Metalua: a Tutorial

11 downloads 3657 Views 189KB Size Report
Metalua: a Tutorial. Fabien Fleutot. November 10, 2006. Contents. 1 Concepts. 2. 2 Data structures. 3. 2.1 Algebraic Datatypes (ADT) .
Metalua: a Tutorial Fabien Fleutot November 10, 2006

Contents 1 Concepts

2

2 Data structures 2.1 Algebraic Datatypes (ADT) . . . 2.2 Abstract Syntax Trees (AST) . . 2.3 AST ⇐⇒ Lua source translation 2.3.1 Expressions . . . . . . . . 2.4 Representing statements . . . . . 2.5 Formal translation defintion . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

3 3 5 5 5 8 10

3 Splicing and quoting 10 3.1 Quasi-quoting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 3.2 Splicing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 3.3 A couple of simple concrete examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 4 MLP parser: how it works, how to extend it 4.1 mll|, the Metalua lexer . . . . . . . . . . . . . . . . . . . . . . 4.2 gg, the grammar generator . . . . . . . . . . . . . . . . . . . . 4.2.1 Sequence parsers . . . . . . . . . . . . . . . . . . . . . . 4.2.2 Sequence parsers: chaining parsers together . . . . . . . 4.2.3 Sequence set parsers: choosing the appropriate sequence 4.2.4 List parsers . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.5 Expression parsers . . . . . . . . . . . . . . . . . . . . . 4.2.6 On-keyword conditional parsing . . . . . . . . . . . . . . 4.3 mlp entry points . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4 Expressions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4.1 Statements . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

14 15 15 16 17 17 18 18 19 19 20 21

5 Some serious examples

22

6 Roadmap to beta version: next features

22

A AST structure reminder

23

Caveat: Metalua is still alpha,but this doc is alpha as well: it features poor writing, inaccuracies, and obsolete parts, especially in the API description part, and it lacks such things as formal AST/source translation, complex examples and coding guidelines. These issues will eventually be fixed...

1

1

Concepts

Metalua is an upward compatible version of Lua 5.1 compiler, written in Lua 5.1, which supports language extension through compile-time metaprogramming. To do this, the following facilities are offered to the metaprogrammer: • An structured, easy to manipulate representation of sources as abstract syntax trees; • A dynamically extensible parser, turning source files into syntax trees, in which syntax extensions are easy to plug at runtime; • Some meta-operators to run code during compilation and syntax trees almost as if it were regular source code; • A compiler which turns syntax trees into Lua 5.1 bytecode. This approach to language manipulation, directly inspired from Lisp macros, offer some advantages with respect to preprocessing by an external tool: • It works with a high level data representation of code, which is much easier to manipulate than ASCII characters or token streams. Lower level representation of code tend to limit code transformation/generation to local usages. • Being single-stage, it leads to easier debugging. When a preprocessors produces incorrect code, it is generally tricky to relate incorrect code with the bug in the generator that caused it; tight integration of compilation and code generation allows better error reporting (although meta-programming will always remain intrinsically harder than “first degree” programming). Among others, it naturally keeps track of the line at which an error occurred, something not always easy for a preprocessor. • It allows to modify Lua’s parser on the fly, by dynamically plugging extension during compilation. • By being properly modularized, the system is more language agnostic: it is possible to independently change the source→syntax tree parser (thus creating a different language), or the syntax tree→bytecode compiler (thus compiling for other platforms). On the other hand, preprocessors have benefits as well, and many projects are better suited for preprocessors. Indeed, preprocessors are conceptually simpler, so if you only need local code transformations, easily implemented at the ASCII or token stream level, and with little opportunity for metaprogramming bugs, there is no need to learn and master a metaprogramming framework. General organization of Metalua Metalua represents code as Abstract Syntax Trees (AST), which outlines the code’s logical and hierarchical structure, rather than its textual representation. For instance, the AST representing “ local a, b, c = foo, bar” will consist of: • a tag indicating that it is a local variable declaration, • a list of the names of the created variables { "a", "b", "c" }, • a list of the AST representing the expressions foo and bar. It is (relatively) easy to perform global analysis, read, generate or transform AST data. However, as often, computer-friendliness is opposed to human friendliness, and an AST is less readable to the programmer than the corresponding textual source code. For this reason, Metalua provides a constructor, +{...}, called ”quasi-quote”, which translates source code into the corresponding AST. In addition to improve readability, quasi-quotes limit the need to learn the complete AST structure in great details. When writing macros, the user can:

2

• Write long stretches of source code with quasi-quotes, because it’s easier, more readable, more maintainable. • Write the key parts of the macros, the few bits whose generation is really complex, through direct AST manipulation. Indeed, there are some manipulations which are simply too advanced to be expressed through quotation. Moreover, the “quasi” part in “quasi-quote” is important: there is a mechanism to include bits of generated AST into a quasiquote. The counterpart to quasiquoting is splicing: once an AST has been manipulated to create the desired program, this AST must be brought down one metalevel, so that it is actually fed to the compiler. The splicing operator, written -{...}, will also serve to prevent quotation of some parts of a quasiquote, and is therefore also often called “antiquote”. The rest of this tutorial will present: • Algebraic datatypes (ADT), a simple syntax extension added to Lua, which will make AST description and manipulation easier; • AST, which happen to be a special case of ADT. Although this knowledge is not strictly necessary to the beginning user, we will explain how to translate source to and from AST structures. • Quoting and Splicing, the syntactic mechanisms which allow to evaluate and generate code through compilation, and generate code with as much source quoting as possible, and as little direct AST manipulation as possible. • Metalua grammar extension: the preferred way to use meta-programming, rather than littering code with -{...} constructs, is to extend Metalua syntax with new constructors. A general purpose parsergenerator library, gg, is provided for this purpose, and used to build mlp, the Metalua parser. This section will describe gg API, as well as the key entrypoints in mlp which allow to extend the grammar easily and efficiently.

2

Data structures

2.1

Algebraic Datatypes (ADT)

An algebraic datatype, or tagged union, is a data structure that supersedes C’s structs and unions, taken from ML dialects such as SML, OCaml or Haskell: when a variable can contain several different kinds of structures, it is necessary not only to store the struct fields, but also to remember which sort of struct is represented. This is easily encoded in Lua: the struct is represented by a table, and we reserve the field tag to store a string, which is the name of the kind of structure being stored. In Metalua, we will respect some conventions for ADT usage: although contents can be stored in implicitly integer-indexed entries as well as string-indexed entries, we will tend to prefer the former alternative to the latter. More precisely, we will only put less important metadata in string-indexed fields, and the type of these entries content will be limited to strings and numbers. This is done for several reasons: • Conventions help to learn a framework faster; • It does slightly ease automated code walking; • This gives a Lisp-ish flavor to the ADT, which might be appreciated by some (and maybe loathed by most); • It makes ADT structurally very close to XML or XHTML. Some cool applications might ensue... :-)

3

Example 1 The most canonical example of ADT is probably the inductive list. Such a list is described either as the empty list Nil, or a pair (called a cons in Lisp) of the first element on one side (car in Lisp), and the list of remaining elements on the other side (cdr in Lisp). These will be represented in Lua as { tag = "Nil" } and { tag = "Cons", car, cdr }. The list (1, 2, 3) will be represented as: { tag="Cons", 1, { tag="Cons", 2, { tag="Cons", 3, { tag="Nil" } } } } Example 2 Here is a more programming language oriented example: imagine that we are working on a symbolic calculator. We will have to work this: • literal numbers, represented as integers; • symbolic variables, represented by the string of their symbol; • formulae, i.e. numbers, variables an/or sub-formulae combined by operators. Such a formula is represented by the symbol of its operator, and the sub-formulae / numbers / variables it operates on. Most operations, e.g. evaluation or simplification, will do different things depending on whether it is applied on a number, a variable or a formula. Moreover, the meaning of the fields in data structures depend on that data type. The datatype is given by the name put in the tag field. In this example, tag can be one of Number, Var or Formula. The formula eiπ + 1 would be encoded as: { tag="Formula", "Addition", { tag="Formula", "Exponent", { tag="Variable", "e" }, { tag="Formula", "Multiplication", { tag="Variable", "i" }, { tag="Variable", "pi" } } }, { tag="Number", 1 } } Syntax Since in Metalua we will spend most of our time fiddling with ADT, and the simple data above already has a quite ugly representation, we will provide some syntax sugar to make them more readable: • The tag can be put in front of the table, prefixed with a backquote. For instance, { tag = "Cons", car, cdr } can be abbreviated as ‘Cons{ car, cdr }. • If the table contains nothing but a tag, the braces can be omitted. Therefore, { tag = "Nil" } can be abbreviated as ‘Nil (although ‘Nil{ } is also legal). • If there is only one element in the table besides the tag, and this element is a literal number or a literal string, braces can be omitted. Therefore { tag = "Foo", "Bar" } can be abbreviated as ‘Foo "bar". With this syntax sugar, the eiπ + 1 example above would read: ‘Formula{ "Addition", ‘Formula"{ "Exponent", ‘Variable "e", ‘Formula{ "Multiplication", ‘Variable "i", ‘Variable "pi" } }, ‘Number 1 } This is not just a documentation convention: Metalua actually does support this syntax.

4

2.2

Abstract Syntax Trees (AST)

An AST is an Abstract Syntax Tree, a data representation of source code suitable for easy manipulation. AST are encoded as ADT, and we will represent them with the ADT syntax described above. We will naturally use the ADT syntax introduced above. Example this is the tree representing the source code print(foo, "bar"): ‘Call{ ‘Id "print", ‘Id "foo", ‘String "bar" } Metalua tries, as much as possible, to shield users from direct AST manipulation, and a thorough knowledge of them is generally not needed. Metaprogrammers should know their general form, but it is reasonable to rely on a cheat-sheet to remember the exact details of AST structures. Such a summary is provided in appendix of this tutorial, as a reference when dealing with them. In the rest of this section, we will present the translation from Lua source concepts to their corresponding AST.

2.3

AST ⇐⇒ Lua source translation

This subsection explains how to translate a piece of Lua source code into the corresponding AST, and conversely. Most of time, users will rely on quasi-quote to produce the AST they will work with, but it is sometimes necessary to directly deal with AST, and therefore to have at least a superficial knowledge of their structure. 2.3.1

Expressions

The expressions are pieces of Lua code which can be evaluated to give a value. This includes constants, variable identifiers, table constructors, expressions based on unary or binary operators, function definitions, function calls, method invocations, and index selection from a table. Expressions should not be confused with statements: an expression has a value with can be returned through evaluation, whereas statements just execute themselves and change the computer state (mainly memory and IO). For instance, 2+2 is an expression which evaluates to 4, but four=2+2 is a statement, which sets the value of variable four but has no value itself. Number constants A number is represented by an AST with tag Number and the number value as its sole child. For instance, 6 is represented by ‘Number 61 . String constants A string is represented by an AST with tag String and the string as its sole child. For instance, "foobar" is represented by ‘String "foobar". Variable names A variable identifier is represented by an AST with tag Id and the number value as its sole child. For instance, variable foobar is represented by ‘Id "foobar". Other atomic values Here are the translations of other, keyword-based, atomic values: • nil is encoded as ‘Nil2 ; • false is encoded as ‘False; • true is encoded as ‘True; • ... is encoded as ‘Dots. 1 As

explained in the section about ADT, ‘Number 6 is exactly the same as ‘Number{ 6 }, or plain Lua { tag = "Number",

6} 2 which

is a short-hand for ‘Nil{ }, or { tag="Nil" } in plain Lua.

5

Table constructors A table constructor is encoded as: ‘Table{ ( ‘Key{ expr expr } | expr )* } This is a list, tagged with Table, whose elements are either: • the AST of an expression, for entries without an explicit associated key; • a pair of expression AST, tagged with Key: the first expression AST represents a key, and the second represents the value associated to this key. Examples • The empty table { } is represented as ‘Table{ }; • {1, 2, "a"} is represented as ‘Table{ ‘Number 1, ‘Number 2, ‘String "a" }; • {x=1, y=2} is syntax sugar for {["x"]=1, ["y"]=2} and is represented by ‘Table{ ‘Key{ ‘String "x", ‘Number 1 }, ‘Key{ ‘String "y", ‘Number 2} }; • indexed and non-indexed entries can be mixed: { 1, [100]="foo", 3} is represented as ‘Table{ ‘Number 1, ‘Key{ ‘Number 100, ‘String "foo"}, ‘Number 3 }; Binary Operators Binary operations are represented by ‘Op{ operator, left, right}, where operator is a childless AST describing the operator, left is the AST of the left operand, and right the AST of the right operand. The following table associates a Lua operator to its AST: Op. + % ~= and

AST ‘Sub ‘Pow ‘Gt ‘And

Op. * .. >= or

AST ‘Mul ‘Concat ‘Ge ‘Or

Op. / ==
=1 and x3 then foo(bar) end}. Finally, you might wish to quote a block of code. As you can guess, just type: +{block: y = 7; x = y+1; if x>3 then foo(bar) end}. However, quoting alone is not really useful: if it’s just about pasting pieces of code verbatim, there is little point in meta-programming. We want to be able to leave “holes” in quasi-quotes (hence the “quasi”), and fill them with bits of AST coming from outside. Such holes are marked with splices inside the quote. For instance, the following piece of Metalua will put the AST of 2+2 in variable X, then insert it in the AST an assignment in Y: X = +{ 2 + 2 } Y = +{ four = -{ X } } After this, Y will contain the AST representing four = 2+2. Because of this, a splice inside a quasi-quote is often called an anti-quote. Of course, quotes and antiquotes can be mixed with explicit AST. The following lines all put the same value in Y, although often in a contrived way: Y X X X Y Y Y Y Y

= = = = = = = = =

+{stat: four = 2+2 } +{ 2+2 }; Y = +{stat: four = -{ X } } ‘Op{ ‘Add, ‘Number 2, ‘Number 2 }; Y = +{stat: four = -{ X } } ‘Op{ ‘Add, +{2}, +{2} }; Y = +{stat: four = -{ X } } +{stat: four = -{ ‘Op{ ‘Add, ‘Number 2, ‘Number 2 } } } +{stat: four = -{ +{ 2+2 } } } +{stat: four = -{ +{ -{ +{ -{ +{ -{ +{ 2+2 } } } } } } } } } ‘Let{ { ‘Id "four" }, { ‘Op{ ‘Add, ‘Number 2, ‘Number 2 } } } ‘Let{ { ‘Id "four" }, { +{ 2+2 } } }

The content of an anti-quote is expected to be an expression. However, it is legal to put a block of statements in it, provided that this blocks returns an AST through a return statement. To do this, just add a “block:” markup at the beginning of the antiquote. The following line is (also) equivalent to the previous ones: Y = +{stat: four = -{stat: local two=‘Number 2; return ‘Op{ ‘Add, two, two } } } Caveat: You can currently quote expressions and statements; most metaprogramming systems also allow to quote identifiers. It seems simpler to write ‘Id "foo" than +{id: foo}, so it hasn’t been implemented yet. Actually, it’s easy to implement as an extension and could be left as an exercise to the reader.

3.2

Splicing

Splicing is used in two, rather different contexts. First, as seen above, it’s used to poke holes into quotations. But it is also used to execute code at compile time. As can be expected from their syntaxes, -{...} undoes what +{...} does: quotes change a piece of code into the AST representing it, and splices cancel the quotation of a piece of code, including it directly in the AST (that piece of code therefore has to either be an AST, or evaluate to an AST. If not, the result of the surrounding quote won’t be an AST). 11

Caveat: Quote operators cannot be directly nested yet, i.e. Metalua won’t parse expression +{ +{ foo } }, although it does have a value (This value is ‘Table{ ‘Key{ ‘String "tag", ‘String "Id" }, ‘String "foo" }, just in case you were wondering). Practical usages of such quoting towers, besides stressing the parser, are yet to be found. But what happens when a splice is put outside of any quote? There is no explicit quotation to cancel, but actually, there is an hidden quotation. The process of compiling a Metalua source file consists in the following steps: ______ ________ +-----------+ / \ +---+ / \ +--------+ |SOURCE FILE|-->< Parser >-->|AST|-->< Compiler >-->|BYTECODE| +-----------+ \______/ +---+ \________/ +--------+ FIXME: put a fancy diagram of metaprog dataflow. So in reality, the source file is translated into an AST; this process is (almost) a translation; what happens when a splice is found at this level then follows logically: the enclosed piece of code is not turned into an AST, which is immediately compiled and executed; its result, which is expected to be an AST, is inserted in the original AST. As an example, consider the following source code, its compilation and its execution:

fabien@macfabien$ cat sample.lua -{block: print "META HELLO"; return +{ print "GENERATED HELLO" } } print "NORMAL HELLO" fabien@macfabien$ ./mlc sample.lua Compiling x.lua... META HELLO ...Wrote x.luac fabien@macfabien$ lua x.luac GENERATED HELLO NORMAL HELLO fabien@macfabien$ _

Thanks to the print statement in the splice, we see that the code it contains is actually executed during evaluation. More in details, what happens is that: • The code inside the splice is parsed and compiled separately; • it is executed: the call to print "META HELLO" is performed, and the AST representing print "GENERATED HELLO" is generated and returned; • in the AST generated from the source code, the splice is replaced by the AST representing print "GENERATED HELLO". Therefore, what is passed to the compiler is the AST representing print "GENERATED HELLO"; print "NORMAL HELLO".

12

Take time to read, re-read, play and re-play with the manipulation described above: once you’re not lost anymore between levels and meta-levels, you master everything required to do meta-programming. Notice that it is admissible, for a splice outside a quote, not to return anything. This allows to execute code at compile time without adding stuff in the source file, typically to load syntax extensions. For instance, this source will just print ”META HELLO” at compile time, and ”NORMAL HELLO” at runtime: -{print "META HELLO"}; print "NORMAL HELLO"

3.3

A couple of simple concrete examples

ternary choice operator Let’s build something more useful. As an example, we will build here a ternary choice operator, equivalent to the _ ? _ : _ from C. Here, we will not deal yet with syntax sugar: our operator will have to be put inside splices. Extending the syntax will be dealt with in the next section, and then, we will coat it with a sweet syntax. Here is the problem: in Lua, choices are made by using if _ then _ else _ end statements. It is a statement, not an expression, which means that we can’t use it in, for instance: local hi = if lang=="fr" then "Bonjour" else "hello" end -- illegal! This won’t compile. So, how to turn the “if” statement into an expression? The simplest solution is to put it inside a function definition. Then, to actually execute it, we need to evaluate that function. Which means that our pseudo-code local hi = lang == "fr" ? "Bonjour" : "Hello" will actually be compiled into: local hi = (function () if lang == "fr" then return "Bonjour" else return "Hello" end end) () We are going to define a function building the AST above, filling holes with parameters. Then we are going to use it in the actual code, through splices.

fabien@macfabien$ cat sample.lua -{stat: function ternary (cond, b1, b2) return +{ (function() if -{cond} then return -{b1} else return -{b2} end end)() } end } lang = "en" hi = -{ ternary ( +{lang=="fr"}, +{"Bonjour"}, +{"Hello"}) } print (hi) lang = "fr" hi = -{ ternary ( +{lang=="fr"}, +{"Bonjour"}, +{"Hello"}) } print (hi) fabien@macfabien$ ./mlc sample.lua Compiling sample.lua... ...Wrote sample.luac fabien@macfabien$ lua sample.luac Hello Bonjour fabien@macfabien$ _ 13

Incrementation Now, we will write another simple example, which doesn’t use quasi-quotes, just to show that we can. Another operator that C developers might be missing with Lua is the ++ incrementer. As with the ternary operator, we won’t show yet how to put the syntax sugar coat around it, just how to build the feature. Here, the transformation is really trivial: we want to encode x++ as x=x+1 (we will only deal with ++ as statement, not as an expression. However, ++ as an expression is not much more complicated to do. Hint: use the turn-statement-into-expr trick shown in the previous example). The AST corresponding to x=x+1 is ‘Let{ { ‘Id x }, { ‘Op{ ‘Add, ‘Id x, ‘Number 1 } } }. From here, the code is straightforward:

fabien@macfabien$ cat sample.lua -{stat: function plusplus (var) assert (var.tag == "Id") return ‘Let{ { var }, { ‘Op{ ‘Add, var, ‘Number 1 } } } end } x = 1; print ("x = " .. tostring (x)) -{ plusplus ( +{x} ) }; print ("Incermented x: x = " .. tostring (x)) fabien@macfabien$ ./mlc sample.lua Compiling sample.lua... ...Wrote sample.luac fabien@macfabien$ lua sample.luac x = 1 Incermented x: x = 2 fabien@macfabien$ _

Now, we just miss a decent syntax around this, and we are set! Caveat: Actually, I couldn’t have written the plusplus example with quotes, due to the hackish way the parser recognizes an assignment statement. This is going to be fixed before version β. Then, I’ll be able to write function plusplus(v) return +{ -{v} = -{v} + 1 } end.

4

MLP parser: how it works, how to extend it

mlp is the part of Metalua which parses a token stream into into an AST; it relies on a lexer to provide that token stream, which has to implement the API define by mll, the Metalua lexer. The specific feature of mlp is that it has to support extension-while-running, something that a naive parser, or a parser generate by some sequel of Yacc, cannot do. To do so, it relies on a genric library, gg (for Grammar Generator), which allows to dynamically build, extend and run a parser. gg is basically a simplification of Parser Combinators, as found for instance in Haskell distributions. This section will be organized as follows: 14

• First, the principles and API of gg will be presented; • then, we will overview how mlp is implemented on top of gg, and more importantly, indicate the entrypoints in mlp that allow to extend Metalua’s syntax. We will also present the extensions to Lua supported by mlp (there’s more than splice, quote and ADT syntax, indeed); this subsection will try to be as practical and “extension cookbook recipes” oriented as possible; • finally, we will present a couple of examples of more significant syntax extensions. Then, it’ll be up to the reader to overpimp his Metalua with so much sugar that it gets cancer of the semicolon.

4.1

mll|, the Metalua lexer

The lexer is in charge of cutting the source text into tokens: keywords, identifiers, numbers, strings, etc. A Lua-compliant tokenizer, mll, is provided with Metalua. Users can extend or change it, as long as the API described in this subsection is respected. Type token A token is an ADT, whose tag is one of the following: Id, Keyword, String, Number, Eof. It has one child, of type string, except for Number whose content is a number, and Eof which doesn’t need to have a child. It should have line and file attributes, which hold the location of the token. A lexer always returns a token: when all available tokens are consumed, it must return an infinite series of ‘Eof. Type lx A lexer lx is an object which can produce tokens, thanks to methods next and peek. Beware that in Metalua, Lua’s “:” method operator isn’t used: Metalua being of a more functional flavor than Lua, it uses closures to create objects, and therefore methods are called with a regular dot. Function lx.peek([n:integer]): token Return the n-th next token, without consuming it. If n is not given, it defaults to 1. Function lx.next([m:integer]): token Return the n-th next token, and consumes it, i.e. after a call to lx.next(n), the token number n+1 becomes the token number 1, and all tokens whose index is less or equal to n are flushed from the lexer. If n is not given, it defaults to 1. Function lx.register(keyword:string):nil Register a word as a keyword recognized by the lexer. Next time this keyword is found is the source, it is returned with a tag Keyword. If the keyword is composed of non-ASCII characters, the sequence will be recognizes as a single token: for instance, if “-{” is registered, then the sequence won’t be returned as two separate tokens “-” then“{”. Registering the same keyword more than once must not crash the lexer. Caveat: There is currently no way to change the parser without hacking the compiler sources. Eventually, a command line option will be provided to plug alternative lexers.

4.2

gg, the grammar generator Caveat: This section is incomplete, some simpler APIs have been implemented, and are yet undocumented

The gg library is a parser combinator: it offers higher order functions, which take simple parsers as parameters, and return a more complex parser as a result. 15

Type parser A parser is a function (or a table with a call metatable entry) which takes a lexer as an argument, and returns a result—generally an AST, unless stated otherwise—while consuming tokens on the lexer. Type seq parser The sequence parser is a subtype of parser. It takes a series of parsers and keywords, and chains these parsers together. They are created by gg.sequence(). Type seqset parser The sequence set parser is a subtype of parser: it has a list of seq parsers, each starting by a distinct keyword, and applies the sequence whose initial keyword is found at lx.peek(1). Sequence set parsers are created by gg.sequence set(). Type list parser The list parser reads a list of simpler elements, optionally separated by a given keyword. List parsers are created by gg.list(). Type expr parser Expression parsers allow to easily handle prefix, infix and postfix operators. Expression parsers are created by gg.expr). Type onkw parser “On-keyword” parsers apply a parser only if the next token lx.peek(1) is a given keyword. If it is not, then the parser simply returns false without consuming any token. Such parsers are returned by gg.onkeyword(). Type functor A functor describes a transformation to apply to an AST. If it is a string, then this string is a tag to be applied to the AST. If it is a function (or has a call metatable entry), it is expected to be of type AST->AST, and applied to the AST. If it is nil, it doesn’t modify the AST. Caveat: The name functor is rather unfortunate: AST transformers are regular, first order function, not functors. The name sticks from an early design, now abandoned. Some better name will have to be found for β version. 4.2.1

Sequence parsers

Function gg.sequence{ [functor = f:functor,] (string|parser)+ }: seq parser Generate a sequence parser: strings represent keywords, other parsers are applied in order. If f is provided, it is applied to the list of the results of all parsers in the sequence.] Examples Supposing that func params parses a possibly empty list of Lua identifiers separated by “,”, and block parses a block of Lua statements, the following call creates a parser which recognizes anonymous function definitions: ano_func_parser = gg.sequence{ "function", "(", func_params, ")", block, "end" }

16

The parser above returns a list whose first element is the result of func params(lx), and whose second element is the result of block(lx). We probably wish to apply the tag ‘Function to that list, thus creating a function definition AST. This is done with a functor: ano_func_parser = gg.sequence{ "function", "(", func_params, ")", block, "end", functor = "Function" } 4.2.2

Sequence parsers: chaining parsers together

Function gg.sequence set{seq parser*, [functor = f:functor,] [default = d:parser]}: seqset parser Takes a list of sequence parsers, which must all start with a distinct keyword. When applied on a lexer, runs the lexer whose initial keyword is returned by lx.peek(). If the next token is not a keyword, or not a keyword starting one of the sequence parsers, then d the default parser is run. If there is no default parser and no keyword is recognized, an error occurs. Example Supposing that ano_func_parser is defined as above, and we define true_parser and false_parser as below to recognize booleans: true_parser = gg.sequence{ "true", functor = function() return ‘True end } false_parser = gg.sequence{ "false", functor = function() return ‘False end } Supposing that expr parser parses expressions, then ss parser as defined below will parser booleans and anonymous functions; if the first token in the lexer is neither function nor tt true nor false, then it tries to recognize an expression by calling expr: ss_parser = ss.sequence_set{ true_parser, false_parser, ano_func_parser, default = expr } 4.2.3

Sequence set parsers: choosing the appropriate sequence

Function seqset parser.add(sp:seq parser) Adds a new sequence parser in a sequence set parser. This sequence parser must, as usual, start with a keyword, and be the only one to start with this keyword in this sequence set parser. Example

This line adds to the parser above the capability to handle balanced parentheses:

ss_parser.add (gg.sequence { "(", ss_parser, ")", functor = function(x) return x[1] end } ) Function seqset parser.del(kw:string) Removes from a sequence set parser the sequence which starts with keyword kw. Function seqset parser.set default(p:parser) Sets or updates the default parser of a sequence set parser.

17

4.2.4

List parsers

Function gg.list primary = p:parser, [separators = s:(table|string),] [terminators = t:(table|string) [functor = f:functor] : list parser Read a list of elements parsed by parser p. s and t are either strings, or tables of strings, or nil. If string(s) are provided as separators, then such a keyword is expected between each element parsed by p. If separators are given, then list reading stops as soon as such a separator is not found. If terminators are given, then reading stops as soon as one of these terminators is found (and the terminator token is not consumed). At least one of terminators and separators fields has to be provided. If no terminators are provided, then the resulting parser can’t handle empty lists. Examples Supposing that variable parses identifiers, the parser func_params defined below parses lists of comma-separated variables: func_params = gg.list{ primary=variable; separators="," } However, this parser can’t accept empty lists. If we want to use it to define functions, we need to accept empty lists of parameters; to do so, we must state that “)” is a legal list terminator: func_params = primary separators terminators

gg.list{ = variable; = ","; = ")" }

This parser works great with ano_func_parser as defined above (except that it doesn’t accept final “...”, as lua does). 4.2.5

Expression parsers Caveat: This section is messy, inaccurate and obsolete. Due to some improvement of gg, the expression parser has been significantly refactored and simplified. The aim is to make syntax extension easy through a couple of recipes.

This parser generator allows to handle prefix, infix and postfix operators. It is quite powerful, but also quite complex. Function gg.expr{ primary = p:parser, [prefix = pre o:op cfg list,] [infix = in o:op cfg list,] [postfix = post o:op cfg list]}: expr parser Create an expression parser, which binds elements parsed parsed by parser p with operators described by pre o, in o and post o. These operators are described by operator config tables, which are lists of operator descriptions. Type op cfg list List of operator descriptors. Each descriptor is of type op desc. Type op desc description of an operator; it is a table with: • a mandatory keyword field, holding a string, which is the name of the operator; • an optional prec field, holding a number, which is the precedence of the operator. The higher the precedence, the tighter the operator binds; if not specified, the default precedence is 1; 18

• an optional assoc field, holding one of "left", "right" or "none", which describes the associativity of an infix operator. It is only meaningful for infix operators; if not specified, left associativity is assumed. • an optional functor field to apply as usual to the parser’s results. Examples Supposing that number parses numbers, the following calc parser computes the usual infix operations on numbers, with usual precedences and associativity: calc = gg.expr { primary = number, infix = { { keyword = "+", prec = 60, { keyword = "-", prec = 60, { keyword = "*", prec = 70, { keyword = "/", prec = 70, { keyword = "%", prec = 70, { keyword = "^", prec = 90, assoc = "right" } } }

functor functor functor functor functor functor

= = = = = =

function(x) function(x) function(x) function(x) function(x) function(x)

return return return return return return

x[1]+x[2] x[1]-x[2] x[1]*x[2] x[1]/x[2] x[1]%x[2] x[1]^x[2]

end end end end end end,

}, }, }, }, },

Function expr parser.add prefix(op:op desc) Adds a new prefix parser to an expression parser Function expr parser.add infix(op:op desc) Adds a new infix parser to an expression parser Function expr parser.add postfix(op:op desc) Adds a new postfix parser to an expression parser Example

Let’s add a prefix unary minus to our calculator:

calc.add_prefix{ keyword="-", prec = 80, functor = function(x) return -x[1] end } Now we can handle stuff like −3 + −4 ∗ 5, which is read as (−3) + ((−4) ∗ 5). If we had set unary minus precedence to less than 60, it would have been read as −(3 + −(4 ∗ 5)), as infix operators would have bound tighter. 4.2.6

On-keyword conditional parsing

Function gg.onkeyword{ keyword = kw:string, p:parser } Returns a parser that applies parser p if the next keyword in the lexer is kw, or does nothing and returns false if the next token is not keyword kw.

4.3

mlp entry points

As stated above, mlp the Metalua source parser is implemented on top of gg. It can be extended through a couple of entry points, thanks to gg API.

19

4.4

Expressions

• mlp.expr is the main Metalua expression parser. It parses all unary and binary operators described in the tables pp.?? and ??. Between operators, primary elements are of type mlp.expr_2. mlp.expr can be extended with methods add_prefix, add_infix and add_postfix. • mlp.expr_2 is not fit for extension: it calls mlp.expr_1, then mlp.expr_1_postfix as much as possible. • mlp.expr_1 is a sequence set parser, which recognizes expressions starting with a dedicated keyword, that is: – ( _ ), – function ( _ ) _ end, – -{ _ }, – +{ _ }, – ‘_{ _ }, – nil, – true, – false, – ... This parser should be extended through its add method, when one wants to add a syntactic construct which starts with a dedicated keyword. • mlp.expr_1_postfix is more complex: it reads postfix expression, which are not simple keywords: – table indexing [ _ ]; – field-style indexing . _; – method invocation : _ ( _ ); – function call ( _ ); – parentheseless function calls (single table or string argument). This parser doesn’t return an AST, but a function to apply to the expression it is postfixing. This parser, which is a sequence set parser, should be extended when one will to add postfix notations to expressions. An example will be given below. • mlp.func_params is a list parser which reads the parameters of a function or method definitions. • mlp.func_args is a list parser which reads the arguments of function and method applications. Example Let’s give a decent syntax to the ternary operator defined above. We won’t stick to the C syntax _ ? _ : _, because the colon would be ambiguous with method calls. Instead, we will use a comma, as in _ ? _ , _. We re-take the definition of ternary given above, and add a postfix sequence parser reading ? _ , _ to expr_1_postfix. Here is a first, rather ugly version: -{stat: -- same as before: function ternary (cond, b1, b2) return +{ (function() if -{cond} then return -{b1} else return -{b2} end end)() } end

20

-- now, let’s deal with the the syntax: local function my_functor(x) return ternary(x[1], x[2], x[3]) end local x = gg.sequence{ "?", mlp.expr, ",", mlp.expr, functor = my_functor } mlp.expr.add_postfix_sequence (x) } lang = "us" print ((lang=="fr") ? "Bonjour", "Hello") lang = "fr" print ((lang=="fr") ? "Bonjour", "Hello") But given how simple our ternary operator is, we could just merge everything in a single statement: -{ stat : mlp.expr.add_postfix_sequence { "?", mlp.expr, ",", mlp.expr, functor = |x| +{ function() if -{x[1]} then return -{x[2]} else return -{x[3]} end end () } } } lang = "en" print ((lang=="fr") ? "Bonjour", "Hello") lang = "fr" print ((lang=="fr") ? "Bonjour", "Hello") This will, as expected, write “Hello” then “Bonjour”. But did you notice the |x| bit? This is syntax sugar for a very common operation in Metalua. Being more functional than Lua, Metalua works a lot with anonymous functions that return a value, the equivalent of Lisp’s lambda expressions. Therefore Metalua has a terser syntax for them: instead of writing function(foo, bar) return compute with(foo, bar) end, you can just write |foo, bar| compute with(foo, bar). Functional programmers will notice that this syntax is currying-friendly: |foo| |bar| foo+bar. Caveat: The verbose version is not tested, and might contain typos Caveat: add postfix sequence has some issues: • It isn’t documented; • It doesn’t support precedence yet: postfix sequences always bind tighter than infix operators. Both of these will be fixed Real Soon NowmathrmT M . 4.4.1

Statements

• FIXME Example

FIXME: introduce syntax for ++ operator.

21

5

Some serious examples

The examples given until now remained extremely simple, and could have been implemented with a preprocessor or with token filters. In this section, we are going to show a couple of more complex, life-size extensions, which use the full power of macros. FIXME: I’ll probably present here: a dynamic typechecking framework, and my ML-like pattern matching on ADT. Maybe also a sketchy prolog inference engine. If I could find a meta-programming twist to the XML/ADT converters, that’d be sweet as well.

6

Roadmap to beta version: next features

Metalua is still alpha, and some crucial features are still missing. Most strikingly, a simpler API for syntax extension is really required: people shouldn’t have to know that much about mlp. Actually, they should even not be required to work with gg. FIXME: list other showstoppers, give fixing time estimates.

22

A

AST structure reminder Metalua AST Structure =====================

block: { stat* (‘Return{expr*} | ‘Break)? } stat: | ‘Do { stat* (‘Return{expr*} | ‘Break)? } | ‘Let{ {lhs+} {expr+} } | ‘While{ expr block } | ‘Repeat{ block expr } | ‘If{ (expr block)+ block? } | ‘Fornum{ ident expr expr expr? block } | ‘Forin{ {ident+} {expr+} block } | ‘Local{ {ident+} {expr+}? } | ‘Localrec{ ident expr } | apply expr: | ‘Nil | ‘Dots | ‘True | ‘False | ‘Number{ } | ‘String{ } | ‘Function{ { ident* ‘Dots? } block } | ‘Table{ ( ‘Key{ expr expr } | expr )* } | ‘Op{ opid expr expr? } | apply | lhs apply: | ‘Call{ expr expr* } | ‘Method{ expr ‘String{ } expr* } lhs: ident | ‘Index{ expr expr } ident: ‘Id{ } opid: | | | |

‘Add ‘Mod ‘Ne ‘Le ‘Len

| | | |

‘Sub ‘Pow ‘Gt ‘And

| | | |

‘Mul ‘Concat ‘Ge ‘Or

| | | |

‘Div ‘Eq ‘Lt ‘Not

23