Language implementation as a functional tool-building exercise

2 downloads 0 Views 219KB Size Report
A number of recent studies have explored the design of new special-purpose languages by extension ... Haskore 10] is a Haskell-based language for the speci cation of musical scores. ..... 15] Standard ML of New Jersey User's Guide, Feb.
Language implementation as a functional tool-building exercise Sam Kamin  University of Illinois at Urbana-Champaign [email protected]

Abstract

We describe an experiment in the use of domain-speci c embedded languages for program generation. Speci cally, we describe how the processor for the language JR has been built by embedding program generators in Standard ML. The processor is built from four program generators: a lexer generator, a parser generator, an abstract syntax generator, and an abstract syntax tree translation generator. We discuss the speci cations given in these embedded languages as well as their implementation. In particular, we show that using the embedded language approach leads to powerful languages at relatively low cost. Keywords: Domain-speci c languages, functional languages, program generators

1 Introduction A number of recent studies have explored the design of new special-purpose languages by extension to existing languages. Such extensions have been called embedded domain-speci c languages [9, 6]. The embedding seems to work best with functional languages, because the existence of higherorder functions allows fragments to be put together neatly. Indeed, we view this as a way of gaining acceptance for functional languages. It is an especially a propos for domain-speci c languages, because it is a low-cost method, and provides power and conciseness, all of which are highly prized in domain-speci c languages. In this paper, we consider the development of program generators by the method of embedding into a functional language. Program generation is a potentially important application of programming language technology. Anecdotal evidence indicates that the development of a program generator can bring enormous productivity gains. The central thesis of this work is that the distinction between program generators and other domain-speci c languages is arti cial. In particular, program generators can be developed by embedding into a functional language. As with other domains, this has the advantage of being relatively easy to do, and makes available all the power of underlying language. It does have some disadvantages | the syntax is not as clean as if it were de ned speci cally for the purpose, and error messages can be confusing | but it produces e ective languages at low cost. This paper describes an experiment in the use of embedded program generators. We have used them to develop an interpreter for an ML-like language that we call JR. The interpreter is, 

Partially supported by NSF grant CCR 96{19644

1

at present, based on four domain-speci c languages. These are embedded in Standard ML [15]. In the next section we describe the structure of the interpreter and give a brief overview of each of the four languages. Spinellis [14] characterizes language implementation as a tool-building process. This paper advocates that the tools be built in a functional language. The title of this paper is a deliberate take-o from [14].

1.1 Related work

A number of experiments of developing languages by embedding them in existing functional languages hae been reported. The method has been particularly championed by Hudak [10, 9], who also coined the term domain-speci c embedded language. Haskore [10] is a Haskell-based language for the speci cation of musical scores. Fran [7, 6] is language for describing animation models, also based on Haskell. FPIC [12] is an ML-based language for specifying simple pictures. Embedding languages in this wasy is also advocated in [4, 5, 9]. Program generation is a very common activity, but program generating languages do not always seem to be considered as full- edged languages. Examples include Gelernter's program builders [1], and Waters's KBEmacs [17]. Each provides a xed set of program-building operations, but there is no direct manipulation of programs, nor is it easy for users to de ne their own operations. Work at OGI [3] and at ISI [2] is speci cally aimed at the development of program generators. In both cases, the speci cation is translated to a very-high-level language, from which an imperative language program is extracted. Our approach is more direct: the program generating language provides operations that manipulate programs, not speci cations. Furthermore, that language has powerful features inherited from its base language, and the translation is not dependent upon the success of a compiler in compiling a very-high-level language down to imperative code. Spinellis [14] describes an implementation of Haskell which is similar to our implementation of JR in that he builds a variety of mini-languages to specify the various phases of the compiler. However, these languages are written in PERL, and are not themselves functional languages. Sheard and Nelson have speci cally addressed the use of ML as a meta-language for itself [13]. Their concern is to guarantee the type correctness of generated ML programs.

2 The experiment Our interpreter is at present built on four domain-speci c languages, which are used to specify programs in C++: 

Lexer generator. Tokens are speci ed by regular expressions, as in Lex, with associated



Parser generator. We've written an LL(1) parser generator; as in Yacc, value-producing

actions.

actions are associated with productions and these actions can refer to the values returned from subtrees. 2





Abstract syntax generator. This language generates a C++ class speci cation from a speci cation of the abstract syntax operators. In our processor, we use two \abstract syntaxes," one fairly close to the concrete syntax and the other more abstract. Tree transformation generator. We specify the translation from the more concrete syntax to the abstract one in this language.

There is additional supporting code written in C++. In particular, the evaluation function, which interprets the abstract syntax trees, is written directly in C++. Overall, the number of lines of code is approximately as follows (we give the counts in characters instead of lines, because the program generators can generate programs with many very short lines): Speci cations : 21,000 characters of embedded language code (which expand to

152,000 characters of C+++ code)

Hand-written C++ code:

27,000 characters

Total hand-written code: 48,000 characters Total C++ code in application: 179,000 characters These numbers do not include the ML code implementing the program generators, but only the \bene t" of using the program generator. However, as we will see, the program generators are all relatively small. (It is true that there are lexer and parser generators available in ML, based on Lex and Yacc, but those produce code in ML, wheres we want to produce C++ code. In any case, the goal is to study embedded program generators, and we have no reason to think that the lessons from these examples will not be applicable to program generators for other domains.) In this section, we describe each of the four languages mentioned above with examples of speci cations and of generated code (when that code is at all readable), and a brief explanation of how they were implemented.

2.1 Lexer generator

We have implemented a lexer generator whose input is intended to mimic the style of Lex. That is, the user speci es regular expressions with associated actions; the scan recognizes the longest sequence of input characters that is matched by one of the regular expressions, and then performs the associated action. The speci cation for lexing in JR is approximately 70 lines long, not counting supporting C++ code. We reproduce here a few representative lines of the speci cation: val val val val

digit = #"0" to #"9"; integer = kleeneplus digit; real = (kleenestar digit) oo (str ".") oo (kleeneplus digit); character = (str "'") oo any oo (str "'");

3

... (integer ==> (%`return make_token(^(ts tc_integer));`) || real ==> (%`return make_token(^(ts tc_real));`) || character ==> (%`return make_token(^(ts tc_char));`) ... || str "let" ==> (%`return make_token(^(ts kw_let));`) || str "in" ==> (%`return make_token(^(ts kw_in));`) || str "end" ==> (%`return make_token(^(ts kw_end));`) ... || ident ==> (%`if ((ptab = search_user_alpha()) == NULL) return make_token(^(ts tc_ident)); else return make_userop_token(ptab);`) || oneOf " \n\t" ==> (%`;`) || str "//" oo (kleenestar any) ==> (%``)

The rst three lines shown here are de nitions, such as one gives in Lex before the actual listing of regular expressions. The meaning of the various operations used here, in order of their appearance, is: c1 to c2 Regular expression matching any of the characters in the range given by the arguments; like Lex's [c1 -c2 ]. kleenestar regexp Regular expression matching any sequence (zero or more) of strings that match its argument; like Lex's regexp*. str string Matches exactly the characters in its argument and nothing else; like Lex's string (but omitting the quotes). regexp1 oo regexp2 Matches regexp1 followed by regexp2 ; like Lex's regexp1 regexp2 . kleeneplus regexp Regular expression matching any non-empty sequence of strings that match its argument; like Lex's regexp+. any Regular expression matching any single character except newline; like Lex's dot (.). oneOf string Matches any one character in string; like Lex's [c1 c2 : : : cn ]. The actions associated with each regular expression are written in C++ directly, and are straightforward. (The action associated with identi ers includes a test for whether that identi er has been added as a user-de ned in x operator.) We do need to explain the \anti-quotation" notation used here, as it is used heavily both here and in the program generators themselves. This feature of Standard ML is similar to the \backquote-comma" notation of Lisp; here, the backquote serves the same purpose as in Lisp, and the carat (^) is used instead of the comma. Speci cally, within the brackets %`: : :`, all characters are taken literally, except when a carat is present; in that case, the expression immediately following the carat| which may be either a single token, such as an identi er, or a parenthesized expression|is evaluated and its value (which must be a string) is interpolated. In the above 4

speci cation, the only use of the feature is to apply the function ts to the tokens, which transforms the integer-valued tokens (de ned in an earlier part of the speci cation) to strings. We will not show the output of this speci cation, because it consists primarily of a large matrix de nition, giving the transitions of the DFA that results from processing the speci cation. We will, however, show some of the code implementing the lexer-generator. In total, the lexer generator consists of about 400 lines of Standard ML code. This breaks down into approximately 100 lines to implement the standard NFA-to-DFA conversion; another 100 lines of output code; about 100 lines de ning user-level operations such as those used in the speci cation above (kleenestar, etc.); and another 100 of miscellaneous type declarations and supporting code. (By comparison, flex, the Free Software Foundation's version of Lex, consists of about 7000 lines of C.) The basic structure of the lexer generator is easy to describe: There is a data type called NFA. The operations mentioned above have values of type NFA as their arguments and/or results. The operation genlexer transforms the NFA to a DFA and writes out its transition matrix, actions associated with each nal state, and so on. This code, together with some additional C++ code supplied with the lexer generator, de nes a function gettoken which when called repeatedly scans for a token and evaluates the associated action. The most complicated part of the lexer generator is the function that performs the NFA-toDFA translation, called subsetConstruction. (Indeed, it could easily be more complicated, if it were tuned for eciency.) We won't show this code, because it represents a well-known algorithm. The functions implementing the user-level operations are more interesting. For example, here is the de nition of kleenestar, which is a function from NFA's to NFA's: fun kleenestar (M as (_,_,_,_,_,fs):NFA) = makeStateFinal 0 (foldr (fn (s, M) => makeStateNonfinal s M) (foldr (fn (s, M) => addEpsTransition s 0 M) M fs) fs);

We don't expect the reader to understand the details, but in broad terms it is easy to see that the construction implements the well-known Kleene algorithm for transforming an NFA for regular expression R to an NFA for regular expression R: add epsilon transitions from all nal states to the start state, and then make the start state nal and all existing nal states non- nal. We consider this part of the experiment to have been a success. Our lexer has all the essential features of lex except for the feature of \start conditions," which would not be hard to implement. It is written in somewhere around 10% of the code size of Flex. This is in large part because all the support structure needed by Lex is already available in ML. To take one example, the de nition facility used above (the one used to name digit, integer, and real) is simply the de nition facility of ML; no code was needed to implement this Furthermore, our lexer generator is more powerful than Lex, in that derived operations can be de ned by the user. For example, kleeneplus is given by the following de nition in the above speci cation: fun kleeneplus M = M oo (kleenestar M);

Lex includes no facility for de ning new regular expression combinators like this one. The price for these advantages the obvious syntactic awkwardness, as well as rather poor error messages. These are serious disadvantages, but on balance we think the bene t/cost ratio of this program generator is very high. 5

2.2 Parser generator

Our parser generator is LL(1). An example of a grammar rule is val Def = "Definition" ::= (term kw_val oo nonterm "Pattern" oo term sc_eq oo nonterm "Expr" ==> %`mkcs_valdef(^($ 2), ^($ 4))`) || (term kw_fun oo nonterm "ClauseList" ==> %`mkcs_fundef(^($ 2))`);

In right-hand sides, terminal symbols must be introduced with the function term and nonterminals with nonterm, and these must be connected by oo. The actions use the anti-quotation mechanism just as in the lexer generator. In our language implementation, the parser builds an \abstract" syntax tree; mkcs_valdef and mkcs_fundef are tree-building operations written in C++ (actually, generated by the abstract syntax generator discussed in the next section). In the calls to those C++ functions, expressions ($ 2) and ($ 4) represent the values of the second and fourth subtrees, respectively; those are the trees corresponding to the Pattern and Expr non-terminals. (The choice of $ was intended to mimic Yacc.) There is little point in discussing the details of this language. We would again mention that it compares favorably in terms of implementation size and e ort with Yacc; like the lexer generator, it consists of about 400 lines of ML code, compared to about 7000 for bison, FSF's version of Yacc. (Of course, the LL(1) construction is simpler than bison's bottom-up construction, and we have no ambiguity-resolution facility.) Again, the syntax is somewhat awkward, but in return we gain some power from the underlying language. An example of that power is in our treatment of the expression grammar. As is well known, the expression grammar has to be mangled somewhat to get it into LL(1) form. For example, here is an LL(1) expression grammar for addition and multiplication:1 Expr ! Term Expr0 Expr0 ! + Term Expr0 j  Term ! Primary Term0 Term0 ! + Primary Term0 Primary ! id

j 

l

In our case, we wanted to allow nine levels of precedence (mimicing Standard ML), and this becomes tedious, not only initially but in maintenance of the code in the long term. Since the language is embedded in ML, we were able to write a function to generate these rules: fun leveln let val val val 1

(n:int) = ns = Int.toString n ns' = Int.toString (n+1) ntn = "Expr"^ns

For parsing purposes, the rule for Expr' could be simpli ed to Expr'

!+

Expr

but our form makes it easier to assign actions to productions.

6

j

val val val in [ntn

ntn' = "Expr"^ns^"prime" ntn'' = "Expr"^ns' infix = tc_infix+n-1 ::= (nonterm ntn'' oo nonterm ntn' ==> %`applyConcSynFun(^($ 2), ^($ 1))`), ntn' ::= (empty ==> %`mkIdentityConcSynFun()`) || (term infix oo nonterm ntn'' oo nonterm ntn' ==> %`mkBinapplicConcSynFun(^($ 1), applyConcSynFun(^($ 3), ^($ 2)))`) ] end;

The expression List.concat (map leveln (int_interval 1 9)) generates the rules for all nine levels (the tenth level, or \primary," has di erent rules and is written by hand).

2.3 Abstract syntax generator

The abstract syntax generator represents the opposite end of the spectrum from the lexer and parser generators. There is no complicated algorithm used to generate C++ code from a speci cation. In a sense, it adds very little value, since the code it produces is very routine. By the same token, it was easy to write. In fact, the rst version was written in a couple of hours; it has now been augmented several times, but probably no more than four hours has been spent on it all together. It is, in other words, a simple tool of a kind that we can imagine programmers building routinely for personal use. And yet, it is also remarkably useful. The purpose of this tool is to take speci cations like this: genAbsSyn "AbsSyn" ["Expr", "Patt", "Input", "AnonClause", "Defn", "DefSeq"] ["as_constant" oftype "LitValue" --> "Expr", "as_variable" oftype "Token" --> "Expr", "as_primop" oftype "Token" --> "Expr", "as_unapplic" oftype "Token" ** "Expr" --> "Expr", "as_binapplic" oftype "Token" ** "Expr" ** "Expr" --> "Expr", ...

and generate C++ struct declarations like this: struct AbsSynNode; typedef AbsSynNode* AbsSyn; enum AbsSyncat { Expr, Patt, Input, AnonClause, Defn, DefSeq }; enum AbsSynop { as_constant // LitValue -> Expr , as_variable // Token -> Expr , as_primop // Token -> Expr , as_unapplic // Token * Expr -> Expr

7

, as_binapplic // Token * Expr * Expr -> Expr ... struct AbsSynNode { AbsSynop theASOp; union { struct{LitValue fld0;}_as_constant; struct{Token fld1;}_as_variable; struct{Token fld2;}_as_primop; struct{Token fld3; AbsSyn fld4;}_as_unapplic; struct{Token fld5; AbsSyn fld6; AbsSyn fld7;}_as_binapplic; ... bool AbsSynNode::getas_constant (LitValue& fld0) ; bool AbsSynNode::getas_variable (Token& fld1) ; bool AbsSynNode::getas_primop (Token& fld2) ; bool AbsSynNode::getas_unapplic (Token& fld3, AbsSyn& fld4) ; bool AbsSynNode::getas_binapplic (Token& fld5, AbsSyn& fld6, AbsSyn& fld7)

;

...

The functions generated also include constructors, a print operation (for debugging), and an overwrite operation to replace one node's contents by another's. Thus, this is kind of a low-rent version of the abstract syntax generator described in [16]. Although in principle it does very little, in practice it is extremely useful. We actually have two speci cations in this language that we use, one for an abstract syntax which is fairly concrete (the parser actions shown above construct trees in that abstract syntax), and another for a more abstract syntax. Our language development process was marked by frequent changes to both of these abstract syntaxes, and the language saved us a great deal of bookkeeping. So, we consider this part of the development to have been a success as well.

2.4 Abstract syntax translation

The translation sublanguage is used to translate the more concrete syntax trees to the abstract ones. An example of its use is in these rules: cs_tldef_p (bar "defn") ==> trans "defn" || cs_tlexp_p (bar "exp") ==> as_def (as_pvar (quote "mkToken(1,\"_\")")) (as_unapplic (quote "mkToken(1,\"print\")") (trans "exp"))

These two rules state that a \concrete" syntax tree containing a top-level de nition should be translated to the translation of that de nition, and that one containing a top-level expression e should be translated to the de nition val = print e, where e is the translation of e. They generate the C++ code: 8

if (node->getcs_tldef(cs1)) { return trans(cs1); } else if (node->getcs_tlexp(cs1)) { return mkas_def(mkas_pvar(mkToken(1,"_")), mkas_unapplic(mkToken(1,"print"), trans(cs1))); }

This has been the least successful part of the experiment. There is no algorithm used to gain value (as in the lexer and parser generators), nor are the translation rules perfectly routine (as are the abstract syntax operators). Furthermore, this may be the program generator in which the syntactic awkwardness is most painful. One would like to write the above rules in terms of the initial concrete syntax, just as they are given in, say, the Haskell manual [11]. In this form, the second rules above might be: e

==>

val _ = print e

and the rst rule above would not be needed at all, since translation of sub-expressions is implicit. Obviously, our language is very far from this ideal.

3 Conclusions As with any e ort to create structure in programs, the payo tends to be long-term. This project is still young, yet some advantages of using program generators|and, more speci cally, functional program generators|can already be seen. First, the syntax of our languages can be made as close to what one would like as they are only because of heavy use of higher-order functions. To give just one example, consider the abstract syntax speci cation "as_variable" oftype "Token" --> "Expr", which is one part of the speci cation given above. To code associated with this operator depends upon the set of categories in the abstract syntax. That is because Expr is not a type but instead a category, whereas Token is a type; thus, for example, the constructor for this operator has an argument of type Token and returns a value of type AbsSyn. It can only know whether a name refers to a type or a category by knowing the names of all the categories. That piece of information was given as an argument to the genAbsSyn function, but how does this phrase know it without being told. The answer is that the meaning of this phrase is a function from category names to C++ code. By using higher-order functions, this dependency is hidden; if it, and all such dependencies, had to be stated explicitly, the notation would become completely unusable. Second, the use of functional languages makes the programming of these program generators easier. This is simply because the languages are very powerful. This fact would probably be widely acknowledged, but it is then usually followed by the admonition that they are inecient and use too much memory. However, it is dicult to imagine a lexer or parser de nition, or any other program generator speci cation, being so large that eciency would be a serious concern. Thirdly, as we have mentioned several times, the underlying language can be useful. We discussed above how we used ML to add regular expression operators to lexer speci cations, and to create a strati ed expression grammar with nine levels. In the long run, we expect this to be the most important advantage. The disadvantages are clear: The syntax of our sublanguages is awkward, in some cases extremely so. When one makes an error in a speci cation, the error message comes from the 9

ML language processor, and is written, so to speak, in a foreign language. Our current research concerns ways in which we might ameliorate these problems.

Acknowledgments Chris Parrott contributed to programming the lexer speci cation. Matt Beckman provided assistance during the writing of this paper.

References [1] S. Ahmed, D. Gelernter, Program builders as alternatives to high-level languages, Yale Univ. C.S. Dept. TR 887, November 1991. [2] B. Balzer, N. Goldman, D. Wile, Rationale and Support for Domain Speci c Languages, USC/Information Sciences Institute, available at http://www.isi.edu/software-sciences/dssa/dssls/dssls.html. [3] J. Bell, F. Bellegarde, J. Hook, R.B. Kieburtz, A. Kotov, J. Lewis, L. McKinney, D.P. Oliva, T. Shear, L. Tong, L. Walton, and T. Zhou, Software design for reliabiity and reuse: A proof-of-concept demonstration, TRI-Ada '94. [4] W. E. Carlson, P. Hudak, M. P. Jones, An experiment using Haskell to prototype \Geometric Region Servers" for navy command and control, Research Report YALEU/DCS/RR{1031, Yale Univ. C. S. Dept., May 1994. [5] Emmanuel Chailloux, Ascander Suarez, mlPicTEX, a picture environment for LATEX, in ACM SIGPLAN Workshop on Standard ML and its Applications, June 1994. [6] C. Elliott, Modeling interactive 3D and multimedia animation with an embedded language, Proc. USENIX Conf. on Domain-Speci c Languages, Santa Barbara, Oct. 1997, 285{296. [7] C. Elliott, P. Hudak, Functional reactive animation, Proc. Intl. Conf. on Functional Programming, 1997. [8] M. Frigo, S.G. Johnson, The Fastest Fourier Transform in the West, MIT Technical Report, MIT-LCS-TR-728, Sept. 11, 1997. [9] P. Hudak, Building domain-speci c embedded languages, position paper for Workshop on Software Engineering and Programming Languages, Cambridge, MA, June 1996. [10] Paul Hudak, Tom Makucevich, Syam Gadde, Bo Whong, Haskore music notation: An algebra of music, J. Func. Prog., to appear. [11] Paul Hudak, Simon L. Peyton Jones, and Philip Wadler (eds.), Report on the Programming Language Haskell, A Non-strict Purely Functional Language (Version 1.2), SIGPLAN Notices 27(5), May 1992, Section R. [12] S. Kamin, D. Hyatt, A special-purpose language for picture-drawing, Proc. USENIX Conf. on Domain-Speci c Languages, Santa Barbara, Oct. 1997, 297{310. 10

[13] T. Sheard, N. Nelson, Type safe abstractions using program generators, Tech. Report 95{013, Oregon Graduate Institute, Computer Science Dept., 1995. [14] Diomidis Spinellis, Implementing Haskell: Language implementation as a tool building exercise, Software: Concepts and Tools 14, 1993, 37{48. [15] Standard ML of New Jersey User's Guide, Feb. 15, 1993, available at http://www.cs.princeton.edu/~appel/smlnj/. [16] D. Wang, A. Appel, J. Korn, C. Serra, The Zephyr abstract syntax description language, Proc. USENIX Conf. on Domain-Speci c Languages, Santa Barbara, Oct. 1997, 213{227. [17] R. C. Waters, The Programmer's Apprentice: A session with KBEmacs, IEEE Trans. Software Eng. SE{11(11), 1296{1320, Nov. 1985.

11