DML - A Meta-language and System for the ... - Semantic Scholar

2 downloads 0 Views 37KB Size Report
Raskovsky [21], Wand [28], [29], Appel [2], Jouvelot [11] until the MESS system by Peter Lee [13] demonstrated the first realistic compiler generation system ...
DML - A Meta-language and System for the Generation of Practical and Efficient Compilers from Denotational Specifications Mikael Pettersson and Peter Fritzson Department of Computer and Information Science, Linköping University S-58183 Linköping, Sweden Email: [email protected], [email protected]

Abstract1 DML, the Denotational Meta Language, is a specification language and a compiler generation tool for producing practical and efficient compilers from Denotational Semantics specifications. This means that code emitted from generated compilers should be product quality, and that generated compilers should have reasonable compilation speed, and interface well with standard frontends and back-ends. To achieve this goal, the DML system contains two main contributions compared to previous work in this area: (1) a general algorithm for producing efficient quadruple code from continuation semantics of Algol-like languages, and (2) enhancements in the DML specification language with BNF rules for abstract syntax declarations and “semantic brackets” [| ... |] with in-line concrete syntax and pattern matching for readable and concise semantic equations. Generated quadruple code is fed into a standard optimizing back-end to obtain high quality target code. The DML system generates efficient compilers in C, and contains a foreign language interface for communication e.g. with parsers or optimizing back-ends. DML is a super set of Standard ML and uses applicative order semantics, i.e. call by value, for reasons of efficiency.

1

Introduction

Generating compilers from formal specifications of programming languages has for a long time been a research goal in the compiler-writing community. Several efforts to generate compilers from denotational semantics specifications, starting with the SIS system by Peter Mosses 1979 [15], have resulted in compilers and code that run very slowly - usually a factor of 1000 slower than for commercial compilers, and also do not interface to commercial product-quality parsers or optimizing code generators. Another problem has been the poor modularization of denota-

1. This work was supported by the Swedish Board for Technical Development (Nutec).

tional specifications, where high-level and low-level aspects, together with static and dynamic properties of language descriptions, are inter-mixed. The situation has gradually improved through the work of several researchers, e.g. Sethi [24], Paulson [17], Raskovsky [21], Wand [28], [29], Appel [2], Jouvelot [11] until the MESS system by Peter Lee [13] demonstrated the first realistic compiler generation system accepting denotational specifications. Peter Lee used Peter Mosses action semantics [16] to separate high-level and low-level semantics (macro-semantics and micro-semantics). He also separated compile-time and dynamic aspects of the semantics to achieve better run-time efficiency. However, the MESS system still has some draw-backs. It is monolithic and does not interface well with standard parser generators and code generators because of language incompatibilities. High-level semantic actions cannot be formally defined within the system. Micro-semantic specifications cannot be used as input to the automatic compiler-generation process - the code generator from the high level abstract syntax intermediate form has to be implemented largely by hand. In comparison to the MESS system, the DML system presented in this paper is going several steps further. It interfaces well with standard tools written in C-compatible languages and it can automatically generate a code generator for intermediate quadruple code. Also, the DML specification language supports readable and concise specifications through several language enhancements.

1.1

Background on Denotational Semantics

Denotation Semantics is a formal method of specifying the meaning of programming language constructs in terms of basic mathematical objects: functions. Thus there is no modeling of computational processes as in operational approaches. Traditionally, these functions are expressed in lambda calculus, since this is a mathematically well-defined notation for expressing function objects. However, other more readable notations are also possible, such as the

DML syntax we are advocating in this paper.

2

In a denotational semantics specification, the meanings of programs or program constructs are expressed in terms of semantic functions, which map them into various semantic domains, the members of which are usually themselves functions. The main part of a denotational specification of the semantics of a subject language consists of a set of semantic equations which define the semantic functions. These equations are usually written as applying semantic functions to program segments shown in concrete syntax within "semantic brackets" on the left side of these equations, and mathematical function objects on the right side of the equations.

In the first part of this paper we present a method for generating efficient code from continuation-style denotational semantics. We claim that this form of semantics arises naturally in the definition of algol-like languages, such as C. Once the language semantics has been simplified using a compile-time versus run-time distinction, we recognize two important properties of the resulting form, singlethreading and label-freeness of closures. The first allows a conventional call-by-value execution with a single store. The second property is central to our work, as it allows bindings and closure creation to be replaced by assignments to uniquely named variables, i.e. quadruples. These quadruples are then passed on to a conventional code optimizer and generator. This method was first tested in a C++based prototype by Pettersson [18], but has now been adapted and integrated into the DML system.

Some books that provide useful background information in the Denotational Semantics area are [25], [4], [23]

1.2

Overview of the paper

The first part of this paper contains an brief overview of the method for generating efficient code for algol-like languages, including a code generation example from a compiler generated for a tiny subset of C. The subset involves integer-only variables, recursive functions of one argument, assignment and conditional statements, loops and nested declarations. The second part of the paper concerns the DML system and specification language, including its implementation. An appendix contains a compiling semantics for TINY-C expressed in DML, from which a small compiler was generated.

Source Text Syntax Specification

Parser Generator

Parser

Syntax Trees Run-time Semantics Static Semantics Semantics Processor Generator

Declaration Processor + IL generator Quadruple IL Code

Machine Description

Code Generator Generator

Generation of final code

Machine Code

Figure 1. The contribution of the DML system to compiler generation

Generation of efficient code

2.1

Background

The usual approach for deriving a prototype compiler from a denotational definition results in code being generated for a stack machine, see Schmidt 1984 [23] and Paulson 1982 [17]. However, this has two drawbacks. First one needs to actually derive a stack machine from the denotational equations, which may require some non-trivial amount of work, see Wand 1982 [28]. Second, the generated code is not in a form suitable for modern register-oriented machines. Why then is the stack machine so popular? To see this, we will consider the case of generating code for an addition expression: R[[E1+E2]]ρκ = R[[E1]]ρ{λε1.R[[E2]]ρ{λε2.κ(ε1+ε2)}} (Let us assume that the environment ρ and continuation κ are compile-time constants.) The central issue is how to communicate the value ε1 past the evaluation of the second expression to the addition operator, without using expensive run-time closures. A simple solution is to use a stack to send (sequences of) values around. Producing a value corresponds to a “push”, and consuming a value is done by a “pop”. Our solution however is based on the following observation: whenever a value is produced, we know that there is a receiving continuation κ=λε.θ. Moreover, the text of this continuation is usually known at compile-time. So, the application of this continuation to a value ε β-reduces to θ[ε/x]. This is directly analogous to the imperative code x:=ε; θ. Applying this idea throughout results in a quadruple-style code generator. The code generation algorithm can then be formulated

as these steps: 1. Calculate the meaning of the program, yielding a continuation-style λ-term 2. α-convert this term so that all bindings have unique names (this allows us to dispense with run-time closures, as we show later) 3. Traverse the converted term, and produce quadruples as shown above. The quadruples can then be fed into a conventional optimizing back-end for global optimization and register allocation, see Wall [27].

2.2

Related work on code generation

As mentioned in the introduction, we consider our approach of generating quadruples from continuation-style semantics to be quite different from the usual stack-based systems. Wand [28] proposes that closures be eliminated by introducing a family Dk of “argument-steering” combinators, similar to B=λf.λg.λx.f(gx). A disadvantage is that the resulting compiling semantics looks rather different from the original one. Appel’s work [2] is in some sense similar to ours. He too generates register-transfers (or quadruples) instead of stack code. However, his system is built using very intimate knowledge about compile-time versus run-time entities, and many highly specific reduction rules. The code generated is similar in quality as ours, with many temporaries which must be optimized away by a code optimizer. Most other systems appear to either be very general, relying on essentially interpreting λ-calculus (e.g. Paulson 1982 [17]), or very specialized, using lots of detailed knowledge about “standard semantics” (e.g. Raskovsky 1982 [21]). Compilers such as Orbit [12], which use continuationpassing intermediate form (CPS-form), have to perform complicated escape-analysis in order to eliminate unnecessary closures in certain contexts. The simple code generation method presented here can completely eliminate the use of closures in code from generated compilers for imperative languages, thus avoiding the need for escape analysis.

3

Compile-time versus run-time semantics

The dynamic semantics is in general quite cluttered with random details concerning three conceptually disjoint events: type checking, meaning calculation and execution.

The goal of the simplification step is to split these events into separate modules or passes. Type checking is done to certify that identifiers are properly declared, that expressions have the correct type depending upon their use and so on. A type error discovered by the semantics causes Wrong to be applied and, conceptually, the execution to be terminated. However, as with all other languages that associate types with declarations and not with values only, most typechecking can be done at compile-time. A compile-time entity is one that needs no access to the state or any input data for its value to be determined. In a conventional compiler, the environment would correspond to the symbol table and be a compile-time object. Unfortunately, the environment carries two run-time types: the locations of declared variables, and the return continuations to functions. Looking first at the return κ’s to function calls, we may notice their restricted use. On function entry, the value continuation κ is bound in the environment so that Return may find it. Thus all function bodies have the return point as a free variable. With a little operational intuition, we can see that a stack of return points can be added to the state, together with two simple primitives Call and Return to maintain it. This leaves us with the environment containing runtime locations. In general, the compiler has to resort to being very intimate with the memory allocation mechanisms of the target machine, introducing notions such as stack offsets or frame pointers. While manageable, this is not an attractive solution due to the amount of details involved. We have chosen not to follow this path, as it turns out that our handling of temporary bindings equally well takes care of the variable locations. Also, we do not have to assume a certain stack model of memory allocation.

3.1

Simplified Interpretive Semantics

The simplified definition consists of the original definition, where the static semantics have been removed (i.e., no error checks), and the modification described above regarding the stack of return continuations. We have, to simplify the next step, put the equations in a kind of “normal” form (not to be confused with normal form of λ-terms). First, all primitive operators (like Fetch, Cond etc.) are uncurried, so all expression continuations are explicitly written as {λε.θ}. Also, for those cases where expression continuations are applied directly, we have introduced an explicit operator Literal = λε.λκ.κε . In essence, we have introduced a set of basic primitives operators such as Literal, Fetch, Update, Call, Return, which will simplify the gen-

eration of corresponding quadruple code.

3.2

The need for name translation

Now that we have a simplified interpretive semantics, it is time to recall our original intuition: to recognize that {λx.θ}ε β-reduces to θ[ε/x], and to transform this to x:=ε; θ. A key element here is that we want to dispense with runtime closures. We have to be careful though so that we do not introduce name conflicts. Given the definitions so far, the denotation for 1+(2+3) is: λρ.λκ.R[[1]]ρ{λε1.λκ.R[[2]]ρ{λε1.R[[3]]ρ{λε2.κ(ε1+ε2)} }){λε2.κ(ε1+ε2)}} which, given an environment and a continuation λx.θ, we would transform into: ε1 := 1; ε1 := 2; ε2 := 3; ε2 := ε1 + ε2; x := ε1 + ε2; θ This is incorrect, since variables are overwritten over and over again. If we assume that all λ-terms are αconverted to have unique bound names, then this problem is eliminated.

3.3

Proof Outline

Here we will show that run-time closures can indeed be dispensed with if all bound temporaries have unique names. Our informal proof is done in two steps. First we show the result for programs without user-defined function, i.e. the program is just a command sequence. Then we discuss why functions complicate matters, and a simple and effective solution which handles that case as well. Part 1 - No Functions It is convenient to have these definitions handy: •

an open term is a λ-term of type κ (value continuation) or θ (command continuation) with free variables, otherwise it is closed



a term is labeled if there is more than one reference to it, otherwise it is label-free



if the term B is a continuation to a term A, then A is a predecessor of B



the use of a name is sound if it is preceded by the corresponding definition (binding, assignment), otherwise, it is unsound

Our correctness criterion can then be stated as: All uses of variables must be sound. Intuitively, open terms need a run-time environment to interpret their free variables. If we see the control flow as a di-

rected graph, then a node (term) is labeled if there are multiple arcs pointing to it. In linearized code, all but one of these arcs must be expressed as a goto to the label. Consider an open, label-free term x. Since it is labelfree, it has at most one predecessor y. If y is closed, then it must bind the free variables of x, and thus all uses in x are sound. Otherwise, if y is open, then it can be either labeled or label-free. If it is labeled, then uses in y (and x) can be unsound. If it is label-free, then this argument applies again, and the uses are sound. The upshot of this reasoning is that the correctness criterion is fulfilled if: All open terms are label-free So let us first examine all contexts in the revised semantics (see appendix) where some term might be labeled. Semantic Equations: The if command uses its continuation twice, since it is the continuation of both the true and false branches. So, command denotations (not continuations θ in general) may be labeled. The denotation of the while command is labeled since it refers back to itself, while at the same time being referred to by its predecessor. The only other bound variables that have multiple uses are the environment arguments ρ, but they are compile-time only objects. Now that we know that only command denotations can be labeled, we need to know if they have free variables. The answer is both yes and no. No, because each righthand-side of the semantic equations mentions only bound variables (i.e., all denotations are closed). Yes, because the locations of declared (in TINY-C) variables are bound in the environment; if we dispense with the environment then these locations become free to the commands and expressions in their scope. However, here we are saved by the structured nature of our language. While commands may be labeled, and also have locations as free variables, the restrictions on where jumps may go ensure that uses are sound anyway. Since jumps may only occur within the same scope level and never into a more deeply nested scope, we see that all uses are indeed preceded by their definitions. A quick look at the equations for if and while reveals that all commands involved execute in the same scope, so jumps here cannot cause unsound references to declared locations. Temporaries: The natural implementation of temporaries as those in quadruples is to put them in the state - they are global anyway. We can add a third component Env = Ide->EVal to keep track of defined temporaries. Since all uses are sound, this is a total

function. Also, since all names are unique and the component is part of the single-threaded state, it can be efficiently implemented as an array indexed by the names of the temporaries. Functions: Functions pose a problem because they may call themselves. A recursive call to a function causes it to be entered again, overwriting the live temporaries. On return, the temporaries in the first invocation of the function will have incorrect values. Apparently, we need infinitely many copies of the function, all with unique names for their temporaries. A modification to the state again would take care of this problem. The second component, the return points, is extended to also contain the values the temporaries in the called function had on entry. On return, these temporaries are restored, i.e. a callee-saves protocol.

3.4

Target Representation

Now that we have shown the soundness of our approach, it is time to define a concrete representation to be used by the compiling semantics. Since our idea of quadruples was expressed by syntactic transformations of λ-terms, we will represent them by a concrete datatype of Λ-terms, that is isomorphic to the text of the λ-terms in the simplified interpretive semantics. Expression continuations {λε.θ} are written (ΛI.θ). Command continuations are written as before, but with references to values ε replaced by identifiers I. It is assumed that the compiling semantics, when it creates new terms with bindings, allocates new unique names for these bindings. Also, the compile-time environment ρ is changed to bind the temporary assigned to hold locations, rather than the locations themselves. The representation of functions now includes a list of its temporaries so that the callee-saves protocol can be used. The compiler specification for TINYC is given in the appendix.

3.5

number of references to the node. •

The “false” branch of Cond nodes is assumed to have a reference count greater than one (this is just to force a label definition for the node, and simplify code generation).

Quadruples are written as [text]; they should be selfexplanatory. Gen(node) { if the node is marked, then output [goto L], where L is the label of the node, and return endif mark the node if the node’s reference count > 1, then allocate a new label L, and store it in the node output [label L:] endif dispatch on the structure of the node: [Halt] => output [halt]. [New (ΛI.θ)] => output [I := new] , then Gen(θ). [Cond I θ1θ2] => output [if not I then goto L], where L is the label of θ2 , call Gen(θ1), then Gen(θ2). [Literal ν (ΛI.θ)] => output [I := ν], then Gen(θ). [Call φ I1 (ΛI2.θ)] => output [I2 := φ(I1)], then Gen(). [Return I] => output [return I]. [Binary + I1 I2 (ΛI3 .θ)] => output [I3 := I1 + I2 ], then Gen(θ). (similarly for the other binary operators) [Update I1 I2 θ] => output [mem[I1] := I2], then Gen(θ). [Fetch I1 (ΛI2 .θ)] => output [I2 := mem[I1 ]], then Gen(θ). end dispatch }

The Code Generation Algorithm

Given a (possibly) circular Λ-term (i.e. an intermediate representation of the program), the algorithm below translates it to sequences of assignments, simple arithmetic, tests and jumps. We formulate the algorithm for code generation of a command denotation, since details about how functions and temporaries are declared in the quadruple language are ignored here. The input term is assumed to meet the following requirements: •

Each node is “unmarked”.



Each node has a reference count field containing the

3.6

A Code Generation Example

The quadruples emitted by the generated compilers are written as statements in C, and then compiled by an optimizing C compiler back-end (we use GCC by Stallman [Stallman89]). The final code quality is close to what an optimizing compiler would have generated. Due to the restricted nature of our language, large applications have not been compiled or benchmarks. But given that quadruples is a well-known technique in the compiler community, we see no reason to expect any problems in extending this work for full Algol-like languages. At the time of this writing,

more extensive evaluation of the code generation capabilities of DML-generated compilers have not yet been done, primarily because we have not yet written DML specifications for larger Algol-like languages. Note however, that TINY-C includes all control-structures normally found in such languages, in addition to recursive functions and several types. Small C Example the function fac: int fac(int n) { if (n==0) return 1; else return n * fac(n-1); } Mc68020 Code, after feeding Quadruples through an optimizing backend: _fac: move1 d2,sp@move1 sp@(8), d2 seq d0 btst #0,d0 jeq L3 moveq #1,d0 jra L1 L3: move1 d2,d1 subq1 #1,d1 move1 d1,sp@jbsr _fac mulsl d2,d0 addqw #4,sp L1: move1 sp@+,d2 rts

Quadruples in C-syntax emitted by the generated compiler when compiling fac: int fac(int temp1) { int temp2,temp3,temp4; int temp5, temp6, temp7; int temp8, temp9, temp10; int temp11; temp2 = temp1; temp3 = 0; temp4 = temp2 == temp3; if (! temp4) goto L1; temp5 = 1; return temp5; L1: temp6 = temp1; temp7 = temp1; temp8 = 1; temp9 = temp7 - temp8; temp10 = fac(temp9); temp11 = temp6 * temp10; return temp11; }

Figure 3. Example of high code quality from a generated compiler.

4

An overview of DML

DML contains all of Standard ML (SML) Harper 1986 [8], making it a higher-order, mostly functional language with a powerful polymorphic type system and a module system for type-safe separate compilation. Here we will look only at the DML-specific extensions.

4.1

Syntax definitions

A common way to specify the type of syntax trees in texts on denotational semantics is to use BNF rules like this: C I E E

: Con : Ide : Exp ::= C | I | E + E

and then refer to syntax tree objects by writing their in-line syntax between “semantic brackets”: eval [[ C ]] env = C

eval [[ I ]] env = env I eval [[ E1 + E1 ]]env = eval E1 env) + (eval E2 env) The definition-by-syntax facility is really just a short-hand, eliminating the need to explicitly deal with Cartesian products, disjoint unions and their appropriate injections, projections and tag tests. Without this shorthand, this example would have to be written as: type Exp = Con + Ide + AddExp type AddExp = Exp*Exp eval exp = if exp ∈ Con then exp|Con else if exp ∈ Ide then env(exp|Ide) else (eval (exp|AddExp)↓1 env) + (eval (exp|AddExp)↓2 env) where exp ∈Con tests whether exp comes from the Con summand of the disjoint union, exp|Con (projection) removes the tag to yield the actual constant, and exp↓i selects the i:th component of a tuple. To create these objects, projections are used: con in Exp converts the constant con to an Exp by tagging it with the Con tag. In DML we take the same view: syntactically defined types provide a convenient notation for ordinary types defined with datatype declarations. The example could be expressed in DML as follows: type Con = ... type Ide = ... syntax Exp = Con | Ide | Exp "+" Exp fun eval [| c |] env = c | eval [| i |] env = env i | eval [| e1 "+" e2 |] = (eval e1 env) + (eval e2 env) (note: string constants are used to denote keywords) This can be understood by preprocessing it to vanilla SML like this: datatype exp = ConExp of Con | IdeExp of Ide | AddExp of Exp * Exp fun eval (ConExp c) env = c | eval (IdeExp i) env = env i | eval (AddExp(e1, e2)) = (eval e1 env) + (eval e2 env) The transformation is non-trivial. In this example, the “pure” SML version wasn’t too ugly, but more complex types and patterns (especially nested ones) are clearly more conveniently handled using grammar rules. There are two basic reasons for preferring syntactic type definitions. First, we do not have to cast all expressions and patterns into the prefix or binary infix syntax of SML; instead we can use (almost) any meaningful notation we wish. Second, grammars allow us to make “unit productions” (type con-

versions) implicit. Consider:

in r := result;

(* DML version *) fun simplify [| e "+" 0 |] = e | ... (* SML version *) fun simplify (AddExp(e, ConExp 0)) = e | ... Here DML enables us to make the mapping from integer constants to Exp:s implicit, whereas SML wants us to spell out all the gory details.

4.2

Related work on in-line syntax

The work by Aasa [1] describes an approach to inline syntax, which in its current form only applies to the internal structure of tokens and is very inefficient. Lee [13] has a very simple approach to implementing in-line syntax within semantic brackets: everything within the bracket is simply concatenated to a string which is then matched to similar strings through a sequential string comparison. This disallows patterns with different variable names but similar structure, adding needed type information to patterns, and patterns with nested structure. All of these are handled by the DML syntax extension.

4.3

Handling recursion and loops

Another idiom of denotational semantics is the use of fixed point equations or rec declarations to create circular or infinite objects, e.g. C[[while Exp do Cmd]]env cont = fix(λcont’.E[[Exp]]env {λx.if x then C[[Cmd]]env cont’ else cont})]] (here the braces act to parenthesize a continuation) Since DML, for reasons of efficiency, uses call-by-value semantics, only simple declarations of recursive lambda expressions (fn-expressions) are allowed. We also observe that for typical denotational specifications of procedural languages, constructed λ-expressions, i.e. denotations, can be treated as an abstract data type since when a term has been constructed we do not access its components. This allows us to implement the fixed-point operators of DML by inserting an invisible indirection node, (INDIR), for the data-type which represents λ-terms. Denotations of recursion and similar constructs are represented by circular data structures - the fixed-point operator will make the indirection node point back to itself during interpretation/ code generation. fix(f) = let val R = INDIR (ref ) (* initially bottom *) val result = f(r) (* always OK,since object is an ADT *)

In DML, this is used to implement the fixed-point operators for command continuations (used for loops), and function denotations - see appendix. The strong typing of DML necessitates two fixed-point operators, fix_cont and fix_func.

4.4

Interfacing with other languages

It is very important for a general tool to be able to communicate with modules written in other languages. At the simplest level this reduces to being able to call procedures written in, say, C. DML does this by extending the declaration syntax as the following example shows: import "C" write_char : int -> unit = "dml_putchar" (unit is an SML type with only one element; it is used roughly as C programmers use the void type) Calls to write_char will then be routed to the C function dml_putchar instead. The C code can access the DML arguments using macros supplied in a header file: #include void dml_putchar(DMLOBJ x) { putchar( DML2INT(x) ); } To enable calls in the other direction, an export declaration is used: export "C" meaning = "dml_calculate_meaning" making the DML function meaning visible from C code using the name dml_calculate_meaning. DML also allows registering garbage-collection time call-back functions from C, to protect live DML objects allocated from C.

5

The implementation of DML

For the initial prototype, we have chosen to use Scheme [22] as the implementation language. This is partly because of personal preferences, but mainly due to Scheme’s expressiveness and the existence of efficient, garbage-collected implementations. The overall structure of the DML compiler is to parse the input DML specification, check its static semantics, rewrite higher-level features into simpler ones (e.g. syntactic expressions, pattern-matching constructs) and finally transform the simplified internal form to Scheme. The resulting Scheme code is then compiled to C using Joel Bartlett’s optimizing Scheme->C compiler [5].

normal LALR(1) parsing. The implementation of this method and integration with type analysis is described by Pettersson in [20].

The DML to C compiler: DML code DML Parser and Type Checker DML to Scheme translator

5.2

Scheme code Bartlett's Scheme to C compiler

C code Finally: Specify DML in itself and bootstrap! DML code

DML to C Compiler

C code

Figure 4: Bootstrapping the DML compiler generation tool

5.1

DML syntax extensions

The definition-by-syntax facility is problematical. The difficulty is that to properly parse a sentence between “semantic brackets” [| ... |] the parser must have access to type information for the whole sentence itself and all its sub-expressions (or -patterns). Consider the following example: syntax foo = int | string syntax fie = foo | ... syntax fum = int | ... fun bar [| x |] = ... | ... Without type information, the first pattern in bar’s definition is ambiguous: it might be a foo, a fie or a fum. Suppose now that the rest of bar’s definition constrains the type of x to be a string. This eliminates fum, leaving foo and fie as possible candidates. If the whole pattern has been constrained to be a fie, then the ambiguity disappears, and the derivation must be: fie => foo => string. To handle these problems, we have chosen the following approach for implementing this syntactic extension. During parsing of DML specifications, syntax declarations are compiled into an incremental LR(1) parser internal to the compiler. The parser generator is based on the description by Heering, Klint and Rekers 1989 [10]. This method starts multiple parses in parallel when needed. Only the correct parse compatible with context and type constraints is finally selected. The parsing method has the power to parse LR(1) languages, with an efficiency within a factor of two of

Static semantics checking of DML specifications

The static semantics of DML is almost identical to SML’s, the exception being the treatment of syntax declarations and syntactic expressions (patterns). The actual type checker is structurally very close to the formal definition of SML by Harper, Milner and Tofte 1989 [9], but some implementation details, for instance the treatment of polymorphic types in let declarations, is based on the work by Cardelli 1987 [7].

5.3

Intermediate form transformations

The intermediate representation of a DML specification needs further processing before it is ready to be compiled to Scheme. Most importantly, pattern-matching constructs must be transformed to the appropriate tests and component extractions. The first implementation of DML used the traditional method of compiling pattern-matching constructs, which is well described by Augustsson 1985 [3] and Wadler 1987 [26]. Recently, the DML implementation was improved by replacing this with a new, more efficient pattern-match compilation method by Pettersson, [19].

5.4

Compiler Generation

When the transformations above have been applied, code generation to Scheme is a simple matter of un-parsing the intermediate form. Issues regarding efficient data representation are not dealt with in our prototype compiler (see Cardelli 1984 [6] and Leroy 1990 [14] for discussions regarding those problems). The Scheme code is finally optimized and compiled by the Scheme-to-C compiler. To compensate for the lack of inter-procedural tail-recursion removal in certain C compilers, we have designed machine-dependent routines to perform this optimization, currently for Sun3:s and Sun4:s

6

Conclusions

We believe that higher-level, mostly functional specifications should be used when implementing suitable phases of compilers or other programming language oriented applications. The lack of proper specification languages and implementations has so far prevented this. We think that the DML system including the method for generating efficient code presented in this paper has the necessary qualities to make it as useful as Yacc and other classical tools.

7 [1]

[2]

[3]

[4] [5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15] [16]

[17]

[18]

References Kent Petersson Annika Aasa and Dan Synek. Concrete Syntax for Data Objects in Functional Languages. In Proc of the 1988 ACM Conference on LISP and Functional Programming, pages 96–105, Snowbird, Utah, July 25-27 1988. Andrew W. Appel. Compile-time Evaluation and Code Generation for Semantics-directed Compilers. PhD thesis, Carnegie-Mellon Univ., 1985. Lennart Augustsson. Compiling pattern-matching. In Proceedings of the 1985 Conference on Functional Programming Languages and Computer Architecture, LNCS-201. Springer-Verlag, 1985. H.P. Barendregt. The Lambda Calculus, Its Syntax and Semantics. North-Holland, 1984. Revised Edition. Joel Bartlett. Scheme to C, a portable Scheme-to-C compiler. Technical Report 89/1, DEC Western Research Lab, 1989. Luca Cardelli. Compiling a functional language. In Proceedings of the ACM Conference on Lisp and Functional Programming, 1984. Luca Cardelli. Basic polymorphic typechecking. Science of Computer Programming, (8), 1987. (earlier version published as Bell Labs CSTR-112). Robert Harper. Introduction to Standard ML. Technical Report ECS-LFCS-86-14, Department of Computer Science, University of Edinburgh, 1986. Robert Harper, Robin Milner, and Mads Tofte. The Definition of Standard ML, Version 4. The MIT Press, 1990. Jan Heering, Paul Klint, and Jan Rekers. Incremental generation of parsers. In Proceedings of SIGPLAN’89 Conference on Programming Language Design and Implementation, SIGPLAN NOTICES Vol. 24, No. 7, July 1989. Pierre Jouvelot. Designing new languages or new language manipulation systems using ML. SIGPLAN Notices, 21:40–52, August 1986. David A. Krantz. Orbit - An optimizing compiler for Scheme. PhD thesis, Yale, Febr 1988. YALEU/DCS/RR632. Peter Lee. Realistic Compiler Generation. PhD thesis, University of Michigan, 1987. Ph.D. thesis published by MIT press 1989. Xavier Leroy. Efficient data representation in polymorphic languages. In Proceedings of the 1990 Workshop on Programming Language Implementation and Logic Programming (PLILP’90), LNCS-456. Springer-Verlag, 1990. Peter D. Mosses. SIS - Semantic Implementation System. PhD thesis, Aarhus University, 1979. TR DAIMI MD-30. Peter D. Mosses. Unified algebras and action semantics. Technical Report DAIMI PB-272, Aarhus Univ., Dec. 1988. L. Paulson. A semantics-directed compiler generator. In Proceedings 9th ACM Conference on Principles of Programming Languages, pages 224–233, 1982. Mikael Pettersson. Generating efficient code from

[19]

[20]

[21]

[22]

[23]

[24] [25] [26]

[27]

[28]

[29]

8

continuation semantics. In Proceedings of the 1990 Workshop on Compiler-Compilers, LNCS-477, Schwerin, Germany, 1990. Springer-Verlag. Mikael Pettersson. A Term Pattern-Match Compiler Inspired by Finite Automata Theory. Technical report, Department of Computer and Information Science, 1992. Mikael Pettersson. DML - A Denotational Meta Language and System - provisional title. Technical report, Department of Computer and Information Science, 1992. Licentiate thesis. Martin R. Raskovsky. Denotational semantics as a specification of code generators. In Proceedings ACM SIGPLAN 82 Conference on Compiler Construction, pages 230–244, 1982. Jonathan Rees and William Clinger, editors. Revised3 Report on the Algorithmic Language Scheme, SIGPLAN Notices, 1986. Vol. 21, No. 12. David A. Schmidt. An implementation from a direct semantics definition. In N.D. Jones, editor, Proc. Workshop on Programs as Data Objects, LNCS-217. Springer-Verlag, Berlin, 1985. Ravi Sethi. Control flow aspects of semantics directed compiling. Technical Report CSTR-98, Bell Labs, 1981. Joseph E. Stoy. Denotational Semantics. MIT Press, 1977. Philip Wadler. Efficient compilation of pattern-matching. In The Implementation of Functional Programming Languages, S. Peyton Jones. Prentice-Hall, 1987. David A. Wall and Michael L. Powell. The Mahler experience: Using an intermediate language as the machine description. Technical Report 87/1, DECWRL, 1987. Mitchell Wand. Semantics-directed machine architecture. In Proceedings 1982 ACM Conference on Lisp and Functional Programming, 1982. Mitchell Wand. A semantic prototyping system. In Proc ACM SIGPLAN’84 Compiler Construction Conference, pages 213–222, 1984.

Appendix: A DML-style compiler to intermediate code for TINY-C

(* DML-style compiler for TINY-C. * Supports scalar and vector variables, and * recursive functions of several arguments *) open CPSAlgol (* Import the code gen module *) (* Abstract Syntax *) type Ide = string syntax Op syntax Exp

= "+" | "-" | "*" | "/" | "==" | "=" = Exp Op Exp | Ide | int | Ide "(" Args ")" | Ide "[" Exp "]"

and Args = "nil" | Exp "," Args (* should be: Exp list *) datatype Formals = FORMALS of Ide list restriction in current DML *) syntax Cmd

(* impl.

= Exp ":=" Exp | "if" Exp "then" Cmd "else" Cmd "fi"

| "skip" | "while" Exp "do" Cmd "od" | Cmd ";" Cmd | Dec ";" Cmd | "return" Exp and

Dec = "int" Ide "(" Formals ")" Cmd | Dec ";" Dec | "int" Ide | "int" Ide "[" int "]"

(* Semantic Domains *) type LocRep = Sym type VecRep = Sym datatype DVal = LocRepDVal of LocRep | FuncRepDVal of FuncRep | VoidDVal | VecRepDVal of VecRep type Env = Ide -> DVal type Dcont = Env -> ContRep (* Auxiliary Functions *) fun op2opr [| "+" |] = ADD | op2opr [| "-" |] = SUB | op2opr [| "*" |] = MUL | op2opr [| "/" |] = DIV | op2opr [| "==" |] = EQ | op2opr [| "=" |] = GE | op2opr [| "!=" |] = NE (* prelude : Env *) ..... (* poor substitute for rho[val/key] syntax, use (key --> val)rho instead *) ..... (* Unspecified : Int *)

......

(* tieargs: Ide list -> Sym list -> Env -> Env *) .. exception Wrong

.........

(* Semantic Equations *) (* expression R-values *) (* R : Exp -> Env -> KontRep -> ContRep *) fun R ([| (num:int) |]:Exp) env kont = Literal(num, kont) | R [| (var:Ide) |] env kont = (case env var of (LocRepDVal loc) => Fetch(loc, kont) | _ => wrong(var ^ " isn't a variable")) | R [| E1 opr E2 |] env kont = let val x = gensym() and y = gensym() and opr = op2opr opr in R E1 env (Lambda(x, R E2 env (Lambda(y, Binary(opr, x, y, kont))))) end | R [| ide "(" args ")" |] env kont = (case env ide of (FuncRepDVal phi) => A args env (fn args' => Call(phi, args', kont)) | _ => wrong(ide ^ " isn't a function")) | R [| var "[" E "]" |] env kont = (case env var of (VecRepDVal sym) => let val x = gensym() and y = gensym() in R E env (Lambda(x, Binary(INDEX, sym, x, Lambda(y, Fetch(y, kont))))) end | _ => wrong(var ^ " isn't a vector")) (* argument R-values *) (*A: Args->Env -> (Sym list -> ContRep)->ContRep *) and A [| "nil" |] env acont = acont [] | A [| exp "," exps |] env acont = let val x = gensym() in R exp env (Lambda(x, A exps env (fn xs => acont (x::xs))))

end (* expression L-values *) (* L : Exp -> Env -> (Sym -> ContRep) -> ContRep *) fun L [| (var:Ide) |] env lkont = (case env var of (LocRepDVal sym) => lkont sym | _ => wrong(var ^ " isn't a variable")) | L [| var "[" E "]" |] env lkont = (case env var of (VecRepDVal sym) => let val x = gensym() and y = gensym() in R E env (Lambda(x, Binary(INDEX,sym,x,Lambda(y, lkont y)))) end end | _ => wrong(var ^ " isn't a vector")) | L _ _ _ = wrong "not an L-value" (* commands *) (* P : Cmd -> Env -> ContRep -> ContRep *) fun P [| lhs ":=" rhs |] env cont = let val x = gensym() in R rhs env (Lambda(x, L lhs env (fn y => Update(y, x, cont)))) end | P [| "if" E "then" C1 "else" C2 "fi" |] env cont = let val x = gensym() in R E env (Lambda(x, Cond(x, P C1 env cont, P C2 env cont))) end | P [| "while" E "do" C "od" |] env cont = let val x = gensym() in fix_cont(fn cont' => R E env (Lambda(x, Cond(x, P C env cont', cont)))) end | P [| "skip" |] env cont = cont | P [| "return" E |] env cont = let val x = gensym() in R E env (Lambda(x, Return x)) end | P [| C1 ";" C2 |] env cont = P C1 env (P C2 env cont) | P [| dec ";" C |] env cont = D dec env (fn env' => P C env' cont) (* declarations *) (* D : Dec -> Env -> Dcont -> ContRep *) and D [| "int" name "(" (FORMALS args) ")" cmd |] env dcont = let val z = gensym() val cont = Literal(Unspecified, Lambda(z, Return z)) val args' = map (fn _ => gensym()) args val phi = fix_func(args', fn phi' => let val env' = tieargs args args' env val env'' = (name -> FuncRepDVal phi')env' in P cmd env'' cont end) in dcont((name --> FuncRepDVal phi)env) end | D [| "int" var |] env dcont = let val x = gensym() in New(Lambda(x, dcont((var --> LocRepDVal x)env))) end | D [| "int" var "[" size "]" |] env dcont = let val x = gensym() in NewVec(size, Lambda(x, dcont((var -> VecRepDVal x)env))) end | D [| (dec1:Dec) ";" (dec2:Dec) |] env dcont = D dec1 env (fn env' => D dec2 env' dcont) (* main program *) (* M : Cmd -> ContRep *) fun M prog = P prog prelude (Halt())