classes in a compiler generator as well as in the compilers generated by it. ... crete objects of these classes, like it is done for XML nodes in Illustration 1.
An Object-oriented and Generic Compiler Generator Michael Pitzer and Heinz Dobler University of Applied Sciences, Hauptstr. 117, 4232 Hagenberg, Austria {michael.pitzer, heinz.dobler} @fh-hagenberg.at
Abstract: Object-oriented software development has become the de-facto standard programming paradigm used in modern software systems. Additionally genericity has grown more popular since the enhancement of Java and C#. This paper attempts to reconsider the principles of compiler construction from this modern, object-oriented point of view. We present a multi-paradigm, mainly object-oriented and generic approach for creating a compiler generator using a combination of the Interpreter pattern and the Visitor pattern. A prototype of such an object-oriented and generic compiler generator has also been developed using C# 2.0 and will serve as a reference to explain the design throughout this paper. Keywords: Compiler, Compiler Generator, Design Patterns, Interpreter, Visitor, EBNF, Attributed Grammars, Genericity
1 Introduction and Motivation Programming using different programming paradigms and languages as well as basics of formal language theory and compiler construction form an essential part in computer science and especially in software engineering education. This paper is the result of work mainly carried out in the context of a bachelor thesis, which had the aim to investigate the applicability of the object-oriented and generic programming paradigms in the field of compilers and compiler generators. The fist step was to study existing techniques for both, compilers and compiler generators. Aho and Ullman in [1] provide a sound overview over the principles of compiler construction. Then, existing compiler generators, traditional ones like lex and yacc [5] (aka flex and bison from GNU) as well as modern ones like JavaCC [9] and Coco-2 [7] were examined. All mentioned generator examples turned out to follow mainly the imperative/procedural programming paradigm. The investigations turned up that all these tools have one aspect in common: they generate old-fashioned procedural code. In this paper, the authors will present a new approach to port this old idea to modern programming paradigms, specifically the object-oriented and generic paradigms.
2 Object-oriented and Generic Compiler Generator The main idea of this approach is using object-oriented programming and generic classes in a compiler generator as well as in the compilers generated by it. This way, the readability of the code shall be improved and it should be easier to maintain. Pro-
vided a proper input grammar, the compiler generator could be bootstrapped, that means, it would be able to generate itself. In [4] is stated that "two of the most useful abstractions used in modern compilers are context-free grammars, for parsing, and regular expressions, for lexical analysis." Having this in mind, the hybrid compiler generator will also make use of both techniques. Like in existing compiler compilers, (E)BNF (see section 3.1) is used to specify the grammar for the compiler to be generated. The first occasion where the difference to procedural compilers will show up, is the syntax tree, which will be created from an input stream. In this case, we are literally talking about a tree: a tree of objects representing specific parts of the source grammar. By combining the design patterns Interpreter and Visitor [2] we will be able to perform operations on this tree. So, to understand the design of the presented compiler generator, it will be necessary to learn about EBNF, the Interpreter and Visitor patterns as well as generic programming in the following sections.
2.1 (E)BNF Notation EBNF is an extension by Niklaus Wirth to BNF, short for Backus-Naur-Form. It is a standard notation for describing context-free grammars. That is why many compiler generators, just like the one presented in this paper, use variations of this notation to define the grammar of the source language for a compiler to be generated. This assures that the compiler generator is able to deal with any given context-free grammar. To define rules for a grammar, EBNF uses the syntactic elements alternative |, option [ … ], repetition { … }, and sequence, which is represented just by a blank. Brackets are also used to declare precedence. Let us consider an example in the XML realm: XMLNode = '' XMLElement ''. XMLElement = { text | XMLNode }. The above example shows a simplified grammar for parts of an XML document in EBNF. The input for the presented hybrid compiler generator will be exactly such a grammar. The main idea is the mapping of all syntactic elements of EBNF as well as all terminal and non-terminal symbols of a grammar to specific classes. Objects of these classes will be able to parse instances of the respective parts of the grammar. For example, objects of the class Rule representing a non-terminal symbol (left-hand side) with its replacement (right-hand side of the grammar rule) will be able to parse parts of a sentence conforming to this rule of the grammar and will additionally represent this rule in the form of an object tree.
2.2 Design Pattern Interpreter The Interpreter pattern is an easy way to represent sentences of any language in an object-oriented programming language. As described in [2], the Interpreter pattern "uses a class to represent each grammar rule. Symbols on the right-hand side of the rule are instance variables of these classes." That is how non-terminal symbols are represented. Now, we consider each terminal symbol of the grammar, a rule of the form SymbolName = 'symbol'.
We can now treat terminal symbols the same way as non-terminal symbols. This rule just gives a name to the class that represents the symbol. Now, that it is possible to represent all terminal and non-terminal symbols of a grammar as objects, we are able to represent any sentence of the source language by assembling a syntax tree with concrete objects of these classes, like it is done for XML nodes in Illustration 1. Illustration 1.
XML syntax tree using the above grammar
Each element of the syntax tree is derived from a common abstract root class, which defines an Interpret method. All derived classes implement this method, in which the desired transformations are performed and the output of the compiler, for example machine code, is created. By traversing the syntax tree and calling the Interpret method for each object in the tree, we can perform these operations on the whole input. Illustration 2.
Class diagram for an XML interpreter according to the previous examples
Illustration 2 shows how all symbols of the grammar are represented by classes derived from the abstract root class XML, which override the Interpret method. The two classes on the side represent non-terminal symbols. The two classes on the right represent terminal symbols, or rather terminal classes. The Composite pattern [2] is used to map recursive rules in the grammar to the class structure.
Often, the user of such a compiler will want to perform various operations on such a tree. For example he might want to generate machine code, pretty-print the code or measure the complexity of the code represented by the tree. He would have to define an additional Interpret method in the root class and implement it in all derived classes. The more classes there are, the more time it will take to do that. It gets rather impractical for complex grammars. Gamma et al. [2] note that "the Interpreter pattern works best when the grammar is simple. For complex grammars, the class hierarchy for the grammar becomes large and unmanageable." By combining the Interpreter with the Visitor pattern, however, we can workaround this problem.
2.3 Design Pattern Visitor According to [2], with the Visitor pattern it is possible to define operations that should be performed on the elements of an object structure without changing the classes of these elements. This means, we can add new operations to the syntax tree without having to mess with the classes of the objects that make up the syntax tree. Each operation is hereby defined as a single Visitor class. Illustration 3.
A Visitor class hierarchy that executes various operations on an XML tree
All Visitor classes that will be working on the same grammar are derived from a common root class, an abstract Visitor class. For each symbol of the grammar, a Visit method is declared in the abstract Visitor class and implemented in all derived concrete Visitor classes. To perform an operation on a syntax tree, the tree is traversed, and a Visitor object visits each object of the tree. Therefore, the Interpret method from the Interpreter pattern is replaced by an Accept method. When visiting an element of the tree, the Visitor object calls the element's Accept method. The visited object just calls back the appropriate Visit method of the Visitor object, where the operation is implemented, and passes itself as parameter. By receiving this parameter, the Visitor now possesses all required information to perform its operation. So, we have now replaced all the Interpret methods with one single Accept method that accepts different classes of Visitors to perform different types of operations. Gamma et al. in [2] suggest unambiguous names for each Visit method, like VisitXMLNode, VisitText, or VisitIdentifier, but we can also just overload the Visit
method as shown in Illustration 3. That way, we will be able to implement Visit methods for the abstract elements of our grammar. For example, a Visit method that takes a terminal symbol could implement the default behaviour of an operation for all terminal symbols. We would not have to implement all operations for all elements of a grammar. Dynamic binding provides that the correct Visit method will be called at run-time. Now, if we want to add new operations to our syntax tree, we only have to derive a new class from the abstract Visitor class and implement the required Visit methods for this operation. However, for the ease of adding new operations, we sacrifice the ease of extending the object structure. If new rules were added to our grammar, we would have to add a new Visit method to each of the Visitor classes. However, this is rarely the case. From [2] we learn that "the classes defining the object structure rarely change, but you often want to define new operations over the structure."
2.4 Generic Aspects of the Compiler Generator To represent the syntactic elements of EBNF, ordinary classes will not suffice. Alternative, sequence, option, and repetition must be implemented as generic classes, because we do not know in advance which symbols will be used within these elements. For example, if alternative was implemented with conventional classes, we would have to implement new Alternative classes for almost each occurance of | in the grammar of the source language. Each of these classes would be limited to two specific types. Generic classes, however, can be instantiated with any concrete type and allow us to map each alternative of a grammar to an instantiation of the generic Alternative class. Furthermore, if a rule consists of more than two alternatives, interlocking instantiations of the generic class can still represent this rule. The following classes can be used to put together the right-hand side of any rule given in EBNF:
• • • •
Sequence
to represent … T1 ○ T2 …,
Alternative
to represent … T1 | T2 …,
Option
to represent … [ T ] … and
Repetition
to represent … { T } …. .
The type parameters T1 and T2 respectively T will then be replaced by concrete types. For example the rule A {B | C}. will be mapped to the class Sequence >
3 Design of an Object-oriented and Generic Compiler Generator As has been mentioned in section 3.1, EBNF is used in compiler generators to describe any context-free language. So, our object-oriented and generic compiler generator has
to implement the Interpreter pattern for EBNF, similar to ideas formulated in [3]. Since the input for the compiler generator will be grammars denoted in EBNF, we can describe the input for our compiler generator by expressing EBNF in EBNF. Grammar = Rule { Rule }. Rule = nonterminalIdent '=' Expr '.'. Expr = Term { '|' Term }. Term = Fact { ' ' Fact }. Fact = terminalIdent | nonterminalIdent | '[' Expr ']' | '{' Expr '}' | '(' Expr ')'. For each syntactic element of EBNF, namely alternative, sequence, option, and repetition, a generic class is derived from the abstract root class Parser in our object-oriented compiler generator. Also, the two classes TerminalSymbol and NonterminalSymbol are derived from Parser. These classes form the core of our object-oriented compiler generator. Illustration 4.
Class diagram of a parser for EBNF
When parsing a grammar in EBNF, all the generator does, is to add a new class derived from TerminalSymbol for each terminal symbol found in the input grammar. In the case of EBNF that is just a period symbol and an equals symbol like shown in Illustration 4. Accordingly, for each rule of the input grammar, a new class is derived from NonterminalSymbol, instantiated with a combination of instantiations of Alternative, Sequence, Option, and Repetition that resembles the right-hand side of the rule. The following listing shows the class declarations that the compiler generator creates for the first three rules of the above grammar of EBNF.
class Grammar : NonterminalSymbol class Rule : NonterminalSymbol class Expr : NonterminalSymbol So for the generator it is rather easy to generate the code required for parsing a sentence of the specified language. All necessary information to perform this task can be found in the EBNF grammar of the source language. However, this is just the front-end of the compiler, how can we describe the transformations and operations that we want to apply on the syntax tree? The EBNF notation will have to be extended with attributes to add semantic information to the grammars as described in [8]. This is usually done by writing semantic actions next to the element of the grammar that will trigger the action when it gets recognized in the input stream. As discussed above, in our object-oriented approach we use Visitor classes to define operations on the syntax tree. The compiler generator will generate as many Visitor classes as operations are defined in the input. As a result, semantic actions have to be qualified via the name of the Visitor that they belong to. This way, the compiler generator will be able to add the actions to the appropriate Visitor class. The following example shows what the notation looks like for addition and subtraction in arithmetic expressions. Two Visitor classes have to be generated: a CalculateVisitor that calculates the result of the arithmetic expression, and a PrintVisitor that simply prints out the expression. Expr = Term CalculateVisitor: left = stack.pop(); { ('+' CalculateVisitor: operation = Plus; PrintVisitor: writer.write(" + "); | '-' CalculateVisitor: operation = Sub; PrintVisitor: writer.write(" - "); ) Term CalculateVisitor: right = stack.pop(); if (operation == Plus) left += right; else if (operation == Sub) left -= right; } CalculateVisitor: stack.push(left); . The position of a semantic action defines at which moment in the parsing process the action will be executed: An action will be executed, when the symbol after which it is located is successfully parsed in the input stream. So, always when a plus symbol is found in the input stream, the PrintVisitor will execute the action that prints " + ". This
notation, however, is just a suggestion, there is no “official standard” for extending EBNF with semantic actions.
4 Implementation of a Prototype As a proof of concept the authors have developed a prototype of such an objectoriented and generic compiler generator in the course of a bachelor thesis [10]. It has been implemented using Microsoft .NET and C# 2.0 [6]. The class structure presented in the previous chapter corresponds to the implementation of the prototype. However, two versions of the back-end are available. One generates C# source code like conventional compiler generators. A second version is available that directly generates and loads Common Intermediate Language code using CodeDom. C# has been chosen over Java because Java's type erasure during compilation limits the use of generic classes in cases where you have to rely on reflection to create instances of these classes.
5 Summary This paper described an attempt to apply modern programming paradigms, especially the object-oriented and generic paradigm to the field of compiler construction. The design of a compiler generator based on these paradigms has been shown. The presented design uses a combination of the Interpreter and Visitor pattern to represent sentences of a language and operations working on these sentences. Generic classes are used to represent the syntactic elements of EBNF, alternative, sequence, option, and repetition, and to glue together classes that correspond to the rules of a given grammar. A variation of attributed grammars that allows the definition of various operations on a grammar has been presented. Finally, a prototype of such an object-oriented and generic compiler generator has been developed to show the applicability of the presented ideas.
References 1. Aho, A.V., and Ullman, J.D.: Principles of Compiler Design. Addison-Wesley, 1977 2. Gamma, E., Helm, R., Johnson, R., and Vlissides, J.: Design Patterns–Elements of Reusable Object-Oriented Software. Professional Computing, 1995 3. Lorenz, D.H.: Tiling Design Patterns–A Case Study Using the Interpreter Pattern. ACM SIGPLAN Notices Vol. 32, Pages 206 - 217, October 1997 4. Appel, A.W., Palsberg, J.: Modern Compiler Implementation in Java. Camebridge University Press, 2002 5. Levine, J.R., Mason, T., Brown, D.: Lex & Yacc. O'Reilly & Associates, 1992 6. Thai, T.L., Lam, H.: .NET Framework Essentials. O'Reilly, 2002 7. Dobler, H.: Coco-2: A new compiler compiler. SIGPLAN Notices 25, 1990 8. Knuth, D. E.: Semantics of context-free languages. Springer New York, 1967 9. McFlynn, D., Weissman, P. F.: Using JavaCC and SableCC. 4UPress, 2004 10. Pitzer, M.: An object-oriented compiler generator. Bachelor Thesis, 2006