LDTA'07 Seventh Workshop on Language

0 downloads 0 Views 2MB Size Report
Mar 25, 2007 - Adrian Johnstone ... Tool Demonstration: SdfMetz: Extraction of Metrics and Graphs From ... yields a large freedom for the concrete representation of a complext language .... as whether a module contains a syntax error, or how many lines of ..... [2] Blackburn, P., M. de Rijke and Y. Venema, “Modal logic,” ...
LDTA’07

Seventh Workshop on Language Descriptions, Tools, and Applications

Braga, Portugal March 25, 2007

Proceedings

Editors Anthony Sloane Adrian Johnstone

Foreword The papers presented at the Seventh Workshop on Language Descriptions, Tools, and Applications (LDTA ’07) are contained in this volume. LDTA ’07 was a satellite event of the European Joint Conferences on Theory and Practice of Software (ETAPS ’07) and was held in Braga, Portugal, on March 25, 2007. Previous instances of this workshop have been held as satellite events of ETAPS in Vienna, Austria in 2006, Edinburgh, UK in 2005, Barcelona, Spain in 2004, Warsaw, Poland in 2003, Grenoble, France in 2002, and Genoa, Italy in 2001. The aim of this one day workshop is to bring together researchers from academia and industry who have an interest in the field of formal language definitions and language technologies. A special emphasis is placed on the development of tools based on formal language definitions. The program for LDTA ’07 consists of an invited talk by Uwe Aßmann, five research papers, three experience reports, and three shorter tool demonstrations. The accepted papers were selected from twenty-five submissions, comprising eighteen research papers, three experience reports and four tool demonstrations. Each submitted paper was reviewed by at least three program committee members with conflicts of interest resolved by non-interested parties. The papers cover a range of topics including program transformation, formal grammars and parsing, attribute grammar systems, and debugging. We would like to thank the members of the program committee for the careful review of the submitted papers. We also thank the ETAPS organizing committee for handling the local organization of the workshop. We are again very pleased that this workshop is held in cooperation with ACM SIGPLAN and with the publication of these proceedings as a volume in the Electronic Notes in Theoretical Computer Science (ENTCS) by Elsevier.

Anthony Sloane Sydney, Australia Adrian Johnstone London, UK March, 2007

Program Committee Judith Bishop, University of Pretoria, South Africa Claus Brabrand, BRICS, University of Aarhus, Denmark Nigel Horspool, University of Victoria, Canada Johan Jeuring, Utrecht University, The Netherlands Adrian Johnstone, Royal Holloway, University of London, UK (co-chair) Steven Klusener, Vrije Universiteit, The Netherlands Kent Lee, Luther College, USA Brian Malloy, Clemson University, USA Terence Parr, University of San Francisco, USA Michael Schwartzbach, BRICS, University of Aarhus, Denmark Anthony Sloane, Macquarie University, Australia (co-chair) Jurgen Vinju, CWI, The Netherlands

Organizing Committee Alcino Cunha, University of Minho, Braga, Portugal Thomas Noll, RWTH Aachen University, Germany

Table of Contents Session 1 (09:00-10:30) Welcome and Introduction Invited Talk: Collaboration-Based Composition of Languages (abstract) Uwe Aßmann

1

Research Talk: Language Parametric Module Management Paul Klint, Taeke Kooiker, Jurgen Vinju

3

Coffee Break (10:30-11:00) Session 2 (11:00-12:30) Research Talk: Fusing a Transformation Language with an Open Compiler Karl Trygve Kalleberg, Eelco Visser

18

Experience Report: Implementing a Domain-Specific Language using Stratego/XT Leonard Hamey, Shirley Goldrei

32

Tool Demonstration: Spoofax: An Interactive Development Environment for Program Transformation with Stratego/XT 47 Karl Trygve Kalleberg, Eelco Visser Lunch (12:30-14:00) Session 3 (14:00-16:00) Research Talk: SPPF-Style Parsing From Earley Recognisers Elizabeth Scott

51

Experience Report: An Experimental Ambiguity Detection Tool Sylvain Schmitz

66

Research Talk: Grammar Engineering Support for Precedence Rule Recovery and Compatibility Checking 82 Eric Bouwers, Martin Bravenboer, Eelco Visser Tool Demonstration: SdfMetz: Extraction of Metrics and Graphs From Syntax Definitions Tiago Alves, Joost Visser

97

Coffee Break (16:00-16:30) Session 4 (16:30-18:15) Research Talk: Silver: an Extensible Attribute Grammar System Eric Van Wyk, Derek Bodin, Jimin Gao, Lijesh Krishnan

101

Experience Report: Development of a Modelica Compiler using JastAdd ˚ Johan Akesson, Torbj¨orn Ekman, G¨orel Hedin

116

Tool Demonstration: A Domain-Specific Language Debugging Framework Demonstration Hui Wu, Jeff Gray, Marjan Mernik

131

Discussion and Closing

LDTA 2007

Collaboration-Based Composition of Languages Uwe Aßmann Institut f¨ ur Software- und Multimediatechnologie Technische Universit¨ at Dresden [email protected]

To achieve compositionality for language components, we transfer the notion of collaboration-based design from software modelling to language design. In modelling, collaboration schemes (also called role models) describe interactions between model concepts, encapsulating the interactions so that they can be reused in different scenarios. While collaboration schemes have been successfully used for system models, they have not yet been applied to metamodels in language design, for which they provide a large potential: they can describe the interaction of language concepts from different language components, explain and constrain their interplay, and adapt them to each other, even if they had not been designed for each other. Hence, the use of collaboration schemes paves the way to a new flexible technique for the composition of languages. To show this, we first discuss several interesting advantages of collaboration schemes for model and model composition. Mainly, collaboration schemes provide a kind of relational modularity. Because they encapsulate the interplay of two or more modelling concepts they provide a form of relational modules. Model concepts can be specified in isolation, disregarding their contexts. To specify their interaction, a collaboration scheme can be superimposed, contributing its relational information to them. During superimposition, two composition relations are employed: role objects are added with a plays-a relationship to model concepts, and the implementation of already existing role objects may be extended by an extends-a relationship. Similarly for metamodeling, that is for the modeling of languages, collaboration schemes provide relational modularity of language concepts. Relations between language constructs can be encapsulated into collaboration schemes and superimposed on language concepts that had been developed in isolation. Hence, collaboration schemes are relational modules for the development of languages: language components can be composed that had not been developed for each other; in particular, languages can be extended with new components containing domain-specific constructs. Hence, collaboration schemes improve the separation of concerns in lanThis paper is electronically published in Electronic Notes in Theoretical Computer Science URL: www.elsevier.nl/locate/entcs

Aßmann

guage development. Because they encapsulate the collaboration aspect of model concepts, they can be used to separate the core features of a language construct from its relations to other concepts in different language components. This simplifies the break-up of a complex language into components, resp. the component-based development of languages. Another advantage of role models is that the plays-a and extends-a relationships have to be implemented in a separate refinement step, being either represented by inheritance, mixin inheritance, or delegation. This refinement step separates abstract language models from concrete language models, and yields a large freedom for the concrete representation of a complext language constructs. In many cases, inheritance and delegation in metamodels turn out to implementations of plays-a and extends-a. When they are abstracted and re-introduced in the refinement step, many different variants of concrete language models can be produced from one abstract language model. This paves the way to a model-driven development of languages. The applications of the collaboration schemes do not only include composition or extension of languages, but enable us to construct language factories for the construction of language product lines. Finally, after a long time of research, they let us better understand the notion of an extensible language: once the relational concern is separated from the core concern of a language concept, language extension and composition get a systematic background in modern object-oriented modeling.

2

LDTA 2007

Language Parametric Module Management for IDEs P. Klint1 A.T. Kooiker2 J.J. Vinju3 Centrum voor Wiskunde en Informatica P.O. Box 94079, NL-1090 GB Amsterdam, The Netherlands

Abstract An integrated development environment (IDE) monitors all the changes that a user makes to source code modules and responds accordingly by flagging errors, by reparsing, by rechecking, or by recompiling modules and by adjusting visualizations or other information derived from a module. A module manager is the central component of the IDE that is responsible for this behavior. Although the overall functionality of a module manager in a given IDE is fixed, its actual behavior strongly depends on the programming languages it has to support. What is a module? How do modules depend on each other? What is the effect of a change to a module? We propose a concise design for a language parametric module manager: a module manager that is parameterized with the module behavior of a specific language. We describe the design of our module manager and discuss some of its properties. We also report on the application of the module manager in the construction of IDEs for the specification language A SF +S DF as well as for Java. Our overall goal is the rapid development (generation) of IDEs for programming languages and domain specific languages. The module manager presented here represents a next step in the creation of such generic language workbenches.

1

Introduction

The long term goal of our research is generation of Integrated Development Environments (IDEs) for programming languages and domain specific languages. This is a classical topic, with a traditional focus on the generation of syntactic and semantic analysis tools [10, 15]. In this paper we instead focus on generating the interactive behavior of IDEs. 1.1

Motivation

IDEs increase productivity of programmers by providing them with an efficient input interface and rapid feedback. For many software projects the availability of a good IDE is one of the decisive factors in programming language selection. With language design and domain specific languages (DSLs) back on the (research) agenda [5], and knowing that tool support for DSLs is one of the limiting factors for their application [14], the key question is: “What is the quickest way to construct a full-fledged IDE for any kind of language?” 1 2 3

Email: [email protected] Email: [email protected] Email: [email protected]

This paper is electronically published in Electronic Notes in Theoretical Computer Science URL: www.elsevier.nl/locate/entcs

Klint, Kooiker and Vinju

IDEs are complex systems. Apart from editing, building, linking and debugging programs they offer syntax highlighting, auto-completion, formatting, outlining, spell checking, indexing, refactoring, context-sensitive help, advanced static analysis, call graphs, version control, round-trip engineering, and much more. Programming languages have become more complex and software products are getting bigger and bigger. Many products actually use multiple programming and domain specific languages. This all adds up to the complexity of IDEs and building them requires major investments as exemplified by the effort in constructing Eclipse [6], and its various instantiations for Java, C, Cobol, and other languages. The subject of this paper is a central part of each IDE that we call the “module manager”. The module manager coordinates all actions within the IDE and all interaction with the programmer. It does this by responding to the changes that the programmer makes to the source code of a project, and by triggering actions accordingly. The module manager does not implement the actual interaction with the user, nor does it implement any specific action, but it does coordinate these actions. The main data model behind such coordination is the collection of source code modules of a software project and their interdependencies. A well-designed module manager is central to each IDE and reduces the coupling between other components. It leads to a plug-in architecture in which IDE components can be added independently. The mother of all module managers is the tool make [8] that uses the module dependency graph to initiate build actions on source code modules in a batch-like fashion. Ant [17] is a modern and more sophisticated version of make. The functionality of a module manager for an IDE is, however, much more complex. It has to react to many external triggers, is not restricted to pure build actions, and has to initiate many different actions as well. Examples are parsing, checking or compiling of modules, and adjusting visualizations or other information derived from modules such as context-sensitive help and error lists. The module manager is a fully interactive scheduler. It knows about language semantics in terms of modularity and packaging, and it knows about the capabilities of the IDE in terms of input and output to the user-interface. The main goal of the module manager is to provide fast and adequate feedback to the programmer on any modification she makes to any module’s source code. 1.2

A Language Parametric Module Manager

The basic functionality of a module manager is to provide access to the modular structure of the source code of a software project. This modular structure is different for each language. Apart from their pure syntactic appearance, the meaning of modules and module dependencies differs per language. For instance, the include mechanism of the C preprocessor does not coincide with a C namespace; files are simply concatenated one after the other. The Java import mechanism, however, does coincide with the namespace of a compilation unit; a class can be made invisible outside the compilation unit it is defined in. Another example: Java has wildcards in import statements, a feature that is not present in C. The module semantics of a language is an important aspect of its syntax and semantics that is essential from the viewpoint of IDE construction. Large applications may even contain circular module dependencies: consider the processing of a text document containing an embedded spreadsheet that in its turn contains a text document, the syntax definition of a language in which statements can contain expressions but expressions may contain statements as well, 4

Klint, Kooiker and Vinju

or various design patterns that result in circular module dependencies. Our goal is to develop a module manager that supports rapid prototyping of IDEs for any (domain specific) language and satisfies the following requirements: R1 (Language parametric) It should be parameterized with the “module semantics” of a language. Circular module dependencies should be allowed. R2 (Schedule actions/rapid feedback) It should schedule actions, optimizing the schedule for rapid feedback to the programmer. R3 (Open) It should be open and be able to share a (partial) view of the modular structure of a project with other parts of the IDE. R4 (Scalable): It should scale to large applications. 1.3

Contributions and Road Map

This paper contributes the following ideas: •

The use of attributed module dependency graphs as a practical and efficient vehicle for implementing a language parametric module manager;



The use of a simple modal logic as a way to parameterize a module manager with language specific module semantics;



An efficient algorithm for implementing this logic on top of an attributed module dependency graph.

In Section 2 we define the functionality of a module manager and its underlying data model. Section 3 gives an overview of the architecture of our implementation of such a module manager. In Section 4 we highlight the efficient implementation of the modal logic. Section 5 describes the case studies in which we applied our module manager to construct various IDEs. Section 6 summarizes our conclusions.

2

Attributed module dependency graphs

We will now present all notions that play a role in our approach: the basic representation (Section 2.1), the mapping of languages concepts (Section 2.2), module attributes (Section 2.3), name spaces (Section 2.4), events (Section 2.5), module predicates (Section 2.6), and the API of the module manager (Section 2.7). In Section 3 we will descibe module predicates in more detail. 2.1

Basic representation

Directed graphs are an obvious representation for programming language modules and their interdependencies. We identify the nodes of a graph with the modules of a program, and the edges of the graph with the dependencies between the modules of a program. Each node has a unique name and a collection of attributes. Each attribute has a unique name within the scope of the node, and an arbitrary value. Dependencies are anonymous but they do have attributes that allow the distinction between different types of dependencies. We call the modules that depend on module M the parents of M and we call the modules that module M depends on the children of M . Graphs can contain cycles and we can 5

Klint, Kooiker and Vinju

therefore represent cyclic dependencies. Let’s consider two examples. In Java, modules could be classes, packages and compilation units. Classes and packages are identified by their qualified name (i.e., including package prefix) and compilation units are identified by filename. Java has dependencies of type containment, import, and inheritance. Classes are contained in compilation units or other classes, compilation units are contained in packages, and packages are contained in other packages. Classes import other classes, and inherit from other classes. In C, modules could be compilation units and header files, both are identified by filename. For dependency types C has includes and uses external declaration. Compilation units and header files can include each other, and they can declare dependencies on anonymous compilation units via external declarations. 2.2

Mapping Language Concepts to the Graph Model

The mapping of programming language concepts to our graph model is rather arbitrary and depends on the granularity of interaction required by the IDE. For example, functions in C could be considered to be modules that depend on each other via a calls dependency. The only reason for labeling a programming language artifact as a module should be that the IDE needs the knowledge about dependencies between these modules to trigger certain actions. 2.3

Attributes

Modules and dependencies may have arbitrary attributes. For a specific programming language, there are specific attributes that will be used by the IDE to implement language specific behavior. Module attributes will be used to visualize a module’s identity to the programmer. For example, in a Java IDE a class module will have a class name attribute and a package name attribute. Other attributes may contain aggregated information, such as whether a module contains a syntax error, or how many lines of code it spans. 2.4

Namespaces

One of the complexities of today’s IDEs is that they have to deal with several programming languages and domain specific languages that are either operating next to each other or are embedded in each other. To be able to support several concepts of module semantics at the same time, we introduce namespaces for all identifiers in our graph based model. So, module identifiers, dependencies, module attributes, and dependency attributes all have a namespace. For brevity, we will assume from now on that a valid namespace is part of each module or attribute identifier. 2.5

Events

So far, we have only introduced a generic data structure for storing and retrieving transient information about modules. In order to schedule actions we need rules to select actions for execution. Examples of actions are compiling a compilation unit, or extracting an outline of a Java class, alerting the programmer about a certain error, or decorating a package view with versioning pictograms. The rules of the module manager should trigger these actions at the appropriate times. 6

Klint, Kooiker and Vinju

The listener or observer design pattern [9] is a simple method for decoupling coordination from computation. A computation, or action, registers itself as a listener, and the coordinator triggers the action at certain moments. The module manager allows registration of listeners to attribute change events, module existence events and dependency existence events such that an action may be triggered on any change in the data model. Note that actions may influence the state of the module manager, triggering new actions. Since we do not assume anything about the actions, there can be no a priori guarantee that such a process would terminate, not deadlock, or even be deterministic. 2.6

Module Predicates

As we have seen earlier in Section 1.1, make and ant trigger build actions using the dependency relationship between modules. For example, the module graph contains the basic information for recompiling parts of a Java program without rebuilding the rest of it. In an IDE there are much more actions to be triggered under different kinds of conditions. For example, if a method is removed from a Java class, outlines need to be recomputed for all classes that inherit from it. Or, if a C include statement is moved in a file, at least all code between the old location and the new location needs to be rechecked for static correctness. Or, if a Java compilation unit is modified (in terms of the version management system), then all packages it is contained in are also modified. The information that needs to be propagated through a module dependency graph is language specific, even IDE specific. So, the module manager must provide some way of making information propagation programmable. For this we introduce module predicates. These are inspired by attribute grammar systems [16] and modal logic [2]. Both formalisms provide a programmable way of distributing information over the elements of a complex data structure. An example of a module predicate for a C IDE is linkable. A C compilation unit is linkable when it contains a main function and all of its dependencies have compiled correctly. We will get back to the details later in Section 4. For now, the key idea is that the truth values of module predicates are determined automatically by inspecting and aggregating the values of the attributes of a module and possibly other modules. The way this inspection and aggregation is done is determined by module predicate definitions, which the module manager receives at configuration time. When the value of such a predicate is changed as a result of the changed value of an attribute, a predicate changed event triggers the appropriate actions via listeners. The definitions are expressed using a simple logic, which allows the module manager to statically check for consistency of the set of definitions. 2.7

API of the Module Manager

The basic operations that the module manager offers are adding and removing of modules and dependencies, setting attribute values, registration of event listeners and registration of module predicates. The module manager may also contain any kind of generic graph manipulation algorithms for the benefit of IDE actions. Operations like transitive closures of dependencies, reachability analysis, inversion, clustering, coloring and exports to graph visualizations are obvious candidates for inclusion in the module manager. Keeping the processing of these data as well as the data themselves as close as possible to the module manager will increase 7

Klint, Kooiker and Vinju

P N A V C

::= ::= ::= ::= ::= | | | | |

N : C < Predicate Name > < Attribute Name > < Attribute Value > true | false A=V ¬C | C∧C | 3C 2C (C)

C∨C

|

(predicate definition) (name of a predicate) (name of an attribute) (value of an attribute) (Boolean constants) (attribute value equality) C → C (not, and, or, implies) (in some child C holds) (in all children C holds) (parentheses for grouping)

Figure 1. The syntax of module predicates.

efficiency.

3

Module Predicates

When the user makes a change to a module, the module manager uses the dependencies between modules to trigger actions in response to that change. How can this be done in a language parametric way? 3.1

Domain analysis

Analysis of existing IDEs reveals that actions on modules are triggered either directly or indirectly. Direct actions are consequences of the actions of the programmer that are directly related to a specific module. A module is edited for example; in response to this change the system decides to invalidate the previous compilation and to trigger a new compile action. We call this directly influenced module the pivot. Every sequence of automatically triggered actions in an IDE always starts at a certain pivot. Indirect actions are also consequences of the actions of the programmer, but the affected modules can be far away from the pivot. Only actions on the pivot or on modules that depend on the pivot can be triggered. However, these actions are triggered conditionally since actions for a certain module are not always triggered even if an (indirect) dependency changed. For example, a C program should be relinked if one of its dependencies changed, but only if it contains a main function, and only if all of its dependencies have compiled correctly. We conclude that: •

The triggering of actions is governed by the module dependency graph;



The triggering of actions occurs under certain conditions;



Conditions refer to properties (attributes) of modules;



Conditions refer to the properties of children of modules.

We will use a simple language for expressing the conditions for triggering events. This language needs at least the Boolean operators, some operators for inspecting the attributes of modules, and some operators to refer to the children of modules. The idea is to evaluate these conditions in every module, and send an event to the IDE when the value of a condition changes. We will label each condition with a name in order to be able to identify it, and call it a module predicate. 8

Klint, Kooiker and Vinju

The effect is that actions will be triggered automatically, in a cascading effect that starts at the pivotal change in the module dependency graph, and ends when all module predicates have been re-evaluated. Note how this method of automatically triggering actions is a generalization of build tools like “make”. Those tools trigger build actions (mainly) on a set of fixed (built-in) conditions, e.g., a file being out-of-date. In our system, the conditions are programmable. Furthermore, in make-like tools dependencies and actions are tightly coupled, since every dependency rule may have a list of actions. In our system, the way dependencies are used is programmable, because a module predicate may refer to the attributes of children of modules in several ways. 3.2

Syntax and semantics of module predicates

The syntax of module predicates is defined in Figure 1. A predicate declaration consists of a predicate name N followed by a condition C. We assume that disjoint sets of predicate and attribute names are used and that predicate names are only used once. A condition may consist of true and false, tests for the value of attributes (A = V ), the Boolean operators (∨, ∧, ¬, →), and operators to express conditions on the children of a module that have to hold in some child (3) or in all children (2). The operational semantics of the conditions is defined as follows. Each condition is evaluated for every module M . Every module has an attribute environment E that maps E . An attribute names to attribute values, and a set of children K. The notation we use is MK evaluation function eval reduces a condition to either true or false. It defines an operational semantics for the standard Boolean conditions (which we leave out for brevity), and an operational semantics for the conditions A = V , 2C and 3C:

E eval(MK , A = V ) = true iff equals(lookup(A, E), V ) E eval(MK , 2C) = true iff ∀k ∈ K eval(k, C ∧ 2C) E eval(MK , 3C) = true iff ∃k ∈ K : eval(k, C ∨ 3C)

Evaluating an attribute value equality amounts to a lookup of the attribute’s value in the module’s environment and comparing it with the given value V . Evaluating a condition containing the 2 or 3 operator leads to the recursive application of the given condition C to the children of the current module, but evaluation differs in the way the result is aggregated. For 2, the condition must hold in all children. For 3, the condition must hold in at least one of the children. Note that evaluating 2 and 3 implies computing the transitive closure of the child relation among modules and that this definition of eval does not terminate on cyclic dependency graphs. A terminating definition of eval can be obtained easily by remembering the result of an earlier visit. Otherwise this definition terminates because it is a recursion over a finite expression tree, and no updates are done in the module environments while eval is computed. We will present a terminating (incremental) evaluation algorithm in Section 4. The function eval is a rephrasing of the definition of the satisfaction relation of a K4 modal logic [2] with attribute equalities as propositions. There are several satisfiability checkers for this logic available [13, 7]. The definition of the operators 2 and 3 resembles tree traversal mechanisms, such as 9

Klint, Kooiker and Vinju

found in ELAN [3], A SF +S DF [4], Stratego [18], JJTraveler [19] and Strafunski [12]. However, since we are in the domain of modal logic and not in the domain of either functional programming or term rewriting, the resemblance is rather coincidental. The main difference between modal logic and tree traversals is that in modal logic the other operators of the language do not perform arbitrary computation but compute truth values using Boolean operators, which is at a higher level of abstraction. Another difference is that these logic operators operate on (possibly circular) graphs instead of trees.

3.3

Examples

The following examples illustrate how certain properties of modules can be described by a combination of attributes and predicates. In these examples we use attribute names S for module state and T for module type. They serve to show the flexibility of these rules since many different kinds of action triggering policies can be expressed using this simple formalism.

erroneous : linkable : not-exec : package-modified :

3(S = parse-error) S = compiled ∧ main = yes ∧ 2(S = compiled) S = error ∨ 3(S = error) T = package ∧ 3(T = program ∧ S = modif ied)

Predicate erroneous flags a module as erroneous when one of its children has a parse error. An action that could be triggered when the value of this predicate changes is a user-interface action that disables certain menu options such as, for instance, executing the module. Predicate linkable computes whether a certain module may be linked to a runnable program. If it is a compiled main module and all of its children are compiled, then an action may be triggered to link the program. Predicate not-exec performs a similar computation: the corresponding module is not executable when the module itself or some of its children are in an error state. Finally, predicate package-modified computes that a package can be marked as modified if there is one program in its dependencies that is modified. Such a change of value of this predicate could trigger a decoration in a package explorer view that is part of the IDE. To show how the evaluation of module predicates uses dependency relations we take, for example, the predicate linkable and show the update process in a few steps. We have a cyclic dependency graph as shown in Figure 2(a). The initial state is consistent; module E has S = error, so there is no module for which all dependencies have S = compiled. After a manual update in module E, its value for S changes to compiled, see Figure 2(b). This triggers an update of all predicates in all modules. Figure 2(c) shows that in three modules the value of linkable changes from false to true. In this example, the module manager will trigger actions right after the initial manual update, and then also for each separate change in valuation of a predicate in a specific module. A linker action could listen to these events and start the linking process for all three main modules. 10

Klint, Kooiker and Vinju

(a) Initial state

(b) Manual update

(c) Predicate updates

Figure 2. Automatically updating module predicates after a manual attribute update.

4

Implementation

In the previous sections we have presented a high-level design of a language parametric module manager, including a data structure and syntax and semantics for module predicates. This section details some of the engineering trade-offs that are necessary to obtain an open (R3) and scalable (R4) implementation of this design. 4.1

Openness via language independent middleware

A language parametric module manager should easily allow any kind and amount of IDE extensions (R3). This means that many different kinds of components should be able to react to module events. This enables rapid prototyping and evolution of IDEs by reusing third-party components, by incrementally adding new components, and by gradually replacing prototype components. Third-party components can be written in any programming language, but there is even a case for developing heterogenous components in-house: prototypes are usually more easily implemented in scripting languages, while the eventual product may be developed in a compiled language. We use the T OOL B US component coordination architecture to support this [1]. In a T OOL B US-based design all computation is done in tools that connect via IP sockets to a software bus and all coordination is done via a script that describes the behaviour of this bus. As such, the tools can be written in any language and can be connected to the bus being totally oblivious from each other’s existence. T OOL B US coordination scripts can express all kinds of collaboration protocols between tools on a high level of abstraction. For example, it is easy to express synchronous and asynchronous communication, broadcasts, and locking. We use these features to construct a generic communication protocol between the module manager and an arbitrary number of tools: 11

Klint, Kooiker and Vinju

Algorithm 1 Incremental evaluation of predicates 1: procedure U PDATE ATTRIBUTE(module, attr, value) 2: global visited ← ∅ 3: value’ ← module(attr) 4: if value 6= value’ then 5: module(attr) ← value 6: N OTIFY L ISTENERS(module, attr, value, value’) 7: E VALUATE P REDICATES(module, attr) 8: procedure E VALUATE P REDICATES(module, attr) 9: if module ∈ / visited then 10: global visited ← visited ∪ {module} 11: Predicates ← G ET D EPENDENT P REDICATES(attr) 12: for all pred ∈ Predicates do 13: value ← E VALUATE C ONDITION(module, pred. condition) 14: value’ ← module(pred. name) 15: if value 6= value’ then 16: module(pred. name) ← value’ 17: N OTIFY L ISTENERS(module, pred, value, value’) 18: for all parent ∈ PARENTS(module) do 19: E VALUATE P REDICATES(parent, attr) 20: function E VALUATE C ONDITION(module,condition) 21: switch condition 22: case 2x: 23: children ← G ET T RANSITIVE C HILDREN(module) 24: for all child ∈ children do 25: if ¬ E VALUATE C ONDITION(child, x) then return false 26: return true 27: case 3x: 28: children ← G ET T RANSITIVE C HILDREN(module) 29: for all child ∈ children do 30: if E VALUATE C ONDITION(child, x) then return true 31: return false 32: case attr = value: 33: return (module(attr) equals value) 34: case . . . : 35: evaluate simple boolean expressions •

Attribute/predicate change events are broadcasted asynchronously to listeners;



Reads and updates of the attributed module dependency graph are done synchronously;



Reads and updates are guarded by a lock mechanism to rule out race conditions.

Tools may anonymously register as listeners. This partially implements requirement R3 on openness. Openness can be improved further by allowing the module manager to anonymously register an arbitrary amount of module predicates at initialization time. After initialization, the module manager may present the predicates to a K4 modal logic solver in order to compute non-satisfiability and tautology. This is necessary only during development of an IDE (debugging mode). When the IDE is finished and released the set of module predicates will not change anymore. 4.2

Scalability by incremental module predicate evaluation

Our implementation should scale to large projects (R4). Large projects have large module dependency graphs that frequently contain cyclic dependencies. A straightforward implementation of the semantics of module predicates that was presented in Section 3 would visit all nodes several times, after every single update of an attribute. A small experiment showed immediately that the performance would be too low. In this section we therefore present an incremental algorithm for the efficient evaluation of module predicates. In Section 3 we have explained that when the truth value of a module predicate changes, an event must trigger all registered listeners. The truth value may change due to a change in attribute values of a pivot module, or due to a change in the configuration of the module de12

Klint, Kooiker and Vinju

pendency graph. A single change in the pivot module may have as effect that many module predicates change value, triggering many actions. An efficient implementation of a module predicate evaluator should at least recalculate the truth values of all predicates that indeed have changed (i.e., the implementation should be correct), while it should avoid waisting time on calculating module predicates that will certainly not change (i.e., the implementation should be incremental). Algorithm 1 shows an incremental predicate evaluation algorithm in pseudocode. The evaluation is started by the procedure U PDATE ATTRIBUTE that initiates the value change of an attribute in the pivot module. The recursive procedure E VALUATE P REDICATES recalculates the values of all predicates that are directly or indirectly dependent on the value of the changed attribute. Note that the previous value of each module predicate pred is maintained in the module environment as pred.name thus enabling the detection of value changes with respect to the current value of pred.condition. The function E VALUATE C ONDITION computes the value of a given condition by recurring over its structure. The algorithm starts at the pivot, evaluates all module predicates, and works its way up the module dependency graph detecting the other modules that are affected. We do not show the definitions of G ET D EPENDENT P REDICATES (gives all predicates that depend on a certain attribute), N OTIFY L ISTENERS (informs the outside world about value changes of attributes or predicates), and G ET T RANSITIVE C HILDREN (yields all direct and indirect children of a module). For brevity, we only show the recalculation of predicates that is initiated by the update of an attribute value. When the structure of the dependency graph is changed, a similar recalculation is done. E VALUATE P REDICATES terminates because every node is visited only once, due to the use of a global worklist. E VALUATE C ONDITION terminates because conditions are finitely deep, and it does not traverse the dependency graph. Instead it uses G ET T RAN SITIVE C HILDREN , which uses a precomputed transitive closure of the dependency graph. Since 2 and 3 are transitively closed (see Section 3) this is a correct implementation. Note that we found that precomputing and caching the transitive closure saves time, since attribute updates are more frequent than adding and removing dependencies. The gain in efficiency, as compared to a na¨ıve implementation as presented in Section 3 is caused by precomputing and caching the transitive closure of the module dependency graph and by evaluating the values of predicates in dependent modules only. After this sketch of predicate evaluation, it is useful to understand the difference between our predicate evaluation method and conventional attribute evaluation algorithms as used in attribute grammars [16]. Attribute grammar systems take an attributed abstract syntax tree as point of departure. Attributes may be inherited (their value is propagated from root to leaves) or synthesized (their value is propagated from leaves to root). At each node, attribute equations determine the dependencies between attributes. These attribute equations induce dependencies between attributes. Although attribute grammars can have cyclic attribute dependencies, the graph that holds the attributes is a tree. Our tool distributes attributes on any graph, not just trees, but the computed attributes can not have cyclic dependencies. Furthermore module predicates are limited to the computational power of modal logic allowing extensive static consistency checking.

13

Klint, Kooiker and Vinju

Figure 3. State transition diagram for the S DF part of an A SF +S DF module

5

Case studies

The module manager as described in the previous sections has been applied successfully in the construction of IDEs for A SF +S DF and Java. In this section we focus on the use of the module manager in these IDEs. 5.1

A SF +S DF Meta-Environment

The A SF +S DF Meta-Environment uses the module manager to keep module states up to date and to store other information such as graph properties, paths, and module names. Since A SF +S DF modules can introduce user-defined syntax it is helpful to treat the S DF part and the A SF part of a module separately. The remainder of this section describes the use of the module manager in A SF +S DF focusing on its use for S DF. An S DF module can be in one of several states. The state diagram in Figure 3 describes the transitions between these states. The transitions themselves are handled by a T OOL B US script as explained earlier. Once a module’s state becomes opened, it is possible for the module manager to evaluate the complete or child-error predicates.

complete : S = opened ∧ 2(S = opened) child-error : ¬(S = error) ∧ 3(S = error) The complete predicate is only true when a module and all its children are opened. This indicates that the module has been parsed correctly and that all of its dependencies are free of errors. When a module’s state is complete, an action is triggered that starts the parsing process of the A SF part of the module. Since the A SF part of a module depends on the S DF part, the A SF part can only be parsed when the S DF part is complete. The child-error predicate is only true when one or more of a module’s dependencies fail to parse correctly. This predicate is used as a state value in the IDE. To avoid a module getting the child-error state when already having the error state a self-check on the error state has been added. 14

Klint, Kooiker and Vinju

5.2

Java IDE

The Java IDE is a prototype IDE for Java that uses the module manager to keep track of the same module states as for the A SF +S DF Meta-Environment, but also propagates errors and warnings through the package structure. The import structure of Java files is very similar to the import structure of S DF files and therefore the state attribute used in the A SF +S DF Meta-Environment is reused. This also means that we can reuse some of the predicates used in the A SF +S DF Meta-Environment. Apart from the import graph the module manager is provided with a package dependency graph. The modules of this graph consist of the segments of the package name and have Java files as leafs. We introduce the predicates package-error, package-warning and package-modified to describe the desired behaviour of the Java IDE:

package-error : 3(S = error) package-warning : ¬(3(S = error)) ∧ 3(warning = yes) package-modified : 3(vcs = modified) The package-error predicate is true when one or more of a module’s dependencies have an error state. Since package segments do not have state it is not needed to have a self-check on the error state, which is needed in case of the import dependency graph. The class-warning predicate is only true when one or more of a module’s children have warnings. Furthermore, it can only be true if none of its children has the error state. The package-modified has been added to indicate that files are modified according to the version control system. This predicate is true when one or more of a module’s children are modified. Since Java development depends strongly on package structure, the addition of package predicates is essential for a Java IDE when editing Java source modules. The module manager made it possible that the package dependency graph and predicates that propagate through this graph were easily added.

5.3

Analysis

Both case studies have been carried out loading a medium-sized application in the IDE. In the A SF +S DF Meta-Environment case we used the sources of the S DF normalizer specification consisting of 75 modules. For the Java IDE case we used the source code of JSPWiki [11], which consists of nearly 38,000 lines of Java code in 228 source files and 192 libraries. In both IDEs a pivot module has been chosen in such a way that as much modules as possible were influenced by its changes. The profile run is done by editing the pivot module and causing an error. This error propagates through the import graph, evaluating all predicates and finally evaluating child-error to true for all dependent modules. Only a part of all available modules will be influenced by the pivot. Profiling these scenarios indicates that the evaluation algorithm requires a quarter of a second to compute the effects of a change in a medium-sized application. Table 1 shows the results for both case studies. We used a 2.2 GHz CPU with 1 Gb of main memory. 15

Klint, Kooiker and Vinju

Nr. of modules Nr. of rules involved Modules evaluated Time

A SF +S DF Meta-Environment 75 2 69 7 msec.

Java IDE 420 (192 are libraries) 5 154 265 msec.

Table 1 Performance statistics

6

Conclusions

The module manager in previous versions of the Meta-Environment was implemented entirely in, partly language-specific, T OOL B US scripts. The approach described in this paper is completely generic, improves the response time for state changes and reduces the size and complexity of the implementation. The proposed module manager is fully language parametric and allows expressing module semantics in a suprisingly concise way. Module predicates can be used to propagate information through the module dependency graph. This information is IDE-specific and can be used to give feedback to the user or trigger other actions. After a change in one of the modules due to editing, module predicates can be recomputed very efficiently. Based on this positive experience we will further explore the application of this approach to other languages and IDE-features.

Acknowledgments Jan van Eijck pointed out the similarity between our attribute evaluation mechanism and K4 modal logic; this enabled the reuse of an existing satisfiability checker.

References [1] Bergstra, J. A. and P. Klint, The discrete time ToolBus – a software coordination architecture, Science of Computer Programming 31 (1998), pp. 205–229. [2] Blackburn, P., M. de Rijke and Y. Venema, “Modal logic,” Cambridge University Press, New York, NY, USA, 2001. [3] Borovansky, P., C. Kirchner, H. Kirchner, P.-E. Moreau and M. Vittek, ELAN: A logical framework based on computational systems, in: J. Meseguer, editor, RWLW96, First International Workshop on Rewriting Logic and its Applications, Electronic Notes in Theoretical Computer Science 4 (1996), pp. 35–50. [4] van den Brand, M. G. J., A. van Deursen, J. Heering, H. A. de Jong, M. de Jonge, T. Kuipers, P. Klint, P. A. Olivier, J. Scheerder, J. J. Vinju, E. Visser and J. Visser, The ASF+SDF Meta-Environment: a component-based language development environment, in: R. Wilhelm, editor, Compiler Construction (CC ’01), Lecture Notes in Computer Science 2027 (2001), pp. 365–370. [5] van Deursen, A., P. Klint and J. Visser, Domain-specific languages: An annotated bibliography., ACM SIGPLAN Notices 35 (2000), pp. 26–36. [6] Eclipse Foundation, The Eclipse tool platform, See: http://www.eclipse.org (2004). [7] Fauthoux, D., Lotrec - a general tableaux theorem prover in Java, See: http://www.irit.fr/Lotrec (1999). [8] Feldman, S. I., Make - a program for maintaining computer programs, Software: Practice and Experience 9 (1979), pp. 255–265. [9] Gamma, E., R. Helm, R. Johnson and J. Vlissides, “Design Patterns,” Addison-Wesley, 1995. [10] Hennessy, M., “The Semantics of Programming Languages: An Elementary Introduction using Strutcural Operational Semantics,” John Wiley & Sons, Inc., 1990. [11] Jalkanen, J., JSPWiki, See: http://www.jspwiki.org (2001).

16

Klint, Kooiker and Vinju [12] L¨ammel, R. and J. Visser, A Strafunski application letter, in: V. Dahl and P. Wadler, editors, Proceedings of Practical Aspects of Declarative Programming (PADL’03), Lecture Notes in Computer Science 2562 (2003), pp. 357–375. [13] Le Berre, D., A satisfiability library for Java, See: http://www.sat4j.org (2004). [14] Mernik, M., J. Heering and A. M. Sloane, When and how to develop domain-specific languages, ACM Computer Surveys 37 (2005), pp. 316–344. [15] Nielson, F., H. R. Nielson and C. Hankin, “Principles of Program Analysis,” Springer-Verlag, 1999. [16] Paakki, J., Attribute grammar paradigms — a high-level methodology in language implementation, ACM Computing Surveys 27 (1995), pp. 196–255. [17] The Apache Software Foundation, Ant, See: http://ant.apache.org (2000). [18] Visser, E., Program transformation with Stratego/XT: Rules, strategies, tools, and systems in StrategoXT-0.9, in: C. Lengauer et al., editors, Domain-Specific Program Generation, Lecture Notes in Computer Science 3016, SpringerVerlag, 2004 pp. 216–238. [19] Visser, J., Visitor combination and traversal control, in: OOPSLA ’01: Proceedings of the 16th ACM SIGPLAN conference on Object oriented programming, systems, languages, and publications (2001), pp. 270–282.

17

LDTA 2007

Fusing a Transformation Language with an Open Compiler Karl Trygve Kalleberg 1 Department of Informatics, University of Bergen, P.O. Box 7800, N-5020 BERGEN, Norway

Eelco Visser 2 Department of Software Technology, Faculty of Electrical Engineering, Mathematics and Computer Science, Delft University of Technology, The Netherlands

Abstract Transformation systems such as Stratego/XT provide powerful analysis and transformation frameworks and concise languages for language processing, but instantiating them for every subject language is an arduous task, most often resulting in half-completed frontends. Open compilers, like the Eclipse Compiler for Java, provide mature frontends with robust parsers and type checkers, but solving language processing problems in general purpose languages without transformation libraries is tedious. Reusing these frontends with existing transformation systems is therefore attractive. However, for this reuse to be optimal, the functional logic found in the frontend should be exposed to the transformation system – simple data serialization of the abstract syntax tree is not enough, as this fails to expose important compiler functionality, such as import graphs, symbol tables and the type checker. In this paper, we introduce a scriptable analysis and transformation framework for Java built on top of the Eclipse Java compiler. The framework consists of an adapter extracted from the abstract syntax tree of the compiler, and an interpreter for the Stratego language. The adapter allows the Stratego interpreter to rewrite directly on the compiler AST. We illustrate the applicability of our system with scripts written in Stratego that perform framework and library-specific analyses and transformations.

1

Introduction

Developing and maintaining frameworks and libraries is at the core of software development. All domain abstractions of software applications are invariably encoded into libraries of a given programming language, and maintained using various language processing tools available for that language, such as compilers, editors, source code navigators, documentation generators, style checkers and static analysis tools. Unfortunately, most of these tools only have a fixed repertoire of functionality which seldom covers all the needs of a given library or framework. Furthermore, relatively few processing tools can quickly and easily be programmed, extended or adapted by the library developer, which often drive developers to implement their own text-based tools from scratch. A preferable solution would be for library developers to quickly write custom scripts in a suitable scripting language and thus implement analyses and transformations specific to their own code bases, such as 1 2

Email: [email protected] Email: [email protected]

This paper is electronically published in Electronic Notes in Theoretical Computer Science URL: www.elsevier.nl/locate/entcs

Kalleberg and Visser

style checking and library-specific optimizations. Domain-specific languages (DSLs) for program analysis and transformations are attractive candidates for expressing such scripts, since they allow precise and concise formulations, but they rarely provide robust and mature parsers and type analysers. Open compilers are also attractive as they provide solid parsers and type analysers, but implementing analyses and transformation in their mainstream implementation languages is often very time-consuming. In this paper, we obtain the best of both worlds by combining Stratego, a DSL for program transformation and the open Eclipse Compiler for Java (ECJ), using a program object model (POM) adapter. The POM adapter welds together the Stratego runtime and the ECJ abstract syntax tree (AST), by translating Stratego rewriting operations on the fly to suitable method calls on the AST API. This obviates the need for data serialization, and the technique can be applied to many tree-like APIs, and is reusable for other rewriting systems. Using this adapter, Stratego becomes a compiler scripting language, offering its powerful features for analysis and transformation such as pattern matching, rewrite rules, generic tree traversals, and a reusable library of generic transformation functions and data-flow analysis. This combination is a powerful platform for programming domain-specific analyses and transformations. We will argue that the system can be wielded by advanced developers and framework providers because large and interesting classes of domain-specific analyses and transformations can be expressed by reusing the libraries provided with Stratego. The contributions of this paper include the fusing of a DSL for language processing with an open compiler without resorting to data serialization. This brings the analysis and transformation capabilities of modern compiler infrastructure into the hands of advanced developers through a convenient and feature-rich transformation language. The technique is reusable for other transformation languages, and can help make transformation tools and techniques practical and reusable both by compiler designers and by framework developers, as it directly integrates them with stable tools like the Java compiler – developers can write interesting classes of analyses and transformations easily and compiler designers can experiment with prototypes of analyses and transformations before committing to a final implementation. We validate the system’s applicability through a series of examples taken from mature and well-designed applications and frameworks. The remainder of this paper is organised as follows: In Sec. 2, we discuss the POM adapter and how it connects Stratego with ECJ. In Sec. 3, we show the practical applicability of our prototype on a series of common, framework-specific analysis and transformation problems. In Sec. 4, we discuss the implementation details of our prototype. In Sec. 5, we cover related work. In Sec. 6, we discuss some trade-offs related to our technique before we conclude in Sec. 7.

2

The Program Object Model Adapter

The program object model adapter is the linchpin in the composition of the compiler and the program transformation language. A program object model (POM) is our name for the object model representing a program in the compiler. This is typically an AST with symbol tables and other auxiliary data structures, such as import graphs. The POM adapter translates the primitive rewriting operations of the rewriting engine to method calls on the POM API. 20

Kalleberg and Visser

Consider Fig. 2, which shows the principal components of our system. At the bottom, the ECJ provides an AST API for modifying and inspecting its internal program object model. The AST is implemented in a traditional object-oriented style. Each node type in the AST, such as CompilationUnit is represented by a concrete class. Children of a node can be retrieved using get-methods, and replaced using set-methods. New nodes can be constructed using methods, such as newCompilationUnit(), in the AST factory. The Stratego interpreter is a rewriting engine, or runtime, Compiled user script for the Stratego term rewriting language (more on this later), written in Java. It executes scripts compiled to an abstract maStratego interpreter chine. The crucial feature of the interpreter is that it abstracts POM adapter FFI library over the actual term implementation. Any data structure that POM (AST API) can provide a suitable interface can be treated as terms, and Eclipse Compiler for Java rewritten. The job of the POM adapter is to adapt tree-like data structures so that they can be transformed with Stratego. Fig 2: Architecture. It does this by wrapping a POM in the term interface required by the interpreter. The adapter translates term rewriting operations to POM API method calls that are executed directly on the POM, without any intermediate data serialization. The interpreter also has a facility for calling foreign functions, i.e. functions implemented in Java. The interface between Stratego and ECJ includes a small foreign function interface (FFI) that exposes parts of the native Eclipse AST API as Stratego library functions. These allow Stratego scripts to ask for the type of a suitable node using type-of, the supertype using supertype-of, and more. Our prototype system is available as a stand-alone, command-line application based on Eclipse, and as a reusable Eclipse plugin. In stand-alone mode, the system performs sourceto-source transformation. The user supplies the path of a project and a script to execute. The script can use the FFI to traverse the project directories and to parse source files, to obtain their AST. After rewriting, the script can also use the FFI to write modified ASTs back to disk, as source code. In plugin mode, interpreter objects may be instantiated with arbitrary scripts. Scripts can be executed directly on individual ASTs by calling execute methods on the interpreter object. This allows scripts to be used for very fine-grained source code queries and transformations.

2.1

Scripting in Stratego

Stratego is a DSL for language processing based on the paradigm of strategic term rewriting. In our context, terms are essential equivalent to ASTs. Stratego has language constructs that make it well-suited for language processing, such as pattern matching, generic tree traversals, rewrite rules and powerful combinators for expressing strategies for rewriting and analysis. Essential constructs of Stratego are explained below. Patterns are written using prefix notation on the form SimpleName("b"), and can contain variables, e.g. SimpleName(n), where n is a term variable. A term is a pattern that does not contain variables. Lists are written as [1,2,3]. Terms are built (instantiated) from patterns using the build operator (!): !MethodInvocation(obj, name, [], []), where obj and name are variables. Operators are applied to an implicit current term; a build 21

Kalleberg and Visser

replaces the current term. Patterns are matched against terms using the match operator (?): ?SimpleName(x), binds the term variable x to the subterm of SimpleName. Matching fails if the pattern does not match, i.e. the current term is not a SimpleName term. Strategy expressions combine basic operations match and fail into more complex transformations. Since match can fail, strategy expressions may fail as well. Combinators are used to compose expressions and handle failures. The choice combinator s0 p(x)

3

Basic Stratego Constructs Meaning (build) Instantiate the term pattern p(x) and make it the current term (match) Match the term pattern p(x) against the current term (left choice) Apply s0 . If s0 fails, roll back, then apply s1 (composition) Apply s0 , then apply s1 . Fail if either s0 or s1 fails (identity, failure) Always succeeds/fails. Current term is not modified Apply s to one direct subterm of the current term Apply s to as many direct subterms of the current term as possible Apply s to all direct subterms of the current subterm Syntactic Sugar Meaning — (syntacic sugar) Anonymous rewrite rule from term pattern pl (x) to pr (x) Equivalent to ?x ; ?p(y); bind current term to x then match p(y) Equivalent to !p(x) ; s; build p(x) then apply s Equivalent to s ; ?p(x); match p(x) on result of s

Domain-specific Analyses and Transformations

In this section, we motivate the applicability of our system by showing some frameworkspecific analyses and transformations. The examples in this section illustrate what an advanced framework developer with a good working knowledge of language processing and 22

Kalleberg and Visser

Stratego could implement. However, Stratego is capable of performing significantly more advanced analyses and transformation than shown here. See [15,4,10] for some examples. 3.1

Project-specific Code Style Checking

Software projects of non-trivial size always adopt some form of (moderately) consistent code style to aid maintenance and readability. We are concerned with checking for proper implementation and proper use of domain abstractions. Consistency of implementation can be improved by encouraging systematic use of particular idioms. The following idiom is taken from the AST implementation in ECJ. Bounds Checking Idiom. Consider the following code for iterating over x: for(int i = 0; i < x.length(); i++) { ... } If x is a value object of type T, i.e. happens to be immutable, then the length() method will be invoked needlessly for every iteration. The JIT may eventually inline this call, but only if the code is executed frequently enough. One might want to encourage a coding style that is also efficient with the bytecode interpreter: { final int sz = x.length(); for(int i = 0; i < sz; i++) { ... } } This idiom is used throughout the implementation of the internal AST classes of the ECJ, and may be checked using the following function: check-for = ?ForStatement(_, e, _, _) ; e call-to-immutable = ?MethodInvocation(_, _, _, _, _, []) ; binding-of => MethodBinding(class-name, _, _, _) ; immutable-classes ; emit-warn(|"Call to method on immutable object in loop iteration") check-for should be applied to a for-statement. If any of the condition expressions are calls to methods without parameters of objects of an immutable type, a warning is emitted. The list of known, non-mutating methods is kept in the global variable immutable-classes 3 . With data-flow analysis we could even consider method calls on objects which are not immutable; as long as the body of the for-loop does not invoke any mutating operation and does not pass x as an argument to another function, we can assume immutability. By keeping (typename,methodname) pairs in an list, say immutable-methods, we can look up the immutability property. 3.2

Custom Data-Flow Analysis

Totem propagation is a kind of data-flow analysis where variables in the source code are marked with annotations, called totems [10]. These assert properties on the variables which 3

Technically, immutable-classes is a Stratego overlay, but this amounts to a global, immutable variable in our case.

23

Kalleberg and Visser

are later used by other analyses and transformations. A meta-program will perform dataflow analysis and propagate the asserted totems throughout the code, following the same principles as constant propagation. Totem propagation is in many ways similar to typestate analysis, which is “a dataflow analysis for verifying the operations performed on variables obey the typestate rules of the language” [16]. Typestate analysis is mostly concerned with verifying protocols, such as ensuring that files are opened before they are read. Totem propagation uses the same dataflow machinery to discover opportunities for optimizing away unnecessary calls (such as a call to sort() on a sorted list) or replacing costly operations with cheaper ones (such as binary search instead of linear search on sorted lists). Meta-programs performing these forms of data-flow analyses must be aware of the propagation rules for each kind of totem. A totem propagator could be useful for removing of dynamic boundary checks in a library for matrix computations. The following interface is found in the Matrix Toolkits for Java (MTJ) library [1]: interface Matrix { Matrix add(Matrix B); Matrix mult(Matrix B, Matrix C); Matrix transpose(); ... } These operations have certain, well-defined requirements. Two matrices, A and B, may only be added if they have the same dimensions, i.e. A has same number of rows and columns as B. Two matrices, A and B, may be multiplied and placed into C if the number of columns of A equals the number of rows of B. The dimensions of C must be equal to the number of rows of A and the number of columns of B. Transposition of a matrix swaps the row and column dimensions. These rules are violated by the following code: Matrix m = new DenseMatrix(5,4); Matrix n = new DenseMatrix(4,6), z = new DenseMatrix(5,6), w = new DenseMatrix(3,5); m.mult(n,z); z.transpose(); z.mult(m,w); // m and w incompatible All dimensions are compatible for the first two operations, but not for the final z.mult(m,w). The matrix operations in MTJ will verify dimensions before calculating and throw exceptions if the preconditions are not met. Performance-wise, this is costly, and latent mismatches may lurk in seldom used code. To alleviate this problem, we can apply a totem propagator which knows how to propagate and verify the dimension of matrix operations. Initial dimensions can be picked up from programmer-supplied assertions (on the form of a comment // @dim(m,4,3)) or from the variable initialization. Whenever a dimension is asserted for a variable in the code, a new, dynamic rule Dimensions: name -> dim is created that remembers the asserted dimensions dim for a variable name. Dynamic rules are like regular rules, except they can be introduced, updated and removed at runtime. If an existing Dimensions rule with the name left-hand side already exists, it is updated to a (potentially) new dim. This rule can then be applied (and updated) when propagating the dimension totem across a transposition: PropTotem = ?MethodInvocation(src, SimpleName("transpose"), _, []) ; src => "no.uib.cipr.matrix.Matrix" 24

Kalleberg and Visser

; src => [rows, columns] ; rules(Dimensions : src -> [columns, rows]) Here, the old dimensions (if they are known) will be swapped and the Dimensions rule updated. There are other (overloaded) PropTotem functions which deal with addition and multiplication. The propagator core is based on the general constant propagation framework proposed by Olmos and Visser [15], but is adapted to propagate arbitrary data properties, not just constants: prop-totem = PropTotem (s2r, s2c) ; dst => (dr, dc); !s1c => s2r; !s2c => dc; !s1r => dr The where clause is a rewriting condition which ensures that the mult call is on the correct data type and that the dimensions are compatible. This rewrite rule is all that is needed to turn the analysis from Sec. 3.2 into an optimizing code transformation. Optimizing Loop Boundary Checks The bounds checking idiom from the previous section can also be turned into a code transformation: OptimizeFor: ForStatement(init, cond, incr, body) -> Block( [vdecls, [ ForStatement(init, cond’, incr, body) ]]) where cond => call-var-pairs ; vardecl( e, v, e)\)> call-var-pairs => vdecls ; cond => cond’ The generic collect function is used with is-immutable to find all invocation of get-like methods in the condition expression. For each expression, a new uniquely named variable is created (by new-names) and a variable declaration for it is created that gets added before the for loop. Each expression is replaced with its corresponding, freshly named, temporary variable using the RewriteImmutable function, thus avoiding any name capture in the generated code.

4

Implementation

The ECJ AST is a class hierarchy consisting of abstract and concrete classes. For example, all expression nodes, such as InfixExpression, inherit from the abstract Expression class. The root node of the hierarchy is the abstract class ASTNode. The AST hierarchy is adapted to the term interface expected by the rewriting engine using the POM adapter. 4.1

Term Interface

The term interface is a generalization of the ATerm interface used by various term rewriting systems, such as ASF+SDF [18], Tom [13] and Stratego [4]. There are two levels to this interface, depending on whether read-only traversals or full rewriting is desired. Inspection Interface The inspection interface is a class hierarchy. At its root we find ITerm. There are four distinct primitive term types deriving from ITerm, for integers, strings, lists and applications. The essential methods of ITerm are given below. public int getPrimitiveTermType(); public ITermConstructor getConstructor(); public int getSubtermCount(); 26

Kalleberg and Visser

public ITerm getSubterm(int index); public boolean isEqual(ITerm rhs); The getPrimitiveTermType() method returns an integer specifying which primitive term type is represented by a given ITerm object. Most AST nodes are application nodes. An application C(t0 , ..., tn ) consists of a constructor name C and a list of subterms t0 through tn . The number and types of the subterms are given together with the constructor name in a signature, e.g. signature EclipseJava constructors InfixExpression : String * Expression * Expression -> Expression ... This declares to Stratego that InfixExpression terms have three children, the first being a string and the remaining two being expressions. The declaration corresponds to the AST class InfixExpression. For each concrete AST node type, a constructor is generated. For each abstract AST node type, a sort is generated. The sets of constructors and sorts define the EclipseJava signature. Calling getConstructor() on an InfixExpression will return an object that can be queried for the constructor name (in this case InfixExpression), and arity (in this case three). Calling getSubtermCount() will return three and the method getSubterm() can be used to retrieve either of the subterms. The isEqual() method performs a deep equality check. Stratego allows pattern matching with variables. All the code for handling variable bindings is kept inside the interpreter implementation, to keep the POM adapter interface minimal. Concrete implementations of the ITerm inspection interface can be derived mostly automatically from the AST class hierarchy. Each concrete class in the ECJ AST requires a small adapter class, all of it generated boilerplate. The only place where human intervention is needed is to decide how the subtrees in the AST should map to an ordered set of terms, e.g: class WrappedPackageDeclaration implements ITermAppl { private PackageDeclaration wrappee; ... public ITerm getSubterm(int index) { switch(index) { case 0: return ECJFactory.wrap(wrappee.getPackage()); case 1: return ECJFactory.wrap(wrappee.imports()); case 2: return ECJFactory.wrap(wrappee.types()); } throw new ArrayIndexOutOfBoundsException(); } } In the current implementation, AST nodes are wrapped lazily, thus wrapping only occurs when needed. When AST nodes are traversed by the rewriting engine, the AST node children are wrapped progressively, as terms are unfolded. Generation Interface The POM adapter technique does not require an implementation of the generation interface, but if one is not provided, only analysis and not rewriting can be done. The following are the essential factory methods that must be provided. interface ITermFactory { ... 27

Kalleberg and Visser

public public public public

ITerm ITerm ITerm ITerm

makeAppl(ITermConstructor ctor, ITerm[] args); makeString(String s); makeInt(int i); makeList(ITerm[] args); }

Default implementations exist for strings, lists and integers. Only the makeAppl method must be done by hand. In our prototype, this method forwards constructor requests to the appropriate factory methods of the ECJ AST; when it sees a request for constructing, say, a PackageDeclaration node, the request is forwarded to newPackageDeclaration() of the ECJ AST factory. class ECJFactory implements ITermFactory { ... public ITerm makeAppl(ITermConstructor ctor, ITerm[] args) { switch(constructorMap.get(ctor.getName())) { ... case PACKAGE_DECLARATION: makePackageDeclaration(args); ... ... }

} }

makePackageDeclaration will first ensure that args contains the correct number and types of arguments, call AST.newPackageDeclaration(), then use the relevant set methods on the resulting PackageDeclaration object to complete the construction. The use of switch(constructorMap.get(ctor.getName())) is a performance trick for mapping constructor names to constructor methods. The EclipseJava signature completely declares the structure of legal terms that ECJFactory will allow.

4.2

Design Considerations

Functional Integration – The type analysis functions such as type-of are calls to the ECJ type checker, through the FFI library in Fig. 2. Invoking type-of on, say, an InfixExpression term, will result in a call to resolveTypeBinding() on the InfixExpression object wrapped by this term. Stratego is dynamically typed, and only the arity of terms is statically guaranteed. If, say, a SimpleName term is passed to type-of, the FFI stub for type-of will detect this and fail, just like any expression in Stratego can fail. Imperative and Functional Data Structures – The rewriting engine assumes a functional data structure; in-place updates to existing terms are not allowed. The generation interface is designed so that existing terms will never be modified – there simply are no operations for modifying existing terms. This makes wrapping imperative data structures in such a functional dress relatively straightforward – the compiler need not provide one. The only restriction is that AST nodes must not change behind the scenes, i.e. the rewriting engine must have exclusive access while rewriting. For in-place rewriting systems, e.g. TXL [6], a slight modification of the ITerm interface would be necessary so that subterms of existing terms can be modified in place. Efficiency Considerations – Using a functional data structure provides some appealing properties for term comparison and copying. As described in [7], maximal sharing of subterms (i.e. always representing two structurally identical terms by the same object) offers constant-time term copying and structural equality checks as these reduce to pointer copying and comparisons, respectively. This is important for efficient pattern matching because term equivalence is deep structural equality, not object (identifier) equality. The ECJ AST interface provides deep structural matching, but this is not constant-time. This 28

Kalleberg and Visser

can be provided in the POM adapter, but then lazy wrapping must be given up. Hash codes must also be computed deeply. The hash code must be computed from the structure of the term, and not the object identity of the AST node, since the equality is structural (two objects that are equal should have the same hash code). Once a hash code has been computed, it can be memoized, as the subterms will never change. The memory footprint of the wrapper objects is small. Each object has only two fields. By keeping a (weak) hash table of the AST nodes already wrapped, the overhead is reduced even further. The current implementation takes just over four minutes to run the bounds checking idiom analysis on the entire Eclipse code base (about 2.7 million lines of code), on a 1.4GHz laptop with 1.5GB of RAM. Complicated transformations are limited by the efficiency of the current Stratego interpreter, not the adapter. Compiling the scripts to Java byte code, instead of the abstract Stratego machine, should significantly improve performance for complicated scripts. Strongly vs Weakly Typed ASTs – The ECJ AST is strongly typed, and the term rewriting system needs to respect this. Stratego is dynamically typed and would normally allow the term InfixExpression(1,BooleanLiteral(0),3) to be constructed, even though the subterms must be String and Expression, as declared previously (making 1, 3 invalid subterms). ECJFactory has two modes for dealing with this. In strict mode, the factory bars invalid EclipseJava terms from being built. As a result, the build expression !InfixExpression(1,BooleanLiteral(0),3) fails. Terms without any EclipseJava terms, such as (1,2,3), can be built freely. These will not be represented as EclipseJava terms, but by the default internal term library of the interpreter. We call these terms without EclipseJava constructors native terms. In lenient mode, mixed terms consisting of native and EclipseJava terms are allowed, such as InfixExpression(1,BooleanLiteral(0),3). The subterm BooleanLiteral remains an EclipseJava term, but 0 and 3 are native terms. The root term, InfixExpression, becomes a mixed term, and is also handled by the native term library. Since all terms are constructed from their leaves up (ITermFactory forces this), ECJFactory can determine when it can build an EclipseJava term, inside its makeAppl() method: iff all subterms are EclipseJava terms, and are compatible with the requested constructor, an EclipseJava term is built, otherwise a mixed term must be constructed. ECJ FFI functions will fail if they are passed mixed terms. Java programs, such as Eclipse plugins, using the Stratego interpreter to rewrite ASTs will receive an ITerm as the result from the interpreter. They should perform a dynamic type check to ensure that the ITerm is a wrapped ECJ AST node, and not a mixed or native term. Rewritings can result in structurally valid but semantically invalid ASTs, for example by removing a method which is called elsewhere from a class. Neither Stratego nor the ECJ AST API checks for this. However, a subsequent type reanalysis will uncover the problem. If the type analysis functions are used as transformation pre-condition checks, one can ensure that a transformation is always type correct.

5

Related work

Language processing is what program transformation systems like Tom [13], TXL [6], ASF+SDF [18] Stratego [4] were designed for. Programmable static analysis tools such as CodeQuest [9], CodeSurfer [2] and PQL [12], 29

Kalleberg and Visser

all support writing various kinds of flow- and/or context-sensitive program analyses, in addition to (often limited) queries on the AST. Pluggable type systems, an implementation of which is described by Andreae et al [3], also offer static analysis capabilities. Developers can express custom type checking rules on the AST, that are executed at compile-time so as to extend the compiler type checking. Neither programmable static analysis tools nor pluggable type systems support source code transformations, however. Languages for refactoring such as JunGL [20] and ConTraCT [11] provide both program analysis and rewriting capabilities. JunGL is hybrid between an ML-like language (for rewriting) and Datalog (for data-flow queries) whereas ConTraCT is based on Prolog. JunGL supports rewriting on both trees and graphs, but is a young language and does not (yet) support user-defined data types. Stratego is a comparatively mature program transformation language with sizable libraries and built-in language constructs for data- and control-flow analysis, handling scoping and variable bindings, and pattern matching with concrete syntax (not demonstrated in this paper) that comes with both a compiler and interpreter, and has been applied to processing various other mainstream languages such as C and C++ [4]. Open compilers such as SUIF [21], OpenJava [17], OpenC++ [5] and Polyglot [14] offer extensible language processing platforms, and in many open compilers, the entire compiler pipeline, including the backend, is extensible. Constructing and maintaining such an open architecture is a laborious task. As we have shown, many interesting classes of domain-specific analyses and transformations require only the front-end to be open. Exposing just the front-end is less demanding than maintaining a fully open compiler pipeline. In principle, we could have plugged Stratego into either of these compilers. A key strength of Stratego is generic traversals (built with one, some and all) that cleanly separate the code for tree navigation from the actual operations (such as rewrite rules) performed on each node. The JJTraveler visitor combinator framework is a Java library described by van Deursen and Visser [19] that also provides generic traversals. Generic traversals and visitor combinators go far beyond traditional object-oriented visitors, and the core interface required by both approaches is very similar. Comparing the Visitable interface of JJTraveler, the ATerm interface found in ASF+SDF and the Stratego C runtime, suggests that the POM adapter should be reusable for all of these systems, implementation language issues notwithstanding (C for ASF+SDF, and Java for JJTraveler and our interpreter). A related approach to rewriting on existing class hierarchies is presented in Tom [13]. Tom is a language extension for Java, provides features for rewriting and matching on existing class hierarchies. Recent versions also support generic traversals in the style of JJTraveler, but its library of analyses is still rather small. High-level analyses are also provided by Engler et al [8], where a system for checking system-specific programming rules for the Linux kernel is described. These rules rules are implemented as extensions to an open compiler. Our system is different in that it can also perform arbitrary code transformation, and that the language we use to implement our rules is a feature-rich transformation language designed for language processing. For language processing problems, Stratego has the advantage of a sizable library of generic transformations, traversals and high-level data-flow analysis, in addition to its novel language features. The net result is that transformation code becomes both precise and concise.

30

Kalleberg and Visser

6

Discussion

Recent research has provided pluggable type systems, style checkers and static analysis with scripting support. The appealing feature of our system, and that of JunGL and ConTraCT, is that we can also script source code transformations based on the analysis results. The tradeoff with using a domain-specific language for scripting is that the same language features that make the language powerful and domain-specific also make it more difficult to learn. This can be offset in part by good documentation, and a sizeable corpus of similar code to learn from. A compiler scripting language can also provide an appealing part of a testbed for prototyping language extensions, new compiler analyses and transformations because its highlevel constructs offer rapid prototyping. The plethora of custom analysis and transformation tools suggests that compiler writers should cater for potential extenders in their infrastructure design. As we have demonstrated, even a rather simple inspection interface is sufficient for read-only analysis, and by adding functionality for building AST nodes, general rewriting can be scripted. We are currently experimenting with improving our tools for (mostly) automatic generation of POM adapters using compile-time reflection techniques over the AST classes of existing frontends. We are testing these tools against other frontends such as the reference Java compiler from Sun and the Polyglot compiler [14]. A limitation of ECJ is that rewriting the AST will invalidate the type information. After rewriting, complete type reanalysis must be performed to restore accurate type information. An open compiler with incremental type reanalysis would help in ensuring that the transformation is semantically correct, as “safe points” can defined in the transformation where the (intermediate) result is checked for type-correctness. Stratego does not have any fundamental limitations on the types of analyses and transformations it can express. The language is Turing-complete, and can express both imperative and functional algorithms for program analysis and transformation. Special support exists, in the form of reusable strategy libraries and language constructs such as dynamic rules, for performing control- and data-flow analysis over subject programs represented as terms, i.e. abstract syntax trees. Please refer to [15] for more details on these features. In practice, the current performance of the interpreter may be a limiting factor for particularly resource-intensive analyses and transformations. In these cases, the C-based Stratego/XT infrastructure [4] may be an alternative. In the future, we anticipate a Java bytecode backend for the Stratego compiler. Certain whole-program analyses may require very efficient implementations of specific data structures, such as binary decision diagrams (BDDs). Stratego does not currently have a library providing BDDs.

7

Conclusion

We have presented the design of the program object model adapter, and shown how this technique can combine the Stratego rewriting language and the Eclipse Compiler for Java, yielding a powerful solution for scripting domain-specific analyses and transformations due to the stability of ECJ and the features of Stratego – the analyses and transformations are expressed precisely and concisely. We have shown that even a relatively small degree of extensibility on the part of the compiler is sufficient for plugging in a rewriting system, motivated that the POM adapter can be reused for other, tree-like data structures, and that 31

Kalleberg and Visser

its design is also applicable to other rewriting engines. The applicability of the system was demonstrated through a series of analysis and transformation problems taken from mature and well-designed frameworks. We would like to thank the anonymous reviewers for helpful comments on this paper. Kalleberg is supported by the Norwegian Research Council, grant PLI-AST.

References [1] Matrix Toolkits for Java. http://rs.cipr.uib.no/mtj/, 2006. [2] P. Anderson and T. Teitelbaum. Software inspection using CodeSurfer. In Proceedings of WISE’01 (Itl Workshop on Inspection in Software Engineering), 2001. [3] C. Andreae, J. Noble, S. Markstrum, and T. Millstein. A framework for implementing pluggable type systems. In Proceedings of OOPLSA’06: Conference on Object-Oriented Programming, Systems, Languages, and Applications, New York, NY, USA, 2006. ACM Press. [4] M. Bravenboer, K. T. Kalleberg, R. Vermaas, and E. Visser. Stratego/XT 0.16. Components for transformation systems. In ACM SIGPLAN 2006 Workshop on Partial Evaluation and Program Manipulation (PEPM’06), Charleston, South Carolina, January 2006. ACM SIGPLAN. [5] S. Chiba. A metaobject protocol for C++. In OOPSLA ’95: Proceedings of the tenth annual conference on Objectoriented programming systems, languages, and applications, pages 285–299, New York, NY, USA, 1995. ACM Press. [6] J. R. Cordy. TXL - a language for programming language tools and applications. ENTCS, 110:3–31, 2004. [7] M. G. T. V. den Brand, H. A. de Jong, P. Klint, and P. A. Olivier. Efficient annotated terms. Softw. Pract. Exper., 30(3):259–291, 2000. [8] D. R. Engler, B. Chelf, A. Chou, and S. Hallem. Checking system rules using system-specific, programmer-written compiler extensions. In OSDI, pages 1–16, 2000. [9] E. Hajiyev, M. Verbaere, O. de Moor, and K. de Volder. CodeQuest: querying source code with datalog. In OOPSLA ’05: Companion to the 20th annual ACM SIGPLAN conference on Object-oriented programming, systems, languages, and applications, pages 102–103, New York, NY, USA, 2005. ACM Press. [10] K. T. Kalleberg. User-configurable, high-level transformations with CodeBoost. Master’s thesis, University of Bergen, P.O.Box 7800, N-5020 Bergen, Norway, March 2003. [11] G. Kniesel and H. Koch. Static composition of refactorings. Sci. Comput. Program., 52(1-3):9–51, 2004. [12] M. Martin, B. Livshits, and M. S. Lam. Finding application errors and security flaws using PQL: a program query language. In OOPSLA ’05: Proc. of the ACM SIGPLAN Conf. Object oriented programming, systems, languages, and applications, pages 365–383, New York, NY, USA, 2005. ACM Press. [13] P.-E. Moreau, C. Ringeissen, and M. Vittek. A pattern matching compiler for multiple target languages. In 12th International Conference on Compiler Construction, LNCS, pages 61–76. Springer, 2003. [14] N. Nystrom, M. R. Clarkson, and A. C. Myers. Polyglot: An extensible compiler framework for Java. In 12th International Conference on Compiler Construction, LNCS, pages 138–152. Springer, 2003. [15] K. Olmos and E. Visser. Composing source-to-source data-flow transformations with rewriting strategies and dependent dynamic rewrite rules. In R. Bodik, editor, 14th International Conference on Compiler Construction (CC’05), volume 3443 of Lecture Notes in Computer Science, pages 204–220. Springer-Verlag, April 2005. [16] R. E. Strom and D. M. Yellin. Extending typestate checking using conditional liveness analysis. IEEE Trans. Softw. Eng., 19(5):478–485, 1993. [17] M. Tatsubori, S. Chiba, K. Itano, and M.-O. Killijian. OpenJava: A class-based macro system for java. In Proceedings of the 1st OOPSLA Workshop on Reflection and Software Engineering, pages 117–133, London, UK, 2000. SpringerVerlag. [18] M. G. J. van den Brand, A. van Deursen, J. Heering, H. A. de Jong, M. de Jonge, T. Kuipers, P. Klint, L. Moonen, P. A. Olivier, J. Scheerder, J. J. Vinju, E. Visser, and J. Visser. The ASF+SDF meta-environment: A componentbased language development environment. In CC ’01: Proceedings of the 10th International Conference on Compiler Construction, pages 365–370, London, UK, 2001. Springer-Verlag. [19] A. van Deursen and J. Visser. Building program understanding tools using visitor combinators. In Proceedings 10th Int. Workshop on Program Comprehension, IWPC 2002, pages 137–146. IEEE Computer Society, 2002. [20] M. Verbaere, R. Ettinger, and O. de Moor. JunGL: a scripting language for refactoring. In ICSE ’06: Proceeding of the 28th international conference on Software engineering, pages 172–181, New York, NY, USA, 2006. ACM Press. [21] R. P. Wilson, R. S. French, C. S. Wilson, S. P. Amarasinghe, J. M. Anderson, S. W. K. Tjiang, S.-W. Liao, C.-W. Tseng, M. W. Hall, M. S. Lam, and J. L. Hennessy. SUIF: an infrastructure for research on parallelizing and optimizing compilers. SIGPLAN Not., 29(12):31–37, 1994.

32

LDTA 2007

Implementing a Domain-Specific Language using Stratego/XT: an experience paper Leonard G. C. Hamey

1

Shirley N. Goldrei 1

Computing Department Macquarie University, Sydney, Australia

Abstract We describe the experience of implementing a Domain-Specific Language using transformation to a General Purpose Language. The domain of application is image processing and low-level computer vision. The transformation is accomplished using the Stratego/XT language transformation toolset. The implementation presented here is contrasted with the original implementation carried out many years ago using standard compiler implementation tools of the day. We highlight some of the unexpected advantages afforded to us, as language designers and implementers, by the source-to-source transformation technique. We also present some of the practical challenges faced in the implementation and show how these issues were addressed. Key words: Domain-specific language, transformation, compiler implementation, language definition, computer vision

1

Introduction

This paper describes the re-implementation of a Domain-Specific Language for low-level computer vision called Apply. This work contributes a reflection on the experience of using source-to-source transformation tools to implement a non-embedded Domain Specific Language. This work compares the implementation experience with that of more traditional compiler implementation techniques. Both implementations were carried out by the same developer. The present Apply implementation was carried out over a period of five months and this paper distils the experience documented in a 68 page daily log maintained by the developer throughout the implementation process [2]. The log includes descriptions of progress made each day, language design thoughts, language implementation thoughts including ways of implementing optimisations of Apply programs, problems and solutions as well as documentation notes on Stratego/XT usage. 1

We thank Alan Fekete, Tony Sloane and the reviewers for their constructive comments. This paper is electronically published in Electronic Notes in Theoretical Computer Science URL: www.elsevier.nl/locate/entcs

Hamey and Goldrei

In terms of the implementation patterns for Executable DSLs identified by Mernik et. al. [6], the implementation technique chosen could be described as a Hybrid of Compiler/application generator, with considerable analysis and optimisation being done on the DSL program, and Pre-processor using sourceto-source transformation to arrive at the base language source code. The Apply language was originally implemented in the 1980’s at The Robotics Institute at Carnegie Mellon University. The language was designed to allow easy and efficient programming of Low-Level Vision applications using the Apply Programming Model [4]. This model of programming reduces the problem of writing image-to-image vision applications, which are implicitly parallel computations, to the task of writing a procedure to be applied to a window around a single pixel of the image. The original implementation design focussed on the ability to rapidly reimplement the back-end of the compiler to target a range of hardware platforms including special purpose multiprocessor parallel architectures, bit-serial processor arrays, distributed memory architectures and uniprocessor systems. Each platform offered different programming environments: operating systems, programming languages and model of parallel computation. Whilst the original design aims were achieved, recent hardware developments suggest it is time to re-implement the Apply compiler. The emphasis now is less towards targeting hardware infrastructure directly, and more towards implementing high-level source optimisations and rapidly targeting a range of image processing Application Programming Interfaces (APIs). In the original implementation of Apply, Lex and Yacc were used to generate a parser which constructed an Abstract Syntax Tree (AST). Hand-written C code implemented the analysis and generation of appropriate target code depending on the target platform. For example, the compiler could generate W2 code for the Warp processor, or C code for a uniprocessor UNIX machine. In the current implementation Stratego/XT [9] was chosen to implement the transformation of Apply source to C source. Stratego/XT provides a complete toolkit with domain specific language support for parsing, rewriting and pretty-printing. There was no need to resort to a general purpose language (such as C) for any part of the implementation. Even with the learning curve required by a novice user, only eight working days were required to implement the core of the Apply language and some basic optimisations [2]. Stratego/XT provides the capability to extend its rewrite language syntax with that of an object language [8] so that rewrite rules can be expressed in the concrete syntax of the language being transformed in addition to the abstract term representation. This significantly simplifies the task of translating and optimising the Apply source. The use of concrete syntax facilitated optimisations that would not have been feasible in the C version of the compiler. This encouraged the development of the Apply language to further enhance its expressivity without fear of sacrificing application performance. It also made it possible to compile the 34

Hamey and Goldrei

Fig. 1. Edge detection. Original image (left). Sobel edge detected image (right).

base language for more efficient execution on a uniprocessor machine than could have been achieved with the original design. Using concrete syntax in the transformation description makes it a very straightforward task to replace target code utilising one API with code utilising another API. It only requires re-implementing one small Stratego module that transforms specific aspects of the code and the entire Apply compiler can be regenerated. This makes it feasible to use the Apply language as a way of succinctly specifying low-level vision processing algorithms in an implementation neutral way, allowing the establishment of reusable and retargetable libraries of such algorithms. In the sections that follow we provide an outline of the domain of application. We introduce the Apply language and domain specific features and discuss the influence that using Stratego/XT has had on the language design. Then we discuss the implementation of the current Apply compiler and reflect on its development with reference to the original C-based implementation.

2

Low-Level Vision and the Apply Language

To motivate and illustrate the implementation design we begin by describing our domain of application and our transformation’s source language Apply. Low-Level Vision involves processing image colour and contrast information at a pixel level to identify features such as edges or corners of objects, ridges or blobs that can be used to identify objects or track moving objects. Figure 1 shows an example of an image and the output image after processing for edge detection using a Sobel operator [1, pp 418-420]. There are a large number of known algorithms for processing images and detecting features, and developing new algorithms continues to be an active research area. Kernel-based algorithms are one common class of global algorithms where each output pixel is computed from a small neighbourhood of the input. These algorithms can be described simply and concisely using Apply. The language allows the algorithm designer to focus on the details of the computation that will apply to a window of pixels surrounding a single pixel. For example, Figure 2(a) illustrates the application of a 3 x 3 kernel to an image at three different pixels. Apply abstracts away entirely the details of repeatedly applying a computation across an entire image. The Apply compiler generates the looping structures, handles the exceptional cases at the border of the image, deals with the internal representation of the image (which changes from system to system and API to API) and generates code to take 35

Hamey and Goldrei

(a)

(b)

Fig. 2. (a) Illustration of applying a kernel computation at the corner of an image (X), edge of the image (Y) and clear of the border (Z). (b) Nine regions of an image, labelled A-I, each requiring different bounds checking code.

advantage of the parallel computational capability of the target platform. To illustrate the abstraction power of Apply consider the Sobel operator [1, pp 418-420] which detects the edge of objects by comparing the intensity of a pixel to its eight immediate neighbours. The algorithm used to apply this operator to vertical edges across the image requires looping over the rows and columns  of the image and at each pixel multiplying, entry-wise, a matrix 1 0 −1

K=

2 0 −2

with an equally sized window of pixels taken from the image and

1 0 −1

centred around the ‘current’ pixel. The sum of the elements in this matrix of products is the vertical component of the output value for the current pixel location. This type of calculation is known as ‘convolution’ in image processing. The process is repeated with a different matrix K for horizontal edges and the horizontal and vertical components are combined by adding their absolute values. Pixel values greater than a byte are truncated. A hand-written Sobel operator must implement the loop that traverses the appropriate data-structure containing the image, and handle the behaviour of the algorithm at nine distinct regions to ensure correct computation at the corners and edges, see Figure 2(b). The C code for this operator, using two loops, is given in Figure 3. Using only one set of nested loops produces unacceptably slow code with some C compilers, while nine loops often perform best. The implementation in Figure 3 assumes that the images are each stored in a single array in row-major order and that the computations are being performed on a standard uniprocessor machine and no truncation of byte values is being done. In comparison Figure 4 shows the code that would be written in Apply to achieve a similar computation with byte truncation as well. The Apply compiler can of course generate a nine loop version automatically.

3

Apply Language Design and Stratego/XT Influence

The syntax of Apply is based on a subset of Ada [5]. In summary, Apply provides the following language features which are similar to Ada: arithmetic and boolean expressions; control flow structures: if, if else, while and for; primitive data-types: byte, real and integer; multidimensional array types 36

Hamey and Goldrei

#define in_range(i,j) \ (row+i >= 0 && row+i < height && column + j >= 0 && column + j < width) #define FROM(i,j) from[(row+i)*width + column + j] #define FROM_R(i,j) (in_range(i,j) ? FROM(i,j) : 0) #define TO to[row * width + column] void sobel(unsigned char from[], unsigned char to[], int height, int width) { int row; int column; int x, y; for (row = 0; row 255 then x := 255; end if; to := x; end sobel; Fig. 4. Apply code for computing Sobel edge detection.

element of a specified primitive data-type or a two dimensional array of a specified primitive data-type with index ranges. Formal parameters of this type are declared with the syntax window of Type or window( Range,Range ) of Type border expr for scalar or subscripted instances respectively. The border expr modifier in the declaration is a succinct way of defining how windows should be handled at the edges of the image. When a location of the window falls outside the image (as it would at the edges) the constant expression given is substituted for the value that would otherwise have been taken for the image itself. Often this constant expression is zero. The current implementation of Apply defines a number of new metaprogramming style language extensions. These extensions serve one of two purposes. Some extensions are used as hints to the compiler indicating possible optimisations. Other extensions make it easier to write the Stratego rules that will generate the code to target specific APIs. These extensions were not available in the original Apply definition, they have been added to the language as a direct result of the experience of using transformation techniques for the compiler implementation. Examples of these extensions include @known expressions, @apply statements, defined expressions and assert statements. The language extensions beginning with an ‘@’ character were intended to be used by the language implementers and those implementing specialised transformations for target APIs or environments. They are used as an intermediate representation of the semantics of constructs that do not appear in the domain of the application programmer, but do appear in the implementation of the language. They are added as syntactical elements (rather than purely abstract terms) to make the transformations easier to both read and write for the language implementers. This technique takes advantage of the ability to extend the Stratego language with object syntax. In this case the object syntax is neither strictly the source nor the target language. 38

Hamey and Goldrei

The meaning of @known(expr ) is as follows: if expr evaluates at compile time to true then the whole expression is replaced by the value true otherwise if the expression either evaluates to false or can not be evaluated to a constant value at all, the whole expression is replaced by false. In conjunction with if and if else statements, @known can be used to provide the compiler with alternate algorithms. The compiler’s standard unreachable code elimination techniques replace the complete branch statement with a single algorithm at compile time. To eliminate a possibly unnecessary modulo computation one would write if (@known(row < 255)) x:=row; else x:=row % 255; If the programmer knows a property of a variable but the compiler could not be expected to prove it, the programmer can assert this knowledge using the assert expr statement e.g. assert x>=2;. The compiler exploits this knowledge to optimise code generation wherever possible. This statement could also be used to generate runtime checks, however currently this is not done. The defined(id (expr,expr )) expression tests whether the access to a particular pixel of a window is currently defined. For example, consider the window from defined in Figure 4. When processing the top left corner as illustrated in Figure 2 (a) in position X, the window access will be defined at from(0,0), from(0,1), from(1,0) and from(1,1) and undefined at, for example, from(-1,-1) and all other locations. The defined construct flexibly and simply expresses border handling for low-level vision algorithms and is used to implement the border expr modifier. Originally @defined and @assert were added to simplify implementation, however while developing the compiler it was decided that it would be beneficial to allow domain application programmers to make use of assertions and the defined construct. The ‘@’ was dropped from the syntax and they became formal parts of the language. The @known is another construct that was originally devised to simplify implementation, however it too is likely to be added to the Apply language because of the power it affords when combined with automatically generated assert statements. For example, since the purpose of the Apply language is to take a computation and to ”apply” it across an image, the computation (on a uniprocessor machine) is placed in a several loops and the Apply compiler automatically generates assertions based on these loops. In particular the loop for row 1..100 do loop would generate the assertions row >=1 and row |[ @setup_buffers(i1,i2,j1,j2); for row in 0..height-1 loop @getrow(row); for column in width*cpunumber/cpucount .. width*(cpunumber+1)/cpucount-1 loop ~s end loop; @putrow(row); end loop; ]| Fig. 6. Sketch of an Apply loop transformation that targets a parallel architecture with abstract API calls to manage images.

5

Reflections on Using Stratego/XT Versus a GPL

The log maintained during the development of the current Apply compiler [2] prompted reflection on the comparative experiences of this development versus the experience of developing the original compiler. Since no such log was maintained during the development of the original compiler the source code and its documentation was used for comparison. In some respects the approach taken in the original implementation mirrored the approach of transformation through walks or traversals of the abstract syntax tree, pattern matching and replacement of old tree nodes with 42

Hamey and Goldrei

new sub-trees. However instead of having a succinct purpose-designed notation these tasks were achieved using calls to hand-written libraries. Figure 7 shows a small fragment of the code used to implement subscript translation in the main apply loop in the original C implementation. The function walk (and its variants) performs a matching tree walk starting at the current tree node. Since each tree node had a number of potential arguments (although only two were used in most circumstances), the walk functions specify which argument to follow, then the expected node type (or 0 if any node is acceptable); walk2 does two walks; walk3 does three. Failed walks return PT_NIL. The function nodei constructs a node that has an integer value and the function node2 constructs a tree node with two children. Note that the C code requires considerably more documentation in the form of comments in order to make the intention of the code clear. In comparison the Stratego/XT code shown in Figure 8 handles the same transformation not just for one variant of the window construct but for subscripted as well as scalar variants, with or without the border modifier and with row/column indexing or with direct indexing. This code requires no additional commenting since the semantics of transformation are defined by the Stratego/XT language itself and tree walking details are implied. Similarly, the original compiler’s minimal optimisations were limited to constant folding since the amount and complexity of the code needed in C was too great. Stratego was selected for the re-implementation to enable more sophisticated code manipulation to target more difficult environments (such as processing the image in place and therefore having to handle the border as part of the code). We had reached the limit of our ability to manage the complexity of the parse tree manipulation process expressed directly in C code. It is difficult to give an accurate measure of the relative complexity or effort involved, however to give a rough guide, and in the absence of accurate records of the time taken to program the original compiler, we compared the physical commented lines of code (LOC), we also counted non-commented code and present that as a proportion of the total. See [3] for further discussion. The current implementation required significantly fewer lines of code - see Figure 9. The analysis and transformation stages for the original C compiler were smaller, however the current implementation does significantly more code manipulation and optimisation. The original compiler included a very simple C module for performing constant folding which handled 8 arithmetic operators computing constant values only when the arguments to the operator were themselves constant terms. By comparison constant folding in the current compiler handles 15 operators including boolean operators. This code more aggressively simplifies expressions, rearranging them as necessary to bring constant terms together and removes operator identities. Needless to say learning a new tool set involves a learning curve. The log documents many incidents where the first attempt at achieving a desired outcome using Stratego did not work and further reading, learning and trial and 43

Hamey and Goldrei

... if (tree->type == PT_SUBSCRIPT && mode == MODE_STMTS) { register pt_node *var, *sub1, *sub2; register declaration *d; /* Match a piece of tree that looks like * PT_SUBSCRIPT -- PT_SUBLIST -- PT_SUBLIST * | | | * var sub1 sub2 */ var = walk (tree, ARG1, PT_VARIABLE); sub1 = walk2 (tree, ARG2, PT_SUBLIST, ARG1, 0); sub2 = walk3 (tree, ARG2, PT_SUBLIST, ARG2, PT_SUBLIST, ARG1, 0); if (var != PT_NIL && sub1 != PT_NIL && sub2 != PT_NIL && var->decl) { d = var->decl; if ((d->classtype == IN || d->classtype == OUT) && d->numdim == 2) { t1 = nodei (d->dimensions[1].high - d->dimensions[1].low + 1); t2 = nodei (d->dimensions[1].low); t3 = nodei (d->dimensions[0].low); /* Now set t1 to entire expression * PT_PLUS -- PT_MINUS -- t3(minrow) * | | * | sub1(i) * | * PT_MULTIPLY -- t1(maxcol-mincol+1) * | * PT_PLUS -- unrollid * | * PT_MINUS -- t2(mincol) * | * sub2(j) */ t1 = node2 (PT_PLUS, node2 (PT_MULTIPLY, node2 (PT_PLUS, node2 (PT_MINUS, sub2, t2), nodei (unrollid)), t1), node2 (PT_MINUS, sub1, t3)); /* Now put expression into subscript list in tree and prune list. */ sub1 = walk (tree, ARG2, PT_SUBLIST); sub1->arg[ARG1] = t1; sub1->arg[ARG2] = PT_NIL; return (tree); /* Prune recursion. */ ... Fig. 7. C code to replace window relative indexes with image relative indexes.

44

Hamey and Goldrei

FixWindowsIndexBorder : WindowAccessElement(x,row,col,rowrange,colrange,Border(t,e),looptype) -> IfElseExp( |[ defined(x( ~row, ~col)) ]| , Subscript(x, |[ ~row * width + app_index + ~col ]| ), e) FixWindowsIndex : WindowAccessElement(x,row,col,rowrange,colrange,type,looptype) -> Subscript(x, |[ ~row * width + app_index + ~col ]| ) FixWindowsNoIndex : WindowAccessScalar(x,type,looptype) -> Subscript (x, |[ row * width + column ]| ) FixWindowsNoIndexBorder : WindowAccessElement(x,r,c,rowrange,colrange,Border(t,e),looptype) -> IfElseExp( |[ defined(x( ~r, ~c)) ]| , Subscript(x, |[ (row + ~r) * width + column + ~c ]| ), e) FixWindowsNoIndex : WindowAccessElement(x,r,c,rowrange,colrange,type,looptype) -> Subscript(x, |[ (row + ~r) * width + column + ~c ]| ) Fig. 8. Stratego/XT rewrite rule: window relative indexes to image relative indexes.

Original

Current

Phase

Compiler

Compiler

lexical & parsing

1023 27%

280 25%

analysis & transformation

1351 36%

1897 27%

constant folding (number of operators)

77(8) 32%

76(15) 38%

format target language

306 34%

144

9%

Fig. 9. Physical commented LOC with % of comments and blank lines to total lines.

error testing was needed. Stratego’s terse syntax was the cause of confusion on a number of occasions: everything from strategy combinators to parameterised rules and matching terms. Using Stratego effectively often requires an unfamiliar functional programming style of problem solving. For example, to compute the extreme dimensions of all windows given as arguments to an Apply function the log records requiring a different way of thinking: “Rather than thinking about it as passing through the tree and accumulating the most extreme dimensions, I need to think about reducing the tree to the most extreme dimensions by rewriting it. Then, I can use where to assign the results to a temporary variable inside a strategy.” The reverse production format of the Stratego Syntax Definition Formalism (SDF) posed difficulties at first, as did identifying sources of ambiguities in the SDF grammar definition and correcting them. Being forced to re-specify the grammar represented a barrier to entry, but it facilitated the integration between concrete and abstract syntax. Overall productivity was greater than 45

Hamey and Goldrei

Core-gcc

PC-gcc SPARC-gcc

Core-MSVC

PC-MSVC

Hand-written (1)

5.232

3.608

4.668

5.190

4.07

old compiler (2)

4.429

3.438

6.366

4.450

4.6

Apply compiler

3.283

2.462

4.596

3.940

3.97

Speedup (1)

37%

32%

2%

24%

2%

Speedup (2)

26%

28%

28%

11%

14%

Fig. 10. Summary of execution times and speedup for a Sobel operator.

before, primarily due to the domain specific nature of Stratego over C. Debugging was difficult during both implementations as the only method was to dump the parse tree or insert debugging statements in the code. With Stratego we can separate transformation stages into separate executables, to test isolated stages on simple AST fragments, but it’s not easy to see which of Stratego’s dynamic rules are active except by their effect when executed.

6

Performance

We found it easy to implement optimisations in Stratego. Is it better to optimise the generated source rather than leave optimisation to the target C compiler? After constant folding, constant propagation and unreachable code elimination the C code was succinct, more readable and thus easier to verify by inspection than un-optimised source code. We compared the performance of the Sobel operator written in Apply (see Figure 4) with the Sobel operator written by hand similar to that shown in Figure 3, using 512 x 512 pixel images. The Apply generated code was specialised with a constant image width as discussed in section 4. We also compared the performance of the C code generated by the old compiler with that of the current compiler. The tests were run with five combinations of CPU and C compiler. For each C compiler a range of common optimisation options were tested with the best option used for comparison for each compiler on each platform. See [3] for more details. Each measurement is the median time over 7 runs each involving 30 seconds of execution time with randomised array locations to avoid the impact of cache and paging. The results in Figure 10 show that the Apply generated code ran up to 37% faster than the hand-written code. The results vary greatly between hardware platforms and compilers. Tests were also performed with operators that had simpler code. The speed-up achieved by the Apply generated code depended on the complexity of the Apply program, but there was no consistent trend across the platforms. We conclude that source-to-source optimisation, using transformation techniques including generating specialised code, is worthwhile 46

Hamey and Goldrei

for readability and verifiability and sometimes for performance reasons.

7

Conclusion

We described the experience of implementing a Domain-Specific Language for image processing and low-level computer vision, comparing our experience of using Stratego/XT with the experience of using traditional techniques. The Stratego/XT toolset enabled easy implementation and provided opportunities to enhance the language and improve the performance of generated code. The use of concrete syntax in the transformations facilitated rapid retargeting to exploit available APIs and target environments. Our experience demonstrates that implementing a compiler by transformation to a GPL is a practical way of achieving a non-embedded DSL.

References [1] Gonzalez, R. C. and R. E. Woods, “Digital Image Processing,” Addison-Wesley, New York, 1992. [2] Hamey, L., Re-implementation of Apply: Stratego/XT experience notes (2006), unpublished. [3] Hamey, L. G. C. and S. N. Goldrei, Implementing the Apply compiler using Stratego/XT, Technical Report C/TR01-01, Department of Computing, Macquarie University, NSW, Australia (February 2007). [4] Hamey, L. G. C., J. A. Webb and I.-C. Wu, Low-level vision on Warp and the Apply programming model, Parallel computation and computers for artificial intelligence (1988), pp. 185–199. [5] Ledgard, H., “Reference Manual for the ADA Programming Language,” Springer-Verlag New York, Inc., Secaucus, NJ, USA, 1983. [6] Mernik, M., J. Heering and A. M. Sloane, When and how to develop domainspecific languages, ACM Computing Surveys 37 (2005), pp. 316–344. [7] Olmos, K. and E. Visser, Strategies for source-to-source constant propagation, in: B. Gramlich and S. Lucas, editors, Workshop on Reduction Strategies (WRS’02), Electronic Notes in Theoretical Computer Science 70 (2002), p. 20. [8] Visser, E., Meta-programming with concrete object syntax, in: D. Batory, C. Consel and W. Taha, editors, Generative Programming and Component Engineering (GPCE’02), Lecture Notes in Computer Science 2487 (2002), pp. 299–315. [9] Visser, E., Program transformation with Stratego/XT, Technical Report UU-CS2004-011, Institute of ICS Utrecht University (2004). URL http://www.stratego-language.org/Stratego/ ProgramTransformationWithStrategoXT

47

LDTA 2007

Spoofax: An Extensible, Interactive Development Environment for Program Transformation with Stratego/XT Karl Trygve Kalleberg

1

Department of Informatics, University of Bergen, P.O. Box 7800, N-5020 BERGEN, Norway

Eelco Visser

2

Department of Software Technology, Faculty of Electrical Engineering, Mathematics and Computer Science, Delft University of Technology, The Netherlands

1

Introduction

Many programmable software transformation systems are based around novel domainspecific languages (DSLs), with a long history of development and successful deployment. Despite their maturity and applicability, these systems are often discarded as esoteric research prototypes. This is partly because the languages are frequently based on less familiar programming paradigms such as term and graph rewriting or logic programming. Another reason is that modern development environments are rarely found for these systems. The basic and expected interactive development aids such as source code navigation, content completion, syntax highlighting and continuous error checking, are rarely available to developers of transformation code. The lack of development aids keeps the entry barrier for new developers high; DSLs for program transformation use their own syntax and language constructs which are unfamiliar to many, and most editing environments support these rather poorly, providing only limited syntax highlighting, and little else. Even skilled developers are less effective, because errors are reported late in the edit-compilerun cycle, only after compiling. It is generally held that errors should be reported immediately after a change has been made, while the human programmer is still in a relevant frame of mind. Also, errors should ideally be customizable and check project-specific design rules, if possible. Stratego/XT is a domain-specific language and toolset for developing standalone software transformation systems based on formal language descriptions. It is 1 2

Email: [email protected] Email: [email protected]

This paper is electronically published in Electronic Notes in Theoretical Computer Science URL: www.elsevier.nl/locate/entcs

Kalleberg and Visser

Fig. 1. Screenshot showing SDF and Stratego editors with outline view.

fairly mature and has been applied by various research groups and companies to tasks ranging from theorem proving to compiler implementation to domain-specific optimization to language extension, see [2]. Until recently, no good editing environment existed for Stratego, which made development harder than necessary. In this system description paper, we describe Spoofax, an extensible, interactive environment based on Eclipse for developing program transformation systems with Stratego/XT. Spoofax supports Stratego/XT by providing modern development aids, such as customizable syntax highlighting, code outlining, content completion, source code outlining and navigation, automatic and incremental project rebuilders. The contributions of this environment include user extensibility with scripts written in Stratego that allow live analyses and transformations of the code under development; syntax highlighting, navigation and content completion that eases the learning curve for new users of Stratego; and integration into a mainstream tools platform that is familiar to developers and that runs on most desktop platforms.

2

Description

Spoofax is a set of Eclipse plugins – a Stratego and an SDF editor, a help system and a Stratego interpreter. It supplements Stratego/XT, which must be installed separately, by providing an extensible, interactive development environment. Figure 1 shows a session with an SDF editor (top right), a Stratego editor (top middle), a list of pending tasks extracted from all project files (bottom), and a code outline view (left) displaying all imports, rules and strategies defined in the edited file. The popup is the content completer showing alternatives for the tc- prefix. Stratego/XT programs can be compiled and executed from within the environment. A notable feature of Spoofax is that users can write scripts in Stratego to extend the editor. These scripts perform code transformations and project-specific style or error checking on the Stratego code under development. For example, a script may ensure that no unwanted module dependencies creep in by continously checking the import list during editing. Scripts are compiled to an abstract machine format by 50

Kalleberg and Visser

the Stratego compiler, and the resulting files are loaded into the editor and executed inside the environment. Execution can happen on-demand or attached to predefined hooks, such as whenever a file is saved. This is an attractive feature because Stratego is a mature language for language processing, and its standard library provides a formal language description and reusable transformations for Stratego. This eases the writing of language processing scripts considerably, compared to other scriptable editors like the Emacs family, as scripts in Spoofax operate on the abstract syntax tree.

3

Implementation

Stratego is a modular language. Each module is defined in a source file that contains definitions of rules and strategies; it may import other modules. Spoofax maintains an in-memory representation of all modules of a project, and their import dependencies, in what we call a build weave. This is used by the source-code navigator, the content-completer, and the code outliner. The module dependencies are resolved by parsing the Makefiles in the source tree, and extracting the module include paths defined there. The editor is built on top of three different parsers of Stratego. The ones used for syntax highlighting and code outlining are hand-written in Java, because they must work work well for syntactically incorrect programs. A scannerless GLR parser is used to extract the abstract syntax tree from source files, and these are available for user scripts to inspect. Modification is also possible, but layout is not (yet) always properly preserved. Spoofax comes with an interpreter, written in Java, for executing compiled Stratego scripts.

4

Related Work

Many program transformation systems provide some form of interactive environments. We briefly mention some that are advanced and actively developed. The Meta-Environment is an open and extensible framework for language development, source code analysis and source code transformation based on the ASF+SDF transformation system [4]. The environment provides interactive visualisations, editors with error checking and syntax highlighting. Tom is a software environment for defining transformations in Java [3] and comes with a basic Eclipse editor plugin that provides syntax highlighting, context-specific help, error checking and automatic compilation, but no source navigation. JTransformer is a Prolog-based query and transformation engine for Java source code, based on Eclipse. It provides a Prolog editor with syntax highlighting, auto-completion, code outlining, error checking and context-specific help. ANTLRWorks [1] is a graphical development environment for developing and debugging ANTLR grammars, with an impressive feature list that includes code navigation, visualisations, error checking and refactoring. All these systems have feature sets overlapping with Spoofax, but to our knowledge, only the Meta-Environment was also designed to be extensible using a transformation language. 51

Kalleberg and Visser

5

Conclusion

We have introduced an extensible, interactive development environment for Stratego/XT that provides modern development aids like content completion, source code navigation, customizable syntax highlighting, automatic and incremental project building. Users can extend the environment with scripts written in Stratego, and these can perform analysis and transformation on the code under development. We feel that our environment lowers the entry level for new users by plugging into a familiar and widely available platform, and that it makes existing developers more productive by making errors quickly visible during editing. We would like to thank the anonymous reviewers for helpful comments on this paper. Kalleberg is supported by the Norwegian Research Council, grant PLI-AST.

References [1] J. Bovet and T. Parr. ANTLRWorks: The ANTLR GUI development environment. Home page at www.antlr.org/works/ (visited 2006-12-10). [2] M. Bravenboer, K. T. Kalleberg, R. Vermaas, and E. Visser. Stratego/XT 0.16. Components for transformation systems. In ACM SIGPLAN 2006 Workshop on Partial Evaluation and Program Manipulation (PEPM’06), Charleston, South Carolina, January 2006. ACM SIGPLAN. [3] P.-E. Moreau, C. Ringeissen, and M. Vittek. A pattern matching compiler for multiple target languages. In 12th International Conference on Compiler Construction, LNCS, pages 61–76. Springer, 2003. [4] M. van den Brand, A. van Deursen, J. Heering, H. A. de Jong, M. de Jonge, T. Kuipers, P. Klint, L. Moonen, P. A. Olivier, J. Scheerder, J. J. Vinju, E. Visser, and J. Visser. The ASF+SDF metaenvironment: A component-based language development environment. In R. Wilhelm, editor, Proc of the 10th Intl Conf on Compiler Construction, volume 2027 of LNCS, pages 365–370. Springer, 2001.

A

Demonstration

We will demonstrate how Spoofax improves the development of language processing tools from formal language descriptions with Stratego/XT by covering the following topics: Starting a new project – We show how to set up Spoofax and Stratego/XT and create a new project from scratch that does simple transformations on a toy language called TIL, and how to configure the environment to build your project automatically and incrementally whenever a source file is saved. Navigating the code – We demonstrate how the source code navigation can be used to jump to other modules, definition sites of rules and strategies, how to search for modules, rules and strategies with wildcards, and how the source code outliner works. Extending the editor – The main part of the demonstration will be devoted developing a small extension to the Stratego editor that analyses the AST of the Stratego source code whenever a file is saved, how to develop such scripts inside Spoofax, and how to compile and install scripts into the running environment. We show how the scripts can ask for the AST of a Stratego source file, and how the library for transforming Stratego programs, provided by Stratego/XT, makes processing Stratego code relatively easy, by exploiting features of Stratego such as generic traversals, rewrite rules, strategy combinators, dynamic rules and concrete syntax patterns. 52

LDTA 2007

SPPF-Style Parsing From Earley Recognisers Elizabeth Scott 1 Department of Computer Science Royal Holloway, University of London Egham, Surrey, United Kingdom

Abstract In its recogniser form, Earley’s algorithm for testing whether a string can be derived from a grammar is worst case cubic on general context free grammars (CFG). Earley gave an outline of a method for turning his recognisers into parsers, but it turns out that this method is incorrect. Tomita’s GLR parser returns a shared packed parse forest (SPPF) representation of all derivations of a given string from a given CFG but is worst case unbounded polynomial order. We have given a modified worstcase cubic version, the BRNGLR algorithm, that, for any string and any CFG, returns a binarised SPPF representation of all possible derivations of a given string. In this paper we apply similar techniques to develop two versions of an Earley parsing algorithm that, in worst-case cubic time, return an SPPF representation of all derivations of a given string from a given CFG. Key words: Earley parsing, cubic generalised parsing, context free languages

Since Knuth’s seminal 1960’s work on LR parsing [14] was extended to LALR parsers by DeRemer [5,4], the Computer Science community has been able to automatically generate parsers for a very wide class of context free languages. However, many parsers are still written manually, either using tool support or even completely by hand. This is partly because in some application areas such as natural language processing and bioinformatics we do not have the luxury of designing the language so that it is amenable to known parsing techniques, but also it is clear that left to themselves computer language designers do not naturally write LR(1) grammars. A grammar not only defines the syntax of a language, it is also the starting point for the definition of the semantics, and the grammar which facilitates semantic definition is not usually the one which is LR(1). This is illustrated by the development of the Java Standard. The first edition of the Java Language Specification [7] contains a detailed discussion of the need to modify 1

Email:[email protected] This paper is electronically published in Electronic Notes in Theoretical Computer Science URL: www.elsevier.nl/locate/entcs

Scott

the grammar used to define the syntax and semantics in the main part of the standard to make it LALR(1) for compiler generation purposes. In the third version of the standard [8] the compiler version of the grammar is written in EBNF and is (unnecessarily) ambiguous, illustrating the difficultly of making correct transformations. Given this difficulty in constructing natural LR(1) grammars that support desired semantics, the general parsing techniques, such as the CYK [20], Earley [6] and GLR [19] algorithms, developed for natural language processing are also of interest to the wider computer science community. When using grammars as the starting point for semantics definition, we distinguish between recognisers which simply determine whether or not a given string is in the language defined by a given grammar, and parsers which also return some form of derivation of the string, if one exists. In their basic forms the CYK and Earley algorithms are recognisers while GLR-style algorithms are designed with derivation tree construction, and hence parsing, in mind. However, in both recogniser and parser form, Tomita’s GRL algorithm is of unbounded polynomial order in worst case. In this paper we describe the expansion of Earley recognisers to parsers which are of worst case cubic order.

1

Generalised parsing techniques

There is no known linear time parsing or recognition algorithm that can be used with all context free grammars. In their recogniser forms the CYK algorithm is worst case cubic on grammars in Chomsky normal form and Earley’s algorithm is worst case cubic on general context free grammars and worst case order n2 on non-ambiguous grammars. General recognisers must, by definition, be applicable to ambiguous grammars. Expanding general recognisers to parsers raises several problems, not least because there can be exponentially many or even infinitely many derivations for a given input string. A cubic recogniser which was modified to simply return all derivations could become an unbounded parser. Of course, it can be argued that ambiguous grammars reflect ambiguous semantics and thus should not be used in practice. This would be far too extreme a position to take. For example, it is well known that the if-else construct in C is ambiguous, but a longest match resolution results in linear time parsers that attach the ‘else’ to the most recent ‘if’, as specified by the ANSI-C semantics. The ambiguous ANSI-C standard grammar is certainly practical for parser implementation. However, in general ambiguity is not so easily handled, and it is well known that grammar ambiguity is in fact undecidable [11], thus we cannot expect a parser generator simply to check for ambiguity in the grammar and report the problem back to the user. It is possible that many of the ad hoc methods of dealing with specific ambiguity, such as the longest match approach for if-else, can be generalised into standard classes of typical ambiguity which can be automatically tested 54

Scott

for see, for example, [3], but this remains a topic requiring further research. Another possibility is to avoid the issue by just returning one derivation. In [9] there is an algorithm for generating a rightmost derivation from the output of an Earley recogniser in at worst cubic time. However, if only one derivation is returned then this creates problems for a user who wants all derivations and, even in the case where only one derivation is required, there is the issue of ensuring that it is the required derivation that is returned. Furthermore, na¨ıve users may not even be aware that there was more than one possible derivation. A truly general parser will return all possible derivations in some form. Perhaps the most well known representation is the shared packed parse forest (SPPF) described and used by Tomita [19]. Using this approach we can at least tell whether there is more than one derivation of a given string in a given grammar: use a GLR parser to build an SPPF and then test to see if the SPPF contains any packed nodes. Tomita’s description of the representation does not allow for the infinitely many derivations which arise from grammars which contain cycles but it is relatively simple to modify his formulation to include these, and a fully general SPPF construction, based on Farshi’s version [15] of Tomita’s GLR algorithm, was given by Rekers [16]. These algorithms are all worst-case unbounded polynomial order and, in fact, Johnson [12] has shown that Tomita-style SPPFs are worst case unbounded polynomial size. Thus using such structures will also turn any cubic recognition technique into a worst case unbounded polynomial parsing technique. Leaving aside the potential increase in complexity when turning a recogniser into a parser, it is clear that this process is often difficult to carry out correctly. Earley gave an algorithm for constructing derivations of a string accepted by his recogniser, but this was subsequently shown by Tomita [19] to return spurious derivations in certain cases. Tomita’s original version of his algorithm failed to terminate on grammars with hidden left recursion and, as remarked above, had no mechanism for constructing complete shared packed parse forests for grammars with cycles. In [2] there is given an outline of an algorithm to turn the recogniser reported there and in [1] into a parser, but again, as written, this algorithm will generate spurious derivations as well as the correct ones. The recogniser described in [1] is not applicable to grammars with hidden left recursion but the closely related RIGLR algorithm [18] is fully general, and as a recogniser is of worst case cubic order. There is a parser version which correctly constructs SPPFs but as these are Tomita-style SPPFs the parser is of unbounded polynomial order. As we have mentioned, Tomita’s GLR algorithm was designed with parse tree construction in mind. We have given a GLR algorithm, BRNGLR [17], which is worst case cubic order and, because the tree building is integral to the algorithm, the parser, which builds a modified form of SPPF, is also worst case cubic order. In this paper we apply similar techniques to the Earley recogniser 55

Scott

and construct two versions of a complete Earley parser, both of which are worst case cubic order. Thus we have an Earley parser which produces an SPPF representation of all derivations of a given input string in worst case cubic space and time.

2

Background theory

In this section we give a brief description of Earley’s algorithm, for simplicity without lookahead, and show how Earley’s own extension of this to a parser can fail. We then show how to apply the techniques developed in [17] to correctly generate a representation of all possible derivations of a given input string from Earley’s recogniser in worst case cubic time and space. A context free grammar (CFG) consists of a set N of non-terminal symbols, a set T of terminal symbols, an element S ∈ N called the start symbol, and a set P of numbered grammar rules of the form A ::= α where A ∈ N and α is a (possibly empty) string of terminals and non-terminals. The symbol  denotes the empty string. A derivation step is an element of the form γAβ⇒γαβ where γ and β are strings of terminals and non-terminals and A ::= α is a grammar rule. A derivation of τ from σ is a sequence of derivation steps σ⇒β1 ⇒ . . . ⇒βn−1 ⇒τ . ∗ n We may also write σ ⇒τ or σ ⇒τ in this case. ∗ A sentential form is any string α such that S ⇒α, and a sentence is a sentential form which contains only elements of T. The set, L(Γ), of sentences which can be derived from the start symbol of a grammar Γ, is defined to be the language generated by Γ. A derivation tree is an ordered tree whose root is labelled with the start symbol, leaf nodes are labelled with a terminal or  and interior nodes are labelled with a non-terminal, A say, and have a sequence of children corresponding to the symbols on the right hand side of a rule for A. A shared packed parse forest (SPPF) is a representation designed to reduce the space required to represent multiple derivation trees for an ambiguous sentence. In an SPPF, nodes which have the same tree below them are shared and nodes which correspond to different derivations of the same substring from the same non-terminal are combined by creating a packed node for each family of children. Examples are given in Sections 3 and 4. Nodes can be packed only if their yields correspond to the same portion of the input string. Thus, to make it easier to determine whether two alternates can be packed under a given node, SPPF nodes are labelled with a triple (x, j, i) where aj+1 . . . ai is a substring matched by x. To obtain a cubic algorithm we use binarised SPPFs which contain intermediate additional nodes but which are of worst case cubic size. (The SPPF is said to be binarised because the additional nodes ensure that nodes whose children are not packed nodes have out-degree at most two.) Earley’s recognition algorithm constructs, for each position i in the input string a1 . . . an , a set of items. Each item represents a position in the grammar 56

Scott

that a top down parser could be in after matching a1 . . . ai . In detail, the set E0 is initially set to be the items (S ::= ·α, 0). For i > 0, Ei is initially set to be the items (A ::= αai · β, j) such that (A ::= α · ai β, j) ∈ Ei−1 . The sets Ei are constructed in order and ‘completed’ by adding items as follows: for each item (B ::= γ · Dδ, k) ∈ Ei and each grammar rule D ::= ρ, (D ::= ·ρ, i) is added to Ei , and for each item (B ::= τ ·, k) ∈ Ei , if (D ::= τ · Bµ, h) ∈ Ek then (D ::= τ B · µ, h) is added to Ei . The input string is in the language of the grammar if and only if there is an item (S ::= α·, 0) ∈ En . As an example consider the grammar S ::= ST | a

B ::= 

T ::= aB | a

and input string aa. The Earley sets are E0 E1 E2

3

= {(S ::= ·ST, 0), (S ::= ·a, 0)} = {(S ::= a·, 0), (S ::= S · T, 0), (T ::= ·aB, 1), (T ::= ·a, 1)} = {(T ::= a · B, 1), (T ::= a·, 1), (B ::= ·, 2), (S ::= ST ·, 0), (T ::= aB·, 1), (S ::= S · T, 0), (T ::= ·aB, 2), (T ::= ·a, 2)}

Problems with Earley parser construction

Earley’s original paper gives a brief description of how to construct a representation of all possible derivation trees from the recognition algorithm, and claims that this requires at most cubic time and space. The proposal is to maintain pointers from the non-terminal instances on the right hand sides of a rule in an item to the item that ‘generated’ that item. So, if (D ::= τ · Bµ, h) ∈ Ek and (B ::= δ·, k) ∈ Ei then a pointer is assigned from the instance of B on the left of the dot in (D ::= τ B · µ, h) ∈ Ei to the item (B ::= δ·, k) ∈ Ei . In order to keep the size of the sets Ei in the parser version of the algorithm the same as the size in the recogniser we add pointers from the instance of B in (D ::= τ B · µ, h) to each of the items of the form (B ::= δ 0 ·, k 0 ) in Ei . Example 1 Applying this approach to the grammar from the previous section, and the string aa, gives the following structure. E0 (S ::= ·ST, 0) (S ::= ·a, 0)

E1 (S ::= a·, 0)  (S ::= S · T, 0) (T ::= ·aB, 1) (T ::= ·a, 1)

E2 (T ::= a·, 1) (T ::= a · B, 1)  (B ::= ·, 2) (S ::= ST ·, 0) U   (S ::= S · T, 0) (T ::= aB·, 1) (T ::= ·a, 2)

(T ::= ·aB, 2)

From this structure the SPPF below can be constructed, as follows. 57

Scott  u0 S, 0, 2

=  ~  u3 u1 S, 0, 1 T, 1, 2



@ @e e ? u  u5 2    =   R   a, 0, 1 ) 

a, 1, 2 B, 2, 2



j l u4

We start with (S ::= ST ·, 0) in E2 . Since the integer in the item is 0 and it lies in the level 2 Earley set, we create a node, u0 , labelled (S, 0, 2). The pointer from S points to (S ::= a·, 0) in E1 , so we create a child node, u1 , labelled (S, 0, 1). From u1 we create a child node, u2 , labelled (a, 0, 1). Returning to u0 , there is a pointer from T that points to (T ::= aB·, 1) in E2 , so we create a child node, u3 , labelled (T, 1, 2). From u3 we create a child node u4 labelled (a, 1, 2) and, using the pointer from B, a child node, u5 , labelled (B, 2, 2), which in turn has child labelled . There is another pointer from T that points to (T ::= a·, 1) in E2 . We already have an SPPF node, u3 , labelled (T, 1, 2) so we reuse this node. We also have a node, u4 , labelled (a, 1, 2). However, u3 does not have a family of children consisting of the single element u4 , so we pack its existing family of children under a new packed node and create a further packed node with child u4 . The procedure proposed by Earley works correctly for the above example, but adding multiple pointers to a given instance of a non-terminal can create errors. As remarked in [19] p74, if we consider the grammar S ::= SS | b and the input string bbb we find that the above procedure generates the correct derivations of bbb but also spurious derivations of the strings bbbb and bb. The problem is that the derivation of bb from the left-most S in one derivation of bbb becomes intertwined with the derivation of bb from the rightmost S in the other derivation, resulting in the creation of bbbb. We could avoid this problem by creating separate instances of the items for different substring matches, so if (B ::= δ·, k), (B ::= σ·, k 0 ) ∈ Ei where k 6= k 0 then we create two copies of (D ::= τ B · µ, h) one pointing to each of the two items. In the above example we would create two items (S ::= SS·, 0) in E3 one in which the second S points to (S ::= b·, 2) and the other in which the second S points to (S ::= SS·, 1). This would cause correct derivations to be generated, but it also effectively embeds all the derivation trees in the construction and, as reported by Johnson, the size cannot be bounded by O(np ) for any fixed integer p. For example, using such a method for input bn to the grammar S ::= SSS | SS | b the set Ei constructed by the parser will contain Ω(i3 ) items and hence the complete structure contains Ω(n4 ) elements. Thus this version of Earley’s 58

Scott

method does not result in a cubic parser. To see this note first that, when constructed by the recogniser, the Earley set Ei is the union of the sets U0 U1 U2 U3 U4 U5

= {(S = {(S = {(S = {(S = {(S = {(S

::= b·, i − 1), (S ::= ·SSS, i), (S ::= ·SS, i), (S ::= ·b, i)} ::= S · SS, k) | i − 1 ≥ k ≥ 0} ::= S · S, k) | i − 1 ≥ k ≥ 0} ::= SS·, k) | i − 1 ≥ k ≥ 0} ::= SS · S, k) | i − 2 ≥ k ≥ 0} ::= SSS·, k) | i − 3 ≥ k ≥ 0}.

If we add pointers then, since there are i elements (S ::= SS·, q) in Ei , 0 ≤ q ≤ (i−1), and (S ::= ·SSS, q) ∈ Eq , we will add i elements of the form (S ::= S · SS, q) to Ei . Then Eq will have q elements of the form (S ::= S · SS, p), 0 ≤ p ≤ (q − 1), so we will add i(i − 1)/2 elements of the form (S ::= SS · S, r) to Ei , 0 ≤ r ≤ (i − 1). Finally, Eq will have q(q − 1)/2 elements of the form (S ::= SS · S, p), 0 ≤ p ≤ (q − 1), so we will add i(i − 1)(i − 3)/6 elements of the form (S ::= SSS·, r) to Ei . Grune [10] has described a parser which exploits an Unger style parser to construct the derivations of a string from the sets produced by Earley’s recogniser. However, as noted by Grune, in the case where the number of derivations is exponential the resulting parser will be of at least unbounded polynomial order in worst case.

4

A cubic parser which walks the Earley sets

We can turn Earley’s algorithm into a correct parser by adding pointers between items rather than instances of non-terminals, and labelling the pointers in a way which allows a binarised SPPF to be constructed by walking the resulting structure. (In the next section we shall give a version of the algorithm that constructs a binarised SPPF as the Earley sets are constructed.) Set E0 to be the items (S ::= ·α, 0). For i > 0 initialise Ei by adding the item p = (A ::= αai · β, j) for each q = (A ::= α · ai β, j) ∈ Ei−1 and, if α 6= , creating a predecessor pointer labelled i − 1 from q to p. Before initialising Ei+1 complete Ei as follows. For each item (B ::= γ · Dδ, k) ∈ Ei and each rule D ::= ρ, (D ::= ·ρ, i) is added to Ei . For each item t = (B ::= τ ·, k) ∈ Ei and each corresponding item q = (D ::= τ · Bµ, h) ∈ Ek , if there is no item p = (D ::= τ B · µ, h) ∈ Ei create one. Add a reduction pointer labelled k from p to t and, if τ 6= , a predecessor pointer labelled k from p to q. We could walk the above structure in a fashion that is essentially the same as described in Example 1 above. However, in order to construct a binarised SPPF we also have to introduce additional nodes for grammar rules of length greater than two. Thus the final algorithm is slightly more complicated. An interior node, u, of the SPPF is either a symbol node labelled (B, j, i) or an intermediate node labelled (B ::= γx · δ, j, i). A family of children of u will consist of one or two nodes. For a symbol node the family will correspond 59

Scott

to a grammar rule B ::= γy or B ::= . If γ 6=  then the children will be labelled (B ::= γ · y, j, l) and (y, l, i), for some l. Otherwise there will be a single child in the family, labelled (y, j, i) or . For an additional node the family will have a child labelled (x, l, i). If γ 6=  then the family will have a second child labelled (B ::= γ · xδ, j, l). We now define a function which takes an SPPF node u and an item p from an Earley set Ei , possibly decorated with pointers, and builds the corresponding part of the SPPF. A decorated item consists of a LR(0)-item, A ::= α ·β, a left hand index j, a right hand index, i, and a set of associated labelled pointers. We assume that these attributes and the complete Earley set structure are passed into Buildtree with u and p. Buildtree(u,p) { suppose that p ∈ Ei and that p is of the form (A ::= α · β, j) mark p as processed if p = (A ::= ·, j) { if there is no SPPF node v labelled (A, i, i) create one with child node  if u does not have a family (v) then add the family (v) to u } if p = (A ::= a · β, j) (where a is a terminal) { if there is no SPPF node v labelled (a, i − 1, i) create one if u does not have a family (v) then add the family (v) to u } if p = (A ::= C · β, j) (where C is a non-terminal) { if there is no SPPF node v labelled (C, j, i) create one if u does not have a family (v) then add the family (v) to u for each reduction pointer from p labelled j { suppose that the pointer points to q if q is not marked as processed Buildtree(v,q) } } if p = (A ::= α0 a · β, j) (where a is a terminal, α0 6= ) { if there is no SPPF node v labelled (a, i − 1, i) create one if there is no SPPF node w labelled (A ::= α0 · aβ, j, i − 1) create one for each target p0 of a predecessor pointer labelled i − 1 from p { if p0 is not marked as processed Buildtree(w,p0 ) } if u does not have a family (w, v) add the family (w, v) to u } if p = (A ::= α0 C · β, j) (where C is a non-terminal, α0 6= ) { for each reduction pointer from p { suppose that the pointer is labelled l and points to q if there is no SPPF node v labelled (C, l, i) create one 60

Scott

if q is not marked as processed Buildtree(v,q) if there is no SPPF node w labelled (A ::= α0 x · Cβ, j, l) create one for each target p0 of a predecessor pointer labelled l from p { if p0 is not marked as processed Buildtree(w,p0 ) } if u does not have a family (w, v) add the family (w, v) to u } } } We build the full SPPF from the root down using the following procedure. PARSER { create an SPPF node u0 labelled (S, 0, n) for each decorated item p = (S ::= α·, 0) ∈ En Buildtree(u0 , p) } We illustrate this approach using two examples: the first is the example, discussed above, that results in an error when Earley’s parsing approach is used; and the second is a grammar with hidden left recursion and a cycle, resulting in infinitely many derivations. Example 2

Grammar : S ::= S S | b

Input : bbb

The Earley set structure is essentially E0

E1 (S ::= b·, 0) (S ::= ·b, 0) 0 (S ::= ·SS, 0) (S ::= S · S, 0) Y  (S ::= ·SS, 1)

E2 (S ::= b·, 1) 1} (S ::= S · S, 1) 2

(S ::= ·SS, 2)

2

1

1

(S ::= ·b, 1)

E3 (S ::= b·, 2)  2 YY (S ::= SS·, 1) 2 I (S ::= S · S, 2)

(S ::= SS·, 0) 0  (S ::= S · S, 0)

(S ::= S · S, 1) 2

(S ::= SS·, 0) 1

(S ::= ·b, 2)

(S ::= S · S, 0)

1

(S ::= ·b, 3) (S ::= ·SS, 3)

(for ease of reading pointers from nodes not reachable from the node in E3 labelled (S ::= SS·, 0) have been left off the diagram). The corresponding (correct) binarised SPPF, with the nodes labelled in construction order, is  u0 S, 0, 3

@ @e  eQ   9  s Q  q u3 u1 u 10 S ::= S · S, 0, 2 S, 2, 3 S, 1, 3 i





 ?  ? u 2 u4  S, 0, 2 b, 2, 3



u7    u 11  S ::= S · S, 0, 1 S ::= S · S, 1, 2



?  w  u8 u5 S, 0, 1 S, 1, 2



? u ?+u 9 6 b, 0, 1 b, 1, 2



61

Scott

Example 3 Grammar : Input :

S ::= A T | a T abbb

A ::= a | B A B ::=  T ::= b b b

The Earley set structure is E0 (S ::= ·aT, 0)

E1 (S ::= a · T, 0) i 

E3

E4

2

(T ::= bb · b, 1) i

(A ::= a·, 0)

(S ::= ·AT, 0) (A ::= ·a, 0)

E2 (T ::= b · bb, 1) i

(T ::= ·bbb, 1)

1

0

(A ::= ·BA, 0) (B ::= ·, 0) 0 0 ) (A ::= B · A, 0)

(S ::= A · T, 0) 0 0 U (A ::= BA·, 0) ] 0

1

3

(T ::= bbb·, 1) Y 1 (S ::= aT ·, 0)

1

(S ::= AT ·, 0)

and the corresponding binarised SPPF is

 u0 S, 0, 4

@ @e e  Q u  7  z 9  s 9   u1 Q u9 T, 1, 4 S ::= a · T, 0, 1 S ::= A · T, 0, 1



u3   q  ?  u2 u10 i A, 0, 1 b, 3, 4 T ::= bb · b, 1, 3





` `` e QQ u5 s =   u4  /  u b, 2, 3 T ::= b · bb, 1, 2

11

A ::= B · A, 0, 0

?  e u6 ?  b, 1, 2 u12

B, 0, 0 ? 

u8 z l  a, 0, 1

5

An integrated parsing algorithm

The Buildtree function described in the previous section is not as efficient as it could be, it has been designed to reflect the principles underlying the approach. We now give a different version of an Earley parser that constructs a binarised SPPF as the Earley sets are constructed, and does not require the items to be decorated with pointers. The SPPF constructed is similar to the binarised SPPF constructed by the BRNGLR algorithm but the additional nodes are the left hand rather than right hand children, reflecting the fact that Earley’s recogniser is essentially top down rather than bottom up. It is also slightly smaller than the corresponding SPPF from the previous section as a node with a label of the form (A ::= x · β, j, i) is merged with its child. The algorithm itself is in a form that is similar to the form in which GLR algorithms are traditionally presented. There is a step in the algorithm for each element of the input string and at step i the Earley set Ei is constructed, 62

Scott

along with all the SPPF nodes with labels of the form (s, j, i), j ≤ i. In order to construct the SPPF as the Earley sets are built, we record with each Earley item the SPPF node that corresponds to it. Thus Earley items are triples (s, j, w) where s is a non-terminal or an LR(0) item, j is an integer and w is an SPPF node with a label of the form (s, j, l). The subtree below such a node w will correspond to the derivation of the substring aj+1 . . . al , from B if s is B and from α if s is B ::= α · β. Earley items of the form (A ::= α · β, j) where |α| ≤ 1 do not have associated SPPF nodes, so we use the dummy node null in this case. The items in each Ei have to be ‘processed’ either to add more elements to Ei or to form the basis of the next set Ei+1 . Thus when an item is added to Ei it is also added to a set Q, if it is of the form (A ::= α · ai+1 β, h, w), or to a set R otherwise. Elements are removed from R as they are processed and when R is empty the items in Q are processed to initialise Ei+1 . There is a special case when an item of the form (A ::= α·, i, w) is in ∗ Ei , this happens if A⇒α⇒. When this item is processed items of the form (X ::= τ · Aδ, i, v) ∈ Ei have to be considered and it is possible that an item of this form may be created after the item (A ::= α·, i, w) has been processed. Thus we use a set H and, when (A ::= α·, i, w) is processed, the pair (A, w) is added to H. Then when (X ::= τ · Aδ, i, v) is processed elements of H are checked and appropriate action is taken. When an SPPF node is needed we first check to see if one with the required label already exists. To facilitate this checking the SPPF nodes constructed at the current step are added to a set V. In the following algorithm ΣN denotes the set of all strings of terminals and non-terminals that begin with a non-terminal, together with the empty string, . Input: a grammar Γ = (N, T, S, P) and a string a1 a2 . . . an EARLEY PARSER { E0 , . . . , En , R, Q0 , V = ∅ for all (S ::= α) ∈ P { if α ∈ ΣN add (S ::= ·α, 0, null) to E0 if α = a1 α0 add (S ::= ·α, 0, null) to Q0 } for 0 ≤ i ≤ n { H = ∅, R = Ei , Q = Q0 Q0 = ∅ while R 6= ∅ { remove an element, Λ say, from R if Λ = (B ::= α · Cβ, h, w) { for all (C ::= δ) ∈ P { if δ ∈ ΣN and (C ::= ·δ, i, null) 6∈ Ei { 63

Scott

add (C ::= ·δ, i, null) to Ei and R } if δ = ai+1 δ { add (C ::= ·δ, i, null) to Q } } if ((C, v) ∈ H) { let y = M AKE N ODE(B ::= αC · β, h, i, w, v, V) if β ∈ ΣN and (B ::= αC · β, h, y) 6∈ Ei { add (B ::= αC · β, h, y) to Ei and R } if β = ai+1 β 0 { add (B ::= αC · β, h, y) to Q } } } if Λ = (D ::= α·, h, w) { if h = i { if there is no node v ∈ V labelled (D, i, i) create one with child  and add it to V add (D, v) to H } for all (A ::= τ · Dδ, k, z) in Eh { let y = M AKE N ODE(A ::= τ D · δ, k, i, z, w, V) if δ ∈ ΣN and (A ::= τ D · δ, k, y) 6∈ Ei { add (A ::= τ D · δ, k, y) to Ei and R } if δ = ai+1 δ 0 { add (A ::= τ D · δ, k, y) to Q } } } 0

} V=∅ create an SPPF node v labelled (ai+1 , i, i + 1) while Q 6= ∅ { remove an element, Λ = (B ::= α · ai+1 β, h, w) say, from Q let y = M AKE N ODE(B ::= αai+1 · β, h, i + 1, w, v, V) if δ ∈ ΣN { add (B ::= αai+1 · β, h, y) to Ei+1 } if δ = ai+2 δ 0 { add (B ::= αai+1 · β, h, y) to Q0 } } } if (S ::= τ ·, 0, w) ∈ En return w else return failure } M AKE N ODE(B ::= αx · β, j, i, w, v, V) { if β =  { s = B } else s = (B ::= αx · β) if α =  and β 6=  { y=v } else { if there is no node y ∈ V labelled (s, j, i) create one and add it to V if w = null and y does not have a family of children (v) add one if w 6= null and y does not have a family of children (w, v) add one } return y }

64

Scott

Using this algorithm on Example 3 from Section 4 results in the following SPPF. S, 0, 4

e

 u9 @ @e

 8 9 z u   u3 z A, 0, 1 T, 1, 4



b be   u6 e   q u7  b, 3, 4 T ::= bb · b, 1, 3

=  ? ?  u2

u1 QQ a, 0, 1 =  B, 0, 0 s



u5 u4 l b, 1, 2 b, 2, 3



6

The order of the parsers

(As we have done throughout the paper, in this section we use n to denote the length of the input to the parser.) A formal proof that the binarised SPPFs constructed by the BRNGLR algorithm contain at most O(n3 ) nodes and at most O(n3 ) edges is given in [17]. The proof that the binarised SPPFs constructed by the parsers described in this paper are of at most cubic size is the same, and we do not give it here. Intuitively however, the non-packed nodes are characterised by an LR(0)-item and two integers, 0 ≤ j ≤ i ≤ n, and thus there are at most O(n2 ) of them. Packed nodes are children of some non-packed node, labelled (s, j, i) say, and for a given non-packed node the packed node children are characterised by an LR(0)-item and an integer l which lies between j and i. Thus each nonpacked node has at most O(n) packed node children and there are at most O(n3 ) packed nodes in a binarised SPPF. As non-packed nodes are the source of at most O(n) edges and packed nodes are the source of at most two edges, there are also at most O(n3 ) edges in a binarised SPPF. For the parsing approach based on the Buildtree procedure described in Section 4, the Earley sets are constructed as for Earley’s original algorithm. There are at most O(n2 ) items and each item has at most O(n) predecessor pointers, one to each of the collections Ej , 0 ≤ j ≤ i. Because an item is marked as processed as soon as Buildtree is called on it, the parsing process makes at most O(n2 ) calls to Buildtree. Assuming that the SPPF is represented in a way that allows n-independent look-up time for a particular node and family of children, the only n-dependent behaviour of Buildtree occurs during the iteration over the predecessor pointers from the input item, and there are at most O(n) such pointers. It is possible to represent the SPPF in the required fashion, one such representation being described in [17]. Thus our Earley parsers can be implemented so that they have worst-case cubic order. Finally we consider the integrated Earley parser given in Section 5. The while-loop that processes the elements in R executes once for each element added to Ei . For each triple (s, j, i) there is at most one SPPF node labelled 65

Scott

with this triple, and thus there are at most O(n) items in Ei . So the whileloop executes at most O(n) times. As we have already remarked, it is possible to implement the SPPF to allow n-independent look-up time for a given node and family of children. Thus, within the while-loop for R, the only case that triggers potentially n-dependent behaviour is the case when the item chosen is of the form (D ::= α·, h, w). In this case the set Eh must be searched. This is a worst-case O(n) operation. The while-loop that processes Q is not n-dependent, thus the integrated parser is worst case O(n3 ).

7

Summary and further work

In this paper we have given two versions of a parser based on Earley’s recognition algorithm, both of which are of worst case cubic order. Both algorithms construct a binarised SPPF that represents all possible derivations of the given input string. The approach is based on the approach taken in BRNGLR, a cubic version of Tomita’s algorithm, and the SPPFs constructed are equivalent to those constructed by BRNGLR. Some experimental results comparing the recogniser versions of BRNGLR and Earley’s algorithm are reported in [13]. Now further experimental work is required to compare the performance of the integrated Earley parser described in this paper with the parser version of BRNGLR.

References [1] John Aycock and Nigel Horspool. Faster generalised LR parsing. In Compiler Construction, 8th Intnl. Conf, CC’99, volume 1575 of Lecture Notes in Computer Science, pages 32 – 46. Springer-Verlag, 1999. [2] John Aycock, R. Nigel Horspool, Jan Janousek, and Borivo Melichar. Even faster generalised LR parsing. Acta Informatica, 37(8):633–651, 2001. [3] Claus Brabrand. Grambiguity. http://www.brics.dk/ brabrand/grambiguity/, 2006.

[4] Frank L DeRemer and Thomas J. Pennello. Efficient computation of LALR(1) look-ahead sets. ACM Trans. Progam. Lang. Syst., 4(4):615–649, October 1982. [5] Franklin L DeRemer. Practical translators for LR(k) languages. PhD thesis, Massachussetts Institute of Technology, 1969. [6] J Earley. An efficient context-free parsing algorithm. Communications of the ACM, 13(2):94–102, February 1970. [7] James Gosling, Bill Joy, and Guy Steele. The Java Language Specification. Addison-Wesley, 1996. [8] James Gosling, Bill Joy, Guy Steele, and Gilad Bracha. The Java Language Specification Third Edition. Addison-Wesley, 2005.

66

Scott

[9] Susan L. Graham and Michael A. Harrison. Parsing of general context-free languages. Advances in Computing, 14:77–185, 1976. [10] Dick Grune and Ceriel Jacobs. Parsing Techniques: A Practical Guide. Ellis Horwood, Chichester, England. (See also: http://www.cs.vu.nl/~dick/PTAPG.html), 1990. [11] John E Hopcroft and Jeffrey D. Ullman. Introduction to Automata Theory, Languages, and Computation. Series in Computer Science. Addison-Wesley, 1979. [12] Mark Johnson. The computational complexity of GLR parsing. In Masaru Tomita, editor, Generalized LR parsing, pages 35–42. Kluwer Academic Publishers, The Netherlands, 1991. [13] Adrian Johnstone, Elizabeth Scott, and Giorgios Economopoulos. Generalised parsing: some costs. In Evelyn Duesterwald, editor, Compiler Construction, 13th Intnl. Conf, CC’04, volume 2985 of Lecture Notes in Computer Science, pages 89–103. Springer-Verlag, Berlin, 2004. [14] Donald E Knuth. On the translation of languages from left to right. Information and Control, 8(6):607–639, 1965. [15] Rahman Nozohoor-Farshi. GLR parsing for -grammars. In Masaru Tomita, editor, Generalized LR Parsing, pages 60–75. Kluwer Academic Publishers, The Netherlands, 1991. [16] Jan G. Rekers. Parser generation for interactive environments. PhD thesis, University of Amsterdam, 1992. [17] E.A. Scott, A.I.C. Johnstone, and G.R. Economopoulos. BRN-table based GLR parsers. Technical Report TR-03-06, Computer Science Department, Royal Holloway, University of London, London, 2003. [18] Elizabeth Scott and Adrian Johnstone. Generalised bottom up parsers with reduced stack activity. The Computer Journal, 48(5):565–587, 2005. [19] Masaru Tomita. Efficient parsing for natural language. Kluwer Academic Publishers, Boston, 1986. [20] D H Younger. Recognition of context-free languages in time n3 . Control, 10(2):189–208, February 1967.

67

Inform.

LDTA 2007

An Experimental Ambiguity Detection Tool Sylvain Schmitz 1 Laboratoire I3S, Universit´e de Nice - Sophia Antipolis & CNRS, France

Abstract Although programs convey an unambiguous meaning, the grammars used in practice to describe their syntax are often ambiguous, and completed with disambiguation rules. Whether these rules achieve to remove all the ambiguities while preserving the original intended language can be difficult to ensure. We present an experimental ambiguity detection tool for GNU/bison, and illustrate how it can assist a grammatical development for a subset of Standard ML. Key words: grammar verification, disambiguation, GLR

1

Introduction

With the broad availability of parser generators that implement the Generalized LR (GLR) [32] or the Earley [10] algorithm, it might seem that the struggles with the dreaded report grammar.y: conflicts: 223 shift/reduce, 35 reduce/reduce are now over. General parsers of these families simulate the various nondeterministic choices in parallel with good performance, and return all the legitimate parses for the input (see Scott and Johnstone for a survey [30]). What our naive account overlooks is that all the legitimate parses according to the grammar might not always be correct in the intended language. With programming languages in particular, a program is expected to have a unique interpretation, and thus a single parse should be returned. Nevertheless, the grammar developed to describe the language is often ambiguous: ambiguous grammars are more concise and readable [1]. The language definition should then include some disambiguation rules to decide which parse to choose. In this paper, we present a tool for GNU Bison [9] 2 that pinpoints possible ambiguities in context-free grammars (CFGs). Grammar and parser developers can then use the ambiguities reported by the tool to write disambiguation 1

Email: [email protected] The modified Bison source is available from the author’s webpage, at the address http://www.i3s.unice.fr/∼schmitz/. This paper is electronically published in Electronic Notes in Theoretical Computer Science URL: www.elsevier.nl/locate/entcs 2

Schmitz

rules where they are needed. Since the problem of finding all the ambiguities in a CFG is undecidable [5,7], our tool implements a conservative algorithm [28]: it guarantees that no ambiguity will be overlooked, but it might return false positives as well. We attempt to motivate the use of such a tool for grammatical engineering [17]. •

We first describe a well-known difficult subset of the syntax of Standard ML [22] (Section 2.1) that combines a genuine ambiguity with a LR conflict requiring unbounded lookahead (Section 2.2). A generalized parser accomplishes to parse correctly the corresponding Standard ML programs, but might return more than one parse (Section 2.3).



We detail how our tool identifies the ambiguity as such and discards the conflict (Section 3). We complete this overview of our tool with more experimental results in Section 4.



At last, we examine the shortcomings of the tool and provide some leads for its improvement (Section 5).

2

A Difficult Syntactic Issue

In this section, we consider a subset of the grammar of Standard ML, and use it to illustrate some of the difficulties encountered with classical LALR(1) parser generators in the tradition of YACC [14]. Unlike the grammars sometimes provided in other programming language references, the grammar defined by Miller et al. [22, Appendix B] is not put in LALR(1) form. In fact, it clearly values simplicity over ease of implementation, and includes highly ambiguous rules like hdeci− →hdeci hdeci. 2.1 Case Expressions in Standard ML Kahrs [15] describes a situation in the Standard ML syntax where an unbounded lookahead is needed by a deterministic parser in order to correctly parse certain strings. The issue arises with alternatives in function value binding and case expressions. A small set of grammar rules from the language specification that illustrates the issue follows in Figure 1. 3 The rules describe Standard ML declarations hdeci for functions, where each function name vid is bound, for a sequence hatpatsi of atomic patterns, to an expression hexpri using the rule hsfvalbindi− →vid hatpatsi = hexpi. Different function value bindings can be separated by alternation symbols “|”. Standard ML case expressions associate an expression hexpi with a hmatchi, which is a sequence of matching rules hmrulei of form hpati => hexpi, separated by alternation symbols “|”. 3

We translated the original rules from their extended representation into classical BNF. We note hnonterminalsi between angle brackets and terminals as such, except for the terminal alternation symbol ′ |′ , quoted in order to avoid confusion with the choice meta character |.

70

Schmitz

hdeci hfvalbindi hsfvalbindi hatpatsi hexpi hmatchi hmrulei hpati hatpati

− → − → − → − → − → − → − → − → − →

fun hfvalbindi hsfvalbindi | hfvalbindi ′ |′ hsfvalbindi vid hatpatsi = hexpi hatpati | hatpatsi hatpati case hexpi of hmatchi | vid hmrulei | hmatchi ′ |′ hmrulei hpati => hexpi vid hatpati vid

Fig. 1. Syntax of function value binding and case expressions in Standard ML.

Example 2.1 Using mostly these rules, the filter function of the SML/NJ Library could be written [20] as: datatype ’ a o p t i o n = NONE | SOME of ’ a fun f i l t e r pred l = let fun f i l t e r P ( x : : r , l ) = case ( pred x ) of SOME y => f i l t e r P ( r , y : : l ) | NONE => f i l t e r P ( r , l ) | f i l t e r P ( [ ] , l ) = rev l in filterP (l , []) end

The Standard ML compilers consistently reject this correct input, often pinpointing the error at the equal sign in “| filterP ([], l ) = rev l”. Let us investigate why they behave this way. 4 2.2 The Conflict We implemented our set of grammar rules in GNU Bison [9], and the result of a run in LALR(1) mode is a single shift/reduce conflict, a nondeterministic choice between two parsing actions: state 20 6 exp: "case" exp "of" match . 8 match: match . ’|’ mrule ’|’ ’|’

shift, and go to state 24 [reduce using rule 6 (exp)]

The conflict takes place just before “| filterP ([], l ) = rev l” with the program of Example 2.1. 4

This behavior is now a de facto standard.

71

Schmitz hfvalbind i

hfvalbind i

hsfvalbind i

hfvalbind i

hexpi

hsfvalbind i

hmatchi

hexpi hmatchi

hmatchi

...

hmrulei

hsfvalbind i

hmrulei hpat i

hexpi

hatpatsi

hmrulei

hpat i

hexpi

| NONE => filterP(r, l) | filterP ([], l ) = rev l

(a) Correct parse tree when reducing.

...

hpat i hexpi

error!

hpat i

| NONE => filterP(r, l) | filterP ([], l ) = rev l

(b) Attempted parse when shifting.

Fig. 2. Partial parse trees corresponding to the two actions in conflict on Example 2.1.

If we choose one of the actions—shift or reduce—over the other, we obtain the parses drawn in Figure 2. The shift action is chosen by default by Bison, and ends on a parse error when seeing the equal sign where a double arrow was expected, exactly where the Standard ML compilers report an error. Example 2.2 The issue is made further complicated by a dangling ambiguity: case a of b => case b of c => c | d => d

In this expression, should the dangling “d => d” matching rule be attached to “case b” or to “case a”? The Standard ML definition indicates that the matching rule should be attached to “case b”. In this case, the shift should be chosen rather than the reduction. Our two examples show that we cannot blindly choose one particular action over the other. Nonetheless, we could make the correct decision if we had more information at our disposal. The “=” sign in the lookahead string “| filterP ([], l ) = rev l” indicates that the alternative is at the topmost function value binding hfvalbindi level, and not at the “case” level, or it would be a “=>” sign. But the sign can be arbitrarily far away in the lookahead string: an atomic pattern hatpati can derive a sequence of tokens of unbounded length. The conflict requires an unbounded lookahead. This issue in the syntax of Standard ML is one of its few major defects according to a survey by Rossberg [27]: [Parsing] this would either require horrendous grammar transformations, backtracking, or some nasty and expensive lexer hack. Fortunately, the detailed analysis of the conflict we conducted, as well as the ugly or expensive solutions mentioned by Rossberg, are not necessary with a general parser. 5 5

Some deterministic parsing algorithms—LR-Regular [8,2], noncanonical [31,11], or LLRegular [24,23]—, albeit perhaps less known, are able to exploit unbounded lookahead lengths. Our ambiguity detection algorithm employs similar principles.

72

Schmitz hexpi hmatchi ≡ hmatchi hmrulei

hmatchi

hmrulei hexpi

hexpi hmatchi hmatchi

hexpi hpat i

hexpi hmrulei hmrulei

case a of b => case b of c => c | d=> d

Fig. 3. The shared parse forest for the input of Example 2.2.

2.3 General Parsing A general parser returns all the possible parses for the provided input, and as such discards the incorrect parse of Figure 2b and only returns the correct one of Figure 2a. In particular, a generalized LALR(1) parser explores the two possibilities of the conflict, until it reaches the = sign, at which point the incorrect partial parse of Figure 2b fails. Our tool tackles an issue that appeared with the recent popularity of general algorithms for programming languages parsers. The user does not know a priori whether the conflict reported by Bison in the LALR(1) automaton is caused by an ambiguity or by an insufficient lookahead length. A casual investigation of its source might only reveal the unbounded lookahead aspect of the conflict as with Example 2.1, and overlook the ambiguity triggered by embedded case expressions like the one of Example 2.2. The result might be a collection of parse trees—a parse forest—where a single parse tree was expected, hampering the reliability of the computations that follow the parsing phase. Two notions pertain to the current use of parse forests in parsing tools. •

The sharing of common subtrees bounds the forest space complexity by a polynomial function of the input length [3]. Figure 3 shows a shared forest for our ambiguity, with a topmost hmatchi node that merges the two alternative interpretations of the input of Example 2.2.



Klint and Visser [18] developed the general notion of disambiguation filters that reject some of the trees of the parse forest, with the hope of ending the selection process with a single tree. Such a mechanism is implemented in one form or in another in many GLR tools, including SDF [33], Elkhound [21], and Bison [9].

Unexpected ambiguities are acute with GLR parsers that compute semantic attributes as they reduce partial trees. The GLR implementations of GNU Bison [9] and of Elkhound [21] are in this situation. Attribute values are syn73

Schmitz

thetized for each parse tree node, and in a situation like the one depicted in Figure 3, the values obtained for the two alternatives of a shared node have to be merged into a single value for the shared node as a whole. The user of these tools should thus provide a merge function that returns the value of the shared node from the attributes of its competing alternatives. Failure to provide a merge function where it is needed forces the parser to choose arbitrarily between the possibilities, which is highly unsafe. Another line of action is to abort parsing with a message exhibiting the ambiguity; this can be set with an option in Elkhound, and it is the behavior of Bison. Example 2.3 Let us suppose that the user has found out the ambiguity of Example 2.2, and is using a disambiguation filter (in the form of a merge function in Bison or Elkhound) that discards the dotted alternative of Figure 3, leaving only the correct parse according to the Standard ML definition. A simple way to achieve this is to check whether we are reducing using rule →hmrulei. Filters of this hmatchi− →hmatchi′ |′ hmrulei or with rule hmatchi− variety are quite common, and are given a specific dprec directive in Bison, also corresponding to the prefer and avoid filters in SDF2 [33]. The above solution is unfortunately unable to deal with yet another form of ambiguity with hmatchi, namely the ambiguity encountered with the input: case a of b => b | c => case c of d => d | e => e

Indeed, with this input, the two shared hmatchi nodes are obtained through reductions using the same rule hmatchi− →hmatchi′ |′ hmrulei. Had we trusted our filter to handle all the ambiguities, we would be running our parser under a sword of Damocles. This last example shows that a precise knowledge of the ambiguous cases is needed for the development of a reliable GLR parser. While the problem of detecting ambiguities is undecidable, conservative answers could point developers in the right direction.

3

Detecting Ambiguities

Our tool is implemented in C as a new option in GNU Bison that triggers an ambiguity detection computation instead of the parser generation. The output of this verification on our subset of the Standard ML grammar is: 2 potential ambiguities with LR(0) precision detected: (match -> mrule . , match -> match . ’|’ mrule ) (match -> match . ’|’ mrule , match -> match ’|’ mrule . ) From this ambiguity report, two things can be noted: that user-friendliness is not a strong point of the tool in its current form, and that the two detected ambiguities correspond to the two ambiguities of Examples 2.2 and 2.3. Furthermore, the reported ambiguities do not mention anything visibly related to 74

Schmitz

the difficult conflict of Example 2.1. Our ambiguity checking algorithm attempts to find ambiguities as two different parse trees describing the same sentence. Of course, there is in general an infinite number of parse trees with an infinite number of derived sentences, and we make therefore some approximations when visiting the trees. The algorithm in its full generality is described in [28], along with the proof that all ambiguities are caught, and more insights on the false positives returned along the way. We detail here the algorithm on the relevant portion of our grammar, and consider to this end approximations based on LR(0) items: a dot in a grammar production A− →α β can also be seen as a position in an elementary tree—a tree of height one—with root A and leaves labeled by αβ. When moving from item to item, we are also moving inside all the syntax trees that contain the corresponding elementary trees. All the moves from item to item that we describe in the following can be checked on the trees of Figures 2 and 3. Since we want to find two different trees, we work with pairs of concurrent items, starting from a pair (S− → hdeci $, S− → hdeci $) at the beginning of all trees, and ending on a pair (S− →hdeci $ , S− →hdeci $ ). Between these, we pair items that could be reached upon reading a common sentence prefix, hence following trees that derive the same sentence.

·

·

·

·

·

3.1 Example Run Let us start with the couple of items reported as being in conflict by Bison; just like Bison, our algorithm has found out that the two positions might be reached by reading a common prefix from the beginning of the input: (hmatchi− →hmatchi

·

· | hmrulei, ′ ′

hexpi− →case hexpi of hmatchi )

(1)

Unlike Bison, the algorithm attempts to see whether we can keep reading the same sentence until we reach the end of the input. Since we are at the extreme right of the elementary tree for rule hexpi− →case hexpi of hmatchi, we are also to the immediate right of the nonterminal hexpi in some rule right part. Our algorithm explores all the possibilities, thus yielding the three couples: (hmatchi− →hmatchi (hmatchi− →hmatchi (hmatchi− →hmatchi

· | hmrulei, ′ ′

· | hmrulei, ′ ′

· | hmrulei, ′ ′

·

hmrulei− →hpati=>hexpi )

·

(2)

hexpi− →case hexpi of hmatchi)

(3)

·

(4)

hsfvalbindi− →vid hatpatsi = hexpi )

Applying the same idea to the pair (2), we should explore all the items with the dot to the right of hmrulei. (hmatchi− →hmatchi (hmatchi− →hmatchi

· | hmrulei, ′ ′

· | hmrulei, ′ ′

75

·

hmatchi− →hmrulei )

(5)

·

hmatchi− →hmatchi ′ |′ hmrulei )

(6)

Schmitz

·

At this point, we find hmatchi− →hmatchi ′ |′ hmrulei, our competing item, among the items with the dot to the right of hmatchi: from our approximations, the strings we can expect to the right of the items in the pairs (5) and (6) are the same, and we report the pairs as potential ambiguities. Our ambiguity detection is not over yet: from (4), we could reach successively (showing only the relevant possibilities): (hmatchi− →hmatchi (hmatchi− →hmatchi

· | hmrulei, ′ ′

· | hmrulei, ′ ′

·

hfvalbindi− →hsfvalbindi )

hfvalbindi− →hfvalbindi

(7)

· | hsfvalbindi) (8) ′ ′

In this last pair, the dot is to the left of the same symbol, meaning that the following item pair might also be reached by reading the same string from the beginning of the input:

·

·

(hmatchi− →hmatchi ′ |′ hmrulei, hfvalbindi− →hfvalbindi ′ |′ hsfvalbindi) (9) The dot being to the left of a nonterminal symbol, it is also at the beginning of all the right parts of the productions of this symbol, yielding successively:

· · (hmrulei− →·hpati=>hexpi, hsfvalbindi− →·vid hatpatsi = hexpi) (hpati− →·vid hatpati, hsfvalbindi− →·vid hatpatsi = hexpi) (hpati− →vid ·hatpati, hsfvalbindi− →vid ·hatpatsi = hexpi) (hpati− →vid ·hatpati, hatpatsi− →·hatpati) (hpati− →vid hatpati·, hatpatsi− →hatpati·) (hmrulei− →hpati·=>hexpi, hatpatsi− →hatpati·) (hmrulei− →hpati·=>hexpi, hsfvalbindi− →vid hatpatsi· = hexpi)

(hmrulei− → hpati=>hexpi, hfvalbindi− →hfvalbindi ′ |′ hsfvalbindi)

(10) (11) (12) (13) (14) (15) (16) (17)

Our exploration stops with this last item pair: its concurrent items expect different terminal symbols, and thus cannot reach the end of the input upon reading the same string. The algorithm has successfully found how to discriminate the two possibilities in conflict in Example 2.1. 3.2 Overview of the Algorithm The example run detailed above relates couples of items. We call this relation the mutual accessibility relation ma, and define it as the union of several primitive relations: mas for terminal and nonterminal shifts, holding for instance between pairs (8) and (9), but also between (14) and (15), mae for downwards closures, holding for instance between pairs (9) and (10), mac for upwards closures in case of a conflict, i.e. when one of the items in the pair has its dot to the extreme right of the rule right part and the concurrent item is different from it, holding for instance between pairs (2) and (5). 76

Schmitz

·

·

The algorithm thus constructs the image of the initial pair (S ′ − → S$, S ′ − → S$) ∗ by the ma relation. If at some point we reach a pair holding twice the same item from a pair with different items, we report an ambiguity. The eligible single moves from item to item are in fact the transitions in a nondeterministic LR(0) automaton (thereafter called LR(0) NFA). The size of the ma relation is bounded by the square of the size of this NFA. Let |G| denote the size of the context-free grammar G, i.e. the sum of the length of all the rules right parts, and |P | denote the number of rules; then, in the LR(0) case, the algorithm time and space complexity is bounded by O((|G| |P |)2). 3.3 Implementation Details The experimental tool currently implements the algorithm with LR(0), SLR(1), and LR(1) items. Although the space required by LR(1) item pairs is really large, we need this level of precision in order to guarantee an improvement over the LALR(1) construction. The implementation changes a few details: •

We construct a nondeterministic automaton [13,12] whose states are either the items of form A− →α β, or some nonterminal items of form A or A . For instance, a nonterminal item would be used when computing the mutual accessibility of (2) and before reaching (5):

·

(hmatchi− →hmatchi

·

· | hmrulei, ′ ′

·

hmrulei ).

·

(18)

The size of the NFA then becomes bounded by O(|G|) in the LR(0) and SLR(1) case, and O(|G||T |2)—where |T | is the number of terminal symbols— in the LR(1) case, and the complexity of the algorithm is thus bounded by the square of these numbers. •

4

We consider the associativity and static precedence directives [1] of Bison and thus we do not report resolved ambiguities.

Experimental Comparisons

The choice of a conservative ambiguity detection algorithm is currently rather limited. Several parsing techniques define subsets of the unambiguous grammars, and beyond LR(k) parsing, two major parsing strategies exist: LRRegular parsing [8], which in practice explores a regular cover of the right context of LR conflicts with a finite automaton [2], and noncanonical parsing [31], where the exploration is performed by the parser itself. Since we follow the latter principle with our algorithm, we call it a noncanonical unambiguity (NU) test. A different approach, unrelated to any parsing method, was proposed by Brabrand et al. [4] with their horizontal and vertical unambiguity check (HVRU). Horizontal ambiguity appears with overlapping concatenated languages, and vertical ambiguity with non-disjoint unions; their method thus follows exactly how the context-free grammar was formed. Their intended 77

Schmitz

Table 1 Reported ambiguities in the grammars from [28]. Grammar G3n G4n G5 G6 G7

actual class LR(2n ) ambiguous non-LRR non-LR LR(0)

Bison 1 1 1 6 0

HVRU [4] 0 1

NU(item0 ) 0 1 0 9 0

application is to test grammars that describe RNA secondary structures [26]. We implemented a LR and a LRR test using the same item pairing technique as our NU algorithm. We present experimental comparisons with these, as well as with the HVRU algorithm when the data is available. 4.1 General Comparisons The formal comparisons of our algorithm with various other methods given in [28] are sustained by several small grammars. Table 1 compiles the results obtained on these grammars. The “Bison” column provides the total number of conflicts (shift/reduce as well as reduce/reduce) reported by Bison, the “HVRU” column the number of potential ambiguities (horizontal or vertical) reported by the HVRU algorithm, and the “NU(item0 )” column the number of potential ambiguities reported by our algorithm with LR(0) items.6 The grammar families G3n and G4n demonstrate the complexity gains with our algorithm as compared to e.g. LR(k) parsing: S− →A | Bn , A− →Aaa | a, B1 − →aa, B2 − →B1 B1 , . . . , Bn − →Bn−1 Bn−1

(G3n )

S− →A | Bn a, A− →Aaa | a, B1 − →aa, B2 − →B1 B1 , . . . , Bn − →Bn−1 Bn−1 . (G4n ) While a LR(2n ) test is needed in order to tell that G3n is unambiguous, the grammar is found unambiguous with our algorithm using LR(0) items. Grammar G5 is a non-LRR [8] grammar with rules S− →AC | BCb, A− →a, B− →a, C− →cCb | cb.

(G5 )

It is also found unambiguous by our algorithm using LR(0) items. Grammars G6 and G7 show that our method is not comparable with the horizontal and vertical ambiguity detection method of Brabrand et al.. Grammar G6 is a palindrome grammar with rules S− →aSa | bSb | a | b | ε

(G6 )

that our method finds erroneously ambiguous. Conversely, grammar G7 with rules S− →AA, A− →aAa | b (G7 ) is a LR(0) grammar, and the test of Brabrand et al. finds it horizontally ambiguous and not vertically ambiguous. For completeness, we also give the results of our tool on the RNA grammars of [26] in Table 2. 78

Schmitz

Table 2 Reported potential ambiguities in the RNA grammars from [26]. Grammar G1 G2 G3 G4 G5 G6 G7 G8

actual class ambiguous ambiguous non-LR SLR(1) SLR(1) LALR(1) non-LR LALR(1)

Bison 30 33 4 0 0 0 5 0

HVRU [4] 6 7 0 0 0 0 0 0

NU(item1 ) 14 13 2 0 0 0 3 0

4.2 Experiments with Programming Languages Grammars We ran the LR, LRR and NU tests on five different ambiguous grammars for programming languages: ANSI C [16, Appendix A.13], retrieved from the comp.compilers FTP at ftp://ftp.iecc.com/pub/file/. The grammar is LALR(1), except for a dangling else ambiguity. The ANSI C’ grammar is the same grammar modified by setting typedef names to be a nonterminal, with a single production htypedef -namei− →identifier. The modification reflects the fact that GLR parsers should not rely on the lexer hack for disambiguation. Standard ML, extracted from the language definition [22, Appendix B]. Elsa C++, developed with the Elkhound GLR parser generator [21], and a smaller version without class declarations nor function bodies. In order to provide a better ground for comparisons between LR, LRR and NU testing, we implemented an option that computes the number of initial LR(0) item pairs in conflict—for instance pair (1)—that can reach a point of ambiguity—for instance pair (5)—through the ma relation. Table 3 presents the number of such initial conflicting pairs with our tests when employing LR(0) items, SLR(1) items, and LR(1) items. 6 Although we ran our tests on a machine equipped with a 3.2GHz Xeon and 3GiB of physical memory, several tests employing LR(1) items exhausted the memory. The explosive number of LR(1) items is also responsible for a huge slowdown: for the small Elsa grammar, the NU test with SLR(1) items ran in 0.38 seconds, against 7 minutes for the corresponding LR(1) test. The decrease in the figures as we go from LR to LRR and then to NU is guaranteed by the underlying theory. Nevertheless, the fact that, in practice, the NU test improves significantly over the LR and LRR ones is a good sign. 6

These conflicts are not directly comparable with the number of conflicts in a LALR(1) construction like that of Bison, since a single item pair can cause conflicts in different LALR(1) states, and conversely, in case of multiple conflicts in a single LALR(1) state, we report an item pair for each combination. They are not comparable with the numbers of potential ambiguities reported by NU either; for instance, NU(item1 ) would report 89 potential ambiguities for Standard ML and 52 for ANSI C’.

79

Schmitz

Table 3 Number of initial LR(0) conflicting pairs remaining with each test. Grammar

Bison

ANSI C ANSI C’ Standard ML Small Elsa C++ Elsa C++

5

1 38 258 58 115

LR(0) items LR LRR NU 377 14 3 387 43 32 477 299 271 1379 278 226 2094 323 308

SLR(1) items LR LRR NU 14 14 3 43 43 32 165 163 135 66 63 58 91 88 77

LR(1) items LR LRR NU 1 1 1 29 127 125 104 64 -

Current Limitations

Our implementation is still a prototype. We describe several planned improvements (Sections 5.1 and 5.2), followed by a small account on the difficulty of considering dynamic disambiguation filters and merge functions in the algorithm (Section 5.3). 5.1 Ambiguity Report As mentioned in the beginning of Section 3, the ambiguity report returned by our tool is hard to interpret. A first solution, already adopted by Brabrand et al. [4], is to attempt to generate actually ambiguous inputs that match the detected ambiguities. The ambiguity report would then comprise of two parts, one for proven ambiguities with examples of input, and one for the potential ambiguities. The generation should only follow item pairs from which the potential ambiguities are reachable through ma relations, and stop whenever finding the ambiguity or after having explored a given number of paths. Displaying the (potentially) ambiguous paths in the grammar in a graphical form is a second possibility. This feature is implemented by ANTLRWorks, the development environment for the upcoming version 3 of ANTLR [23]. 5.2 Running Time The complexity of our algorithm is a square function of the grammar size. If, instead of item pairs, we considered deterministic states of items like LALR(1) does, the worst-case complexity would rise to an exponential function. Our algorithm is thus more robust. Nonetheless, practical computations seem likely to be faster with LALR(1) item sets: a study of LALR(1) parsers sizes by Purdom [25] showed that the size of the LALR(1) parser was usually a linear function of the size of the grammar. Therefore, all hope of analyzing large GLR grammars—like the Cobol grammar recovered by L¨ammel and Verhoef [19]—is not lost. The theory behind noncanonical LALR parsing [29] translates well into a special case of our algorithm for ambiguity detection, and future versions of the tool should implement it. 80

Schmitz A



A

BB a a b

c

Fig. 4. The shared parse forest for input aabc with grammar G8 .

5.3 Dynamic Disambiguation Filters Our tool does not ignore potential ambiguities when the user has declared a merge function that might solve the issue. The rationale is simple: we do not know whether the merge function will actually solve the ambiguity. Consider for instance the rules A− →aBc | aaBc, B− →ab | b.

(G8 )

·

·

Our tool reports an ambiguity on the item pair (B− →ab , B− →b ), and is quite right: the input aabc is ambiguous. As shown in Figure 4, adding a merge function on the rules of B would not resolve the ambiguity: the merge function should be written for A. If we consider arbitrary productions for B, a merge function might be useful only if the languages of the alternatives for B are not disjoint. We could thus improve our tool to detect some useless merge declarations. On the other hand, if the two languages are not equivalent, then there are cases where a merge function is needed on A—or even at a higher level. Ensuring equivalence is difficult, but could be attempted in some decidable cases, namely when we can detect that the languages of the alternatives of B are finite or regular, or using structural bisimulation equivalence [6].

6

Conclusions

The paper reports on an ambiguity detection tool. In spite of its experimental state, the tool has been successfully used on a very difficult portion of the Standard ML grammar. The tool also improves on the dreaded LALR(1) conflicts report, albeit at a much higher computational price. We hope that the need for such a tool, the results obtained with this first implementation, and the solutions described for the current limitations will encourage the investigation of better ambiguity detection techniques. The integration of our method with the one designed by Brabrand et al. is another promising solution.

Acknowledgement The author is highly grateful to Jacques Farr´e for his help in the preparation of this paper, to S´ebastien Verel for granting him the access to a fast computer, and to the anonymous referees for their numerous helpful suggestions. 81

Schmitz

References [1] Aho, A. V., S. C. Johnson and J. D. Ullman, Deterministic parsing of ambiguous grammars, Communications of the ACM 18 (1975), pp. 441–452. [2] Bermudez, M. E. and K. M. Schimpf, Practical arbitrary lookahead LR parsing, Journal of Computer and System Sciences 41 (1990), pp. 230–250. [3] Billot, S. and B. Lang, The structure of shared forests in ambiguous parsing, in: ACL’89 (1989), pp. 143–151. URL http://www.aclweb.org/anthology/P89-1018 [4] Brabrand, C., R. Giegerich and A. Møller, Analyzing ambiguity of context-free grammars, Technical Report RS-06-09, BRICS, University of Aarhus (2006). URL http://www.brics.dk/∼brabrand/grambiguity/ [5] Cantor, D. G., On the ambiguity problem of Backus systems, Journal of the ACM 9 (1962), pp. 477–479. [6] Caucal, D., Graphes canoniques de graphes alg´ebriques, RAIRO - Theoretical Informatics and Applications 24 (1990), pp. 339–352. URL http://www.inria.fr/rrrt/rr-0872.html [7] Chomsky, N. and M. P. Sch¨ utzenberger, The algebraic theory of context-free languages, in: P. Braffort and D. Hirshberg, editors, Computer Programming and Formal Systems, Studies in Logic, North-Holland Publishing, 1963 pp. 118– 161. ˇ [8] Culik, K. and R. Cohen, LR-Regular grammars—an extension of LR(k) grammars, Journal of Computer and System Sciences 7 (1973), pp. 66–96. [9] Donnely, C. and R. Stallman, “Bison version 2.1,” (2005). URL http://www.gnu.org/software/bison/manual/ [10] Earley, J., An efficient context-free parsing algorithm, Communications of the ACM 13 (1970), pp. 94–102. [11] Fortes G´ alvez, J., S. Schmitz and J. Farr´e, Shift-resolve parsing: Simple, linear time, unbounded lookahead, in: O. H. Ibarra and H.-C. Yen, editors, CIAA’06, Lecture Notes in Computer Science 4094 (2006), pp. 253–264. [12] Grune, D. and C. J. H. Jacobs, “Parsing Techniques: A Practical Guide,” Ellis Horwood Limited, 1990. URL http://www.cs.vu.nl/∼dick/PTAPG.html [13] Hunt III, H. B., T. G. Szymanski and J. D. Ullman, Operations on sparse relations and efficient algorithms for grammar problems, in: 15th Annual Symposium on Switching and Automata Theory (1974), pp. 127–132. [14] Johnson, S. C., YACC — yet another compiler compiler, Computing science technical report 32, AT&T Bell Laboratories, Murray Hill, New Jersey (1975).

82

Schmitz

[15] Kahrs, S., Mistakes and ambiguities in the definition of Standard ML, Technical Report ECS-LFCS-93-257, University of Edinburgh, LFCS (1993). URL http://www.lfcs.inf.ed.ac.uk/reports/93/ECS-LFCS-93-257/ [16] Kernighan, B. W. and D. M. Ritchie, “The C Programming Language,” Prentice-Hall, 1988. [17] Klint, P., R. L¨ ammel and C. Verhoef, Toward an engineering discipline for grammarware, ACM Transactions on Software Engineering and Methodology 14 (2005), pp. 331–380. [18] Klint, P. and E. Visser, Using filters for the disambiguation of context-free grammars, in: G. Pighizzini and P. San Pietro, editors, ASMICS Workshop on Parsing Theory, Technical Report 126-1994 (1994), pp. 89–100. URL http://citeseer.ist.psu.edu/klint94using.html [19] L¨ ammel, R. and C. Verhoef, Semi-automatic grammar recovery, Software: Practice & Experience 31 (2001), pp. 1395–1438. [20] Lee, P., “Using the SML/NJ System,” Carnegie Mellon University (1997). URL http://www.cs.cmu.edu/∼petel/smlguide/smlnj.htm [21] McPeak, S. and G. C. Necula, Elkhound: A fast, practical GLR parser generator, in: E. Duesterwald, editor, CC’04, Lecture Notes in Computer Science 2985 (2004), pp. 73–88. [22] Milner, R., M. Tofte, R. Harper and D. MacQueen, “The definition of Standard ML,” MIT Press, 1997, revised edition. [23] Parr, T. J., “The Definitive ANTLR Reference: Building Domain-Specific Languages,” The Pragmatic Programmers, 2007. [24] Poplawski, D. A., On LL-Regular grammars, Journal of Computer and System Sciences 18 (1979), pp. 218–227. [25] Purdom, P., The size of LALR(1) parsers, BIT Numerical Mathematics 14 (1974), pp. 326–337. [26] Reeder, J., P. Steffen and R. Giegerich, Effective ambiguity checking in biosequence analysis, BMC Bioinformatics 6 (2005), p. 153. [27] Rossberg, A., Defects in the revised definition of Standard ML, Technical report, Saarland University, Saarbr¨ ucken, Germany (2006). URL http://ps.uni-sb.de/Papers/paper info.php?label=sml-defects [28] Schmitz, S., Conservative ambiguity detection in context-free grammars, Technical Report I3S/RR-2006-30-FR, Laboratoire I3S, Universit´e de Nice Sophia Antipolis & CNRS (2006). URL http://www.i3s.unice.fr/∼mh/RR/2006/RR-06.30-S.SCHMITZ.pdf [29] Schmitz, S., Noncanonical LALR(1) parsing, in: Z. Dang and O. H. Ibarra, editors, DLT’06, Lecture Notes in Computer Science 4036 (2006), pp. 95–107.

83

Schmitz

[30] Scott, E. and A. Johnstone, Right nulled GLR parsers, ACM Transactions on Programming Languages and Systems 28 (2006), pp. 577–618. [31] Szymanski, T. G. and J. H. Williams, Noncanonical extensions of bottom-up parsing techniques, SIAM Journal on Computing 5 (1976), pp. 231–250. [32] Tomita, M., “Efficient Parsing for Natural Language,” Kluwer Academic Publishers, 1986. [33] van den Brand, M., J. Scheerder, J. J. Vinju and E. Visser, Disambiguation filters for scannerless generalized LR parsers, in: R. N. Horspool, editor, CC’02, Lecture Notes in Computer Science 2304 (2002), pp. 143–158. URL http://www.springerlink.com/content/03359k0cerupftfh/

84

LDTA 2007

Grammar Engineering Support for Precedence Rule Recovery and Compatibility Checking Eric Bouwersa,1 Martin Bravenboerb,2 Eelco Visserb,3 a

Department of Information and Computing Sciences Utrecht University, The Netherlands b Department of Software Technology Delft University of Technology, The Netherlands

Abstract A wide range of parser generators are used to generate parsers for programming languages. The grammar formalisms that come with parser generators provide different approaches for defining operator precedence. Some generators (e.g. YACC) support precedence declarations, others require the grammar to be unambiguous, thus encoding the precedence rules. Even if the grammar formalism provides precedence rules, a particular grammar might not use it. The result is grammar variants implementing the same language. For the C language, the GNU Compiler uses YACC with precedence rules, the C-Transformers uses SDF without priorities, while the SDF library does use priorities. For PHP, Zend uses YACC with precedence rules, whereas PHP-front uses SDF with priority and associativity declarations. The variance between grammars raises the question if the precedence rules of one grammar are compatible with those of another. This is usually not obvious, since some languages have complex precedence rules. Also, for some parser generators the semantics of precedence rules is defined operationally, which makes it hard to reason about their effect on the defined language. We present a method and tool for comparing the precedence rules of different grammars and parser generators. Although it is undecidable whether two grammars define the same language, this tool provides support for comparing and recovering precedence rules, which is especially useful for reliable migration of a grammar from one grammar formalism to another. We evaluate our method by the application to non-trivial mainstream programming languages, such as PHP and C. Keywords: Precedence, precedence rules, disambiguation, priorities, associativity, grammar engineering, grammar recovery, parsing, YACC, SDF.

1

Introduction

Defining the syntax of a programming language using a context-free grammar is one of the most established practices in the software industry and computer science. For various reasons a wide range of parser generators are used to generate parsers from context-free grammars. For almost every mainstream programming language there exists a series of parser generators, not only featuring different parsing algorithms, but also different grammar formalisms. These grammar formalisms often provide methods for declaring the precedence of operators, since the notions of priority and associativity are pervasive in the definition of the syntax of programming languages. 1 2 3

Email: [email protected] Email: [email protected] (corresponding author) Email: [email protected]

This paper is electronically published in Electronic Notes in Theoretical Computer Science URL: www.elsevier.nl/locate/entcs

Bouwers, Bravenboer, Visser

As early as 1975 Aho and Johnson recognized [1] that for many languages the most natural grammar is not accepted by the parser generators that are used in practice, since the grammar does not fall in the class of context-free grammars for which the parser generator can produce an efficient parser. Aho and Johnson proposed to define the syntax of a programming languages as an ambiguous grammar combined with disambiguation rules that tell the parser how to resolve a parsing action conflict, a method that was implemented in the now still dominant YACC parser generator [5]. Unfortunately, most of the work on separate precedence declarations has been guided by the underlying parsing technique and not by an analysis of the requirements and fundamentals of precedence declarations. Indeed, parser generators only support precedence rules that can efficiently be implemented in the parser. This is understandable from a practical point of view, yet the result is that there is little known about the actual requirements for separate precedence declarations. Indeed, the semantics of separate precedence declarations is apparently so ill-defined that it is still not used in language specifications. Rather, language specifications prefer to encode precedence rules in the productions of the grammar. Sadly, it is difficult to disagree with this approach, since an encoding in productions is still the most precise, formal, and parsing technology independent way of defining precedences! In this paper, we argue that precedence rules need to be liberated from the idiosyncrasies of specific parser generators. The reasons for this are closely related to the efforts to work towards an engineering discipline for grammarware [6,11,13,9]. Liberating grammars from concrete parser generators is not a new idea [8], however precedence rules have never been studied fundamentally outside of the context of specific parsing technologies or parser generators. Indeed, there is currently, for example, no solid methodology to •

recover precedence rules from ‘legacy’ grammar formalisms. For example, for PHP there is no language specification, only a YACC grammar. Due to the conflict resolution semantics of YACC precedence declarations, the exact precedence rules of PHP are currently very difficult to determine.



compare the precedence rules of two grammars, whether they are defined in the same grammar formalism or not. For example, for the C language, the GNU Compiler uses YACC with precedence rules, the C-Transformers [2] uses SDF [15] without priorities, while the SDF library does use priorities. For PHP, Zend uses YACC with precedence rules, whereas PHP-front uses SDF with priority and associativity declarations. However, there is no way to check that the precedence rules of one grammar are compatible with those of another.



reliably migrate a grammar from one grammar formalism to another including its precedence rules. This does not necessarily have to be completely automatic, but at least there can be support for recovering precedence rules and generating precedence declarations in the new formalism.

In this paper we present a method and its implementation for recovering precedence rules from grammars. Our method is based on a core formalism for defining precedence rules, which is independent of specific parser generators. Based on this formalism and the recovery of precedence rules, we can compare precedence rules of different grammars, defined in different grammar formalism, and using different precedence declaration mechanisms. We have implemented support for recovering precedence rules from YACC [5] and SDF [4,15] (parser generators using different parsing algorithms) and present the details 86

Bouwers, Bravenboer, Visser

of an algorithm to check precedence rules against LR parsers. Although it is undecidable whether two grammars define the same language, this tool provides support for comparing and recovering precedence rules, which is especially useful for reliable migration of a grammar from one grammar formalism to another. Also, the method can be used to analyze the precedence rules of a language, for example to determine if they can be defined using a certain grammar formalism specific precedence declaration mechanism. We evaluate our method by the application to the non-trivial mainstream programming languages C and PHP. For both languages we compare the precedence rules of three grammars defined in SDF or YACC. The evaluation was most successful and revealed several differences and bugs in the precedence rules of the grammars. The YACC and SDF implementations of the method that we present are implemented in Stratego/XT [16] and available as open source software as part of the Stratego/XT Grammar Engineering Tools 4 . Contributions. The contributions of this paper are: (1) A core formalism for precedence rules. (2) A novel method for recovering precedence rules from grammars (3) A method for checking the compatibility of precedence rules across grammars (4) Implementations of the recovery method for YACC and SDF and an evaluation for non-trivial programming languages C and PHP. Organization. In Section 2 we introduce notations for context-free grammars and tree patterns. In Section 3 we introduce a running example and explain the precedence mechanisms of YACC and SDF. Section 4 is the body of the paper, where we present our precedence rule recovery method. Section 5 discusses compatibility checking. In Section 6 we present our evaluation, and we conclude with a discussion of related work.

2

Grammars and Tree Patterns

In this section we define the notions and notations for context-free grammars and tree patterns as we will use them in this paper. A context-free grammar G is a tuple (Σ, N, P ), with Σ a set of terminals, N a set of non-terminals, and P a set of productions of the form A → α, where we use the following notation: V for N ∪ Σ; A, B, C for variables ranging over N ; X, Y, Z for variables ranging over V ; a, b for variables ranging over Σ; v, w, x for variables ranging over Σ∗ ; and α, β, γ for variables ranging over V ∗ . Context-free grammars are usually written in some concrete grammar formalism. Figure 1 gives examples of grammars for the same language in different grammar formalisms. The underlying structure is that of context-free grammars just defined. The augmentation of grammars with precedence mechanisms will be discussed in the next section. The family of valid parse trees TG over a grammar G is a mapping from V to a set of trees, and is defined inductively as follows: •

if a is a terminal symbol, then a ∈ TG (a)



if A0 → X1 ...Xn is a production in G, and ti ∈ TG (Xi ) for 1 ≤ i ≤ n, then hA0 → t1 ...tn i ∈ TG (A0 ).

For example, the tree hE → hE → hT → hF → NUMiii + hT → hF → NUMiii is a parse tree for the addition of two numbers according to the left-most grammar in Figure 1. 4

http://www.strategoxt.org/GrammarEngineeringTools

87

Bouwers, Bravenboer, Visser %token NUM E: | T: | F:

E ’+’ T T T ’ *’ F F NUM

%token NUM %left ’+’ %left ’*’ E: NUM | E ’+’ E | E ’ *’ E

context-free syntax E "+" E -> E E "*" E -> E NUM -> E context-free priorities E "*" E -> E {left} > E "+" E -> E {left} lexical syntax [0-9] -> NUM

context-free syntax E "+" E -> E {left} T -> E T "*" T -> T {left} F -> T NUM -> F lexical syntax [0-9] -> NUM

Fig. 1. Grammars for a small arithmetic expressions language. Left to right: YACC using encoded precedence (YACC1 ), YACC using precedence declarations (YACC2 ), SDF using precedence declarations (SDF1 ), SDF using a mixture of encoding and precedence declarations (SDF2 ).

The family TPG of parse tree patterns (or tree patterns for short) over a grammar G, is a mapping from grammar symbols in V to sets of parse trees over G extended with non-terminals as trees, which we define inductively as follows: •

if X is a terminal or non-terminal symbol in V , then X ∈ TPG (X)



if A0 → X1 ...Xn is a production in G, and ti ∈ TPG (Xi ) for 1 ≤ i ≤ n, then hA0 → t1 ...tn i ∈ TPG (A0 ).

A parse tree pattern p denotes a set of parse trees, namely the set obtained by replacing each non-terminal A in p by the elements of TG (A). Basically, a parse tree pattern corresponds to the derivation tree for a sentential form. For example, the tree pattern hE → hE → hT → F ii + T i denotes the set of trees for summation expressions where the first summand is a ‘factor’. We denote a tree pattern with root A ∈ N and yield α by hA ; αi We use the notation hA ∼ B → t∗ i to denote an injection chain from a tree pattern with root A to a node with non-terminal B and leaves t∗ . Formally, hA ∼ B → t∗ i is the subset of TPG (A) such that •

if A → B is a production in G, and hB → t∗ i ∈ TPG (B), then hA → hB → t∗ ii ∈ hA ∼ B → t∗ i



if A → C is a production in G, and hC ∼ B → t∗ i ∈ TPG (C), then hA → hC ∼ B → t∗ ii ∈ hA ∼ B → t∗ i

For example, the expression hE → hE ∼ F → NUMi + T i abbreviates the injection chain in the tree pattern hE → hE → hT → hF → NUMiii + T i. Finally, to define the notion of precedence, we will need one-level tree patterns, which we define as follows: •

if A → αBγ and B → β are productions, then hA → αhB → βiγi ∈ TP1G (A)



if A → αBγ is a production and hB ∼ C → βi ∈ TPG (B) then hA → αhB ∼ C → βiγi ∈ TP1G (A)

That is, one-level tree patterns are productions with a single subtree, with possibly an injection chain from the root production to the child production. Observe that TP1G (A) ⊆ TPG (A). The tree pattern hE → E + hT → T ∗ F ii is one-level, and so is hE → hE → hT → T ∗ F ii + T i. However, hE → hE → hT → T ∗ F ii + hT → T ∗ F ii is not a one-level tree pattern, since it has two non-chain subtrees. 88

Bouwers, Bravenboer, Visser

3

Precedence Mechanisms

In this paper we focus on two grammar formalisms, their parser generators, and their precedence mechanisms. The first is YACC (Yet Another Compiler-Compiler) and the second is SDF (Syntax Definition Formalism). The parser targeted by the SDF parser generator has a different name: SGLR (Scannerless Generalized-LR). Considering the combination of SDF and YACC is interesting for three reasons. First, the two grammar formalisms provide very different precedence declaration mechanisms. Second, the grammar formalisms are implemented using different parsing techniques. Third, the conversion of YACC to SDF is a very common use case. We introduce the basics of the YACC and SDF precedence declaration mechanisms with a few grammars for a small arithmetic language, see Figure 1. YACC [5] is the classic parser generator. It accepts grammars of the LALR(1) class of context-free grammars with optionally additional disambiguation rules. For our YACCbased tools we use Bison, the GNU version of YACC, however, we will refer to our use of Bison as YACC (on most systems yacc is actually an alias of bison). The first and the second grammar of Figure 1 are YACC grammars. The first grammar encodes the precedence rules of the arithmetic language in the productions of the grammar. The operators + and * are left-associative, since the grammar does not allow an occurrence of + at the right-hand side of a +. The operator * takes precedence over the operator +, since it is not possible at all to have a + at left or right-hand side of a *. The second grammar uses separate YACC precedence declarations [1]. Without disambiguation rules (and implicit conflict resultion), this grammar is ambiguous, e.g 1 + 2 * 3 has two different parse trees: hE → hE → hE → 1i + hE → 2ii ∗ hE → 3ii and hE → hE → 1i + hE → hE → 2i ∗ hE → 3iii are both elements of TG (E). As disambiguation rules, YACC allows declarations of the precedence of operators, which can be %left, %right, or %nonassoc. After the associativity comes a list of tokens. All tokens on the same line have the same precedence. The relative precedence of the operators is defined by the order of the precedence declarations. The operators in the first precedence declaration have lower precedence than the next. The semantics of the precedence declarations of YACC are defined in terms of parser generation. YACC produces an LALR parse table in which the action has to be deterministic for each state and lookahead. If there are multiple possible actions, then this results in shift/reduce or reduce/reduce conflicts. The precedence declarations are used by YACC to select the appropriate action if there is a conflict between two actions. If there is no precedence declaration for the involved tokens, then YACC will resolve the conflict by preferring a shift over a reduce. For a reduce/reduce conflict, YACC resolves the conflict by selecting the reduce of the first production in the input grammar. Later we will see in more detail what the consequence of this is for the precedence rules. The main weakness of precedence declarations of YACC is that it is not really a precedence declaration mechanism, i.e. YACC has no notion of precedence of operators. Precedence declarations are a mechanism to resolve conflicts in the parse table, which can be used to implement operator precedence. Unfortunately, this requires understanding of LALR parsing and the way YACC generates a parser. SDF [4,15] is a feature rich grammar formalism that integrates lexical and contextfree syntax. SDF supports arbitrary context-free grammars, so grammars are not restricted to subclasses of context-free grammar, such as LL or LALR. The SDF parser generator 89

Bouwers, Bravenboer, Visser

generates a parse table for a scannerless generalized-LR parser. For disambiguation, SDF supports various disambiguation filters [15,14], some of which are used to define precedence rules. The third grammar of Figure 1 uses the precedence declarations of SDF 5 . Similar to the second YACC grammar, the productions of this grammar define an ambiguous language. A separate definition of priorities is used to define that * takes precedence over +. Also, both operators are defined to be left associative by using the associativity attribute left. The semantic of SDF priorities is well-defined in terms of the grammar, as opposed to operationally in the parser generator. SDF applies the transitive closure to the declared priority relation over productions (which introduces some limitations). Priority declarations generate a set conflicts(G) of tree patterns of the form hA → αhB → βiγi. Note that this pattern has the same form as patterns from the set of one-level tree patterns, excluding injection chains. If A → βBγ > B → β is in the closure of the priority relation, then hA → αhB → βiγi ∈ conflicts(G). The generated parser will never create a parse tree that matches one of the tree patterns in conflicts(G). The fourth grammar of Figure 1 illustrates that encoding precedence in productions is possible in all grammar formalisms, even if they provide separate precedence declarations. To make the example a bit more interesting, this grammar defines the priority of operators in productions, but uses associativity definitions for individual operators.

4

Precedence Rule Recovery

In previous sections, we have argued that there is a need for methods and tools for determining the precedence rules of a grammar. In this section, we present such a method for recovering the precedence rules as encoded in productions or defined using separate precedence declarations. A Core Formalism for Precedence Rules. The recovered precedence rules need to be expressed in a certain formalism. To liberate the precedence rules from the idiosyncrasies of specific grammar formalisms, we need a formalization that is independent of specific parsing techniques. The formalism for precedence rules does not need to be concise or notationally convenient. Rather, it serves as a core representation of precedence rules of programming languages. Inspired by previous work on SDF conflict sets defined by priorities [4,15], we use parse tree patterns to define precedence rules. Parse tree patterns denote a set of parse trees. Thus, a parse tree pattern can be used to define a set of invalid parse trees. For example, for the grammar SDF1 in Figure 1 the tree pattern hE → hE → E + Ei * Ei denotes a set of invalid parse trees according to the precedence rules of this grammar. However, the precedence rules for a grammar G can not just be defined as a subset of TPG . The reason for this is that for grammars that encode precedence in productions, there will be no tree patterns that denote invalid parse trees. Such grammars have a series of expression non-terminals that are only allowed at specific places. For example, in grammar YACC1 of Figure 1, the expression E is not allowed at the right-hand side of the operator + in the production E → E + T . Nevertheless, we are interested in precedence rules over such 5

SDF uses a reversed notation for production rules. We will only use this notation in verbatim examples of SDF. All other productions are written in conventional A → α notation.

90

Bouwers, Bravenboer, Visser hT → hT ∼ E → E + T i * F i hT → T * hF ∼ T → T * F ii hT → T * hF ∼ E → E + T ii

hE → hE → E + Ei * Ei hE → E * hE → E * Eii hE → E * hE → E + Eii

hE → E + hT ∼ E → E + T ii

hE → E + hE → E + Eii

hE → hE → E + Ei * Ei

hT → hT ∼ E → E+ Ei *T i

hE → E * hE → E * Eii hE → E * hE → E + Eii

hT → T * hT → T * T ii hT → T * hT ∼ E → E +Eii

hE → E + hE → E + Eii

hE → E + hE → E + Eii

Fig. 2. Precedence rules for grammar of Figure 1. First row: YACC1 , YACC2 , second row: SDF1 , SDF2

grammars. Therefore, we define the set of precedence rules for G = (Σ, N, P ) to be a subset of TP G (NE ) , where G (NE ) is an extended context-free grammar of the grammar → − → − G where NE ⊆ N and G (NE ) = (Σ, N, P 0 ) where P 0 = P ∪ {A → B|A ∈ NE , B ∈ → − NE , A 6= B}. For example, for the grammar YACC1 in Figure 1, YACC1 ({E, T, F }) −−−→ contains the injections E → F , T → E, F → E, and F → T in addition to the productions of YACC1 . Based on this definition we can now introduce the precedence rules for the grammars of Figure 1 that are presented in Figure 2. First, note that an injection chain hA → αhB ∼ C → βiγi is used when the symbol C of the nested production is not equal to the symbol B at the place where the nested production is used. Second, note that for the grammar YACC1 the tree pattern hT ∼ E → E + T i is not actually valid. This is exactly where YACC1 comes in, since the injection T → E is present in YACC1 . −−−→ −−−→ There is no relation defined between the tree patterns that are members of the precedence rule set, e.g we do not take the transitive closure of a precedence relation over productions. If a precedence declaration for operators needs to be transitively closed for a language, then this should be expressed by having all combinations in the set. A precedence rule set is not by definition required to be minimal. This means that some tree patterns can define precedence rules that are already implied by other tree patterns. Precedence rules defined by tree patterns are closely related to the set of conflicts conflicts(G) defined by SDF priority and associativity declarations. One important difference is that the set of conflicts of SDF is transitively closed, since it is defined by a priority relation that is a strict partial ordering between productions. Another difference is that we do not restrict the tree patterns used in the precedence rule sets to trees of two productions. As mentioned before, we do not assume anything about (the feasibility of) a concise notation for the set of tree patterns. Tree Pattern Generation. We recover precedence rules from grammars by generating a set of tree patterns involving expression productions and checking if a parse is possible that will result in a parse tree matching the tree patterns. By default, we generate the set of one-level tree patterns TP1G (NE ) for a grammar G with P restricted to → −

P = {A → α ∈ P |A ∈ NE }, i.e a set of tree patterns involving two productions for all combinations of expression productions. For example, the set of one-level tree patterns for two productions E → E + E and E → & E is hE → & hE → & Eii, hE → hE → & Ei + Ei , hE → hE → E + Ei + Ei , hE → & hE → E + Eii , hE → E + hE → & Eii , hE → E + hE → E + Eii. One-level tree patterns are sufficient to express the precedence rules of most operator languages. Indeed, our case studies 91

Bouwers, Bravenboer, Visser

in Section 6 are based on one-level tree patterns. However, some languages require precedence rules that include 3 or more productions. For this, the precedence recovery tool supports configuration of the number of levels that is to be generated. Next, the question is how to check if a grammar allows a parse that matches a tree pattern. If the pattern is accepted, then there are valid parse trees for this pattern. If not, then it denotes invalid parse trees and it will be an element of the resulting precedence rule set. Clearly, checking tree patterns is parser generator specific, since we need intimate knowledge about the semantics of the grammar formalism that is used by the parser generator. Based on the requirements for our case studies and our practical needs, we implemented the validation of tree patterns for YACC and SDF. However, the algorithm and the approach that is used can easily be ported to different (Generalized) LR parser generators. Precedence Rule Recovery: YACC. For YACC, the precedence rules are difficult to determine from the grammar directly, since the semantics of precedence declarations in YACC is defined operationally. The precedence declarations are used to resolve conflicts during parser generation, which means that precedence rules are only applied if there is actually a conflict. Also, YACC applies implicit conflict resolution mechanisms, i.e. preference for a shift over a reduce, and preference for a reduce of the first production in the grammar. Furthermore, grammars can encode the precedence rules in productions and combine this with precedence declarations, an issue that is not YACC specific. Hence, checking the grammar for possible matches of tree patterns is complex and requires intimate knowledge of YACC parser generation and conflict resolution. A much more general solution is to validate tree patterns against the parse table generated by YACC. Of course, a parser generated by YACC can not parse tree patterns. To check if a tree pattern is valid, we simulate the parsing of a sentential form that results in a parse tree matching the tree pattern. If this is possible, then the tree pattern is valid, otherwise it is invalid. A shift reduce parser is a transition system with as configuration a stack and an input string. The configuration is changed by shift and reduce actions. A shift action moves a symbol from the input to the stack, which corresponds to a step of one symbol in the righthand sides of a set of productions that is currently expected. A reduce removes a number of elements from the stack and replaces them by a single element, which corresponds to the application of a grammar production. In an LR parser [7], the information on the actions to perform is stored in an action table. Both a shift and a reduce introduce state transitions, which are recorded on the stack and are based on information in the action and goto table. After popping elements from the stack in a reduce, the goto entries of the state on top of the stack are consulted to find the new state to push on the stack. To recognize tree patterns, we change the input of the LR parser to a string of tree patterns and symbols. The tree patterns are translated into LR actions and all changes in the configuration of the parser are checked against the actions that are allowed to derive a parse tree that matches the tree pattern. Figure 3 lists the transition rules that implement the modified LR parser for recognizing tree patterns. The configuration of the parser, denoted by | stack | input |, is rewritten by the transition rules. The stack grows to the right, the input grows to the left. The variable e ranges over all possible input symbols, which is the → − set N ∪ Σ ∪ TP G (NE ) ∪ R(P ) ∪ R(N ). Hence, the input of the parser consists of a se→ − quence of non-terminals, terminals, tree patterns, and two special elements for representing → − reduces. R(A → α) represents a reduction of the production A → α. R(A) represents a reduction of any chain production B → C, until B is A. The function head finds the first 92

Bouwers, Bravenboer, Visser action(sm , a) = shift(sm+1 ) | s0 . . . sm | a, ei . . . en | ⇒ | s0 . . . sm , sm+1 | ei . . . en | | s0 . . . sm | hA → αi . . . en | ⇒ | s0 . . . sm | α, R(A → α) . . . en | → − | s0 . . . sm | hA ∼ B → αi . . . en | ⇒ | s0 . . . sm | hB → αi, R(A) . . . en | | s0 . . . sm

goto(sm , A) = sm+1 | A, ei . . . en | ⇒ | s0 . . . sm , sm+1 | ei . . . en |

(1) (2) (3) (4)

action(sm+k , head(ei . . . en )) = reduce(A → X1 . . . Xk ) | s0 . . . sm . . . sm+k | R(A → X1 . . . Xk ), ei . . . en | ⇒ | s0 . . . sm | A, ei . . . en |

(5)

action(sm+1 , head(ei . . . en )) = reduce(A → B) → − | s0 . . . sm , sm+1 | R(A), ei . . . en | ⇒ | s0 . . . sm | A, ei . . . en |

(6)

action(sm+1 , head(ei . . . en )) = reduce(B → C), B 6= A → − → − | s0 . . . sm , sm+1 | R(A), ei . . . en | ⇒ | s0 . . . sm | B, R(A), ei . . . en |

(7)

Fig. 3. Transition rules for checking tree patterns for a YACC parser

non-reduce element in its list of arguments. Equation 1 defines a shift of a terminal a. This definition is not different from a shift in a normal LR parser. A terminal is removed from the input and a new state is pushed on the stack. Equation 2 defines the unfolding of a tree pattern hA → αi. This transition rule does not exist for an LR parser, since the normal input is a sequence of terminals. The unfolding of a tree pattern involves adding α and a reduce of hA → αi to the input. The reduction is denoted by R(A → α). Equation 3 defines the unfolding of a tree pattern → − hA ∼ B → αi. After the unfolding of hB → αi, a reduce R(A) for arbitrary chain productions is inserted. Thanks to the unfolding of productions, the input of the system can now contain non-terminals. This is the reason for a separate transition rule 4 for performing a goto, which is usually considered to be a part of the reduce action. The goto transition rule removes a non-terminal from the input and pushes a new state on the stack, determined by the goto function. The reason why this works is that we can assume that the non-terminal A is productive, which means that there will always be a production for A that will finally reduce to state sm , which would lead to exactly the same goto. Equation 5 defines a reduce action. The transition system only allows reduces if a reduce is explicitly identified in the input. This method of checked reduces is used to enforce the structure of the tree pattern on the parser, i.e. it is not possible to recognize the leafs of the tree pattern with a parse tree that has a different internal structure. The definition of the reduce action reuses the separate transition rule of goto by inserting a non-terminal in front of the list. The equations 6 and 7 define the reduction of chain productions, which → − is allowed if there is an R(A) in front of the input. If the reduce is applied for A → B then → − R(A) is removed from the input and A is added. If the chain production does not produce → − A, then more chain productions might be necessary. Therefore, the R(A) is preserved and B is pushed in front of the input to trigger a goto. Using the extended LR parser that operates on tree patterns, the parsing of an actual input of the form of a tree pattern is simulated in detail. To illustrate the validation of 93

Bouwers, Bravenboer, Visser

unfold | 0 | hE → E + hE → E * Eii | goto | 0 | E, +, hE → E * Ei, R(+) | shift | 0, 3 | +, hE → E * Ei, R(+) | unfold | 0, 3, 5 | hE → E * Ei, R(+) | goto | 0, 3, 5 | E, *, E, R(*), R(+) | shift | 0, 3, 5, 7 | *, E, R(*), R(+) | goto | 0, 3, 5, 7, 6 | E, R(*), R(+) | reduce | 0, 3, 5, 7, 6, 8 | R(*), R(+) | goto | 0, 3, 5 | E, R(+) | reduce | 0, 3, 5, 7 | R(+) | goto | 0 | E | accept | 3, 0 | |

unfold | 0 | hE → hE → E + Ei * Ei | unfold | 0 | hE → E + Ei, *, E, R(*) | goto | 0 | E, +, E, R(+), *, E, R(*) | shift | 0, 3 | +, E, R(+), *, E, R(*) | goto | 0, 3, 5 | E, R(+), *, E, R(*) | error | 0, 3, 5, 7 | R(+), *, E, R(*) |

Fig. 4. LR configuration sequences for a valid and invalid tree pattern

tree patterns, Figure 4 shows the configuration of a parser generated from grammar YACC2 (Figure 1) for every application of a transition rule. The R(*) and R(+) inputs are abbreviations for the complete productions of these operators. The tree pattern on the left is valid. The tree pattern on the right is invalid, since in the last configuration the lookahead is the terminal *. For this lookahead, a reduce of the + operator is not allowed, since that would give the + operator precedence over *. Thus, parsing fails and the tree pattern is invalid. By working on the parse table generated by YACC, the recovery supports all precedence rules of a YACC grammar: encoded in productions, defined using precedence declarations, and even implicit conflict resolution. Indeed, if we remove the precedence declarations from YACC2 , then the precedence rule recovery returns hE → hE → E * Ei * Ei, hE → hE → E + Ei * Ei, hE → hE → E * Ei + Ei, hE → hE → E + Ei + Ei, which illustrates that YACC prefers a shift over a reduce. Bison has a detailed report function that provides information about the generated LR parse table, item sets, shifts, gotos, reduces, and conflicts. We parse this output to get a representation of the parse table. The tree pattern parser is implemented in Stratego. The transition rules of Figure 3 directly correspond to rewrite rules in the Stratego implementation, which are applied using a rewriting strategy. The configurations of the parser can be inspected, which was used to produce the examples of configuration sequences of Figure 4. The implementation of the transition system takes 55 lines of code. Precedence Rule Recovery: SDF. For recovering precedence rules from SDF grammars, an analysis of the grammar would be feasible, since the precedence declarations of SDF are not operationally defined in terms of parser generation. Yet, supporting a mixture of encoded and separately defined precedence declarations can still be rather involved. Based on the success of the approach that we used for recovering YACC precedence rules, we chose the same method for SDF grammars. Thus, precedence rules are recovered by checking generated tree patterns up to a certain level against the parse table generated from an SDF grammar. We cannot reuse the transition system (a modified LR parser) that we defined for checking tree patterns against YACC parse tables, since SDF is implemented using a scannerless generalized-LR parser, called SGLR. Because the parser uses the generalized-LR algorithm, there will be cases where multiple actions are possible in some configuration, for example a shift as well as a reduce action. To handle the alternatives, the GLR configuration 94

Bouwers, Bravenboer, Visser

needs to be forked, where in the end one of the alternatives has to succeed to make a tree pattern valid. Furthermore, the scannerless generalized-LR parser uses a different method for applying precedence declarations to the parse table. Whereas YACC uses precedence declarations to resolve conflicts between shift and reduce actions, SDF effectively prunes the goto table of a parse table. SGLR refines the goto table from gotos based on symbols to gotos based on productions, i.e the goto table is now a table of states and productions instead of states and symbols [15]. This slightly complicates the definition of the transition system, since the system applies gotos that are not introduced by a reduce, but by a non-terminal in the tree pattern. For this reason, we distinguish such a goto from a goto induced by a reduce. In the case of a goto caused by a non-terminal in the input, we consider all possible gotos for this non-terminal. We determine the set of possible states where the parser can goto from the current configuration and fork the GLR configuration to check all alternatives. Our method supports ambiguous grammars, which is illustrated by the case studies of Section 6, where two ambiguous grammars for C are compared. The implementation of the precedence recovery tool for SDF is a very basic and somewhat naive GLR parser. However, for the size of tree patterns this is not an issue at all. Again, the checker is implemented in Stratego using rewrite rules that rewrite the GLR configuration. Due to space constraints we cannot present the transition system for the SDF implementation.

5

Precedence Compatibility

Comparing the language defined by two grammars is undecidable, but this does not mean that nothing can be said about the compatibility of two grammars. Static analysis tools, such as our precedence rule recovery tool, can be used to extract information from different grammars and compare the results, even if they are written in different grammar formalisms. While the precedence rules are represented in a grammar formalism independent formalism, this does not imply that precedence rules can be compared directly in a useful way after recovering them from two different grammars. Grammars usually have different naming conventions, different names for lexical symbols, and often also have a different structure at some points. The recovered precedence rules can still be compared by first applying grammar transformations to the precedence rules to achieve a common representation. After this, the comparison of precedence rules is a simple set comparison. Grammar Transformation. Precedence rule recovery usually results in rather big sets of tree patterns. Trying to transform this huge set of tree patterns to a common representation is usually not a good idea. To avoid working with this big set of precedences, it is usually a good idea to first extract the productions from the precedence rules and compare and transform the set productions in order to find the required set of grammar transformations that achieves a common representation. Also, this is the most convenient way to identify language extensions that are only present in one of the two grammars. The relationship between two grammars is something that has to be custom defined for a particular combination of grammars. Typically, one of the grammar transformations that needs to be applied to the precedence rules is the renaming of all expression symbols to a single expression symbol. Note that it is essential that this renaming is applied to the precedence rules and not to the original grammar, since that would most likely change the 95

Bouwers, Bravenboer, Visser

precedence rules of the language or even make it impossible to generate a parser. Similar to the renaming of expression symbols, injections caused by the application of chain productions are no longer useful. To achieve a common representation, all injection chain nodes hB ∼ C → βi are transformed to hC → βi In the comparison of a YACC grammar and an SDF grammar a common issue is that the YACC precedence rules use names for the operators of the language (e.g. ANDAND instead of &&). This is usually a straightforward renaming where the lexical specification can be consulted if necessary.

6

Evaluation

We have evaluated the method for precedence rule recovery and compatibility checking by applying the implementation for YACC and SDF to a set of grammars for the C and PHP languages. Both languages have a large number of operators and non-obvious precedence rules. The size and complexity of the languages makes this compatibility check a good benchmark for our method. C99. We have compared three grammars for C99: •

The C compiler of the the GNU Compiler Collection uses a parser generated from a YACC grammar 6 . The YACC grammar uses a mixture of precedence declarations and encoding of priorities in productions.



The Transformers project provides a C99 SDF grammar [2]. This grammar is a direct translation of the standard to SDF 7 . The grammar does not use SDF precedence declarations. Instead, it uses an encoding of precedence in productions as specified by the standard. The grammar is designed to be ambiguous where the C syntax is ambiguous.



The SDF Library provides an ANSI C SDF grammar 8 . Unlike C-Transformers, this grammar uses SDF precedence declarations. The grammar is designed to be ambiguous.

The precedence tools reported various differences between the grammars. All the reports have been verified as being real differences, i.e there were no false positives. Examples of the reported differences are: hE → sizeof hE → ( TypeName ) Eii A cast as an argument of sizeof is forbidden in GCC and C-Transformers, which is correct, but it is allowed in the SDF Library, which is a bug. hE → ++ hE → ( TypeName ) Eii hE → -- hE → ( TypeName ) Eii GCC and SDF Library allow a cast as an argument of ++ and --. The C-Transformers do not, which corresponds to the standard. The standard defines ++ and -- separate from unary operators, while GCC and the SDF Library ignore this difference. E → sizeof(TypeName) Though not a precedence problem, our tools reported this missing production in the SDF library grammar. This means that some sizeof expressions that should be parsed ambiguously are currently unambiguous. 6

In GCC 4.1 the Bison-generated C parser has been replaced with a hand-written recursive-decent parser. We use the Bison grammar for GCC 4.03. 7 We used revision 1611 of the transformers-c-tools package. The one bug we found has been fixed in revision 1613. 8 We used revision 20649 of the sdf-library package for our evaluation.

96

Bouwers, Bravenboer, Visser

hE → E ? E : hE → E = Eii This tree pattern of an assignment in the else branch of the conditional is forbidden in GCC and the SDF Library, but is allowed in C-Transformers. This is a bug in CTransformers: the else branch of the conditional operator uses the wrong non-terminal hE → hE → E ? E : Ei = Ei hE → hE → ( TypeName ) Ei = Ei A conditional or a cast in the left-hand side of an assignment is allowed by GCC and the SDF Library. For GCC this is a legacy feature that now produces a semantic error. C-Transformers forbids this, which is correct. The same issue holds for many more binary operators (||, &&, |, ˆ, &, !=, ==, >=, , ! (! l && ! r) }

concrete prod logical_and e::Expr ::= l::Expr ’&&’ r::Expr { e.c = ...; e.errors := ... ; e.typerep = booleanType(); }

funcType in::TRep out::TRep {...} booleanType {...} arrayType component::TRep {...} errorType {...}

Fig. 1. A portion of the SimpleC specifications written in (primarily) core Silver.

declaration of the grammar name. Figure 1 contains the partial specification of SimpleC, its name given by the grammar declaration. After the grammar declaration (and any import statements that include AG declarations from other grammar modules) a Silver file consists of a series of AG declarations. Order does not matter as declarations in a file are visible in the entire file and in other files in that same module. Line comments begin with “--”. Reading from the beginning of Figure 1 we see the declaration of nonterminal symbols Prog (the grammar start symbol), Dcl (declaration), Dcls, Type (type expressions), Stmt (statement), Stmts, and Expr (expression). Next is the declaration of the terminal symbol Id and the regular expression (denoted /regex /) used by the generated scanner to identify identifiers. Keyword and punctuation terminal symbols, like AndOp, that match a fixed string (denoted ’fixed lexeme’) instead of a regular expression can be specified by their fixed string directly in productions, as in the production logical and. Next a synthesized attribute c of type String is declared. It contains the translation of SimpleC constructs to C and decorates the non-terminals specified in the occurs on clause. The attribute typerep is a higher-order attribute that holds trees whose root is a non-terminal of type TRep. The type 109

Van Wyk, Bodin, Gao, Krishnan

of an Expr is represented by these trees. Following are a few sample production declarations. Productions with the concrete modifier are used to generate the input specification to a parser generator. Different extensions to Silver integrate different parser and scanner generators into Silver. These extensions provide translations of concrete productions and terminal declarations to the input language of a parser/scanner generator. Productions marked as abstract or aspect are not used in the parser specification. The first production is named program, its left hand side non-terminal is Prog and is named p. The production’s right hand side contains the Dcls non-terminal named d. Attribute definitions are given between the curly braces ({ and }). Here, the attribute c on p is defined as indicated. Definitions of other attributes that use features added as language extensions such lists ([...]) and collections (:=) are also shown but described below in Section 2.2. Attributes can be defined on concrete and abstract productions; for SimpleC we evaluate attributes on the concrete syntax tree since it is a simple language. For more complex languages, one may separate the concrete and abstract syntax so that the only attributes on the concrete productions are used to construct the AST over which attributes are evaluated. Productions for conjunction and negation follow. These define the higher order attribute typerep to be the tree constructed by the abstract production booleanType to indicate that they are boolean expressions. Following are the abstract productions used to construct different type representations. The concrete production for functions calls follows. Its definition of typerep is not specified here, but is given in the aspect production with the same name in Figure 2. Aspect productions allow attributes to be defined for concrete or abstract productions specified in different locations in the same file, different files, or even different modules. The pattern matching case expression is an extension to Silver and discussed below. The logical or production uses forwarding [14] to implement the local transformation that maps l || r to !(!l && !r). Forwarding allows a production to define a distinguished syntax tree that provides default values for synthesized attributes that it does not explicitly define with an attribute definition. When a tree node is queried for an attribute that is not explicitly defined, it “forwards” that query to this tree which will return its value. In logical or this tree is the semantically equivalent expression constructed from logical and and logical not productions. The errors and typerep attributes are defined explicitly so that a error message can be reported on the code written by the programmer. The value of the c attribute is defined implicitly and retrieved from the forwards-to tree. Forwarding is used in the implementation of language extensions to define their translation to the host language. Forwarding suffices for translations that require only a local transformation. Productions defining statements, declarations, and other expressions are what one might expect and are not shown. Also, several definitions that would have the expected value are elided with ellipses (...). 110

Van Wyk, Bodin, Gao, Krishnan autocopy inh attr env::[ Binding ]; nonterminal Binding with typerep ; syn attr name :: String occurs on Binding ; syn attr errors :: [ String ] collect with ++ ; attr errors occurs on Prog, Dcl, Dcls, Type, Stmt, Stmts, Expr ;

aspect prod funcCall e::Expr ::= f::Id ’(’ arg::Expr ’)’ { e.typerep = case ftype of funcType(in, out) => out | _ => errorType(); e.errors [ :: Error ] | _ => [ "Error: " ++ f.pp ++ " must be a function."]; prod attr ftype :: TRep; ftype = ... lookup f in env ... ; }

Fig. 2. A portion of the SimpleC specifications written in full Silver.

2.2

Full Silver: core Silver with language extensions

The definitions of attributes errors and env in Figure 1 and the specification in Figure 2 make use of Silver features that were added as extensions to the core Silver language. The inherited environment (symbol-table) attribute env defined in Figure 2 uses two extensions. First, it is an autocopy attribute and thus if no explicit definition for env is given in a production, then one is automatically generated that copies the value of env from the left hand side nonterminal node to its appropriate children. Second, its type uses the type-safe polymorphic list extension to specify that env is a list of Binding values. The simple Binding nonterminal declaration is an extension that uses the with clause to indicate that the typerep attribute decorates Binding. Collection attributes in Silver are similar to those defined by Boyland [2] and are associated with an associative operator used to fold together contributions to the attribute. Collection attributes are declared using the collect with clause that specifies the collection operator. The Silver collection assignment operator := (which differs from the standard definition operator =) is used in several productions to define the attributes initial (or unit) value. Aspect productions may use the collection contribution operator