A Simple One-Pass Compiler. ○ Language syntax: context-free grammar.
Backus-Naur Form – BNF. ○ Language semantics: informal descriptions. ○
Grammar ...
Chapter 2 A Simple One-Pass Compiler
A Simple One-Pass Compiler ●
Language syntax: context-free grammar Backus-Naur Form – BNF
●
Language semantics: informal descriptions
●
Grammar used in syntax-directed translation
●
Infix expressions will be converted to postfix
●
–
8+4*3-2 becomes 843*+2-
–
postfix is easy to evaluate
Later, programming constructs will be added
Initial Lexical Analysis ●
●
●
We start with expressions formed from single digit numbers and arithmetic operators So, the lexical analyzer can simply read a character and return it. (ignoring white space) Later we will extend the lexical analysis to deal with multi-digit numbers, identifiers and keywords.
Context Free Grammar ●
● ●
●
Defines the hierarchical structure of a programming language stmt > while ( expr ) stmt Means a statement can be the keyword while, followed by an expression in parentheses and a statement after the parentheses In this case we assume that expr is also defined by the grammer.
Grammar ●
A context-free grammar consists of 4 parts:
●
A set of tokens (aka terminals)
●
A set of nonterminals (mol variables)
●
A set of productions where each production is
●
–
a nonterminal on the left of the arrow for the production
–
a sequence of terminals and nonterminals to the right of the arrow
A designated start nonterminal
Example Grammar 1 ●
list -> list + digit
●
list -> list – digit
●
list -> digit
●
digit -> 0|1|2|3|4|5|6|7|8|9
●
The last line is shorthand for 10 productions: –
digit -> 0 digit -> 1 ...
●
Terminals are 0-9 and + and -
●
Nonterminals are list and digit
Example Grammar 1 (2) ●
●
You use a grammar to derive a string by starting with the start symbol and repeatedly replacing a nonterminal with a RHS for it. list => list + digit => list + 8 => list – digit + 8 => digit – digit + 8 => 4 – digit + 8 => 4 – 3 + 8 This is referred to as a derivation
Parse Trees ●
The root is labeled with the start nonterminal
●
Leaves are labeled by terminals or ε
●
Interior nodes are labeled by nonterminals
●
The children of a node are labeled by the RHS of a production for the nonterminal
Parse Tree Example list list list digit 4
–
+
digit 3
digit 8
Parse Trees (3) ●
●
●
The leaves of a parse tree (left to right) form the “ yield” o f the tree. This is a string or sentence generated by the grammar. This string is derived from the start symbol.
Ambiguity ●
A grammar is ambiguous if some string can be generated by 2 or more parse trees.
●
The grammar below is ambiguous
●
list > list + list
●
list > list – list
●
list > 0|1|2|...|9
Two Parse Trees for 6-1-1 list
list
list 6
list 1
list
list
list
1
6
list
list list 1
1
Associativity ●
●
●
We prefer left associativity for + and 611 == (61)1 In C, assignment is right associative –
a=b=c
–
list > var = list | var
–
var > a | b | c
Parse trees for left associative grammars tend to expand on the left.
Left vs Right Associativity list
list
list digit
6
digit 1
list
digit
var
1
a
=
list
var b
=
list var
c
Operator Precedence ●
If we use all 4 basic math operations, we need operator precedence 84*2 == 8(4*2)
●
We need more nonterminals and productions
●
An expression is a sum/difference of terms
●
A term is a product/quotient of factors
●
A factor is a digit or an expression in parentheses
Grammar with Precedence ●
●
●
●
expr > expr + term | expr term | term term > term * factor | term / factor | factor factor > digit | ( expr ) digit > 0|1|2|...|9
Why is this grammar ambiguous? ●
●
stmt > id = expr | if expr then stmt | if expr then stmt else stmt | while expr do stmt | { stmt_list } stmt_list > stmt_list ; stmt | ε
Why is the grammar ambiguous? ●
●
●
if 1 then if 2 then a = 2 else a = 1 if 1 then if 2 then a = 2 else a = 1 Which if “owns” the else?
Syntax-Directed Translation ●
●
●
A compiler must keep track of a variety of values for program entities –
The starting address for an else clause
–
The type of an expression
–
The size of an array
We refer to these as attributes and associate them with terminals and nonterminals. A syntax-directed definition adds attribute rules (semantic rules) to productions.
Postfix Notation ● ●
If E is a variable or constant, PF(E) = E If op is a binary operator, PF(E1 op E2) = PF(E1) PF(E2) op
●
If E is of the form ( E1 ), then PF(E) = E1
●
Postfix uses no parentheses
●
PF(8-1-1) = 81-1PF(2+3*4-5*2) = 234*+52*PF((2+3)*(4-5)*2) = 23+45-*2*
Synthesized Attributes ●
● ●
Synthesized attributes at an internal node of a parse tree are determined from attributes of its children. (bottom up) The alternative is “i nherited” a ttributes. Attributes are specified using a “d ot” n otation like members of a struct or class. expr.t
Syntax-Directed Definition ●
expr > expr1 + term expr.t = expr1.t term.t +
●
expr > expr1 term expr.t = expr1.t term.t
●
expr > term expr.t = term.t
●
term > 0 term.t = '0'
●
term > 1 term.t = '1'
●
...
Attribute Synthesis expr.t =83+4
expr.t
expr.t =8 term.t =8 8
+
= 83+
term.t =3
3
term.t =4
4
Depth-First Parse Tree Traversal void visit ( node *n ) { for ( m = first_child(n); m; m++ ) { visit ( m ); } determine attributes of n; }
Translation Scheme ●
●
●
●
A context-free grammar with programming language statements embedded in RHSs. The programming statements are called semantic actions. Similar to a syntax-directed definition, but the order of execution/evaluation is explicit This format is used by yacc and bison. (also more or less the same in lex and flex)
Example Translation Scheme ●
expr > expr + term { print('+');}
●
expr > expr term { print('');}
●
expr > term
●
term > 0 { print('0');}
●
term > 1 { print('1');}
●
...
●
term > 9 { print('9');}
Augmented Parse Tree expr {print('');} expr
expr term 8
+
term {print('+');}
3
{print('3');}
{print('8');}
term
4 {print('4');}
Translation Scheme ●
●
●
The execution of the print statements could be done via a depth-first traversal. Alternatively, if parsing occurs in the same pattern (and order), the tree could be skipped. Note that the semantic actions can be more general purpose actions: symbol table actions, error messages, line counting, ...
Parsing ●
●
●
A parser converts a string of tokens into a parse tree. (perhaps the tree is not explicit) Only certain grammars yield efficient parsers. –
Arbitrary grammars might take O(n3) time
–
Programming language grammars take O(n) time
Top-down parser: parse tree constructed starting with the root –
●
Can be easily hand-generated
Bottom-up: construction starts at the leaves –
Handles a larger class of grammars
Top-Down Parsing ●
Start with start symbol as the root of the tree
●
Repeat the steps below
●
● ●
–
Find a node, n, labeled with a nonterminal, A
–
Select a production for A and construct children of n for the RHS symbols of the production
For nice grammars the parsing will proceed from left to right through the input string. The challenge is selecting a proper production. We will consider the current token from the input to help select a production. (lookahead
Example Grammar ●
type > simple
●
| id
●
| array [ simple ] of type
●
simple > integer
●
| char
●
| num .. num
Picking a Production ●
●
●
The 3 productions for type start with id, array or whatever starts simple. We can examine the first symbol and determine which of the 3 productions to use: –
id: type > id
–
array: type > array [ simple ] of type
–
others: type > simple
Likewise the 3 productions for simple can be selected by inspecting the lookahead symbol.
Parsing Using Lookahead type
array [ 1 .. 10 ] of integer
●
Start with parse tree with start symbol at root
●
Lookahead symbol == array
●
Expand the tree by applying third production
Parsing Using Lookahead type
array
[
simple
]
of
array [ 1 .. 10 ] of integer
●
Match and consume the token array
●
Advance to left bracket and match it
●
Advance to 1 and expand simple
type
Parsing Using Lookahead type array
[
simple num
..
]
of
type
num
array [ 1 .. 10 ] of integer
●
●
With lookahead 1 (a num), the correct production is selected and added to the tree We can finish the production for simple
Parsing Using Lookahead type array
[
simple num
..
]
of
type
num
array [ 1 .. 10 ] of integer
● ●
We advance past ] and of to reach integer Now we can select the proper production to apply based on the lookahead
Predictive Parsing ●
●
●
Recursive descent parsing is done using a function for each nonterminal. Predictive parsing is a type of recursive descent where the lookahead symbol is used to “ predict” the correct production to apply. Parse tree is implicitly defined by the pattern of recursive function calls
Functions for Predictive Parser ●
●
The match function verifies that the current token is what we expect. It advances to the next token if it correct. void match ( int token ) { if ( look == token ) look = next_token(); else error(); }
void type() { switch ( look ) { case INTEGER: // These 3 start the case CHAR: // 3 productions for case NUM: // simple simple(); break; case ID: // type > id match(ID); break; case ARRAY: // type > array ... match(ARRAY); match('['); // match 2 tokens simple(); // expand simple match(']'); match(OF); // match 2 tokens type(); // expand type break; default: error(); } }
void simple() { switch ( look ) { case INTEGER: match(INTEGER); break; case CHAR: match(CHAR); break; case NUM: match(NUM); // Not quite as simple match(DOTDOT); // as the first 2 simple match(NUM); // productions break; default: error(); } }
First Sets ●
Prediction of the proper production requires knowing which tokens can be first in strings generated from a particular production.
●
We define First sets for RHS of productions.
●
Let A > α be a production
●
●
If α = ε or α can generate ε, then ε is in First(α). First(α) also includes all terminals which can be the first terminal is a string derived from α.
First Sets (2) ●
First(simple) = { integer, char, num }
●
First(id) = { id }
●
●
First(array [ simple ] of type) = { array } We can choose one production over another if their First sets are disjoint.
Using ε-Productions ● ●
●
●
opt_stmts -> stmt_list | ε In the code for opt_stmts, if the lookahead symbol is not in First(opt_stmts) we can use opt_stmts -> ε It may be than the lookahead symbol is legal after opt_stmts or not. If the lookahead symbol is illegal, it will result in an error elsewhere.
Predictive Parser ● ●
●
●
Write a function for each nonterminal Select which production to use for a nonterminal by inspecting the lookahead symbol to determine which First set it is in. If First sets for competing productions are not disjoint, this plan won't work. Implement code for a production by calling functions for nonterminals of the RHS and matching terminals of the RHS.
Predictive Syntax-Directed Translator ● ●
● ●
Extend the code for the predictive parser. Copy the actions from the translation scheme into the parser in the same position as in the translation scheme. The action will happen at the intended time. The code for the parser/translator could be automated using a tool which reads the translation scheme and writes C++ code.
Left Recursion ●
Left recursion could cause infinite looping in a recursive-descent parser.
●
expr -> expr + term
●
The problem is that expr is the first on the RHS.
●
●
Applying that production would not change the lookahead symbol and would allow it to be selected again. Of course the alternative expr -> term would have a conflicting First set...
Eliminating Left Recursion ●
expr -> expr + term | term
●
Compare this to
●
expr -> term rest
●
rest -> + term rest | ε
●
●
Now we have right recursion and recursivedescent works. But we generate parse trees which are better for right associative operators.
A Translator for Simple Expressions ●
●
●
●
We are extending the translator to include the 4 basic math operations, multi-digit numbers and identifiers. There will be a symbol table which will hold minimal information. The translator will accept a list of expressions with each expression terminated with a semicolon. We start with a left-recursive grammar which we convert to right-recursive.
Abstract and Concrete Syntax ●
●
A parse tree can be called a concrete syntax tree. By contrast an abstract syntax tree leaves out grammar symbols, showing only operators and operands. + 8
2 4
expr ->Infix-to-postfix expr + term Simple Specification
●
{ print('+'); }
●
● ●
●
● ●
expr -> expr – term { print('-'); } expr -> term term -> 0 { print('0'); } term -> 1 { print('1'); } ... term -> 9 { print('9'); }
Problem with the Specification ●
The specification is left-recursive
●
We need to convert to right-recursion
●
expr -> term rest
●
rest -> + expr | - expr | ε
●
term -> 0 | 1 | ... | 9
●
If we use this grammar, we get the same language, but we must be careful about the actions.
Problem with the Specification (2) ● ●
●
● ●
●
Consider 2 choices for actions for rest -> - expr { rest.t = '-' expr.t; } rest -> - expr { rest.t = expr.t '-'; } The first pattern translates 8-4 into 8-4. The second translates 8-4 into 84-, but it also translates 8-4+2 into 842-+ We need help.
Eliminating Left-Recursion in a Translation Scheme ●
●
The solution is to “ drag” the actions around during the conversion, treating each as 1 grammar symbol. In general we convert A -> Aα | Aβ | γ into
●
A -> γR
●
R -> αR | βR | ε
●
Actions can be part of the α, β and γ
Repaired Grammar ●
expr -> term rest
●
rest -> + term { print('+'); } rest
●
rest -> - term { print('-'); } rest
●
rest -> ε
●
term -> 0 { print('0'); }
●
term -> 1 { print('1'); }
●
...
●
term -> 9 { print('9'); }
Translation of 8-4+2 expr term 8
print('8')
rest - term
print('-')
rest
4 print('4') + term print('+')rest 2 print('2')
ε
void expr() { term(); rest(); } void rest() { switch ( lookahead ) { case '+': match('+'); term(); print('+'); rest(); break; case '-': match('-'); term(); print('+'); rest(); break; } } void term() { if ( isdigit(lookahead) ) { print(lookahead); match(lookahead); } else error(); }
Eliminating Tail Recursion tail recursion: recursive call just be returning from a recursive funtion – might as well use a void looprest()
●
{
while ( 1 ) { switch ( lookahead ) { case '+': match('+'); term(); print('+'); break; case '-': match('-'); term(); print('+'); break; default: return; } } }
Merging Code ●
●
expr is called once and transfers control to rest might as well merge The + and – cases are almost identical can merge using lookahead variable
Streamlined expr code void expr() { term(); while ( 1 ) { switch ( lookahead ) { case '+': case '-': t = lookahead; match(t); term(); print(t); break; default: return; } } }
Lexical Analysis ●
●
White space and comment removal –
Easy in the scanner
–
Difficult in the parser
Constant = sequence of digits –
scanner passes num to the parser
–
the value of the num is an attribute
●
25 + 15 – 12
●
Identifiers ●
Identifier: letters and digits starting with a letter
●
Might also be a keyword, but not now.
●
●
Easy to code by starting a while loop when the next character is a letter and continuing until the next character is not a letter nor a digit. Then we need to return the character to the input stream to be read by another section of code.
Interfacing to the Lexical Analyzer ● ●
●
Lexical analyzer reads characters from stdin Parser gets tokens/attributes from the lexical analyzer The simplest arrangement is for the lexical analyzer to have a function to call to get the next token.
Symbol Table ●
●
●
●
Generally a symbol table supports insertion of identifiers with attributes. A symbol table also allows searching for an identifier by name. Efficiency usually dictates using some form of hash table or tree. (STL map: red-black tree) For Chapter 2 the symbol table is an array of tuples of char pointers and ints (struct entry).
Symbol Table (2) ●
●
●
●
A symbol table is a good place to handle keywords like “ div” and “ mod” . The translator inserts “d iv” with the #defined constant DIV (and “ mod” with MOD) in the table. DIV and MOD are ints greater than 255 to avoid confusion with single char tokens. The lexical analyzer uses lookup to search for a string. If in the table, it uses the token type from the table. Otherwise it inserts it as an ID.
Abstract Stack Machine ●
●
●
●
●
An abstract stack machine is a possible form of intermediate code for a compiler. An ASM has data memory, instruction memory, a data stack and a CPU. The CPU has instructions to move data from data memory to the stack and vice versa. It also has instructions to perform operations on the top items of the stack. Lastly the CPU has flow-control instructions.
ASM Arithmetic Instructions ●
Using an ASM is like interpreting postfix
●
PF(2+3*4) = 234*+
●
●
ASM instructions would be push 2 push 3 push 4 multiply add There would be a full collection of operators for ints and doubles.
L-values and R-values ●
●
●
●
An identifier is used in 2 common ways in a programming language –
On the left side of an assignment (l-value)
–
As part of an expression (r-value)
When used for the target of an assignment the computer needs the address of the variable. When used in an expression the computer needs to value of the variable. An ASM needs rvalue and lvalue instructions.
lvalue and rvalue
●
To push a variable's value onto the stack –
●
// a's address is used to
To push a variable's address onto the stack –
●
rvalue a get a lvalue a
// a's address is pushed
To compute c=a+b –
lvalue c
–
rvalue a
–
rvalue b
–
add
–
store
// := in the book
ASM Control Flow ●
●
●
● ●
●
label x Set label named x goto x Branch to the label named x gofalse x Goto x if the top of the stack is 0 (also pops it) gotrue x Goto x is the top of the stack is not 0 (also pops) halt
ASM Code for if Statement Source: if expr then stmt Target: code for expr gofalse out code for stmt label out
Translation scheme: stmt -> if expr { out = newlabel(); emit('gofalse',out) then stmt { emit('label',out); }
void stmt () { if ( lookahead == ID ) { emit('lvalue',tokenval); match(ID); match('='); expr(); } else if ( lookahead == IF ) { match(IF); expr(); out = newlabel(); emit('gofalse',out); match(THEN); stmt(); emit('label',out); } else error(); }
Infix to Postfix Translator Specification start -> list EOF list -> expr ; list | ε expr -> expr + term | expr – term | term term -> term * factor | term / factor | term DIV factor | term MOD factor | factor factor -> ( expr ) | id | num
{ print('+') } { print('-') } { { { {
print('*') } print('/') } print('DIV') } print('MOD') }
{ print(id.lexeme) } { print(num.value) }
Translation Scheme with no Left Recursion start -> list EOF list -> expr ; list | ε expr -> term moreterms moreterms -> + term { print('+') } moreterms | – term { print('-') } moreterms | ε term -> factor morefactors morefactors -> * factor { print('*') } morefactors | / factor { print('/') } morefactors | DIV factor { print('DIV') } morefactors | MOD factor { print('MOD') } morefactors | ε factor -> ( expr ) | id { print(id.lexeme) } | num { print(num.value) }
Tokens ●
● ●
●
●
Tokens are identified by an integer and some of them have an integer attribute value. Many tokens like '+' are simply themselves NUM, DIV, MOD, ID, and DONE are #defined as numbers starting with 256 to be distinct. The integer attribute for NUM is the sequence of digits converted to an integer. The integer attribute for ID is the index into the symbol table for that ID.