Chapter 2 A Simple One-Pass Compiler - Orca

247 downloads 509 Views 262KB Size Report
A Simple One-Pass Compiler. ○ Language syntax: context-free grammar. Backus-Naur Form – BNF. ○ Language semantics: informal descriptions. ○ Grammar ...
Chapter 2 A Simple One-Pass Compiler

A Simple One-Pass Compiler ●

Language syntax: context-free grammar Backus-Naur Form – BNF



Language semantics: informal descriptions



Grammar used in syntax-directed translation



Infix expressions will be converted to postfix





8+4*3-2 becomes 843*+2-



postfix is easy to evaluate

Later, programming constructs will be added

Initial Lexical Analysis ●





We start with expressions formed from single digit numbers and arithmetic operators So, the lexical analyzer can simply read a character and return it. (ignoring white space) Later we will extend the lexical analysis to deal with multi-digit numbers, identifiers and keywords.

Context Free Grammar ●

● ●



Defines the hierarchical structure of a programming language stmt ­> while ( expr ) stmt Means a statement can be the keyword while, followed by an expression in parentheses and a statement after the parentheses In this case we assume that expr is also defined by the grammer.

Grammar ●

A context-free grammar consists of 4 parts:



A set of tokens (aka terminals)



A set of nonterminals (mol variables)



A set of productions where each production is





a nonterminal on the left of the arrow for the production



a sequence of terminals and nonterminals to the right of the arrow

A designated start nonterminal

Example Grammar 1 ●

list -> list + digit



list -> list – digit



list -> digit



digit -> 0|1|2|3|4|5|6|7|8|9



The last line is shorthand for 10 productions: –

digit -> 0 digit -> 1 ...



Terminals are 0-9 and + and -



Nonterminals are list and digit

Example Grammar 1 (2) ●



You use a grammar to derive a string by starting with the start symbol and repeatedly replacing a nonterminal with a RHS for it. list => list + digit      => list + 8      => list – digit + 8      => digit – digit + 8      => 4 – digit + 8      => 4 – 3 + 8 This is referred to as a derivation

Parse Trees ●

The root is labeled with the start nonterminal



Leaves are labeled by terminals or ε



Interior nodes are labeled by nonterminals



The children of a node are labeled by the RHS of a production for the nonterminal

Parse Tree Example list list list digit 4



+

digit 3

digit 8

Parse Trees (3) ●





The leaves of a parse tree (left to right) form the “ yield” o f the tree. This is a string or sentence generated by the grammar. This string is derived from the start symbol.

Ambiguity ●

A grammar is ambiguous if some string can be generated by 2 or more parse trees.



The grammar below is ambiguous



list ­> list + list



list ­> list – list



list ­> 0|1|2|...|9

Two Parse Trees for 6-1-1 list

list

list ­ 6

­

list 1

list

list

list

1

6

­

list

list ­ list 1

1

Associativity ●





We prefer left associativity for + and 6­1­1 == (6­1)­1 In C, assignment is right associative –

a=b=c



list ­> var = list | var



var ­> a | b | c

Parse trees for left associative grammars tend to expand on the left.

Left vs Right Associativity list

list

list digit

6

­

­

digit 1

list

digit

var

1

a

=

list

var b

=

list var

c

Operator Precedence ●

If we use all 4 basic math operations, we need operator precedence 8­4*2 == 8­(4*2)



We need more nonterminals and productions



An expression is a sum/difference of terms



A term is a product/quotient of factors



A factor is a digit or an expression in parentheses

Grammar with Precedence ●







expr ­> expr + term      |  expr ­ term      |  term term ­> term * factor      |  term / factor      |  factor factor ­> digit        |  ( expr ) digit ­> 0|1|2|...|9

Why is this grammar ambiguous? ●



stmt ­> id = expr       | if expr then stmt       | if expr then stmt else stmt       | while expr do stmt       | { stmt_list } stmt_list ­> stmt_list ; stmt            | ε

Why is the grammar ambiguous? ●





if 1 then    if 2 then a = 2 else a = 1 if 1 then    if 2 then a = 2 else a = 1 Which if “owns” the else?

Syntax-Directed Translation ●





A compiler must keep track of a variety of values for program entities –

The starting address for an else clause



The type of an expression



The size of an array

We refer to these as attributes and associate them with terminals and nonterminals. A syntax-directed definition adds attribute rules (semantic rules) to productions.

Postfix Notation ● ●

If E is a variable or constant, PF(E) = E If op is a binary operator, PF(E1 op E2) = PF(E1) PF(E2) op



If E is of the form ( E1 ), then PF(E) = E1



Postfix uses no parentheses



PF(8-1-1) = 81-1PF(2+3*4-5*2) = 234*+52*PF((2+3)*(4-5)*2) = 23+45-*2*

Synthesized Attributes ●

● ●

Synthesized attributes at an internal node of a parse tree are determined from attributes of its children. (bottom up) The alternative is “i nherited” a ttributes. Attributes are specified using a “d ot” n otation like members of a struct or class. expr.t

Syntax-Directed Definition ●

expr ­> expr1 + term  expr.t = expr1.t term.t +



expr ­> expr1 ­ term  expr.t = expr1.t term.t ­



expr ­> term          expr.t = term.t



term ­> 0             term.t = '0'



term ­> 1             term.t = '1'



...

Attribute Synthesis expr.t =83+4­

expr.t

expr.t =8 term.t =8 8

+

­

= 83+

term.t =3

3

term.t =4

4

Depth-First Parse Tree Traversal void visit ( node *n ) { for ( m = first_child(n); m; m++ ) { visit ( m ); } determine attributes of n; }

Translation Scheme ●







A context-free grammar with programming language statements embedded in RHSs. The programming statements are called semantic actions. Similar to a syntax-directed definition, but the order of execution/evaluation is explicit This format is used by yacc and bison. (also more or less the same in lex and flex)

Example Translation Scheme ●

expr ­> expr + term  { print('+');}



expr ­> expr ­ term  { print('­');}



expr ­> term



term ­> 0            { print('0');}



term ­> 1            { print('1');}



...



term ­> 9            { print('9');}

Augmented Parse Tree expr {print('­');} expr

expr term 8

+

­

term {print('+');}

3

{print('3');}

{print('8');}

term

4 {print('4');}

Translation Scheme ●





The execution of the print statements could be done via a depth-first traversal. Alternatively, if parsing occurs in the same pattern (and order), the tree could be skipped. Note that the semantic actions can be more general purpose actions: symbol table actions, error messages, line counting, ...

Parsing ●





A parser converts a string of tokens into a parse tree. (perhaps the tree is not explicit) Only certain grammars yield efficient parsers. –

Arbitrary grammars might take O(n3) time



Programming language grammars take O(n) time

Top-down parser: parse tree constructed starting with the root –



Can be easily hand-generated

Bottom-up: construction starts at the leaves –

Handles a larger class of grammars

Top-Down Parsing ●

Start with start symbol as the root of the tree



Repeat the steps below



● ●



Find a node, n, labeled with a nonterminal, A



Select a production for A and construct children of n for the RHS symbols of the production

For nice grammars the parsing will proceed from left to right through the input string. The challenge is selecting a proper production. We will consider the current token from the input to help select a production. (lookahead

Example Grammar ●

type ­> simple



      | id



      | array [ simple ] of type



simple ­> integer



        | char



        | num .. num

Picking a Production ●





The 3 productions for type start with id, array or whatever starts simple. We can examine the first symbol and determine which of the 3 productions to use: –

id:  type ­> id



array: type ­> array [ simple ] of type



others: type ­> simple

Likewise the 3 productions for simple can be selected by inspecting the lookahead symbol.

Parsing Using Lookahead type

array [ 1 .. 10 ] of integer



Start with parse tree with start symbol at root



Lookahead symbol == array



Expand the tree by applying third production

Parsing Using Lookahead type

array

[

simple

]

of

array [ 1 .. 10 ] of integer



Match and consume the token array



Advance to left bracket and match it



Advance to 1 and expand simple

type

Parsing Using Lookahead type array

[

simple num

..

]

of

type

num

array [ 1 .. 10 ] of integer





With lookahead 1 (a num), the correct production is selected and added to the tree We can finish the production for simple

Parsing Using Lookahead type array

[

simple num

..

]

of

type

num

array [ 1 .. 10 ] of integer

● ●

We advance past ] and of to reach integer Now we can select the proper production to apply based on the lookahead

Predictive Parsing ●





Recursive descent parsing is done using a function for each nonterminal. Predictive parsing is a type of recursive descent where the lookahead symbol is used to “ predict” the correct production to apply. Parse tree is implicitly defined by the pattern of recursive function calls

Functions for Predictive Parser ●



The match function verifies that the current token is what we expect. It advances to the next token if it correct. void match ( int token ) { if ( look == token ) look = next_token(); else error(); }  

void type() { switch ( look ) { case INTEGER:                // These 3 start the case CHAR:                   // 3 productions for case NUM:                    // simple simple();                  break; case ID:                     // type ­> id match(ID); break; case ARRAY:                  // type ­> array ... match(ARRAY); match('['); // match 2 tokens simple();                 // expand simple match(']'); match(OF);    // match 2 tokens type();                   // expand type break; default: error(); } }

void simple() { switch ( look ) { case INTEGER: match(INTEGER); break; case CHAR: match(CHAR); break; case NUM: match(NUM);     // Not quite as simple match(DOTDOT);  // as the first 2 simple match(NUM);     // productions break; default: error(); } }

First Sets ●

Prediction of the proper production requires knowing which tokens can be first in strings generated from a particular production.



We define First sets for RHS of productions.



Let A ­> α be a production





If α = ε or α can generate ε, then ε is in First(α). First(α) also includes all terminals which can be the first terminal is a string derived from α.

First Sets (2) ●

First(simple) = { integer, char, num }



First(id) = { id }





First(array [ simple ] of type) = { array } We can choose one production over another if their First sets are disjoint.

Using ε-Productions ● ●





opt_stmts -> stmt_list | ε In the code for opt_stmts, if the lookahead symbol is not in First(opt_stmts) we can use opt_stmts -> ε It may be than the lookahead symbol is legal after opt_stmts or not. If the lookahead symbol is illegal, it will result in an error elsewhere.

Predictive Parser ● ●





Write a function for each nonterminal Select which production to use for a nonterminal by inspecting the lookahead symbol to determine which First set it is in. If First sets for competing productions are not disjoint, this plan won't work. Implement code for a production by calling functions for nonterminals of the RHS and matching terminals of the RHS.

Predictive Syntax-Directed Translator ● ●

● ●

Extend the code for the predictive parser. Copy the actions from the translation scheme into the parser in the same position as in the translation scheme. The action will happen at the intended time. The code for the parser/translator could be automated using a tool which reads the translation scheme and writes C++ code.

Left Recursion ●

Left recursion could cause infinite looping in a recursive-descent parser.



expr -> expr + term



The problem is that expr is the first on the RHS.





Applying that production would not change the lookahead symbol and would allow it to be selected again. Of course the alternative expr -> term would have a conflicting First set...

Eliminating Left Recursion ●

expr -> expr + term | term



Compare this to



expr -> term rest



rest -> + term rest | ε





Now we have right recursion and recursivedescent works. But we generate parse trees which are better for right associative operators.

A Translator for Simple Expressions ●







We are extending the translator to include the 4 basic math operations, multi-digit numbers and identifiers. There will be a symbol table which will hold minimal information. The translator will accept a list of expressions with each expression terminated with a semicolon. We start with a left-recursive grammar which we convert to right-recursive.

Abstract and Concrete Syntax ●



A parse tree can be called a concrete syntax tree. By contrast an abstract syntax tree leaves out grammar symbols, showing only operators and operands. + 8

2 4

expr ->Infix-to-postfix expr + term Simple Specification



{ print('+'); }



● ●



● ●

expr -> expr – term { print('-'); } expr -> term term -> 0 { print('0'); } term -> 1 { print('1'); } ... term -> 9 { print('9'); }

Problem with the Specification ●

The specification is left-recursive



We need to convert to right-recursion



expr -> term rest



rest -> + expr | - expr | ε



term -> 0 | 1 | ... | 9



If we use this grammar, we get the same language, but we must be careful about the actions.

Problem with the Specification (2) ● ●



● ●



Consider 2 choices for actions for rest -> - expr { rest.t = '-' expr.t; } rest -> - expr { rest.t = expr.t '-'; } The first pattern translates 8-4 into 8-4. The second translates 8-4 into 84-, but it also translates 8-4+2 into 842-+ We need help.

Eliminating Left-Recursion in a Translation Scheme ●



The solution is to “ drag” the actions around during the conversion, treating each as 1 grammar symbol. In general we convert A -> Aα | Aβ | γ into



A -> γR



R -> αR | βR | ε



Actions can be part of the α, β and γ

Repaired Grammar ●

expr -> term rest



rest -> + term { print('+'); } rest



rest -> - term { print('-'); } rest



rest -> ε



term -> 0 { print('0'); }



term -> 1 { print('1'); }



...



term -> 9 { print('9'); }

Translation of 8-4+2 expr term 8

print('8')

rest - term

print('-')

rest

4 print('4') + term print('+')rest 2 print('2')

ε

void expr() { term(); rest(); } void rest() { switch ( lookahead ) { case '+': match('+'); term(); print('+'); rest(); break; case '-': match('-'); term(); print('+'); rest(); break; } } void term() { if ( isdigit(lookahead) ) { print(lookahead); match(lookahead); } else error(); }

Eliminating Tail Recursion tail recursion: recursive call just be returning from a recursive funtion – might as well use a void looprest()



{

while ( 1 ) { switch ( lookahead ) { case '+': match('+'); term(); print('+'); break; case '-': match('-'); term(); print('+'); break; default: return; } } }

Merging Code ●



expr is called once and transfers control to rest might as well merge The + and – cases are almost identical can merge using lookahead variable

Streamlined expr code void expr() { term(); while ( 1 ) { switch ( lookahead ) { case '+': case '-': t = lookahead; match(t); term(); print(t); break; default: return; } } }

Lexical Analysis ●



White space and comment removal –

Easy in the scanner



Difficult in the parser

Constant = sequence of digits –

scanner passes num to the parser



the value of the num is an attribute



25 + 15 – 12





Identifiers ●

Identifier: letters and digits starting with a letter



Might also be a keyword, but not now.





Easy to code by starting a while loop when the next character is a letter and continuing until the next character is not a letter nor a digit. Then we need to return the character to the input stream to be read by another section of code.

Interfacing to the Lexical Analyzer ● ●



Lexical analyzer reads characters from stdin Parser gets tokens/attributes from the lexical analyzer The simplest arrangement is for the lexical analyzer to have a function to call to get the next token.

Symbol Table ●







Generally a symbol table supports insertion of identifiers with attributes. A symbol table also allows searching for an identifier by name. Efficiency usually dictates using some form of hash table or tree. (STL map: red-black tree) For Chapter 2 the symbol table is an array of tuples of char pointers and ints (struct entry).

Symbol Table (2) ●







A symbol table is a good place to handle keywords like “ div” and “ mod” . The translator inserts “d iv” with the #defined constant DIV (and “ mod” with MOD) in the table. DIV and MOD are ints greater than 255 to avoid confusion with single char tokens. The lexical analyzer uses lookup to search for a string. If in the table, it uses the token type from the table. Otherwise it inserts it as an ID.

Abstract Stack Machine ●









An abstract stack machine is a possible form of intermediate code for a compiler. An ASM has data memory, instruction memory, a data stack and a CPU. The CPU has instructions to move data from data memory to the stack and vice versa. It also has instructions to perform operations on the top items of the stack. Lastly the CPU has flow-control instructions.

ASM Arithmetic Instructions ●

Using an ASM is like interpreting postfix



PF(2+3*4) = 234*+





ASM instructions would be push 2 push 3 push 4 multiply add There would be a full collection of operators for ints and doubles.

L-values and R-values ●







An identifier is used in 2 common ways in a programming language –

On the left side of an assignment (l-value)



As part of an expression (r-value)

When used for the target of an assignment the computer needs the address of the variable. When used in an expression the computer needs to value of the variable. An ASM needs rvalue and lvalue instructions.

lvalue and rvalue



To push a variable's value onto the stack –



// a's address is used to

To push a variable's address onto the stack –



rvalue a get a lvalue a

// a's address is pushed

To compute c=a+b –

lvalue c



rvalue a



rvalue b



add



store

// := in the book

ASM Control Flow ●





● ●



label x Set label named x goto x Branch to the label named x gofalse x Goto x if the top of the stack is 0 (also pops it) gotrue x Goto x is the top of the stack is not 0 (also pops) halt

ASM Code for if Statement Source: if expr then stmt Target: code for expr gofalse out code for stmt label out

Translation scheme: stmt -> if expr { out = newlabel(); emit('gofalse',out) then stmt { emit('label',out); }

void stmt () { if ( lookahead == ID ) { emit('lvalue',tokenval); match(ID); match('='); expr(); } else if ( lookahead == IF ) { match(IF); expr(); out = newlabel(); emit('gofalse',out); match(THEN); stmt(); emit('label',out); } else error(); }

Infix to Postfix Translator Specification start -> list EOF list -> expr ; list | ε expr -> expr + term | expr – term | term term -> term * factor | term / factor | term DIV factor | term MOD factor | factor factor -> ( expr ) | id | num

{ print('+') } { print('-') } { { { {

print('*') } print('/') } print('DIV') } print('MOD') }

{ print(id.lexeme) } { print(num.value) }

Translation Scheme with no Left Recursion start -> list EOF list -> expr ; list | ε expr -> term moreterms moreterms -> + term { print('+') } moreterms | – term { print('-') } moreterms | ε term -> factor morefactors morefactors -> * factor { print('*') } morefactors | / factor { print('/') } morefactors | DIV factor { print('DIV') } morefactors | MOD factor { print('MOD') } morefactors | ε factor -> ( expr ) | id { print(id.lexeme) } | num { print(num.value) }

Tokens ●

● ●





Tokens are identified by an integer and some of them have an integer attribute value. Many tokens like '+' are simply themselves NUM, DIV, MOD, ID, and DONE are #defined as numbers starting with 256 to be distinct. The integer attribute for NUM is the sequence of digits converted to an integer. The integer attribute for ID is the index into the symbol table for that ID.