Introduction to Parsing Ambiguity and Syntax Errors

78 downloads 144 Views 142KB Size Report
Syntax errors. Compiler Design 1 (2011). 3. Languages and Automata. • Formal languages are very important in CS. – Especially in programming languages.
Outline

Introduction to Parsing Ambiguity and Syntax Errors

• Regular languages revisited • Parser overview • Context-free grammars (CFG’s) • Derivations • Ambiguity • Syntax errors

Compiler Design 1 (2011)

Languages and Automata

Limitations of Regular Languages

• Formal languages are very important in CS

Intuition: A finite automaton that runs long enough must repeat states • A finite automaton cannot remember # of times it has visited a particular state • because a finite automaton has finite memory

– Especially in programming languages

• Regular languages – The weakest formal languages widely used – Many applications

– Only enough to store in which state it is – Cannot count, except up to a finite limit

• Many languages are not regular • E.g., language of balanced parentheses is not regular: { (i )i | i ≥ 0}

• We will also study context-free languages

Compiler Design 1 (2011)

2

3

Compiler Design 1 (2011)

4

The Functionality of the Parser

Example

• Input: sequence of tokens from lexer

• If-then-else statement if (x == y) then z =1; else z = 2; • Parser input IF (ID == ID) THEN ID = INT; ELSE ID = INT; • Possible parser output

• Output: parse tree of the program

IF-THEN-ELSE == ID Compiler Design 1 (2011)

5

Comparison with Lexical Analysis

= ID

ID

= INT

ID

INT

Compiler Design 1 (2011)

The Role of the Parser

Phase

Input

Output

• Not all sequences of tokens are programs . . . • . . . Parser must distinguish between valid and invalid sequences of tokens

Lexer

Sequence of characters

Sequence of tokens

• We need

Parser

Sequence of tokens

Parse tree

Compiler Design 1 (2011)

6

– A language for describing valid sequences of tokens – A method for distinguishing valid from invalid sequences of tokens

7

Compiler Design 1 (2011)

8

Context-Free Grammars

CFGs (Cont.)

• Many programming language constructs have a recursive structure

• A CFG consists of – – – –

• A STMT is of the form if COND then STMT else STMT while COND do STMT …

, or , or

Assuming X ∈ N the productions are of the form X→ε , or X → Y1 Y2 ... Yn where Yi ∈ N ∪T

• Context-free grammars are a natural notation for this recursive structure

Compiler Design 1 (2011)

A set of terminals T A set of non-terminals N A start symbol S (a non-terminal) A set of productions

9

Compiler Design 1 (2011)

Notational Conventions

Examples of CFGs

• In these lecture notes

A fragment of our example language (simplified):

– Non-terminals are written upper-case – Terminals are written lower-case – The start symbol is the left-hand side of the first production

Compiler Design 1 (2011)

10

STMT → if COND then STMT else STMT ⏐ while COND do STMT ⏐ id = int

11

Compiler Design 1 (2011)

12

Examples of CFGs (cont.)

The Language of a CFG

Grammar for simple arithmetic expressions:

Read productions as replacement rules:

E→ ⏐ ⏐ ⏐

X → Y1 ... Yn

E*E E+E (E) id

Compiler Design 1 (2011)

Means X can be replaced by Y1 ... Yn

X→ε

Means X can be erased (replaced with empty string)

13

Compiler Design 1 (2011)

14

Key Idea

The Language of a CFG (Cont.)

(1) Begin with a string consisting of the start symbol “S”

More formally, we write

X1 L X i L X n → X1 L X i−1Y1 LYm X i+1 L X n

(2) Replace any non-terminal X in the string by a right-hand side of some production

if there is a production

X → Y1 LYn

X i → Y1 LYm

(3) Repeat (2) until there are no non-terminals in the string

Compiler Design 1 (2011)

15

Compiler Design 1 (2011)

16

The Language of a CFG (Cont.) Write

The Language of a CFG



X 1 L X n → Y1 LYm

if

{

Let G be a context-free grammar with start symbol S. Then the language of G is:

}

∗ a1 K an | S → a1 K an and every ai is a terminal

X 1 L X n → L → L → Y1 LYm in 0 or more steps

Compiler Design 1 (2011)

17

Compiler Design 1 (2011)

18

Terminals

Examples

• Terminals are called so because there are no rules for replacing them

L(G) is the language of the CFG G Strings of balanced parentheses

• Once generated, terminals are permanent

i i

Two grammars:

• Terminals ought to be tokens of the language

Compiler Design 1 (2011)

{( ) | i ≥ 0}

S S

19

→ (S ) → ε

Compiler Design 1 (2011)

OR

S

→ (S ) | ε

20

Example

Example (Cont.)

A fragment of our example language (simplified):

Some elements of the our language

STMT → if COND then STMT ⏐ if COND then STMT else STMT ⏐ while COND do STMT ⏐ id = int COND → (id == id) ⏐ (id != id)

Compiler Design 1 (2011)

id = int if (id == id) then id = int else id = int while (id != id) do id = int while (id == id) do while (id != id) do id = int if (id != id) then if (id == id) then id = int else id = int

21

Compiler Design 1 (2011)

Arithmetic Example

Notes

Simple arithmetic expressions:

The idea of a CFG is a big step. But:

E → E+E | E ∗ E | (E) | id Some elements of the language:

id

• Membership in a language is just “yes” or “no”; we also need the parse tree of the input

id + id

(id) id ∗ id (id) ∗ id id ∗ (id) Compiler Design 1 (2011)

22

• Must handle errors gracefully • Need an implementation of CFG’s (e.g., yacc) 23

Compiler Design 1 (2011)

24

More Notes

Derivations and Parse Trees

• Form of the grammar is important

A derivation is a sequence of productions

– Many grammars generate the same language – Parsing tools are sensitive to the grammar

S →L →L →L A derivation can be drawn as a tree

Note: Tools for regular languages (e.g., lex/ML-Lex) are also sensitive to the form of the regular expression, but this is rarely a problem in practice

Compiler Design 1 (2011)

– Start symbol is the tree’s root – For a production X → Y1 LYn add children to node X

25

Derivation Example

Y1 LYn

Compiler Design 1 (2011)

26

Derivation Example (Cont.)

• Grammar

E → E+E | E ∗ E | (E) | id • String

id ∗ id + id

Compiler Design 1 (2011)

E

E → E+E → E ∗ E+E → id ∗ E + E → id ∗ id + E → id ∗ id + id 27

Compiler Design 1 (2011)

+

E E id

*

E

E id

id

28

Derivation in Detail (1)

Derivation in Detail (2)

E

E E

+

E → E+E

E

Compiler Design 1 (2011)

29

Derivation in Detail (3)

Compiler Design 1 (2011)

30

Derivation in Detail (4)

E

E → E+E → E ∗ E+E

Compiler Design 1 (2011)

E

+

E E

*

E

E → E+E

E

→ E ∗ E+E → id ∗ E + E

E

31

Compiler Design 1 (2011)

+

E E

*

E

E

id

32

Derivation in Detail (5)

Derivation in Detail (6)

E

E → E+E → E ∗ E+E → id ∗ E + E → id ∗ id + E

+

E E id

*

→ → → → →

E

E id

Compiler Design 1 (2011)

E

E

33

E+E E ∗ E+E id ∗ E + E id ∗ id + E id ∗ id + id

+

E E id

*

E

id

id

Compiler Design 1 (2011)

34

Notes on Derivations

Left-most and Right-most Derivations

• A parse tree has

• The example is a left-most

derivation

– Terminals at the leaves – Non-terminals at the interior nodes

– At each step, replace the left-most non-terminal

• An in-order traversal of the leaves is the original input

• There is an equivalent notion of a right-most

derivation

• The parse tree shows the association of operations, the input string does not

Compiler Design 1 (2011)

E

E → → → →

E+E E+id E ∗ E + id E ∗ id + id

→ id ∗ id + id 35

Compiler Design 1 (2011)

36

Right-most Derivation in Detail (1)

Right-most Derivation in Detail (2)

E

E E

+

E → E+E

E

Compiler Design 1 (2011)

37

Right-most Derivation in Detail (3)

Compiler Design 1 (2011)

38

Right-most Derivation in Detail (4)

E

E → E+E → E+id

Compiler Design 1 (2011)

E

E

+

E

E

E

→ E+E → E+id → E ∗ E + id

id

39

Compiler Design 1 (2011)

+

E E

*

E

E id

40

Right-most Derivation in Detail (5)

Right-most Derivation in Detail (6)

E

E → E+E → E+id → E ∗ E + id

+

E E

*

→ E ∗ id + id

E

E

E → → → → →

E id

id

Compiler Design 1 (2011)

41

E+E E+id E ∗ E + id E ∗ id + id id ∗ id + id

+

E E id

*

E

E id

id

Compiler Design 1 (2011)

Derivations and Parse Trees

Summary of Derivations

• Note that right-most and left-most derivations have the same parse tree

• We are not just interested in whether s ∈ L(G)

42

– We need a parse tree for s

• The difference is just in the order in which branches are added

• A derivation defines a parse tree – But one parse tree may have many derivations

• Left-most and right-most derivations are important in parser implementation

Compiler Design 1 (2011)

43

Compiler Design 1 (2011)

44

Ambiguity

Ambiguity (Cont.)

• Grammar

This string has two parse trees E → E + E | E * E | ( E ) | int E

• String int * int + int E

E +

E

E

*

E

* E

int

int

E

+

int

Compiler Design 1 (2011)

45

E

int

int

E int

Compiler Design 1 (2011)

Ambiguity (Cont.)

Dealing with Ambiguity

• A grammar is ambiguous if it has more than one parse tree for some string

• There are several ways to handle ambiguity

– Equivalently, there is more than one right-most or left-most derivation for some string

• Most direct method is to rewrite grammar unambiguously

• Ambiguity is bad – Leaves meaning of some programs ill-defined

E→T+E|T T → int * T | int | ( E )

• Ambiguity is common in programming languages – Arithmetic expressions – IF-THEN-ELSE

Compiler Design 1 (2011)

46

• Enforces precedence of * over +

47

Compiler Design 1 (2011)

48

Ambiguity: The Dangling Else

The Dangling Else: Example

• Consider the following grammar

• The expression if C1 then if C2 then S3 else S4

has two parse trees

S → if C then S | if C then S else S | OTHER

if C1

• This grammar is also ambiguous

if

C2

if

C1

S4

if S3

C2

S3

S4

• Typically we want the second form Compiler Design 1 (2011)

49

Compiler Design 1 (2011)

50

The Dangling Else: A Fix

The Dangling Else: Example Revisited

• else matches the closest unmatched then • We can describe this in the grammar

• The expression if C1 then if C2 then S3 else S4

S → MIF | UIF

C1

MIF → if C then MIF else MIF | OTHER UIF → if C then S | if C then MIF else UIF

if C2

S3

C1 S4

• A valid parse tree (for a UIF)

• Describes the same set of strings Compiler Design 1 (2011)

if

if

/* all then are matched */ /* some then are unmatched */

51

Compiler Design 1 (2011)

S4

if C2

S3

• Not valid because the then expression is not a MIF 52

Ambiguity

Precedence and Associativity Declarations

• No general techniques for handling ambiguity

• Instead of rewriting the grammar – Use the more natural (ambiguous) grammar – Along with disambiguating declarations

• Impossible to convert automatically an ambiguous grammar to an unambiguous one

• Most tools allow precedence and associativity declarations to disambiguate grammars

• Used with care, ambiguity can simplify the grammar

• Examples …

– Sometimes allows more natural definitions – We need disambiguation mechanisms

Compiler Design 1 (2011)

53

Compiler Design 1 (2011)

54

Associativity Declarations

Precedence Declarations

• Consider the grammar E → E + E | int • Ambiguous: two parse trees of int + int + int

• Consider the grammar E → E + E | E * E | int

E E E

+

+ E

int

int

– And the string int + int * int E

E E

E

int

int E

+

int

E

E E

+ E

int

E

E

int

int E

+

int

• Precedence declarations: %left + %left *

• Left associativity declaration: %left + Compiler Design 1 (2011)

+ E

int

int

*

E

55

Compiler Design 1 (2011)

E *

E int

56

Error Handling

Syntax Error Handling

• Purpose of the compiler is

• Error handler should

– To detect non-valid programs – To translate the valid ones

– Report errors accurately and clearly – Recover from an error quickly – Not slow down compilation of valid code

• Many kinds of possible errors (e.g. in C) Error kind Lexical Syntax Semantic Correctness

Example

Detected by …

…$… … x *% … … int x; y = x(3); … your favorite program

• Good error handling is not easy to achieve

Lexer Parser Type checker Tester/User

Compiler Design 1 (2011)

57

Compiler Design 1 (2011)

Approaches to Syntax Error Recovery

Error Recovery: Panic Mode

• From simple to complex

• Simplest, most popular method

– Panic mode – Error productions – Automatic local or global correction

58

• When an error is detected: – Discard tokens until one with a clear role is found – Continue from there

• Not all are supported by all parser generators

• Such tokens are called synchronizing tokens – Typically the statement or expression terminators

Compiler Design 1 (2011)

59

Compiler Design 1 (2011)

60

Syntax Error Recovery: Panic Mode (Cont.)

Syntax Error Recovery: Error Productions

• Consider the erroneous expression

• Idea: specify in the grammar known common mistakes • Essentially promotes common errors to alternative syntax • Example:

(1 + + 2) + 3

• Panic-mode recovery: – Skip ahead to next integer and then continue

– Write 5 x instead of 5 * x – Add the production E → … | E E

• (ML)-Yacc: use the special terminal error to describe how much input to skip

• Disadvantage

E → int | E + E | ( E ) | error int | ( error )

– Complicates the grammar

Compiler Design 1 (2011)

61

Syntax Error Recovery: Past and Present • Past – Slow recompilation cycle (even once a day) – Find as many errors in one cycle as possible – Researchers could not let go of the topic

• Present – – – –

Quick recompilation cycle Users tend to correct one error/cycle Complex error recovery is needed less Panic-mode seems enough

Compiler Design 1 (2011)

63

Compiler Design 1 (2011)

62