Introduction to Parsing Ambiguity and Syntax Errors

Outline

Introduction to Parsing Ambiguity and Syntax Errors

• Regular languages revisited • Parser overview • Context-free grammars (CFG’s) • Derivations • Ambiguity • Syntax errors

Compiler Design 1 (2011)

Languages and Automata

Limitations of Regular Languages

• Formal languages are very important in CS

Intuition: A finite automaton that runs long enough must repeat states • A finite automaton cannot remember # of times it has visited a particular state • because a finite automaton has finite memory

– Especially in programming languages

• Regular languages – The weakest formal languages widely used – Many applications

– Only enough to store in which state it is – Cannot count, except up to a finite limit

• Many languages are not regular • E.g., language of balanced parentheses is not regular: { (i )i | i ≥ 0}

• We will also study context-free languages


2

3


4

The Functionality of the Parser

Example

• Input: sequence of tokens from lexer

• If-then-else statement if (x == y) then z =1; else z = 2; • Parser input IF (ID == ID) THEN ID = INT; ELSE ID = INT; • Possible parser output

• Output: parse tree of the program

IF-THEN-ELSE == ID Compiler Design 1 (2011)

5

Comparison with Lexical Analysis

= ID

ID

= INT

ID

INT


The Role of the Parser

Phase

Input

Output

• Not all sequences of tokens are programs . . . • . . . Parser must distinguish between valid and invalid sequences of tokens

Lexer

Sequence of characters

Sequence of tokens

• We need

Parser

Sequence of tokens

Parse tree


6

– A language for describing valid sequences of tokens – A method for distinguishing valid from invalid sequences of tokens

7


8

Context-Free Grammars

CFGs (Cont.)

• Many programming language constructs have a recursive structure

• A CFG consists of – – – –

• A STMT is of the form if COND then STMT else STMT while COND do STMT …

, or , or

Assuming X ∈ N the productions are of the form X→ε , or X → Y1 Y2 ... Yn where Yi ∈ N ∪T

• Context-free grammars are a natural notation for this recursive structure


A set of terminals T A set of non-terminals N A start symbol S (a non-terminal) A set of productions

9


Notational Conventions

Examples of CFGs

• In these lecture notes

A fragment of our example language (simplified):

– Non-terminals are written upper-case – Terminals are written lower-case – The start symbol is the left-hand side of the first production


10

STMT → if COND then STMT else STMT ⏐ while COND do STMT ⏐ id = int

11


12

Examples of CFGs (cont.)

The Language of a CFG

Grammar for simple arithmetic expressions:

Read productions as replacement rules:

E→ ⏐ ⏐ ⏐

X → Y1 ... Yn

E*E E+E (E) id


Means X can be replaced by Y1 ... Yn

X→ε

Means X can be erased (replaced with empty string)

13


14

Key Idea

The Language of a CFG (Cont.)

(1) Begin with a string consisting of the start symbol “S”

More formally, we write

X1 L X i L X n → X1 L X i−1Y1 LYm X i+1 L X n

(2) Replace any non-terminal X in the string by a right-hand side of some production

if there is a production

X → Y1 LYn

X i → Y1 LYm

(3) Repeat (2) until there are no non-terminals in the string


15


16

The Language of a CFG (Cont.) Write

The Language of a CFG

∗

X 1 L X n → Y1 LYm

if

{

Let G be a context-free grammar with start symbol S. Then the language of G is:

}

∗ a1 K an | S → a1 K an and every ai is a terminal

X 1 L X n → L → L → Y1 LYm in 0 or more steps


17


18

Terminals

Examples

• Terminals are called so because there are no rules for replacing them

L(G) is the language of the CFG G Strings of balanced parentheses

• Once generated, terminals are permanent

i i

Two grammars:

• Terminals ought to be tokens of the language


{( ) | i ≥ 0}

S S

19

→ (S ) → ε


OR

S

→ (S ) | ε

20

Example

Example (Cont.)

A fragment of our example language (simplified):

Some elements of the our language

STMT → if COND then STMT ⏐ if COND then STMT else STMT ⏐ while COND do STMT ⏐ id = int COND → (id == id) ⏐ (id != id)


id = int if (id == id) then id = int else id = int while (id != id) do id = int while (id == id) do while (id != id) do id = int if (id != id) then if (id == id) then id = int else id = int

21


Arithmetic Example

Notes

Simple arithmetic expressions:

The idea of a CFG is a big step. But:

E → E+E | E ∗ E | (E) | id Some elements of the language:

id

• Membership in a language is just “yes” or “no”; we also need the parse tree of the input

id + id

(id) id ∗ id (id) ∗ id id ∗ (id) Compiler Design 1 (2011)

22

• Must handle errors gracefully • Need an implementation of CFG’s (e.g., yacc) 23


24

More Notes

Derivations and Parse Trees

• Form of the grammar is important

A derivation is a sequence of productions

– Many grammars generate the same language – Parsing tools are sensitive to the grammar

S →L →L →L A derivation can be drawn as a tree

Note: Tools for regular languages (e.g., lex/ML-Lex) are also sensitive to the form of the regular expression, but this is rarely a problem in practice


– Start symbol is the tree’s root – For a production X → Y1 LYn add children to node X

25

Derivation Example

Y1 LYn


26

Derivation Example (Cont.)

• Grammar

E → E+E | E ∗ E | (E) | id • String

id ∗ id + id


E

E → E+E → E ∗ E+E → id ∗ E + E → id ∗ id + E → id ∗ id + id 27


+

E E id

*

E

E id

id

28

Derivation in Detail (1)


E

E E

+

E → E+E

E


29



30


E

E → E+E → E ∗ E+E


E

+

E E

*

E

E → E+E

E

→ E ∗ E+E → id ∗ E + E

E

31


+

E E

*

E

E

id

32



E

E → E+E → E ∗ E+E → id ∗ E + E → id ∗ id + E

+

E E id

*

→ → → → →

E

E id


E

E

33

E+E E ∗ E+E id ∗ E + E id ∗ id + E id ∗ id + id

+

E E id

*

E

id

id


34

Notes on Derivations

Left-most and Right-most Derivations

• A parse tree has

• The example is a left-most

derivation

– Terminals at the leaves – Non-terminals at the interior nodes

– At each step, replace the left-most non-terminal

• An in-order traversal of the leaves is the original input

• There is an equivalent notion of a right-most

derivation

• The parse tree shows the association of operations, the input string does not


E

E → → → →

E+E E+id E ∗ E + id E ∗ id + id

→ id ∗ id + id 35


36

Right-most Derivation in Detail (1)


E

E E

+

E → E+E

E


37



38


E

E → E+E → E+id


E

E

+

E

E

E

→ E+E → E+id → E ∗ E + id

id

39


+

E E

*

E

E id

40



E

E → E+E → E+id → E ∗ E + id

+

E E

*

→ E ∗ id + id

E

E

E → → → → →

E id

id


41

E+E E+id E ∗ E + id E ∗ id + id id ∗ id + id

+

E E id

*

E

E id

id


Derivations and Parse Trees

Summary of Derivations

• Note that right-most and left-most derivations have the same parse tree

• We are not just interested in whether s ∈ L(G)

42

– We need a parse tree for s

• The difference is just in the order in which branches are added

• A derivation defines a parse tree – But one parse tree may have many derivations

• Left-most and right-most derivations are important in parser implementation


43


44

Ambiguity

Ambiguity (Cont.)

• Grammar

This string has two parse trees E → E + E | E * E | ( E ) | int E

• String int * int + int E

E +

E

E

*

E

* E

int

int

E

+

int


45

E

int

int

E int


Ambiguity (Cont.)

Dealing with Ambiguity

• A grammar is ambiguous if it has more than one parse tree for some string

• There are several ways to handle ambiguity

– Equivalently, there is more than one right-most or left-most derivation for some string

• Most direct method is to rewrite grammar unambiguously

• Ambiguity is bad – Leaves meaning of some programs ill-defined

E→T+E|T T → int * T | int | ( E )

• Ambiguity is common in programming languages – Arithmetic expressions – IF-THEN-ELSE


46

• Enforces precedence of * over +

47


48

Ambiguity: The Dangling Else

The Dangling Else: Example

• Consider the following grammar

• The expression if C1 then if C2 then S3 else S4

has two parse trees

S → if C then S | if C then S else S | OTHER

if C1

• This grammar is also ambiguous

if

C2

if

C1

S4

if S3

C2

S3

S4

• Typically we want the second form Compiler Design 1 (2011)

49


50

The Dangling Else: A Fix

The Dangling Else: Example Revisited

• else matches the closest unmatched then • We can describe this in the grammar

• The expression if C1 then if C2 then S3 else S4

S → MIF | UIF

C1

MIF → if C then MIF else MIF | OTHER UIF → if C then S | if C then MIF else UIF

if C2

S3

C1 S4

• A valid parse tree (for a UIF)

• Describes the same set of strings Compiler Design 1 (2011)

if

if

/* all then are matched */ /* some then are unmatched */

51


S4

if C2

S3

• Not valid because the then expression is not a MIF 52

Ambiguity

Precedence and Associativity Declarations

• No general techniques for handling ambiguity

• Instead of rewriting the grammar – Use the more natural (ambiguous) grammar – Along with disambiguating declarations

• Impossible to convert automatically an ambiguous grammar to an unambiguous one

• Most tools allow precedence and associativity declarations to disambiguate grammars

• Used with care, ambiguity can simplify the grammar

• Examples …

– Sometimes allows more natural definitions – We need disambiguation mechanisms


53


54

Associativity Declarations

Precedence Declarations

• Consider the grammar E → E + E | int • Ambiguous: two parse trees of int + int + int

• Consider the grammar E → E + E | E * E | int

E E E

+

+ E

int

int

– And the string int + int * int E

E E

E

int

int E

+

int

E

E E

+ E

int

E

E

int

int E

+

int

• Precedence declarations: %left + %left *

• Left associativity declaration: %left + Compiler Design 1 (2011)

+ E

int

int

*

E

55


E *

E int

56

Error Handling

Syntax Error Handling

• Purpose of the compiler is

• Error handler should

– To detect non-valid programs – To translate the valid ones

– Report errors accurately and clearly – Recover from an error quickly – Not slow down compilation of valid code

• Many kinds of possible errors (e.g. in C) Error kind Lexical Syntax Semantic Correctness

Example

Detected by …

…$… … x *% … … int x; y = x(3); … your favorite program

• Good error handling is not easy to achieve

Lexer Parser Type checker Tester/User


57


Approaches to Syntax Error Recovery

Error Recovery: Panic Mode

• From simple to complex

• Simplest, most popular method

– Panic mode – Error productions – Automatic local or global correction

58

• When an error is detected: – Discard tokens until one with a clear role is found – Continue from there

• Not all are supported by all parser generators

• Such tokens are called synchronizing tokens – Typically the statement or expression terminators


59


60

Syntax Error Recovery: Panic Mode (Cont.)

Syntax Error Recovery: Error Productions

• Consider the erroneous expression

• Idea: specify in the grammar known common mistakes • Essentially promotes common errors to alternative syntax • Example:

(1 + + 2) + 3

• Panic-mode recovery: – Skip ahead to next integer and then continue

– Write 5 x instead of 5 * x – Add the production E → … | E E

• (ML)-Yacc: use the special terminal error to describe how much input to skip

• Disadvantage

E → int | E + E | ( E ) | error int | ( error )

– Complicates the grammar


61

Syntax Error Recovery: Past and Present • Past – Slow recompilation cycle (even once a day) – Find as many errors in one cycle as possible – Researchers could not let go of the topic

• Present – – – –

Quick recompilation cycle Users tend to correct one error/cycle Complex error recovery is needed less Panic-mode seems enough


63


62

Introduction to Parsing Ambiguity and Syntax Errors

Introduction to Parsing Ambiguity and Syntax Errors

Suggest Documents

Lexical Syntax and Parsing Architecture

Common Syntax and Semantic Errors

An Introduction to Syntax

Influences on Parsing Ambiguity

A New Top-Down Parsing Algorithm to Accommodate Ambiguity and ...

Introduction to Syntax Session 1

Investigating Preprocessor-Based Syntax Errors

Influences on Parsing Ambiguity - Northwestern Linguistics

Parsing spoken language without syntax - Association for

Syntax-Semantics Interaction Parsing Strategies. Inside ... - arXiv

Syntax-Semantics Interaction Parsing Strategies. Inside ... - arXiv

PARSING SPOKEN LANGUAGE WITHOUT SYNTAX

Resumption and Intrusion: Syntax and Semantics, Parsing and ...

1 Syntax errors Logic errors Three Example Exceptions

Quick Introduction to Syntax Analysis - Semantic Scholar

English Syntax: An Introduction

Invertible Syntax Descriptions: Unifying Parsing and Pretty Printing

Invertible Syntax Descriptions: Unifying Parsing and Pretty Printing

How to Avoid Common Syntax Errors When Using WebAssign

Introduction to the Special Issue on Ambiguity and ... - Semantic Scholar

Object-Oriented Parsing and Transformation 1 Introduction

Handling Ambiguity and Errors: Visual Languages for ... - INESC-ID

Resolving Ambiguity in Inter-chunk Dependency Parsing - afnlp

Ling 120: Introduction to Syntax and Semantics - Linguistics