Syntax errors. Compiler Design 1 (2011). 3. Languages and Automata. • Formal
languages are very important in CS. – Especially in programming languages.
Outline
Introduction to Parsing Ambiguity and Syntax Errors
Intuition: A finite automaton that runs long enough must repeat states • A finite automaton cannot remember # of times it has visited a particular state • because a finite automaton has finite memory
– Especially in programming languages
• Regular languages – The weakest formal languages widely used – Many applications
– Only enough to store in which state it is – Cannot count, except up to a finite limit
• Many languages are not regular • E.g., language of balanced parentheses is not regular: { (i )i | i ≥ 0}
• We will also study context-free languages
Compiler Design 1 (2011)
2
3
Compiler Design 1 (2011)
4
The Functionality of the Parser
Example
• Input: sequence of tokens from lexer
• If-then-else statement if (x == y) then z =1; else z = 2; • Parser input IF (ID == ID) THEN ID = INT; ELSE ID = INT; • Possible parser output
• Output: parse tree of the program
IF-THEN-ELSE == ID Compiler Design 1 (2011)
5
Comparison with Lexical Analysis
= ID
ID
= INT
ID
INT
Compiler Design 1 (2011)
The Role of the Parser
Phase
Input
Output
• Not all sequences of tokens are programs . . . • . . . Parser must distinguish between valid and invalid sequences of tokens
Lexer
Sequence of characters
Sequence of tokens
• We need
Parser
Sequence of tokens
Parse tree
Compiler Design 1 (2011)
6
– A language for describing valid sequences of tokens – A method for distinguishing valid from invalid sequences of tokens
7
Compiler Design 1 (2011)
8
Context-Free Grammars
CFGs (Cont.)
• Many programming language constructs have a recursive structure
• A CFG consists of – – – –
• A STMT is of the form if COND then STMT else STMT while COND do STMT …
, or , or
Assuming X ∈ N the productions are of the form X→ε , or X → Y1 Y2 ... Yn where Yi ∈ N ∪T
• Context-free grammars are a natural notation for this recursive structure
Compiler Design 1 (2011)
A set of terminals T A set of non-terminals N A start symbol S (a non-terminal) A set of productions
9
Compiler Design 1 (2011)
Notational Conventions
Examples of CFGs
• In these lecture notes
A fragment of our example language (simplified):
– Non-terminals are written upper-case – Terminals are written lower-case – The start symbol is the left-hand side of the first production
Compiler Design 1 (2011)
10
STMT → if COND then STMT else STMT ⏐ while COND do STMT ⏐ id = int
11
Compiler Design 1 (2011)
12
Examples of CFGs (cont.)
The Language of a CFG
Grammar for simple arithmetic expressions:
Read productions as replacement rules:
E→ ⏐ ⏐ ⏐
X → Y1 ... Yn
E*E E+E (E) id
Compiler Design 1 (2011)
Means X can be replaced by Y1 ... Yn
X→ε
Means X can be erased (replaced with empty string)
13
Compiler Design 1 (2011)
14
Key Idea
The Language of a CFG (Cont.)
(1) Begin with a string consisting of the start symbol “S”
More formally, we write
X1 L X i L X n → X1 L X i−1Y1 LYm X i+1 L X n
(2) Replace any non-terminal X in the string by a right-hand side of some production
if there is a production
X → Y1 LYn
X i → Y1 LYm
(3) Repeat (2) until there are no non-terminals in the string
Compiler Design 1 (2011)
15
Compiler Design 1 (2011)
16
The Language of a CFG (Cont.) Write
The Language of a CFG
∗
X 1 L X n → Y1 LYm
if
{
Let G be a context-free grammar with start symbol S. Then the language of G is:
}
∗ a1 K an | S → a1 K an and every ai is a terminal
X 1 L X n → L → L → Y1 LYm in 0 or more steps
Compiler Design 1 (2011)
17
Compiler Design 1 (2011)
18
Terminals
Examples
• Terminals are called so because there are no rules for replacing them
L(G) is the language of the CFG G Strings of balanced parentheses
• Once generated, terminals are permanent
i i
Two grammars:
• Terminals ought to be tokens of the language
Compiler Design 1 (2011)
{( ) | i ≥ 0}
S S
19
→ (S ) → ε
Compiler Design 1 (2011)
OR
S
→ (S ) | ε
20
Example
Example (Cont.)
A fragment of our example language (simplified):
Some elements of the our language
STMT → if COND then STMT ⏐ if COND then STMT else STMT ⏐ while COND do STMT ⏐ id = int COND → (id == id) ⏐ (id != id)
Compiler Design 1 (2011)
id = int if (id == id) then id = int else id = int while (id != id) do id = int while (id == id) do while (id != id) do id = int if (id != id) then if (id == id) then id = int else id = int
21
Compiler Design 1 (2011)
Arithmetic Example
Notes
Simple arithmetic expressions:
The idea of a CFG is a big step. But:
E → E+E | E ∗ E | (E) | id Some elements of the language:
id
• Membership in a language is just “yes” or “no”; we also need the parse tree of the input
id + id
(id) id ∗ id (id) ∗ id id ∗ (id) Compiler Design 1 (2011)
22
• Must handle errors gracefully • Need an implementation of CFG’s (e.g., yacc) 23
Compiler Design 1 (2011)
24
More Notes
Derivations and Parse Trees
• Form of the grammar is important
A derivation is a sequence of productions
– Many grammars generate the same language – Parsing tools are sensitive to the grammar
S →L →L →L A derivation can be drawn as a tree
Note: Tools for regular languages (e.g., lex/ML-Lex) are also sensitive to the form of the regular expression, but this is rarely a problem in practice
Compiler Design 1 (2011)
– Start symbol is the tree’s root – For a production X → Y1 LYn add children to node X
25
Derivation Example
Y1 LYn
Compiler Design 1 (2011)
26
Derivation Example (Cont.)
• Grammar
E → E+E | E ∗ E | (E) | id • String
id ∗ id + id
Compiler Design 1 (2011)
E
E → E+E → E ∗ E+E → id ∗ E + E → id ∗ id + E → id ∗ id + id 27
Compiler Design 1 (2011)
+
E E id
*
E
E id
id
28
Derivation in Detail (1)
Derivation in Detail (2)
E
E E
+
E → E+E
E
Compiler Design 1 (2011)
29
Derivation in Detail (3)
Compiler Design 1 (2011)
30
Derivation in Detail (4)
E
E → E+E → E ∗ E+E
Compiler Design 1 (2011)
E
+
E E
*
E
E → E+E
E
→ E ∗ E+E → id ∗ E + E
E
31
Compiler Design 1 (2011)
+
E E
*
E
E
id
32
Derivation in Detail (5)
Derivation in Detail (6)
E
E → E+E → E ∗ E+E → id ∗ E + E → id ∗ id + E
+
E E id
*
→ → → → →
E
E id
Compiler Design 1 (2011)
E
E
33
E+E E ∗ E+E id ∗ E + E id ∗ id + E id ∗ id + id
+
E E id
*
E
id
id
Compiler Design 1 (2011)
34
Notes on Derivations
Left-most and Right-most Derivations
• A parse tree has
• The example is a left-most
derivation
– Terminals at the leaves – Non-terminals at the interior nodes
– At each step, replace the left-most non-terminal
• An in-order traversal of the leaves is the original input
• There is an equivalent notion of a right-most
derivation
• The parse tree shows the association of operations, the input string does not
Compiler Design 1 (2011)
E
E → → → →
E+E E+id E ∗ E + id E ∗ id + id
→ id ∗ id + id 35
Compiler Design 1 (2011)
36
Right-most Derivation in Detail (1)
Right-most Derivation in Detail (2)
E
E E
+
E → E+E
E
Compiler Design 1 (2011)
37
Right-most Derivation in Detail (3)
Compiler Design 1 (2011)
38
Right-most Derivation in Detail (4)
E
E → E+E → E+id
Compiler Design 1 (2011)
E
E
+
E
E
E
→ E+E → E+id → E ∗ E + id
id
39
Compiler Design 1 (2011)
+
E E
*
E
E id
40
Right-most Derivation in Detail (5)
Right-most Derivation in Detail (6)
E
E → E+E → E+id → E ∗ E + id
+
E E
*
→ E ∗ id + id
E
E
E → → → → →
E id
id
Compiler Design 1 (2011)
41
E+E E+id E ∗ E + id E ∗ id + id id ∗ id + id
+
E E id
*
E
E id
id
Compiler Design 1 (2011)
Derivations and Parse Trees
Summary of Derivations
• Note that right-most and left-most derivations have the same parse tree
• We are not just interested in whether s ∈ L(G)
42
– We need a parse tree for s
• The difference is just in the order in which branches are added
• A derivation defines a parse tree – But one parse tree may have many derivations
• Left-most and right-most derivations are important in parser implementation
Compiler Design 1 (2011)
43
Compiler Design 1 (2011)
44
Ambiguity
Ambiguity (Cont.)
• Grammar
This string has two parse trees E → E + E | E * E | ( E ) | int E
• String int * int + int E
E +
E
E
*
E
* E
int
int
E
+
int
Compiler Design 1 (2011)
45
E
int
int
E int
Compiler Design 1 (2011)
Ambiguity (Cont.)
Dealing with Ambiguity
• A grammar is ambiguous if it has more than one parse tree for some string
• There are several ways to handle ambiguity
– Equivalently, there is more than one right-most or left-most derivation for some string
• Most direct method is to rewrite grammar unambiguously
• Ambiguity is bad – Leaves meaning of some programs ill-defined
E→T+E|T T → int * T | int | ( E )
• Ambiguity is common in programming languages – Arithmetic expressions – IF-THEN-ELSE
Compiler Design 1 (2011)
46
• Enforces precedence of * over +
47
Compiler Design 1 (2011)
48
Ambiguity: The Dangling Else
The Dangling Else: Example
• Consider the following grammar
• The expression if C1 then if C2 then S3 else S4
has two parse trees
S → if C then S | if C then S else S | OTHER
if C1
• This grammar is also ambiguous
if
C2
if
C1
S4
if S3
C2
S3
S4
• Typically we want the second form Compiler Design 1 (2011)
49
Compiler Design 1 (2011)
50
The Dangling Else: A Fix
The Dangling Else: Example Revisited
• else matches the closest unmatched then • We can describe this in the grammar
• The expression if C1 then if C2 then S3 else S4
S → MIF | UIF
C1
MIF → if C then MIF else MIF | OTHER UIF → if C then S | if C then MIF else UIF
if C2
S3
C1 S4
• A valid parse tree (for a UIF)
• Describes the same set of strings Compiler Design 1 (2011)
if
if
/* all then are matched */ /* some then are unmatched */
51
Compiler Design 1 (2011)
S4
if C2
S3
• Not valid because the then expression is not a MIF 52
Ambiguity
Precedence and Associativity Declarations
• No general techniques for handling ambiguity
• Instead of rewriting the grammar – Use the more natural (ambiguous) grammar – Along with disambiguating declarations
• Impossible to convert automatically an ambiguous grammar to an unambiguous one
• Most tools allow precedence and associativity declarations to disambiguate grammars
• Used with care, ambiguity can simplify the grammar
• Examples …
– Sometimes allows more natural definitions – We need disambiguation mechanisms
Compiler Design 1 (2011)
53
Compiler Design 1 (2011)
54
Associativity Declarations
Precedence Declarations
• Consider the grammar E → E + E | int • Ambiguous: two parse trees of int + int + int
• Consider the grammar E → E + E | E * E | int
E E E
+
+ E
int
int
– And the string int + int * int E
E E
E
int
int E
+
int
E
E E
+ E
int
E
E
int
int E
+
int
• Precedence declarations: %left + %left *
• Left associativity declaration: %left + Compiler Design 1 (2011)
+ E
int
int
*
E
55
Compiler Design 1 (2011)
E *
E int
56
Error Handling
Syntax Error Handling
• Purpose of the compiler is
• Error handler should
– To detect non-valid programs – To translate the valid ones
– Report errors accurately and clearly – Recover from an error quickly – Not slow down compilation of valid code
• Many kinds of possible errors (e.g. in C) Error kind Lexical Syntax Semantic Correctness
Example
Detected by …
…$… … x *% … … int x; y = x(3); … your favorite program
• Good error handling is not easy to achieve
Lexer Parser Type checker Tester/User
Compiler Design 1 (2011)
57
Compiler Design 1 (2011)
Approaches to Syntax Error Recovery
Error Recovery: Panic Mode
• From simple to complex
• Simplest, most popular method
– Panic mode – Error productions – Automatic local or global correction
58
• When an error is detected: – Discard tokens until one with a clear role is found – Continue from there
• Not all are supported by all parser generators
• Such tokens are called synchronizing tokens – Typically the statement or expression terminators
Compiler Design 1 (2011)
59
Compiler Design 1 (2011)
60
Syntax Error Recovery: Panic Mode (Cont.)
Syntax Error Recovery: Error Productions
• Consider the erroneous expression
• Idea: specify in the grammar known common mistakes • Essentially promotes common errors to alternative syntax • Example:
(1 + + 2) + 3
• Panic-mode recovery: – Skip ahead to next integer and then continue
– Write 5 x instead of 5 * x – Add the production E → … | E E
• (ML)-Yacc: use the special terminal error to describe how much input to skip
• Disadvantage
E → int | E + E | ( E ) | error int | ( error )
– Complicates the grammar
Compiler Design 1 (2011)
61
Syntax Error Recovery: Past and Present • Past – Slow recompilation cycle (even once a day) – Find as many errors in one cycle as possible – Researchers could not let go of the topic
• Present – – – –
Quick recompilation cycle Users tend to correct one error/cycle Complex error recovery is needed less Panic-mode seems enough