Nodes of type IntExp (integer constant) carry as attribute: constant value (an
integer). c 2002/03 T.Grust · Compiler Construction: 1. Introduction. 22 ...
1 Introduction (
1.1
1. Introduction, p. 3)
Compiler Phases
• Due to the complexity of the compilation task, a compiler typically proceeds in a sequence of compilation phases. – In the Tiger book—as in this lecture—each chapter is devoted to one compilation phase. – The phases communicate with each other via clearly defined interfaces. – Interface: data structure (e.g., a tree), set of exported functions – Each phase operates on an abstract intermediate representation of the source program, not the source program text itself (except the first phase). c 2002/03 T.Grust · Compiler Construction: 1. Introduction
19
• Breaking the compiler into many phases enables reuse (of phase implementations). – Example: If we need to adapt our compiler to translate a different source language than Tiger, we only need to rewrite the early phases (Lex → Translate, see below). All phases following Translate remain untouched (after Translate, all specifics of the source language have been “abstracted away”).
c 2002/03 T.Grust · Compiler Construction: 1. Introduction
20
1.1.1
A Brief Overview of the Tiger Compiler Phases
• Let us trace the compilation of the Tiger program below and see how the different phases transform the initial source program (later on, we pick certain source program fragments only to keep the exposition short). rem.tig 1 2 3 4 5 6
/* compute the remainder when dividing x by y */ let function rem (x : int, y : int) : int = let var d := x / y in x - d * y end
7
var r := 0
8 9
in r := rem (10, 3)
10 11
end
c 2002/03 T.Grust · Compiler Construction: 1. Introduction
21
1.2
Intermediate Representations (Tree Languages)
• We have seen that the compiler phases pass different intermediate representations (IR). – Often, these IR take the form of a tree. – In these trees, each node assumes one of many node types. – Each node type carries a number of attributes to store details about the program (fragment) being represented by that node. Example: Recall from our review of the compiler phases: – Nodes of type CallExp (function call) carry as attributes: function name (a string) argument list (an IR tree rooted in an ExpList node). – Nodes of type IntExp (integer constant) carry as attribute: constant value (an integer).
c 2002/03 T.Grust · Compiler Construction: 1. Introduction
22
• How do we precisely describe the valid IR (tree) forms? – Use grammars (grammars and trees are equivalent). – Grammar: set of rules of the form L→R
(T )
T denotes one possible tree node type in our IR. The righthand side R indicates how the subtree below a node of type T may look like (L may occur in R, one grammar may have several L → . . . rules). – Example: Grammar: E → E Op E (OpExp) E → num (NumExp) Op → + (Plus) Op → * (Times)
c 2002/03 T.Grust · Compiler Construction: 1. Introduction
23
A conforming tree: OpExp OOO O
ooo ooo o o o
OOO OO
NumExp Plus NumExp num
num
N.B. – Some node types are marked to be leaves (typewriter font). – For the num node it might be sensible to add an attribute holding the actual numeric value represented by that node.
c 2002/03 T.Grust · Compiler Construction: 1. Introduction
24
• Example: Grammar that describes the valid IR trees for a simple straight-line programming language (no loops, no gotos):
Stm Stm Stm Exp Exp Exp Exp
→ → → → → → →
Stm ; Stm id := Exp print ( ExpList ) id num Exp Binop Exp ( Stm , Exp )
(CompoundStm) (AssignStm) (PrintStm) (IdExp) (NumExp) (OpExp) (EseqExp)
ExpList ExpList Binop Binop Binop Binop
→ → → → → →
Exp , ExpList Exp + * /
(PairExpList) (LastExpList) (Plus) (Minus) (Times) (Div)
• A valid program written in the straight-line programming language (provided that 3,5,10 and a,b are acceptable for num and id, respectively): a := 5+3; b := (print (a, a-1), 10*a); print (b)
• Informal semantics of the straight-line programming language shown on next slide. c 2002/03 T.Grust · Compiler Construction: 1. Introduction
25
– Stm (statement): may have side effects (variable assignment, I/O). Stm ; Stm Execute left statment, then execute right statement. id := Exp Evaluate Exp, then assign the numeric value to variable id print ( ExpList ) Evaluate all expressions in the list (left to right), then print the resulting numeric values separated by spaces, terminated by newline. – Exp (expression): evaluates to a numeric value. id Evaluates to the current value of variable id. num Evaluates to the value of the numeric constant. Exp Binop Exp Evaluate left expression, then evaluate right expression, then apply binary operator. ( Stm ; Exp ) Execute statement, then evaluate Exp whose value is the value of the expression. c 2002/03 T.Grust · Compiler Construction: 1. Introduction
26
• The IR tree corresponding to the above example program: a := 5+3; b := (print (a, a-1), 10*a); print (b) CompoundStm[[[[[[[[[[[
eeee eeeeee e e e e e e eeeeee
[[[[[[[[[ [[[[[[[[[ [[[[[[[[
AssignStm ??
a
5
OpExp OOO
oo ooo o o ooo
NumExp
CompoundStm WWWWW
eeee eeeeee e e e e e e eeeeee
?? ??
Plus
WWWWW WWWWW WWWW
AssignStm ??
OOO OOO O
NumExp
b
EseqExpTTTT
jjj jjjj j j j j jjjj
OpExp OOO
oo ooo o o ooo
NumExp
PairExpList OOO oo
OOO OOO O
oo ooo o o o
IdExp
LastExpList
TTTT TTTT TT
PrintStm
3
PrintStm
?? ??
LastExpList
10
Times
IdExp
OOO OOO O
IdExp
b
a
OpExp OOO
a
oo ooo o o ooo
IdExp a
N.B.
Minus
OOO OOO O
NumExp 1
– This IR tree shows all node attributes (not just the IR subtrees of a node). c 2002/03 T.Grust · Compiler Construction: 1. Introduction
27
• How can we represent these IR trees in C code (i.e., inside our compiler)? – Represent each IR tree node by a C struct. A C struct will give us the possibility to attach attributes (= struct fields) as well as subtrees to a node. Rule: For each lefthand side grammar symbol (Stm, Exp, ExpList, Binop), introduce a C struct type. – Example: 7→ 7 → 7 → 7→
Stm Exp ExpList Binop
struct struct struct struct
A A A A
stm exp expList binop
– We will use pointers to these structs to link tree nodes, thus: C code 1 2 3 4
typedef typedef typedef typedef
struct struct struct struct
A_stm_ A_exp_ A_expList_ A_binop_
*A_stm; *A_exp; *A_expList; *A_binop;
c 2002/03 T.Grust · Compiler Construction: 1. Introduction
28
• In each of these structs, embed
1 a kind field to indicate which node type this node actually has (e.g., for Exp (struct A_exp_) kind could be IdExp, NumExp, OpExp, EseqExp),
2 all attributes and subtree (pointers) for this specific node type. Rule: If a node type is described by a single attribute value (e.g., NumExp), embed this value in the struct; if we need to represent more attribute values/subtrees, embed a nested struct that groups this information. • Example: 1 2 3 4 5 6 7 8 9 10
C code
struct A_exp_ { enum { A_idExp, A_numExp, A_opExp, A_eseqExp } kind; string id; /* A_idExp */ int num; /* A_numExp */ struct { A_exp left; A_binop oper; A_exp right; } op; /* A_opExp */ struct { A_stm stm; A_exp exp; } eseq; /* A_eseqExp */ }
c 2002/03 T.Grust · Compiler Construction: 1. Introduction
29
• The kind field determines which attribute/subtree information is valid for any given node4. All other fields are unused and may not be accessed! • Unused fields? ⇒ Use C union to save space for each node. We get: C code 1 2 3 4 5 6 7 8 9 10 11 12
struct A_exp_ { enum { A_idExp, A_numExp, A_opExp, A_eseqExp } kind; union { string id; /* A_idExp */ int num; /* A_numExp */ struct { A_exp left; A_binop oper; A_exp right; } op; /* A_opExp */ struct { A_stm stm; A_exp exp; } eseq; /* A_eseqExp */ } u; }
4 For example, accessing the op attributes (right, oper, left) while kind == A_idExp will result in havoc! c 2002/03 T.Grust · Compiler Construction: 1. Introduction
30
c 2002/03 T.Grust · Compiler Construction: 1. Introduction
slp.h
31
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49
typedef typedef typedef typedef
struct struct struct struct
A_stm_ *A_stm; A_exp_ *A_exp; A_expList_ *A_expList; A_binop *Abinop;
struct A_stm_ { enum { A_compoundStm, A_assignStm, A_printStm } kind; union { struct { A_stm stm1, stm2; } compound; struct { string id; A_exp exp; } assign; struct { A_expList exps; } print; } u; }; struct A_exp_ { enum { A_idExp, A_numExp, A_opExp, A_eseqExp } kind; union { string id; int num; struct { A_exp left; A_binop oper; A_exp right; } op; struct { A_stm stm; A_exp exp; } eseq; } u; }; struct A_expList_ { enum { A_pairExpList, A_lastExpList } kind; union { struct { A_exp head; A_expList tail; } pair; A_exp last; } u; }; struct A_binop_ { enum { A_plus, A_minus, A_times, A_div } kind; };
• As IR tree nodes “live on the heap”, they need to be allocated via malloc() and initialized appropriately: – Example: create an A_opExp node with subtrees A_exp e1 and e2 and A_binop op: C code 1
A_exp n;
2 3 4
n = malloc (sizeof (*n)); if (!n) { ... handle memory allocation failure ... };
5 6 7 8 9
n->kind = A_opExp; n->u.op.left = e1; n->u.op.oper = op; n->u.op.right = e2;
• Such node creation routines will be needed over and over the compiler. ⇒ Provide node constructors to allocate and initialize IR tree nodes. Rule: never call malloc() outside these constructors.
c 2002/03 T.Grust · Compiler Construction: 1. Introduction
32
• Example: node constructors for node types A_CompoundStm and A_IdExp. C code 1 2 3
A_stm A_CompoundStm (A_stm stm1, A_stm stm2) { A_stm s = checked_malloc (sizeof (*s));
4
s->kind = A_compoundStm; s->u.compound.stm1 = stm1; s->u.compound.stm2 = stm2;
5 6 7 8
return s;
9 10
}
C code 1 2 3
A_exp A_IdExp (string id) /* typedef char *string */ { A_exp e = checked_malloc (sizeof (*e));
4
e->kind = A_idExp; e->u.id = id;
5 6 7
return e;
8 9
}
c 2002/03 T.Grust · Compiler Construction: 1. Introduction
33
• To actually construct larger IR trees, we can now simply plug the constructors together and build trees bottom-up:
1 Use constructors to build the leaf nodes,
2 use the results of these calls as arguments to constructors for inner tree nodes. – Example: build the IR tree corresponding to the straight-line program a := 52; print (a) CompoundStm OOO oo
oo ooo o o o
AssignStm ??
a
OOO OOO OO
?? ??
PrintStm
NumExp
LastExpList
42
IdExp a
C code 1 2
A_stm p = A_CompoundStm (A_AssignStm ("a", A_NumExp (42)), A_PrintStm (A_LastExpList (A_IdExp ("a"))));
c 2002/03 T.Grust · Compiler Construction: 1. Introduction
34
1.2.1
Summary of IR Tree Representation Rules
1 Valid IR trees are described by a grammar.
2 Each lefthand side grammar symbol E is translated into a corresponding struct definition: E 7→ struct X_E_ { ... }
3 The struct X_E_ itself is never used anywhere else, instead declare X_E (pointer to struct): typedef struct X_E_ *X_E;
4 Each struct X_E_ contains a kind enum which contains a enumeration constant for each grammar rule with lefthand side E, and a union u to carry the specific attributes/subtrees: struct X_E_ { enum { ... } kind; c 2002/03 T.Grust · Compiler Construction: 1. Introduction
union { ... } u; }; 35
5 In union u, collect the information represented on the righthand side for each grammar rule for E. If several attributes/subtrees need to be represented, embed a struct carrying this information (e.g., compound in A_stm_).
6 If a single value describes the righthand side of a grammar rule for E, embed this value directly (e.g., num in A_exp_).
7 Each IR node type X_E will have a constructor that initializes all struct fields; malloc() is never called outside these constructors.
8 Each C file (compiler phase or module) will have a prefix X_ unique to that file.
9 Naming/capitalization: Exp (IdExp)
7→
struct X_exp_ { enum { X_idexp } kind; ... }; typedef struct X_exp_ *X_exp; X_exp X_IdExp (...) { ... };
c 2002/03 T.Grust · Compiler Construction: 1. Introduction
36
• Variations of these general IR representation rules:
1 Use a single struct definition to represent all IR node types uniformly. C code 1
typedef struct A_node_ *A_node;
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
struct A_node_ { enum { A_compoundStm, A_assignStm, A_printStm, A_idExp, A_numExp, A_opExp, A_eseqExp, A_pairExpList, AlastExpList, A_plus, A_minus, A_times, A_div } kind; union { struct { A_node stm1, A_node stm2; } compound; ... struct { A_node left; A_node oper; A_node right } op; ... } u; }
c 2002/03 T.Grust · Compiler Construction: 1. Introduction
37
– Bad idea, because the C compiler loses the ability to check that we do not build “nonsense” IR trees (everything is a generic A_node and may occur anywhere). Example: C code (buggy) 1 2 3
A_node n = A_OpExp (A_IdExp ("a"), A_AssignStm ("b", A_NumExp (42)), A_PrintStm (A_LastExpList (A_NumExp (0))));
N.B. – In a real compiler, we would write code to build complex IR trees and bugs in that code might not be that obvious to us at all.
c 2002/03 T.Grust · Compiler Construction: 1. Introduction
38
2 Consider the A_binop constructors: C code 1 2 3 4 5 6
A_binop A_Plus () { A_binop op = checked_malloc (sizeof (*op)); op->kind = A_plus; return op; }
– The A_binop nodes encapsulate a single enum value kind only. This is uniform but unnecessarily complex and wastes space. – A_binop nodes only occur inside A_exp (of kind A_opExp) nodes. ⇒ Encode the operator inside A_opExp directly (using an enum) .
c 2002/03 T.Grust · Compiler Construction: 1. Introduction
39
C code (modified A exp node) 1
enum { A_plus, A_minus, A_times, A_div } A_binop;
2 3 4 5 6 7 8 9 10 11 12
struct A_exp_ { enum { A_idExp, A_numExp, A_opExp, A_eseqExp } kind; string id; /* A_idExp */ int num; /* A_numExp */ struct { A_exp left; A_binop oper; A_exp right; } op; /* A_opExp */ struct { A_stm stm; A_exp exp; } eseq; /* A_eseqExp */ }
c 2002/03 T.Grust · Compiler Construction: 1. Introduction
40
1.3
C Coding Guidelines for the Tiger Compiler Project
• The Tiger compiler will be a rather complex piece of software. We strongly suggest that you follow the guidelines below when you build C source code for the compiler.
1 Each phase of the compiler belongs in its own .c source file (which #includes an associated .h header file containing exported function prototypes and type declarations). [Separate compilation, handling, reusability]
2 Each phase shall have an identifier prefix X_ unique to this phase. All global names (struct/union fields are not global) shall start with the prefix. [Organize the otherwise flat C namespace (avoid clashes), clarify origin of name]
c 2002/03 T.Grust · Compiler Construction: 1. Introduction
41
3 All functions shall have prototypes and the C compiler shall be told to warn about uses of functions without prototypes. (gcc: -Wmissing-prototypes) [In C, functions without prototypes default to return int and to accept int arguments (e.g., pointers, characters may be implicityly casted to int)]
4 Each phase includes util.h and the compiler is linked against util.o. util.h 1
#include
2 3 4
typedef char *string; typedef char bool;
5 6 7
#define TRUE 1 #define FALSE 0
8 9 10
void *checked_malloc(int); string String(char *);
c 2002/03 T.Grust · Compiler Construction: 1. Introduction
42
– assert: halt program if asserted expression yields 0. Example: C code 1 2 3
A_exp e; e = malloc (sizeof (*e)); assert (e);
Should malloc() fail: tiger: phase.c:42: foo: Assertion ‘e’ failed. To disable all assertion checks, compile with -DNDEBUG.
Aborted.
– bool: simulate boolean type in C, use type bool if a variable/function actually deals with truth values. – checked_malloc(n) allocates n bytes and returns pointer into heap. Halts program if allocation fails. ⇒ If checked_malloc() returns, the returned pointer is valid.
c 2002/03 T.Grust · Compiler Construction: 1. Introduction
43
5 Values of type string are heap-allocated strings. Constructor String("foo") allocates four bytes and copies the argument string. Convention: a function that receives a string argument may assume the string contents never change ⇒ it is safe to store the associated character pointer, there is no need to copy the string.
6 Never call malloc() directly, aways use checked_malloc(). [We may later re-implement checked_malloc() to, e.g., use a GC library.]
7 Never call free() to release heap-allocated memory. – Correct usage of free() can be tricky: avoid space leaks (call free() early enough), avoid corruption/overwrites (call free() not too early). – Good practice: p = 0 if you plan to never access *p anymore. [Again, a GC library could make the compiler production-strength nevertheless.]
c 2002/03 T.Grust · Compiler Construction: 1. Introduction
44