Compiler Construction

9 downloads 9338 Views 934KB Size Report
Nodes of type IntExp (integer constant) carry as attribute: constant value (an integer). c 2002/03 T.Grust · Compiler Construction: 1. Introduction. 22 ...
1 Introduction (

1.1

1. Introduction, p. 3)

Compiler Phases

• Due to the complexity of the compilation task, a compiler typically proceeds in a sequence of compilation phases. – In the Tiger book—as in this lecture—each chapter is devoted to one compilation phase. – The phases communicate with each other via clearly defined interfaces. – Interface:  data structure (e.g., a tree),  set of exported functions – Each phase operates on an abstract intermediate representation of the source program, not the source program text itself (except the first phase). c 2002/03 T.Grust · Compiler Construction: 1. Introduction

19

• Breaking the compiler into many phases enables reuse (of phase implementations). – Example: If we need to adapt our compiler to translate a different source language than Tiger, we only need to rewrite the early phases (Lex → Translate, see below). All phases following Translate remain untouched (after Translate, all specifics of the source language have been “abstracted away”).

c 2002/03 T.Grust · Compiler Construction: 1. Introduction

20

1.1.1

A Brief Overview of the Tiger Compiler Phases

• Let us trace the compilation of the Tiger program below and see how the different phases transform the initial source program (later on, we pick certain source program fragments only to keep the exposition short). rem.tig 1 2 3 4 5 6

/* compute the remainder when dividing x by y */ let function rem (x : int, y : int) : int = let var d := x / y in x - d * y end

7

var r := 0

8 9

in r := rem (10, 3)

10 11

end

c 2002/03 T.Grust · Compiler Construction: 1. Introduction

21

1.2

Intermediate Representations (Tree Languages)

• We have seen that the compiler phases pass different intermediate representations (IR). – Often, these IR take the form of a tree. – In these trees, each node assumes one of many node types. – Each node type carries a number of attributes to store details about the program (fragment) being represented by that node. Example: Recall from our review of the compiler phases: – Nodes of type CallExp (function call) carry as attributes:  function name (a string)  argument list (an IR tree rooted in an ExpList node). – Nodes of type IntExp (integer constant) carry as attribute:  constant value (an integer).

c 2002/03 T.Grust · Compiler Construction: 1. Introduction

22

• How do we precisely describe the valid IR (tree) forms? – Use grammars (grammars and trees are equivalent). – Grammar: set of rules of the form L→R

(T )

 T denotes one possible tree node type in our IR.  The righthand side R indicates how the subtree below a node of type T may look like (L may occur in R, one grammar may have several L → . . . rules). – Example: Grammar: E → E Op E (OpExp) E → num (NumExp) Op → + (Plus) Op → * (Times)

c 2002/03 T.Grust · Compiler Construction: 1. Introduction

23

A conforming tree: OpExp OOO O

ooo ooo o o o

OOO OO

NumExp Plus NumExp num

num

N.B. – Some node types are marked to be leaves (typewriter font). – For the num node it might be sensible to add an attribute holding the actual numeric value represented by that node.

c 2002/03 T.Grust · Compiler Construction: 1. Introduction

24

• Example: Grammar that describes the valid IR trees for a simple straight-line programming language (no loops, no gotos):

Stm Stm Stm Exp Exp Exp Exp

→ → → → → → →

Stm ; Stm id := Exp print ( ExpList ) id num Exp Binop Exp ( Stm , Exp )

(CompoundStm) (AssignStm) (PrintStm) (IdExp) (NumExp) (OpExp) (EseqExp)

ExpList ExpList Binop Binop Binop Binop

→ → → → → →

Exp , ExpList Exp + * /

(PairExpList) (LastExpList) (Plus) (Minus) (Times) (Div)

• A valid program written in the straight-line programming language (provided that 3,5,10 and a,b are acceptable for num and id, respectively): a := 5+3; b := (print (a, a-1), 10*a); print (b)

• Informal semantics of the straight-line programming language shown on next slide. c 2002/03 T.Grust · Compiler Construction: 1. Introduction

25

– Stm (statement): may have side effects (variable assignment, I/O).  Stm ; Stm Execute left statment, then execute right statement.  id := Exp Evaluate Exp, then assign the numeric value to variable id  print ( ExpList ) Evaluate all expressions in the list (left to right), then print the resulting numeric values separated by spaces, terminated by newline. – Exp (expression): evaluates to a numeric value.  id Evaluates to the current value of variable id.  num Evaluates to the value of the numeric constant.  Exp Binop Exp Evaluate left expression, then evaluate right expression, then apply binary operator.  ( Stm ; Exp ) Execute statement, then evaluate Exp whose value is the value of the expression. c 2002/03 T.Grust · Compiler Construction: 1. Introduction

26

• The IR tree corresponding to the above example program: a := 5+3; b := (print (a, a-1), 10*a); print (b) CompoundStm[[[[[[[[[[[

eeee eeeeee e e e e e e eeeeee

[[[[[[[[[ [[[[[[[[[ [[[[[[[[

AssignStm  ??    

a

5

OpExp OOO

oo ooo o o ooo

NumExp

CompoundStm WWWWW

eeee eeeeee e e e e e e eeeeee

?? ??

Plus

WWWWW WWWWW WWWW

AssignStm  ??

OOO OOO O

   

NumExp

b

EseqExpTTTT

jjj jjjj j j j j jjjj

OpExp OOO

oo ooo o o ooo

NumExp

PairExpList OOO oo

OOO OOO O

oo ooo o o o

IdExp

LastExpList

TTTT TTTT TT

PrintStm

3

PrintStm

?? ??

LastExpList

10

Times

IdExp

OOO OOO O

IdExp

b

a

OpExp OOO

a

oo ooo o o ooo

IdExp a

N.B.

Minus

OOO OOO O

NumExp 1

– This IR tree shows all node attributes (not just the IR subtrees of a node). c 2002/03 T.Grust · Compiler Construction: 1. Introduction

27

• How can we represent these IR trees in C code (i.e., inside our compiler)? – Represent each IR tree node by a C struct. A C struct will give us the possibility to attach attributes (= struct fields) as well as subtrees to a node. Rule: For each lefthand side grammar symbol (Stm, Exp, ExpList, Binop), introduce a C struct type. – Example: 7→ 7 → 7 → 7→

Stm Exp ExpList Binop

struct struct struct struct

A A A A

stm exp expList binop

– We will use pointers to these structs to link tree nodes, thus: C code 1 2 3 4

typedef typedef typedef typedef

struct struct struct struct

A_stm_ A_exp_ A_expList_ A_binop_

*A_stm; *A_exp; *A_expList; *A_binop;

c 2002/03 T.Grust · Compiler Construction: 1. Introduction

28

• In each of these structs, embed

1 a kind field to indicate which node type this node actually has (e.g., for Exp (struct A_exp_) kind could be IdExp, NumExp, OpExp, EseqExp),

2 all attributes and subtree (pointers) for this specific node type. Rule: If a node type is described by a single attribute value (e.g., NumExp), embed this value in the struct; if we need to represent more attribute values/subtrees, embed a nested struct that groups this information. • Example: 1 2 3 4 5 6 7 8 9 10

C code

struct A_exp_ { enum { A_idExp, A_numExp, A_opExp, A_eseqExp } kind; string id; /* A_idExp */ int num; /* A_numExp */ struct { A_exp left; A_binop oper; A_exp right; } op; /* A_opExp */ struct { A_stm stm; A_exp exp; } eseq; /* A_eseqExp */ }

c 2002/03 T.Grust · Compiler Construction: 1. Introduction

29

• The kind field determines which attribute/subtree information is valid for any given node4. All other fields are unused and may not be accessed! • Unused fields? ⇒ Use C union to save space for each node. We get: C code 1 2 3 4 5 6 7 8 9 10 11 12

struct A_exp_ { enum { A_idExp, A_numExp, A_opExp, A_eseqExp } kind; union { string id; /* A_idExp */ int num; /* A_numExp */ struct { A_exp left; A_binop oper; A_exp right; } op; /* A_opExp */ struct { A_stm stm; A_exp exp; } eseq; /* A_eseqExp */ } u; }

4 For example, accessing the op attributes (right, oper, left) while kind == A_idExp will result in havoc! c 2002/03 T.Grust · Compiler Construction: 1. Introduction

30

c 2002/03 T.Grust · Compiler Construction: 1. Introduction

slp.h

31

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49

typedef typedef typedef typedef

struct struct struct struct

A_stm_ *A_stm; A_exp_ *A_exp; A_expList_ *A_expList; A_binop *Abinop;

struct A_stm_ { enum { A_compoundStm, A_assignStm, A_printStm } kind; union { struct { A_stm stm1, stm2; } compound; struct { string id; A_exp exp; } assign; struct { A_expList exps; } print; } u; }; struct A_exp_ { enum { A_idExp, A_numExp, A_opExp, A_eseqExp } kind; union { string id; int num; struct { A_exp left; A_binop oper; A_exp right; } op; struct { A_stm stm; A_exp exp; } eseq; } u; }; struct A_expList_ { enum { A_pairExpList, A_lastExpList } kind; union { struct { A_exp head; A_expList tail; } pair; A_exp last; } u; }; struct A_binop_ { enum { A_plus, A_minus, A_times, A_div } kind; };

• As IR tree nodes “live on the heap”, they need to be allocated via malloc() and initialized appropriately: – Example: create an A_opExp node with subtrees A_exp e1 and e2 and A_binop op: C code 1

A_exp n;

2 3 4

n = malloc (sizeof (*n)); if (!n) { ... handle memory allocation failure ... };

5 6 7 8 9

n->kind = A_opExp; n->u.op.left = e1; n->u.op.oper = op; n->u.op.right = e2;

• Such node creation routines will be needed over and over the compiler. ⇒ Provide node constructors to allocate and initialize IR tree nodes. Rule: never call malloc() outside these constructors.

c 2002/03 T.Grust · Compiler Construction: 1. Introduction

32

• Example: node constructors for node types A_CompoundStm and A_IdExp. C code 1 2 3

A_stm A_CompoundStm (A_stm stm1, A_stm stm2) { A_stm s = checked_malloc (sizeof (*s));

4

s->kind = A_compoundStm; s->u.compound.stm1 = stm1; s->u.compound.stm2 = stm2;

5 6 7 8

return s;

9 10

}

C code 1 2 3

A_exp A_IdExp (string id) /* typedef char *string */ { A_exp e = checked_malloc (sizeof (*e));

4

e->kind = A_idExp; e->u.id = id;

5 6 7

return e;

8 9

}

c 2002/03 T.Grust · Compiler Construction: 1. Introduction

33

• To actually construct larger IR trees, we can now simply plug the constructors together and build trees bottom-up:

1 Use constructors to build the leaf nodes,

2 use the results of these calls as arguments to constructors for inner tree nodes. – Example: build the IR tree corresponding to the straight-line program a := 52; print (a) CompoundStm OOO oo

oo ooo o o o

AssignStm  ??    

a

OOO OOO OO

?? ??

PrintStm

NumExp

LastExpList

42

IdExp a

C code 1 2

A_stm p = A_CompoundStm (A_AssignStm ("a", A_NumExp (42)), A_PrintStm (A_LastExpList (A_IdExp ("a"))));

c 2002/03 T.Grust · Compiler Construction: 1. Introduction

34

1.2.1

Summary of IR Tree Representation Rules

1 Valid IR trees are described by a grammar.

2 Each lefthand side grammar symbol E is translated into a corresponding struct definition: E 7→ struct X_E_ { ... }

3 The struct X_E_ itself is never used anywhere else, instead declare X_E (pointer to struct): typedef struct X_E_ *X_E;

4 Each struct X_E_ contains a kind enum which contains a enumeration constant for each grammar rule with lefthand side E, and a union u to carry the specific attributes/subtrees: struct X_E_ { enum { ... } kind; c 2002/03 T.Grust · Compiler Construction: 1. Introduction

union { ... } u; }; 35

5 In union u, collect the information represented on the righthand side for each grammar rule for E. If several attributes/subtrees need to be represented, embed a struct carrying this information (e.g., compound in A_stm_).

6 If a single value describes the righthand side of a grammar rule for E, embed this value directly (e.g., num in A_exp_).

7 Each IR node type X_E will have a constructor that initializes all struct fields; malloc() is never called outside these constructors.

8 Each C file (compiler phase or module) will have a prefix X_ unique to that file.

9 Naming/capitalization: Exp (IdExp)

7→

struct X_exp_ { enum { X_idexp } kind; ... }; typedef struct X_exp_ *X_exp; X_exp X_IdExp (...) { ... };

c 2002/03 T.Grust · Compiler Construction: 1. Introduction

36

• Variations of these general IR representation rules:

1 Use a single struct definition to represent all IR node types uniformly. C code 1

typedef struct A_node_ *A_node;

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17

struct A_node_ { enum { A_compoundStm, A_assignStm, A_printStm, A_idExp, A_numExp, A_opExp, A_eseqExp, A_pairExpList, AlastExpList, A_plus, A_minus, A_times, A_div } kind; union { struct { A_node stm1, A_node stm2; } compound; ... struct { A_node left; A_node oper; A_node right } op; ... } u; }

c 2002/03 T.Grust · Compiler Construction: 1. Introduction

37

– Bad idea, because the C compiler loses the ability to check that we do not build “nonsense” IR trees (everything is a generic A_node and may occur anywhere). Example: C code (buggy) 1 2 3

A_node n = A_OpExp (A_IdExp ("a"), A_AssignStm ("b", A_NumExp (42)), A_PrintStm (A_LastExpList (A_NumExp (0))));

N.B. – In a real compiler, we would write code to build complex IR trees and bugs in that code might not be that obvious to us at all.

c 2002/03 T.Grust · Compiler Construction: 1. Introduction

38

2 Consider the A_binop constructors: C code 1 2 3 4 5 6

A_binop A_Plus () { A_binop op = checked_malloc (sizeof (*op)); op->kind = A_plus; return op; }

– The A_binop nodes encapsulate a single enum value kind only. This is uniform but unnecessarily complex and wastes space. – A_binop nodes only occur inside A_exp (of kind A_opExp) nodes. ⇒ Encode the operator inside A_opExp directly (using an enum) .

c 2002/03 T.Grust · Compiler Construction: 1. Introduction

39

C code (modified A exp node) 1

enum { A_plus, A_minus, A_times, A_div } A_binop;

2 3 4 5 6 7 8 9 10 11 12

struct A_exp_ { enum { A_idExp, A_numExp, A_opExp, A_eseqExp } kind; string id; /* A_idExp */ int num; /* A_numExp */ struct { A_exp left; A_binop oper; A_exp right; } op; /* A_opExp */ struct { A_stm stm; A_exp exp; } eseq; /* A_eseqExp */ }

c 2002/03 T.Grust · Compiler Construction: 1. Introduction

40

1.3

C Coding Guidelines for the Tiger Compiler Project

• The Tiger compiler will be a rather complex piece of software. We strongly suggest that you follow the guidelines below when you build C source code for the compiler.

1 Each phase of the compiler belongs in its own .c source file (which #includes an associated .h header file containing exported function prototypes and type declarations). [Separate compilation, handling, reusability]

2 Each phase shall have an identifier prefix X_ unique to this phase. All global names (struct/union fields are not global) shall start with the prefix. [Organize the otherwise flat C namespace (avoid clashes), clarify origin of name]

c 2002/03 T.Grust · Compiler Construction: 1. Introduction

41

3 All functions shall have prototypes and the C compiler shall be told to warn about uses of functions without prototypes. (gcc: -Wmissing-prototypes) [In C, functions without prototypes default to return int and to accept int arguments (e.g., pointers, characters may be implicityly casted to int)]

4 Each phase includes util.h and the compiler is linked against util.o. util.h 1

#include

2 3 4

typedef char *string; typedef char bool;

5 6 7

#define TRUE 1 #define FALSE 0

8 9 10

void *checked_malloc(int); string String(char *);

c 2002/03 T.Grust · Compiler Construction: 1. Introduction

42

– assert: halt program if asserted expression yields 0. Example: C code 1 2 3

A_exp e; e = malloc (sizeof (*e)); assert (e);

Should malloc() fail: tiger: phase.c:42: foo: Assertion ‘e’ failed. To disable all assertion checks, compile with -DNDEBUG.

Aborted.

– bool: simulate boolean type in C, use type bool if a variable/function actually deals with truth values. – checked_malloc(n) allocates n bytes and returns pointer into heap. Halts program if allocation fails. ⇒ If checked_malloc() returns, the returned pointer is valid.

c 2002/03 T.Grust · Compiler Construction: 1. Introduction

43

5 Values of type string are heap-allocated strings. Constructor String("foo") allocates four bytes and copies the argument string. Convention: a function that receives a string argument may assume the string contents never change ⇒ it is safe to store the associated character pointer, there is no need to copy the string.

6 Never call malloc() directly, aways use checked_malloc(). [We may later re-implement checked_malloc() to, e.g., use a GC library.]

7 Never call free() to release heap-allocated memory. – Correct usage of free() can be tricky: avoid space leaks (call free() early enough), avoid corruption/overwrites (call free() not too early). – Good practice: p = 0 if you plan to never access *p anymore. [Again, a GC library could make the compiler production-strength nevertheless.]

c 2002/03 T.Grust · Compiler Construction: 1. Introduction

44