Abstract. Syntax. Tree. Control. Flow. Graph. Object. Code. CMSC 631. 3.
Abstract Syntax ... Abstract Syntax Tree Example ... Assignments x := y op z or x :=
op z.
Compiler Structure Source Code
CMSC 631 — Program Analysis and Understanding
Abstract Syntax Tree
Control Flow Graph
Object Code
• Source code parsed to produce AST • AST transformed to CFG
Data Flow Analysis
• Data flow analysis operates on control flow graph (and other intermediate representations) CMSC 631
Abstract Syntax Tree (AST)
Abstract Syntax Tree Example x := a + b;
• Programs are written in text ■ ■
I.e., sequences of characters
y := a * b;
Awkward to work with
while (y > a) { a := a + 1;
• First step: Convert to structured representation ■
2
Use lexer (like flex) to recognize tokens
x := a + b }
Program :=
...
x
while
+ a
> b
y
Block a
:= a
... +
a
1
- Sequences of characters that make words in the language ■
Use parser (like bison) to group words structurally - And, often, to produce AST
CMSC 631
3
CMSC 631
4
ASTs
Disadvantages of ASTs
• ASTs are abstract ■
■
• AST has many similar forms
They don’t contain all information in the program
■
E.g., for, while, repeat...until
- E.g., spacing, comments, brackets, parentheses
■
E.g., if, ?:, switch
Any ambiguity has been resolved - E.g., a + b + c produces the same AST as (a + b) + c
• Expressions in AST may be complex, nested ■
• For more info, see CMSC 430 ■
In this class, we will generally begin at the AST level
• Want simpler representation for analysis ■
CMSC 631
(42 * y) + (z > 5 ? 12 * z : z + 20)
5
Control-Flow Graph (CFG)
...at least, for dataflow analysis
CMSC 631
6
Control-Flow Graph Example x := a + b;
• A directed graph where ■
Each node represents a statement
y := a * b;
■
Edges represent control flow
while (y > a) {
x := a + b
y := a * b
a := a + 1; x := a + b
• Statements may be ■
Assignments x := y op z or x := op z
■
Copy statements x := y
■
Branches goto L or if x relop y goto L
■
etc.
CMSC 631
y>a
} a := a + 1
x := a + b
7
CMSC 631
8
Variations on CFGs
Control-Flow Graph w/Basic Blocks
• We usually don’t include declarations (e.g., int x;) ■
But there’s usually something in the implementation
• May want a unique entry and exit node ■
Won’t matter for the examples we give
CMSC 631
• But more complicated to explain, so... ■ 9
Graph Example with Entry and Exit
a := a + 1; x := a + b } • All nodes without a (normal) predecessor should be pointed to by entry •All nodes without a successor should point to exit CMSC 631
We’ll use single-statement blocks in lecture today
CMSC 631
10
CFG vs. AST • CFGs are much simpler than ASTs
entry
y := a * b; while (y > a) {
y>a
• Can lead to more efficient implementations
A sequence of instructions with no branches into or out of the block
x := a + b;
x := a + b y := a * b
a := a + 1 x := a + b
• May group statements into basic blocks ■
x := a + b; y := a * b; while (y > a + b) { a := a + 1; x := a + b }
■
Fewer forms, less redundancy, only simple expressions
x := a + b
• But...AST is a more faithful representation
y := a * b y>a a := a + 1
exit
■
CFGs introduce temporaries
■
Lose block structure of program
• So for AST,
x := a + b
11
■
Easier to report error + other messages
■
Easier to explain to programmer
■
Easier to unparse to produce readable code
CMSC 631
12
Data Flow Analysis
Available Expressions
• A framework for proving facts about programs
• An expression e is available at program point p if
• Reasons about lots of little facts
■
e is computed on every path to p, and
■
the value of e has not changed since the last time e was computed on the paths to p
• Little or no interaction between facts ■
Works best on properties about how program computes
• Optimization ■
• Based on all paths through program ■
- (At least, if it’s still in a register somewhere)
Including infeasible paths
CMSC 631
13
Data Flow Facts • Is expression e available? • Facts: ■
a + b is available
■
a * b is available
■
a + 1 is available
CMSC 631
14
Gen and Kill • What is the effect of each statement on the set of facts?
entry x := a + b
Gen
Kill
y := a * b
a := a + 1
entry x := a + b
Stmt
y := a * b
y>a
exit
x := a + b
CMSC 631
If an expression is available, need not be recomputed
x := a + b
a+b
y := a * b
a*b
a := a + 1
15
CMSC 631
y>a a := a + 1 a + 1, a + b, a*b
exit
x := a + b
16
Computing Available Expressions
Terminology • A joint point is a program point where two branches meet
entry x := a + b
{a + b}
• Available expressions is a forward must problem
y := a * b
{a + b, a * b}
{a + b}
y>a
{a + b, a * b}
a := a + 1
Ø
{a + b} exit
■
Must = At join point, property must hold on all paths that are joined
{a + b}
CMSC 631
17
Data Flow Equations ■
succ(s) = { immediate successor statements of s }
■
pred(s) = { immediate predecessor statements of s}
■
In(s) = program point just before executing s
■
Out(s) = program point just after executing s
s
18
• A variable v is live at program point p if ■
v will be used on some execution path originating from p...
■
before v is overwritten
• Optimization
pred(s) Out(s )
• Out(s) = Gen(s)
CMSC 631
Liveness Analysis
• Let s be a statement
CMSC 631
Forward = Data flow from in to out
x := a + b
{a + b}
• In(s) =
■
■
If a variable is not live, no need to keep it in a register
■
If variable is dead at assignment, can eliminate assignment
(In(s) - Kill(s)) 19
CMSC 631
20
Data Flow Equations
Gen and Kill
• Available expressions is a forward must analysis ■
Data flow propagate in same dir as CFG edges
■
Expr is available only if available on all paths
• Liveness is a backward may problem To know if variable live, need to look at future uses
■
Variable is live if used on some path s
• In(s) = Gen(s)
Stmt
Gen
Kill
x := a + b
a, b
x
x := a + b
y := a * b
y>a
■
• Out(s) =
• What is the effect of each statement on the set of facts?
y := a * b
a, b
y>a
a, y
a := a + 1
a
y a := a + 1
succ(s) In(s )
x := a + b
a
(Out(s) - Kill(s))
CMSC 631
21
Computing Live Variables
CMSC 631
22
Very Busy Expressions • An expression e is very busy at point p if
{a, b} x := a + b
■
{x, a, b}
On every path from p, expression e is evaluated before the value of e is changed
y := a * b
• Optimization
{x,y,y,a,a}b} {x, y>a
■
{y, a, b} a := a + 1
• What kind of problem?
{y, a, b} x := a + b
{x, {x,y,y,a,a}b} CMSC 631
Can hoist very busy expression computation
{x}
23
■
Forward or backward?
■
May or must?
CMSC 631
backward must 24
Reaching Definitions
Space of Data Flow Analyses
• A definition of a variable v is an assignment to v
May
• A definition of variable v reaches point p if ■
There is no intervening assignment to v
Forward Backward
• Also called def-use information
Must
Reaching Available definitions expressions Live variables
Very busy expressions
• Most data flow analyses can be classified this way • What kind of problem? ■
Forward or backward?
■
May or must?
■
forward
• Lots of literature on data flow analysis
may
CMSC 631
25
Data Flow Facts and Lattices
a+b, a*b
�
■
“top”
■ ■
a+b, a+1
a*b, a+1
26
• A partial order is a pair (P, ≤) such that
Example: Available expressions a+b, a*b, a+1
CMSC 631
Partial Orders
• Typically, data flow facts form a lattice ■
A few don’t fit: bidirectional analysis
■
≤⊆P ×P
≤ is reflexive: x ≤ x
≤ is anti-symmetric: x ≤ y and y ≤ x ⇒ x = y ≤ is transitive: x ≤ y and y ≤ z ⇒ x ≤ z
a+b a*b
a+1
(none)
CMSC 631
⊥
“bottom” 27
CMSC 631
28
Lattices
Lattices (cont’d)
• A partial order is a lattice if � and � are defined
• A finite partial order is a lattice if meet and join exist for every pair of elements
on any set:
• A lattice has unique elements ⊥ and � such that
� is the meet or greatest lower bound operation: x � y ≤ x and x � y ≤ y if z ≤ x and z ≤ y, then z ≤ x � y ■ � is the join or least upper bound operation: x ≤ x � y and y ≤ x � y if x ≤ z and y ≤ z, then x � y ≤ z ■
CMSC 631
■ ■
• A partial order is a complete lattice if meet and join are defined on any set S ! P 30
• A function f on a partial order is monotonic if (Top - Kill(s))
x ≤ y ⇒ f (x) ≤ f (y)
(worklist)
• Easy to check that operations to compute In and Out are monotonic
repeat Take s from W In(s) := s pred(s) Out(s )
■
In(s) := s
■
temp := Gen(s)
(In(s) - Kill(s))
if (temp != Out(s)) { Out(s) := temp W := W succ(s)
pred(s) Out(s )
a function fs(In(s))
(In(s) - Kill(s))
• Putting these two together, ■
} until W = CMSC 631
CMSC 631
Monotonicity
Out(s) = Top for all statements s
temp := Gen(s)
x��=�
x ≤ y iff x � y = y
Forward Must Data Flow Algorithm
W := { all statements }
x��=x
• In a lattice, x ≤ y iff x � y = x
29
// Slight acceleration: Could set Out(s) = Gen(s)
x�⊥=x
x�⊥=⊥
31
CMSC 631
� temp := fs (�s� ∈pred(s) Out(s ))
32
Useful Lattices
Termination
• (2S, ) forms a lattice for any set S ■
• We know the algorithm terminates because
2S is the powerset of S (set of all subsets)
• If (S, ≤) is a lattice, so is (S, ≥) ■
The lattice has finite height
■
The operations to compute In and Out are monotonic
■
On every iteration, we remove a statement from the worklist and/or move down the lattice
I.e., lattices can be flipped
• The lattice for constant propagation 1
2
⊤ 3
...
CMSC 631
33
Forward Data Flow, Again Out(s) = Top
34
• Available expressions
(worklist)
repeat Take s from W temp := fs(⊓s pred(s) Out(s ))
CMSC 631
Lattices (P, ≤)
for all statements s
W := { all statements }
(fs monotonic transfer fn)
■
P = sets of expressions
■
S1 ⊓ S2 = S1
■
Top = set of all expressions
S2
• Reaching Definitions
if (temp != Out(s)) { Out(s) := temp W := W succ(s) } until W =
CMSC 631
■
35
■
P = set of definitions (assignment statements)
■
S1 ⊓ S2 = S1
■
Top = empty set
CMSC 631
S2
36
Fixpoints
Lattices (P, ≤), cont’d
• We always start with Top
• Live variables
■
Every expression is available, no defns reach this point
■
P = sets of variables
■
Most optimistic assumption
■
S1 ⊓ S2 = S1 ∪ S2
■
Strongest possible hypothesis
■
Top = empty set
- = true of fewest number of states
• Very busy expressions
• Revise as we encounter contradictions ■
Always move down in the lattice (with meet)
• Result: A greatest fixpoint
CMSC 631
37
Forward vs. Backward
! if (temp != In(s)) { ! ! In(s) := temp ! ! W := W pred(s)
! } until W =
! } until W =
P = set of expressions
■
S1 ⊓ S2 = S1
■
Top = set of all expressions
S2
CMSC 631
38
Termination Revisited
Out(s) = Top for all s In(s) = Top for all s W := { all statements } W := { all statements } repeat repeat ! Take s from W ! Take s from W ! temp := fs(⊓s pred(s) Out(s )) ! temp := fs(⊓s succ(s) In(s )) ! if (temp != Out(s)) { ! ! Out(s) := temp ! ! W := W succ(s)
■
• How many times can we apply this step: temp := fs(⊓s ! ■
pred(s) Out(s ))
if (temp != Out(s)) { ... }
Claim: Out(s) only shrinks - Proof: Out(s) starts out as top - So temp must be ≤ than Top after first step - Assume Out(s ) shrinks for all predecessors s of s - Then ⊓s
pred(s) Out(s ) shrinks
- Since fs monotonic, fs(⊓s CMSC 631
39
CMSC 631
pred(s) Out(s )) shrinks 40
Termination Revisited (cont’d)
Least vs. Greatest Fixpoints
• A descending chain in a lattice is a sequence ■
• Dataflow tradition: Start with Top, use meet
x0 ⊐x1 ⊐x2 ⊐...
■
- complete meet semilattice = meets defined for any set - finite height ensures termination
• The height of a lattice is the length of the longest descending chain in the lattice ■
• Then, dataflow must terminate in O(nk) time ■
n = # of statements in program
■
k = height of lattice
■
assumes meet operation takes O(1) time
CMSC 631
To do this, we need a meet semilattice with top
Computes greatest fixpoint
• Denotational semantics tradition: Start with Bottom, use join ■
41
Distributive Data Flow Problems
Computes least fixpoint
CMSC 631
42
Benefit of Distributivity
• By monotonicity, we also have
• Joins lose no information
f (x � y) ≤ f (x) � f (y)
f
g
h
• A function f is distributive if f (x � y) = f (x) � f (y)
k
k(h(f (�) � g(�))) = k(h(f (�)) � h(g(�))) = k(h(f (�))) � k(h(g(�))) CMSC 631
43
CMSC 631
44
Accuracy of Data Flow Analysis
What Problems are Distributive?
• Ideally, we would like to compute the meet over all paths (MOP) solution:
• Analyses of how the program computes ■
Live variables
■
Let fs be the transfer function for statement s
■
Available expressions
■
If p is a path {s1, ..., sn}, let fp = fn;...;f1
■
Reaching definitions
■
Let path(s) be the set of paths from the entry to s
■
Very busy expressions
MOP(s) = �p∈path(s) fp (�)
• All Gen/Kill problems are distributive
• If a data flow problem is distributive, then solving the data flow equations in the standard way yields the MOP solution CMSC 631
45
A Non-Distributive Example
CMSC 631
Practical Implementation
• Constant propagation
• Data flow facts = assertions that are true or false at a program point
x := 1
x := 2
y := 2
y := 1
• Represent set of facts as bit vector ■
Facti represented by bit i
■
Intersection = bitwise and, union = bitwise or, etc
z := x + y
• In general, analysis of what the program computes in not distributive
• “Only” a constant factor speedup ■
CMSC 631
46
47
CMSC 631
But very useful in practice 48
Basic Blocks
Order Matters
• A basic block is a sequence of statements s.t.
• Assume forward data flow problem
■
No statement except the last in a branch
■
Let G = (V, E) be the CFG
■
There are no branches to any statement in the block except the first
■
Let k be the height of the lattice
• If G acyclic, visit in topological order
• In practical data flow implementations, ■
■
Compute Gen/Kill for each basic block
• Running time O(|E|)
- Compose transfer functions ■
Store only In/Out for each basic block
■
Typical basic block ~5 statements
CMSC 631
■
49
Order Matters — Cycles
• Let Q = max # back edges on cycle-free path Nesting depth
■
Back edge is from node to ancestor on DFS tree
50
■
The order of statements is taken into account
■
I.e., we keep track of facts per program point
• Alternative: Flow-insensitive analysis
• Then if ∀x.f (x) ≤ x (sufficient, but not necessary) ■
CMSC 631
• Data flow analysis is flow-sensitive
Order from depth-first search
■
No matter what size the lattice
Flow-Sensitivity
• If G has cycles, visit in reverse postorder ■
Visit head before tail of edge
■
Analysis the same regardless of statement order
■
Standard example: types
Running time is O((Q + 1)|E|)
- /* x : int */ x := ... /* x : int */
- Note direction of req’t depends on top vs. bottom CMSC 631
51
CMSC 631
52
Terminology Review
Another Approach: Elimination
• Must vs. May ■
• Recall in practice, one transfer function per basic block
(Not always followed in literature)
• Forwards vs. Backwards • Why not generalize this idea beyond a basic block?
• Flow-sensitive vs. Flow-insensitive • Distributive vs. Non-distributive
CMSC 631
53
Lattices of Functions
■
“Collapse” larger constructs into smaller ones, combining data flow equations
■
Eventually program collapsed into a single node!
■
“Expand out” back to original constructs, rebuilding information
CMSC 631
54
Elimination Methods: Conditionals
• Let (P, ≤) be a lattice If
• Let M be the set of monotonic functions on P
IfThenElse
• Define f ≤f g if for all x, f(x) ≤ g(x)
Then
z
• Define the function f ⊓ g as ■
z
(f ⊓ g) (x) = f(x) ⊓ g(x)
fite = (fthen ◦ fif ) " (felse ◦ fif ) Out(if) = fif (In(ite))) Out(then) = (fthen ◦ fif )(In(ite))) Out(else) = (felse ◦ fif )(In(ite)))
• Claim: (M, ≤f) forms a lattice CMSC 631
Else
55
CMSC 631
56
Elimination Methods: Loops
Elimination Methods: Loops (cont’d) • Let f i = f o f o ... o f (i times)
Head
■
While
f 0 = id
• Let
Body z
z
g(j) = !i∈[0..j] (fhead ◦ fbody )i ◦ fhead
• Need to compute limit as j goes to infinity fwhile
= fhead ! fhead ◦ fbody ◦ fhead ! fhead ◦ fbody ◦ fhead ◦ fbody ◦ fhead ! · · ·
CMSC 631
57
Height of Function Lattice What is height of lattice of monotonic functions?
■
Claim: finite (see homework)
Does such a thing exist?
• Observe: g(j+1) ≤ g(j)
CMSC 631
58
Non-Reducible Flow Graphs
• Assume underlying lattice (P, ≤) has finite height ■
■
• Therefore, g(j) converges
• Elimination methods usually only applied to reducible flow graphs ■
Ones that can be collapsed
■
Standard constructs yield only reducible flow graphs
• Unrestricted goto can yield non-reducible graphs A
CMSC 631
59
CMSC 631
B
C
z
w
60
Comments
Data Flow Analysis and Functions
• Can also do backwards elimination ■
• What happens at a function call?
Not quite as nice (regions are usually single entry but often not single exit)
■
• For bit-vector problems, elimination efficient ■
• In practice, only analyze one procedure at a time
Easy to compose functions, compute meet, etc.
• Consequences
• Elimination originally seemed like it might be faster than iteration ■
Not really the case
CMSC 631
Lots of proposed solutions in data flow analysis literature
61
More Terminology
■
Call to function kills all data flow facts
■
May be able to improve depending on language, e.g., function call may not affect locals
CMSC 631
62
Data Flow Analysis and The Heap
• An analysis that models only a single function at a time is intraprocedural • An analysis that takes multiple functions into account is interprocedural • An analysis that takes the whole program into account is...guess?
• Data Flow is good at analyzing local variables ■
But what about values stored in the heap?
■
Not modeled in traditional data flow
• In practice: *x := e ■
Assume all data flow facts killed (!)
■
Or, assume write through x may affect any variable whose address has been taken
• Note: global analysis means “more than one basic block,” but still within a function
• In general, hard to analyze pointers CMSC 631
63
CMSC 631
64
Data Flow Analysis and Optimization
•LCLint - Evans et al. (UVa)
• Moore’s Law: Hardware advances double computing power every 18 months.
•METAL - Engler et al. (Stanford, now Coverity) •ESP - Das et al. (MSR)
• Proebsting’s Law: Compiler advances double computing power every 18 years. ■
DF Analysis and Defect Detection
•FindBugs - Hovemeyer, Pugh (Maryland) ■ For
Not so much bang for the buck!
Java. The first three are for C.
•Many other one-shot projects
CMSC 631
65
■ Memory
leak detection
■ Security
vulnerability checking (tainting, info. leaks)
CMSC 631
66