Also GVN runs in linear time with a simple one- pass algorithm, while A/W/Z use a more complex. O(rt log n) partitioning algorithm. Cliff Click presents ex-.
Global Global
Code Motion
Value Numbering Cliff
Hewlett-Packard
Laboratories,
One Main cliffc
that optimizing
compilers
optimizations
should
treat
allows
stronger
Preserving
optimizations
using simpler
in algorithms
global congruence finding We present
it into more control
Global
algo-
GVN is also a convenient
place to fold constants;
constant expressions
are then value-numbered
expressions.
two instructions
Finding
near-linear-time2
quirement for identical values, not variable names. Hence GVN relies heavily on Static Single Assignment
al-
(SSA) form [12].
(GCM).
Motion
(and presumably
instructions replacement
less fre-
This is profitable
for identical
with a single instruction, instruction
(alternatively,
think of GVN selecting a block at random;
the resulting program is clearly incorrect). A following pass of GCM selects a correct placement for the instruc-
if the
tion based solely on its dependence
able.
between instruc-
ignores
from
schedule order is ignored. but it does not alter the Control
(CFG) nor remove redundant CFG shaping
edges, or inserting
i Thl~ work has ~n
supported by ARPA through
and was done at the Rice University
ment ‘ Modern microprocessors reqmre global scheduhng, 2 We recpure a dominator
schedule,
build a correct schedule.
GCM
edges.
GVN
Finally,
Since GCM
is not required
GVN
to
is fast, requiring
only one pass over the program.
Flow
1.1
control-dependent
pads).
to use a simple hash-based technique (GVN). Numbering
J-1989
the original
code. GCM benefits
(such as splitting loop landing
however the single
is not placed in any basic block
it is very profit-
GCM relies only on dependence
inputs is a re-
GVN replaces sets of value-equivalent
loop executes at least once; frequently
Graph
like other
that have the same
many times; the requirement
in quently executed) basic blocks. GCM is not optimal the sense that it may lengthen some paths; it hoists con-
tions; the original moves instructions,
identities
and identical
like PRE [18, 13] or
Code
trol dependent code out of loops.
for algebraic
that have the same operation
operation and identical inputs is done with a hash-table lookup. In most programs the same variable is assigned
hoists code out of loops and pushes dependent
finds these sets by looking
inputs.
[2, 20].
a straightforward
for performing
Our GCM algorithm
02142
or instructions
a legal schedule is one of the prime
sources of complexity
gorithm
MA.
Office
(61 7) 225-4915
GVN
(e.g., conditional
constant propagation, global value numbering) and code Removing the code motion motion issues separately.’ requirements from the machine-independent optimization
Research
GVN attempts to replace a set of instructions that each compute the same value with a single instruction.
the machine-independent
rithms.
Cambridge
St., Cambridge,
@hpl.hp.com
1. Introduction We believe
Clickt
GCM
allows us
for Global
Related
Loop-invariant around awhile [1].
Value
rithm
with super-scalar or VLIW
Science Depar% execution
already
tree, which requires O(cc(rz,n)) rime m build
that
code motion techniques have been Aho, Sethi, Unman present an algo-
moves loop-invariant
code to a pre-header
before the loop. Because it does not use SSA form significant restrictions are placed on what can be moved.
ONR want Noool 4-91-
Computer
Work
Multiple
assignments
hoisted.
They also lift
to the same variable the profitability
are not be
restriction,
and
allow the hoisting of control dependent code, They note that lifting a guarded division may be unwise but do not
It is
essentially linear in the size of the control flow graph, and very fast to build in practice.
present any formal
method
for dealing
with faulting
in-
structions. Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage, the ACM copyright notice and the title of the publication and its date appear, and notice is given that copyin is by permission of the Association of Computing Machinery. Y o copy otherwise, or to republish, requires a fee and/or specific permission. SIGPLAN ‘95La Jolla, CA USA 0 1995 ACM 0-89791 -697 -2/95/0006 ...$3.50
Partial
Redundancy
moves partially Loop-invariant dundant,
246
Elimination
(PRE) [18, 13] re-
redundant expressions when profitable. expressions are considered partially re-
and, when profitable,
will
be hoisted to before
the loop.
Profitable
lengthened.
a computation
that is used on some
paths in a loop but not all paths may lengthen is not hoisted. expressions.
GCM
will
mediate
as an optimization
hoist such control-dependent
this method sometimes
However,
lengthens
representation
used throughout
pro-
of GVN.
Section
showing
4 we present
numerical
GCM will move code from main paths into more control dependent paths. The effect is similar to Partial Dead Code Elimination [16]. Expressions that are dead
tional techniques 1.2
no partially
senting
techniques
instead of local,
it finds
all the redundant
found by the local techniques. techniques
have
been
Local
generalized
expressions
extended
GVN
uses a simple of congruent
bottom-up
basic
and Zadeck present a technique top-down
Their
[2].
amongst dependence bottom-up
to build
Alpern,
Wegman
tion.
partitions
structure
method
for building will
find
that form loops.
technique
it cannot find such looping
ties and constant folding, not.
Also GVN
while
log n) partitioning tensions
A/W/Z
identities
and
conditional
the expected
gain
unit of a machine
variables
contain close to
will
as assignments
are renamed
that represents
variable
do.
from
vertices
identi-
instruction;
We
being
into a directed
instructions
essen-
become pointers
that defines the variable.
sequences of instructions
congru-
graph, with the
and the edges representing
data flow.
handled like other expressions and can have backedges as
use a more complex
O(rt
Cliff
Click
that
the higher
in spirit to the value
Program
Dependence
order to make the representation
algebraic
propagation
This is similar
an operator-level
presents ex-
include
constant
inputs.
It is not a DAG because SSA ~-functions
need a $-function
[10].
input that indicates
tions values are merged. keep a pointer
implementation
[2] or
graph
under what condiWe show
1.
Both GCM and GVN rely solely on dependence
gruences [18, 13]. As such, the set of discovered congruences (whether they are taken advantage of or not) only partially overlaps with GVN. Some textually-congruerlt
placement heuristic). To be correct, ence must be explicit. LoAos and
expressions
congruences
do not compute
textually-congruent
congruent. ruences
the same value. expressions
Due to algebraic are not textually
instead of value cori-
identities,
equal.
threaded
However,
are also
control
many value cong-
In practice,
global
value numbering
dant expressions. is simpler,
Their
technique
a near] y identical
However, the algorithm
in part,
handled by GCM. graphs; the R/WIZ
[19].
token
Faulting This
or
then, all dependneed to be
STORES
other
instructions
input
allows
dependence
keep an explicit
the instruction
to be
basic block or after, but not be-
fore.
and Zadeck present a method
one presented here find
a memory
input.
for
the original schedule is ignored be used as an input to the code-
scheduled in its original
GVN finds
a larger set of congruences than PRE. Rosen, Wegman
by
edge [3, 15, 23].
value-
In
[8], we
It suffices to make @functions
to the basic block they exist in.
an example in Figure
are
Graph [15].
compositional
correct behavior, (although it may
many
to
This converts
can-
cost. PRE finds lexical
In-
simple
to the address of the
the defining
names in expressions
the expression
finds more congruences than GVN Compiler writers have to trade off against
blocks
operations
one-
technique
While this algorithm it is quite complex.
Basic
elementary
a simple
time with
algorithm.
to A/W/Zs
graphs
technique
which AlW/Z’s
runs in linear
pass algorithm,
of contro13 [1].
are represented
All
tially
congruences
Because GVN is a
ences. However, GVN can make use of algebraic
as CFGS, directed
basic blocks, and edges repre-
We make extensive use of use-clef and clef-use chains. Use-clef chains are explicit in the implementa-
technique
expressions.
over more tradi-
(single operator) expressions (e.g., “a := b + c“). require the program to be in SSA form [12].
blocks in several compilers. partitions
the flow
structions
value-numbering
to
programs
what a single functional
Since it is global
In the
Representation
sequences of instructions,
of local hash-table-based
[1, 11].
combination
with vertices representing
this leaves
dead code.
GVN is a direct extension value-numbering
Program
results
(e.g., PRE).
We represent
on some paths but not others are moved to the latest point Frequently
We
by itself and because GVN relies on it,
benefits of the GVN-GCM
their uses.
the paper.
In Section 3 we present our implementation
grams.
where they dominate
discuss the inter-
present GCM next, in Section 2, both because it is useful
a path, and
In general this is very profitable.
like all heuristics,
For the rest of this section we will
in this case means that no path is
Hoisting
Most
for
instructions
block they started in,4
and the
not residing
set of redun-
exist
without
knowing
Data instructions
in any particular
basic block.
ior depends solely on the data dependence.
presented here
because the scheduling issues are Also GVN works on irreducible approach relies heavily on program
the basic
are treated as Correct behavAn instruc-
3 Our implementation treats basic blocks as a special kind of instruction smular to Ferrante, Ottenstein and Warren’s regwn nodes [151. This allows us to simplifi several optimization algorithms unrelated to this
structure.
~aper. Our presentation will use the more traditional CFG. The exceptions are instructions which can cause exceptions: loads, stores, division, subroutine catls, etc.
247
Start:
mio
“ :–
-
I
1
‘T 1
/
m loop:
\
b:=a+l iz:=il+b c:=i2X2 cc := i2 < 10
br ne loop
Figure
1 A loop, and the resulting
graph to be scheduled 4.
tion, the instructions that depend on it, and the instructions on which it depends exist in a “sea” of instructions, with little control structure. The “sea” of instructions
intermediate
late.
in the last block
where
choose the block that is in the shallowest possible, and then is as control
simple global code motion algorithm.
sible.
“free-floating”
It must preserve any existing
that follow.
We will
ning example.
control
gram.
use the loop in Figure
On the right
is the resulting
optimization),
block) and all data dependence.
rounded boxes are instructions,
ristic:
While
the scheduler uses a flexible
satis~lng
dependent
as pos-
these
Shadowed
graph (possibly
boxes
are
basic
proafter
blocks,
dark lines represent con-
trol flow edges, and thin lines represent
and simple heu-
1 as a run-
The code on the left is the original
dependence (divides, subroutine calls and other possibly faulting operations keep a dependence to their original restrictions
We
loop nest
We cover these points in more detail in the sections
code motion algorithm “schedules” the instructions by attaching them to the
CFG’S basic blocks.
all
we have a safe range to place computations.
repre-
sentation such as a CFG. We need a way to serialize the graph and get back the CFG form. We do this with a
The global
We place instructhey dominate
their uses. 5, Between the early schedule and the late schedule
is useful for optimization,
but does not represent any traditional
Schedule all instructions tions
data-flow
(data
dependence) edges.
place code outside as many loops as possible, then
on as few execution paths as possible. 2.1
2. Global
blocks with the dominator
tree, and annotate basic
graph
tree depth.
write-up
ence.
control
We place instructions
[21 ],
Chris
of the modifications
Vick
Lengauer
has an excellent
and on building
We used his algorithm
and
[17], We find “Testing flow the loop
directly.
The dominator tree and loop nesting depth give us a handle on expected execution times of basic blocks. We
and data dependin the first
reducibility”
tree [22].
2. Find loops and compute loop nesting depth for each basic block. 3, Schedule (select basic blocks for) all instructions early, based on existing
tree by running
Tarjan’s fast dominator finding algorithm loops using a slightly modified Tarjan’s
basic strategy:
1. Find the CFG dominator
and Loops
We find the dominator
Code Motion
We use the following
Dominators
use this information
block
tions
where they are dominated by their inputs. This schedule has a lot of speculative code. with extremely long live ranges.
in the following
in deciding sections.
where to place instrucThe running
time
of
these algorithms is nearly linear in the size of the CFG. In practice, they can be computed quite quickly. We show the dominator tree and loop tree for our running example in Figure 2.
248
int first:= TRUE, x, sumo:= O; while( predo ) {
1.
2. 2.1
suml := +( sumo,
3.
if( first
{first:=
4.
5.1
sumz
FALSE;
X:=
1
Figure 2.2
Schedule
tree and loop tree
4
Our first scheduling
pass greedily places instructions
We place instructions
block where they are dominated only exception include
is instructions
by their
instructions,
PHI
inputs.5
BRANCH/JUMP,
(these end specific
a
Figure
One final problem:
Faulting
to write a program
in-
the uses.
structions have an explicit control input, so they cannot be moved before their original basic block.6 This schedule has a lot of speculative ranges.
code, with extremely
We make a post-order at instructions (“pinned”
instructions).
instruction,
have a control
After scheduling
we schedule the instruction
the instruction
dependence We place
dominate
uses, then this block is the earliest correct
forall
II Instruction
inputs
x to i
Schedule_Early(
do x );
if i. block .dom_depth i. block:=
x. block;
building
paths are possible;
algorithm
the
block 5.
These
PHI
5, the
PHI
instructions
of X. initializes
it to
PHI
instructions
can arise outside instruction removed,
be scheduled
dominate
the use of
loops as well.
on line 3 is critical. the computation
early,
in block
2.
In With
of ~(x) on However,
block 2 does not dominate block 4, and y is live on exit from block 4. In fact, there is no single block where the inputs to ~ dominate ~s uses, and thus no legal schedule.
dominator
tells us that it’s all right; the proThe PHI instruction grammer wrote code such that it appears that we can
all inputs early
then
II Choose deepest dominator
PHI
instruction
line 4 will
II II Schedule
assumes all
thus it may be possible to
In essence, the declaration
the merging. x in block 5,
is being visited now
branch ne loop
cc
1
I
BRANCH
J
T
I
I
C’2 :=
2
c:=i2xc2 return c
Figure 3.1
7 Our loop example, after scheduling
input is a special constant (add of zero) or both inputs are
Value-Numbering Value-numbering
exactly attempts to assign the same value
number to equivalent
instructions.
table based approach
is amenable to a number
The bottom-up
hash-
of exten-
sions, which are commonly implemented in local valuenumber algorithms. Among these are constant folding and recognizing
algebraic
number an instruction, 1) Determine result.
identities.
we perform
When
if the instruction
directly computes the constant. constants are value-numbered
we value-
the following
tasks:
computes a constant
If it does, replace the instruction Instructions
with
one that
that compute
by the hash table below, so
all instructions generating the same constant tually fold into a single instruction. Instructions
late
will
even-
can generate constants for two reasons:
the same (the t) or MAX of equal value;).
replace uses of such instructions value.
The current
instruction
We
with uses of the copied is then dead and can be
removed. 3) Determine value number
if the instruction
as some previous
by hashing the instruction’s
computes
instruction.
operation
the same
We do this
and inputs (which
together uniquely determine the value produced by the instruction) and looking in a hash table. If we find a hit, then we replace uses of this instruction
with uses of the
previous instruction. Again, we remove the redundant instruction. If the hash table lookup misses, then the instruction computes table. Commutative compare) function
either all their inputs are constant, or one of their inputs
While
a new value.
We insert
it in the
instructions use a hashing that ignores the order of inputs.
programmers
occasionally
write
(and
redundant
is a special constant. Multiplication by zero is such a case. There are many others and they all interact in
code, more often this arises out of a compiler’s fi-ont end during the conversion of array addressing expressions
useful ways.g
into machine instructions.
2) Determine
if the instruction
tity on one of its inputs.
Again,
reasons; either the instruction
is an algebraic
iden-
these arise for several
is a trivial
CoPY,
or
one
a By treating basic blocks as a special kind of instmction, we are able to constant fold basic blocks like other instructions. Constant folding a basic block means removmg constant tests and unreachable control flow edges, which m turn simplifies
the dependent o-functions
3.2
Example
We present an extended example in Figure 8, which we copied from the Rosen, Wegman and Zadeck example [19]. The original program fragment is on the left and the SSA form is on the right. We assume every original
read(ao, bo, co, do, lo, mo, so, to) I
al :=$(
ao, a3)
dl :=$(
do, d3)
ml := O( mo, m3) s, := @( so, S3)
1
I
l:=cxb
d:=c
m:=l+4
l:=dxb
a;=c
s:=axb t:=s
11 :=
coX
dz := CO
b.
L
i-1
m2:=ll+4
12 := d2 X b.
a2 := co
S2 := al
X b.
tz :=s2+1
a3 := $( a2, al)
d3 := @( dl, d2)
x:=axb
y:=x+l
13 := ~
I
1,, 12)
m3 := $( mz, ml) S3 := @( .$1, SJ ts := ~
tl, tJ
X. := a3 x b.
yo:=xo+l
Figure variable is live on exit from the fragment.
8 A larger example, also in SSA form
For space rea-
shown on the right.
sons we do not show the program in graph form. We present We visit
the highlights
the instructions
reado call obviously
of running
Coming
out of SSA form requires a
new variable name as shown on the right in Figure 9.
GVN
here.
in reverse post order.
The
produces a new value number.
4. Experiments
The We converted
a large test suite into a low-level
~-functions in the loop header block (al, d,, rm, SI, and tl) are not any algebraic identities and they do not conn-
termediate
language,
pute constants.
translation
is very naive and is intended
They all produce new value numbers, as
does “11 := co x bo” and “mz := 11 + 4“. identity
on co.
We follow
We then performed
Next az is an
zation
the clef-use chains to remap
simulator
uses of a2 into uses of CO. Running
along the other side of the conditional
virtual
we
ring
after this
code fragment
with
all produce new values.
11.
The
other
to collect
execution
All
in-
produced
by
to be optimized.
We ran the resulting
optimilLOC on a
cycle counts for the lLOC
applications
ran to completion
on
and produced correct results.
and 1 cycle latencies for all operations,
including
Lo~s,
STORESand JUMPS. A single register can not be used for both
$-
integer
operators
We show the resulting
and floating-point
exist).
operations
(conversion
ILOC is based on a load-store
archi-
tecture; there are no memory operations other than LOAD and STORE. ILOC includes the usual assortment of logical, arithmetic and floating-point operations.
program
in Figure 9. At this point the program is incorrect; we require a round of GCM to move “1, := co x bo” to a point that dominates
ILOC
ILOC is a virtual assembly language. The current simulator defines a machine with an infinite register set
as 11and we replace uses
of 12with 11. The variables S2and t2produce new values. In the merge after the conditional we find “13 := ~(11, lJ”, which is an identity on 11. We replace uses of 13 occurfunctions
The
several machine-independent level.
ILOC
machine.
the simulator
discover that we can replace d2 with co. Then “12 := co x bo” has the same value number
at the
lLOC [6, 5].
its uses.
The simulator
GCNl moves the computations of 11, mz, XO, and ,YO out of the loop. The final code, still in SSA form, is
is implemented
into C. The C code is then compiled
253
by translating and linked
ILOC
as a na-
read(a, b, c, d, 1,m,s, t)
read(ao, bo, co, do, l., mo, so, to) read(ao, bo,Co,do,lo,rno,,so,to)
I
+* al :=$(
~1 := Co X b.
l:=cxb
m2;=
m2:=l+4
11+4
ao, a3)
v
v
d, := ()( do, dJ
al :=+(
ml := $( mo, ms)
dl := ()( do, d3)
~o, S3) t, := $( to, f3)
S1 := Q( so, S3)
1, := co x b~
v
ml := $( mo, ms)
J-1 :=0(
m2 :=11+4
v
ao, as)
t3) t] := O( to,
S2 := I
al X bo
a:=c
tz:=sz+l
S2 := al x b.
s:=axb
m := m2
t:=s+l
tz:=sz+l
d:=c
a3 := $( co, al) d3 := 1$( dl, co)
a3 := $( co, al)
ms := 4( m2., ml)
d3 := $( dl, Co)
s,:=
m3 := $( mz, ml)
+( s,, s,)
t~ := $( t~, tJ
S3 := ($( s], SJ
Xo := a3 x b.
tq := ($( tl, tJ
ylj:=xo+l
m
Figure tive program. that collect
on the number
The simulated model.
overly simplistic,
it makes measuring
effects,
with
hold copies.
is
latencies
out of SSA at the end.
insert a large number of copies,
a round
This empties
of coalescing
to remove
We also split control-dependent
move the empty
blocks.
the
some basic blocks that only
some of those blocks go unused.
the separate effects
and variable
with
copies [4, 9].
We are looking at machineWere we to include memory long
with a naive translation
We followed
the simulation
memory and a fixed number of functional
back out of SSA form on the right
Thus, most optimization
ILOC
ILOC clearly represents an unrealistic We argue that while
of optimizations possible. independent optimization, hierarchy
,, . 1. . .
internally
annotations
of times each
routine is called and the dynamic cycle count.
machine
y:=x+l
9 After GVN on the left, after GCM in the middle,
The executed code contains
statistics
7 x:=axb
blocks and
We ran a pass to re-
Besides the optimization
al-
ready discussed, we also used the following:
to
CCP:
units we would
Conditional
Constant
Propagation
is an im-
obscure the effects we are trying to measure.
plementation of Wegman and Zadeck’s algorithm [24]. It simultaneously removes unreachable code and replaces
4.1
computations of constants with load of constants. After constant propagation, CCP folds algebraic identities that
The Experiments The test suite consists of a dozen applications
total of 52 procedures.
All the applications
use constants, such as x x 1 and x + O, into simple copy
with a
are written
operations.
in
FORTRAN, except for cplex. Cplex is a large constraint tomcatv, matrix300 and solver written in C. Doduc, fpppp are from the Spec89 benchmark suite.9 The remaining procedures come from the Forsythe, Malcom
Finding implements AlpGCF: Global Congruence finding ern, Wegman, and Zadeck’s global congruence algorithm [2]. Instead of using dominance information, available expressions (AVAIL) information is used to
and Moler suite of routines
remove
All experiments control-dependent
[14].
start by shaping
edges and inserting
redundant
computations.
the CFG: splitting loop landing
Redundancy Elimination implements PRE: Partial Morel and Renvoise’s algorithm. PRE is discussed in
pads.
All experiments end with a round of Dead Code Elimi(DCE) [12]. Most of the phases use SSA form nation
Section 1.1.
9At the time, our tools were not up to handling the entue Spec suite,
254
so more computations
AVAIL is stronger are removed [20].
than dominance,
Reassociation: This phase reassociates and redistributes integer expressions allowing other optirnizations to find more loop invariant code [6]. This is important
4.4
Reassociate,
then GVN-GCM
vs. GCF-PRE
We first reassociate expressions. GVN-GCM
combination
Then we ran the
and compared
for addressing expressions inside loops, where the rapidly
pass of GCF, PRE and CCP.
varying
a pass of DCE, coalescing and removing
innermost
index
may cause portions
plex address to be recomputed sociation
can allow all the invariant
leaving
behind
only
“base+indexxstride”
an
of a con~-
on each iteration.
Reas-
In Figure
parts to be hoisted,
expression
of
the
count
form
all follow
this general strategy:
1. Convert from FORTRAN shape the CFG.
4.5
up with
DCE,
coalescing
and removing
the ILOC to C with the simulator,
then
combination
Reassociate,
percentage
and
times.
strat-
or constants
problem;
constants
next column The fourth
that
GVN-GCM
We
with
We compared
code motion
The first
GCM
two passes of GCF, PRE and CCP.
write
tion.
GVN,
repeated combination
The sixth column
in Figure
implemented
up with a pass
tially
seven shows Roughly
is by
with
a particularly
good results.
simple
Besides
fast
and
relatively
We ran them both on a By separating
expression
finding
the
issues we are able to of Global
redundant
Value
expressions,
Besides
correcting
the
scheduled
produced
by
GCM also moves code out of loops and down into branches.
linear
in the program
size, and is quite
practice. GVN requires one linear-time program and is also quite fast in practice.
10 shows the runtime
the percentage speedup over GVN-GCM.
roughly
GCM requires the dominator tree and loop nesting depth, It then makes two linear-time passes over the program to select basic blocks. Running time is essen-
empty blocks.
cycles for the repeated passes and column
two
from the optimization
nested conditional
against
We ran DCE after
As before we finish
of DCE, coalescing and removing
Again,
GVN folds constants and algebraic identities. After Code A40running GVN we are required to do Global
Per-
each set; the final sequence being GCF, PRE, CCP, DCE, GCF, PRE and CCP.
and
pass of GCF-PRE-CCP
optimization.
Numbering.
cycles for GVN-GCM.
is cycles for GCF-PRE-CCP.
the GVN,
have
straightforward
and procedure name, the
vs. GCF-PRE-CCP
PRE
5. Conclusions
centage speedup is in the fifth column. 4.3
GCF,
speedup over GVN-GCM.
suite of programs
10 shows the results.
shows the runtime column
DCE,
CCP.
expressions.
The table in Figure
CCP,
could make use of constants found and code removed
GCM is more aggressive than PRE in
two columns are the application
PRE,
two
The final sequence being
found
with CCP cannot then be used to find congruences hoisting loop invariant
GCF,
Then we ran the
lost. This indicates that the second pass of GCF and PRE
Each of GCF, PRE
some congruences
passes have a phase-ordering In addition,
-CCP
and compared it to running
halve the gain over a single
GVN cannot. However, constants discovered with GVN The separate can help find congruences and vice-versa.
PRE.
vs. GCF-PRE
CCP. As before we finish up with a pass of DCE, coalescing and removing empty blocks,
one pass each of GCF, PRE and
This represents an aggressive optimization
and CCP can find
cycle
the GVN-GCM
Column eleven in Figure 10 shows the runtime cycles for the repeated passes and column twelve shows the
with the GVN-GCM
egy with reasonable compile
combination
passes of GCF, PRE and CCP.
GCF-PRE-CCP
VS.
compared it to running CCP.
then GVN-GCM
We first reassociate expressions.
experiment.
We optimized
Reassociate,
GVN-GCM
compile and link. 6. Run the compiled application, collecting statistics. Variations on this Strategy- are outlined below in each
GVN-GCM
is the runtime
running
repeated
empty blocks.
4.2
eighth then
empty blocks.
for one pass of GCF, PRE and CCP.
once or twice,
5. Translate
one
average speedup is 5.970 instead of 4.370. Reassociation provides more opportunities for GVN-GCM than it does
or C to naive ILOC. Re-
2. Optionally reassociate expressions. or run GCF-PRE-CCP 3. Run either GVN-GCM 4. Clean
it to running
As before we finish up with
combination. Column nine is reassociation, then GCFPRE-CCP. Column ten is the percentage speedup. The
in the loop.
Our experiments
10, column
for reassociating,
-CCP
optimization
halve
provide
a net gain over using GCF, PRE
and CCP, and are very simple to implement.
the gain over a single pass of GCF-PRE-CCP is lost, This indicates that the second pass of GCF and PRE could make use of constants found and code removed by CCP,
255
fast in
pass over the Together these
;VN-GCM
Application Routim parol doduc rkf45 rkfs gamgen fpppp prophy doduc doduc pastern doduc ddeflu yeh doduc doduc deblco deseco doduc doduc mtegr doduc cardeb doduc coeray !doduc bilan matrix300 sgemv rkf45 rkf45 initbx doduc dccera doduc twldrv fpppp sphrte sewd heat doduc doduc doduc fpppp doduc doduc urand doduc cplex doduc matrix300 rkf45 cplex doduc cplex fpppp doduc doduc fpppp matrkx300 doduc doduc solve svd cplex zeroin tomcat v fmin fpppp doduc doduc
689,059 536,685 56,024 134,227 553,522 625,336 1,329,989 342,460 479,318 2,182,909 389,004 168,165 1,487,040 536,432 496,000 1,475 2,602 921,068 ?6,291,270 882 578,646
GCF-PRE-CCP speedup Once 898,775 637,325 66,254 158,644 646,101 721,850 1,485,024 379,567 529,876 2,395,579 426,138 183,520 1,622,160 580,878 536,800 1,575 2,776 981,505 81,100,769 937
642 4,596 4,156,117 740 ?485E+08
613,800 24,985 715,410 973>331 507,531 898 563 301,516 3,899,518 763,437 8,767 135,960 16,487,352 9,981,072 19,791,979 1,074 2,567,544 56,684 26,883,090 13,340,000 1,763,280 1,253,980 641 4,542 4,097,458 729 2,436E+08
saturr
908 2,046,512 75,708 63,426
876 1,951,843 71,616 59,334
TOTAL
1.416E+08
4.432E+08
Average
8,659,662
8,690,153
orgpar drepvi fmtgen repvid uudeb urnnd drrgl xload colbur sgemrtr fehl xaddrov si xielem fmtset Supp iniset fpppp Saxpy subb x21y21 decomp svd chpwot zeroin tomcatv fmm efill lhbtr
23,69’2 680,080 926,059 486,141 863 550 295,501 3,831,744 751,074 8,664 134,640 ,6,345,264 9,924,360 9,738,262 1,072 2,564,382 56,649 !6,871,102 3,340,000 1,763,280 l,~53,980
Figure
GCF-PRE-CCP speedup Repeated
23.3%
821,399 15.8% 560,180 15.4% 66,254 15.4% 151,039 14 3% 590,164 13.4% 654,365 10,4% 1,434,149 9 8% 379,567 9.59’. 497,174 8.9% 2,249,694 8.7% 383,403 8.4% 157,620 8 3% 1,622,160 7.7% 545,127 7,6% 536,400 1,575 6.3% 2,606 6.3% 6.2% 981,505 5.9% 80,988,348 937 5 9% 57% 613,800 24,060 5.2% 4.9% 689,325 4 9% 966,891 42% 492,837 838 3.9% 2.3% 563 2.0% 298,926 1,7% 3,899,518 1 6% 765,868 1.2% 8,666 1.0% 134,640 0.9% 16,499,836 0,6% 9,981,072 0.3% 19,904,514 0.2% 1,074 0.1% 2,567,544 0.1% 56,655 0 o% 26,877,096 0.0% 13,340,000 0.0% 1.763,280 0.0% 1,355,263 -0.2% 631 -1.2% 4,547 -1,4% 4,097,548 -1.5% 729 -2 o% 2433E+08 -3.7% 902 -4 9% 1,898,546 -5 7% 77,010 -6.9% 58,962 0.4% 4 3% 10
4.423E+08 8,672,843
GVN-GCM
256
Reassociate GVN-GCM 689,931
Reassociate GCF-PRE-CCP Once speedup
163%
47,316 26,871,102 10,480,000 1,763,280 1,162,343 655 4,612 4,194,263
-1.5% -2 1% -0.7% -7.8% 1.7’% -7 6%
909 1 952E+08 1,017 2,791,920 71,430 63,426
o 2% 24%
3.873E+08
3,938E+08
1.6%
3.918E+08
1.1%
7,593,844
7,721.122
5,9%
7,681,987
2,2%
1,475 2,558 921,068 79,039,949 801 560,418 21,287 697,840 923,419 453,033 735 555 297,334 3,831,744 761,478 7,928 130,416 15,539,173 9,300,548 19,755,998 948 2,564,382
vs. GCF-PRE-CCP
689,495 61,654 102,630 834,681 769,328 1,541,800 362,090 552,856 2,671,044 487,928 169,090 1,606,384 686,143 401,200 1,475 3>091 975,215 81,989,205 837 610,080 22,950 741,055 960,776 514,041 877 564 347,379 3,899,518 804,580 8,022 137,808 16,487,352 9,470,684 19,718,353 952 2,561,220 47,498 26,871,102 10,480,000 1,763,280 1,162,343 651 4,407 4,137,514 731 1,957E+08 885 2,305,816 84,266 60,078
25.8%
8~4, J89
16.1% 4,2% 15.47. 11,1% 6.2% 4 d!io 7,3% 98% 3.6% 3.070 -1 5% -6.7% 8 3% 1.6% 7.5% 6.3% 0.2% 6.2% 5.8% 5.9% 5.7% 1 5% 1.3% 42% 1,4% -3.0% 2 3% 1.1% 1,7% 1 9% 0 o% 0.0% 0.9% 0 6% 0.8% 0 2% 0.1% 0,0% 0.0% 0 o% 0.0% 7.5% -1 7% -1.1% -1.4%
513,005 58,945 79,057 617,902 598,399 1,332,394 342,460 440,884 2,175,324 401,214 159,840 1,487,040 529,957 400,000
929,271
Reassociate GCF-PRE-CCP Repeated speedup
25.6% 559,255 4.4% 62,526 23.070 83,032 26.0% 648,916 22.2% 651,012 13,6% 1,474,460 54% 354,328 20.3% 508>216 18.6% 2,324,804 17.8% 391,913 5.5% 151,330 7.4% 1,606,384 22,8% 582,127 0 3% 400,400 0.090 1,475 17 2% 2,591 5.6% 975>215 3,6% 81,464,531 4,3% 829 8.l% 608,964 7.2% 21,840 5.8% 720,890 3,9% 952,581 119% 496,557 16.2% 844 1.6% 564 14.4% 347,379 1 7% 3,899,518 5.4% 797,365 1.2% 7,920 5.4% 137,808 5 8% 16,499,836 1 8% 9.470,684 -0.2% 19,956,683 0.4% 952 -o 1% 2,567,544 0.4% 47,469 0.0% 26,865,108 0 o% 10,480,000 0 o% 1,763,280 0.0% 1,162,343 -0,6% 60 I -4.7% 4,135 -1 4% 4,137,604 -24.4% 738 0.3% 1,952E+08 -14 9% 901 -21,1% 2,471,638 15 2% 79,9!38 -5 6% 59,706
8.3% 5.7% 4.8% 4.8% 8.1% 96% 3.3% 13.2% 6.4% -2,4% -5 6% 7.470 9 o% 0 1% 0.0% 1 3% 5.6% 3.0% 34% 8.0% 2.5% 3.2% 3.1% 8.8% 129% 1,6% 14.4% 1,7% 4.5% -o 1% 5.4% 5.8% 1 8% 1.0% 0.4% 0 1% 0 3% 0.09’0 0 o% 0 o% 0.0% -9 o% -11.5’% -1 4% -23 2% 0.0$%0 -12 9% -13.0% 10 7% -6.Z?ZO
[13]
K. H. Drechsler problem
Bibliography
mization ACM
[1]
A. Aho,
and J. Unman.
Techniques,
Wesley, 1985. [2]
sium
the
of the Fifteenth
A. B. Maccabe,
stein.
The program
tation
supporting
Prentice-Hall, sey, 1977. [15]
driven
interpretation of the
Lan-
and
of imperative SIGPLAN
Languages
[16]
web: A represen-
data’90
Design
languages.
In on
Allocation
’94
[17]
Ph.D. thesis, Rice University, The Massively
Unpublished
report.
tions
Coloring.
1992.
Scalar Compiler
Preliminary
shared.ps. Rice University,
[18]
Project.
version available
dancy
elimination,
SIGPLAN guages
[7]
’94 Design
P. Briggs
R. Cartwright program guages
[9]
on
Programming
Iloc ’93.
Design
G. J. Chaitin.
In
and spilling
’82 Symposium
on
Compiler
[21]
Combining
J. Coeke and J. T. Schwartz. guages
and
their
Mathematical April [12]
Combining
Ph.D. thesis, Rice University,
zatiorzs.
[11]
Analyses,
via [22]
compilers.
Sciences,
Courant New York
[23]
1995. lan-
Institute of University, [24]
1970.
computing
static single assignment
Record
of the Sixteenth
on the Principles
of Programming
form.
ACM
Value
of
A CM
Programtning
Numbering.
Available
R. E. Tarjan.
from
Unpub-
ftp://cs,rice.edu/ Rice University,
C.
Testing
of Computer
Vick,
SSA-Based
Masters
Jan.
1989.
257
graph
reducibility.
Sciences,
9:355-
Reduction
of
Operator
thesis, Rice University,
pages
1994.
D. Weise, R. Crew, M. Ernst, and B. Steensgaard, graphs:
taxation.
In
SIGPLAN
Symposium
gramming
Languages,
Representation
Proceedings on
of the
on
Programming
13(2): 181-210,
the
without
21st
Principles
ACM of
Pro-
1994.
M. N. Wegman and F. K. propagation with conditional Transactions
In Con-
flow
and System
1974.
Systems,
Symposium
Lunguages,
Principles
Global
Value dependence
R. Cytron, J. Ferrante, B. K. Rosen, M. N. Wegman, and F. K, Zadeck, An efficient method of ference
Computa-
of the Fifteenth
Jan. 1988.
Strength.
Optimi-
Programming
and F. K. Zadeck,
and Redundant
Record
the
report.
11-15,
C. Click,
on
T. Simpson,
365,
1982. [10]
Numbers
by
Communica-
Feb. 1979.
22(2):96-103,
In Conference
Journal
June
Construction,
Global optimization
the Lan-
of the SIGPLAN
In Proceedings
Systems,
1994.
June 1989.
Register allocation
graph coloring,
of
Programming
and
redundancies.
public/preston/optimizer/gval.ps.
The semantics of
and Implementation,
Value
lished
1993.
Proceedings on
[20]
Transac-
Languages
of partial
Languages,
re-
for
ACM
July, 1979.
Symposium
Lan-
Technical
Rice University,
Conference
the
De-
A fast algorithm
B. K. Rosen., M. N. Wegman, tions.
June 1994.
and M. Felleisen,
’89
of
Languages
in a flowgraph.
Programming
of the ACM,
Global
redun-
Proceedings
Conference
dependence,
SIGPLAN
partial
and Implementation,
port CRPC-TR93323, [8]
Effective
In
and T. Harvey.
[19]
Partial dead
of the SIGPLAN
June 1994.
E. Morel and C. Renvoise. tions
July, 1987.
and B. Steffen.
and R. E. Tarjan.
suppression
July 1994.
P. Briggs and K. Cooper.
on
Programming
19-349,
on Programming
dominators
1(1):121-141,
via ftp://cs.rice.edu/public/prestonloptirnizer/
[6]
T. Lengauer
on
9(3):3
In Proceedings
Conference
New Jer-
and J. D. Warren.
Transactions
and Systems,
Computa-
Cliffs,
graph and its use in op-
sign and Implementation,
Implementa-
via Graph
Englewood
J. Knoop, O. Riithing,
finding
P. Briggs, Register
ACM
and C. B. Moler.
Mathematical
dependence
code elimination.
demand-
Conference
and
Languages
Oct. 1988.
K. J. Ottenstein,
Languages
June 1990.
P. Briggs.
J. Ferrante, The program
and K. J. Otten-
dependence control-,
for
tions.
Sympo-
Programming
opti-
redundancies”.
Programming
M. A. Malcom,
Methods
timization.
R. A. Ballance,
tion,
of
ACM
of partial
on
10(4):635–640,
G. E. Forstyhe, Computer
Principles
Programming
[5]
[14]
1988.
Proceedings
by suppression
to a
“Global
–
Addison-
-
Record
on
guages,
[4]
Compilers
Tools.
B. Alpern, M. N. Wegman, and F. K. Zadeck. Detecting equality of variables in programs. In Conference
[3]
and
A solution
and Renvoise’s
Transactions
and Systems,
R, Sethi,
Principles,
and M. P. Stadel.
with Morel
April
Zadeck. Constant branches. ACM Languages
1991.
and