Global Code Motion Global Value Numbering

3 downloads 0 Views 1MB Size Report
Also GVN runs in linear time with a simple one- pass algorithm, while A/W/Z use a more complex. O(rt log n) partitioning algorithm. Cliff Click presents ex-.
Global Global

Code Motion

Value Numbering Cliff

Hewlett-Packard

Laboratories,

One Main cliffc

that optimizing

compilers

optimizations

should

treat

allows

stronger

Preserving

optimizations

using simpler

in algorithms

global congruence finding We present

it into more control

Global

algo-

GVN is also a convenient

place to fold constants;

constant expressions

are then value-numbered

expressions.

two instructions

Finding

near-linear-time2

quirement for identical values, not variable names. Hence GVN relies heavily on Static Single Assignment

al-

(SSA) form [12].

(GCM).

Motion

(and presumably

instructions replacement

less fre-

This is profitable

for identical

with a single instruction, instruction

(alternatively,

think of GVN selecting a block at random;

the resulting program is clearly incorrect). A following pass of GCM selects a correct placement for the instruc-

if the

tion based solely on its dependence

able.

between instruc-

ignores

from

schedule order is ignored. but it does not alter the Control

(CFG) nor remove redundant CFG shaping

edges, or inserting

i Thl~ work has ~n

supported by ARPA through

and was done at the Rice University

ment ‘ Modern microprocessors reqmre global scheduhng, 2 We recpure a dominator

schedule,

build a correct schedule.

GCM

edges.

GVN

Finally,

Since GCM

is not required

GVN

to

is fast, requiring

only one pass over the program.

Flow

1.1

control-dependent

pads).

to use a simple hash-based technique (GVN). Numbering

J-1989

the original

code. GCM benefits

(such as splitting loop landing

however the single

is not placed in any basic block

it is very profit-

GCM relies only on dependence

inputs is a re-

GVN replaces sets of value-equivalent

loop executes at least once; frequently

Graph

like other

that have the same

many times; the requirement

in quently executed) basic blocks. GCM is not optimal the sense that it may lengthen some paths; it hoists con-

tions; the original moves instructions,

identities

and identical

like PRE [18, 13] or

Code

trol dependent code out of loops.

for algebraic

that have the same operation

operation and identical inputs is done with a hash-table lookup. In most programs the same variable is assigned

hoists code out of loops and pushes dependent

finds these sets by looking

inputs.

[2, 20].

a straightforward

for performing

Our GCM algorithm

02142

or instructions

a legal schedule is one of the prime

sources of complexity

gorithm

MA.

Office

(61 7) 225-4915

GVN

(e.g., conditional

constant propagation, global value numbering) and code Removing the code motion motion issues separately.’ requirements from the machine-independent optimization

Research

GVN attempts to replace a set of instructions that each compute the same value with a single instruction.

the machine-independent

rithms.

Cambridge

St., Cambridge,

@hpl.hp.com

1. Introduction We believe

Clickt

GCM

allows us

for Global

Related

Loop-invariant around awhile [1].

Value

rithm

with super-scalar or VLIW

Science Depar% execution

already

tree, which requires O(cc(rz,n)) rime m build

that

code motion techniques have been Aho, Sethi, Unman present an algo-

moves loop-invariant

code to a pre-header

before the loop. Because it does not use SSA form significant restrictions are placed on what can be moved.

ONR want Noool 4-91-

Computer

Work

Multiple

assignments

hoisted.

They also lift

to the same variable the profitability

are not be

restriction,

and

allow the hoisting of control dependent code, They note that lifting a guarded division may be unwise but do not

It is

essentially linear in the size of the control flow graph, and very fast to build in practice.

present any formal

method

for dealing

with faulting

in-

structions. Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage, the ACM copyright notice and the title of the publication and its date appear, and notice is given that copyin is by permission of the Association of Computing Machinery. Y o copy otherwise, or to republish, requires a fee and/or specific permission. SIGPLAN ‘95La Jolla, CA USA 0 1995 ACM 0-89791 -697 -2/95/0006 ...$3.50

Partial

Redundancy

moves partially Loop-invariant dundant,

246

Elimination

(PRE) [18, 13] re-

redundant expressions when profitable. expressions are considered partially re-

and, when profitable,

will

be hoisted to before

the loop.

Profitable

lengthened.

a computation

that is used on some

paths in a loop but not all paths may lengthen is not hoisted. expressions.

GCM

will

mediate

as an optimization

hoist such control-dependent

this method sometimes

However,

lengthens

representation

used throughout

pro-

of GVN.

Section

showing

4 we present

numerical

GCM will move code from main paths into more control dependent paths. The effect is similar to Partial Dead Code Elimination [16]. Expressions that are dead

tional techniques 1.2

no partially

senting

techniques

instead of local,

it finds

all the redundant

found by the local techniques. techniques

have

been

Local

generalized

expressions

extended

GVN

uses a simple of congruent

bottom-up

basic

and Zadeck present a technique top-down

Their

[2].

amongst dependence bottom-up

to build

Alpern,

Wegman

tion.

partitions

structure

method

for building will

find

that form loops.

technique

it cannot find such looping

ties and constant folding, not.

Also GVN

while

log n) partitioning tensions

A/W/Z

identities

and

conditional

the expected

gain

unit of a machine

variables

contain close to

will

as assignments

are renamed

that represents

variable

do.

from

vertices

identi-

instruction;

We

being

into a directed

instructions

essen-

become pointers

that defines the variable.

sequences of instructions

congru-

graph, with the

and the edges representing

data flow.

handled like other expressions and can have backedges as

use a more complex

O(rt

Cliff

Click

that

the higher

in spirit to the value

Program

Dependence

order to make the representation

algebraic

propagation

This is similar

an operator-level

presents ex-

include

constant

inputs.

It is not a DAG because SSA ~-functions

need a $-function

[10].

input that indicates

tions values are merged. keep a pointer

implementation

[2] or

graph

under what condiWe show

1.

Both GCM and GVN rely solely on dependence

gruences [18, 13]. As such, the set of discovered congruences (whether they are taken advantage of or not) only partially overlaps with GVN. Some textually-congruerlt

placement heuristic). To be correct, ence must be explicit. LoAos and

expressions

congruences

do not compute

textually-congruent

congruent. ruences

the same value. expressions

Due to algebraic are not textually

instead of value cori-

identities,

equal.

threaded

However,

are also

control

many value cong-

In practice,

global

value numbering

dant expressions. is simpler,

Their

technique

a near] y identical

However, the algorithm

in part,

handled by GCM. graphs; the R/WIZ

[19].

token

Faulting This

or

then, all dependneed to be

STORES

other

instructions

input

allows

dependence

keep an explicit

the instruction

to be

basic block or after, but not be-

fore.

and Zadeck present a method

one presented here find

a memory

input.

for

the original schedule is ignored be used as an input to the code-

scheduled in its original

GVN finds

a larger set of congruences than PRE. Rosen, Wegman

by

edge [3, 15, 23].

value-

In

[8], we

It suffices to make @functions

to the basic block they exist in.

an example in Figure

are

Graph [15].

compositional

correct behavior, (although it may

many

to

This converts

can-

cost. PRE finds lexical

In-

simple

to the address of the

the defining

names in expressions

the expression

finds more congruences than GVN Compiler writers have to trade off against

blocks

operations

one-

technique

While this algorithm it is quite complex.

Basic

elementary

a simple

time with

algorithm.

to A/W/Zs

graphs

technique

which AlW/Z’s

runs in linear

pass algorithm,

of contro13 [1].

are represented

All

tially

congruences

Because GVN is a

ences. However, GVN can make use of algebraic

as CFGS, directed

basic blocks, and edges repre-

We make extensive use of use-clef and clef-use chains. Use-clef chains are explicit in the implementa-

technique

expressions.

over more tradi-

(single operator) expressions (e.g., “a := b + c“). require the program to be in SSA form [12].

blocks in several compilers. partitions

the flow

structions

value-numbering

to

programs

what a single functional

Since it is global

In the

Representation

sequences of instructions,

of local hash-table-based

[1, 11].

combination

with vertices representing

this leaves

dead code.

GVN is a direct extension value-numbering

Program

results

(e.g., PRE).

We represent

on some paths but not others are moved to the latest point Frequently

We

by itself and because GVN relies on it,

benefits of the GVN-GCM

their uses.

the paper.

In Section 3 we present our implementation

grams.

where they dominate

discuss the inter-

present GCM next, in Section 2, both because it is useful

a path, and

In general this is very profitable.

like all heuristics,

For the rest of this section we will

in this case means that no path is

Hoisting

Most

for

instructions

block they started in,4

and the

not residing

set of redun-

exist

without

knowing

Data instructions

in any particular

basic block.

ior depends solely on the data dependence.

presented here

because the scheduling issues are Also GVN works on irreducible approach relies heavily on program

the basic

are treated as Correct behavAn instruc-

3 Our implementation treats basic blocks as a special kind of instruction smular to Ferrante, Ottenstein and Warren’s regwn nodes [151. This allows us to simplifi several optimization algorithms unrelated to this

structure.

~aper. Our presentation will use the more traditional CFG. The exceptions are instructions which can cause exceptions: loads, stores, division, subroutine catls, etc.

247

Start:

mio

“ :–

-

I

1

‘T 1

/

m loop:

\

b:=a+l iz:=il+b c:=i2X2 cc := i2 < 10

br ne loop

Figure

1 A loop, and the resulting

graph to be scheduled 4.

tion, the instructions that depend on it, and the instructions on which it depends exist in a “sea” of instructions, with little control structure. The “sea” of instructions

intermediate

late.

in the last block

where

choose the block that is in the shallowest possible, and then is as control

simple global code motion algorithm.

sible.

“free-floating”

It must preserve any existing

that follow.

We will

ning example.

control

gram.

use the loop in Figure

On the right

is the resulting

optimization),

block) and all data dependence.

rounded boxes are instructions,

ristic:

While

the scheduler uses a flexible

satis~lng

dependent

as pos-

these

Shadowed

graph (possibly

boxes

are

basic

proafter

blocks,

dark lines represent con-

trol flow edges, and thin lines represent

and simple heu-

1 as a run-

The code on the left is the original

dependence (divides, subroutine calls and other possibly faulting operations keep a dependence to their original restrictions

We

loop nest

We cover these points in more detail in the sections

code motion algorithm “schedules” the instructions by attaching them to the

CFG’S basic blocks.

all

we have a safe range to place computations.

repre-

sentation such as a CFG. We need a way to serialize the graph and get back the CFG form. We do this with a

The global

We place instructhey dominate

their uses. 5, Between the early schedule and the late schedule

is useful for optimization,

but does not represent any traditional

Schedule all instructions tions

data-flow

(data

dependence) edges.

place code outside as many loops as possible, then

on as few execution paths as possible. 2.1

2. Global

blocks with the dominator

tree, and annotate basic

graph

tree depth.

write-up

ence.

control

We place instructions

[21 ],

Chris

of the modifications

Vick

Lengauer

has an excellent

and on building

We used his algorithm

and

[17], We find “Testing flow the loop

directly.

The dominator tree and loop nesting depth give us a handle on expected execution times of basic blocks. We

and data dependin the first

reducibility”

tree [22].

2. Find loops and compute loop nesting depth for each basic block. 3, Schedule (select basic blocks for) all instructions early, based on existing

tree by running

Tarjan’s fast dominator finding algorithm loops using a slightly modified Tarjan’s

basic strategy:

1. Find the CFG dominator

and Loops

We find the dominator

Code Motion

We use the following

Dominators

use this information

block

tions

where they are dominated by their inputs. This schedule has a lot of speculative code. with extremely long live ranges.

in the following

in deciding sections.

where to place instrucThe running

time

of

these algorithms is nearly linear in the size of the CFG. In practice, they can be computed quite quickly. We show the dominator tree and loop tree for our running example in Figure 2.

248

int first:= TRUE, x, sumo:= O; while( predo ) {

1.

2. 2.1

suml := +( sumo,

3.

if( first

{first:=

4.

5.1

sumz

FALSE;

X:=

1

Figure 2.2

Schedule

tree and loop tree

4

Our first scheduling

pass greedily places instructions

We place instructions

block where they are dominated only exception include

is instructions

by their

instructions,

PHI

inputs.5

BRANCH/JUMP,

(these end specific

a

Figure

One final problem:

Faulting

to write a program

in-

the uses.

structions have an explicit control input, so they cannot be moved before their original basic block.6 This schedule has a lot of speculative ranges.

code, with extremely

We make a post-order at instructions (“pinned”

instructions).

instruction,

have a control

After scheduling

we schedule the instruction

the instruction

dependence We place

dominate

uses, then this block is the earliest correct

forall

II Instruction

inputs

x to i

Schedule_Early(

do x );

if i. block .dom_depth i. block:=

x. block;

building

paths are possible;

algorithm

the

block 5.

These

PHI

5, the

PHI

instructions

of X. initializes

it to

PHI

instructions

can arise outside instruction removed,

be scheduled

dominate

the use of

loops as well.

on line 3 is critical. the computation

early,

in block

2.

In With

of ~(x) on However,

block 2 does not dominate block 4, and y is live on exit from block 4. In fact, there is no single block where the inputs to ~ dominate ~s uses, and thus no legal schedule.

dominator

tells us that it’s all right; the proThe PHI instruction grammer wrote code such that it appears that we can

all inputs early

then

II Choose deepest dominator

PHI

instruction

line 4 will

II II Schedule

assumes all

thus it may be possible to

In essence, the declaration

the merging. x in block 5,

is being visited now



branch ne loop

cc

1

I

BRANCH

J

T

I

I

C’2 :=

2

c:=i2xc2 return c

Figure 3.1

7 Our loop example, after scheduling

input is a special constant (add of zero) or both inputs are

Value-Numbering Value-numbering

exactly attempts to assign the same value

number to equivalent

instructions.

table based approach

is amenable to a number

The bottom-up

hash-

of exten-

sions, which are commonly implemented in local valuenumber algorithms. Among these are constant folding and recognizing

algebraic

number an instruction, 1) Determine result.

identities.

we perform

When

if the instruction

directly computes the constant. constants are value-numbered

we value-

the following

tasks:

computes a constant

If it does, replace the instruction Instructions

with

one that

that compute

by the hash table below, so

all instructions generating the same constant tually fold into a single instruction. Instructions

late

will

even-

can generate constants for two reasons:

the same (the t) or MAX of equal value;).

replace uses of such instructions value.

The current

instruction

We

with uses of the copied is then dead and can be

removed. 3) Determine value number

if the instruction

as some previous

by hashing the instruction’s

computes

instruction.

operation

the same

We do this

and inputs (which

together uniquely determine the value produced by the instruction) and looking in a hash table. If we find a hit, then we replace uses of this instruction

with uses of the

previous instruction. Again, we remove the redundant instruction. If the hash table lookup misses, then the instruction computes table. Commutative compare) function

either all their inputs are constant, or one of their inputs

While

a new value.

We insert

it in the

instructions use a hashing that ignores the order of inputs.

programmers

occasionally

write

(and

redundant

is a special constant. Multiplication by zero is such a case. There are many others and they all interact in

code, more often this arises out of a compiler’s fi-ont end during the conversion of array addressing expressions

useful ways.g

into machine instructions.

2) Determine

if the instruction

tity on one of its inputs.

Again,

reasons; either the instruction

is an algebraic

iden-

these arise for several

is a trivial

CoPY,

or

one

a By treating basic blocks as a special kind of instmction, we are able to constant fold basic blocks like other instructions. Constant folding a basic block means removmg constant tests and unreachable control flow edges, which m turn simplifies

the dependent o-functions

3.2

Example

We present an extended example in Figure 8, which we copied from the Rosen, Wegman and Zadeck example [19]. The original program fragment is on the left and the SSA form is on the right. We assume every original

read(ao, bo, co, do, lo, mo, so, to) I

al :=$(

ao, a3)

dl :=$(

do, d3)

ml := O( mo, m3) s, := @( so, S3)

1

I

l:=cxb

d:=c

m:=l+4

l:=dxb

a;=c

s:=axb t:=s

11 :=

coX

dz := CO

b.

L

i-1

m2:=ll+4

12 := d2 X b.

a2 := co

S2 := al

X b.

tz :=s2+1

a3 := $( a2, al)

d3 := @( dl, d2)

x:=axb

y:=x+l

13 := ~

I

1,, 12)

m3 := $( mz, ml) S3 := @( .$1, SJ ts := ~

tl, tJ

X. := a3 x b.

yo:=xo+l

Figure variable is live on exit from the fragment.

8 A larger example, also in SSA form

For space rea-

shown on the right.

sons we do not show the program in graph form. We present We visit

the highlights

the instructions

reado call obviously

of running

Coming

out of SSA form requires a

new variable name as shown on the right in Figure 9.

GVN

here.

in reverse post order.

The

produces a new value number.

4. Experiments

The We converted

a large test suite into a low-level

~-functions in the loop header block (al, d,, rm, SI, and tl) are not any algebraic identities and they do not conn-

termediate

language,

pute constants.

translation

is very naive and is intended

They all produce new value numbers, as

does “11 := co x bo” and “mz := 11 + 4“. identity

on co.

We follow

We then performed

Next az is an

zation

the clef-use chains to remap

simulator

uses of a2 into uses of CO. Running

along the other side of the conditional

virtual

we

ring

after this

code fragment

with

all produce new values.

11.

The

other

to collect

execution

All

in-

produced

by

to be optimized.

We ran the resulting

optimilLOC on a

cycle counts for the lLOC

applications

ran to completion

on

and produced correct results.

and 1 cycle latencies for all operations,

including

Lo~s,

STORESand JUMPS. A single register can not be used for both

$-

integer

operators

We show the resulting

and floating-point

exist).

operations

(conversion

ILOC is based on a load-store

archi-

tecture; there are no memory operations other than LOAD and STORE. ILOC includes the usual assortment of logical, arithmetic and floating-point operations.

program

in Figure 9. At this point the program is incorrect; we require a round of GCM to move “1, := co x bo” to a point that dominates

ILOC

ILOC is a virtual assembly language. The current simulator defines a machine with an infinite register set

as 11and we replace uses

of 12with 11. The variables S2and t2produce new values. In the merge after the conditional we find “13 := ~(11, lJ”, which is an identity on 11. We replace uses of 13 occurfunctions

The

several machine-independent level.

ILOC

machine.

the simulator

discover that we can replace d2 with co. Then “12 := co x bo” has the same value number

at the

lLOC [6, 5].

its uses.

The simulator

GCNl moves the computations of 11, mz, XO, and ,YO out of the loop. The final code, still in SSA form, is

is implemented

into C. The C code is then compiled

253

by translating and linked

ILOC

as a na-

read(a, b, c, d, 1,m,s, t)

read(ao, bo, co, do, l., mo, so, to) read(ao, bo,Co,do,lo,rno,,so,to)

I

+* al :=$(

~1 := Co X b.

l:=cxb

m2;=

m2:=l+4

11+4

ao, a3)

v

v

d, := ()( do, dJ

al :=+(

ml := $( mo, ms)

dl := ()( do, d3)

~o, S3) t, := $( to, f3)

S1 := Q( so, S3)

1, := co x b~

v

ml := $( mo, ms)

J-1 :=0(

m2 :=11+4

v

ao, as)

t3) t] := O( to,

S2 := I

al X bo

a:=c

tz:=sz+l

S2 := al x b.

s:=axb

m := m2

t:=s+l

tz:=sz+l

d:=c

a3 := $( co, al) d3 := 1$( dl, co)

a3 := $( co, al)

ms := 4( m2., ml)

d3 := $( dl, Co)

s,:=

m3 := $( mz, ml)

+( s,, s,)

t~ := $( t~, tJ

S3 := ($( s], SJ

Xo := a3 x b.

tq := ($( tl, tJ

ylj:=xo+l

m

Figure tive program. that collect

on the number

The simulated model.

overly simplistic,

it makes measuring

effects,

with

hold copies.

is

latencies

out of SSA at the end.

insert a large number of copies,

a round

This empties

of coalescing

to remove

We also split control-dependent

move the empty

blocks.

the

some basic blocks that only

some of those blocks go unused.

the separate effects

and variable

with

copies [4, 9].

We are looking at machineWere we to include memory long

with a naive translation

We followed

the simulation

memory and a fixed number of functional

back out of SSA form on the right

Thus, most optimization

ILOC

ILOC clearly represents an unrealistic We argue that while

of optimizations possible. independent optimization, hierarchy

,, . 1. . .

internally

annotations

of times each

routine is called and the dynamic cycle count.

machine

y:=x+l

9 After GVN on the left, after GCM in the middle,

The executed code contains

statistics

7 x:=axb

blocks and

We ran a pass to re-

Besides the optimization

al-

ready discussed, we also used the following:

to

CCP:

units we would

Conditional

Constant

Propagation

is an im-

obscure the effects we are trying to measure.

plementation of Wegman and Zadeck’s algorithm [24]. It simultaneously removes unreachable code and replaces

4.1

computations of constants with load of constants. After constant propagation, CCP folds algebraic identities that

The Experiments The test suite consists of a dozen applications

total of 52 procedures.

All the applications

use constants, such as x x 1 and x + O, into simple copy

with a

are written

operations.

in

FORTRAN, except for cplex. Cplex is a large constraint tomcatv, matrix300 and solver written in C. Doduc, fpppp are from the Spec89 benchmark suite.9 The remaining procedures come from the Forsythe, Malcom

Finding implements AlpGCF: Global Congruence finding ern, Wegman, and Zadeck’s global congruence algorithm [2]. Instead of using dominance information, available expressions (AVAIL) information is used to

and Moler suite of routines

remove

All experiments control-dependent

[14].

start by shaping

edges and inserting

redundant

computations.

the CFG: splitting loop landing

Redundancy Elimination implements PRE: Partial Morel and Renvoise’s algorithm. PRE is discussed in

pads.

All experiments end with a round of Dead Code Elimi(DCE) [12]. Most of the phases use SSA form nation

Section 1.1.

9At the time, our tools were not up to handling the entue Spec suite,

254

so more computations

AVAIL is stronger are removed [20].

than dominance,

Reassociation: This phase reassociates and redistributes integer expressions allowing other optirnizations to find more loop invariant code [6]. This is important

4.4

Reassociate,

then GVN-GCM

vs. GCF-PRE

We first reassociate expressions. GVN-GCM

combination

Then we ran the

and compared

for addressing expressions inside loops, where the rapidly

pass of GCF, PRE and CCP.

varying

a pass of DCE, coalescing and removing

innermost

index

may cause portions

plex address to be recomputed sociation

can allow all the invariant

leaving

behind

only

“base+indexxstride”

an

of a con~-

on each iteration.

Reas-

In Figure

parts to be hoisted,

expression

of

the

count

form

all follow

this general strategy:

1. Convert from FORTRAN shape the CFG.

4.5

up with

DCE,

coalescing

and removing

the ILOC to C with the simulator,

then

combination

Reassociate,

percentage

and

times.

strat-

or constants

problem;

constants

next column The fourth

that

GVN-GCM

We

with

We compared

code motion

The first

GCM

two passes of GCF, PRE and CCP.

write

tion.

GVN,

repeated combination

The sixth column

in Figure

implemented

up with a pass

tially

seven shows Roughly

is by

with

a particularly

good results.

simple

Besides

fast

and

relatively

We ran them both on a By separating

expression

finding

the

issues we are able to of Global

redundant

Value

expressions,

Besides

correcting

the

scheduled

produced

by

GCM also moves code out of loops and down into branches.

linear

in the program

size, and is quite

practice. GVN requires one linear-time program and is also quite fast in practice.

10 shows the runtime

the percentage speedup over GVN-GCM.

roughly

GCM requires the dominator tree and loop nesting depth, It then makes two linear-time passes over the program to select basic blocks. Running time is essen-

empty blocks.

cycles for the repeated passes and column

two

from the optimization

nested conditional

against

We ran DCE after

As before we finish

of DCE, coalescing and removing

Again,

GVN folds constants and algebraic identities. After Code A40running GVN we are required to do Global

Per-

each set; the final sequence being GCF, PRE, CCP, DCE, GCF, PRE and CCP.

and

pass of GCF-PRE-CCP

optimization.

Numbering.

cycles for GVN-GCM.

is cycles for GCF-PRE-CCP.

the GVN,

have

straightforward

and procedure name, the

vs. GCF-PRE-CCP

PRE

5. Conclusions

centage speedup is in the fifth column. 4.3

GCF,

speedup over GVN-GCM.

suite of programs

10 shows the results.

shows the runtime column

DCE,

CCP.

expressions.

The table in Figure

CCP,

could make use of constants found and code removed

GCM is more aggressive than PRE in

two columns are the application

PRE,

two

The final sequence being

found

with CCP cannot then be used to find congruences hoisting loop invariant

GCF,

Then we ran the

lost. This indicates that the second pass of GCF and PRE

Each of GCF, PRE

some congruences

passes have a phase-ordering In addition,

-CCP

and compared it to running

halve the gain over a single

GVN cannot. However, constants discovered with GVN The separate can help find congruences and vice-versa.

PRE.

vs. GCF-PRE

CCP. As before we finish up with a pass of DCE, coalescing and removing empty blocks,

one pass each of GCF, PRE and

This represents an aggressive optimization

and CCP can find

cycle

the GVN-GCM

Column eleven in Figure 10 shows the runtime cycles for the repeated passes and column twelve shows the

with the GVN-GCM

egy with reasonable compile

combination

passes of GCF, PRE and CCP.

GCF-PRE-CCP

VS.

compared it to running CCP.

then GVN-GCM

We first reassociate expressions.

experiment.

We optimized

Reassociate,

GVN-GCM

compile and link. 6. Run the compiled application, collecting statistics. Variations on this Strategy- are outlined below in each

GVN-GCM

is the runtime

running

repeated

empty blocks.

4.2

eighth then

empty blocks.

for one pass of GCF, PRE and CCP.

once or twice,

5. Translate

one

average speedup is 5.970 instead of 4.370. Reassociation provides more opportunities for GVN-GCM than it does

or C to naive ILOC. Re-

2. Optionally reassociate expressions. or run GCF-PRE-CCP 3. Run either GVN-GCM 4. Clean

it to running

As before we finish up with

combination. Column nine is reassociation, then GCFPRE-CCP. Column ten is the percentage speedup. The

in the loop.

Our experiments

10, column

for reassociating,

-CCP

optimization

halve

provide

a net gain over using GCF, PRE

and CCP, and are very simple to implement.

the gain over a single pass of GCF-PRE-CCP is lost, This indicates that the second pass of GCF and PRE could make use of constants found and code removed by CCP,

255

fast in

pass over the Together these

;VN-GCM

Application Routim parol doduc rkf45 rkfs gamgen fpppp prophy doduc doduc pastern doduc ddeflu yeh doduc doduc deblco deseco doduc doduc mtegr doduc cardeb doduc coeray !doduc bilan matrix300 sgemv rkf45 rkf45 initbx doduc dccera doduc twldrv fpppp sphrte sewd heat doduc doduc doduc fpppp doduc doduc urand doduc cplex doduc matrix300 rkf45 cplex doduc cplex fpppp doduc doduc fpppp matrkx300 doduc doduc solve svd cplex zeroin tomcat v fmin fpppp doduc doduc

689,059 536,685 56,024 134,227 553,522 625,336 1,329,989 342,460 479,318 2,182,909 389,004 168,165 1,487,040 536,432 496,000 1,475 2,602 921,068 ?6,291,270 882 578,646

GCF-PRE-CCP speedup Once 898,775 637,325 66,254 158,644 646,101 721,850 1,485,024 379,567 529,876 2,395,579 426,138 183,520 1,622,160 580,878 536,800 1,575 2,776 981,505 81,100,769 937

642 4,596 4,156,117 740 ?485E+08

613,800 24,985 715,410 973>331 507,531 898 563 301,516 3,899,518 763,437 8,767 135,960 16,487,352 9,981,072 19,791,979 1,074 2,567,544 56,684 26,883,090 13,340,000 1,763,280 1,253,980 641 4,542 4,097,458 729 2,436E+08

saturr

908 2,046,512 75,708 63,426

876 1,951,843 71,616 59,334

TOTAL

1.416E+08

4.432E+08

Average

8,659,662

8,690,153

orgpar drepvi fmtgen repvid uudeb urnnd drrgl xload colbur sgemrtr fehl xaddrov si xielem fmtset Supp iniset fpppp Saxpy subb x21y21 decomp svd chpwot zeroin tomcatv fmm efill lhbtr

23,69’2 680,080 926,059 486,141 863 550 295,501 3,831,744 751,074 8,664 134,640 ,6,345,264 9,924,360 9,738,262 1,072 2,564,382 56,649 !6,871,102 3,340,000 1,763,280 l,~53,980

Figure

GCF-PRE-CCP speedup Repeated

23.3%

821,399 15.8% 560,180 15.4% 66,254 15.4% 151,039 14 3% 590,164 13.4% 654,365 10,4% 1,434,149 9 8% 379,567 9.59’. 497,174 8.9% 2,249,694 8.7% 383,403 8.4% 157,620 8 3% 1,622,160 7.7% 545,127 7,6% 536,400 1,575 6.3% 2,606 6.3% 6.2% 981,505 5.9% 80,988,348 937 5 9% 57% 613,800 24,060 5.2% 4.9% 689,325 4 9% 966,891 42% 492,837 838 3.9% 2.3% 563 2.0% 298,926 1,7% 3,899,518 1 6% 765,868 1.2% 8,666 1.0% 134,640 0.9% 16,499,836 0,6% 9,981,072 0.3% 19,904,514 0.2% 1,074 0.1% 2,567,544 0.1% 56,655 0 o% 26,877,096 0.0% 13,340,000 0.0% 1.763,280 0.0% 1,355,263 -0.2% 631 -1.2% 4,547 -1,4% 4,097,548 -1.5% 729 -2 o% 2433E+08 -3.7% 902 -4 9% 1,898,546 -5 7% 77,010 -6.9% 58,962 0.4% 4 3% 10

4.423E+08 8,672,843

GVN-GCM

256

Reassociate GVN-GCM 689,931

Reassociate GCF-PRE-CCP Once speedup

163%

47,316 26,871,102 10,480,000 1,763,280 1,162,343 655 4,612 4,194,263

-1.5% -2 1% -0.7% -7.8% 1.7’% -7 6%

909 1 952E+08 1,017 2,791,920 71,430 63,426

o 2% 24%

3.873E+08

3,938E+08

1.6%

3.918E+08

1.1%

7,593,844

7,721.122

5,9%

7,681,987

2,2%

1,475 2,558 921,068 79,039,949 801 560,418 21,287 697,840 923,419 453,033 735 555 297,334 3,831,744 761,478 7,928 130,416 15,539,173 9,300,548 19,755,998 948 2,564,382

vs. GCF-PRE-CCP

689,495 61,654 102,630 834,681 769,328 1,541,800 362,090 552,856 2,671,044 487,928 169,090 1,606,384 686,143 401,200 1,475 3>091 975,215 81,989,205 837 610,080 22,950 741,055 960,776 514,041 877 564 347,379 3,899,518 804,580 8,022 137,808 16,487,352 9,470,684 19,718,353 952 2,561,220 47,498 26,871,102 10,480,000 1,763,280 1,162,343 651 4,407 4,137,514 731 1,957E+08 885 2,305,816 84,266 60,078

25.8%

8~4, J89

16.1% 4,2% 15.47. 11,1% 6.2% 4 d!io 7,3% 98% 3.6% 3.070 -1 5% -6.7% 8 3% 1.6% 7.5% 6.3% 0.2% 6.2% 5.8% 5.9% 5.7% 1 5% 1.3% 42% 1,4% -3.0% 2 3% 1.1% 1,7% 1 9% 0 o% 0.0% 0.9% 0 6% 0.8% 0 2% 0.1% 0,0% 0.0% 0 o% 0.0% 7.5% -1 7% -1.1% -1.4%

513,005 58,945 79,057 617,902 598,399 1,332,394 342,460 440,884 2,175,324 401,214 159,840 1,487,040 529,957 400,000

929,271

Reassociate GCF-PRE-CCP Repeated speedup

25.6% 559,255 4.4% 62,526 23.070 83,032 26.0% 648,916 22.2% 651,012 13,6% 1,474,460 54% 354,328 20.3% 508>216 18.6% 2,324,804 17.8% 391,913 5.5% 151,330 7.4% 1,606,384 22,8% 582,127 0 3% 400,400 0.090 1,475 17 2% 2,591 5.6% 975>215 3,6% 81,464,531 4,3% 829 8.l% 608,964 7.2% 21,840 5.8% 720,890 3,9% 952,581 119% 496,557 16.2% 844 1.6% 564 14.4% 347,379 1 7% 3,899,518 5.4% 797,365 1.2% 7,920 5.4% 137,808 5 8% 16,499,836 1 8% 9.470,684 -0.2% 19,956,683 0.4% 952 -o 1% 2,567,544 0.4% 47,469 0.0% 26,865,108 0 o% 10,480,000 0 o% 1,763,280 0.0% 1,162,343 -0,6% 60 I -4.7% 4,135 -1 4% 4,137,604 -24.4% 738 0.3% 1,952E+08 -14 9% 901 -21,1% 2,471,638 15 2% 79,9!38 -5 6% 59,706

8.3% 5.7% 4.8% 4.8% 8.1% 96% 3.3% 13.2% 6.4% -2,4% -5 6% 7.470 9 o% 0 1% 0.0% 1 3% 5.6% 3.0% 34% 8.0% 2.5% 3.2% 3.1% 8.8% 129% 1,6% 14.4% 1,7% 4.5% -o 1% 5.4% 5.8% 1 8% 1.0% 0.4% 0 1% 0 3% 0.09’0 0 o% 0 o% 0.0% -9 o% -11.5’% -1 4% -23 2% 0.0$%0 -12 9% -13.0% 10 7% -6.Z?ZO

[13]

K. H. Drechsler problem

Bibliography

mization ACM

[1]

A. Aho,

and J. Unman.

Techniques,

Wesley, 1985. [2]

sium

the

of the Fifteenth

A. B. Maccabe,

stein.

The program

tation

supporting

Prentice-Hall, sey, 1977. [15]

driven

interpretation of the

Lan-

and

of imperative SIGPLAN

Languages

[16]

web: A represen-

data’90

Design

languages.

In on

Allocation

’94

[17]

Ph.D. thesis, Rice University, The Massively

Unpublished

report.

tions

Coloring.

1992.

Scalar Compiler

Preliminary

shared.ps. Rice University,

[18]

Project.

version available

dancy

elimination,

SIGPLAN guages

[7]

’94 Design

P. Briggs

R. Cartwright program guages

[9]

on

Programming

Iloc ’93.

Design

G. J. Chaitin.

In

and spilling

’82 Symposium

on

Compiler

[21]

Combining

J. Coeke and J. T. Schwartz. guages

and

their

Mathematical April [12]

Combining

Ph.D. thesis, Rice University,

zatiorzs.

[11]

Analyses,

via [22]

compilers.

Sciences,

Courant New York

[23]

1995. lan-

Institute of University, [24]

1970.

computing

static single assignment

Record

of the Sixteenth

on the Principles

of Programming

form.

ACM

Value

of

A CM

Programtning

Numbering.

Available

R. E. Tarjan.

from

Unpub-

ftp://cs,rice.edu/ Rice University,

C.

Testing

of Computer

Vick,

SSA-Based

Masters

Jan.

1989.

257

graph

reducibility.

Sciences,

9:355-

Reduction

of

Operator

thesis, Rice University,

pages

1994.

D. Weise, R. Crew, M. Ernst, and B. Steensgaard, graphs:

taxation.

In

SIGPLAN

Symposium

gramming

Languages,

Representation

Proceedings on

of the

on

Programming

13(2): 181-210,

the

without

21st

Principles

ACM of

Pro-

1994.

M. N. Wegman and F. K. propagation with conditional Transactions

In Con-

flow

and System

1974.

Systems,

Symposium

Lunguages,

Principles

Global

Value dependence

R. Cytron, J. Ferrante, B. K. Rosen, M. N. Wegman, and F. K, Zadeck, An efficient method of ference

Computa-

of the Fifteenth

Jan. 1988.

Strength.

Optimi-

Programming

and F. K. Zadeck,

and Redundant

Record

the

report.

11-15,

C. Click,

on

T. Simpson,

365,

1982. [10]

Numbers

by

Communica-

Feb. 1979.

22(2):96-103,

In Conference

Journal

June

Construction,

Global optimization

the Lan-

of the SIGPLAN

In Proceedings

Systems,

1994.

June 1989.

Register allocation

graph coloring,

of

Programming

and

redundancies.

public/preston/optimizer/gval.ps.

The semantics of

and Implementation,

Value

lished

1993.

Proceedings on

[20]

Transac-

Languages

of partial

Languages,

re-

for

ACM

July, 1979.

Symposium

Lan-

Technical

Rice University,

Conference

the

De-

A fast algorithm

B. K. Rosen., M. N. Wegman, tions.

June 1994.

and M. Felleisen,

’89

of

Languages

in a flowgraph.

Programming

of the ACM,

Global

redun-

Proceedings

Conference

dependence,

SIGPLAN

partial

and Implementation,

port CRPC-TR93323, [8]

Effective

In

and T. Harvey.

[19]

Partial dead

of the SIGPLAN

June 1994.

E. Morel and C. Renvoise. tions

July, 1987.

and B. Steffen.

and R. E. Tarjan.

suppression

July 1994.

P. Briggs and K. Cooper.

on

Programming

19-349,

on Programming

dominators

1(1):121-141,

via ftp://cs.rice.edu/public/prestonloptirnizer/

[6]

T. Lengauer

on

9(3):3

In Proceedings

Conference

New Jer-

and J. D. Warren.

Transactions

and Systems,

Computa-

Cliffs,

graph and its use in op-

sign and Implementation,

Implementa-

via Graph

Englewood

J. Knoop, O. Riithing,

finding

P. Briggs, Register

ACM

and C. B. Moler.

Mathematical

dependence

code elimination.

demand-

Conference

and

Languages

Oct. 1988.

K. J. Ottenstein,

Languages

June 1990.

P. Briggs.

J. Ferrante, The program

and K. J. Otten-

dependence control-,

for

tions.

Sympo-

Programming

opti-

redundancies”.

Programming

M. A. Malcom,

Methods

timization.

R. A. Ballance,

tion,

of

ACM

of partial

on

10(4):635–640,

G. E. Forstyhe, Computer

Principles

Programming

[5]

[14]

1988.

Proceedings

by suppression

to a

“Global



Addison-

-

Record

on

guages,

[4]

Compilers

Tools.

B. Alpern, M. N. Wegman, and F. K. Zadeck. Detecting equality of variables in programs. In Conference

[3]

and

A solution

and Renvoise’s

Transactions

and Systems,

R, Sethi,

Principles,

and M. P. Stadel.

with Morel

April

Zadeck. Constant branches. ACM Languages

1991.

and