Jun 26, 1991 - Also, we mark the ten basic blocks. (BL1-BL1O) of which the ..... condition. In Figure 4 we mark the equivalent ...... Bernstein,. D., and Gertner, I.,.
Global
Instruction
Scheduling David
Machines
Bernslein
Michael
IBM
for SuperScalar
Rodeh
Israel ScientKlc Technion Haifa
Center
City
32000
ISRAEL
1. Introduction
Abstract To improve
the utilization
superscalar carefully
processors,
scheduled
parallelism evident
level.
scheduling
information
Dependence
well beyond
Graph,
basic block
uses the control
to move
boundaries.
scheduling
framework
description
of the machine
exploits further code.
speculative
execution
We have implemented XL
family
them
on the IBM
of compilers RISC
which
and to
scheduling
so as to improve
of such transformations, has been placed on
algorithms
at the instruction with
functional
machines
units BRG89,
[BJR89],
While
pipelined
HG83,
GM86,
Word
for machines
W90]
(VLIW)
each cycle, for pipelined issue a new instruction eliminating
fee ell or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage, the ACM copyright notice and the title of the publication and its date appear, and notice is given that copying is by permission of the otherwise,
0-89791-428-7/91/0005/0241
or to
feature required
from
allowing
the generation
[EIEJ
units the idea
the goal is to
every cycle, effectively NOPS (No
Operations).
types of machines,
the code instructions
the machine
. ..$1.50
I
for both
and Very
as n instructions
machines
the so-called
However, Permission to copy without
several
machines
n functional
with
is to be able to execute as many
@ 1991 ACM
or assembly
level were suggested for processors
Large Instruction
machines.
Association for Computing Machinery. To copy republish, reauirea a fee end/or aDecific oermiasion.
out that in order
language
scheduling,
Previously,
architecture
have to be rearranged,
The burden
compilers.
[BG89,
in the
and have evaluated
System/6000
It turned
of pipelining
optimizing
instructions;
spans
of the general
our algorithms
[P85J
for
emerged which
in computer
instructions
called instruction
a new approach
of program
this direction
at the intermediate
code level.
instructions
of instructions
enhance the performance
IBM
usually
in a
machines,
streamlining
performance,
and
is based on the parametric
a range of superscakis and VLIW
high speed processors
to take advantage
This novel
architecture,
building
was called RISC
the
(intra-loop)
summarized
in the late seventies,
subsequently
be done beyond
which
Starting
emphasizes
it becomes
A scheme for global
is proposed,
data dependence
As internal
increases, should
in
have to be
by the compiler.
and pipelining
basic block
resources
the instructions
that scheduling
Program
of machine
the common
the compiler
is to discover
that are data
independent,
of code that better
in
utilizes
resources.
i Proceedings
of the
Programming Toronto,
ACM
Language
Ontario,
SIGPLAN Design
Canada,
June
’91
Conference
on
It was a common
view that such data independent
and Implementation. 26-28,
instructions
1991.
can be found
within
basic blocks,
there is no need to move instructions
241
beyond
and basic
block
boundaries.
Virtually,
work
on the implementation
scheduling
for pipelined
scheduling
within
W90].
many
unpredictable scientific
Dependence
concentrated
[HG83,
-type programs
small basic blocks
programs
terminated
since there, basic blocks
compilers
to expose parallelism
with
by
extends
RISC
by the ability
challenges
and generation
to compilers,
the parametric
for global
instruction
is not so severe,
optimizing
compilers.
is evolving
that
This type of high or serious
since instruction
generation
machine
resources
computations,
symbolic
or Udx-t
not depend scheduling
scheduling
branch
cases not
beyond
superscalar
machines
code, processor
small number approaches machines
scheduling
was reported
in [GR90],
Also,
instruction
machine
There
of instructions within
time
is likely
in
on such assumption.
However,
global
of taking
advantage
whenever
available
scheduling
of the (e.g.
As for the enhanced
scheduling,
our opinion
towards
a machine
of computational
units,
is that it is with
a large
like VLIW
between
instructions
have to be duplicated
scheduled.
Since we are currently
machines the
scheduling.
for global
with
the scope of the enclosed
speculative
the movement
loop.
boundaries The method
242
and speculative
we identify
a small number System/6000
a conservative
interested
of functional
approach
First we try to exploit
instructions,
execution
in order
machines),
useful instructions,
we
the cases where to be in units
we
to instruction the machine next we consider
whose effect on
performance
depends on the probability
to be taken,
and scheduling
might
Bell Labs
useful
Also,
(like the RISC established
basic blocks
in a PDG,
a
[EN89].
permits
available
distinguish
of instructions. with
in the literature:
of AT&T
(which
of a
but may not be true in
of
code for the VLIW
which
assumes the existence
by proffig).
E851 and the enhanced
well beyond
I Unix is a trademark
by
does
Using the information
are two main
we present a technique
scheduling
scheduling
scheduling)
resources with In this paper,
a powerful
machines.
one can view a
as a VLIW
that were reported [F81,
number
were investigated,
of running
of resources.
scheduling
providing
of
for scheduling
techniques
for compiling
scheduling
percolation
of a family
percolation
probabilities,
percolation
[JW89].
instruction
in fair improvements
the compiled
trace
to pursue
the scope of basic blocks
resulting
the PDG
global
more targeted
code ,replication
of
ype programs),
is capable
computed
of code that utilizes
to a desired extent
scientific
(as well as enhanced
poses more
to allow
trace scheduling
main trace in the program
to issue more than one
sufficient
where
description
framework
level is in many
for superscalar
for the purposes
We suggest combining
hand, for
at the basic block
One recent effort
to be used in
of code for
thereby
called superscalar
architecture,
superpipelined
et. al [FOW87]
tend to be larger.
per cycle [G089].
speed processors,
called the Program
that was recently
by Ferrante
multiprocessors.
that
(PDG),
machines,
a new type of architecture
instruction
Graph
proposed
While Recently,
data structure,
superscalar
On the other
the problem
a novel
vectorization
such
may result in code with
Unixl
branches.
on
GM86,
architectures
type of scheduling
many
of instruction
basic blocks
NOPS for certain
include
employs
machines
Even for basic RISC
restricted
all of the previous
with
~f branches
duplication,
increase the code size incurring
which
additional
costs in terms of instruction do not overlap belong
cache misses.
the execution
to different
of instructions
iterations
is often called sofware
for future
that
of the loop.
more aggressive type of instruction which
Also, we
of functional
the machine
This
executed
units of m types, where
has nl, nz, ....n~ units of each type.
Each instruction
scheduling,
pipelining
a collection
in the code can be potentially
by any of the units of a speci.tied type.
is left
[J-X8],
work.
For the instruction
scheduling
that there is an unbounded For speculative
instructions,
previously-it
suggested that they have to be supported machine
architecture
architectural
[ESS, SLH90].
support
carries a si~lcant evaluating with
run-time
compile-time
retaining
overhead,
XL family
System/6000 preliminary
machine
we are
(coloring)
of the code, still effect promised
of compilers
(RS/6K
for the IBM
for short)
computers.
of
The rest of the paper is organized
and show how it is applicable
Section
Then,
in Section
that will
in Section
3 we bring
serve as a running
this paper we
and register allocation
A program
instruction
of machine
cycles to be executed
at all.
between
For the
instruction
see [BEH89].
units of its type.
in Section
execution, 6 we bring
results and conclude
in Section
imposed
which
are modelled
the execution
machine
Our model
of a superscalar
In
of the PDG,
description
of a typical
RISC
that reference memory while
We view a superscalar
all the computations
Let I (t > 1) be
that if Zz is scheduled
as
243
if 11 is
constraints
be
(by the compiler)
above, this would
of the program,
to guarantee
info~ation
BRG89].
are
machine
purposes,
to start no earlier than k + t+ d. Notice,
pipelined
whose only
are load and
store instructions,
edge.
scheduled
More
is based on the
done in registers.
such that the edge
to start at time k, then L should
assume that the machine
description
edges of the
time of 11 and d (d z O) be the delay
affect the correctness
7.
of
by the integer
graph.
start earlier than mentioned
some performance
processor
on the execution
scheduled
however,
are presented.
machine
there are
assigned to (11,14. For performance
a small
number
by one of the
Also,
constraints
(11,12) is a data dependence
model
interlocks
2. Parametric
an integral
Let 11 and L be two instructions
In
example.
requires
delays assigned to the data dependence
5 several levels of scheduling,
speculative
instructions
Throughout
register allocation
scheduling
instructions
to the RS/6K
4 we discuss the usefulness
including Finally,
as follows.
2 we describe our generic machine
program
the
onto the real
on the relationships
pipelined
The
results for our scheduling
were based on a set of SPEC benchmarks
machines.
the
using one of the standard
algorithms.
computational
Section
during
phase of the compiler,
discussion
functional
RISC
[ss9].
while
registers,
of symbolic
Subsequently,
registers are mapped
will not deal with
our scheme in the context
performance
prototype
symbolic
number
we assume
execution.
We have implemented the IBM
register allocation
such support
most of the performance
by speculative
by the
execution
for replacing
analysis
registers in the machine.
Since
for speculative
techniques
was
purposes,
implements
to not
since we hardware
the delays at run time.
about
the notion
can be found
of delays due to in [BG8!J,
2.1 The RS/6K
model
the second types of the above mentioned
Here we show how our generic model superscalar
machine
machine.
is cotilgured
The RS/6K
be considered.
of a
to fit the RS/6K
processor
is modelled
3. A program
as
Next,
follows:
we present
that computes ●
“
m = 3, there are three types
fixed point,
floating
ni=
l,n3=
1, nz=
point
unit,
Most
l,there
instructions,
point
are four main
in one
Next,
etc.
a load
instruction
Figure a floating
point
compare
instruction
instruction
comprises
a floating
Section
that uses the result of that
delays in the
In this paper we concentrate computations
2
of notation,
only.
problem
of future
Therefore,
in the with of
before
discussion.
as was mentioned the global
the register allocation
to activate allocation
2
the registers mentioned
2, we prefer to invoke
the register
on fixed point
them
of the program
However,
in the code), even though
whose effect is secondary.
2.
the code of Figure
for the purposes
algorithm
XL-C
the instructions
this stage there is an unbounded
are a few additional
machine
in Figure
statements
in the code are real.
and the branch
compare. There
that corresponds
we mark the ten basic blocks
of which
For simplicity
a delay of five cycles between
for the loop,
the
The
2 (I 1-120) and annotate
1. Also,
(BL1-BL1O) that
updating
if needed.
we number
the corresponding
;
uses its result; —
they are
and the minimum,
code of Figure
and the instruction
one to
maximum
For convenience,
and the branch
that uses the result of that
instruction
of a are compared
, is presented
is of
of the loop.
to the max and mi n variables,
compiler3 a fixed point
which
every iteration
(zfiu > v)) , and subsequently
pseudo-code
1 and
that two elements
to the real code created by the IBM
Zoad);
a delay of one cycle between
on the loop
compared
RS/6K
that uses its
in Figure
of
example.
1, we notice
these elements
in C)
and the maximum
concentrating
in Figure
(written
is shown
serve us as a running
another
division,
a delay of three cycles between
point
This program
marked
types of delays:
result register (delayed
compare2
a small program
the array a are fetched
and the instruction
instruction
example
the minimum
In this program,
there are also multi-cycle
instruction
–
will
unit and a
are executed
like multiplication,
compare
an array.
types.
isa single fixed
a delay of one cycle between
–
units:
unit.
cycle, however,
–
and branch
of the instructions
s There
of functional
a single floating
single branch ●
point
delays will
number
conceptually
the instruction
in
scheduling
is done (at of registers there is no
scheduling
after
is completed.
only the first and
More precisely, usually the three cycle delay between a fixed point compare and the respective branch instruction encountered only when the branch is taken.
However, here for simplicity
is
we assume that such delay exists whether
the branch is taken or not. 3 The
only
feature
of the machine
in a special counter
register.
zero in a single instruction,
that was disabled
in this example
is that of keeping
the iteration
variable
of the loop
Keeping the iteration variable in this register allows it to be decremented and tested for effectively reducing the overhead for loop control instructions. 244
~
find
the
largest
and the
smal lest
number
in a given
array
minmax(a,n) { int i,u,v,min,max,n,a[SIZE]; min=a[O]; max=min; i=l; /****************** LOOP STARTS while
max is kept in r30 min is kept in r28 i is kept in r29 n is kept in r27 address of a[i] is kept in r31 . . . more instructions here . . . *************** LfjOfJ STARTS *******************
‘/
*************
/
(i v) { if (u>max) max=u; if (v max END BL2 max = u END BL3 v < min END BL4 min = v
CL.9
END BL5
CL.4: cr6=r0,r30 (112) C (113) BF CL.ll,cr6,0x2/gt --------------------------------------(114) LR r30=r0 ~-------------------------------------CL. 11: (115) C cr7=r12,r28 (116) BF CL.9,cr7,0xl/lt ---------------------------------------
v > max END BL6 max = v END BL7 u < min
...
more
instructions
here
.. .
that the
code executes in 20, 21 or 22 cycles, depending O, 1 or 2 updates
index
END BL8 min = u (117) LR r28=r12 --------------------------------------END BL9 CL.9: (118) AI r29=r29,2 i =i+2 iv
---------------------------------------
in the code ofFigure
requires
r12=a(r31,4) rO, r31=a(r31,8)
(13) C cr7=r12, r0 (14) BF CL.4, cr7,0x2/gt --------------------------------------(15) C cr6=r12, r30 (16) BF CL.6,cr6,0x2/gt --------------------------------------(17) LR r30=r12 --------------------------------------CL.6: (18) c cr7=r0,r28 (19) BF CL.9,cr7,0xl/lt --------------------------------------(110) LR r28=r0
(v>max) max=v; (u
D(J), then pick ~
4, If D(J)>
D(f),
5. If CP(l)
B3.
Each of them
into
B 1, but it is apparent
be printed
x=3 belongs
can be (speculatively) that both
in B4.
Data dependence
of these instructions
to
moved
of them are
to move there, since a wrong
the movement
To solve this problem,
then pick % that occurred
about
in the code
from
frost,
the (symbolic) a basic block.
considered that the current
ordering
is tuned towards of resources.
preferring
to B2, while
value may
do not prevent into
B 1.
> CP(J), then pick L
7. Pick an instruction
functions
x=5 belongs
not allowed
then pick J
then pick
Cl’(l),
number
the
that is
the same time in the scheduling
Notice
excerpt
Examine
and let 1 and J be two
that (should
functional
in
as follows:
To make it formally,
instructions
has to be maintained.
The control
of instructions.
currently
as they were defined
...
path heuristic
we try to preserve the original
to respect
this is not true, and a new type of
i f (cond) x=5; else x=3; print. f(’’x=%d”,
of the same class and delay
we pick one that has a biggest critical
it is sufficient
...
we pick
has the. biggest delay heuristic
For the instructions
following
For the same
doing
It turns out that for speculative
information
process, we schedule useful
while
to preserve the
of the program
the data dependence
(CP(J2) + d(l,JJ),
framework,
to schedule
of the heuristic
a machine
with
a small
exit from
B, such speculative
a useful instmction
to be updated
a speculative
instruction
may cause longer
delay.
and tuning
before a
speculative
In any case,
updated.
are needed for better
Then,
results.
251
speculatively
that is being to a block
B
a new value for a register that is live on
dka.llowed.
one, even though
the information
registers that are Ibe on exit If an instruction
to be moved
This is the reason for always
speculative
experimentation
computes
we maintain
Notice
Thus,
is
that this type of information
dynamically,
motion
movement
i.e., after each
this information
has to be
let us say, x=5 is fwst moved
x (or actually
has
a symbolic
register that
to B 1.
. . . more instructions here . . . ********** LOOp STARTS ************
. . . more instructions here . . . *********** Loop STARTS *************
CL.0: (11) (12) (118) (13) (119) (14) (15) (18) (16) (17) CL.6: (19) (110) (Ill) CL.4: (112) (115) (113) (114)
CL.0: (11) (12) (118) (13)
L LU AI C C ;F
BF LR
r12=a(r31,4) r0,r31=a(r31,8) r29=r29,2 cr7=r12,r0 cr4=r29,r27 CL.4,cr7,0x2/gt cr6=r12,r30 cr7=r0,r28 CL.6,cr6,0x2/gt r30=r12
BF LR B
CL.9,cr7,0xl/lt r28=r0 CL.9
C C BF LR
cr6=r0,r30 cr7=r12,r28 CL.ll,cr6,0x2/gt r30=r0
c
more instructions
5. The results
here
of applying
to the program
correspondsto
x)
CL.9,cr7,(3xl/lt r28=r0 CL.9
detailed
scheduling
ofx=3to
the useful
scheduling
and its relationship
the effect of useful and on the example
The result ofscheduling is presented ofBLl,
considered
tobe
in Figure
two instructions
ofBL10(118
ftiginthe
2.
the programof
Similarly,
while
6 shows the result ofapplying (l-branch)
252
12- 13 program
both
speculative
In addition
above,
from
of
in 20-22 cycles per iteration.
the
schedulingto
to the motions
two additional
(15 and 112) were moved
and 119) were moved
Figure2
BL8toBL6,
the original
Figure
ffl
and specula-
moved
inFigure5takes
2 was executing
were described
Theresultisthat
18was
Figure
the same program.
that were
the useful
15wasmovedfrom
useful a.ndthe
the
...
delay slots of the
there.
cycles per iteration,
there were those ofBLIO,
since only BLIO~EQUIV(BLl).
of applying
Theresultantprogram
onlyto
5. During
the ordyinstmctions moved
of Figure
useful instructions
BL1,
BL4toBL2,andI
to the PDG-based
Let us demonstrate
scheduling
into
instructions
speculative
examples
this program
B 1,
is out of the scope of this paper.
scheduling
here
tive schedulingto
5.4. Scheduling speculative
6. The results
2
live onexitfrom
ofthe
more instructions
Figure
B1 will be prevented.
description
scheduling
...
...
of Figure
becomes
and the movement
global
BF LR B
CL.4: (115) C cr7=r12,r28 (113) BF CL.ll,cr5,(3x2/gt (114) LR r30=r0 CL.11: (116) BF CL.9,cr7,0xl/lt (117) LR r28=r12 CL.9: (120) BT CL.13,cr4,0xl/lt ********** Loop ENDS ***************
BF CL.9,cr7,0xl/lt (117) LR r28=r12 CL.9: (120) BT CL.0,cr4,(3xl/lt ********** Loop ENDS **************
More
C
C C BF c BF LR
(Ill)
CL.11:
Figure
(119)
(15) (112) (14) (18) (16) (17) CL.6: (19)
r12=a(r31,4) r0,r31=a(r31,8) r29=r29,2 cr7=r12,r13 cr4=r29,r27 cr6=r12,r30 cr5=r0,r3Cl CL.4,cr7,0x2/gt cr7=r0,r28 CL.6,cr6,EJx2/gt r30=r12
(110)
(116)
...
L LU AI C
instructions
speculatively
in the three cycle delay between since 15and
that
to BL1,
to
13 and 14.
Interestingly
enough,
112 belong
basic blocks
that are never executed
together
to in any
single execution
of the program,
two instructions
will
the program iteration,
only one of these
carry a useful result.
in Figure
in Figure
was cor@ured
All in all,
6 takes 11-12 cycles per
a one cycle improvement
program
Next we describe how the global
over the
compile-time
overhead
improvement
to a maximum
design decisions
5.
the global
6. Performance
evaluation
of the global
scheme was done on the IBM whose abstract
model
For experimentation scheduling
scheduling
RS/6K
is presented purposes,
has been embedded
of compilers.
several high-level etc.; however,
in Section
into the IBM support
like C, Fortran,
we concentrate
Only
“small”
“Small”
Pascal,
suite [SS9].
in
unrolled
EQNTOTT
programs
and ESPRESSO
C Compiler,
manipulation
of Boolean
functions
(denoted
by BASE
After
with
and
the global
scheduling
XL
was disabled.
that the base compiler
includes
possible
scheduling machine
optimization)
the body
two types of
that of [W90], ●
loop-closing
scheduling
techniques
overlap
to the
that represent are rotated,
loops
by
after the end of
the global inner
scheduling loops,
the
we
effect of the software
of the loop
of the previous
are executed
of the within
iteration.
The general flow of the global
scheduling
is as
inner loops are unrolled;
to
scheduling
is applied
time to the inner regions
the fust
only;
3. certain inner loops are rotated; techniques
delay problems
So, in some sense certain improvements global
similar
and
a set of code replication certain
scheduler
are
of one).
i.e., some of the instructions
1. certain
basic block
that
follows:
and peephole
as follows:
a sophisticated
64
they include
is applied
such regions
By applying
2. the global ●
scheduling
their fust basic block
next iteration
Please notice
regions
instead
up to 4 basic blocks
pipelining,
in which
on its own (aside of all the
independent
and
only
are scheduled.
the inner
of a loop
achieve the partial
.
instruction
regions
second time to the rotated
comparisons
C compiler
regions)
that include
once (i.e., after unrolling
the global
copying
and equations.
in the sequel) is the performance
results of the same IBM
other
(i.e.,
step, before the global
is applied,
the loop. The basis for all the following
inner regions
loops with up to 4 basic blocks
inner regions,
are two
that are related to minimization
are scheduled.
and 256 instructions.
two iterations
LI denotes the Lisp Interpreter .
while
status of
are those that have at most
In a preparation
represent
In the following
stands for the GNU
reducible
basic blocks
only on the C
was done on the four C programs
the SPEC benchmark
between
(i.e. regions
regions
scheduling
GCC
the current
inner regions).
XL
programs.
benchmark,
The following
Only two inner levels of regions
outer regions
●
discussion
extent.
regions that do not include
2.1.
the global
These compilers
languages,
The evaluation
of the
prototype:
So, we distinguish
machine
●
family
the trade-off
and the run-time
characterize
scheduling
scheme
results 9
A preliminary
so as to exploit
scheduling
that solve
4. the global
[GR90].
scheduling
time to the rotated
is applied inner loops
the second and the
outer regions.
due to the
those of the scheduling
The compile-time
that were already part of the base
overhead
scheme is shown in Figure
compiler.
253
of the above described 7. The column
marked
BASE
gives the compilation
in seconds as measured machine,
model
column
marked
provides
on the IBM
(Compile-Time
percents.
above mentioned rotation,
only,
time
comes from
and GCC, (Actually,
performance
BASE
benchmarks,
LI
EQNTOTT ESPRESSO GCC
improvement
CTO
206
13%
78 465 2457
17% 12% 13%
towards
the existing
at the moment,
useful and speculative improvement
(RTI)
in Figure
namely
scheduling. for both
overhead,
which
especially
is shown
of the measurements
is about
0.5
of instructions
by an opt imizing
utilization
is
of machine
superscalar
to the
processors,
the base
structure
proposed
The accuracy
compilers
(PDG),
and a flexible
10/0.
RTI USEFUL SPECULATIVE
work
scheduling, 312
EQNTOTT ESPRESSO GCC
45 1(36 76
2.0%
6.9%
7.1%
7.3% 0% (3%
-0.5% -1.5%
many helpful Figure
8. Run-time improvements
Vladimir
for the global sched-
254
description
that employs
a
RS/6K
machine
We are going to extend our more aggressive speculative with
would
Krawczyk
discussions,
Rainish
implementation.
uling
machine
The results of evaluating
and scheduling
Hugo
for better
for a range of
framework
Acknowledgements. We Ebcioglu,
compiler
scheme on the IBM
by supporting
scheduling
It is based on a data
a parametric
are quite encouraging.
LI
over the size
for parallel/parallelizing
scheduling
the scheduling
the global
resources
set of useful heuristics.
BASE
steps were
that are being scheduled.
The run-time
0/0 -
it as
since no major
scheme allows
in seconds.
a larger
As for the
we consider
The proposed
with
with
units.
usefid only and
relative
time of the code compiled
in machines
We may expect
7. Summary
that we
types of scheduling
8 in percents
has already been optimized
of computational
of the regions
the global
due to the fact
taken to reduce it except of the control
overheads for the global sched-
are two levels of scheduling
with
is modest
architecture.
even bigger payoffs number
uling
PROGRAM
only.)
that the achieved
in run-time
compile-time
7. Compile-time
compiler
in
when the global
our short experience
we notice
reasonable,
running
is
to useful scheduling
that the base compiler
presented
is
etc.).
PROGRAM
distinguish
scheduling
no improvement
is restricted
scheduling,
There
the useful scheduling
there is a slight degradation
for both
scheduling
To summarize
Figure
most of
On the other hand, for both
observed.
unrolling,
8 that for EQNTOTT
for LI, the speculative
ESPRESSO
all of the
loop
while
dominant.
times in
to perform
steps (including
The
Overhead)
This increase in the compilation the time required
in Figure
the improvement
530 whose cycle time is 40ns. CTO
We notice
RS/6K
the increase in the compilation
includes
loop
times of the programs
duplication
of code.
like to thank
Kemal
and Ron Y. Pinter and Irit Boldo
for their help in the
and
for
References
Transactions
[BG89]
Systems, Vol. 319-349.
[BRG89]
[BJR89j
Bernstein, D., and Gertner, I., “Scheduling expressions on a pipelined processor with a maximal delay of one cycle”, ACM Transactions on Prog. Lang. and Systems, Vol. 11, Num. 1 (Jan. 1989), 57-66, Bernstein, D., Rodeh, M., and Gertner, I., “Approximation algorithms for scheduling arithmetic expressions on pipelined machines”, Journa[ of AZgorit/vns, 10 (Mar. 1989), 120-139. Bernstein, D., Jaffe, J. M., and Rodeh, M,, “Scheduling arithmetic and load operations in parallel with no spilling”, SIAM Journa[ of Computing, (Dec. 1989), 1098-1127.
on Prog. Lang.
and
9, Nurn. 3 (July 1987),
[F81]
Fisher, J., “Trace scheduling: A technique for global microcode compaction”, IEEE Trans. on Computers, C-30, No. 7 (July 1981), 478-490.
[GM$6]
Gibbons, P.B. and Muchnick, S. S., “Efficient instruction scheduling for a pipelined architecture”, Proc. of the SIGPLAN Annual Symposium, (June 1986), 11-16.
[GR90]
Golumblc, lM.C. and Rainish, V., “Instruction scheduling beyond basic blocks”, IBM J, Res. Dev.,(Jan. 1990), 93-98.
[BEH89]
Bradlee, D. G., Eggers, S.J., and Henry, R. R., “Integrating register allocation and instruction scheduling for RISCS”, to appear in Proc. of the Fourth ASPLOS Conference, (April 199 1).
[G089]
Groves, R. D., and Oehler, R., “An second generation RISC processor architecture”, Proc. of the IEEE Conference on Computer Design, (October 1989), 134-137.
[CHH89]
Cytron, R., Hind, M., and Wilson, H., “Automatic generation of DAG parallelism”, Proc. of the SIGPLAN Annual Symposium, (June 1989), 54-68,
[HG83]
[CFRWZ]
Cytron, R., Ferrante, J., Rosen, B. K., Wegman, M. N., and Zadeck, F. K., “An efficient method for computing static single assignment form”, Proc, of the Annual ACM Symposium on Principles of Programming Languages, (Jan. 1989), 25-35.
Hennessy, J,L. and Gross, T., “Postpass code optimization of pipeline constraints”, ACM Trans. on Programming Languages and Systems 5 (July 1983), 422-448.
[JW89]
Jouppi, N. P., and Wall, D.W., “Available instruction-level parallelism for superscalar and superpipelined machines”, Proc. of the Third A SPLOS Conference, (April 1989), 272-282.
[L881
Lam M, “Software Pipelining: An effective scheduling technique for VLIW machines”, Proc. of the SIGPLAN Annual Symposium, (June 1988), 318-328.
[P851
Patterson, D. A., “Reduced instruction set computers”, Comm. of A CM, (Jan. 1985), 8-21.
[SLH90]
Smith, M.D, Lam M. S., and Horowitz M.A., “Boosting beyond static scheduling in a superscalar processor”, Proc. of the Computer Architecture Conference, (May 1990), 344-354.
[s89]
“SPEC Newsletter”, Systems Performance Evaluation Cooperative, Vol. 1, Issue 1, (Sep. 1989).
p-v!xy
Warren, H., “Instruction scheduling for the IBM RISC System/6K processor”, IBit4 J. Res. Z)W., (J~. 1990), 85-92.
[E88]
Ebcioglu, K., “Some design ideas for a VLIW architecture for sequential-natured software”, Proc. of the IFIP Conference on Paral!el Processing, (April 1988), Italy.
[EN89]
Ebcioglu, K., and Nakatani, T., “A new compilation technique for paralleliziig regions with unpredictable branches on a VLIW architecture”, Proc. of the Workshop on Languages and Compilers fm-bm-aalle[ Computing, (August 1989),
[E851
Ellis, J. R., “Bulldog: A compiler for VLIW architectures”, Ph.D. thesis, Yale U/DCS/RR-364, Yale University, Feb. 1985.
[FOW87]
Ferrante, J., Ottenstein, K.J., and Warren, J. D., “The program dependence graph and its use in optimization”, ACM
255
IBM