Global Instruction Scheduling for SuperScalar Machines Abstract 1 ...

Global

Instruction

Scheduling David

Machines

Bernslein

Michael

IBM

for SuperScalar

Rodeh

Israel ScientKlc Technion Haifa

Center

City

32000

ISRAEL

1. Introduction

Abstract To improve

the utilization

superscalar carefully

processors,

scheduled

parallelism evident

level.

scheduling

information

Dependence

well beyond

Graph,

basic block

uses the control

to move

boundaries.

scheduling

framework

description

of the machine

exploits further code.

speculative

execution

We have implemented XL

family

them

on the IBM

of compilers RISC

which

and to

scheduling

so as to improve

of such transformations, has been placed on

algorithms

at the instruction with

functional

machines

units BRG89,

[BJR89],

While

pipelined

HG83,

GM86,

Word

for machines

W90]

(VLIW)

each cycle, for pipelined issue a new instruction eliminating

fee ell or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage, the ACM copyright notice and the title of the publication and its date appear, and notice is given that copying is by permission of the otherwise,

0-89791-428-7/91/0005/0241

or to

feature required

from

allowing

the generation

[EIEJ

units the idea

the goal is to

every cycle, effectively NOPS (No

Operations).

types of machines,

the code instructions

the machine

. ..$1.50

I

for both

and Very

as n instructions

machines

the so-called

However, Permission to copy without

several

machines

n functional

with

is to be able to execute as many

@ 1991 ACM

or assembly

level were suggested for processors

Large Instruction

machines.

Association for Computing Machinery. To copy republish, reauirea a fee end/or aDecific oermiasion.

out that in order

language

scheduling,

Previously,

architecture

have to be rearranged,

The burden

compilers.

[BG89,

in the

and have evaluated

System/6000

It turned

of pipelining

optimizing

instructions;

spans

of the general

our algorithms

[P85J

for

emerged which

in computer

instructions

called instruction

a new approach

of program

this direction

at the intermediate

code level.

instructions

of instructions

enhance the performance

IBM

usually

in a

machines,

streamlining

performance,

and

is based on the parametric

a range of superscakis and VLIW

high speed processors

to take advantage

This novel

architecture,

building

was called RISC

the

(intra-loop)

summarized

in the late seventies,

subsequently

be done beyond

which

Starting

emphasizes

it becomes

A scheme for global

is proposed,

data dependence

As internal

increases, should

in

have to be

by the compiler.

and pipelining

basic block

resources

the instructions

that scheduling

Program

of machine

the common

the compiler

is to discover

that are data

independent,

of code that better

in

utilizes

resources.

i Proceedings

of the

Programming Toronto,

ACM

Language

Ontario,

SIGPLAN Design

Canada,

June

’91

Conference

on

It was a common

view that such data independent

and Implementation. 26-28,

instructions

1991.

can be found

within

basic blocks,

there is no need to move instructions

241

beyond

and basic

block

boundaries.

Virtually,

work

on the implementation

scheduling

for pipelined

scheduling

within

W90].

many

unpredictable scientific

Dependence

concentrated

[HG83,

-type programs

small basic blocks

programs

terminated

since there, basic blocks

compilers

to expose parallelism

with

by

extends

RISC

by the ability

challenges

and generation

to compilers,

the parametric

for global

instruction

is not so severe,

optimizing

compilers.

is evolving

that

This type of high or serious

since instruction

generation

machine

resources

computations,

symbolic

or Udx-t

not depend scheduling

scheduling

branch

cases not

beyond

superscalar

machines

code, processor

small number approaches machines

scheduling

was reported

in [GR90],

Also,

instruction

machine

There

of instructions within

time

is likely

in

on such assumption.

However,

global

of taking

advantage

whenever

available

scheduling

of the (e.g.

As for the enhanced

scheduling,

our opinion

towards

a machine

of computational

units,

is that it is with

a large

like VLIW

between

instructions

have to be duplicated

scheduled.

Since we are currently

machines the

scheduling.

for global

with

the scope of the enclosed

speculative

the movement

loop.

boundaries The method

242

and speculative

we identify

a small number System/6000

a conservative

interested

of functional

approach

First we try to exploit

instructions,

execution

in order

machines),

useful instructions,

we

the cases where to be in units

we

to instruction the machine next we consider

whose effect on

performance

depends on the probability

to be taken,

and scheduling

might

Bell Labs

useful

Also,

(like the RISC established

basic blocks

in a PDG,

a

[EN89].

permits

available

distinguish

of instructions. with

in the literature:

of AT&T

(which

of a

but may not be true in

of

code for the VLIW

which

assumes the existence

by proffig).

E851 and the enhanced

well beyond

I Unix is a trademark

by

does

Using the information

are two main

we present a technique

scheduling

scheduling

scheduling)

resources with In this paper,

a powerful

machines.

one can view a

as a VLIW

that were reported [F81,

number

were investigated,

of running

of resources.

scheduling

providing

of

for scheduling

techniques

for compiling

scheduling

percolation

of a family

percolation

probabilities,

percolation

[JW89].

instruction

in fair improvements

the compiled

trace

to pursue

the scope of basic blocks

resulting

the PDG

global

more targeted

code ,replication

of

ype programs),

is capable

computed

of code that utilizes

to a desired extent

scientific

(as well as enhanced

poses more

to allow

trace scheduling

main trace in the program

to issue more than one

sufficient

where

description

framework

level is in many

for superscalar

for the purposes

We suggest combining

hand, for

at the basic block

One recent effort

to be used in

of code for

thereby

called superscalar

architecture,

superpipelined

et. al [FOW87]

tend to be larger.

per cycle [G089].

speed processors,

called the Program

that was recently

by Ferrante

multiprocessors.

that

(PDG),

machines,

a new type of architecture

instruction

Graph

proposed

While Recently,

data structure,

superscalar

On the other

the problem

a novel

vectorization

such

may result in code with

Unixl

branches.

on

GM86,

architectures

type of scheduling

many

of instruction

basic blocks

NOPS for certain

include

employs

machines

Even for basic RISC

restricted

all of the previous

with

~f branches

duplication,

increase the code size incurring

which

additional

costs in terms of instruction do not overlap belong

cache misses.

the execution

to different

of instructions

iterations

is often called sofware

for future

that

of the loop.

more aggressive type of instruction which

Also, we

of functional

the machine

This

executed

units of m types, where

has nl, nz, ....n~ units of each type.

Each instruction

scheduling,

pipelining

a collection

in the code can be potentially

by any of the units of a speci.tied type.

is left

[J-X8],

work.

For the instruction

scheduling

that there is an unbounded For speculative

instructions,

previously-it

suggested that they have to be supported machine

architecture

architectural

[ESS, SLH90].

support

carries a si~lcant evaluating with

run-time

compile-time

retaining

overhead,

XL family

System/6000 preliminary

machine

we are

(coloring)

of the code, still effect promised

of compilers

(RS/6K

for the IBM

for short)

computers.

of

The rest of the paper is organized

and show how it is applicable

Section

Then,

in Section

that will

in Section

3 we bring

serve as a running

this paper we

and register allocation

A program

instruction

of machine

cycles to be executed

at all.

between

For the

instruction

see [BEH89].

units of its type.

in Section

execution, 6 we bring

results and conclude

in Section

imposed

which

are modelled

the execution

machine

Our model

of a superscalar

In

of the PDG,

description

of a typical

RISC

that reference memory while

We view a superscalar

all the computations

Let I (t > 1) be

that if Zz is scheduled

as

243

if 11 is

constraints

be

(by the compiler)

above, this would

of the program,

to guarantee

info~ation

BRG89].

are

machine

purposes,

to start no earlier than k + t+ d. Notice,

pipelined

whose only

are load and

store instructions,

edge.

scheduled

More

is based on the

done in registers.

such that the edge

to start at time k, then L should

assume that the machine

description

edges of the

time of 11 and d (d z O) be the delay

affect the correctness

7.

of

by the integer

graph.

start earlier than mentioned

some performance

processor

on the execution

scheduled

however,

are presented.

machine

there are

assigned to (11,14. For performance

a small

number

by one of the

Also,

constraints

(11,12) is a data dependence

model

interlocks

2. Parametric

an integral

Let 11 and L be two instructions

In

example.

requires

delays assigned to the data dependence

5 several levels of scheduling,

speculative

instructions

Throughout

register allocation

scheduling

instructions

to the RS/6K

4 we discuss the usefulness

including Finally,

as follows.

2 we describe our generic machine

program

the

onto the real

on the relationships

pipelined

The

results for our scheduling

were based on a set of SPEC benchmarks

machines.

the

using one of the standard

algorithms.

computational

Section

during

phase of the compiler,

discussion

functional

RISC

[ss9].

while

registers,

of symbolic

Subsequently,

registers are mapped

will not deal with

our scheme in the context

performance

prototype

symbolic

number

we assume

execution.

We have implemented the IBM

register allocation

such support

most of the performance

by speculative

by the

execution

for replacing

analysis

registers in the machine.

Since

for speculative

techniques

was

purposes,

implements

to not

since we hardware

the delays at run time.

about

the notion

can be found

of delays due to in [BG8!J,

2.1 The RS/6K

model

the second types of the above mentioned

Here we show how our generic model superscalar

machine

machine.

is cotilgured

The RS/6K

be considered.

of a

to fit the RS/6K

processor

is modelled

3. A program

as

Next,

follows:

we present

that computes ●

“

m = 3, there are three types

fixed point,

floating

ni=

l,n3=

1, nz=

point

unit,

Most

l,there

instructions,

point

are four main

in one

Next,

etc.

a load

instruction

Figure a floating

point

compare

instruction

instruction

comprises

a floating

Section

that uses the result of that

delays in the

In this paper we concentrate computations

2

of notation,

only.

problem

of future

Therefore,

in the with of

before

discussion.

as was mentioned the global

the register allocation

to activate allocation

2

the registers mentioned

2, we prefer to invoke

the register

on fixed point

them

of the program

However,

in the code), even though

whose effect is secondary.

2.

the code of Figure

for the purposes

algorithm

XL-C

the instructions

this stage there is an unbounded

are a few additional

machine

in Figure

statements

in the code are real.

and the branch

compare. There

that corresponds

we mark the ten basic blocks

of which

For simplicity

a delay of five cycles between

for the loop,

the

The

2 (I 1-120) and annotate

1. Also,

(BL1-BL1O) that

updating

if needed.

we number

the corresponding

;

uses its result; —

they are

and the minimum,

code of Figure

and the instruction

one to

maximum

For convenience,

and the branch

that uses the result of that

instruction

of a are compared

, is presented

is of

of the loop.

to the max and mi n variables,

compiler3 a fixed point

which

every iteration

(zfiu > v)) , and subsequently

pseudo-code

1 and

that two elements

to the real code created by the IBM

Zoad);

a delay of one cycle between

on the loop

compared

RS/6K

that uses its

in Figure

of

example.

1, we notice

these elements

in C)

and the maximum

concentrating

in Figure

(written

is shown

serve us as a running

another

division,

a delay of three cycles between

point

This program

marked

types of delays:

result register (delayed

compare2

a small program

the array a are fetched

and the instruction

instruction

example

the minimum

In this program,

there are also multi-cycle

instruction

–

will

unit and a

are executed

like multiplication,

compare

an array.

types.

isa single fixed

a delay of one cycle between

–

units:

unit.

cycle, however,

–

and branch

of the instructions

s There

of functional

a single floating

single branch ●

point

delays will

number

conceptually

the instruction

in

scheduling

is done (at of registers there is no

scheduling

after

is completed.

only the first and

More precisely, usually the three cycle delay between a fixed point compare and the respective branch instruction encountered only when the branch is taken.

However, here for simplicity

is

we assume that such delay exists whether

the branch is taken or not. 3 The

only

feature

of the machine

in a special counter

register.

zero in a single instruction,

that was disabled

in this example

is that of keeping

the iteration

variable

of the loop

Keeping the iteration variable in this register allows it to be decremented and tested for effectively reducing the overhead for loop control instructions. 244

~

find

the

largest

and the

smal lest

number

in a given

array

minmax(a,n) { int i,u,v,min,max,n,a[SIZE]; min=a[O]; max=min; i=l; /****************** LOOP STARTS while

max is kept in r30 min is kept in r28 i is kept in r29 n is kept in r27 address of a[i] is kept in r31 . . . more instructions here . . . *************** LfjOfJ STARTS *******************

‘/

*************

/

(i v) { if (u>max) max=u; if (v max END BL2 max = u END BL3 v < min END BL4 min = v

CL.9

END BL5

CL.4: cr6=r0,r30 (112) C (113) BF CL.ll,cr6,0x2/gt --------------------------------------(114) LR r30=r0 ~-------------------------------------CL. 11: (115) C cr7=r12,r28 (116) BF CL.9,cr7,0xl/lt ---------------------------------------

v > max END BL6 max = v END BL7 u < min

...

more

instructions

here

.. .

that the

code executes in 20, 21 or 22 cycles, depending O, 1 or 2 updates

index

END BL8 min = u (117) LR r28=r12 --------------------------------------END BL9 CL.9: (118) AI r29=r29,2 i =i+2 iv

---------------------------------------

in the code ofFigure

requires

r12=a(r31,4) rO, r31=a(r31,8)

(13) C cr7=r12, r0 (14) BF CL.4, cr7,0x2/gt --------------------------------------(15) C cr6=r12, r30 (16) BF CL.6,cr6,0x2/gt --------------------------------------(17) LR r30=r12 --------------------------------------CL.6: (18) c cr7=r0,r28 (19) BF CL.9,cr7,0xl/lt --------------------------------------(110) LR r28=r0

(v>max) max=v; (u

D(J), then pick ~

4, If D(J)>

D(f),

5. If CP(l)

B3.

Each of them

into

B 1, but it is apparent

be printed

x=3 belongs

can be (speculatively) that both

in B4.

Data dependence

of these instructions

to

moved

of them are

to move there, since a wrong

the movement

To solve this problem,

then pick % that occurred

about

in the code

from

frost,

the (symbolic) a basic block.

considered that the current

ordering

is tuned towards of resources.

preferring

to B2, while

value may

do not prevent into

B 1.

> CP(J), then pick L

7. Pick an instruction

functions

x=5 belongs

not allowed

then pick J

then pick
Cl’(l),

number

the

that is

the same time in the scheduling

Notice

excerpt

Examine

and let 1 and J be two

that (should

functional

in

as follows:

To make it formally,

instructions

has to be maintained.

The control

of instructions.

currently

as they were defined

...

path heuristic

we try to preserve the original

to respect

this is not true, and a new type of

i f (cond) x=5; else x=3; print. f(’’x=%d”,

of the same class and delay

we pick one that has a biggest critical

it is sufficient

...

we pick

has the. biggest delay heuristic

For the instructions

following

For the same

doing

It turns out that for speculative

information

process, we schedule useful

while

to preserve the

of the program

the data dependence

(CP(J2) + d(l,JJ),

framework,

to schedule

of the heuristic

a machine

with

a small

exit from

B, such speculative

a useful instmction

to be updated

a speculative

instruction

may cause longer

delay.

and tuning

before a

speculative

In any case,

updated.

are needed for better

Then,

results.

251

speculatively

that is being to a block

B

a new value for a register that is live on

dka.llowed.

one, even though

the information

registers that are Ibe on exit If an instruction

to be moved

This is the reason for always

speculative

experimentation

computes

we maintain

Notice

Thus,

is

that this type of information

dynamically,

motion

movement

i.e., after each

this information

has to be

let us say, x=5 is fwst moved

x (or actually

has

a symbolic

register that

to B 1.

. . . more instructions here . . . ********** LOOp STARTS ************

. . . more instructions here . . . *********** Loop STARTS *************

CL.0: (11) (12) (118) (13) (119) (14) (15) (18) (16) (17) CL.6: (19) (110) (Ill) CL.4: (112) (115) (113) (114)

CL.0: (11) (12) (118) (13)

L LU AI C C ;F

BF LR

r12=a(r31,4) r0,r31=a(r31,8) r29=r29,2 cr7=r12,r0 cr4=r29,r27 CL.4,cr7,0x2/gt cr6=r12,r30 cr7=r0,r28 CL.6,cr6,0x2/gt r30=r12

BF LR B

CL.9,cr7,0xl/lt r28=r0 CL.9

C C BF LR

cr6=r0,r30 cr7=r12,r28 CL.ll,cr6,0x2/gt r30=r0

c

more instructions

5. The results

here

of applying

to the program

correspondsto

x)

CL.9,cr7,(3xl/lt r28=r0 CL.9

detailed

scheduling

ofx=3to

the useful

scheduling

and its relationship

the effect of useful and on the example

The result ofscheduling is presented ofBLl,

considered

tobe

in Figure

two instructions

ofBL10(118

ftiginthe

2.

the programof

Similarly,

while

6 shows the result ofapplying (l-branch)

252

12- 13 program

both

speculative

In addition

above,

from

of

in 20-22 cycles per iteration.

the

schedulingto

to the motions

two additional

(15 and 112) were moved

and 119) were moved

Figure2

BL8toBL6,

the original

Figure

ffl

and specula-

moved

inFigure5takes

2 was executing

were described

Theresultisthat

18was

Figure

the same program.

that were

the useful

15wasmovedfrom

useful a.ndthe

the

...

delay slots of the

there.

cycles per iteration,

there were those ofBLIO,

since only BLIO~EQUIV(BLl).

of applying

Theresultantprogram

onlyto

5. During

the ordyinstmctions moved

of Figure

useful instructions

BL1,

BL4toBL2,andI

to the PDG-based

Let us demonstrate

scheduling

into

instructions

speculative

examples

this program

B 1,

is out of the scope of this paper.

scheduling

here

tive schedulingto

5.4. Scheduling speculative

6. The results

2

live onexitfrom

ofthe

more instructions

Figure

B1 will be prevented.

description

scheduling

...

...

of Figure

becomes

and the movement

global

BF LR B

CL.4: (115) C cr7=r12,r28 (113) BF CL.ll,cr5,(3x2/gt (114) LR r30=r0 CL.11: (116) BF CL.9,cr7,0xl/lt (117) LR r28=r12 CL.9: (120) BT CL.13,cr4,0xl/lt ********** Loop ENDS ***************

BF CL.9,cr7,0xl/lt (117) LR r28=r12 CL.9: (120) BT CL.0,cr4,(3xl/lt ********** Loop ENDS **************

More

C

C C BF c BF LR

(Ill)

CL.11:

Figure

(119)

(15) (112) (14) (18) (16) (17) CL.6: (19)

r12=a(r31,4) r0,r31=a(r31,8) r29=r29,2 cr7=r12,r13 cr4=r29,r27 cr6=r12,r30 cr5=r0,r3Cl CL.4,cr7,0x2/gt cr7=r0,r28 CL.6,cr6,EJx2/gt r30=r12

(110)

(116)

...

L LU AI C

instructions

speculatively

in the three cycle delay between since 15and

that

to BL1,

to

13 and 14.

Interestingly

enough,

112 belong

basic blocks

that are never executed

together

to in any

single execution

of the program,

two instructions

will

the program iteration,

only one of these

carry a useful result.

in Figure

in Figure

was cor@ured

All in all,

6 takes 11-12 cycles per

a one cycle improvement

program

Next we describe how the global

over the

compile-time

overhead

improvement

to a maximum

design decisions

5.

the global

6. Performance

evaluation

of the global

scheme was done on the IBM whose abstract

model

For experimentation scheduling

scheduling

RS/6K

is presented purposes,

has been embedded

of compilers.

several high-level etc.; however,

in Section

into the IBM support

like C, Fortran,

we concentrate

Only

“small”

“Small”

Pascal,

suite [SS9].

in

unrolled

EQNTOTT

programs

and ESPRESSO

C Compiler,

manipulation

of Boolean

functions

(denoted

by BASE

After

with

and

the global

scheduling

XL

was disabled.

that the base compiler

includes

possible

scheduling machine

optimization)

the body

two types of

that of [W90], ●

loop-closing

scheduling

techniques

overlap

to the

that represent are rotated,

loops

by

after the end of

the global inner

scheduling loops,

the

we

effect of the software

of the loop

of the previous

are executed

of the within

iteration.

The general flow of the global

scheduling

is as

inner loops are unrolled;

to

scheduling

is applied

time to the inner regions

the fust

only;

3. certain inner loops are rotated; techniques

delay problems

So, in some sense certain improvements global

similar

and

a set of code replication certain

scheduler

are

of one).

i.e., some of the instructions

1. certain

basic block

that

follows:

and peephole

as follows:

a sophisticated

64

they include

is applied

such regions

By applying

2. the global ●

scheduling

their fust basic block

next iteration

Please notice

regions

instead

up to 4 basic blocks

pipelining,

in which

on its own (aside of all the

independent

and

only

are scheduled.

the inner

of a loop

achieve the partial

.

instruction

regions

second time to the rotated

comparisons

C compiler

regions)

that include

once (i.e., after unrolling

the global

copying

and equations.

in the sequel) is the performance

results of the same IBM

other

(i.e.,

step, before the global

is applied,

the loop. The basis for all the following

inner regions

loops with up to 4 basic blocks

inner regions,

are two

that are related to minimization

are scheduled.

and 256 instructions.

two iterations

LI denotes the Lisp Interpreter .

while

status of

are those that have at most

In a preparation

represent

In the following

stands for the GNU

reducible

basic blocks

only on the C

was done on the four C programs

the SPEC benchmark

between

(i.e. regions

regions

scheduling

GCC

the current

inner regions).

XL

programs.

benchmark,

The following

Only two inner levels of regions

outer regions

●

discussion

extent.

regions that do not include

2.1.

the global

These compilers

languages,

The evaluation

of the

prototype:

So, we distinguish

machine

●

family

the trade-off

and the run-time

characterize

scheduling

scheme

results 9

A preliminary

so as to exploit

scheduling

that solve

4. the global

[GR90].

scheduling

time to the rotated

is applied inner loops

the second and the

outer regions.

due to the

those of the scheduling

The compile-time

that were already part of the base

overhead

scheme is shown in Figure

compiler.

253

of the above described 7. The column

marked

BASE

gives the compilation

in seconds as measured machine,

model

column

marked

provides

on the IBM

(Compile-Time

percents.

above mentioned rotation,

only,

time

comes from

and GCC, (Actually,

performance

BASE

benchmarks,

LI

EQNTOTT ESPRESSO GCC

improvement

CTO

206

13%

78 465 2457

17% 12% 13%

towards

the existing

at the moment,

useful and speculative improvement

(RTI)

in Figure

namely

scheduling. for both

overhead,

which

especially

is shown

of the measurements

is about

0.5

of instructions

by an opt imizing

utilization

is

of machine

superscalar

to the

processors,

the base

structure

proposed

The accuracy

compilers

(PDG),

and a flexible

10/0.

RTI USEFUL SPECULATIVE

work

scheduling, 312

EQNTOTT ESPRESSO GCC

45 1(36 76

2.0%

6.9%

7.1%

7.3% 0% (3%

-0.5% -1.5%

many helpful Figure

8. Run-time improvements

Vladimir

for the global sched-

254

description

that employs

a

RS/6K

machine

We are going to extend our more aggressive speculative with

would

Krawczyk

discussions,

Rainish

implementation.

uling

machine

The results of evaluating

and scheduling

Hugo

for better

for a range of

framework

Acknowledgements. We Ebcioglu,

compiler

scheme on the IBM

by supporting

scheduling

It is based on a data

a parametric

are quite encouraging.

LI

over the size

for parallel/parallelizing

scheduling

the scheduling

the global

resources

set of useful heuristics.

BASE

steps were

that are being scheduled.

The run-time

0/0 -

it as

since no major

scheme allows

in seconds.

a larger

As for the

we consider

The proposed

with

with

units.

usefid only and

relative

time of the code compiled

in machines

We may expect

7. Summary

that we

types of scheduling

8 in percents

has already been optimized

of computational

of the regions

the global

due to the fact

taken to reduce it except of the control

overheads for the global sched-

are two levels of scheduling

with

is modest

architecture.

even bigger payoffs number

uling

PROGRAM

only.)

that the achieved

in run-time

compile-time

7. Compile-time

compiler

in

when the global

our short experience

we notice

reasonable,

running

is

to useful scheduling

that the base compiler

presented

is

etc.).

PROGRAM

distinguish

scheduling

no improvement

is restricted

scheduling,

There

the useful scheduling

there is a slight degradation

for both

scheduling

To summarize

Figure

most of

On the other hand, for both

observed.

unrolling,

8 that for EQNTOTT

for LI, the speculative

ESPRESSO

all of the

loop

while

dominant.

times in

to perform

steps (including

The

Overhead)

This increase in the compilation the time required

in Figure

the improvement

530 whose cycle time is 40ns. CTO

We notice

RS/6K

the increase in the compilation

includes

loop

times of the programs

duplication

of code.

like to thank

Kemal

and Ron Y. Pinter and Irit Boldo

for their help in the

and

for

References

Transactions

[BG89]

Systems, Vol. 319-349.

[BRG89]

[BJR89j

Bernstein, D., and Gertner, I., “Scheduling expressions on a pipelined processor with a maximal delay of one cycle”, ACM Transactions on Prog. Lang. and Systems, Vol. 11, Num. 1 (Jan. 1989), 57-66, Bernstein, D., Rodeh, M., and Gertner, I., “Approximation algorithms for scheduling arithmetic expressions on pipelined machines”, Journa[ of AZgorit/vns, 10 (Mar. 1989), 120-139. Bernstein, D., Jaffe, J. M., and Rodeh, M,, “Scheduling arithmetic and load operations in parallel with no spilling”, SIAM Journa[ of Computing, (Dec. 1989), 1098-1127.

on Prog. Lang.

and

9, Nurn. 3 (July 1987),

[F81]

Fisher, J., “Trace scheduling: A technique for global microcode compaction”, IEEE Trans. on Computers, C-30, No. 7 (July 1981), 478-490.

[GM$6]

Gibbons, P.B. and Muchnick, S. S., “Efficient instruction scheduling for a pipelined architecture”, Proc. of the SIGPLAN Annual Symposium, (June 1986), 11-16.

[GR90]

Golumblc, lM.C. and Rainish, V., “Instruction scheduling beyond basic blocks”, IBM J, Res. Dev.,(Jan. 1990), 93-98.

[BEH89]

Bradlee, D. G., Eggers, S.J., and Henry, R. R., “Integrating register allocation and instruction scheduling for RISCS”, to appear in Proc. of the Fourth ASPLOS Conference, (April 199 1).

[G089]

Groves, R. D., and Oehler, R., “An second generation RISC processor architecture”, Proc. of the IEEE Conference on Computer Design, (October 1989), 134-137.

[CHH89]

Cytron, R., Hind, M., and Wilson, H., “Automatic generation of DAG parallelism”, Proc. of the SIGPLAN Annual Symposium, (June 1989), 54-68,

[HG83]

[CFRWZ]

Cytron, R., Ferrante, J., Rosen, B. K., Wegman, M. N., and Zadeck, F. K., “An efficient method for computing static single assignment form”, Proc, of the Annual ACM Symposium on Principles of Programming Languages, (Jan. 1989), 25-35.

Hennessy, J,L. and Gross, T., “Postpass code optimization of pipeline constraints”, ACM Trans. on Programming Languages and Systems 5 (July 1983), 422-448.

[JW89]

Jouppi, N. P., and Wall, D.W., “Available instruction-level parallelism for superscalar and superpipelined machines”, Proc. of the Third A SPLOS Conference, (April 1989), 272-282.

[L881

Lam M, “Software Pipelining: An effective scheduling technique for VLIW machines”, Proc. of the SIGPLAN Annual Symposium, (June 1988), 318-328.

[P851

Patterson, D. A., “Reduced instruction set computers”, Comm. of A CM, (Jan. 1985), 8-21.

[SLH90]

Smith, M.D, Lam M. S., and Horowitz M.A., “Boosting beyond static scheduling in a superscalar processor”, Proc. of the Computer Architecture Conference, (May 1990), 344-354.

[s89]

“SPEC Newsletter”, Systems Performance Evaluation Cooperative, Vol. 1, Issue 1, (Sep. 1989).

p-v!xy

Warren, H., “Instruction scheduling for the IBM RISC System/6K processor”, IBit4 J. Res. Z)W., (J~. 1990), 85-92.

[E88]

Ebcioglu, K., “Some design ideas for a VLIW architecture for sequential-natured software”, Proc. of the IFIP Conference on Paral!el Processing, (April 1988), Italy.

[EN89]

Ebcioglu, K., and Nakatani, T., “A new compilation technique for paralleliziig regions with unpredictable branches on a VLIW architecture”, Proc. of the Workshop on Languages and Compilers fm-bm-aalle[ Computing, (August 1989),

[E851

Ellis, J. R., “Bulldog: A compiler for VLIW architectures”, Ph.D. thesis, Yale U/DCS/RR-364, Yale University, Feb. 1985.

[FOW87]

Ferrante, J., Ottenstein, K.J., and Warren, J. D., “The program dependence graph and its use in optimization”, ACM

255

IBM

Global Instruction Scheduling for SuperScalar Machines Abstract 1 ...

Global Instruction Scheduling for SuperScalar Machines Abstract 1 ...

Suggest Documents

Superscalar Branch Instruction Processor

expansion caches for superscalar machines - Semantic Scholar

Available Instruction-Level Parallelism for Superscalar ... - CiteSeerX

Global Multi-Threaded Instruction Scheduling - Liberty Research ...

Optimal Global Instruction Scheduling Using ... - Semantic Scholar

Global Multi-Threaded Instruction Scheduling: Technique and ...

Simplifying Instruction Issue Logic in Superscalar Processors

Improving Superscalar Instruction Dispatch and ... - Semantic Scholar

Instruction scheduling for instruction level parallel processors

Benchmarking internet servers on superscalar machines - dipee.info

INSTRUCTION – Industrial Sewing Machines

Interacting Abstract State Machines

INSTRUCTION MANUAL FOR WIRE WELDING MACHINES

Learning Policies for Local Instruction Scheduling 1 ... - CiteSeerX

Approximation Algorithms for Scheduling on Multiple Machines

Instruction Scheduling for Dynamic Hardware ... - CiteSeerX

Instruction Scheduling for Dynamic Hardware Configurations

Particle Swarm Optimization for Constrained Instruction Scheduling

Learning Heuristics for Basic Block Instruction Scheduling

Constraint Programming Techniques for Optimal Instruction Scheduling

Power Aware Instruction Scheduling for Microcontrollers - International

Improving Superscalar Instruction Dispatch and Issue by ... - CiteSeerX

Two Customized Parallel Machines Scheduling

Page 1 115 Scheduling identical jobs on uniform parallel machines ...