Scheduling time-critical instructions on RISC machines - CiteSeerX

2 downloads 0 Views 2MB Size Report
704, Yorktown. Heights,. NY. 10598. e-mail: rpalem. @watson.ibm.com. ...... idle slot is TX; the maximum number of idle time steps in S~,,,dy. (that is, time steps ...
Scheduling Time-Critical on RISC Machines

Instructions

KRISHNA V. PALEM IBM T. J. Watson Research Center and BARBARA B. SIMONS IBM Santa Teresa Laboratory

We present

a polynomial

instructions

from

Berkeley

RISC

heuristic

for

algorithm. have

be

for

reuse.

also

always

a length

interest

because, pipelines,

independent and

that input

register

the

that

as we show, even

in that the

independent

of

and

NP-hard

such

there

even

problem

code

being

small

streams

of straightline

are allowed

to complete

be used

some quickly

becomes

NP-hard

input

consists

available

pipelines

that

heuristic

is of

for arbitrary

of

only

there

several

Finally,

no time-critical

if either

that

scheduling

constraints,

simultaneously

as a

real-time

a greedy identical

the

optimal

are instructions in

of

801,

known

of the

pipelines, code

can

behavior

are no time-critical

for

IBM

as registers

multiple

The

schedule

the no

occur

constraints, with

schedule.

block

code,

becomes

machine

is

which

instructions

scheduling

basic

of straightline

and Subject and Terms.

Additional

for a target

there

instructions, resources

of an optimal

algorithm

which

of time-critical

time

SPARC,

we

constraints, is only

because

a single

of some shared

as a bus

optwnization;

General

shared

absence

completion

Sun Our

for

Time-critical

the instruction

when

as the

Architecture.

time-critical

to make the

a minimum

such

pipelines,

time.

a schedule

twice

problem

such

Algorithms

tion

than

of several

Categories at~on;

prove

streams

Precision longer

a specific

or if no two instructions

resource

HP

with

machines

can also handle by

produces

less

length prove

the

and can also be used We

algorithm has

and

algorithm

for constructing

on RISC

processors

completed

computations,

algorithm

block

machine, RISC

Our

to

time

a basic

Key

Descriptors:

F.2.2

D.3.4

[Analysis

[Programming

of Algorithms

Problems—sequencing

and

Languages]: and

Problem

Processors—code

Complexity]:

gener-

Nonnumerical

scheduling

Algorithms Words

scheduling,

and

latency,

Phrases:

Compiler

NP-complete,

optimization,

pipeline

deadline,

processor,

greedy

register

algorithm,

allocation,

RISC

instrucmachme

scheduling

1. INTRODUCTION Many code optimization problems for parallel and pipelined modeled as deterministic scheduling problems. Typically, problems involve rearranging generated object code that

Authors’ 704,

addresses:

Yorktown

K. V. Palem,

Heights,

NY

Technology

Institute,

Permission

to copy without

not made of the

or distributed

publication

Association specific

IBM

and

IBM

10598.

Santa

Teresa

its

date

of this

commercial

appear,

Machinery.

Division,

rpalem

Laboratory,

fee all or part

for direct

for Computing

Research e-mail:

and

T. J. Watson

555 Bailey

advantage, notice

Research

@watson.ibm.com.

material is given

To copy otherwise,

the ACM that

Center,

B. B. Simons,

Road,

is granted

machines can be these scheduling is derived from a

San Jose,

provided copyright

copying

or to republish,

CA 95141.

that notice

the copies

0164-0925/93/0900-0632

of the

a fee and/or

permission.

@ 1993 ACM

are

and the title

is by permission requu-es

P.O. Box Compiler

$1.50

ACM TransactIons OnProgramming Languages and Systems, Vd 15, No. 4, September 1993, Pages 632–65S

Scheduling Time-Critical Instructions

.

633

single basic block of source code. The object code instructions have deterministic behavior and often require a single unit of execution time on the CPU [19, 25, 26]. A fast algorithm for rearranging the object code in a basic block to minimize execution time can improve generated by the compiler. In particular, contains

the

utilizing

all

necessary the

smallest of the

to guarantee

schedule.

help

possible

processor

For

to minimize

To illustrate

that

example, spillage

number

cycles

of

early

induced

In

critical,

izes a value which is subsequently Suppose that s, is given an early

or

idle

cycles,

addition,

instructions execution

thereby

it is frequently

are completed

of certain

by register

of time

and efficiency of the code execution time schedule

no-ops

effectively.

time-critical the

the notion

the quality a minimum

early

instructions

in

might

allocation.

assume

that

instruction

referenced by mstructlons deadline and that s;, sj, ...,

s, initials~, sj, . . . . .sj. s~ are given

somewhat later, but still early, deadlines. Then, if a schedule satisfies the deadline constraints, the register that is used to store the value initialized by s, will be available for other use no later than the latest deadline that is assigned

to s;, s~, . . ..s~.

We present constructs

a fast algorithm

a minimum

machines.

This

the Sun

model

SPARC

that

takes

completion

time

approximates

[28],

the IBM

a basic block schedule

several

of code as its input

for a generic

early

RISC

801 [26, 27], the Berkeley

model

processors

such

RISC

and the

[19],

HP Precision Architecture [ 13]. It also constructs an optimal schedule ing time-critical constraints for such machines. Our algorithm can be a heuristic for constructing “fast” schedules for RISC processors with multiple pipelines for which it is not optimal such as the Intel 80860 Any

scheduling

available

for

that

algorithm

scheduling

in the absence

produces that

of two

never

for

is a worst-case

rithms This

tend to perform paper contains

problem

of producing

machines

less than

and

in

with

pipelines

schedule.

most One

overall

is

algorithm

identical

practice

results.

minimum

satisfyused as long or [7].

We show

scheduling

of an optimal

better. three NP-completeness schedules

algorithm.

multiple

that

as

if some instruction

a greedy

with

twice

guarantee,

a no-op

scheduling

constraints,

target

time

inserts

a greedy

of time-critical

a schedule

has a completion

factor

that is called

and

of RISC

The

greedy

proves

algo-

that

completion

the

time

is

NP-hard if the depth of the pipeline grows as part of the input, even when the basic block of code being input consists of only several independent streams of straightline

code,

and

these

instructions

have

no time-critical

The other two results demonstrate how the introduction straints can make a problem NP-complete. In particular, scheduling inputs

problem

constraints, allowed

if there

to complete

1The completion Workshop sity,

is NP-hard

are only independent

Aug.,

is only

problem Languages

register,

Compilers

pipeline,

code without

or if two

A weaker

was brought and

small-depth

of straightline

a single

simultaneously.1

time-constraint

on Programming

for a single

streams

for Parallel

even if the time-critical

instructions

NP-completeness

to our attention

constraints.

of resource conthe instruction

are not result

as an open problem Computing,

Cornell

for

at the Univer-

1988. ACM

Transactions

on Programming

Languages

and Systems,

Vol. 15, No 4, September

1993.

634

the

K. V. Palem and B. B. Simons

.

instruction

[17],

but

scheduling

the proof

2. DESCRIPTION We consider

problem

is flawed

register

target

machines

in which

We use the [17].

standard

Each

requires

load

must

are

every

acyclic DAG

machine

time

delayed

delay,

until

which

instruction

as LOADs, access.

graph

(DAG)

corresponds

in

requires

cannot

the

entire

is represented

if

additional of basic

instructions LOAD

each

until

instruction, that

has

as a weight

and

be executed

an

one

on-chip in one

representation

Furthermore,

to complete,

require

to an instruction,

An instruction

completed.

additional

be

additional

in the

to a dependence.

predecessors

LOAD,

directed

node

edge corresponds of its

is claimed

of the instruction are derived from are fetched, decoded, and executed

In contrast, some instructions, such due to latencies introduced by memory

blocks

constraints

OF THE MODEL

cycle of CPU time. If the operands registers, then such instructions cycle. cycles

with

[15].

all

say

depend

a

on that

been

completed.

The

on the

appropriate

out-

edges from the LOAD to its immediate descendants, is called an interinstructional latency, or latency, for short. The value of the latency is the additional delay beyond the unit of time required by the CPU. Figure 1 shows a simple DAG, all the edges of which two sary

possible idle

schedules

time

for that

can be introduced

order. Clearly, schedule preferable. The idle time

DAG. if the

Sa, which in schedule

Schedule

S]

nodes

completes S1 could

have

latency

illustrates

how

are scheduled

1, and unneces-

in a suboptimal

execution earlier than S’l, is have been introduced either at

compile time using no-ops or at runtime, if the target machine has hardware interlocks. Therefore, depending on the machine, the problem is either to minimize the number piler or to minimize activated. In addition, instruction

the has

of idle cycles caused the number of cycles

input

might

associated

with

contain

by no-ops produced by the during which the interlocks

time-critical

it a nonnegative

instructions. integer

called

comare

Such

an

a deadline.

The deadline could be either a real-time constraint or a value chosen by the programmer or compiler to try to improve performance. In this case, the problem is to construct a schedule in which all instructions are completed by their deadlines. We do not address the question schedule in which all the nodes are completed

of how to assign deadlines. by their deadlines is called

A a

feasible schedule; otherwise it is infeasible. A problem instance is feasible or infeasible according to whether or not a feasible schedule exists for that instance.

instruction be tardy, tardiness

If

the

input

completed

has

after

no

deadlines,

its deadline

a minimum tardiness schedule of any node is minimized.

then

is tardy.

by

default

it

If instructions

is a schedule

in which

is

feasible.

An

are allowed

to

the maximum

2.1 Some Definitions Assume we have a set of instructions that form a basic block. Techniques such as trace scheduling [8, 9] or global compaction [1, 22] can be used to ACM

Transactions

on Programmmg

Languages

and Systems,

Vol. 15, No. 4, September

1993,

Scheduling

Time-Critical

Instructions

.

635

o---

il

i3

i5

Idle

Cycle

7

S* il

i5

i3

0123456

S2 il

ia

i5

i2

i4

01234

5

Fig.1,

ADAGwith

twopossible

schedules.

increase the size of this basic block. The problem that represents the basic block, where each node the

instructions,

and

each

edge

(i, J

input is a DAG G = (N, E) to one of i ● N corresponds

- E corresponds

to a dependence.

In

addition, each edge has a nonnegative integer weight W( i, j), which is the latency of edge (i, j). If a node i must be completed by a certain time t,then i has a deadline sors,

then

processors

the

d(i)

= t. If the target

delays

through

involved

on-chip

in

register

(1) For i, j = N, i #j, and executed simultaneously (2) S(j) > S(i) + w(i, j) depends on the start

data

set 1, 2,...,

= M(j),

IS(i)

on the same

+ 1 for (i, j) time,

is encoded

latency,

(identical)

items

proces-

between

by appropriately

for each instruction

the

M(i)

has multiple

banks

menting the edge latencies. Formally, a schedule S specifies S(i) and a processor M(i) from the target machine such that:

machine transferring

or node

m of identical

– S(j)l

>1,

these incre-

a start

time

processors

in

(No two nodes

are

processor.)

E E. (The and

earliest

processing

start time

time

of a node

of its

predeces-

sors.) If there

are no deadlines,

then

the

goal

of the

algorithm

is to construct

a

schedule with minimum completion time, that is, max Z{S(i ) + 1} is minimized; if there are deadlines, then the goal is to construct a feasible schedule that starts at time O. If for some assignment of deadlines there is no feasible ACM

Transactions

on Programming

Languages

and Systems,

Vol. 15, No, 4, September

1993.

636

K. V. Palem and B. B. Simons

.

schedule, then the algorithm struct a minimum tardiness rank

algorithm,

defined

should return that information schedule. We show in Section

below,

constructs

a minimum

and also con4.1,2 that if the

tardiness

schedule

for

a problem instance with deadlines, it also constructs a minimum completion time schedule for an instance of the same problem without deadlines. 2.2

Relationship

The pipeline standard

to Pipeline

model

notion

studied

with

the

of the

stage

k unit

this

model

start

that

of any instruction

may

be in the pipeline

paper

machine. length

pipeline

t + k. A new instruction the

in this

of a pipeline

k is a machine first

Scheduling

stages

at time

can enter

time

is more

the

for which pipeline

simultaneously.

the

at times

is at least and

The model

special case of the latency model in which A generalization of the standard pipeline

the classical

machine

an instruction

from

it depends,

than

pipeline

t exists

of an instruction

on which

general

A standard

last

or

of length that

stage

enters at time

t + 1,t + 2,. . . . In k units

as many

greater

than

as k instructions

of a standard

pipeline

is a

all the latencies are h – 1. model is obtained by allowing

an

instruction to exit the pipeline at some stage prior to the last stage. An instance of this problem can be represented by having identical latencies on all the out-edges of a node, but allowing different nodes to have different values

on their

out-edges.

For the algorithms

in this

the most general version of the model, namely of the same node can have different latencies. 2.3

Compiler

We briefly optimization

Construction

essary shared

discuss the interaction between in the compiler, particularly

instructions hazards resource

unnecessary instructions

that

we consider different

only

out-edges

Issues

allocation phase precedes the scheduling to use the same register, the register between

paper,

one in which

the scheduler and other stages of register allocation. If the register phase, then, by forcing allocator can create

are not otherwise

might be introduced. This that is allocated at compile

hazards could be introduced that complete simultaneously

such as a bus. Deadlines might provide There are different approaches for

dependent.

instructions dependence

Consequently,

problem can time. Another

unnec-

be caused by any example in which

is a target machine in which two access a single (limited) resource,

a technique for handling handling the interaction

this problem. between the

scheduler and the register allocator. In the approach used by Hennessy and Gross [ 16, 17], the instruction scheduler is explicitly constrained by hazards that are introduced by the register allocator and by memory access. Gibbons and Muchnick [13] deal with register allocation by introducing edges in the DAG to prevent instructions that share registers from overlapping. Register allocation is handled in the PL.8 compiler [3] by having the instruction scheduler preceded by a first register allocation phase and succeeded by a second register allocation phase. In the first phase the allocation is done for a target machine with an unbounded number of registers. A register is reused in the first phase only when the reuse is guaranteed not to add any additional ACM

TransactIons

on Progammmg

Languages

and Systems,

Vol. 15, No. 4, September

1993

Scheduling constraints.

After

instruction

scheduling,

Time-Critical register

Instructions

allocation

target machine is performed, and hazards are eliminated introduced spill code. The latter two approaches obviate scheduler

to explicitly

deal with

other

than

those

encoded

back

and

forth

between

presented

into

constraints

the input

and Hsu

from

separate sources,

that

the such

instruction

of register

scheduler as a bus.

scheduling

the

strategy

and

actual

allocation,

that

register

switches

allocation

is

[12].

We assume that the compiler that in [13] or [3]. Consequently, separated

scheduling

for

637

by appropriately the need for the

by register

DAG.2 A mixed

instruction

by Goodman

introduced

.

designer has employed a technique such the problem of instruction scheduling allocation.

Similar

approaches

as is

can be used to

from constraints introduced by other For a more detailed discussion, see our

shared chapter

reon

in [2].

3. PREVIOUS WORK In [3–6,

10, 13, 16– 18, 20, 21, 23, 30],

aspects

of instruction

scheduling

for

pipeline and related machines are studied. A survey of deterministic scheduling results for pipelined machines is contained in [20]; some of these results can be found general

in

more

sufficient

in polynomial previously

detail

condition

time, known

and

in

shows

Hennessy for

and

Gross

[16,

n is the number

their

heuristic

worst-case a heuristic

for the

schedules

an algorithm

new

the

problems,

a heuristic

machine

case in which

Palem

characterizes

problems

as well

problems,

produced

by their

that

runs

and report

[18].

There

Gibbons latencies

heuristic.

scheduling

as several satisfy

in time

this

0(n4),

performance

results

is no analysis

of the

and Muchnick

[13]

describe

are O or 1, with

the

substan-

time of 0(rz2 ). Although they report they too do not do a worst-case analysis

for optimally

a

to be solvable

taken for the PL.8 compiler [3]. are based on greedy scheduling.

in the DAG,

MIPS

[23],

scheduling

the approach results that

of the heuristic.

tially improved running mance for the heuristic, of the

some

17] present

on the

In

scheduling

solvable

of nodes

performance

10, 21].

that

polynomially

condition. We have already discussed There are a number of other where

[6,

for pipelined

Bernstein

an arbitrary

and

graph

good perforof the quality

Gertner

with

[4] give

latencies

of O

or 1 on a single processor. Since their algorithm uses transitive reduction as a preprocessing step, the running time of their algorithm is either that of transitive reduction3 or 0( n2 ), if preprocessing costs are ignored. Their algorithm does not handle time-critical instructions. In [5], Bernstein, Rodeh, and Gertner analyze the worst-case behavior of the greedy scheduling algorithm that

for a target result

2If code scheduling ing register

machine

with

to the multiple is done

lifetimes,

of backward scope of this

scheduling paper.

3The

running

time

DAG,

and the

running

prior

additional may

is the time

ACM Transactions

a single

pipeline to register register help

allocation

spill

eliminate

minimum of matrix

pipeline.

or processor

of O(ne),

and

In Section

no consideration

may

be introduced.

this

problem.

where

5 we generalize

case.

A detailed

e is the

is given

The use of deadlines

number

discussion of edges

to minimizand forms

is beyond in the

the

original

multiplication.

on Programming

Languages

and Systems,

Vol. 15, No. 4, September

1993.

638

K, V. Palem and B. B. Slmons

.

4. THE RANK ALGORITHM A standard algorithm

technique for instruction that always schedules

available

node. The input

a DAG

G, which

which the

We

the

nonnegative

algorithm

is an ordered

dependence

integer,

between

and

m, the

to each

possible

integer

step

t finishes

at

Time

algorithm

scans

the list,

scheduled,

giving

priority

time

time

choosing

at which

list

of nodes,

nodes,

number

a node

t + 1. At

each

up to m eligible

latencies,

of processors

to the nodes

earliest

of the current

node is eligible.

time

step is increased

can

time

nodes

in

as a time

start

step

the

greedy

on each scan to be

on the list.

all of its predecessors in G have been scheduled relevant latency constraints have been satisfied. value

the

architecture.

refer

step.4

to the greedy

represents

can be any

target

scheduling is to use a greedy scheduling a node whenever there is at least one

A node is eligible

if

on an earlier scan and the If no node is eligible, the

to the earliest

At the end of the scan, the chosen

time

nodes

for which

are deleted

some

from

the

list, and new nodes may become eligible. The process is repeated until the list is empty. The input to the greedy algorithm could be any list, including an arbitrarily ordered prioritize the

one, that does not use information nodes.5 The rank algorithm, defined

latencies

deadline node

between

of node

is an

a node

i, to compute

upper

bound

i and the

on

each

rank

the

of i’s

tardiness

are either

O or 1, and

Although

the rank

an arbitrary

DAG

schedule

finish

time

or all

algorithm

of the

as well (i).

node

as the

The rank

in

any

of a

feasible

the algorithm constructs a list, greedily. single processor schedule or a input

nodes

is not guaranteed

if the latencies

rank

of that

for an arbitrary

some

successors,

of i, written

schedule. Once the ranks are all computed, based on the ranks, which it then schedules The rank algorithm constructs a feasible minimum

about the graph structure to below, uses information about

are greater

DAG

if all the

have

preassigned

to find

a feasible

than

1, or if there

latencies deadlines.

schedule is more

one pipeline in the target machine, we conjecture that its behavior heuristic is quite good in the general case. There are some preliminary results in [7] showing the rank algorithm performing better on the 80860 than the Warren algorithm, which is used for instruction the IBM RS/6000 [30]. In at least one special case, namely interval ordered graph (see Section 4.1. 5), the rank algorithm optimal

schedule

for

arbitrary

latencies,

deadlines,

proximation bound of Section 5 applies any other greedy scheduling algorithm,

to the if there

4 In a (forward)

first

5The

highest

schedule level

some information level

first

j5rst about

algorithm

TransactIons

the

is not

cases of the instruction

ACM

we assume algorithm, graph

that

structure

guaranteed

scheduling

on Programming

O is the

a more

to construct

to construct

problem,

Languages

and Systems,

scheduling in the monotone constructs an

processors.

The

ap-

step.

version the

an optimal

as illustrated

as a test Intel

rank algorithm, as well as to are no preassigned deadlines.

time

sophisticated

and

for than

of the

ordered

schedule

in schedule

Vol

list.

greedy

algorithm,

However, even

for

S1 of Figure

15, No 4, September

the some 1,

1993

uses highest simple

Scheduling

Time-Critical

instructions

.

639

4.1 The Algorithm (1) Compute the ranks of all the nodes. If some node is assigned a rank less than or equal to O, return the information that the problem instance is infeasible.G (2) Construct list, their ranks. (3) Apply

which

is an ordered

the greedy scheduling

list of nodes in non decreasing

algorithm

order

of

to list.

4.1.1 Computing the Ranks. The weighted length of a path p is the sum of its constituent edge latencies and the number of nodes in p, excluding the endpoints of p. Let w‘( i, j) denote the weighted length of the longest path from node i to a successor j. In Figure 2, w+(il, iz) = w“F(il, i~) = ii) = O, and w+(il, i~) = 3. w+(i~, i~) = 1, w+(iz, The rank of node i is computed after w‘( i, j) and rank(j) have been computed for all nodes j that are successors of i. If j is a node that does not have preassigned deadlines, then d(i) + D, where D is some integer that is sufficiently example with

large

that

all nodes

no successors

are guaranteed

is (k + 1)n,where

of such a value

is called

a sink.

to be completed

k is the maximum

If i is a sink,

then

by time latency.

rank(i)

D. An A node

~ d(i).

For Figure 2, suppose d(il) = d(iz) = d(i~) = d(ii) = 6. Then rank(i4) = 6. If i has only a single successor j, then rank(i) = min{rank(j) – 1 – w+(i, j), d(i)}. In Figure 2 rank(iz) = 5 and rank(i~) = 4. Let i be a node with more than one successor, whose successors’ ranks have all been computed. We construct a sorted list sw(i ) of the successor set of node i. The nodes in SW(i) are sorted in nonincreasing order by the w+ values relative to i, that is, if w‘( i, ~“) > W+( i, p), then node ~“ occurs before node

p in

sw(i ). For

node

sw(il)

= ili~iz

and

which

w‘(

= q. Because

i, j)

sw(il)

il

of Figure

= i~izia.

Let

2, the

possible

sw(i)~

be all the nodes

SW(i) is sorted,

values

the nodes

in sw(i)~

for

SW( i ~) are

j in sw(i)

for

are contigu-

ous in sw(i). We next sort each swr(i) be the resulting

segment sw(i)~ by nonincreasing order of ranks. Let list. The nodes in swr(i ) are all the successors of node

i stored in the nonincreasing w+ value are further sorted

order based on w+; all the nodes with the same by their ranks. For Figure 2, the only possible

choice for swr(il) is swr(il) = idizi~. A schedule for a target machine with matrix in which each row represents machine, and each column represents

m processors

can be represented

by a

one of the processors of the target a time step. A slot is a single entry in

the matrix, and represents a specific time step on a specific processor. A slot is available if no node has been assigned to the specific start time and processor represented by the slot. To compute rank(i), we select nodes in the order in which they appear in swr(i

), starting

at the beginning

of swr(i

). Each

backward schedule j by greedily scheduling latest possible start time less than rank(j).

6If the problem

ACM

instance

Transactions

is infeasible,

a minimum

on Programming

tardiness

Languages

time

we select

a node J“, we

j in an available slot with the In particular, we schedule j in

schedule

and Systems,

is constructed.

Vol

15, No 4, September

1993.

640

K, V. Palem and B. B. Simons

.

1 Fig.

2.

Aa example

i]

DAG.

1

\

is

the time

step,

finishing

(1)

D’ < rank(j),

(2)

the number this

time

D’, such that:

at the largest

and of nodes that

step is strictly

j in swr(i)

occur before less than

and

have

been

m.

The rank of i with respect to j equals min{D’ – 1 – w+(i, D’ – 1 is the start time of node j. The rank of i with respect latest

time

completion respect

node in the

i

can

finish

backward

Correctness for the

rank

node

rank(i)

of the Ranks. algorithm.

if

schedule.

to each of its successors;

4.1.2 proof

that time

shows

j

is

to

be

We compute

is the

Below

It

assigned

we present

that

completed the

smallest

j), d(i)}. ‘l’he to j gives the rank

of these the

if no nodes

key

by

its

of i with

values. theorem

are completed

and later

than their deadlines, they must also be completed no later than their ranks. We assume that nodes with no preassigned deadlines are given the default deadline D. Note that the proof is entirely general, and holds for any number of processors

and any latencies.

TFI~O~E~

4.1.

is completed rank(i). PROOF.

Let G be a DAG

by its deadline.

If

i is a sink

and

Then

node,

S a schedule

for G in which

every node i in S is completed

then

the

theorem

definition of rank. Suppose that i is not a sink node and assume holds for all successors of i. If rank(i) = d(i),

follows inductively then the

every node

no later

trivially

from

than

the

that the theorem theorem obviously

holds. So assume that rank(i) = D – 1 – w+(i, j’) < d(i) for some j’ and D’. By the manner in which rank is computed, j’ is scheduled in the backwards schedule in time step D’ — 1. If D = rank(~”’ ), then the result follows immediately from the assumption that the ranks of all the successors of i satisfy the theorem, together with the definition of w + ( i, j’ ). Now assume that D’ < rank (j’ ) and let S~~C~W,r~ with completion time T ~~C~W~,~ be the backwards schedule as it exists immediately after the insertion of j’. Case 1. There are no idle slots in the time steps finishing at 1)’ + 1, D’ + 2, ..., T~~C~W~,~.Since nodes are scheduled as late as possible in S~,C~W,,~, ~ank(J”) ~ Tb~CkW~,d, for j ● S~,CkW,,d. From the order in which nodes are ACM

Transactions

on Programming

Languages

and Systems,

Vol. 15, No 4, September

1993.

Scheduling

Time-Critical

successors

w+(i,

j’)

Fig. 3.

An illustration

ofi

Tbackwe~

of Case 1 of the proof

1.

for m =

ofi

t

—t w+(i, j’) Fig.4.

t

t

D’

Anillustration

of Case 20fthe

proof

T&kua,d

form

= 1.

in S~.C~W.,~, we get W+( i, j) > W+( i, j’), j G S~,C~W,,~ (see Figure pigeon-hole argument suffices to prove that if i is completed

placed simple

641

.

D’+1

D’

successors

than than

Instructions

3). A later

rank(i) in the forward schedule, some successor of i will complete later the assumption that all of i’s successors time T~~C~W~,~.This contradicts

are completed Case

2.

D’ + 2,..

There

ranks.

is some

., Z’~,C~W~,~. Let

containing all nodes

by their

an idle in time

slot. steps

idle

slot

t be the

the

time

steps

time of the S~8C~W~,~is constructed

Since with

in

start

start

time

less than

finishing

smallest greedily,

t have rank

at

D’ + 1,

such time it follows no greater

step that than

t.Also,

by assumption, all these time steps have no idle slots. Therefore, all by time t and nodes scheduled in times steps prior to t must be completed have a w+ value at least as great as w+ (i, j’ ) (see Figure 4). The theorem again If

follows

from

a pigeon-hole

1 is a problem

that is obtained no preassigned (rather instance LEMMA

instance,



argument. then

18 is defined

than D). We also define rank(i) to 1, and rarzk~(i) to be the rank of node 4.2.

Let I and Ia be as defined

The proof

PROOF.

to be the

by adding 8 to every preassigned deadline deadline, then it is given the preassigned

follows

directly

above.

from

problem

be the ranks i in 18. Then

the definition

instance

of I. If a node has deadline of D + 8

ranka(i)

computed

= rank(i)

of rank.

for

+ 8.



Corollaries 4.3 and 4.4 show how the rank algorithm can be used to solve the minimum tardiness problem and the minimum completion time problem in the absence of deadlines. ACM

TransactIons

on Programming

Languages

and Systems,

Vol. 15, No. 4, September

1993

642

K. V, Palem and B, B. Simons

.

COROLLmY schedule

4.3.

for

minimum

tardiness

PROOF’. ciently

If

sponding algorithm

that

the

of problems

if

schedule

a problem

large

which

Assume

a class

6 such

I

when

a feasible

schedule

is infeasible,

8 is ad.ded

it

from

that

then

to all

CO~O~I,m~

a

exists

deadlines,

a suffi-

the

corre-

Since, by assumption, the rank if one exists, the smallest 6 for

can be constructed

4.4.

Suppose

the rank

is the

algorithm

for a class of problems.

a minimum preassigned

value

completion deadlines.

time

Then

of the

schedule

constructs

the rank

minimum

for

a minimum

algorithm

inputs

in

the problem

instance,

O tardiness. If rank algorithm

then

D is the minimum The Running running

graph.

also

Time time,

assume

algorithm

will

tardi-

there

are

no

each node the same completion time for

construct

a schedule

with

minimum completion time, by Lemma 4.2, the the identical schedule as for the case in which

completion

worst-case We

the rank

D is not the will construct

18. ❑

also constructs

which

The proof follows from the technique of giving the minimum D as a deadline. If D is precisely

PROOF.

number

4.1.3

constructs class.

there

the

a feasible

also

It follows from Lemma 4.2 that list is the same for both 1 and the same schedule is constructed for both problem instances.

ness schedule



time.

of the Rank we

that

Algorithm.

assume the

that

input

For

the

DAG

input

the

DAG

analysis is

is a transitively

of the

a connected closed

graph,

G = (N, E’), the transitive closure of a graph G = (N, E) is a graph consists of all the nodes from G together with an edge from i to j if is a path in G from i to j. Otherwise, the transitive closure is

where which there

automatically

computed

during

the computation

Computing the w+ values. The time O(en ). Given the w+ values, involves that

constructs

Then

instances

problem instance 18 is feasible. constructs a feasible schedule

tardiness. Therefore,

algorithm exists.

for infeasible

instance

that

rank one

sorting

has

the

various

a worst-case

transitively

closed

graph

total time required for where e’ is the number Backscheduling

computation constructing

successor

running

of the w‘-

sets.

time

of

We

of all the w+ values takes the lists sw(i) and swr(i) can

use

0( n log n).

is used for processing

values.

any

Since

the rank

sorting

each

of only

algorithm

edge

in

the

one node,

the

sorting all the sets sw(i) and swr(i) is O(e’ log n), of edges in the transitive closure of G.

using

UNION-FIND.

Once

the

list

swr( i) has

been

con-

i is If the backward scheduling is done in a straightforward fashion, it increase the running time of the algorithm. Therefore, we implement this

structed,

the

backward

scheduling

step

of the

rank

computation

for

node

performed.

will

step using successors Suppose

the UNION-FIND algorithm of i. there are n, distinct ranks

create n, associated that rank in swr(i),

single node trees, tree[ 1], tree[2], . . . . with each tree. We order the trees by (tree[ p – 1]) < rank (tree[ p ]). So rank and rank (tree[ n,]) is the largest rank

ACM

Transactions

on Programming

Languages

[29] on the values values

and Systems,

among

the

of the ranks successors

of i. We

tree[ n, ], with a single their associated ranks, (tree[ 1])is the smallest in swr(i).

Vol

15, No 4, September

of the

1993

rank such rank

Scheduling Each

tree[ p ] has

a field

called

Time-Critical

capczcity[

instructions

p ] associated

643

.

with

it.

We

set

capacity [l] + m x mwk(tree[ l]), and cczpacity[ p] + m X (rarzk(tree[ p]) – rank (tree[ p – l])), where m is the number of processors. capacity[ p ] is the number

of nodes

greater than tree[ p ] also

that

can be inserted

into

the backward

schedule

in the slots

rcmk(tree[ p – 1]) and less than or equal to rcmh( tree[ p ]). Each has a field called corztent[ p ]; initially confent[ p ] ~ O for 1
1, then tree[ p ] is made rcmk( tree[ q ]) is the largest rank of any tree with (Initially, q = p – 1, but as trees are have q < p – 1.) If content[ 1] ever becomes greater feasible schedule and rank(i) rank(i) (greedy that some node is scheduled in S,~~~ prior to j),

(2) an idle

slot is encountered,

(3) all of S,..k Suppose rank(j) let

has been

the

first

> ranh(

Z

be

the

examined.

condition

set

holds,

t be the time

i ). Let of

scheduling

nodes

namely step

node

at which

scheduled

to

j

is

encountered

~ is scheduled

start

at

time

with

to start

and

t + 1,t +

steps

– 1, together with node i itself. (Node j is not included in 2.) 2 9..., rank(i) By the definition of node j, all nodes in Z have rank less than rank(j). Also, since i = Z, IX] = rank(i) – t. Let j’ be the node scheduled to start at time step t – 1; j’ must be a predecessor of all the nodes in X, since otherwise have scheduled one of those nodes at time step successor

since

of j’

and

otherwise

Therefore,

for

the backward

there

the all

are

no

greedy

successors

schedule

will

other

paths

algorithm k’

of j’

the greedy

algorithm

would

t.If k = X is an immediate j’ to k, then w (~, k) = 1, schedule node k at time t.

from

would we have

w+ (~, k’)

cause some successor

>1.

Consequently,

k’ of j’ to have

a finish

time

t + 1. By the definition of rank, this gives rank(j’) < (D’) no greater than the assumption that all the nodes in S,., ~ are completed t – 1, contradicting by their ranks. Therefore, this condition cannot occur. If the second condition holds, namely an idle slot is encountered, then again let argument

node j’ be the node immediately preceding the idle is the same as above. Again, this condition cannot occur.

8As computation computation memory ACM

time.

latencies

TransactIons

speeds The exceed

increac+e, rank

latencies

algorithm

can

from be used

memory

accesges

as a heuristic

are

increasing

for those

cases

slot;

relative

to

in which

the

1.

on Programming

Languages

and Systems,

Vol. 15, No. 4, September

the

1993

Scheduling If the last

condition

s ,ank have rank principle,

holds,

then

no greater

there

that

this

Instructions

S,~~h has no idle time,

than

rank(i).

is a node whose

4.1, we conclude

Time-Critical

rank

contradicts

the

from

than

existence

645

and all the nodes in

Consequently,

is no greater

.

the pigeon-hole

zero. From

Theorem

of a feasible

schedule.

(Intuitively, at least two nodes must be scheduled in the same slot, contradicting the fact we are constructing a schedule for a single processor.) If the problem instance is infeasible, it follows from Corollary 4.3 that the rank

algorithm

4.1.5

Monotone

sors.

Even

plete time)

constructs

a minimum

Interval-Order,

though

the

tardiness

Arbitrary

general

for arbitrarily large latencies, algorithms exist for interesting

which the rank algorithm constructs is called monotone interval-orders. processors

in

the

target

Latencies,

instruction

machine



schedule. and

scheduling

these

in the real

intervals,

numbers each

possible that fast (polynomial of graphs. One such class, for

a feasible For this

schedule problem

can

be arbitrary,

whenever class the and

node

signed [24].

y with

either

LEMMA 4.7. all

For

deadline

Let G = (N,

A monotone

y c j,

a preassigned

constraints nodes that

the predecessors

are also predecessors

i, j = N, (i, j)

x = i and

has

The only and all

the same large

either

The edges of G are derived

as follows.

x and

algorithm. nonnegative

line.



the N

from

one exists, number of

latencies

and

is a set of closed the order

between

E if and only if for any pair

x < y. Each deadline

edge has a latency,

or

is

assigned

on their values are that do not have preassigned by the algorithm.

E)

Proces-

is NP-com-

it is still classes

deadlines can assume arbitrary integer values. An interval-order graph is a DAG G = (N, E), where intervals

Multiple

problem

be an interval-order

by

the

the latencies are deadlines are as-

The following

graph.

of i are also predecessors

one

of

and

lemma

is from

Then for i, j G N,

of j or all predecessors

of j

of i.

interval-order

(i, j) and (i, j’), zv(i, j) predecessors of j.

graph

z w(i,

j’)

is one in which, whenever

THEOREM 4.8. Let G = (N, E) arbitrary latencies and deadlines,

the

given

any pair

predecessors

of edges

of j’

are

also

be a monotone interval-order graph with and assume that there are m > 1 proces-

sors. Then the rank algorithm constructs a feasible schedule for G whenever one exists, and constructs a minimum tardiness schedule otherwise. As in Theorem 4.6, we initially assume for contradiction algorithm fails to construct a feasible schedule for G, but

PROOF.

rank

feasible

schedule

constructed the

problem

completed Clearly, following

for

G. Let

by the greedy instance by rank(j) three

Case 1. scheduled

its

S,~~~

is infeasible,

rank.

> S,..h(j) cases.

be the

scheduling For

and

j = S,a.~, for

There are precisely at each of the time

partial

algorithm

all nodes

let let in

schedule

when i be the

S,~.~( j) S,a.k

it first first

had

been

determines node

be the except

that

that the there is a

start

that

that is not

time

i. We consider

of j. the

m nodes with ranks bounded above by rank(i) steps O, 1, ..., rank(i) – 1. Then by a simple

ACM Transactions on Programming Languages and Systems,Vol. 15, No. 4, September 1993.

646

K. V. Palem and B. B. Simons

.

pigeon-hole feasible

argument,

together

with

Theorem

4.1, there

does not

exist

any

schedule.

Case 2.

There

is either

rank (i) scheduled

at time

an idle

slot or some node

with

rank

greater

than

step

rank(i) – 1. Then i must have a predecessor, i) + 1 > rank(i). Otherwise, i would have say ~, such that S,ank(~) + w+(~, of rank, been scheduled to start at time step rank(i) – 1.From the definition we get that rank(j) s S,., ~(.j), contradicting the assumption in a time step smaller than rank(i) has a start time less than Case 3. There is some time step either an idle slot or some node with

O s t’ < rank(i) rank

greater

that

– 1 such

than

aw

its

ranii(

node

rank.

that

there

i ) scheduled

is at

time

step t’. Let t be the largest such time step, and let Z be the set of nodes – 1} together with node i. Clearly, with start times of{t + 1, t + 2, . . . . rank(i) IXI = (rank(i) – t – 1) x m + 1. Any node i’ ● Z must have a predecessor j such

that

S,u.h(j)

i’) + 1 > t. Otherwise,

+ W+(j,

Ipred(k)l

< Ipred(k’)1,

i’

constraining the smallest

j is signed start time t. We say that Let k = Z be the node in X with

for k, k’ = X. BY Lemma

would

have

been

node of i’. sized predecessor

a

4.7 every

as-

set, i.e.

node in ~red(k~

is a

node for predecessor of all the nodes in X. Let j ● pred( k ) be a constraining k. Then, because G is a monotone interval order, w ‘(j, k’) > w ‘(j, k ) for k’ = Z. Consequently, j is a constraining node for all k’ = Z. But now the rank computation for j results in a rank for j which is less than the finish contradicting the assumption that i is the first node in time for j in Sr..~, ,anh with this property. s If the problem instance is infeasible, it follows from Corollary 4.3 that the rank

algorithm

constructs

a minimum

5. THE GREEDY SCHEDULING

tardiness



schedule.

HEURISTIC

Most scheduling algorithms are greedy in that they do not introduce idle time if some instruction is available for scheduling. The result in this section holds for any greedy scheduling algorithm applied to an instruction scheduling problem containing ber of processors,

an arbitrary DAG, and no preassigned

and the algorithms THEOREM

cies

schedule no more PROOF.

with the disposal, being

5.1.

between

will Let

O and

tend

to perform

G = (N, k. Then

E)

the completion

latencies, an arbitrary The analysis is worst

better

numcase,

in practice.

be an arbitrary

the greedy

for G on a target machine than a factor of 2 – l/m(k Consider assumption as opposed

arbitrary deadlines.

DAG

scheduling

with m processors + 1) worse than

with

arbitrary

algorithm

laten-

constructs

that is guaranteed optimal.

a

to be

the greedy schedule constructed for the given DAG G that we have as many processors as we can use at our to only m. We use S% to denote this schedule, with T. time

of Sm. Let

Sgr.,~Y

be a schedule

constructed

by the

greedy algorithm for the given DAG with a target machine of m processors, and let Tg,e,~Y be the completion time of Sg,,,dY. We say that a time step in a schedule is actiue provided it has at least one node scheduled in it. Otherwise, it is idle. If P is a path in G, the number of ACM

Transactions

on Programming

Languages

and Systems,

Vol. 15, No 4, September

1993

Scheduling

Time-Critical

idle slots in P is defined to be the sum of the latencies define idle. to be the maximum number of idle slots length

of a path

P is the sum of the number

slots in P. By construction, LEMMA 5.2.

The maximum

number

idle slot is TX; the maximum

(that

steps containing

is, time

Any

PROOF.

a node

time

or an idle

only

step in Sg,.,~Y

slot

from

every

steps

number

idle

and the number

of the longest

of time

least a single

slots)

647

.

of the edges of P. We of any path in G. The

of nodes

T. is the length

instructions

path

of idle

in G.

in S~r. n/m

and

– T./(k

+ T.(1

+ 1)) + T.

– l/(m(k

+ 1))).

(4)

TOP, > T. in (4), we get

T~,,e~Y s TOP, + TOP,(l

– l/(m(k

+ 1)))

(5)

or T~,,cdY/TOP, ACM

Transactions

= 2-

on Programming

l/(m(h

Languages

+ l)). ❑ and Systems,

Vol

(6) 15, No. 4, September

1993.

648

K. V. Pa[em and B. B. Simons

.

The

completion

time

of a greedy

schedule

is within

completion time of an optimal schedule. However, schedule can degrade as the number of processors well

the latencies,

constructed given

increase.

There

greedy

algorithm

by the

by (6), thereby

showing

6. NP-COMPLETENESS All

of the

chain

has at most chain

to

consists

a simple

an NP-completeness

Similarly,

close

block

a schedule

to the

bound

for

with

result

it also holds reductions

in which

every

is connected.

straightline

Since a

a chain is the reader

case automatically

implies

that

since a chain is a very to more complex DAGs.

in Section

for machines are all from

A

node

dependence

graph (for example, results. We remind

a simple

since our NP-completeness

is a set of chains.

and the graph

of code

is a very simple strong negative

proof

that path,

the more complex cases are NP-hard. In particular, simple DAG, our NP-completeness results also apply with a single register, The NP-completeness

in which

is tight.

use a DAG

of only

and one out-edge, a basic

(threads) and since a chain also a tree), these are very that

in [20]

be arbitrarily

the bound

reductions

that

one in-edge

corresponds

that

can

of 2 of the

RESULTS

NP-completeness

is a subgraph

are examples

a factor

the quality of the greedy in the target machine, as

6.1 holds

with multiple the 3-partition

for machines registers. problem

[ 11],

which is defined as follows. Given a multiset A containing 3rz integers and a positive integer bound B, where B/4 < a, < B/2 for all a, = A anl x~fl la, = Bn, is there a partition the sum of the integers

of A into n triples of three in each triple equals B?g

elements

each such

that

6.1 Registers If there

is only

a single

processor

and some but not necessarily then constructing a minimum

and a single

register

on the target

machine

all of the nodes are preassigned to the register, completion time schedule is NP-complete. To

the best of our knowledge, ours is the first correct proof that the register constraints can transform a version of the instruction

addition of scheduling

problem, for which a polynomial time algorithm exists (from Theorem 4.6), into an NP-complete problem. When a value is stored in a register by instruction i, a new value cannot be inserted

into

current

value

the have

register

until

after

been executed.

all

the

We define

instructions a register

which constraint

access

the

as follows.

If w~ .X(i) is the maximum latency for all edges (i, j), then a new node cannot be inserted into the register until at least Wma, time units after the completion of instruction i [17]. Because we are presenting a negative result, this very weak definition of a register constraint only strengthen 1s the result. THEOREM 6.1. (The register that

is a set of chains

allocation

the latencies

Let G = (IV, E) be a DAG is

‘Because the 3-partition problem is strongly NP-complete the value of the numbers in the 3-partition problem NP-completeness.

[11], a reduction that is polynomial instance is sufficient for a proof

in of

ACM

Vol. 15, No. 4, September

on Programming

which

problem).

there

Transactions

for

Languages

and Systems,

are all

equal

to 1 and

1993

Scheduling

Time-Critical

Instructions

only one register. The problem of determining if there schedule for G having a completion time no greater than

649

.

is a single processor D, for some given D,

is NP-complete. Membership

PROOF.

NP-hard

by reducing

in the

NP

is

obvious.

3-partition

We

problem

show

that

the

to the register

problem

allocation

is

prob-

lem. Given an instance of the 3-partition problem, we construct an instance the register allocation problem in which all latencies are one. For each there Each

of a,

is a corresponding chain called a number chain, as shown in Figure 5. number chain C(a,) consists of two subchains called the first subchain

and the

second

subchain,

each containing

a, nodes.

Nodes

that

are assigned

to the register are called register nodes and nodes that are not assigned to the register are nonregister nodes. All of the nodes in the first subchain are nonregister nodes and all of the nodes in the second subchain are register nodes. There C;h,..

is also ., Cjh

a place-holding

(see Figure

chain,

5). C~k

CPk, that

contains

consists

of

2 B + 1 nodes,

n subchains

with

the

first

B

nodes being register nodes and the remaining B + 1 nodes being nonregister nodes. C~k, 2 < i < n, contains 2 B nodes, with the first B – 1 nodes being register nodes and the remaining B + 1 nodes being nonregister nodes. Cpfi is constructed by linking subchams c~~, . . . . C“~k in order with latency 1 edges. If the node

in CP~ is started

at time

O, since

all edges have

latency

1

and the last node of C~k has no out-going edge, the earliest completion time any schedule with a for CPh is 4Bn + 1. We set D = 4Bn + 1; consequently, completion

time

created Cpfi

by the

must

solution

schedule above

as soon

to the 3-partition

4Bn

both

and end with

of the register

every

node

problem

and suppose We construct

corresponding

instance

place

Suppose

that

a schedule

the last

node from

tion

is that

subchain

node from

all register

ter node, with

nodes

the exception

node in S. More specifically, C(al ), C(aJ),

Cj~,

and

which

a number

is a nonre@ster

chain

(Figure

are both

preceded

of the first

register

schedule C(ak )—all

each

of the

of which

after

the

first

B

nonregister

completion time of Cph, since repeated for each of the triples

ACM Transactions

nodes

C;h.

of nodes nodes,

prob-

from Cpk. except for

node and is followed

in the

first

descrip-

by a nonregis-

node of c~k, which

are nonregister

of

completion

constraint

and succeeded

the first B register nodes from C;h. In a similar of the B register nodes from the second subchains

is a

ah comprise

6). An equivalent

B nodes

chain

there

S with

of the register

CP~.

problem

holding that

a,, aJ, and

lem by inserting a number chain node between each pair The nodes in S alternate between register and nonregister by a nonregister

a node from constraint

of the

as it is available.

in the solution.

+ 1 for the

begin

for an instance

transformation,

be scheduled

one of the triples time

+ 1 must

of 4Bn

In any feasible

is the first

subchains

nodes—after

of

each

of

manner we schedule each of C(a, ), C(al), and C( ah ) This

does

not

increase

the

each edge in Cph has latency 1. The Process is in the solution of the 3-partition problem, each

on Programming

Languages

and Systems,

Vol

15, No 4, September

1993.

650

K. V. Palem and B. B. Simons

.

J-=d--o-J+u&l

C(al)

v al

nodes

L

C-U-L-J

—cl

o

al nodes

0 0

a3n nodes

a3n nodes

543 “d==cd-oo’d-o”” C&4=D cJu&kcL-o OO

\

00

/\

k“

v

B nodes



/

B+l

~~ nodes

B-1

O

register Fig. 5.

The number

and place-holding

nodes

nodes

nonregister

chains for the register-constraint

nodes last nonregister

B+l

frOm

problem.

C;h

node horn C~~~ 7

(B-1) register

nodes

I

(B+l)

nonregister

nodes

I

I

1

I

v B non register

nodes

B register

nodes ‘1

nodes from number Fig. 6. ACM

TransactIons

Structure

of a feasible

on Programmmg

chains C(a, ), C(aj),

schedule

Languages

C(ak)

for the register-allocation

and Systems,

Vol

~ problem.

15, No 4, September

1993

Scheduling time

interleaving

register

nodes with

Time-Critical

nonregister

instructions

nodes.

There

is no idle time

in the schedule, and the completion time is 4Bn + 1. Conversely, suppose that we are given a schedule S for the multiple problem that completes by time 4Bn + 1. This implies that S has time and that all the nodes from nodes from CP~. It also implies between

register

described Given

nodes

and

the that

number chains the interleaving

nonregister

nodes,

are

except

651

.

chain no idle

interleaved with strictly alternates

at the

boundaries,

as

predecessors

or

above. schedule

S, let

~,

be the

nodes

that

are

either

successors of the register nodes of C~~ in S, and let R, be the nodes that have nonregister nodes of C~~ as both their predecessor and successor nodes in S,l