704, Yorktown. Heights,. NY. 10598. e-mail: rpalem. @watson.ibm.com. ...... idle slot is TX; the maximum number of idle time steps in S~,,,dy. (that is, time steps ...
Scheduling Time-Critical on RISC Machines
Instructions
KRISHNA V. PALEM IBM T. J. Watson Research Center and BARBARA B. SIMONS IBM Santa Teresa Laboratory
We present
a polynomial
instructions
from
Berkeley
RISC
heuristic
for
algorithm. have
be
for
reuse.
also
always
a length
interest
because, pipelines,
independent and
that input
register
the
that
as we show, even
in that the
independent
of
and
NP-hard
such
there
even
problem
code
being
small
streams
of straightline
are allowed
to complete
be used
some quickly
becomes
NP-hard
input
consists
available
pipelines
that
heuristic
is of
for arbitrary
of
only
there
several
Finally,
no time-critical
if either
that
scheduling
constraints,
simultaneously
as a
real-time
a greedy identical
the
optimal
are instructions in
of
801,
known
of the
pipelines, code
can
behavior
are no time-critical
for
IBM
as registers
multiple
The
schedule
the no
occur
constraints, with
schedule.
block
code,
becomes
machine
is
which
instructions
scheduling
basic
of straightline
and Subject and Terms.
Additional
for a target
there
instructions, resources
of an optimal
algorithm
which
of time-critical
time
SPARC,
we
constraints, is only
because
a single
of some shared
as a bus
optwnization;
General
shared
absence
completion
Sun Our
for
Time-critical
the instruction
when
as the
Architecture.
time-critical
to make the
a minimum
such
pipelines,
time.
a schedule
twice
problem
such
Algorithms
tion
than
of several
Categories at~on;
prove
streams
Precision longer
a specific
or if no two instructions
resource
HP
with
machines
can also handle by
produces
less
length prove
the
and can also be used We
algorithm has
and
algorithm
for constructing
on RISC
processors
completed
computations,
algorithm
block
machine, RISC
Our
to
time
a basic
Key
Descriptors:
F.2.2
D.3.4
[Analysis
[Programming
of Algorithms
Problems—sequencing
and
Languages]: and
Problem
Processors—code
Complexity]:
gener-
Nonnumerical
scheduling
Algorithms Words
scheduling,
and
latency,
Phrases:
Compiler
NP-complete,
optimization,
pipeline
deadline,
processor,
greedy
register
algorithm,
allocation,
RISC
instrucmachme
scheduling
1. INTRODUCTION Many code optimization problems for parallel and pipelined modeled as deterministic scheduling problems. Typically, problems involve rearranging generated object code that
Authors’ 704,
addresses:
Yorktown
K. V. Palem,
Heights,
NY
Technology
Institute,
Permission
to copy without
not made of the
or distributed
publication
Association specific
IBM
and
IBM
10598.
Santa
Teresa
its
date
of this
commercial
appear,
Machinery.
Division,
rpalem
Laboratory,
fee all or part
for direct
for Computing
Research e-mail:
and
T. J. Watson
555 Bailey
advantage, notice
Research
@watson.ibm.com.
material is given
To copy otherwise,
the ACM that
Center,
B. B. Simons,
Road,
is granted
machines can be these scheduling is derived from a
San Jose,
provided copyright
copying
or to republish,
CA 95141.
that notice
the copies
0164-0925/93/0900-0632
of the
a fee and/or
permission.
@ 1993 ACM
are
and the title
is by permission requu-es
P.O. Box Compiler
$1.50
ACM TransactIons OnProgramming Languages and Systems, Vd 15, No. 4, September 1993, Pages 632–65S
Scheduling Time-Critical Instructions
.
633
single basic block of source code. The object code instructions have deterministic behavior and often require a single unit of execution time on the CPU [19, 25, 26]. A fast algorithm for rearranging the object code in a basic block to minimize execution time can improve generated by the compiler. In particular, contains
the
utilizing
all
necessary the
smallest of the
to guarantee
schedule.
help
possible
processor
For
to minimize
To illustrate
that
example, spillage
number
cycles
of
early
induced
In
critical,
izes a value which is subsequently Suppose that s, is given an early
or
idle
cycles,
addition,
instructions execution
thereby
it is frequently
are completed
of certain
by register
of time
and efficiency of the code execution time schedule
no-ops
effectively.
time-critical the
the notion
the quality a minimum
early
instructions
in
might
allocation.
assume
that
instruction
referenced by mstructlons deadline and that s;, sj, ...,
s, initials~, sj, . . . . .sj. s~ are given
somewhat later, but still early, deadlines. Then, if a schedule satisfies the deadline constraints, the register that is used to store the value initialized by s, will be available for other use no later than the latest deadline that is assigned
to s;, s~, . . ..s~.
We present constructs
a fast algorithm
a minimum
machines.
This
the Sun
model
SPARC
that
takes
completion
time
approximates
[28],
the IBM
a basic block schedule
several
of code as its input
for a generic
early
RISC
801 [26, 27], the Berkeley
model
processors
such
RISC
and the
[19],
HP Precision Architecture [ 13]. It also constructs an optimal schedule ing time-critical constraints for such machines. Our algorithm can be a heuristic for constructing “fast” schedules for RISC processors with multiple pipelines for which it is not optimal such as the Intel 80860 Any
scheduling
available
for
that
algorithm
scheduling
in the absence
produces that
of two
never
for
is a worst-case
rithms This
tend to perform paper contains
problem
of producing
machines
less than
and
in
with
pipelines
schedule.
most One
overall
is
algorithm
identical
practice
results.
minimum
satisfyused as long or [7].
We show
scheduling
of an optimal
better. three NP-completeness schedules
algorithm.
multiple
that
as
if some instruction
a greedy
with
twice
guarantee,
a no-op
scheduling
constraints,
target
time
inserts
a greedy
of time-critical
a schedule
has a completion
factor
that is called
and
of RISC
The
greedy
proves
algo-
that
completion
the
time
is
NP-hard if the depth of the pipeline grows as part of the input, even when the basic block of code being input consists of only several independent streams of straightline
code,
and
these
instructions
have
no time-critical
The other two results demonstrate how the introduction straints can make a problem NP-complete. In particular, scheduling inputs
problem
constraints, allowed
if there
to complete
1The completion Workshop sity,
is NP-hard
are only independent
Aug.,
is only
problem Languages
register,
Compilers
pipeline,
code without
or if two
A weaker
was brought and
small-depth
of straightline
a single
simultaneously.1
time-constraint
on Programming
for a single
streams
for Parallel
even if the time-critical
instructions
NP-completeness
to our attention
constraints.
of resource conthe instruction
are not result
as an open problem Computing,
Cornell
for
at the Univer-
1988. ACM
Transactions
on Programming
Languages
and Systems,
Vol. 15, No 4, September
1993.
634
the
K. V. Palem and B. B. Simons
.
instruction
[17],
but
scheduling
the proof
2. DESCRIPTION We consider
problem
is flawed
register
target
machines
in which
We use the [17].
standard
Each
requires
load
must
are
every
acyclic DAG
machine
time
delayed
delay,
until
which
instruction
as LOADs, access.
graph
(DAG)
corresponds
in
requires
cannot
the
entire
is represented
if
additional of basic
instructions LOAD
each
until
instruction, that
has
as a weight
and
be executed
an
one
on-chip in one
representation
Furthermore,
to complete,
require
to an instruction,
An instruction
completed.
additional
be
additional
in the
to a dependence.
predecessors
LOAD,
directed
node
edge corresponds of its
is claimed
of the instruction are derived from are fetched, decoded, and executed
In contrast, some instructions, such due to latencies introduced by memory
blocks
constraints
OF THE MODEL
cycle of CPU time. If the operands registers, then such instructions cycle. cycles
with
[15].
all
say
depend
a
on that
been
completed.
The
on the
appropriate
out-
edges from the LOAD to its immediate descendants, is called an interinstructional latency, or latency, for short. The value of the latency is the additional delay beyond the unit of time required by the CPU. Figure 1 shows a simple DAG, all the edges of which two sary
possible idle
schedules
time
for that
can be introduced
order. Clearly, schedule preferable. The idle time
DAG. if the
Sa, which in schedule
Schedule
S]
nodes
completes S1 could
have
latency
illustrates
how
are scheduled
1, and unneces-
in a suboptimal
execution earlier than S’l, is have been introduced either at
compile time using no-ops or at runtime, if the target machine has hardware interlocks. Therefore, depending on the machine, the problem is either to minimize the number piler or to minimize activated. In addition, instruction
the has
of idle cycles caused the number of cycles
input
might
associated
with
contain
by no-ops produced by the during which the interlocks
time-critical
it a nonnegative
instructions. integer
called
comare
Such
an
a deadline.
The deadline could be either a real-time constraint or a value chosen by the programmer or compiler to try to improve performance. In this case, the problem is to construct a schedule in which all instructions are completed by their deadlines. We do not address the question schedule in which all the nodes are completed
of how to assign deadlines. by their deadlines is called
A a
feasible schedule; otherwise it is infeasible. A problem instance is feasible or infeasible according to whether or not a feasible schedule exists for that instance.
instruction be tardy, tardiness
If
the
input
completed
has
after
no
deadlines,
its deadline
a minimum tardiness schedule of any node is minimized.
then
is tardy.
by
default
it
If instructions
is a schedule
in which
is
feasible.
An
are allowed
to
the maximum
2.1 Some Definitions Assume we have a set of instructions that form a basic block. Techniques such as trace scheduling [8, 9] or global compaction [1, 22] can be used to ACM
Transactions
on Programmmg
Languages
and Systems,
Vol. 15, No. 4, September
1993,
Scheduling
Time-Critical
Instructions
.
635
o---
il
i3
i5
Idle
Cycle
7
S* il
i5
i3
0123456
S2 il
ia
i5
i2
i4
01234
5
Fig.1,
ADAGwith
twopossible
schedules.
increase the size of this basic block. The problem that represents the basic block, where each node the
instructions,
and
each
edge
(i, J
input is a DAG G = (N, E) to one of i ● N corresponds
- E corresponds
to a dependence.
In
addition, each edge has a nonnegative integer weight W( i, j), which is the latency of edge (i, j). If a node i must be completed by a certain time t,then i has a deadline sors,
then
processors
the
d(i)
= t. If the target
delays
through
involved
on-chip
in
register
(1) For i, j = N, i #j, and executed simultaneously (2) S(j) > S(i) + w(i, j) depends on the start
data
set 1, 2,...,
= M(j),
IS(i)
on the same
+ 1 for (i, j) time,
is encoded
latency,
(identical)
items
proces-
between
by appropriately
for each instruction
the
M(i)
has multiple
banks
menting the edge latencies. Formally, a schedule S specifies S(i) and a processor M(i) from the target machine such that:
machine transferring
or node
m of identical
– S(j)l
>1,
these incre-
a start
time
processors
in
(No two nodes
are
processor.)
E E. (The and
earliest
processing
start time
time
of a node
of its
predeces-
sors.) If there
are no deadlines,
then
the
goal
of the
algorithm
is to construct
a
schedule with minimum completion time, that is, max Z{S(i ) + 1} is minimized; if there are deadlines, then the goal is to construct a feasible schedule that starts at time O. If for some assignment of deadlines there is no feasible ACM
Transactions
on Programming
Languages
and Systems,
Vol. 15, No, 4, September
1993.
636
K. V. Palem and B. B. Simons
.
schedule, then the algorithm struct a minimum tardiness rank
algorithm,
defined
should return that information schedule. We show in Section
below,
constructs
a minimum
and also con4.1,2 that if the
tardiness
schedule
for
a problem instance with deadlines, it also constructs a minimum completion time schedule for an instance of the same problem without deadlines. 2.2
Relationship
The pipeline standard
to Pipeline
model
notion
studied
with
the
of the
stage
k unit
this
model
start
that
of any instruction
may
be in the pipeline
paper
machine. length
pipeline
t + k. A new instruction the
in this
of a pipeline
k is a machine first
Scheduling
stages
at time
can enter
time
is more
the
for which pipeline
simultaneously.
the
at times
is at least and
The model
special case of the latency model in which A generalization of the standard pipeline
the classical
machine
an instruction
from
it depends,
than
pipeline
t exists
of an instruction
on which
general
A standard
last
or
of length that
stage
enters at time
t + 1,t + 2,. . . . In k units
as many
greater
than
as k instructions
of a standard
pipeline
is a
all the latencies are h – 1. model is obtained by allowing
an
instruction to exit the pipeline at some stage prior to the last stage. An instance of this problem can be represented by having identical latencies on all the out-edges of a node, but allowing different nodes to have different values
on their
out-edges.
For the algorithms
in this
the most general version of the model, namely of the same node can have different latencies. 2.3
Compiler
We briefly optimization
Construction
essary shared
discuss the interaction between in the compiler, particularly
instructions hazards resource
unnecessary instructions
that
we consider different
only
out-edges
Issues
allocation phase precedes the scheduling to use the same register, the register between
paper,
one in which
the scheduler and other stages of register allocation. If the register phase, then, by forcing allocator can create
are not otherwise
might be introduced. This that is allocated at compile
hazards could be introduced that complete simultaneously
such as a bus. Deadlines might provide There are different approaches for
dependent.
instructions dependence
Consequently,
problem can time. Another
unnec-
be caused by any example in which
is a target machine in which two access a single (limited) resource,
a technique for handling handling the interaction
this problem. between the
scheduler and the register allocator. In the approach used by Hennessy and Gross [ 16, 17], the instruction scheduler is explicitly constrained by hazards that are introduced by the register allocator and by memory access. Gibbons and Muchnick [13] deal with register allocation by introducing edges in the DAG to prevent instructions that share registers from overlapping. Register allocation is handled in the PL.8 compiler [3] by having the instruction scheduler preceded by a first register allocation phase and succeeded by a second register allocation phase. In the first phase the allocation is done for a target machine with an unbounded number of registers. A register is reused in the first phase only when the reuse is guaranteed not to add any additional ACM
TransactIons
on Progammmg
Languages
and Systems,
Vol. 15, No. 4, September
1993
Scheduling constraints.
After
instruction
scheduling,
Time-Critical register
Instructions
allocation
target machine is performed, and hazards are eliminated introduced spill code. The latter two approaches obviate scheduler
to explicitly
deal with
other
than
those
encoded
back
and
forth
between
presented
into
constraints
the input
and Hsu
from
separate sources,
that
the such
instruction
of register
scheduler as a bus.
scheduling
the
strategy
and
actual
allocation,
that
register
switches
allocation
is
[12].
We assume that the compiler that in [13] or [3]. Consequently, separated
scheduling
for
637
by appropriately the need for the
by register
DAG.2 A mixed
instruction
by Goodman
introduced
.
designer has employed a technique such the problem of instruction scheduling allocation.
Similar
approaches
as is
can be used to
from constraints introduced by other For a more detailed discussion, see our
shared chapter
reon
in [2].
3. PREVIOUS WORK In [3–6,
10, 13, 16– 18, 20, 21, 23, 30],
aspects
of instruction
scheduling
for
pipeline and related machines are studied. A survey of deterministic scheduling results for pipelined machines is contained in [20]; some of these results can be found general
in
more
sufficient
in polynomial previously
detail
condition
time, known
and
in
shows
Hennessy for
and
Gross
[16,
n is the number
their
heuristic
worst-case a heuristic
for the
schedules
an algorithm
new
the
problems,
a heuristic
machine
case in which
Palem
characterizes
problems
as well
problems,
produced
by their
that
runs
and report
[18].
There
Gibbons latencies
heuristic.
scheduling
as several satisfy
in time
this
0(n4),
performance
results
is no analysis
of the
and Muchnick
[13]
describe
are O or 1, with
the
substan-
time of 0(rz2 ). Although they report they too do not do a worst-case analysis
for optimally
a
to be solvable
taken for the PL.8 compiler [3]. are based on greedy scheduling.
in the DAG,
MIPS
[23],
scheduling
the approach results that
of the heuristic.
tially improved running mance for the heuristic, of the
some
17] present
on the
In
scheduling
solvable
of nodes
performance
10, 21].
that
polynomially
condition. We have already discussed There are a number of other where
[6,
for pipelined
Bernstein
an arbitrary
and
graph
good perforof the quality
Gertner
with
[4] give
latencies
of O
or 1 on a single processor. Since their algorithm uses transitive reduction as a preprocessing step, the running time of their algorithm is either that of transitive reduction3 or 0( n2 ), if preprocessing costs are ignored. Their algorithm does not handle time-critical instructions. In [5], Bernstein, Rodeh, and Gertner analyze the worst-case behavior of the greedy scheduling algorithm that
for a target result
2If code scheduling ing register
machine
with
to the multiple is done
lifetimes,
of backward scope of this
scheduling paper.
3The
running
time
DAG,
and the
running
prior
additional may
is the time
ACM Transactions
a single
pipeline to register register help
allocation
spill
eliminate
minimum of matrix
pipeline.
or processor
of O(ne),
and
In Section
no consideration
may
be introduced.
this
problem.
where
5 we generalize
case.
A detailed
e is the
is given
The use of deadlines
number
discussion of edges
to minimizand forms
is beyond in the
the
original
multiplication.
on Programming
Languages
and Systems,
Vol. 15, No. 4, September
1993.
638
K, V. Palem and B. B. Slmons
.
4. THE RANK ALGORITHM A standard algorithm
technique for instruction that always schedules
available
node. The input
a DAG
G, which
which the
We
the
nonnegative
algorithm
is an ordered
dependence
integer,
between
and
m, the
to each
possible
integer
step
t finishes
at
Time
algorithm
scans
the list,
scheduled,
giving
priority
time
time
choosing
at which
list
of nodes,
nodes,
number
a node
t + 1. At
each
up to m eligible
latencies,
of processors
to the nodes
earliest
of the current
node is eligible.
time
step is increased
can
time
nodes
in
as a time
start
step
the
greedy
on each scan to be
on the list.
all of its predecessors in G have been scheduled relevant latency constraints have been satisfied. value
the
architecture.
refer
step.4
to the greedy
represents
can be any
target
scheduling is to use a greedy scheduling a node whenever there is at least one
A node is eligible
if
on an earlier scan and the If no node is eligible, the
to the earliest
At the end of the scan, the chosen
time
nodes
for which
are deleted
some
from
the
list, and new nodes may become eligible. The process is repeated until the list is empty. The input to the greedy algorithm could be any list, including an arbitrarily ordered prioritize the
one, that does not use information nodes.5 The rank algorithm, defined
latencies
deadline node
between
of node
is an
a node
i, to compute
upper
bound
i and the
on
each
rank
the
of i’s
tardiness
are either
O or 1, and
Although
the rank
an arbitrary
DAG
schedule
finish
time
or all
algorithm
of the
as well (i).
node
as the
The rank
in
any
of a
feasible
the algorithm constructs a list, greedily. single processor schedule or a input
nodes
is not guaranteed
if the latencies
rank
of that
for an arbitrary
some
successors,
of i, written
schedule. Once the ranks are all computed, based on the ranks, which it then schedules The rank algorithm constructs a feasible minimum
about the graph structure to below, uses information about
are greater
DAG
if all the
have
preassigned
to find
a feasible
than
1, or if there
latencies deadlines.
schedule is more
one pipeline in the target machine, we conjecture that its behavior heuristic is quite good in the general case. There are some preliminary results in [7] showing the rank algorithm performing better on the 80860 than the Warren algorithm, which is used for instruction the IBM RS/6000 [30]. In at least one special case, namely interval ordered graph (see Section 4.1. 5), the rank algorithm optimal
schedule
for
arbitrary
latencies,
deadlines,
proximation bound of Section 5 applies any other greedy scheduling algorithm,
to the if there
4 In a (forward)
first
5The
highest
schedule level
some information level
first
j5rst about
algorithm
TransactIons
the
is not
cases of the instruction
ACM
we assume algorithm, graph
that
structure
guaranteed
scheduling
on Programming
O is the
a more
to construct
to construct
problem,
Languages
and Systems,
scheduling in the monotone constructs an
processors.
The
ap-
step.
version the
an optimal
as illustrated
as a test Intel
rank algorithm, as well as to are no preassigned deadlines.
time
sophisticated
and
for than
of the
ordered
schedule
in schedule
Vol
list.
greedy
algorithm,
However, even
for
S1 of Figure
15, No 4, September
the some 1,
1993
uses highest simple
Scheduling
Time-Critical
instructions
.
639
4.1 The Algorithm (1) Compute the ranks of all the nodes. If some node is assigned a rank less than or equal to O, return the information that the problem instance is infeasible.G (2) Construct list, their ranks. (3) Apply
which
is an ordered
the greedy scheduling
list of nodes in non decreasing
algorithm
order
of
to list.
4.1.1 Computing the Ranks. The weighted length of a path p is the sum of its constituent edge latencies and the number of nodes in p, excluding the endpoints of p. Let w‘( i, j) denote the weighted length of the longest path from node i to a successor j. In Figure 2, w+(il, iz) = w“F(il, i~) = ii) = O, and w+(il, i~) = 3. w+(i~, i~) = 1, w+(iz, The rank of node i is computed after w‘( i, j) and rank(j) have been computed for all nodes j that are successors of i. If j is a node that does not have preassigned deadlines, then d(i) + D, where D is some integer that is sufficiently example with
large
that
all nodes
no successors
are guaranteed
is (k + 1)n,where
of such a value
is called
a sink.
to be completed
k is the maximum
If i is a sink,
then
by time latency.
rank(i)
D. An A node
~ d(i).
For Figure 2, suppose d(il) = d(iz) = d(i~) = d(ii) = 6. Then rank(i4) = 6. If i has only a single successor j, then rank(i) = min{rank(j) – 1 – w+(i, j), d(i)}. In Figure 2 rank(iz) = 5 and rank(i~) = 4. Let i be a node with more than one successor, whose successors’ ranks have all been computed. We construct a sorted list sw(i ) of the successor set of node i. The nodes in SW(i) are sorted in nonincreasing order by the w+ values relative to i, that is, if w‘( i, ~“) > W+( i, p), then node ~“ occurs before node
p in
sw(i ). For
node
sw(il)
= ili~iz
and
which
w‘(
= q. Because
i, j)
sw(il)
il
of Figure
= i~izia.
Let
2, the
possible
sw(i)~
be all the nodes
SW(i) is sorted,
values
the nodes
in sw(i)~
for
SW( i ~) are
j in sw(i)
for
are contigu-
ous in sw(i). We next sort each swr(i) be the resulting
segment sw(i)~ by nonincreasing order of ranks. Let list. The nodes in swr(i ) are all the successors of node
i stored in the nonincreasing w+ value are further sorted
order based on w+; all the nodes with the same by their ranks. For Figure 2, the only possible
choice for swr(il) is swr(il) = idizi~. A schedule for a target machine with matrix in which each row represents machine, and each column represents
m processors
can be represented
by a
one of the processors of the target a time step. A slot is a single entry in
the matrix, and represents a specific time step on a specific processor. A slot is available if no node has been assigned to the specific start time and processor represented by the slot. To compute rank(i), we select nodes in the order in which they appear in swr(i
), starting
at the beginning
of swr(i
). Each
backward schedule j by greedily scheduling latest possible start time less than rank(j).
6If the problem
ACM
instance
Transactions
is infeasible,
a minimum
on Programming
tardiness
Languages
time
we select
a node J“, we
j in an available slot with the In particular, we schedule j in
schedule
and Systems,
is constructed.
Vol
15, No 4, September
1993.
640
K, V. Palem and B. B. Simons
.
1 Fig.
2.
Aa example
i]
DAG.
1
\
is
the time
step,
finishing
(1)
D’ < rank(j),
(2)
the number this
time
D’, such that:
at the largest
and of nodes that
step is strictly
j in swr(i)
occur before less than
and
have
been
m.
The rank of i with respect to j equals min{D’ – 1 – w+(i, D’ – 1 is the start time of node j. The rank of i with respect latest
time
completion respect
node in the
i
can
finish
backward
Correctness for the
rank
node
rank(i)
of the Ranks. algorithm.
if
schedule.
to each of its successors;
4.1.2 proof
that time
shows
j
is
to
be
We compute
is the
Below
It
assigned
we present
that
completed the
smallest
j), d(i)}. ‘l’he to j gives the rank
of these the
if no nodes
key
by
its
of i with
values. theorem
are completed
and later
than their deadlines, they must also be completed no later than their ranks. We assume that nodes with no preassigned deadlines are given the default deadline D. Note that the proof is entirely general, and holds for any number of processors
and any latencies.
TFI~O~E~
4.1.
is completed rank(i). PROOF.
Let G be a DAG
by its deadline.
If
i is a sink
and
Then
node,
S a schedule
for G in which
every node i in S is completed
then
the
theorem
definition of rank. Suppose that i is not a sink node and assume holds for all successors of i. If rank(i) = d(i),
follows inductively then the
every node
no later
trivially
from
than
the
that the theorem theorem obviously
holds. So assume that rank(i) = D – 1 – w+(i, j’) < d(i) for some j’ and D’. By the manner in which rank is computed, j’ is scheduled in the backwards schedule in time step D’ — 1. If D = rank(~”’ ), then the result follows immediately from the assumption that the ranks of all the successors of i satisfy the theorem, together with the definition of w + ( i, j’ ). Now assume that D’ < rank (j’ ) and let S~~C~W,r~ with completion time T ~~C~W~,~ be the backwards schedule as it exists immediately after the insertion of j’. Case 1. There are no idle slots in the time steps finishing at 1)’ + 1, D’ + 2, ..., T~~C~W~,~.Since nodes are scheduled as late as possible in S~,C~W,,~, ~ank(J”) ~ Tb~CkW~,d, for j ● S~,CkW,,d. From the order in which nodes are ACM
Transactions
on Programming
Languages
and Systems,
Vol. 15, No 4, September
1993.
Scheduling
Time-Critical
successors
w+(i,
j’)
Fig. 3.
An illustration
ofi
Tbackwe~
of Case 1 of the proof
1.
for m =
ofi
t
—t w+(i, j’) Fig.4.
t
t
D’
Anillustration
of Case 20fthe
proof
T&kua,d
form
= 1.
in S~.C~W.,~, we get W+( i, j) > W+( i, j’), j G S~,C~W,,~ (see Figure pigeon-hole argument suffices to prove that if i is completed
placed simple
641
.
D’+1
D’
successors
than than
Instructions
3). A later
rank(i) in the forward schedule, some successor of i will complete later the assumption that all of i’s successors time T~~C~W~,~.This contradicts
are completed Case
2.
D’ + 2,..
There
ranks.
is some
., Z’~,C~W~,~. Let
containing all nodes
by their
an idle in time
slot. steps
idle
slot
t be the
the
time
steps
time of the S~8C~W~,~is constructed
Since with
in
start
start
time
less than
finishing
smallest greedily,
t have rank
at
D’ + 1,
such time it follows no greater
step that than
t.Also,
by assumption, all these time steps have no idle slots. Therefore, all by time t and nodes scheduled in times steps prior to t must be completed have a w+ value at least as great as w+ (i, j’ ) (see Figure 4). The theorem again If
follows
from
a pigeon-hole
1 is a problem
that is obtained no preassigned (rather instance LEMMA
instance,
❑
argument. then
18 is defined
than D). We also define rank(i) to 1, and rarzk~(i) to be the rank of node 4.2.
Let I and Ia be as defined
The proof
PROOF.
to be the
by adding 8 to every preassigned deadline deadline, then it is given the preassigned
follows
directly
above.
from
problem
be the ranks i in 18. Then
the definition
instance
of I. If a node has deadline of D + 8
ranka(i)
computed
= rank(i)
of rank.
for
+ 8.
❑
Corollaries 4.3 and 4.4 show how the rank algorithm can be used to solve the minimum tardiness problem and the minimum completion time problem in the absence of deadlines. ACM
TransactIons
on Programming
Languages
and Systems,
Vol. 15, No. 4, September
1993
642
K. V, Palem and B, B. Simons
.
COROLLmY schedule
4.3.
for
minimum
tardiness
PROOF’. ciently
If
sponding algorithm
that
the
of problems
if
schedule
a problem
large
which
Assume
a class
6 such
I
when
a feasible
schedule
is infeasible,
8 is ad.ded
it
from
that
then
to all
CO~O~I,m~
a
exists
deadlines,
a suffi-
the
corre-
Since, by assumption, the rank if one exists, the smallest 6 for
can be constructed
4.4.
Suppose
the rank
is the
algorithm
for a class of problems.
a minimum preassigned
value
completion deadlines.
time
Then
of the
schedule
constructs
the rank
minimum
for
a minimum
algorithm
inputs
in
the problem
instance,
O tardiness. If rank algorithm
then
D is the minimum The Running running
graph.
also
Time time,
assume
algorithm
will
tardi-
there
are
no
each node the same completion time for
construct
a schedule
with
minimum completion time, by Lemma 4.2, the the identical schedule as for the case in which
completion
worst-case We
the rank
D is not the will construct
18. ❑
also constructs
which
The proof follows from the technique of giving the minimum D as a deadline. If D is precisely
PROOF.
number
4.1.3
constructs class.
there
the
a feasible
also
It follows from Lemma 4.2 that list is the same for both 1 and the same schedule is constructed for both problem instances.
ness schedule
❑
time.
of the Rank we
that
Algorithm.
assume the
that
input
For
the
DAG
input
the
DAG
analysis is
is a transitively
of the
a connected closed
graph,
G = (N, E’), the transitive closure of a graph G = (N, E) is a graph consists of all the nodes from G together with an edge from i to j if is a path in G from i to j. Otherwise, the transitive closure is
where which there
automatically
computed
during
the computation
Computing the w+ values. The time O(en ). Given the w+ values, involves that
constructs
Then
instances
problem instance 18 is feasible. constructs a feasible schedule
tardiness. Therefore,
algorithm exists.
for infeasible
instance
that
rank one
sorting
has
the
various
a worst-case
transitively
closed
graph
total time required for where e’ is the number Backscheduling
computation constructing
successor
running
of the w‘-
sets.
time
of
We
of all the w+ values takes the lists sw(i) and swr(i) can
use
0( n log n).
is used for processing
values.
any
Since
the rank
sorting
each
of only
algorithm
edge
in
the
one node,
the
sorting all the sets sw(i) and swr(i) is O(e’ log n), of edges in the transitive closure of G.
using
UNION-FIND.
Once
the
list
swr( i) has
been
con-
i is If the backward scheduling is done in a straightforward fashion, it increase the running time of the algorithm. Therefore, we implement this
structed,
the
backward
scheduling
step
of the
rank
computation
for
node
performed.
will
step using successors Suppose
the UNION-FIND algorithm of i. there are n, distinct ranks
create n, associated that rank in swr(i),
single node trees, tree[ 1], tree[2], . . . . with each tree. We order the trees by (tree[ p – 1]) < rank (tree[ p ]). So rank and rank (tree[ n,]) is the largest rank
ACM
Transactions
on Programming
Languages
[29] on the values values
and Systems,
among
the
of the ranks successors
of i. We
tree[ n, ], with a single their associated ranks, (tree[ 1])is the smallest in swr(i).
Vol
15, No 4, September
of the
1993
rank such rank
Scheduling Each
tree[ p ] has
a field
called
Time-Critical
capczcity[
instructions
p ] associated
643
.
with
it.
We
set
capacity [l] + m x mwk(tree[ l]), and cczpacity[ p] + m X (rarzk(tree[ p]) – rank (tree[ p – l])), where m is the number of processors. capacity[ p ] is the number
of nodes
greater than tree[ p ] also
that
can be inserted
into
the backward
schedule
in the slots
rcmk(tree[ p – 1]) and less than or equal to rcmh( tree[ p ]). Each has a field called corztent[ p ]; initially confent[ p ] ~ O for 1
1, then tree[ p ] is made rcmk( tree[ q ]) is the largest rank of any tree with (Initially, q = p – 1, but as trees are have q < p – 1.) If content[ 1] ever becomes greater feasible schedule and rank(i) rank(i) (greedy that some node is scheduled in S,~~~ prior to j),
(2) an idle
slot is encountered,
(3) all of S,..k Suppose rank(j) let
has been
the
first
> ranh(
Z
be
the
examined.
condition
set
holds,
t be the time
i ). Let of
scheduling
nodes
namely step
node
at which
scheduled
to
j
is
encountered
~ is scheduled
start
at
time
with
to start
and
t + 1,t +
steps
– 1, together with node i itself. (Node j is not included in 2.) 2 9..., rank(i) By the definition of node j, all nodes in Z have rank less than rank(j). Also, since i = Z, IX] = rank(i) – t. Let j’ be the node scheduled to start at time step t – 1; j’ must be a predecessor of all the nodes in X, since otherwise have scheduled one of those nodes at time step successor
since
of j’
and
otherwise
Therefore,
for
the backward
there
the all
are
no
greedy
successors
schedule
will
other
paths
algorithm k’
of j’
the greedy
algorithm
would
t.If k = X is an immediate j’ to k, then w (~, k) = 1, schedule node k at time t.
from
would we have
w+ (~, k’)
cause some successor
>1.
Consequently,
k’ of j’ to have
a finish
time
t + 1. By the definition of rank, this gives rank(j’) < (D’) no greater than the assumption that all the nodes in S,., ~ are completed t – 1, contradicting by their ranks. Therefore, this condition cannot occur. If the second condition holds, namely an idle slot is encountered, then again let argument
node j’ be the node immediately preceding the idle is the same as above. Again, this condition cannot occur.
8As computation computation memory ACM
time.
latencies
TransactIons
speeds The exceed
increac+e, rank
latencies
algorithm
can
from be used
memory
accesges
as a heuristic
are
increasing
for those
cases
slot;
relative
to
in which
the
1.
on Programming
Languages
and Systems,
Vol. 15, No. 4, September
the
1993
Scheduling If the last
condition
s ,ank have rank principle,
holds,
then
no greater
there
that
this
Instructions
S,~~h has no idle time,
than
rank(i).
is a node whose
4.1, we conclude
Time-Critical
rank
contradicts
the
from
than
existence
645
and all the nodes in
Consequently,
is no greater
.
the pigeon-hole
zero. From
Theorem
of a feasible
schedule.
(Intuitively, at least two nodes must be scheduled in the same slot, contradicting the fact we are constructing a schedule for a single processor.) If the problem instance is infeasible, it follows from Corollary 4.3 that the rank
algorithm
4.1.5
Monotone
sors.
Even
plete time)
constructs
a minimum
Interval-Order,
though
the
tardiness
Arbitrary
general
for arbitrarily large latencies, algorithms exist for interesting
which the rank algorithm constructs is called monotone interval-orders. processors
in
the
target
Latencies,
instruction
machine
❑
schedule. and
scheduling
these
in the real
intervals,
numbers each
possible that fast (polynomial of graphs. One such class, for
a feasible For this
schedule problem
can
be arbitrary,
whenever class the and
node
signed [24].
y with
either
LEMMA 4.7. all
For
deadline
Let G = (N,
A monotone
y c j,
a preassigned
constraints nodes that
the predecessors
are also predecessors
i, j = N, (i, j)
x = i and
has
The only and all
the same large
either
The edges of G are derived
as follows.
x and
algorithm. nonnegative
line.
●
the N
from
one exists, number of
latencies
and
is a set of closed the order
between
E if and only if for any pair
x < y. Each deadline
edge has a latency,
or
is
assigned
on their values are that do not have preassigned by the algorithm.
E)
Proces-
is NP-com-
it is still classes
deadlines can assume arbitrary integer values. An interval-order graph is a DAG G = (N, E), where intervals
Multiple
problem
be an interval-order
by
the
the latencies are deadlines are as-
The following
graph.
of i are also predecessors
one
of
and
lemma
is from
Then for i, j G N,
of j or all predecessors
of j
of i.
interval-order
(i, j) and (i, j’), zv(i, j) predecessors of j.
graph
z w(i,
j’)
is one in which, whenever
THEOREM 4.8. Let G = (N, E) arbitrary latencies and deadlines,
the
given
any pair
predecessors
of edges
of j’
are
also
be a monotone interval-order graph with and assume that there are m > 1 proces-
sors. Then the rank algorithm constructs a feasible schedule for G whenever one exists, and constructs a minimum tardiness schedule otherwise. As in Theorem 4.6, we initially assume for contradiction algorithm fails to construct a feasible schedule for G, but
PROOF.
rank
feasible
schedule
constructed the
problem
completed Clearly, following
for
G. Let
by the greedy instance by rank(j) three
Case 1. scheduled
its
S,~~~
is infeasible,
rank.
> S,..h(j) cases.
be the
scheduling For
and
j = S,a.~, for
There are precisely at each of the time
partial
algorithm
all nodes
let let in
schedule
when i be the
S,~.~( j) S,a.k
it first first
had
been
determines node
be the except
that
that the there is a
start
that
that is not
time
i. We consider
of j. the
m nodes with ranks bounded above by rank(i) steps O, 1, ..., rank(i) – 1. Then by a simple
ACM Transactions on Programming Languages and Systems,Vol. 15, No. 4, September 1993.
646
K. V. Palem and B. B. Simons
.
pigeon-hole feasible
argument,
together
with
Theorem
4.1, there
does not
exist
any
schedule.
Case 2.
There
is either
rank (i) scheduled
at time
an idle
slot or some node
with
rank
greater
than
step
rank(i) – 1. Then i must have a predecessor, i) + 1 > rank(i). Otherwise, i would have say ~, such that S,ank(~) + w+(~, of rank, been scheduled to start at time step rank(i) – 1.From the definition we get that rank(j) s S,., ~(.j), contradicting the assumption in a time step smaller than rank(i) has a start time less than Case 3. There is some time step either an idle slot or some node with
O s t’ < rank(i) rank
greater
that
– 1 such
than
aw
its
ranii(
node
rank.
that
there
i ) scheduled
is at
time
step t’. Let t be the largest such time step, and let Z be the set of nodes – 1} together with node i. Clearly, with start times of{t + 1, t + 2, . . . . rank(i) IXI = (rank(i) – t – 1) x m + 1. Any node i’ ● Z must have a predecessor j such
that
S,u.h(j)
i’) + 1 > t. Otherwise,
+ W+(j,
Ipred(k)l
< Ipred(k’)1,
i’
constraining the smallest
j is signed start time t. We say that Let k = Z be the node in X with
for k, k’ = X. BY Lemma
would
have
been
node of i’. sized predecessor
a
4.7 every
as-
set, i.e.
node in ~red(k~
is a
node for predecessor of all the nodes in X. Let j ● pred( k ) be a constraining k. Then, because G is a monotone interval order, w ‘(j, k’) > w ‘(j, k ) for k’ = Z. Consequently, j is a constraining node for all k’ = Z. But now the rank computation for j results in a rank for j which is less than the finish contradicting the assumption that i is the first node in time for j in Sr..~, ,anh with this property. s If the problem instance is infeasible, it follows from Corollary 4.3 that the rank
algorithm
constructs
a minimum
5. THE GREEDY SCHEDULING
tardiness
❑
schedule.
HEURISTIC
Most scheduling algorithms are greedy in that they do not introduce idle time if some instruction is available for scheduling. The result in this section holds for any greedy scheduling algorithm applied to an instruction scheduling problem containing ber of processors,
an arbitrary DAG, and no preassigned
and the algorithms THEOREM
cies
schedule no more PROOF.
with the disposal, being
5.1.
between
will Let
O and
tend
to perform
G = (N, k. Then
E)
the completion
latencies, an arbitrary The analysis is worst
better
numcase,
in practice.
be an arbitrary
the greedy
for G on a target machine than a factor of 2 – l/m(k Consider assumption as opposed
arbitrary deadlines.
DAG
scheduling
with m processors + 1) worse than
with
arbitrary
algorithm
laten-
constructs
that is guaranteed optimal.
a
to be
the greedy schedule constructed for the given DAG G that we have as many processors as we can use at our to only m. We use S% to denote this schedule, with T. time
of Sm. Let
Sgr.,~Y
be a schedule
constructed
by the
greedy algorithm for the given DAG with a target machine of m processors, and let Tg,e,~Y be the completion time of Sg,,,dY. We say that a time step in a schedule is actiue provided it has at least one node scheduled in it. Otherwise, it is idle. If P is a path in G, the number of ACM
Transactions
on Programming
Languages
and Systems,
Vol. 15, No 4, September
1993
Scheduling
Time-Critical
idle slots in P is defined to be the sum of the latencies define idle. to be the maximum number of idle slots length
of a path
P is the sum of the number
slots in P. By construction, LEMMA 5.2.
The maximum
number
idle slot is TX; the maximum
(that
steps containing
is, time
Any
PROOF.
a node
time
or an idle
only
step in Sg,.,~Y
slot
from
every
steps
number
idle
and the number
of the longest
of time
least a single
slots)
647
.
of the edges of P. We of any path in G. The
of nodes
T. is the length
instructions
path
of idle
in G.
in S~r. n/m
and
– T./(k
+ T.(1
+ 1)) + T.
– l/(m(k
+ 1))).
(4)
TOP, > T. in (4), we get
T~,,e~Y s TOP, + TOP,(l
– l/(m(k
+ 1)))
(5)
or T~,,cdY/TOP, ACM
Transactions
= 2-
on Programming
l/(m(h
Languages
+ l)). ❑ and Systems,
Vol
(6) 15, No. 4, September
1993.
648
K. V. Pa[em and B. B. Simons
.
The
completion
time
of a greedy
schedule
is within
completion time of an optimal schedule. However, schedule can degrade as the number of processors well
the latencies,
constructed given
increase.
There
greedy
algorithm
by the
by (6), thereby
showing
6. NP-COMPLETENESS All
of the
chain
has at most chain
to
consists
a simple
an NP-completeness
Similarly,
close
block
a schedule
to the
bound
for
with
result
it also holds reductions
in which
every
is connected.
straightline
Since a
a chain is the reader
case automatically
implies
that
since a chain is a very to more complex DAGs.
in Section
for machines are all from
A
node
dependence
graph (for example, results. We remind
a simple
since our NP-completeness
is a set of chains.
and the graph
of code
is a very simple strong negative
proof
that path,
the more complex cases are NP-hard. In particular, simple DAG, our NP-completeness results also apply with a single register, The NP-completeness
in which
is tight.
use a DAG
of only
and one out-edge, a basic
(threads) and since a chain also a tree), these are very that
in [20]
be arbitrarily
the bound
reductions
that
one in-edge
corresponds
that
can
of 2 of the
RESULTS
NP-completeness
is a subgraph
are examples
a factor
the quality of the greedy in the target machine, as
6.1 holds
with multiple the 3-partition
for machines registers. problem
[ 11],
which is defined as follows. Given a multiset A containing 3rz integers and a positive integer bound B, where B/4 < a, < B/2 for all a, = A anl x~fl la, = Bn, is there a partition the sum of the integers
of A into n triples of three in each triple equals B?g
elements
each such
that
6.1 Registers If there
is only
a single
processor
and some but not necessarily then constructing a minimum
and a single
register
on the target
machine
all of the nodes are preassigned to the register, completion time schedule is NP-complete. To
the best of our knowledge, ours is the first correct proof that the register constraints can transform a version of the instruction
addition of scheduling
problem, for which a polynomial time algorithm exists (from Theorem 4.6), into an NP-complete problem. When a value is stored in a register by instruction i, a new value cannot be inserted
into
current
value
the have
register
until
after
been executed.
all
the
We define
instructions a register
which constraint
access
the
as follows.
If w~ .X(i) is the maximum latency for all edges (i, j), then a new node cannot be inserted into the register until at least Wma, time units after the completion of instruction i [17]. Because we are presenting a negative result, this very weak definition of a register constraint only strengthen 1s the result. THEOREM 6.1. (The register that
is a set of chains
allocation
the latencies
Let G = (IV, E) be a DAG is
‘Because the 3-partition problem is strongly NP-complete the value of the numbers in the 3-partition problem NP-completeness.
[11], a reduction that is polynomial instance is sufficient for a proof
in of
ACM
Vol. 15, No. 4, September
on Programming
which
problem).
there
Transactions
for
Languages
and Systems,
are all
equal
to 1 and
1993
Scheduling
Time-Critical
Instructions
only one register. The problem of determining if there schedule for G having a completion time no greater than
649
.
is a single processor D, for some given D,
is NP-complete. Membership
PROOF.
NP-hard
by reducing
in the
NP
is
obvious.
3-partition
We
problem
show
that
the
to the register
problem
allocation
is
prob-
lem. Given an instance of the 3-partition problem, we construct an instance the register allocation problem in which all latencies are one. For each there Each
of a,
is a corresponding chain called a number chain, as shown in Figure 5. number chain C(a,) consists of two subchains called the first subchain
and the
second
subchain,
each containing
a, nodes.
Nodes
that
are assigned
to the register are called register nodes and nodes that are not assigned to the register are nonregister nodes. All of the nodes in the first subchain are nonregister nodes and all of the nodes in the second subchain are register nodes. There C;h,..
is also ., Cjh
a place-holding
(see Figure
chain,
5). C~k
CPk, that
contains
consists
of
2 B + 1 nodes,
n subchains
with
the
first
B
nodes being register nodes and the remaining B + 1 nodes being nonregister nodes. C~k, 2 < i < n, contains 2 B nodes, with the first B – 1 nodes being register nodes and the remaining B + 1 nodes being nonregister nodes. Cpfi is constructed by linking subchams c~~, . . . . C“~k in order with latency 1 edges. If the node
in CP~ is started
at time
O, since
all edges have
latency
1
and the last node of C~k has no out-going edge, the earliest completion time any schedule with a for CPh is 4Bn + 1. We set D = 4Bn + 1; consequently, completion
time
created Cpfi
by the
must
solution
schedule above
as soon
to the 3-partition
4Bn
both
and end with
of the register
every
node
problem
and suppose We construct
corresponding
instance
place
Suppose
that
a schedule
the last
node from
tion
is that
subchain
node from
all register
ter node, with
nodes
the exception
node in S. More specifically, C(al ), C(aJ),
Cj~,
and
which
a number
is a nonre@ster
chain
(Figure
are both
preceded
of the first
register
schedule C(ak )—all
each
of the
of which
after
the
first
B
nonregister
completion time of Cph, since repeated for each of the triples
ACM Transactions
nodes
C;h.
of nodes nodes,
prob-
from Cpk. except for
node and is followed
in the
first
descrip-
by a nonregis-
node of c~k, which
are nonregister
of
completion
constraint
and succeeded
the first B register nodes from C;h. In a similar of the B register nodes from the second subchains
is a
ah comprise
6). An equivalent
B nodes
chain
there
S with
of the register
CP~.
problem
holding that
a,, aJ, and
lem by inserting a number chain node between each pair The nodes in S alternate between register and nonregister by a nonregister
a node from constraint
of the
as it is available.
in the solution.
+ 1 for the
begin
for an instance
transformation,
be scheduled
one of the triples time
+ 1 must
of 4Bn
In any feasible
is the first
subchains
nodes—after
of
each
of
manner we schedule each of C(a, ), C(al), and C( ah ) This
does
not
increase
the
each edge in Cph has latency 1. The Process is in the solution of the 3-partition problem, each
on Programming
Languages
and Systems,
Vol
15, No 4, September
1993.
650
K. V. Palem and B. B. Simons
.
J-=d--o-J+u&l
C(al)
v al
nodes
L
C-U-L-J
—cl
o
al nodes
0 0
a3n nodes
a3n nodes
543 “d==cd-oo’d-o”” C&4=D cJu&kcL-o OO
\
00
/\
k“
v
B nodes
❑
/
B+l
~~ nodes
B-1
O
register Fig. 5.
The number
and place-holding
nodes
nodes
nonregister
chains for the register-constraint
nodes last nonregister
B+l
frOm
problem.
C;h
node horn C~~~ 7
(B-1) register
nodes
I
(B+l)
nonregister
nodes
I
I
1
I
v B non register
nodes
B register
nodes ‘1
nodes from number Fig. 6. ACM
TransactIons
Structure
of a feasible
on Programmmg
chains C(a, ), C(aj),
schedule
Languages
C(ak)
for the register-allocation
and Systems,
Vol
~ problem.
15, No 4, September
1993
Scheduling time
interleaving
register
nodes with
Time-Critical
nonregister
instructions
nodes.
There
is no idle time
in the schedule, and the completion time is 4Bn + 1. Conversely, suppose that we are given a schedule S for the multiple problem that completes by time 4Bn + 1. This implies that S has time and that all the nodes from nodes from CP~. It also implies between
register
described Given
nodes
and
the that
number chains the interleaving
nonregister
nodes,
are
except
651
.
chain no idle
interleaved with strictly alternates
at the
boundaries,
as
predecessors
or
above. schedule
S, let
~,
be the
nodes
that
are
either
successors of the register nodes of C~~ in S, and let R, be the nodes that have nonregister nodes of C~~ as both their predecessor and successor nodes in S,l