for ILP. Processors. Scott A. Mahlke*. Richard E. Hank. James E. McCormick. David I. August. Wen-mei W. Hwu. Center for Reliable and High-Performance.
A Comparkon
Scott A. Mahlke*
of Full
and Partial Predicated for ILP Processors
Richard E. Hank
James E. McCormick
David I. August
Center for Reliable and High-Performance University of Illinois Urbana-Champaign,
Support
Wen-mei W. Hwu
Computing
IL 61801
in conjunction with speculative execution is typically utilized by the compiler andlor hardware to remove control dependence and expose ILP in superscalar and VLIW processors [1] [2] [3]. However, misprediction of these branches can result in severe performance penalties. Recent studies have reported a performance reduction of two to more than ten when realistic instead of perfect branch prediction is utilized [4] [5] [6]. The second limitation is that processor resources to handle branches are often restricted. As a result, for control intensive applications, an artificial upper bound on performance will be imposed by the branch resource constraints. For example, in an instruction stream consisting of 40% branches, a four issue processor capable of processing only one branch per cycle is bounded to a maximum of 2.5 sustained instructions per cycle.
Abstract One can effectively utilize predicated execution to improve branch handling in instruction-level parallel processors. Although the potential benefits of predicated execution are high, the tradeoffs involved in the design of an instruction set to support predicated execution can be difficult. On one end of the design spectrum, architectural support for full predicated execution requires increasing the number of source operands for all instructions. Full predicate support provides for the most flexibility and the largest potential performance improvements. On the other end, partial predicated execution support, such as conditional moves, requires very little change to existing architectures. This paper presents a preliminary study to qualitatively and quantitatively address the benefit of full and partial predicated execution support. With our current compiler technology, we show that the compiler can use both partial and full predication to achieve speedup in large control-intensive programs. Some details of the code generation techniques are shown to provide insight into the benefit of going from partial to full predication. Preliminary experimental results are very encouraging: partial predication provides an average of 33’% performance improvement for an 8-issue processor with no predicate support while full predication provides an additional 30~o improvement.
1
Execution
Predicated execution support provides an effective means to eliminate branches from an instruction stream. Predicated or guarded execution refers to the conditional execution of an instruction based on the value of a boolean source operand, referred to as the predicate [7] [8]. This architectural support allows the compiler to employ an z\corwersion algorithm to convert conditional branches into predicate defining instructions, and instructions along alternative paths of each branch into predicated instructions [9] [10] [11]. Predicated instructions are fetched regardless of their predicate value. Instructions whose predicate is true are executed normally. Conversely, instructions whose predicate is false are nullified, and thus are prevented from modifying the processor state.
Introduction
Predicated execution provides the opportunity to significantly improve branch handling in ILP processors. The most obvious benefit is that decreasing the number of branches reduces the need to sustain multiple branches per cycle. Therefore, the artificial performance bounds imposed by limited branch resources can be alleviated. Eliminating frequently mispredicted branches also leads to a substantial reduction in branch prediction misses [12]. As a result, the performance penalties associated with mispredictions of the eliminated branches are removed. Finally, predicated execution provides an efficient interface for the compiler to expose multiple execution paths to the hardware. Without compiler support, the cost of maintaining multiple execution paths in hardware grows exponentially.
Branch instructions are recognized as a major impediment to exploiting instruction-level parallelism (ILP). ILP is limited by branches in two principle ways. First, branches impose control dependence which restrict the number of independent instructions available each cycle. Branch prediction * Scott Mahlke is now with Hewlett Packard Laboratories, Palo Alto, CA. Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage, the ACM copyright notice and the title of the publication and its date appear, and notice is given that copyin is by permission of the Association of Computing Machinery. ? o copy otherwise, or to republish, requires a fee and/or specific permission. ISCA ’95, Santa Margherita Ligure Italy @ 1995 ACM 0-89791 -698-0/95/0006... $3.50
Predicated execution may be supported by a range of architectural extensions. The most complete approach is full
138
predicate support. With this technique, all instructions are provided with an additional source operand to hold a predicate specifier. In this manner, every instruction may be a predicated. Additionally, a set of predicate defining opcodes are added to efficiently manipulate predicate values. This approach was most notably utilized in the Cydra 5 minisupercomputer [8] [13]. Full predicate execution support and the largest potential perprovides the most flexibility formance improvements. The other approach is to provide partial predicate support. With partial predicate support, a small number of instructions are provided which conditionally execute, such as a conditional move. As a result, partial predicate support minimizes the required changes to existing instruction set architectures (ISA’s) and data paths. This approach is most attractive for designers extending current ISA’s in an upward compatible manner.
options
for extending
ducing
guard
of subsequent
2
ISA
instructions
2.1
Extensions
of instruction extensions ecution
Suppression determines
stored
in a set of l-bit
tively
referred
predicate using
relying
support,
predicated
an equivalent
these sequences structions
executed
of instructions.
in an increase
and requires
efficient
However,
this
port
requires
can frequently
a chain
the performance dependent
parallel
with
same computation
that
with
of sequentially increase
of partial
predicate
computations
are typically
dition
is true,
the processor
more
conditional
moves
have
speculation a predicate
selectively always
employed being
for
is not
locations. state
Then,
example. Thus,
it improves
With since
full all
performance
each instruction
specifies
which
operation
If the value in the specified
predshould
predicate
is execui~ed normally;
the instruction
if
is suppressed.
An
one or
This
predicate
predicate
be
other
rather
operations. must
cate register
the
instruction This
it still
be suppressed
be available
during
for
regard-
ties up an
the complexsignaling
ciuring
whose unit
the
de-
corresponding
not issuedl.
value
is
stage.
an instruction
the
of
which
reduced
is that
and force exception
the execution
the
value
to be generated
method
is suppressed,
logic also
Since
must
modifies
schedules
stage
it reduces
register.
to this
is false is simply
of allowing
referenced
utilized.
predicate
the last pipeline Thus,
register
advantage
since
met hod may also increase
can
stage.
in the write-back
that
compact
bypass
until
instruction
code/issue
on that
an instruction
unit.
to be delayed
state
is useful
A drawback
ity of the register
storing
may
of
the use of an addi-
and a subsequent
more
code.
less of whether
instructions
speculation
enables
execution
if the con-
using
is
The values in
the
an instruction
based
predicated
performing and
is updated,
required
specifier. where
by first
(speculative)
the result (s) in some temporary
support,
Conditional
represented
unconditionally
latency
Finally,
is extensively
with
state
are collec-
The setting
operand
method
register
conditioned
instructions,
support
file.
whether
of processor This
between
the predicate
sup-
execution.
register
in this section.
the instruction
is O, or false,
any change
latency
support.
length.
modify
registers
set through
This
state.
of the pipeline.
values
predicate
path
allow
architec-
predicate
These
One way to perform the suppression of an instruction in hardware is to allow the instruction to execute and to dis-
of in-
of predicate
dependent
of
later
determine
is 1, or true,
of ex-
of the condition
should
file are associated
operand.
processor
the value
of registers
predicate
partial
the critical
modify register
Generation
number
full
on the use of speculative
the computation
than
a larger
execupartial
be performed
in the number
values for the partial and
only
must
In the second case, the computation
is highly
may
with
The result
registers.
instruction will
These
for suppression
of condition.
is discussed register
extensions.
as support
if an instruction
in the extended
pred-
on speculative
operations
sequence
results
to hold intermediate ture.
extensively
case, for an architecture
Predication
to as the predicate
these registers
source
and
Full
of Execution.
which
register
In the first
for
and expression
icate
values,
intro-
specifiers
[14].
set and micro-architecture
tional
icate
may be utilized
can be classified
ing unsupported tion.
For example,
The essence of predicated execution is the ability to suppress the modification of the processor state based upon some condition. There must be a way to express this condition and a way to express when the condition should a,ffect execution. Full predication cleanly supports this through a combination
the predicate
manipulating
ISA.
hold the predicate
Extensions
However, there are several fundamental performance limitations of partial predicate support that are overcome with include representfull predicate support. These difficulties instructions,
which
In this section, a set of extensions to the instruction set architecture for both full and partial predicate support are presented. The baseline architecture assumed is generic ILP processor (either VLIW or superscaku-) with in-order issue and register interlocking. A generic Ioadistore ISA is further assumed as the baseline ISA.
In this paper, the tradeoffs involved in supporting full Using and partial predicated execution are investigated. the compilation techniques proposed in this paper, partial predicate support enables the compiler to perform full ifconversion to eliminate branches and expose ILP. Therefore, the compiler may remove as many branches with partial predicate support as with full predicate support. By removing a large portion of the branches, branch handling is significantly improved for ILP processors with partial predicate support. The relatively few changes needed to add partial predicate support into an architecture make this approach extremely attractive for designers.
predicated
an existing
instructions
This
has the
to be allocated
of the predicate decode/issue,
at least be set in the previous
to
register the predi-
cycle.
This
The issues discussed in the paper are intended for both designers of new ISA’s, as well as those extending existing
dependence distance may also be larger for deeper or if bypass is not available for predicate registers.
ISA’s. With a new instruction set, the issue of supporting full or partial predicate support is clearly a choice that is available. Varying levels of partial predicate support provide
ing the dependence distance between definiticms and uses of predicates may adversely affect execution time by lengthening
139
the
schedule
for
predicated
code.
An
example
pipelines Increas-
of this
Pout P ..,.
‘o 1 0
0 0 1 1
Cydra assumed needed
These
Predicate
The
of the HPL
Playdoh
nal instruct
registers
This
The
is that
these predicate
registers.
is shown
assigns
The comparison than
of w-cl
(gt),
each destination The predicate
predicate
and of the input
1, write types
defines
condition.
have
destination
cution
format
predicate
of a
on the destination
the unconditional
(U),
complements.
of three
Table
1 contains
destination
comparison.
If the value
parison
is placed
for ~).
Otherwise,
to the
ually
of the
instructions,
control
OR
predicates
type
block
can be enabled
predicates table
registers
ure
AND
(&&)
predicate
and OR (11) constructs registers
comparison icate
register
must
be explicitly
used.
However,
predicate
are set if P,n
is 1 (O for ~), is unchanged.
defines
initialized
into
only
after may
they
of a
me initialized
be issued
multiple
simultaneously
to individually
set each predicate
the
use of the
instructions,
in some
can be costly.
and pred_set,
aforemen-
cases individTherefore,
are defined
predicate
two
to provide
register
file to O or
1 contains
of predicated [10],
code
in
is initiated
all predicate branches
ions.
can be true
for
if the entire
short
code
are cleared.
pl
to
be true. is not
boolean is true.
order
in
The
first
two
to two
pred-eq
If p2 in the first executed.
evaluation. The
transuse of
type since either
p f is OR
pred-eq
expression
The
into
il-
source in Fig-
is then
1(c).
me translated
circuit
The
by a pred-clear
register
is false the second
example
the code shown the
Figure
registers
in (b)
Predicate
with
into
if-conversion shown
a simple execution.
This
“then”
part
of the
that
both
is not executed.
This
is performed.
“else”
part
Notice
of the outer
of i is performed
if statement.
p4 and
p5 remain
is consistent Finally,
with
the incre-
unconditionally.
2.2
Extensions
for
Partial
Predication
and
type
and in any
Enhancing tion
140
is
p3 is true
pred-
predicates OR
for transfor-
[16].
instruction
ment
are defined
reduction
the entire
at zero if the predme
of the
type
is unchanged.
useful
on p3 for this reason. The the addition or subtraction
is 1 and the result OR
height
pred-
are cleared
outer if statement is predicated pred_ne simply decides whether
the
they
multiple
is O (1 for AND),
register
is particularly
Figure
registers
consistent
such as logical
that
OR
an exe-
using
registers
predicate
1(a) is compiled the
that
pred_eq
register.
to O before
un-
logical
to the OR type
predicate
each predicate
Using
condition
of the com-
the destination
define
the concept
1(b).
instruct
of the
execution
since the
to compute height
are analogous
or one through
pred.clear
conditional
are always
in C. OR type destination
otherwise Note
when
is true
as a wired
can be utilized
it is possible zero
Example.
predicate
for these
(or its compliment
conditions,
This
zero dependence
predicate
of setting
formed
and
for blocks which are i.e., they ha~e a single
are useful
by multiple
as-
1 in one cycle.
dependence.
The
(b)
a 1 or leaves the register
implementation
destination
setting
a method
useful,
to the predicate
Unconditional predicates are utilized executed based on a single condition,
code,
if-conversion.
register.
writes
destination
predicate
of 34 = 81 possible
of P,~ is 1, the result
a O is written
(a) source
code after
instructions.
type
to
Code
the truth
register
(c)
of predication,
such as control
lustrating
of P,n and the result
in the predicate
(pi)
(p4) (p5)
i,i,l
predicates,
the
AND
It can write
type
predicate
of the value
(P2)
k,k,l
is 1 and the result of the comparison
insure
regardless
k,k,l L3
either
type
Although
actions
are particularly
and AND
k,k,l k,k,l i,i,l
if Pi.
register
types.
Unconditional defined,
OR,
add sub add
AND
for
For each com-
predicate.
types which
c,0,L2
with
code in Figure
are six predicate
bne
property
define
mations
instructions
the result
one
A total
jjj,l (P3) p4U ,p5V,c,0
icate.
The
(ne),
is specified
Pi..
and P,n,
O, or leave it unchanged.
predicate
This
type
tioned
upon
add predme
allows
condition
otherwise
accord-
the value written
based
result
src2 (Pin)
by .
defining
predicate,
L3
on the same predicate which
exist.
There their
order
by P,~.
register
jump
(c) assembly
changed
(eq), not equal
determines
destination
may be performed
specified
Predicate
as specified
of comparison
code,
conventional
and Pout2
src2
A predicate
comparison bination
1: Example
predicate
srcl,
to Poutl
and
predicate.
are also predicated,
Figure sembly
AND
can be: equal
etc.
define
OR type
The instruction
values
p10R,p3~,b,0
(b)
in the origi-
their
Pout2,
pred.eq
M those
below.
Pout I,
ing to a comparison
with
that
PI OR ,p2~,a,0
j,j,l
(a)
is a predicate
opcode
and
pred_eq
add
used are those
There
difference
are predicate
instruction
greater
semantics [15].
pred_clear
is
registers.
major registers
crop>
is also
conditional
be classified
a,O,Ll b,O,Ll
add jump L2: sub L3: add
by the
stage
predicate
for each comparison
define
pred.
=,N COMCLR,= DEPI,TR DEPI $pred-O
%r3,2,$pred-O %r0,%r24,%r0 1,3,1,%3 1,1,1,%3
EXTRU,EV ADDI
%r3,3,1,770r0 1,%r25,%r25
COMCLR,=
%r8,%r0,%r0
COPY
%r6,%r5
COMCLR,= OR,TR
96r8,%r0,%r0
COPY
%r7,%r5
%r6,%r5
non-excepting
support
specula-
techniques [20].
to as Superblock
2 1
2 bit counter
7100.
ILP
eq ge
r18,0,r13 r19,r4,r15
issue cycle
segment
predic-
The instruction
to fully
o
eq
blocks.
allocate
does not support
of all instructions Superblock
64 byte
to have an infinite
processor
execution.
versions
utilized
BTB
penalty.
is assumed
The baseline
or silent
entry
are those of the HP PA-RISC
processor
registers.
with
T
r13,rl,–2 r17,r13,10
for scheduling
hyperblock
loop
r16,rl,r2
id_c
or consists
no write
The dynamic
is a lK
a 2 cycle misprediction
cies assumed
with
assumed
for branches.
perfect
cache and a 64K
data cache; both
of 12 cycles.
6: Example
except
as either
cache is write-through
has a miss penalty tion
each cycle,
is specified
mapped
4–issue, 1-branch
(b) fully predicated
Figure which
or–kee
3
cycle
Note:
The memory
iter3
1
6 cycles
tions
iter2
3
pl ~R,rl,r22
jump st_c st_c st_c add add jumb
12
.ge
2 2 1
2
pred_ge
+
12
(a) superblock
iter 1
iter3
11
r6,rl ,r2 r3,rl,–3 r7,r3, 10 r8,0,r3 r9,r4,r5
o 0
eq eq ge
R_c
ld_c
pred_ge
0 10
14 cycles
o
pred~e
>
12
Jump
pl
iter2
9
bge
pred_clear
T
2
bge
iter2
iter3
bge
The
in all graphs
Figure 7: HP PA-RISC
are
emulation
of predicate
support.
baseparticular,
and
speedup
for a l-issue
tables.
processor
is calculated
baseline
processor
of the specified
by dividing
the cycle count
by the cycle count
of a k-issue
model.
For partial predicate support, the baseline processor is extended to support conditional move instructions. Note that since non-excepting the
more
efficient
for partial support
versions
of all instructions
conversions
are applied
predication processor
(Section
is referred
3.2).
The
4.2
are available, by the partial
to as Conditional
compiler
Figure
predicate Move.
to as Full
compilation
Predication.
techniques
models
is compared
ticular
processor
For this
are applied. by reporting
model
versus
the the
model,
Performance speedup
The
baseline
of the
processor.
In
relative move,
processor. benchmark
Full with
superblock. 1 Speedup tween superblock and
3
par-
the
conditional
8, l-branch in every
hyperblock of the
8 shows
perblock,
final model is the baseline processor extended to support full predication as described in Section 2.1. This model is referred
Results
except
072.sc
1Averages
146
which
reported
performance predication
an average
performed
by su-
for an issue-
performed speedup
with conditional full predication
refer
achieved
and full predication
the best
of 63%
over
move code fell befor all benchmarks
slightly
below
to the arithmetic
mean.
superblock
8.@2 703 6.IM 5.03 4.CQ 3.(s3 2.@2 I.ca
■
Figure
■
Suwhkxk
8: Effectiveness
for an 8-issue,
of full
l-branch
❑
Condwonal Move
and partial
processor
with
❑
Full Prediction
predicate perfect
support
Figure
caches.
■
Sunerblock
10: Effectiveness
for an 4-issue,
Cditicml
of full
l-branch
❑
Move
F.11 F’rdication
and partial
processor
with
predicate
perfect
support
caches.
8.iXl 7.(-0 6.(XI 5.cm 4.03 3.02 Z.ca 1m
I 1
LI
■
Figure
■
Supblock
9: Effectiveness
for an 8-issue,
of full
2-branch
❑
Conditional Move
and partial
processor
with
■
Full Fwd&tion
predicate perfect
support
Figure
caches.
code’s performance. marily
the conditional conditional
The
substantial. reported niques
small
moves.
However,
Full
consistent
predication
significant changes
with
gain
full
of representing
of
certain
over
ss~o
move
code
the
predicated
Figure
that
provided
there
by the
instructions,
the
trend
predication move
ISA
reduced
dition
move
by creating
many
the 8 issue machine,
improvement.
Increasing cycle
provides
full predicate
issue rate
interesting
The performance
insight
to improving
hyperblock
from
1 to 2 branches
into
the
effectiveness
superblock the
of conditional
of
full
predication,
performance
This
of superblock.
These move
results
substantial
To evaluate execution,
The
methods
147
the cache
Figure
in the
must
with
with
conservative
processor.
gain
by
creat-
to be employed
performance
full
are absorbed
a 4-issue
with
con-
compensate
it would
a more
needs
full
in the majority
predication
indicate
effects
while
conditional
the 4 issue machine
is still
for For
possi-
an average of 33% speedup associated
11 is presented.
were affected
is that
than
on a 4 The most
code,
instructions
algorithm model
benchmarks
the compiler
saturate
ble for the 4-issue processor, over superblock.
is at-
only
performed
per cycle.
superblock for
instructions
but
formation
conditional
than
These extra
results.
the
move code and
code is reduced.
support.
per
Figure 9 shows the performance result that can execute 2 branches per cycle.
improvement
against
ing poor
a gain of 30% over the
was observed.
the branch
predicated execution. of an 8-issue processor
tributal
On average,
move model
pred-
performed
predication
superblock
support
more
per-
formance
worse
code is limited,
fines provided conditional
full
benchmarks
beats
Since
predicate
the additional
move
of the
across these
code performs
of benchmarks.
does not noticeably
can issue 1 branch
consistently
the effi-
while
of
There-
move and full
conditional
performance
that
and de-
with
and
superblock.
10 shows
noticeable
is
formation.
of branches
superblock
than
issue processor
In particular,
support
support
instruction
code has had many
of conditional
On average,
dependence heights to represent predicated instructions, the ability to simultaneously execute OR type predicate full predicate
predicate
64K
hyperblock
the number
than
35% faster
tech-
gain on top of
illustrates
code.
3% faster
move transforma-
performance
possible
with
predication
with
the performance
ication
have
removed
increasing
improve
improvements.
predication.
and partial
processor
move and full
branches
fore,
su-
cases with
the hyperblock
This
Full kdicati,m
caches.
conditional
is very
developers special
utilizing
model.
performance
though,
the conditional
also achieved
to support
caused by
On average,
and product
performance
move
lengths
conditional for
of full
l-branch
❑
COnditiond Mow
of 072. sc was pri-
chain
a speedup
for
except
in conjunction yields
had
researchers
gains
the conditional
ciency
code
speedup
Most
conditional
behavior
dependence
move transformations. move
perblock.
tions
The unusual
due to increased
11: Effectiveness
for an 8-issue, data
■
S.perblcck
by a realistic
with
As expected cache model.
predicated all
three
However,
Table
2: Dynamic
two benchmarks duced memory using
move traffic
often
tional
proportionally instruction
move’s
instruction
larger
instruction
icated
code
still
predication
conditional The respect cation
Superblock over
full
structions overall
into
code’s
code
having
combined
from
full
predication as shown and fully
in the superblock tial predication predicted With
had only predicated
code.
Much
branches
cost of branch
incur
prediction,
the issue rate
The
taken
goal of the
of dynamic
branches.
this
number
of mispre-
technique.
Remarks
benefit
In
general,
both
number full
predication
evaluation,
less reliance
instructions
executed. enable
As shown full
we show that
sors using Results
based
remains
substantially
with
by
and
not
low
tional
instruction.
marks. in
predication The
support
issue processor support.
superior
perfor-
illustrate
in ILP
illustrate
proces-
predicate
This
contrasts
extra
dynamic
the
issue resources
degradation
Nevertheless,
the
support
even in the presence
where
the processor
the ad-
Full predication
overhead.
support
able performance
extension
Full
handling
is due to its efficient
instruction
move
over utilize
superblock,
of branches
This
encouraging
move.
predication branch
on a four
Conditional
issue rate.
is very gain.
30% with
execution.
of full predication
how-
about
processor
architectural
over conditional
of improving
predicated
allows
base ILP
speedup
results,
more robust
up to one branch
move
small
and partial
predicate and fewer
configurations.
executes
performance
30% gain
gains of full
efficient
to provide
a
stream.
execution,
of processor
This
to remove
in our quantitative
conditional
significant
another
mance
more
that
a relatively
illuspred-
compiler
predication
support. that
paper
and full
the instruction
gain over an aggressive
shows
offers
the from
allows
in this partial
on speculative
gain in a variety
the importance
hidden
allow
of branches
vantage
hardest;
presented of both
substantial
with
condi-
instructions
and result
for the majority
of a low
evaluation
in a siz-
of the bench-
performance
gain
for
and par-
ever,
code,
performance there
and
predicated with
increase code fully
have
Mis-
rate
mispre-
does indicate when
that
exploiting
a compiler conditional
of the results
2-cycle
branch
presented
prediction
show
conservative
performance
Therefore,
fewer
tion.
For machines
with
the available
ILP.
close to the
predicated
code often
ties, we expect to be much vanced
148
compiler
larger
the benefits
more
in this
be extremely on low
paper
miss penalty.
is less
very
must move
issue
processors.
All
of the high
for branches
this
intelligent
penalty.
are fewer because
instructions.
Partially
of full
of branches.
the issue rate
in the code can greatly of branches,
branch.
less than
branches
fully
is heavily
of the speedup
architectures,
for other
which
code is substantially
a significant
in the
in many
superblock
transformation
two of the benchmarks suggests that conditional move could be a valuable feature even in a low issue rate processor. How-
comes from the elimination
Also,
same number
dynamic
7%
strategy
the qualitative
can provided
move code had an
3, the number
code generation
and
con-
the
than
The
and full
unlikely
the total
with
The
no predication
over su-
associated
instructions
As a result,
predication
code normally
Conditional
dynamic in Table
branches
fewer
dictions.
executing
of the
combining
the number
Concluding
performance
with
Therefore
is hit
that
3.
move
branch typically causes more mispreof the mispredictions caused by the
5
per cycle,
and
in-
full
in Table
transformation,
For an eight issue processor
instruction
indicate.
count
the inefficiencies
support.
of 46% more
Finally, partially
instruction
with
full predicate
average
between
than
to a single
be increased
performance
with
speculated
paths.
count
can vary as the results dynamic
ever, since it suffers branches
executed
may
these benefits
2. Full predi-
dynamic
by unnecessary
frequently
of
pred-
of an if-then-else
can increase
branches.
However,
by the dy-
count
this
is to reduce
original
grep
predicated
calls.
by the compiler
With
are combined
ication.
superblock.
instruction
paths
in instruction
and superblock
while
dynamic both
grep.
dictions
trates
re-
superblock
in Table
for hyperblocks for
substantial
effect
benchmarks
are larger
However, the combined dictions than the sum
due to conditional
than
is caused by a branch
transformation
move
of all
for
models
This
and fully
for the conditional
model.
branches
of condi-
faster
is observed
predicated
The differences
subroutine
predication
applied
significantly.
over superblock
is shown
predication
relation
move
zA~o
count
behavior
code.
for partially
of mispredictions
employed
instruc-
Conditional
than
ratios
of predicated
in the number
around
data
2. On average
results
predicated
code is also a result odd
difference
branches
in the misprediction An
execution
is evidenced
54% faster
model
as is executes
struct. count
than
good
code can increase
perblock
This
moves performing
to a processor
increased
is a side
small
re-
and superblock
This
The
of adding
calls in partially
number
models.
promoted
result.
count.
instruction
(1.16)
the performance
same.
performing
50M
code dropped
of egn in Table
yielded
dynamic
these
cache miss rate
count
(0.99) (0.99) (1.28) (1.29) (1.11) (1.02)
in all three
predication
the
the increased namic
misses,
full
922K 44M 1647K 46M 49M 1526K
speculative
Since
predication
while
(0.83) (1.04)
a few less. is a result
subroutine
comparison.
code
performing
an interesting
poorly
75M 3855K
just
of branches
(1.01) (0.99)
real cache significantly
predication
by
cache
and full
also exhibited
mained
The
promotion.
3603M 11073M
count
of 026. compress
and full
caused
move
performed
full
out.
more
predicate
tions Eqn
stand
having
Full Pred. 626M (1.28) 32M (1.04) 885M (0.86) 108M (1.20)
Move (1.66) (1.23) (1.19) (1.41) (1.12) (1.23) (0.93) (1.37) (1.53) (1.11) (1.92) (2.10) (1.61) (2.01) (1.53)
instruction
the performance
Conditional
Cond. 812M 38M 1230M 128M 4003M 13838M 85M 5077K 1422K 49M 2467K 75M 70M 2999K 66M
Superblk 489M 31M 103OM 90M 3574M 11225M 91M 3701K 932K 44M 1282K 36M 44M 1493K 43M
Benchmark 008.espresso 022.li 023.eqntott 026.compress 052. alvinn 056.ear 072.sc Cccp cmp eqn grep lex qsort Wc yacc
branch
of both
pronounced. optimization
gains
are based
This
for predicated prediction
techniques
when become
to
execu-
miss penal-
full and partial
Furthermore,
on a
was chosen
prediction more
ad-
available,
Benchmark 008.espresso 022.li 023.eqntott 026.compress 052.alvinn 056.ear 072.sc Cccp cmp eqn grep lex qsort Wc yacc
Table
3: Comparison
we expect
530K 7470K 663K 14M 8847K 478K 12M
of branch
the performance
to increase.
explore
the range of predication
statistics:
support
Conditional MP
38M 6169K 53M 9269K 74M 443M llM 537K 26K 4506K 171K 3070K 6092K 224K
T.lgyo
number
partial
BR
0.8370 8.2% 1.46% 1.6570 15.06V0 6.85% 4.31%
517K
We also feel it would
predication
MPR 4.55% 10.38% 13.47~o 10.9?% 0.24% 4.3% 5.49%
4395 1612K 9660 232K 1332K 33K
gain of both
ication
move and full
Superblock MP 3402K 774K 42M 1344K I091K 66M 1232K 66K
BR 75M 7457K 315M 12M 463M 1539M 22M 921K
2066K 694K 6732K 864K 896K 16M 1044K 65K 31 514K 20K 201K 597K 57 445K
5944K
of branches
(BR),
and full pred-
be interesting
between
The authors
would
for their
also wish
like to thank
effort
to extend
in helping thanks
to Mike
Kathail
at HP
Labs
Playdoh
model
of predicated
like to thank John
Robert
Ruttenberg
of conditional supported grant
for
Roger
their
insightful
Cohn
and Geoff
SGI
for
moves
and
selects.
by the National
vces, Hewlett-Packard,
Intel
their
Science
Lowney
This
Foundation
of the
and AT&T
(MP),
G. R. Beck, D. W. Yen, and T. L. Anderson, “The Cydra 5 minisupercomputer: Architecture and implementation,” The Journai of Supercomputing, vol. 7, pp. 143-180, January 1993.
[14]
D. N. Pnevmatikatos and G. S. Sohi, “Guarded execution and branch prediction in dynamic ILP processors,” in Proceedings of the 21st International Symposium on Computer Architecture, pp. 120-129, April 1994.
[15]
V. Kathail, M. S. Schlansker, and B. R. Rau, “HPL playdoh architecture specification: Version 1.0,” Tech. Rep. HPL93-80, Hewlett-Packard Laboratories, Palo Alto, CA 94303, February 1994.
strategies and pp. 6–22, Jan-
on Com-
[16]
M. Schlansker,
pp. 21 7–227,
V. Kathail,
“On predicated execution,” Packard Laboratories, Palo
December
and S. Anik,
Hank, and predicated of the 25th pp. 45-54,
1994.
“Height
reduction
of
control recurrences for ILP processors,” in Proceedings of the 27th International Symposium on J4icroarchitecture, pp. 40– 51, December 1994.
M. D. Smith, M. Johnson, and M. A. Horowitz, “Limits on multiple instruction issue,” in Proceedings of the 3rd International Conference on Architectural Support for Programming Languages and Operating Systems, pp. 290-302, April 1989.
[5] D. W. Wall, “Limits of instruction-level parallelism,” in Proceedings of the lth International Conference on Architectural Support for Programming Languages and Operating Systems, PP. 176–188, April 1991. [6]
R. A. Towle, “The IEEE Computer,
[13]
of dynamic branch [3] T. Y. Yeh and Y. N. Patt, “A comparison predictors that use two levels of branch history? in Proceed-
[4]
(MPR).
S. A. Mahlke, R. E. Hank, R. A. Bringmann, J. C. Gyllenhaal, D. M. Gallagher, and W. W. Hwu, “Characterizing the impact of predicated execution on branch prediction: in Proceedings of the 27th International Symposium on Mi-
GIS.
E. Smith, “A study of branch prediction strategies:’ in Proceedings of the 8th International Symposium on Computer Architecture, pp. 135–148, May 1981.
Symposium 1993.
rate
[12]
De-
[1]J.
ings of the 20th Annual International puter Architecture, pp. 257-266, May
and miss prediction
S. A. Mahlke, D. C. Lin, W. Y. Chen, FL E. R. A. Bringmann, “Effective compiler suppc,rt for “ in Proceedings execution using the hyperblock, International Sympostum on Micr-oarchitecture, December 1992.
References
prediction Computer,
_
[11]
croarchitecture,
“Branch [2] J. Lee and A. J. Smith, branch target buffer design,” IEEE uary 1984.
7.31%
J. C. Park and M. S. Schlansker, Tech. Rep. HPL-91-58, Hewlett Alto, CA, May 1991.
use
under
Micro
mispredictions
3.15~o 11.5% 13.57% 9.3870 1.38% 3.4% 8.26~o 12.15% 0.12!70 11.37% 11.73% 6.46% 10.06% .025%
[10]
and
has been
(NSF)
Advanced
5900K
6931K 867K 1032K 15M 934K 65K 31 511K 20K 196K 61OK 57 431K
MPR
B. R. Rau, D. W. L. Yen, W. Yen, and Cydra 5 departmental supercomputer,” vol. 22, pp. 12–35, January 1989.
we would on the
research
Microsystems
We
at DEC,
discussions
Corporation,
SUN
Finally,
7.48%
1039K 702K
[8]
and Vinod
discussions
33M 611OK 51M 9240K 74M 442M llM 534K 26K 4495K 171K 3030K 6066K 224K
[9] J. R. Allen, K. Kennedy, C. Porterfield, and J. Warren, “Conversion of control dependence to data dependence,” in Proceedings of the 10th ACM Symposium on Principles of Programming Languages, pp. 177-189, January 1983.
and Dan
together.
Schlansker
execution.
at
MIP-9308013,
Bringmann
put this paper
5.38~o 11.25~o 12.66~o 9.32% 1.23% 3.52% 9.19% 12.17% 0.12% 11.4% 11.7% 6.55% 9.79% .025%
Predication MP
P. Y. Hsu and E. S. Davidson, “Highly concurrent scalar processing,” in Proceedings of the 13th International Symposium on Computer Architecture, pp. 386–395, June 1986.
conditional
Acknowledgements
Lavery
IMl BR
[7]
to
support.
Move MPR
M. Butler, T. Yeh, Y. Patt, M. Alsup, H. Scales, and M. Shebanow, “Single instruction stream parallelism is greater than two,” in Proceedings of the 18th International Symposium on Computer Architecture, pp. 276-286, May 1991.
149
[17]
Hewlett-Packard Company, Architecture and Instruction
Cupertino, CA, PA-RISC 1.1 Set Reference Manual, 1990.
[18]
D. S. Blickstein et al., tem,” Digital Technical
[19]
trace scheduling comP. G. Lowney et al., “The Multiflow piler,” The Journal 0$ Supercomputing, vol. 7, pp. 51–142, January 1993.
[20]
An effective technique W. W. Hwu et al., “The Superblock: for VLIW and superscalar compilation,” The Journal of Supercomputing, vol. 7, pp. 229–248, January 1993.
“The GEM optimizing compiler sysJournal, vol. 4, pp. 121-136, 1992.