T. J. Watson Research Center. P.O. BOX ... branches if we wish to approach ideal performance. A Branch. History. Table ..... target using a set of CALL/RETURN.
Branch
History
Table Prediction of Moving Subroutine Returns
Target
Branches
Due to
David R. Kaeli f$ Philip
Dept.
G. Emma
$
Rutgers University t of Electrical and Computer Engineering New Brunswick, N.J. 08903
IBM $ T. J. Watson Research Center P.O. BOX 218 Yorktown
Heights,
NY
Abstract a
a high percentage
of taken
or eliminate
thk
branches,
techniques
are needed
to approach
degradation.
A Branch History Table (BHT) stores past action target for branches, and predicts that future behavior repeat. action, correct
an
stack
mechanism
for
reducing
this type
predictions should instead
of
Using traces of the SPEC benchmark suite running on RS/6000, we provide an analysis of the performance
enhancements
possible
using
a BHT.
We
show
number
of
that
simple
discuss
why
a design
We further provide a solution to eliminating one incorrect target address predictions by using a
subroutine
CALL/RETURN
detection
scheme.
Since branches can degrade the performance of the pipeline severely, many methods have been proposed to address thk problem. Techniques such as delayed branching [5], branch folding [6], static prediction based on opcode,
prefetchlng imposed
[7],
have all been
by branches.
during the time in which the branch instruction is being resolved. One problem with this technique is that for a large percentage of the branches, an instruction can not be found to till the delay slot. (Campbell reports that 10’%o of the unconditional branch slots and 40-60’%0 of the conditional branch slots can not be fdled[8].)
this class in this paper.
Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage, the ACM copyright notice and the title of the publication and its date appear, and notice is given that copying is by permission of the Association for Computing Machinery. To copy otherwise, or to republish, requires a fee and/or specific permission.
$1.50
target
the penalty
Delayed branches have been implemented in many architectures (i.e. IBM 801, MIPS, Intel 80860, etc.). The idea with a delayed branch is to schedule useful work
dependencies, hardware transfers (branches and
@ 1991 ACM 0-89791 -394-9/91/0005/0034
and branch
been used to reduce
interrupts). Of these pipeline hazards, the control transfer class can be the most detrimental to pipeline performance. We address
We
1. Branch Penalty Reduction Techniques
Ideal speedup in pipeline processors is seldom achieved to stalk and breaks in the execution stream. These
interruptions are caused by data resource contention, and control
by a BHT.
the
branch
Introduction
due
made
focus not on the amount of saved history, but on the correctness of the saved branch target
address. class of
proposed mechanism can reduce the wrong guesses by 18.2°/0 on average.
ideal performance.
A Branch History Table (BHT) as been show to be an effective mechanism for reducing the number of these stalls. In this paper we characterize the different wrong
and will
Although past action is a good indicator of future the subroutine CALL/RETURN paradigm makes prediction of the branch target dlfflcult. We
propose a new mispredlction.
A taken branch in the instruction stream will introduce In order to approach peak break in execution.
performance, the pipeline must constantly be kept full of The frequency of taken branches is valid instructions. shown to be over 20 ‘/o of the instructions in today’s workloads [1 ][2][3][4]. It is necessary to reduce the number of pipeline stalls due to taken branches if we wish
Ideally, a pipeline processor can run at a rate that is limited by its slowest stage. Branches in the instruction and reduce processor stream disrupt the pipeIine, performance to well below ideal. Since workloads contain to reduce
10598
Branch folding, as microprocessor, attempts
34
implemented to include the
in
next
the
CRISP
instruction
address
wit.hh
the instruction
format.
A field
each instruction that contains the address instruction to be executed. As the instructions the next
instruction
an uncondhional the
next
branch
address branch,
address
field
is effectively
is immediately
the target of
the
folded
into
the
to
available.
address
previous
is added
2. Branch Historv Tables
of the next are decoded, into
instruction.
previous
One approach
On
is entered
The
technique
can only work
in a small
instruction.
number
is to examine
level.
By
advantage branches,
static
prediction
percentage
of correct
dependent
on
the
techniques
guesses,
nature
of
can
but
their
the
success
workload.
a high
the
more
common
in high-level languages, we reduction strategy tiat can take
The IF/THEN/ELSE behavior, since
predictable
is highly Smith
of
by taken
a Iprogramming
branching behavior. [Jncostditional such as subroutine CALL/RETURN and GOTO, are always taken. Their behavior seems quite easy to predict. Conditional branching constructs, such as the DO LOOP and the IFflHEN/ELSE block, do not always follow the same path. A DO LOOP is a fairly well-behaved structure, usually resulting in a taken branch to a destination at a negative dkplacement off the program
of cases,
produce
some
incurred
from
of typical
counter. Some
studying
constructs a branch penalty
propose
the penalty
the problem
programming
On a conditional branch, the next address field of the previous instruction is modified once the upstream condhion that the branch depends upon is resolved. In this case some delay will be introduced unless the condh.ion code can be resolved far enough upstream to permit the modification of the next address field before the branch is executed. Reordering the instruction sequence to do thk is called branch spreading. Due to the clustering of branches, tbk
to reducing
branches
subsequent
block may may change
it
)not exhibit a direction on
iterations.
has Unfortunately,
reported that by statically predicting that every branch is taken, on average 76.7 ‘%o of the predictions were correct
a processor
does
not
have
information
[9]. Smith has also shown that opcode-based branch prediction can yield accuracies in the range of 65.7% to 99.4%. The problem with thk approach is that it relies heavily on characteristics of the branches in the workload;
about the high-level language programming constructs that are used to produce the object-runable code. The processor is only able to detect patterns of branch behavior by capturing a sample of the behavior. Then, by using thk stored hktory of prior branch execution, the processor can
these
predict
characteristics
tend
to
vary
from
workload
to
subsequent
execution
next encountered.
workload.
a Branch Prefetching decoded
branch
and
implemented
targets,
the on
once the a branch
target
tie
IBM
address System
360
forward branches (backward branches loop-mode) the branch sequencer initiates
direction
when
‘T’ids is the fundamental
History
the branch concept
is
behhtd
Table.
has been
generated,
was
Model
For
91.
Branch Smith[9], Hughes
are handled by two doubleword
History
Tables
have
been
fiescribed
Lee and Smith[lO], Hoigate and [bbett[l etal.[1 2], Lilja[l 3], and Pomerene[14]. In
by 1], it’s
simplest form, a BHT maintains the outcomes of previously executed branches. The table is accessed by the instruction
fetches down the target stream. The target instruction can then be rerouted to the decoder once the branch is resolved. One problem with this strategy is that unnecessary memory accesses for instructions that are not
prefetch unit and redirected or not.
decides whether prefetching should be The table is searched for a valid entry,
just
is
executed
set-associative,
can
bandwidth
drastically
available
reduce
the
amount
of
memory
as
a
cache as [15].
organizations
to the processor.
when a taken A common problem prediction or redirection
with all of these techniques is that is performed at decode or execute
time.
a
By
this
prefetched
and
time, possibly
typical
processor
decoded
past
this
pipeline control
On entry
has
simply
fetching
the next sequential
with
via the historically
History
Table
autonomous prediction observed branch target
A provides an example redirects the prefetching as possible.
operates
Providing
ahead
of decode-time
to predict
far upstream
describe
the execution unit will minimize any latency incurred by the target fetch. None of the other mechanisms we described issue the target fetch as early as in a BI lT design. In the next use of a BHT.
section
we provide
further
insight
into
is executed hk,
the
for
tabie
is
typically
with many cache added to the table
by the processor.
hktoricai
information
algorithm.
a taken
sequential
The
prediction,
instruction
for
in
that
algorithm
or
continues
a
not-taken
Some implementations branch changes to not will
occur
will
ensue.
invalidate the en~y taken. In thk case, a subsequently, and tnext-sequential If
The amount of branch accuracy of the predictions
of
The
the case with An entry is only
the
processor must be equipped restore the necessary state.
of one BHT design. The BHT of the instruction stream as early the ability
next
miss
prefetching
and prefetching of instructions. Appendix
BHT
prefetchhg
the
prediction. when the
instruction.)
BHT A Branch
each
searched.
is used by the prediction
redirects
point.
Cache and memory cycles will be used for fetching instructions that will not be executed (assuming prefetch is
branch
is
how
to use five
prediction
with
history made. bits
of
is wrong,
a back-out
strategy
the to
kept also affects the Lee and Smith[lO] history to increase the
prediction accuracy over 92% for the majority of their workioad (the five bits indicate the direction of the last five iterations of a branch). It can be argued that one or at most two bits of history are necessary to take advantage of the underlying data structures. Two bits wound be useful to handIe the case of nested loops. Any more history than
the
35
two bits may even degrade other workloads.
the prediction
performance
detected.
for
One
feature
of a BI-IT
not-taken conditional branches default) predicted not taken.
is that
by invalidating
in the table,
they
are
(by
Some design trade-offs need to be addressed to tailor the BHT to a particular machine implementation. Most of these trade-offs can be determined by analyzing traces of the target machine environment. Issues such as table size, associativity, number of hktory bits saved, and prediction algorithm all influence the success of a particular
The traces used in this evaluation are taken from workload running on an IBM RS/6000[4] running AIX. The applications traced are taken from the SPEC version 1.0 benchmark suite[16]. Appendix B provides the details of each trace. The traces are taken from both Fortran and
implementation. The goal is to provide the processor with the correct instruction stream before the “control point”[l 1]
were used to run through
in the pipeline effectiveness
is reached. of a BHT
Other
are:
factors
that determine
the aging
out of valid
C source
the
Our
branch
of the target
workload.
taken.
branches
should
One key problem
past is the correctness
that
always
be predicted
target
more than characteristics.
in the
address
that
Wrong accounted our traces. wrong them
can
be
predictions for
more
due than
We analyzed
predictions could
redirected
to
caused that
history
to dynamic
So’?io of branch
branch
target.
First down
(In
the incorrect pU3d1CtiOnS target.) From further
the performance
the BHT
on not-taken
of a BHT,
branches
could
not
no 3) last
thati
1) that occur,
to
workload
no overlays
(due
2) the BHT
only
entry is of the entry is
branch target address field is updated in the
into
look
at
accuracy.
how
the
Our
prediction
the following
size
five classes:
2) branches
predicted
in target incorrect
address, 3) branches history, 4) branches
of
the
BHT
results 1) branches
incorrectly predicted that miss
affects
are
broken
predicted
due to a change incorrectly the B HT
due to that are
result
for branches
that are not taken.
BHT sizes of 128, 256, 512, 1024, 2048 and 4096 entries were simulated. These sizes were chosen as reasonable sizes to implement in today’s technology. It can be argued that a fully-associative design is overkill (in practice a 2 or
to
a optimal
be more BHT
practical). associativity
Our purpose scheme.
is not Instead
we show how the BHT performs and we identify the action taken for each branch encountered on the trace. This breakdown is shown in the next section for each of the ten traces.
If only address traces were used, could be detected, and misses in
conditional
model
sensitive
remapping)
4 way design would
we choose
highly
correctly,
to propose analyze
in our
B HT is the desired
3. Trace Tapes and Model Description To
following
not taken (miss correctly), and 5) branches that miss the BHT that are taken (miss incorrectly). A miss in a BHT should not be confused with a miss in a cache. A miss in a
4.
use full instruction traces. then only taken branches
are
memory
we
prediction
of
can eliminate many of the wrong guesses due to changing target using a set of CALL/RETURN stacks. Thk design in Section
the
BHT.
analysis, we found that incorrect predictions made due to CALL/RETURN from subroutines were the major contributor to this problem. An important difference between changing history and a changing target is that we
is detailed
with
in
of the
percentage
designed
invalidated, and 4) when the changes, the corresponding target
typically
predictions
is
each,
keeps information for taken branches, 3) if a valid found in the BHT and if the current execution branch results in a not taken branch, then the BHT
target
the remainder
a large
to a changing
two of our traces, over were due to changing
appropriate
of the wrong
found
be attributed
the
changing
half what
and
to
two
We assume
is
stored in the BHT. The BHT logic must be able to predict the direction of the branch (taken or not taken), and must also produce the correct target address. Then, instruction prefetching address.
model
instructions
model.
time, and 4) replacement is LRU. As discussed before, only in some cases can two bits be used effectively, and any
to be
has been overlooked
of the branch
1 million
our BHT
the last taken branch in our prediction algorithm, branch history bits are necessary in our implementation, if a branch is in the table then it must have been taken
Previous BHT studies focused on the accuracy of the prediction algorithm [9][1 O]. Prediction algorithm accuracy only relates to predicting conditional branches. Unconditional
13HT
Ten traces,
parameters: 1) the table is fully associative and each entry contains the full branch address as the tag along with the full branch target address, 2) since we only use history of
history table entries, the frequency of first-time taken branches, address aliasing, and incorrect instruction predictions due to page relocation. It should be stressed that all of these implementation issues are tied directly to the nature
programs.
be
36
---------
&$+-&
---.-.-
--.&
flogrj
‘i b tamcabl
----
----
* —.
——
w-------------= ——.———————
—...
❑
ye::”.:.::.:;=:;_;
d’
:g._”.:.:.:.’-.:; -----
M matrix
———
----- .:::”:;”.;
--
/~Rv
,’,,,x .. .......
-T gcc
~
‘x “pre”o
o
misses
not-taken The traces,
the correct result
branches.
worst
in
is better
accuracy
is
Predictions
as a percent
branches
predictions)
(rhis
for
effect
4000
4500
all branches)
a BHT
size that captures
contained
emulates
in any
an infinite
the maximum
single
trace
of our
number trace
BHT.j
correct. 85V0 for
‘7(f-’75~o for
most the
kPPLICATION BRANCHES
of the
Espresso
tomcatv nasa matrix eqntott fpppp spice espresso doduc Ii gcc
number of branches arc displaying an unpredictable behavior.) GCC also exhibits poor performance for tll 11”s smaller than 1024 entries. This can be attributed to the large number of unique branches found in this benchmark. The
(includes
to pick
of all
includes
benchmark. The poor performance on Espresso can be attributed to the large number of misses, which are a result of BHT invalidations due to wrong history predictions (a
of increasing
entries does not significantly. From
3500
more than 4096 unique branches. Thus, increasing the III 11 will not change the prediction accuracy. Our purpose here is not to propose the optimal Bl lT size. Results witl vary from workload to workload. Table 1 is used to here
‘1’his number
not-taken
than
Correct
Results
These are implicitly
13HT accuracy The
1. Percent
predictions
on each tape.
(which
3000
2s00
2000
BHTENTRIES
4. BHT Modelinp
IlliT
1500
1000
500
Figure
1 plots
cqntatt
“ x“”””””’””””””’””””””” ““”’’””’”’’”x ““’”””’”’’’”””””””””””’ ““’"'"'""""""""'"""'"'"'"'"'"'""'"""""'"""""""""""""x +Ii
x F“”
encountered
+
*EI doduc
/
Figure
spice
+ fpppp
Iv
branches
a
the size of the 1111’1” past 512 the prediction accuracy increase Table 1 we see thal no trace contains
‘1’uhle 1. Number
37
of Unique
2 7 1;: 160 350 485 648 1,038 2,529 Branch
Instructions
of
suite.
5. CALL/RETURN
In Figure 2 we show branch performance for each of the ten traces when run through a 4096-entry 111f~. We select this breakdown so that we can identify where there exists potential for improvement. “1’he goal is to decrease the number of wrong predictions. As previously siated, we can not expect to perform belter across all workloads it’ we only focus on the history algorithm..
Por each of the ten traces we have plotted the This breakdown of all wrong predictions in Figure 3. illustrates the potential for improvement. In six out of ten of the traces there exists potential for reducing the number of wrong branch-target contain very few wrong reduce
❑ wrong mkw ❑ cor, mlm ❑ wrung hist.
F=l
Figure
2. BHT
The number
Perfrrrmance of incorrect
existing
the other
predictions. predictions;
accuracy
with
“rhe other four traces we must be sure not to any
algorithm
used
to
six traces.
To reduce the number of wrong predictions due to we need to better understand the changing targeLs, underlying structure that causes these targets to change. Subroutine RE’I’U RN constitutes the largest percentage of wrong targets.
kk
J5-
eqn fpp epl eBENCHMARKS
noe ml
the
optimize
IR=l
tom
Stack Pair Implementation
Profile
I
d&t
gcc
(4096 Ill IT entries)
misses encountered
can not be
reduced. We have already captured the rnaximurn number of branches possible by using a sum]ciently large 111I“I’. Wrong misses are a result of the first time misses (eitl}er due to the first occurrence of the branch on lhe trace tape or a branch being taken atler BHT due to wrong history).
it had been invalidated
A point that should be stressed predicts that a branch will be taken
is when the 1111“1 and tl]en it is not
taken,
the
than
case.
fl’be
sequential D1lT will non-ill which
IT
BIIT
will
wrong
worse will
have
[hc
Irigure
design
will
have
is the same action
and
give
predictions
prcfetchcd
taken
focus
on
past
Ilic
the
In the next a srrlulion [I]at
due to incorrect
sec[ion, reduces
(hc
3. 111IT Wrong
‘l”he SAM1’l,li
next lhc
Prediction entries)
Iigco
Profile
(4096
BIIT
number
prohlcrn.
progranr provided in Figure 4 illustrates From address 10() wc brar~ch to the
subroutine PRIN1’ at address 500. We thco enter an entry in the 1111’1’ ror address 1(10 with a target 500. On the RE’I’U RN from PRI N1’ we enter the address 600 in the
branch,
by the BI 11’ design.) decreasing
eqnfppspic+dod BENCHMARKS
nrrn-1]11“1”
prcfclcltcd
instruction address. ) Iror a wrong miss, lhe perform the same as the mm-1] 11”1” case. [[’he
As a result, we incorrect predictions. problem
perform
non-1]1 11” design
tomnmmot
in the
tll IT with a target 110. branch to PRINT, entering
of
we look at this tl]e number of
on instruction into the 1)1IT
130 we again the address 130
with a {arget 500. When we get to the subroutine PRINT, we will hit on a valid entry
target address.
end of the in the B1l’r
(address = 600) and predict that we will return to address I I 0. of course, this is not correct, The next instruction to be execuLed is at address 140.
38
PROCRAM
SAMPLE
Znst rwct Address
ion
3 and
the
return
address
sequential 10 the CA Ll, S2 palh via path 6.
100 110 120 130 lUO
cALL
@PRINT
CAI,L
@P171NT
2.
(1 10, the
instruction)
address
is pushed
that
is
onto
the
When the RE’1’U RN is executed at 600 with target address 110, lhe addresses 600 and 110 are sent to the B 11’1’ along
path
1, branch
address
and Ibranch
target
respectively. In parallel, 110 is sent to stack S2 along path 6 to sce if S2 has an entry for it. In this case, S2 does have an entry
and
its corresponding
(in this case 500) is found PRINT 500 510
:
(tile S1{ bit) is turned entry.
REhN
Figure
4. Sample
with
we detect
the correct
made?
this behavior
address
By using
before
and
supply
an incorrect
a set of stacks,
we can
prediction
is
when
branch
I
the
CALL
onlo
at
stack
(140) is pushed
of the two-stack design. In each BI IT entry must he return” bit (SR) that is used
that this is a special
When
pushed
a
targel 4.
F!gure 5 shows a picture addition to the two stacks, augmented with a ‘subroutine
on to denote
that this is a special
address
130
is
executed,
the
address (130) and target (500) are sent to the B] IT via path 1. Also, the target address of the CALL (500), is
the BI IT
identify
CALL is made and then supply the correct when the RETURN is encountered [1 7].
to indicate
in S 1
sent along
Program 3,
can
entry
8 and
path 4 where it replaces the target address on path 1. “l’lLen the entry in the Ill [T has a branch address of 600 aid a targel adrtrcss of 500.” A bit in the BI 1’1’
Soo
1low
via path
The
address
prediction
field
associated
set to 600 and branch
to path
I
3 and the return
point
When instruction address 600 is later prefetched, the B I IT will find an entry for address 600 @he RE’lURN), and that entry will have the SR bit turned on.
entry.
S 1 on path
on stack S2 on path 6.
2, and
since
is sent to stack
entry
for
500.
In
with target
the special S 1 on path this
case,
the (entry (branch set to 500) is sent
bit is on, the target 5 to see if S1 has an S1
does
entry in S2 (140) is identified
corresponding
and
so the
via path
9
and is put on path 7 where it replaces the target field on path 2. (All entries on the stack are compared in parallel. chosen. ) 5.
m
t
Processor
Figure
5. Call/Return
Stack
We will use the programming describe the design. Tbe algorithm 1.
When
the
addresses
CALL
Design
example in Figure works as follows:
address
100
100 and 500 w-e senl along
an entry with branch target the CALL
at
[be branch set to 500,
(500)
is pushed
entry
is
executed,
path
4 10
the
entry
and
a pair
the stack
1 10 create
lhe S1 stack
4096
111IT
model,
adding
of CA LL/Ri3’l’URN
an SR bit stacks.
for
Stack
this our
each entry depths
of 5
and 10 were sirnulrited. Table 2 lists the resuhs. By adding a 5-eIltry pair of stacks we can reduce the percentage of wrong predictions up to 4.6°/0 (an improvement of 31.5%). We have found that for many of the benchmarks, if we use predict
address set to 100, and the Also, the targcl address of onlo
is
“[’he prediction tl}at is made for address 600 will then have- tile larget 130. ‘1’he stack provides the correct address instead of the historical address.
‘1’o show how much savings can be expected from implementation, we have run all ten traces through
I
I
In the event of a tie, the most recent
pair more
with
a 111IT
branches
containing
correclly
than
128 entries, using
it will
a stacktess
1111’1’will) 4096 enlries (i.e. 1,1 run with a l:!8-entry BI 1’1’ and a pair of 5-entry stacks showed a 33 °YiOimprovement over a stackless 4096-entry 1111’1’).
via path
39
NO STACK ~5 0.4 0!0 0.0
j j ~
:::
::4
:
;::
2?:? 14.8 14.6
~ ; :
2::: 13,4 10.0
14.5
~
12.5
tomcotv
nasa matrix eqntott fpppp spice espresso doduc Ii gcc
Table
2. Percent
References
10 ENTRIES
ENTRIES
1.
0.0 ;:: 0.0 7.0
Kaeli 11.1{., Kirkpatrick S., ong S., “PC Characterization”, Proceedings of the ACM and Performance
2.
Adams
T.,
instruction Symposium
‘89”,
May
Zimmerman
89, pp.220. R.,
(includes
Doubting the stack deplh to 10 entries only difference in the LI and GCC benchmarks.
ail
3.
and
011 Computer 1982, pps.9-t7.
April
encouraging result is that performance of the four benchmarks that were previously generating good results were not adversely affected by the introduction of Lhe
4.
of
Operating
Architecture,
8086 Third For
Systems,
Austin,
Organization Groboski G. F., ‘Machine RISC System/6000 processor”, 1BM Research and pps.37-58.
stacks.
Analysis
Clark 11.W., I.evy 1I., “Measurement and Analysis of Instruction Use in the VAX 11/780”, Proc. Ninth SYmPoSiUJn
showed a Another
“An
usage in MS DOS programs”, Proc. OJI Architectural Support
Programming Languages Boston, 1989, pps.1 52-161.
of Wrong Predictions predictions)
Workload Sigmetrics
Development,
Tx.,
of the IBM Journal of
VO1.34, No. 1, Jan.
1990,
Conclusions 5. To
optimize
instruction Much
processor
pipeline etTort
approaches
pipeline
performance,
must be kept full with
has
been
to reducing
spent
valid
or eliminating
Iatencies
these Iatencies. BtIT’s are especially attractive since attempt to redirect prefetching as early as possible.
6.
trying
to
different
studies
optimize
the
amounts
characterized
of
the
have
focused
they
prediction
history.
B1 IT
algorithm In
behavior
on the problem
our
while
by
study,
of have
concentrating
on
consistently
perform
as
well
if
not
considerably
Anderson
ACM,
Set Computers-,
VOI.28-1,
D.W.
1I. R., “Branch Reducing
14th Ann.
Machine
11]M
Jan,
1085,
etal.,
Symp.
“The
philosophy
Journal
of
Computer
IBM and
Research
Folding
Ih-anch
Arch.,
System/360
instruction
and
in the
Delay
to
1987,
Model
handling”,
I)evclopmen[,
Jan.
1967, pps.8-24.
those aspects of the design that reduce Lhe numhcr of incorrect predictions, and that are less dependent on workload characteristics than some previously proposed history-based algorithms. We have also presented a robust Ill IT design that is based on the underlying programming We have structures that exist in today’s applications. shown how to reduce the number of wruflg predictimls made by 18.2% on average, by usiilg a pair CALL/RETURN stacks. The results have been found
D. R., McLellan Microprocessor:
9 I:
saving
we
Ditzel Crisp
Zero,” Proc. pps.2-9. 7,
131IT
Instruction
of the
different that arise in
tJte presence of taken branches. Branch I [istory “rabies have been shown to be an elTective approach to reducing
Previous
1>.A, “Reduced
pps.8-21.
instructions.
investigating
PaLterson
Communications
the
8.
“Compiling Campbell R,, C for [instruction Set Computer”, Master’s U.(;. Berkeley 94720, Ilcc. 1980.
9.
Slni(ll J.lj., “A study of Branch Prediction Strategies-, l’r’oc. l~igh[h Symposium OJI (blnputcr Architecture, May 1981, Minneapolis, pps.135-148.
10.
Lec J,, Smith
of to
better
]Iranch
across all of the traces in our study.
A. J., “Branch
“1’arget
ButTer
the Reduced report, I~EcS,
Prediction
Design”,
Strategies
Computer
and
17:1,
Jan.
Analysis
of
1984, pps.6-22.
Acknowledgements ‘1’he
authors
would
like
to
tl]ank
11. Jul)o
‘1’tmg
providing the tracing tool used to produce the traces I Ierbert Freeman for his support of [his researctl.
I Iotgate l[lslrl]cli(~]]
R. W., Ibbett R.N., -l;etct]irlg Strategies
(l)mputers”,
for
No.4,
and to 12.
[iil~h
April
I tuglles
J.I;.,
‘t’rans.
on
“An irJ
Pipclined
Con)puters.
VOI.C-29,
1980, pps. 325-329. I.iptay
J. S., RyrJJarczyk
J. W., Stone S. E.,
“Nilllli-tr]slruclion Stream tlranch Mechanism”, U.S. Patenl 4,200,927,” Apr.
40
Processing 29, 1980.
13.
Lilja
D.J.,
“Reducing
Processors”,
IEEE
the Branch Computer
Penally Magazine,
15.
in Pipelined July
1988,
Smith
A. J., “Cache
VOL14
No.3,
Memories”,
Computing
Surveys,
Sept. 1982. pps.473-530.
pps.47-55. 14.
Pomerene Rosenfeld History
J. H., P. L.,
Table”,
Puzak T. R., RechtschaITen Sparacio F. J., “Pageable U.S. Patent
4,679,141,
16.
SPEC Quarterly Newsletter, System Performance Evaluation Cooperative, 1st Quarter 1990.
17.
Webb Tech.
R. N., Branch
Jul. 7, 1987.
41
Call/Return C. F., “Subroutine Disc. Bullen., VOL30, No.1 I, April
Stack”, 1988.
IBM
Appendix B
Appendix A Trace Tape Description
Branch 1Iistorv Table Description The
purpose
prefetching processor,
of the tlranch
Ilistory
Table
fetched and is ready for execution. I’he one possible implementation of a !31IT. In our the next redirection. redirected branch.
‘1’be hencbrnarks used in this shrdy were taken from the version 1 release of the SPEC Benchmark Suite. The traces are taken at an ofTsct of 1 million instructions into
is to redirect
such that when a branch is execuled in ihe the target instruction stream has already been example
below
each application. length.
is
example, we assume that prcfclching wiiI felch sequential address in the absence of any In the non-B} IT description, prefetching is only when the execute unit (E) detects a taken
Next Sequential
u al m
1.
Espresso generation
2.
Spice
3.
Memory
m
4.
5. Figure
In the IIHT lookup.
description,
each
prefetch
Register) a match
is found,
constitutes
corresponding target address logic. Prefetching is redirected
a
131IT
a B1 Ir
effects
The lPFAR (instruction Prefetch is compared against the branch address this
hit,
6,
Address (11A). If and
u
TA
TA
BA
TA
{rA) is feed to the to the new address.
%
yl 0!
select
>
- A
purpose
Monte
Carlo
execution,
circuit
simulation written
Nasa - A collections of 7 floating con)prised of 2,200 lines of Fortran. Li
- A
Lisp
interpreter
written
- An integer-intensive a
logical
to a truth
point
in
C,
In the case that the B1 IT prediction is wrong, the execute unit (E) will signal the branch error handling logic to redirect prefetching to the new target address or to tile fall-through address (just as in the case of lbe non-l] 11’1 description), If the instruction was neither detecled in the BI lT nor signaled as wrong by the execute unit, prcfclclling continues with the next sequential address. When a first-time taken branch is encountered, the fll 1’1’ is updaled by the branch error handling logic. Similarly, wbell a wrong handling
prediction logic will
is made invalidate
by the
B I IT,
lhe branch
the corresponding
error
B 11’1’ entry.
42
kernels,
solving
the
benchmark
representation
written of
a
in C,
boolean
table.
Fpppp - A double chemistry benchmark.
9.
Torncatv - A double precision generation Fortran program.
- A vectorizable
Forlran
scientific
precision
- ‘1’his is the GNU
Y===Q’edirec’on E-1 A.2. III IT Description
scalar
problem.
13qntott
(;(;(;
program
involving
benchmark.
Fortran
quantum
vectorized
C compiler,
compiling
prepr~cessed source tiles into optimized language. ‘1’here are 109,000 lines of C.
Figure
in
in Fortran.
8.
10.
instructions
simulation
Matrix
Memory
P
is 1 million
7.
m
,, ,, ,,
Iloduc
equation
the
$+=JJ BA
BA
in Fortran.
translating
Next Sequential
> L E
- A general
writlen
8-queens
A.1. Non-111 IT Description
trace
- one of a collection of tools for the and optimization of Programmable Logic ‘1’his is a integer benchmark written in C. Arrays. our trace used t.ial.in as input.
floating-point Redirection
Elr~mch.
E 2
Each
mesh
19
assembly