Branch History Table Prediction of Moving Target ... - CiteSeerX

22 downloads 28268 Views 653KB Size Report
T. J. Watson Research Center. P.O. BOX ... branches if we wish to approach ideal performance. A Branch. History. Table ..... target using a set of CALL/RETURN.
Branch

History

Table Prediction of Moving Subroutine Returns

Target

Branches

Due to

David R. Kaeli f$ Philip

Dept.

G. Emma

$

Rutgers University t of Electrical and Computer Engineering New Brunswick, N.J. 08903

IBM $ T. J. Watson Research Center P.O. BOX 218 Yorktown

Heights,

NY

Abstract a

a high percentage

of taken

or eliminate

thk

branches,

techniques

are needed

to approach

degradation.

A Branch History Table (BHT) stores past action target for branches, and predicts that future behavior repeat. action, correct

an

stack

mechanism

for

reducing

this type

predictions should instead

of

Using traces of the SPEC benchmark suite running on RS/6000, we provide an analysis of the performance

enhancements

possible

using

a BHT.

We

show

number

of

that

simple

discuss

why

a design

We further provide a solution to eliminating one incorrect target address predictions by using a

subroutine

CALL/RETURN

detection

scheme.

Since branches can degrade the performance of the pipeline severely, many methods have been proposed to address thk problem. Techniques such as delayed branching [5], branch folding [6], static prediction based on opcode,

prefetchlng imposed

[7],

have all been

by branches.

during the time in which the branch instruction is being resolved. One problem with this technique is that for a large percentage of the branches, an instruction can not be found to till the delay slot. (Campbell reports that 10’%o of the unconditional branch slots and 40-60’%0 of the conditional branch slots can not be fdled[8].)

this class in this paper.

Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage, the ACM copyright notice and the title of the publication and its date appear, and notice is given that copying is by permission of the Association for Computing Machinery. To copy otherwise, or to republish, requires a fee and/or specific permission.

$1.50

target

the penalty

Delayed branches have been implemented in many architectures (i.e. IBM 801, MIPS, Intel 80860, etc.). The idea with a delayed branch is to schedule useful work

dependencies, hardware transfers (branches and

@ 1991 ACM 0-89791 -394-9/91/0005/0034

and branch

been used to reduce

interrupts). Of these pipeline hazards, the control transfer class can be the most detrimental to pipeline performance. We address

We

1. Branch Penalty Reduction Techniques

Ideal speedup in pipeline processors is seldom achieved to stalk and breaks in the execution stream. These

interruptions are caused by data resource contention, and control

by a BHT.

the

branch

Introduction

due

made

focus not on the amount of saved history, but on the correctness of the saved branch target

address. class of

proposed mechanism can reduce the wrong guesses by 18.2°/0 on average.

ideal performance.

A Branch History Table (BHT) as been show to be an effective mechanism for reducing the number of these stalls. In this paper we characterize the different wrong

and will

Although past action is a good indicator of future the subroutine CALL/RETURN paradigm makes prediction of the branch target dlfflcult. We

propose a new mispredlction.

A taken branch in the instruction stream will introduce In order to approach peak break in execution.

performance, the pipeline must constantly be kept full of The frequency of taken branches is valid instructions. shown to be over 20 ‘/o of the instructions in today’s workloads [1 ][2][3][4]. It is necessary to reduce the number of pipeline stalls due to taken branches if we wish

Ideally, a pipeline processor can run at a rate that is limited by its slowest stage. Branches in the instruction and reduce processor stream disrupt the pipeIine, performance to well below ideal. Since workloads contain to reduce

10598

Branch folding, as microprocessor, attempts

34

implemented to include the

in

next

the

CRISP

instruction

address

wit.hh

the instruction

format.

A field

each instruction that contains the address instruction to be executed. As the instructions the next

instruction

an uncondhional the

next

branch

address branch,

address

field

is effectively

is immediately

the target of

the

folded

into

the

to

available.

address

previous

is added

2. Branch Historv Tables

of the next are decoded, into

instruction.

previous

One approach

On

is entered

The

technique

can only work

in a small

instruction.

number

is to examine

level.

By

advantage branches,

static

prediction

percentage

of correct

dependent

on

the

techniques

guesses,

nature

of

can

but

their

the

success

workload.

a high

the

more

common

in high-level languages, we reduction strategy tiat can take

The IF/THEN/ELSE behavior, since

predictable

is highly Smith

of

by taken

a Iprogramming

branching behavior. [Jncostditional such as subroutine CALL/RETURN and GOTO, are always taken. Their behavior seems quite easy to predict. Conditional branching constructs, such as the DO LOOP and the IFflHEN/ELSE block, do not always follow the same path. A DO LOOP is a fairly well-behaved structure, usually resulting in a taken branch to a destination at a negative dkplacement off the program

of cases,

produce

some

incurred

from

of typical

counter. Some

studying

constructs a branch penalty

propose

the penalty

the problem

programming

On a conditional branch, the next address field of the previous instruction is modified once the upstream condhion that the branch depends upon is resolved. In this case some delay will be introduced unless the condh.ion code can be resolved far enough upstream to permit the modification of the next address field before the branch is executed. Reordering the instruction sequence to do thk is called branch spreading. Due to the clustering of branches, tbk

to reducing

branches

subsequent

block may may change

it

)not exhibit a direction on

iterations.

has Unfortunately,

reported that by statically predicting that every branch is taken, on average 76.7 ‘%o of the predictions were correct

a processor

does

not

have

information

[9]. Smith has also shown that opcode-based branch prediction can yield accuracies in the range of 65.7% to 99.4%. The problem with thk approach is that it relies heavily on characteristics of the branches in the workload;

about the high-level language programming constructs that are used to produce the object-runable code. The processor is only able to detect patterns of branch behavior by capturing a sample of the behavior. Then, by using thk stored hktory of prior branch execution, the processor can

these

predict

characteristics

tend

to

vary

from

workload

to

subsequent

execution

next encountered.

workload.

a Branch Prefetching decoded

branch

and

implemented

targets,

the on

once the a branch

target

tie

IBM

address System

360

forward branches (backward branches loop-mode) the branch sequencer initiates

direction

when

‘T’ids is the fundamental

History

the branch concept

is

behhtd

Table.

has been

generated,

was

Model

For

91.

Branch Smith[9], Hughes

are handled by two doubleword

History

Tables

have

been

fiescribed

Lee and Smith[lO], Hoigate and [bbett[l etal.[1 2], Lilja[l 3], and Pomerene[14]. In

by 1], it’s

simplest form, a BHT maintains the outcomes of previously executed branches. The table is accessed by the instruction

fetches down the target stream. The target instruction can then be rerouted to the decoder once the branch is resolved. One problem with this strategy is that unnecessary memory accesses for instructions that are not

prefetch unit and redirected or not.

decides whether prefetching should be The table is searched for a valid entry,

just

is

executed

set-associative,

can

bandwidth

drastically

available

reduce

the

amount

of

memory

as

a

cache as [15].

organizations

to the processor.

when a taken A common problem prediction or redirection

with all of these techniques is that is performed at decode or execute

time.

a

By

this

prefetched

and

time, possibly

typical

processor

decoded

past

this

pipeline control

On entry

has

simply

fetching

the next sequential

with

via the historically

History

Table

autonomous prediction observed branch target

A provides an example redirects the prefetching as possible.

operates

Providing

ahead

of decode-time

to predict

far upstream

describe

the execution unit will minimize any latency incurred by the target fetch. None of the other mechanisms we described issue the target fetch as early as in a BI lT design. In the next use of a BHT.

section

we provide

further

insight

into

is executed hk,

the

for

tabie

is

typically

with many cache added to the table

by the processor.

hktoricai

information

algorithm.

a taken

sequential

The

prediction,

instruction

for

in

that

algorithm

or

continues

a

not-taken

Some implementations branch changes to not will

occur

will

ensue.

invalidate the en~y taken. In thk case, a subsequently, and tnext-sequential If

The amount of branch accuracy of the predictions

of

The

the case with An entry is only

the

processor must be equipped restore the necessary state.

of one BHT design. The BHT of the instruction stream as early the ability

next

miss

prefetching

and prefetching of instructions. Appendix

BHT

prefetchhg

the

prediction. when the

instruction.)

BHT A Branch

each

searched.

is used by the prediction

redirects

point.

Cache and memory cycles will be used for fetching instructions that will not be executed (assuming prefetch is

branch

is

how

to use five

prediction

with

history made. bits

of

is wrong,

a back-out

strategy

the to

kept also affects the Lee and Smith[lO] history to increase the

prediction accuracy over 92% for the majority of their workioad (the five bits indicate the direction of the last five iterations of a branch). It can be argued that one or at most two bits of history are necessary to take advantage of the underlying data structures. Two bits wound be useful to handIe the case of nested loops. Any more history than

the

35

two bits may even degrade other workloads.

the prediction

performance

detected.

for

One

feature

of a BI-IT

not-taken conditional branches default) predicted not taken.

is that

by invalidating

in the table,

they

are

(by

Some design trade-offs need to be addressed to tailor the BHT to a particular machine implementation. Most of these trade-offs can be determined by analyzing traces of the target machine environment. Issues such as table size, associativity, number of hktory bits saved, and prediction algorithm all influence the success of a particular

The traces used in this evaluation are taken from workload running on an IBM RS/6000[4] running AIX. The applications traced are taken from the SPEC version 1.0 benchmark suite[16]. Appendix B provides the details of each trace. The traces are taken from both Fortran and

implementation. The goal is to provide the processor with the correct instruction stream before the “control point”[l 1]

were used to run through

in the pipeline effectiveness

is reached. of a BHT

Other

are:

factors

that determine

the aging

out of valid

C source

the

Our

branch

of the target

workload.

taken.

branches

should

One key problem

past is the correctness

that

always

be predicted

target

more than characteristics.

in the

address

that

Wrong accounted our traces. wrong them

can

be

predictions for

more

due than

We analyzed

predictions could

redirected

to

caused that

history

to dynamic

So’?io of branch

branch

target.

First down

(In

the incorrect pU3d1CtiOnS target.) From further

the performance

the BHT

on not-taken

of a BHT,

branches

could

not

no 3) last

thati

1) that occur,

to

workload

no overlays

(due

2) the BHT

only

entry is of the entry is

branch target address field is updated in the

into

look

at

accuracy.

how

the

Our

prediction

the following

size

five classes:

2) branches

predicted

in target incorrect

address, 3) branches history, 4) branches

of

the

BHT

results 1) branches

incorrectly predicted that miss

affects

are

broken

predicted

due to a change incorrectly the B HT

due to that are

result

for branches

that are not taken.

BHT sizes of 128, 256, 512, 1024, 2048 and 4096 entries were simulated. These sizes were chosen as reasonable sizes to implement in today’s technology. It can be argued that a fully-associative design is overkill (in practice a 2 or

to

a optimal

be more BHT

practical). associativity

Our purpose scheme.

is not Instead

we show how the BHT performs and we identify the action taken for each branch encountered on the trace. This breakdown is shown in the next section for each of the ten traces.

If only address traces were used, could be detected, and misses in

conditional

model

sensitive

remapping)

4 way design would

we choose

highly

correctly,

to propose analyze

in our

B HT is the desired

3. Trace Tapes and Model Description To

following

not taken (miss correctly), and 5) branches that miss the BHT that are taken (miss incorrectly). A miss in a BHT should not be confused with a miss in a cache. A miss in a

4.

use full instruction traces. then only taken branches

are

memory

we

prediction

of

can eliminate many of the wrong guesses due to changing target using a set of CALL/RETURN stacks. Thk design in Section

the

BHT.

analysis, we found that incorrect predictions made due to CALL/RETURN from subroutines were the major contributor to this problem. An important difference between changing history and a changing target is that we

is detailed

with

in

of the

percentage

designed

invalidated, and 4) when the changes, the corresponding target

typically

predictions

is

each,

keeps information for taken branches, 3) if a valid found in the BHT and if the current execution branch results in a not taken branch, then the BHT

target

the remainder

a large

to a changing

two of our traces, over were due to changing

appropriate

of the wrong

found

be attributed

the

changing

half what

and

to

two

We assume

is

stored in the BHT. The BHT logic must be able to predict the direction of the branch (taken or not taken), and must also produce the correct target address. Then, instruction prefetching address.

model

instructions

model.

time, and 4) replacement is LRU. As discussed before, only in some cases can two bits be used effectively, and any

to be

has been overlooked

of the branch

1 million

our BHT

the last taken branch in our prediction algorithm, branch history bits are necessary in our implementation, if a branch is in the table then it must have been taken

Previous BHT studies focused on the accuracy of the prediction algorithm [9][1 O]. Prediction algorithm accuracy only relates to predicting conditional branches. Unconditional

13HT

Ten traces,

parameters: 1) the table is fully associative and each entry contains the full branch address as the tag along with the full branch target address, 2) since we only use history of

history table entries, the frequency of first-time taken branches, address aliasing, and incorrect instruction predictions due to page relocation. It should be stressed that all of these implementation issues are tied directly to the nature

programs.

be

36

---------

&$+-&

---.-.-

--.&

flogrj

‘i b tamcabl

----

----

* —.

——

w-------------= ——.———————

—...



ye::”.:.::.:;=:;_;

d’

:g._”.:.:.:.’-.:; -----

M matrix

———

----- .:::”:;”.;

--

/~Rv

,’,,,x .. .......

-T gcc

~

‘x “pre”o

o

misses

not-taken The traces,

the correct result

branches.

worst

in

is better

accuracy

is

Predictions

as a percent

branches

predictions)

(rhis

for

effect

4000

4500

all branches)

a BHT

size that captures

contained

emulates

in any

an infinite

the maximum

single

trace

of our

number trace

BHT.j

correct. 85V0 for

‘7(f-’75~o for

most the

kPPLICATION BRANCHES

of the

Espresso

tomcatv nasa matrix eqntott fpppp spice espresso doduc Ii gcc

number of branches arc displaying an unpredictable behavior.) GCC also exhibits poor performance for tll 11”s smaller than 1024 entries. This can be attributed to the large number of unique branches found in this benchmark. The

(includes

to pick

of all

includes

benchmark. The poor performance on Espresso can be attributed to the large number of misses, which are a result of BHT invalidations due to wrong history predictions (a

of increasing

entries does not significantly. From

3500

more than 4096 unique branches. Thus, increasing the III 11 will not change the prediction accuracy. Our purpose here is not to propose the optimal Bl lT size. Results witl vary from workload to workload. Table 1 is used to here

‘1’his number

not-taken

than

Correct

Results

These are implicitly

13HT accuracy The

1. Percent

predictions

on each tape.

(which

3000

2s00

2000

BHTENTRIES

4. BHT Modelinp

IlliT

1500

1000

500

Figure

1 plots

cqntatt

“ x“”””””’””””””’””””””” ““”’’””’”’’”x ““’”””’”’’’”””””””””””’ ““’"'"'""""""""'"""'"'"'"'"'"'""'"""""'"""""""""""""x +Ii

x F“”

encountered

+

*EI doduc

/

Figure

spice

+ fpppp

Iv

branches

a

the size of the 1111’1” past 512 the prediction accuracy increase Table 1 we see thal no trace contains

‘1’uhle 1. Number

37

of Unique

2 7 1;: 160 350 485 648 1,038 2,529 Branch

Instructions

of

suite.

5. CALL/RETURN

In Figure 2 we show branch performance for each of the ten traces when run through a 4096-entry 111f~. We select this breakdown so that we can identify where there exists potential for improvement. “1’he goal is to decrease the number of wrong predictions. As previously siated, we can not expect to perform belter across all workloads it’ we only focus on the history algorithm..

Por each of the ten traces we have plotted the This breakdown of all wrong predictions in Figure 3. illustrates the potential for improvement. In six out of ten of the traces there exists potential for reducing the number of wrong branch-target contain very few wrong reduce

❑ wrong mkw ❑ cor, mlm ❑ wrung hist.

F=l

Figure

2. BHT

The number

Perfrrrmance of incorrect

existing

the other

predictions. predictions;

accuracy

with

“rhe other four traces we must be sure not to any

algorithm

used

to

six traces.

To reduce the number of wrong predictions due to we need to better understand the changing targeLs, underlying structure that causes these targets to change. Subroutine RE’I’U RN constitutes the largest percentage of wrong targets.

kk

J5-

eqn fpp epl eBENCHMARKS

noe ml

the

optimize

IR=l

tom

Stack Pair Implementation

Profile

I

d&t

gcc

(4096 Ill IT entries)

misses encountered

can not be

reduced. We have already captured the rnaximurn number of branches possible by using a sum]ciently large 111I“I’. Wrong misses are a result of the first time misses (eitl}er due to the first occurrence of the branch on lhe trace tape or a branch being taken atler BHT due to wrong history).

it had been invalidated

A point that should be stressed predicts that a branch will be taken

is when the 1111“1 and tl]en it is not

taken,

the

than

case.

fl’be

sequential D1lT will non-ill which

IT

BIIT

will

wrong

worse will

have

[hc

Irigure

design

will

have

is the same action

and

give

predictions

prcfetchcd

taken

focus

on

past

Ilic

the

In the next a srrlulion [I]at

due to incorrect

sec[ion, reduces

(hc

3. 111IT Wrong

‘l”he SAM1’l,li

next lhc

Prediction entries)

Iigco

Profile

(4096

BIIT

number

prohlcrn.

progranr provided in Figure 4 illustrates From address 10() wc brar~ch to the

subroutine PRIN1’ at address 500. We thco enter an entry in the 1111’1’ ror address 1(10 with a target 500. On the RE’I’U RN from PRI N1’ we enter the address 600 in the

branch,

by the BI 11’ design.) decreasing

eqnfppspic+dod BENCHMARKS

nrrn-1]11“1”

prcfclcltcd

instruction address. ) Iror a wrong miss, lhe perform the same as the mm-1] 11”1” case. [[’he

As a result, we incorrect predictions. problem

perform

non-1]1 11” design

tomnmmot

in the

tll IT with a target 110. branch to PRINT, entering

of

we look at this tl]e number of

on instruction into the 1)1IT

130 we again the address 130

with a {arget 500. When we get to the subroutine PRINT, we will hit on a valid entry

target address.

end of the in the B1l’r

(address = 600) and predict that we will return to address I I 0. of course, this is not correct, The next instruction to be execuLed is at address 140.

38

PROCRAM

SAMPLE

Znst rwct Address

ion

3 and

the

return

address

sequential 10 the CA Ll, S2 palh via path 6.

100 110 120 130 lUO

cALL

@PRINT

CAI,L

@P171NT

2.

(1 10, the

instruction)

address

is pushed

that

is

onto

the

When the RE’1’U RN is executed at 600 with target address 110, lhe addresses 600 and 110 are sent to the B 11’1’ along

path

1, branch

address

and Ibranch

target

respectively. In parallel, 110 is sent to stack S2 along path 6 to sce if S2 has an entry for it. In this case, S2 does have an entry

and

its corresponding

(in this case 500) is found PRINT 500 510

:

(tile S1{ bit) is turned entry.

REhN

Figure

4. Sample

with

we detect

the correct

made?

this behavior

address

By using

before

and

supply

an incorrect

a set of stacks,

we can

prediction

is

when

branch

I

the

CALL

onlo

at

stack

(140) is pushed

of the two-stack design. In each BI IT entry must he return” bit (SR) that is used

that this is a special

When

pushed

a

targel 4.

F!gure 5 shows a picture addition to the two stacks, augmented with a ‘subroutine

on to denote

that this is a special

address

130

is

executed,

the

address (130) and target (500) are sent to the B] IT via path 1. Also, the target address of the CALL (500), is

the BI IT

identify

CALL is made and then supply the correct when the RETURN is encountered [1 7].

to indicate

in S 1

sent along

Program 3,

can

entry

8 and

path 4 where it replaces the target address on path 1. “l’lLen the entry in the Ill [T has a branch address of 600 aid a targel adrtrcss of 500.” A bit in the BI 1’1’

Soo

1low

via path

The

address

prediction

field

associated

set to 600 and branch

to path

I

3 and the return

point

When instruction address 600 is later prefetched, the B I IT will find an entry for address 600 @he RE’lURN), and that entry will have the SR bit turned on.

entry.

S 1 on path

on stack S2 on path 6.

2, and

since

is sent to stack

entry

for

500.

In

with target

the special S 1 on path this

case,

the (entry (branch set to 500) is sent

bit is on, the target 5 to see if S1 has an S1

does

entry in S2 (140) is identified

corresponding

and

so the

via path

9

and is put on path 7 where it replaces the target field on path 2. (All entries on the stack are compared in parallel. chosen. ) 5.

m

t

Processor

Figure

5. Call/Return

Stack

We will use the programming describe the design. Tbe algorithm 1.

When

the

addresses

CALL

Design

example in Figure works as follows:

address

100

100 and 500 w-e senl along

an entry with branch target the CALL

at

[be branch set to 500,

(500)

is pushed

entry

is

executed,

path

4 10

the

entry

and

a pair

the stack

1 10 create

lhe S1 stack

4096

111IT

model,

adding

of CA LL/Ri3’l’URN

an SR bit stacks.

for

Stack

this our

each entry depths

of 5

and 10 were sirnulrited. Table 2 lists the resuhs. By adding a 5-eIltry pair of stacks we can reduce the percentage of wrong predictions up to 4.6°/0 (an improvement of 31.5%). We have found that for many of the benchmarks, if we use predict

address set to 100, and the Also, the targcl address of onlo

is

“[’he prediction tl}at is made for address 600 will then have- tile larget 130. ‘1’he stack provides the correct address instead of the historical address.

‘1’o show how much savings can be expected from implementation, we have run all ten traces through

I

I

In the event of a tie, the most recent

pair more

with

a 111IT

branches

containing

correclly

than

128 entries, using

it will

a stacktess

1111’1’will) 4096 enlries (i.e. 1,1 run with a l:!8-entry BI 1’1’ and a pair of 5-entry stacks showed a 33 °YiOimprovement over a stackless 4096-entry 1111’1’).

via path

39

NO STACK ~5 0.4 0!0 0.0

j j ~

:::

::4

:

;::

2?:? 14.8 14.6

~ ; :

2::: 13,4 10.0

14.5

~

12.5

tomcotv

nasa matrix eqntott fpppp spice espresso doduc Ii gcc

Table

2. Percent

References

10 ENTRIES

ENTRIES

1.

0.0 ;:: 0.0 7.0

Kaeli 11.1{., Kirkpatrick S., ong S., “PC Characterization”, Proceedings of the ACM and Performance

2.

Adams

T.,

instruction Symposium

‘89”,

May

Zimmerman

89, pp.220. R.,

(includes

Doubting the stack deplh to 10 entries only difference in the LI and GCC benchmarks.

ail

3.

and

011 Computer 1982, pps.9-t7.

April

encouraging result is that performance of the four benchmarks that were previously generating good results were not adversely affected by the introduction of Lhe

4.

of

Operating

Architecture,

8086 Third For

Systems,

Austin,

Organization Groboski G. F., ‘Machine RISC System/6000 processor”, 1BM Research and pps.37-58.

stacks.

Analysis

Clark 11.W., I.evy 1I., “Measurement and Analysis of Instruction Use in the VAX 11/780”, Proc. Ninth SYmPoSiUJn

showed a Another

“An

usage in MS DOS programs”, Proc. OJI Architectural Support

Programming Languages Boston, 1989, pps.1 52-161.

of Wrong Predictions predictions)

Workload Sigmetrics

Development,

Tx.,

of the IBM Journal of

VO1.34, No. 1, Jan.

1990,

Conclusions 5. To

optimize

instruction Much

processor

pipeline etTort

approaches

pipeline

performance,

must be kept full with

has

been

to reducing

spent

valid

or eliminating

Iatencies

these Iatencies. BtIT’s are especially attractive since attempt to redirect prefetching as early as possible.

6.

trying

to

different

studies

optimize

the

amounts

characterized

of

the

have

focused

they

prediction

history.

B1 IT

algorithm In

behavior

on the problem

our

while

by

study,

of have

concentrating

on

consistently

perform

as

well

if

not

considerably

Anderson

ACM,

Set Computers-,

VOI.28-1,

D.W.

1I. R., “Branch Reducing

14th Ann.

Machine

11]M

Jan,

1085,

etal.,

Symp.

“The

philosophy

Journal

of

Computer

IBM and

Research

Folding

Ih-anch

Arch.,

System/360

instruction

and

in the

Delay

to

1987,

Model

handling”,

I)evclopmen[,

Jan.

1967, pps.8-24.

those aspects of the design that reduce Lhe numhcr of incorrect predictions, and that are less dependent on workload characteristics than some previously proposed history-based algorithms. We have also presented a robust Ill IT design that is based on the underlying programming We have structures that exist in today’s applications. shown how to reduce the number of wruflg predictimls made by 18.2% on average, by usiilg a pair CALL/RETURN stacks. The results have been found

D. R., McLellan Microprocessor:

9 I:

saving

we

Ditzel Crisp

Zero,” Proc. pps.2-9. 7,

131IT

Instruction

of the

different that arise in

tJte presence of taken branches. Branch I [istory “rabies have been shown to be an elTective approach to reducing

Previous

1>.A, “Reduced

pps.8-21.

instructions.

investigating

PaLterson

Communications

the

8.

“Compiling Campbell R,, C for [instruction Set Computer”, Master’s U.(;. Berkeley 94720, Ilcc. 1980.

9.

Slni(ll J.lj., “A study of Branch Prediction Strategies-, l’r’oc. l~igh[h Symposium OJI (blnputcr Architecture, May 1981, Minneapolis, pps.135-148.

10.

Lec J,, Smith

of to

better

]Iranch

across all of the traces in our study.

A. J., “Branch

“1’arget

ButTer

the Reduced report, I~EcS,

Prediction

Design”,

Strategies

Computer

and

17:1,

Jan.

Analysis

of

1984, pps.6-22.

Acknowledgements ‘1’he

authors

would

like

to

tl]ank

11. Jul)o

‘1’tmg

providing the tracing tool used to produce the traces I Ierbert Freeman for his support of [his researctl.

I Iotgate l[lslrl]cli(~]]

R. W., Ibbett R.N., -l;etct]irlg Strategies

(l)mputers”,

for

No.4,

and to 12.

[iil~h

April

I tuglles

J.I;.,

‘t’rans.

on

“An irJ

Pipclined

Con)puters.

VOI.C-29,

1980, pps. 325-329. I.iptay

J. S., RyrJJarczyk

J. W., Stone S. E.,

“Nilllli-tr]slruclion Stream tlranch Mechanism”, U.S. Patenl 4,200,927,” Apr.

40

Processing 29, 1980.

13.

Lilja

D.J.,

“Reducing

Processors”,

IEEE

the Branch Computer

Penally Magazine,

15.

in Pipelined July

1988,

Smith

A. J., “Cache

VOL14

No.3,

Memories”,

Computing

Surveys,

Sept. 1982. pps.473-530.

pps.47-55. 14.

Pomerene Rosenfeld History

J. H., P. L.,

Table”,

Puzak T. R., RechtschaITen Sparacio F. J., “Pageable U.S. Patent

4,679,141,

16.

SPEC Quarterly Newsletter, System Performance Evaluation Cooperative, 1st Quarter 1990.

17.

Webb Tech.

R. N., Branch

Jul. 7, 1987.

41

Call/Return C. F., “Subroutine Disc. Bullen., VOL30, No.1 I, April

Stack”, 1988.

IBM

Appendix B

Appendix A Trace Tape Description

Branch 1Iistorv Table Description The

purpose

prefetching processor,

of the tlranch

Ilistory

Table

fetched and is ready for execution. I’he one possible implementation of a !31IT. In our the next redirection. redirected branch.

‘1’be hencbrnarks used in this shrdy were taken from the version 1 release of the SPEC Benchmark Suite. The traces are taken at an ofTsct of 1 million instructions into

is to redirect

such that when a branch is execuled in ihe the target instruction stream has already been example

below

each application. length.

is

example, we assume that prcfclching wiiI felch sequential address in the absence of any In the non-B} IT description, prefetching is only when the execute unit (E) detects a taken

Next Sequential

u al m

1.

Espresso generation

2.

Spice

3.

Memory

m

4.

5. Figure

In the IIHT lookup.

description,

each

prefetch

Register) a match

is found,

constitutes

corresponding target address logic. Prefetching is redirected

a

131IT

a B1 Ir

effects

The lPFAR (instruction Prefetch is compared against the branch address this

hit,

6,

Address (11A). If and

u

TA

TA

BA

TA

{rA) is feed to the to the new address.

%

yl 0!

select

>

- A

purpose

Monte

Carlo

execution,

circuit

simulation written

Nasa - A collections of 7 floating con)prised of 2,200 lines of Fortran. Li

- A

Lisp

interpreter

written

- An integer-intensive a

logical

to a truth

point

in

C,

In the case that the B1 IT prediction is wrong, the execute unit (E) will signal the branch error handling logic to redirect prefetching to the new target address or to tile fall-through address (just as in the case of lbe non-l] 11’1 description), If the instruction was neither detecled in the BI lT nor signaled as wrong by the execute unit, prcfclclling continues with the next sequential address. When a first-time taken branch is encountered, the fll 1’1’ is updaled by the branch error handling logic. Similarly, wbell a wrong handling

prediction logic will

is made invalidate

by the

B I IT,

lhe branch

the corresponding

error

B 11’1’ entry.

42

kernels,

solving

the

benchmark

representation

written of

a

in C,

boolean

table.

Fpppp - A double chemistry benchmark.

9.

Torncatv - A double precision generation Fortran program.

- A vectorizable

Forlran

scientific

precision

- ‘1’his is the GNU

Y===Q’edirec’on E-1 A.2. III IT Description

scalar

problem.

13qntott

(;(;(;

program

involving

benchmark.

Fortran

quantum

vectorized

C compiler,

compiling

prepr~cessed source tiles into optimized language. ‘1’here are 109,000 lines of C.

Figure

in

in Fortran.

8.

10.

instructions

simulation

Matrix

Memory

P

is 1 million

7.

m

,, ,, ,,

Iloduc

equation

the

$+=JJ BA

BA

in Fortran.

translating

Next Sequential

> L E

- A general

writlen

8-queens

A.1. Non-111 IT Description

trace

- one of a collection of tools for the and optimization of Programmable Logic ‘1’his is a integer benchmark written in C. Arrays. our trace used t.ial.in as input.

floating-point Redirection

Elr~mch.

E 2

Each

mesh

19

assembly

Suggest Documents