of Full and Partial Predicated Execution Support for ILP Processors

1 downloads 0 Views 1MB Size Report
for ILP. Processors. Scott A. Mahlke*. Richard E. Hank. James E. McCormick. David I. August. Wen-mei W. Hwu. Center for Reliable and High-Performance.
A Comparkon

Scott A. Mahlke*

of Full

and Partial Predicated for ILP Processors

Richard E. Hank

James E. McCormick

David I. August

Center for Reliable and High-Performance University of Illinois Urbana-Champaign,

Support

Wen-mei W. Hwu

Computing

IL 61801

in conjunction with speculative execution is typically utilized by the compiler andlor hardware to remove control dependence and expose ILP in superscalar and VLIW processors [1] [2] [3]. However, misprediction of these branches can result in severe performance penalties. Recent studies have reported a performance reduction of two to more than ten when realistic instead of perfect branch prediction is utilized [4] [5] [6]. The second limitation is that processor resources to handle branches are often restricted. As a result, for control intensive applications, an artificial upper bound on performance will be imposed by the branch resource constraints. For example, in an instruction stream consisting of 40% branches, a four issue processor capable of processing only one branch per cycle is bounded to a maximum of 2.5 sustained instructions per cycle.

Abstract One can effectively utilize predicated execution to improve branch handling in instruction-level parallel processors. Although the potential benefits of predicated execution are high, the tradeoffs involved in the design of an instruction set to support predicated execution can be difficult. On one end of the design spectrum, architectural support for full predicated execution requires increasing the number of source operands for all instructions. Full predicate support provides for the most flexibility and the largest potential performance improvements. On the other end, partial predicated execution support, such as conditional moves, requires very little change to existing architectures. This paper presents a preliminary study to qualitatively and quantitatively address the benefit of full and partial predicated execution support. With our current compiler technology, we show that the compiler can use both partial and full predication to achieve speedup in large control-intensive programs. Some details of the code generation techniques are shown to provide insight into the benefit of going from partial to full predication. Preliminary experimental results are very encouraging: partial predication provides an average of 33’% performance improvement for an 8-issue processor with no predicate support while full predication provides an additional 30~o improvement.

1

Execution

Predicated execution support provides an effective means to eliminate branches from an instruction stream. Predicated or guarded execution refers to the conditional execution of an instruction based on the value of a boolean source operand, referred to as the predicate [7] [8]. This architectural support allows the compiler to employ an z\corwersion algorithm to convert conditional branches into predicate defining instructions, and instructions along alternative paths of each branch into predicated instructions [9] [10] [11]. Predicated instructions are fetched regardless of their predicate value. Instructions whose predicate is true are executed normally. Conversely, instructions whose predicate is false are nullified, and thus are prevented from modifying the processor state.

Introduction

Predicated execution provides the opportunity to significantly improve branch handling in ILP processors. The most obvious benefit is that decreasing the number of branches reduces the need to sustain multiple branches per cycle. Therefore, the artificial performance bounds imposed by limited branch resources can be alleviated. Eliminating frequently mispredicted branches also leads to a substantial reduction in branch prediction misses [12]. As a result, the performance penalties associated with mispredictions of the eliminated branches are removed. Finally, predicated execution provides an efficient interface for the compiler to expose multiple execution paths to the hardware. Without compiler support, the cost of maintaining multiple execution paths in hardware grows exponentially.

Branch instructions are recognized as a major impediment to exploiting instruction-level parallelism (ILP). ILP is limited by branches in two principle ways. First, branches impose control dependence which restrict the number of independent instructions available each cycle. Branch prediction * Scott Mahlke is now with Hewlett Packard Laboratories, Palo Alto, CA. Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage, the ACM copyright notice and the title of the publication and its date appear, and notice is given that copyin is by permission of the Association of Computing Machinery. ? o copy otherwise, or to republish, requires a fee and/or specific permission. ISCA ’95, Santa Margherita Ligure Italy @ 1995 ACM 0-89791 -698-0/95/0006... $3.50

Predicated execution may be supported by a range of architectural extensions. The most complete approach is full

138

predicate support. With this technique, all instructions are provided with an additional source operand to hold a predicate specifier. In this manner, every instruction may be a predicated. Additionally, a set of predicate defining opcodes are added to efficiently manipulate predicate values. This approach was most notably utilized in the Cydra 5 minisupercomputer [8] [13]. Full predicate execution support and the largest potential perprovides the most flexibility formance improvements. The other approach is to provide partial predicate support. With partial predicate support, a small number of instructions are provided which conditionally execute, such as a conditional move. As a result, partial predicate support minimizes the required changes to existing instruction set architectures (ISA’s) and data paths. This approach is most attractive for designers extending current ISA’s in an upward compatible manner.

options

for extending

ducing

guard

of subsequent

2

ISA

instructions

2.1

Extensions

of instruction extensions ecution

Suppression determines

stored

in a set of l-bit

tively

referred

predicate using

relying

support,

predicated

an equivalent

these sequences structions

executed

of instructions.

in an increase

and requires

efficient

However,

this

port

requires

can frequently

a chain

the performance dependent

parallel

with

same computation

that

with

of sequentially increase

of partial

predicate

computations

are typically

dition

is true,

the processor

more

conditional

moves

have

speculation a predicate

selectively always

employed being

for

is not

locations. state

Then,

example. Thus,

it improves

With since

full all

performance

each instruction

specifies

which

operation

If the value in the specified

predshould

predicate

is execui~ed normally;

the instruction

if

is suppressed.

An

one or

This

predicate

predicate

be

other

rather

operations. must

cate register

the

instruction This

it still

be suppressed

be available

during

for

regard-

ties up an

the complexsignaling

ciuring

whose unit

the

de-

corresponding

not issuedl.

value

is

stage.

an instruction

the

of

which

reduced

is that

and force exception

the execution

the

value

to be generated

method

is suppressed,

logic also

Since

must

modifies

schedules

stage

it reduces

register.

to this

is false is simply

of allowing

referenced

utilized.

predicate

the last pipeline Thus,

register

advantage

since

met hod may also increase

can

stage.

in the write-back

that

compact

bypass

until

instruction

code/issue

on that

an instruction

unit.

to be delayed

state

is useful

A drawback

ity of the register

storing

may

of

the use of an addi-

and a subsequent

more

code.

less of whether

instructions

speculation

enables

execution

if the con-

using

is

The values in

the

an instruction

based

predicated

performing and

is updated,

required

specifier. where

by first

(speculative)

the result (s) in some temporary

support,

Conditional

represented

unconditionally

latency

Finally,

is extensively

with

state

are collec-

The setting

operand

method

register

conditioned

instructions,

support

file.

whether

of processor This

between

the predicate

sup-

execution.

register

in this section.

the instruction

is O, or false,

any change

latency

support.

length.

modify

registers

set through

This

state.

of the pipeline.

values

predicate

path

allow

architec-

predicate

These

One way to perform the suppression of an instruction in hardware is to allow the instruction to execute and to dis-

of in-

of predicate

dependent

of

later

determine

is 1, or true,

of ex-

of the condition

should

file are associated

operand.

processor

the value

of registers

predicate

partial

the critical

modify register

Generation

number

full

on the use of speculative

the computation

than

a larger

execupartial

be performed

in the number

values for the partial and

only

must

In the second case, the computation

is highly

may

with

The result

registers.

instruction will

These

for suppression

of condition.

is discussed register

extensions.

as support

if an instruction

in the extended

pred-

on speculative

operations

sequence

results

to hold intermediate ture.

extensively

case, for an architecture

Predication

to as the predicate

these registers

source

and

Full

of Execution.

which

register

In the first

for

and expression

icate

values,

intro-

specifiers

[14].

set and micro-architecture

tional

icate

may be utilized

can be classified

ing unsupported tion.

For example,

The essence of predicated execution is the ability to suppress the modification of the processor state based upon some condition. There must be a way to express this condition and a way to express when the condition should a,ffect execution. Full predication cleanly supports this through a combination

the predicate

manipulating

ISA.

hold the predicate

Extensions

However, there are several fundamental performance limitations of partial predicate support that are overcome with include representfull predicate support. These difficulties instructions,

which

In this section, a set of extensions to the instruction set architecture for both full and partial predicate support are presented. The baseline architecture assumed is generic ILP processor (either VLIW or superscaku-) with in-order issue and register interlocking. A generic Ioadistore ISA is further assumed as the baseline ISA.

In this paper, the tradeoffs involved in supporting full Using and partial predicated execution are investigated. the compilation techniques proposed in this paper, partial predicate support enables the compiler to perform full ifconversion to eliminate branches and expose ILP. Therefore, the compiler may remove as many branches with partial predicate support as with full predicate support. By removing a large portion of the branches, branch handling is significantly improved for ILP processors with partial predicate support. The relatively few changes needed to add partial predicate support into an architecture make this approach extremely attractive for designers.

predicated

an existing

instructions

This

has the

to be allocated

of the predicate decode/issue,

at least be set in the previous

to

register the predi-

cycle.

This

The issues discussed in the paper are intended for both designers of new ISA’s, as well as those extending existing

dependence distance may also be larger for deeper or if bypass is not available for predicate registers.

ISA’s. With a new instruction set, the issue of supporting full or partial predicate support is clearly a choice that is available. Varying levels of partial predicate support provide

ing the dependence distance between definiticms and uses of predicates may adversely affect execution time by lengthening

139

the

schedule

for

predicated

code.

An

example

pipelines Increas-

of this

Pout P ..,.

‘o 1 0

0 0 1 1

Cydra assumed needed

These

Predicate

The

of the HPL

Playdoh

nal instruct

registers

This

The

is that

these predicate

registers.

is shown

assigns

The comparison than

of w-cl



(gt),

each destination The predicate



predicate

and of the input

1, write types

defines

condition.

have

destination

cution

format

predicate

of a

on the destination

the unconditional

(U),

complements.

of three

Table

1 contains

destination

comparison.

If the value

parison

is placed

for ~).

Otherwise,

to the

ually

of the

instructions,

control

OR

predicates

type

block

can be enabled

predicates table

registers

ure

AND

(&&)

predicate

and OR (11) constructs registers

comparison icate

register

must

be explicitly

used.

However,

predicate

are set if P,n

is 1 (O for ~), is unchanged.

defines

initialized

into

only

after may

they

of a

me initialized

be issued

multiple

simultaneously

to individually

set each predicate

the

use of the

instructions,

in some

can be costly.

and pred_set,

aforemen-

cases individTherefore,

are defined

predicate

two

to provide

register

file to O or

1 contains

of predicated [10],

code

in

is initiated

all predicate branches

ions.

can be true

for

if the entire

short

code

are cleared.

pl

to

be true. is not

boolean is true.

order

in

The

first

two

to two

pred-eq

If p2 in the first executed.

evaluation. The

transuse of

type since either

p f is OR

pred-eq

expression

The

into

il-

source in Fig-

is then

1(c).

me translated

circuit

The

by a pred-clear

register

is false the second

example

the code shown the

Figure

registers

in (b)

Predicate

with

into

if-conversion shown

a simple execution.

This

“then”

part

of the

that

both

is not executed.

This

is performed.

“else”

part

Notice

of the outer

of i is performed

if statement.

p4 and

p5 remain

is consistent Finally,

with

the incre-

unconditionally.

2.2

Extensions

for

Partial

Predication

and

type

and in any

Enhancing tion

140

is

p3 is true

pred-

predicates OR

for transfor-

[16].

instruction

ment

are defined

reduction

the entire

at zero if the predme

of the

type

is unchanged.

useful

on p3 for this reason. The the addition or subtraction

is 1 and the result OR

height

pred-

are cleared

outer if statement is predicated pred_ne simply decides whether

the

they

multiple

is O (1 for AND),

register

is particularly

Figure

registers

consistent

such as logical

that

OR

an exe-

using

registers

predicate

1(a) is compiled the

that

pred_eq

register.

to O before

un-

logical

to the OR type

predicate

each predicate

Using

condition

of the com-

the destination

define

the concept

1(b).

instruct

of the

execution

since the

to compute height

are analogous

or one through

pred.clear

conditional

are always

in C. OR type destination

otherwise Note

when

is true

as a wired

can be utilized

it is possible zero

Example.

predicate

for these

(or its compliment

conditions,

This

zero dependence

predicate

of setting

formed

and

for blocks which are i.e., they ha~e a single

are useful

by multiple

as-

1 in one cycle.

dependence.

The

(b)

a 1 or leaves the register

implementation

destination

setting

a method

useful,

to the predicate

Unconditional predicates are utilized executed based on a single condition,

code,

if-conversion.

register.

writes

destination

predicate

of 34 = 81 possible

of P,~ is 1, the result

a O is written

(a) source

code after

instructions.

type

to

Code

the truth

register

(c)

of predication,

such as control

lustrating

of P,n and the result

in the predicate

(pi)

(p4) (p5)

i,i,l

predicates,

the

AND

It can write

type

predicate

of the value

(P2)

k,k,l

is 1 and the result of the comparison

insure

regardless

k,k,l L3

either

type

Although

actions

are particularly

and AND

k,k,l k,k,l i,i,l

if Pi.

register

types.

Unconditional defined,

OR,

add sub add

AND

for

For each com-

predicate.

types which

c,0,L2

with

code in Figure

are six predicate

bne

property

define

mations

instructions

the result

one

A total

jjj,l (P3) p4U ,p5V,c,0

icate.

The

(ne),

is specified

Pi..

and P,n,

O, or leave it unchanged.

predicate

This

type

tioned

upon

add predme

allows

condition

otherwise

accord-

the value written

based

result

src2 (Pin)

by .

defining

predicate,

L3

on the same predicate which

exist.

There their

order

by P,~.

register

jump

(c) assembly

changed

(eq), not equal



determines

destination

may be performed

specified

Predicate

as specified

of comparison

code,

conventional

and Pout2

src2

A predicate

comparison bination

1: Example

predicate

srcl,

to Poutl

and

predicate.

are also predicated,

Figure sembly

AND

can be: equal

etc.

define

OR type

The instruction

values

p10R,p3~,b,0

(b)

in the origi-

their

Pout2,

pred.eq

M those

below.

Pout I,

ing to a comparison

with

that

PI OR ,p2~,a,0

j,j,l

(a)

is a predicate

opcode

and

pred_eq

add

used are those

There

difference

are predicate

instruction

greater

semantics [15].

pred_clear

is

registers.

major registers

crop>

is also

conditional

be classified

a,O,Ll b,O,Ll

add jump L2: sub L3: add

by the

stage

predicate

for each comparison

define

pred.
=,N COMCLR,= DEPI,TR DEPI $pred-O

%r3,2,$pred-O %r0,%r24,%r0 1,3,1,%3 1,1,1,%3

EXTRU,EV ADDI

%r3,3,1,770r0 1,%r25,%r25

COMCLR,=

%r8,%r0,%r0

COPY

%r6,%r5

COMCLR,= OR,TR

96r8,%r0,%r0

COPY

%r7,%r5

%r6,%r5

non-excepting

support

specula-

techniques [20].

to as Superblock

2 1

2 bit counter

7100.

ILP

eq ge

r18,0,r13 r19,r4,r15

issue cycle

segment

predic-

The instruction

to fully

o

eq

blocks.

allocate

does not support

of all instructions Superblock

64 byte

to have an infinite

processor

execution.

versions

utilized

BTB

penalty.

is assumed

The baseline

or silent

entry

are those of the HP PA-RISC

processor

registers.

with

T

r13,rl,–2 r17,r13,10

for scheduling

hyperblock

loop

r16,rl,r2

id_c

or consists

no write

The dynamic

is a lK

a 2 cycle misprediction

cies assumed

with

assumed

for branches.

perfect

cache and a 64K

data cache; both

of 12 cycles.

6: Example

except

as either

cache is write-through

has a miss penalty tion

each cycle,

is specified

mapped

4–issue, 1-branch

(b) fully predicated

Figure which

or–kee

3

cycle

Note:

The memory

iter3

1

6 cycles

tions

iter2

3

pl ~R,rl,r22

jump st_c st_c st_c add add jumb

12

.ge

2 2 1

2

pred_ge

+

12

(a) superblock

iter 1

iter3

11

r6,rl ,r2 r3,rl,–3 r7,r3, 10 r8,0,r3 r9,r4,r5

o 0

eq eq ge

R_c

ld_c

pred_ge

0 10

14 cycles

o

pred~e

>

12

Jump

pl

iter2

9

bge

pred_clear

T

2

bge

iter2

iter3

bge

The

in all graphs

Figure 7: HP PA-RISC

are

emulation

of predicate

support.

baseparticular,

and

speedup

for a l-issue

tables.

processor

is calculated

baseline

processor

of the specified

by dividing

the cycle count

by the cycle count

of a k-issue

model.

For partial predicate support, the baseline processor is extended to support conditional move instructions. Note that since non-excepting the

more

efficient

for partial support

versions

of all instructions

conversions

are applied

predication processor

(Section

is referred

3.2).

The

4.2

are available, by the partial

to as Conditional

compiler

Figure

predicate Move.

to as Full

compilation

Predication.

techniques

models

is compared

ticular

processor

For this

are applied. by reporting

model

versus

the the

model,

Performance speedup

The

baseline

of the

processor.

In

relative move,

processor. benchmark

Full with

superblock. 1 Speedup tween superblock and

3

par-

the

conditional

8, l-branch in every

hyperblock of the

8 shows

perblock,

final model is the baseline processor extended to support full predication as described in Section 2.1. This model is referred

Results

except

072.sc

1Averages

146

which

reported

performance predication

an average

performed

by su-

for an issue-

performed speedup

with conditional full predication

refer

achieved

and full predication

the best

of 63%

over

move code fell befor all benchmarks

slightly

below

to the arithmetic

mean.

superblock

8.@2 703 6.IM 5.03 4.CQ 3.(s3 2.@2 I.ca



Figure



Suwhkxk

8: Effectiveness

for an 8-issue,

of full

l-branch



Condwonal Move

and partial

processor

with



Full Prediction

predicate perfect

support

Figure

caches.



Sunerblock

10: Effectiveness

for an 4-issue,

Cditicml

of full

l-branch



Move

F.11 F’rdication

and partial

processor

with

predicate

perfect

support

caches.

8.iXl 7.(-0 6.(XI 5.cm 4.03 3.02 Z.ca 1m

I 1

LI



Figure



Supblock

9: Effectiveness

for an 8-issue,

of full

2-branch



Conditional Move

and partial

processor

with



Full Fwd&tion

predicate perfect

support

Figure

caches.

code’s performance. marily

the conditional conditional

The

substantial. reported niques

small

moves.

However,

Full

consistent

predication

significant changes

with

gain

full

of representing

of

certain

over

ss~o

move

code

the

predicated

Figure

that

provided

there

by the

instructions,

the

trend

predication move

ISA

reduced

dition

move

by creating

many

the 8 issue machine,

improvement.

Increasing cycle

provides

full predicate

issue rate

interesting

The performance

insight

to improving

hyperblock

from

1 to 2 branches

into

the

effectiveness

superblock the

of conditional

of

full

predication,

performance

This

of superblock.

These move

results

substantial

To evaluate execution,

The

methods

147

the cache

Figure

in the

must

with

with

conservative

processor.

gain

by

creat-

to be employed

performance

full

are absorbed

a 4-issue

with

con-

compensate

it would

a more

needs

full

in the majority

predication

indicate

effects

while

conditional

the 4 issue machine

is still

for For

possi-

an average of 33% speedup associated

11 is presented.

were affected

is that

than

on a 4 The most

code,

instructions

algorithm model

benchmarks

the compiler

saturate

ble for the 4-issue processor, over superblock.

is at-

only

performed

per cycle.

superblock for

instructions

but

formation

conditional

than

These extra

results.

the

move code and

code is reduced.

support.

per

Figure 9 shows the performance result that can execute 2 branches per cycle.

improvement

against

ing poor

a gain of 30% over the

was observed.

the branch

predicated execution. of an 8-issue processor

tributal

On average,

move model

pred-

performed

predication

superblock

support

more

per-

formance

worse

code is limited,

fines provided conditional

full

benchmarks

beats

Since

predicate

the additional

move

of the

across these

code performs

of benchmarks.

does not noticeably

can issue 1 branch

consistently

the effi-

while

of

There-

move and full

conditional

performance

that

and de-

with

and

superblock.

10 shows

noticeable

is

formation.

of branches

superblock

than

issue processor

In particular,

support

support

instruction

code has had many

of conditional

On average,

dependence heights to represent predicated instructions, the ability to simultaneously execute OR type predicate full predicate

predicate

64K

hyperblock

the number

than

35% faster

tech-

gain on top of

illustrates

code.

3% faster

move transforma-

performance

possible

with

predication

with

the performance

ication

have

removed

increasing

improve

improvements.

predication.

and partial

processor

move and full

branches

fore,

su-

cases with

the hyperblock

This

Full kdicati,m

caches.

conditional

is very

developers special

utilizing

model.

performance

though,

the conditional

also achieved

to support

caused by

On average,

and product

performance

move

lengths

conditional for

of full

l-branch



COnditiond Mow

of 072. sc was pri-

chain

a speedup

for

except

in conjunction yields

had

researchers

gains

the conditional

ciency

code

speedup

Most

conditional

behavior

dependence

move transformations. move

perblock.

tions

The unusual

due to increased

11: Effectiveness

for an 8-issue, data



S.perblcck

by a realistic

with

As expected cache model.

predicated all

three

However,

Table

2: Dynamic

two benchmarks duced memory using

move traffic

often

tional

proportionally instruction

move’s

instruction

larger

instruction

icated

code

still

predication

conditional The respect cation

Superblock over

full

structions overall

into

code’s

code

having

combined

from

full

predication as shown and fully

in the superblock tial predication predicted With

had only predicated

code.

Much

branches

cost of branch

incur

prediction,

the issue rate

The

taken

goal of the

of dynamic

branches.

this

number

of mispre-

technique.

Remarks

benefit

In

general,

both

number full

predication

evaluation,

less reliance

instructions

executed. enable

As shown full

we show that

sors using Results

based

remains

substantially

with

by

and

not

low

tional

instruction.

marks. in

predication The

support

issue processor support.

superior

perfor-

illustrate

in ILP

illustrate

proces-

predicate

This

contrasts

extra

dynamic

the

issue resources

degradation

Nevertheless,

the

support

even in the presence

where

the processor

the ad-

Full predication

overhead.

support

able performance

extension

Full

handling

is due to its efficient

instruction

move

over utilize

superblock,

of branches

This

encouraging

move.

predication branch

on a four

Conditional

issue rate.

is very gain.

30% with

execution.

of full predication

how-

about

processor

architectural

over conditional

of improving

predicated

allows

base ILP

speedup

results,

more robust

up to one branch

move

small

and partial

predicate and fewer

configurations.

executes

performance

30% gain

gains of full

efficient

to provide

a

stream.

execution,

of processor

This

to remove

in our quantitative

conditional

significant

another

mance

more

that

a relatively

illuspred-

compiler

predication

support. that

paper

and full

the instruction

gain over an aggressive

shows

offers

the from

allows

in this partial

on speculative

gain in a variety

the importance

hidden

allow

of branches

vantage

hardest;

presented of both

substantial

with

condi-

instructions

and result

for the majority

of a low

evaluation

in a siz-

of the bench-

performance

gain

for

and par-

ever,

code,

performance there

and

predicated with

increase code fully

have

Mis-

rate

mispre-

does indicate when

that

exploiting

a compiler conditional

of the results

2-cycle

branch

presented

prediction

show

conservative

performance

Therefore,

fewer

tion.

For machines

with

the available

ILP.

close to the

predicated

code often

ties, we expect to be much vanced

148

compiler

larger

the benefits

more

in this

be extremely on low

paper

miss penalty.

is less

very

must move

issue

processors.

All

of the high

for branches

this

intelligent

penalty.

are fewer because

instructions.

Partially

of full

of branches.

the issue rate

in the code can greatly of branches,

branch.

less than

branches

fully

is heavily

of the speedup

architectures,

for other

which

code is substantially

a significant

in the

in many

superblock

transformation

two of the benchmarks suggests that conditional move could be a valuable feature even in a low issue rate processor. How-

comes from the elimination

Also,

same number

dynamic

7%

strategy

the qualitative

can provided

move code had an

3, the number

code generation

and

con-

the

than

The

and full

unlikely

the total

with

The

no predication

over su-

associated

instructions

As a result,

predication

code normally

Conditional

dynamic in Table

branches

fewer

dictions.

executing

of the

combining

the number

Concluding

performance

with

Therefore

is hit

that

3.

move

branch typically causes more mispreof the mispredictions caused by the

5

per cycle,

and

in-

full

in Table

transformation,

For an eight issue processor

instruction

indicate.

count

the inefficiencies

support.

of 46% more

Finally, partially

instruction

with

full predicate

average

between

than

to a single

be increased

performance

with

speculated

paths.

count

can vary as the results dynamic

ever, since it suffers branches

executed

may

these benefits

2. Full predi-

dynamic

by unnecessary

frequently

of

pred-

of an if-then-else

can increase

branches.

However,

by the dy-

count

this

is to reduce

original

grep

predicated

calls.

by the compiler

With

are combined

ication.

superblock.

instruction

paths

in instruction

and superblock

while

dynamic both

grep.

dictions

trates

re-

superblock

in Table

for hyperblocks for

substantial

effect

benchmarks

are larger

However, the combined dictions than the sum

due to conditional

than

is caused by a branch

transformation

move

of all

for

models

This

and fully

for the conditional

model.

branches

of condi-

faster

is observed

predicated

The differences

subroutine

predication

applied

significantly.

over superblock

is shown

predication

relation

move

zA~o

count

behavior

code.

for partially

of mispredictions

employed

instruc-

Conditional

than

ratios

of predicated

in the number

around

data

2. On average

results

predicated

code is also a result odd

difference

branches

in the misprediction An

execution

is evidenced

54% faster

model

as is executes

struct. count

than

good

code can increase

perblock

This

moves performing

to a processor

increased

is a side

small

re-

and superblock

This

The

of adding

calls in partially

number

models.

promoted

result.

count.

instruction

(1.16)

the performance

same.

performing

50M

code dropped

of egn in Table

yielded

dynamic

these

cache miss rate

count

(0.99) (0.99) (1.28) (1.29) (1.11) (1.02)

in all three

predication

the

the increased namic

misses,

full

922K 44M 1647K 46M 49M 1526K

speculative

Since

predication

while

(0.83) (1.04)

a few less. is a result

subroutine

comparison.

code

performing

an interesting

poorly

75M 3855K

just

of branches

(1.01) (0.99)

real cache significantly

predication

by

cache

and full

also exhibited

mained

The

promotion.

3603M 11073M

count

of 026. compress

and full

caused

move

performed

full

out.

more

predicate

tions Eqn

stand

having

Full Pred. 626M (1.28) 32M (1.04) 885M (0.86) 108M (1.20)

Move (1.66) (1.23) (1.19) (1.41) (1.12) (1.23) (0.93) (1.37) (1.53) (1.11) (1.92) (2.10) (1.61) (2.01) (1.53)

instruction

the performance

Conditional

Cond. 812M 38M 1230M 128M 4003M 13838M 85M 5077K 1422K 49M 2467K 75M 70M 2999K 66M

Superblk 489M 31M 103OM 90M 3574M 11225M 91M 3701K 932K 44M 1282K 36M 44M 1493K 43M

Benchmark 008.espresso 022.li 023.eqntott 026.compress 052. alvinn 056.ear 072.sc Cccp cmp eqn grep lex qsort Wc yacc

branch

of both

pronounced. optimization

gains

are based

This

for predicated prediction

techniques

when become

to

execu-

miss penal-

full and partial

Furthermore,

on a

was chosen

prediction more

ad-

available,

Benchmark 008.espresso 022.li 023.eqntott 026.compress 052.alvinn 056.ear 072.sc Cccp cmp eqn grep lex qsort Wc yacc

Table

3: Comparison

we expect

530K 7470K 663K 14M 8847K 478K 12M

of branch

the performance

to increase.

explore

the range of predication

statistics:

support

Conditional MP

38M 6169K 53M 9269K 74M 443M llM 537K 26K 4506K 171K 3070K 6092K 224K

T.lgyo

number

partial

BR

0.8370 8.2% 1.46% 1.6570 15.06V0 6.85% 4.31%

517K

We also feel it would

predication

MPR 4.55% 10.38% 13.47~o 10.9?% 0.24% 4.3% 5.49%

4395 1612K 9660 232K 1332K 33K

gain of both

ication

move and full

Superblock MP 3402K 774K 42M 1344K I091K 66M 1232K 66K

BR 75M 7457K 315M 12M 463M 1539M 22M 921K

2066K 694K 6732K 864K 896K 16M 1044K 65K 31 514K 20K 201K 597K 57 445K

5944K

of branches

(BR),

and full pred-

be interesting

between

The authors

would

for their

also wish

like to thank

effort

to extend

in helping thanks

to Mike

Kathail

at HP

Labs

Playdoh

model

of predicated

like to thank John

Robert

Ruttenberg

of conditional supported grant

for

Roger

their

insightful

Cohn

and Geoff

SGI

for

moves

and

selects.

by the National

vces, Hewlett-Packard,

Intel

their

Science

Lowney

This

Foundation

of the

and AT&T

(MP),

G. R. Beck, D. W. Yen, and T. L. Anderson, “The Cydra 5 minisupercomputer: Architecture and implementation,” The Journai of Supercomputing, vol. 7, pp. 143-180, January 1993.

[14]

D. N. Pnevmatikatos and G. S. Sohi, “Guarded execution and branch prediction in dynamic ILP processors,” in Proceedings of the 21st International Symposium on Computer Architecture, pp. 120-129, April 1994.

[15]

V. Kathail, M. S. Schlansker, and B. R. Rau, “HPL playdoh architecture specification: Version 1.0,” Tech. Rep. HPL93-80, Hewlett-Packard Laboratories, Palo Alto, CA 94303, February 1994.

strategies and pp. 6–22, Jan-

on Com-

[16]

M. Schlansker,

pp. 21 7–227,

V. Kathail,

“On predicated execution,” Packard Laboratories, Palo

December

and S. Anik,

Hank, and predicated of the 25th pp. 45-54,

1994.

“Height

reduction

of

control recurrences for ILP processors,” in Proceedings of the 27th International Symposium on J4icroarchitecture, pp. 40– 51, December 1994.

M. D. Smith, M. Johnson, and M. A. Horowitz, “Limits on multiple instruction issue,” in Proceedings of the 3rd International Conference on Architectural Support for Programming Languages and Operating Systems, pp. 290-302, April 1989.

[5] D. W. Wall, “Limits of instruction-level parallelism,” in Proceedings of the lth International Conference on Architectural Support for Programming Languages and Operating Systems, PP. 176–188, April 1991. [6]

R. A. Towle, “The IEEE Computer,

[13]

of dynamic branch [3] T. Y. Yeh and Y. N. Patt, “A comparison predictors that use two levels of branch history? in Proceed-

[4]

(MPR).

S. A. Mahlke, R. E. Hank, R. A. Bringmann, J. C. Gyllenhaal, D. M. Gallagher, and W. W. Hwu, “Characterizing the impact of predicated execution on branch prediction: in Proceedings of the 27th International Symposium on Mi-

GIS.

E. Smith, “A study of branch prediction strategies:’ in Proceedings of the 8th International Symposium on Computer Architecture, pp. 135–148, May 1981.

Symposium 1993.

rate

[12]

De-

[1]J.

ings of the 20th Annual International puter Architecture, pp. 257-266, May

and miss prediction

S. A. Mahlke, D. C. Lin, W. Y. Chen, FL E. R. A. Bringmann, “Effective compiler suppc,rt for “ in Proceedings execution using the hyperblock, International Sympostum on Micr-oarchitecture, December 1992.

References

prediction Computer,

_

[11]

croarchitecture,

“Branch [2] J. Lee and A. J. Smith, branch target buffer design,” IEEE uary 1984.

7.31%

J. C. Park and M. S. Schlansker, Tech. Rep. HPL-91-58, Hewlett Alto, CA, May 1991.

use

under

Micro

mispredictions

3.15~o 11.5% 13.57% 9.3870 1.38% 3.4% 8.26~o 12.15% 0.12!70 11.37% 11.73% 6.46% 10.06% .025%

[10]

and

has been

(NSF)

Advanced

5900K

6931K 867K 1032K 15M 934K 65K 31 511K 20K 196K 61OK 57 431K

MPR

B. R. Rau, D. W. L. Yen, W. Yen, and Cydra 5 departmental supercomputer,” vol. 22, pp. 12–35, January 1989.

we would on the

research

Microsystems

We

at DEC,

discussions

Corporation,

SUN

Finally,

7.48%

1039K 702K

[8]

and Vinod

discussions

33M 611OK 51M 9240K 74M 442M llM 534K 26K 4495K 171K 3030K 6066K 224K

[9] J. R. Allen, K. Kennedy, C. Porterfield, and J. Warren, “Conversion of control dependence to data dependence,” in Proceedings of the 10th ACM Symposium on Principles of Programming Languages, pp. 177-189, January 1983.

and Dan

together.

Schlansker

execution.

at

MIP-9308013,

Bringmann

put this paper

5.38~o 11.25~o 12.66~o 9.32% 1.23% 3.52% 9.19% 12.17% 0.12% 11.4% 11.7% 6.55% 9.79% .025%

Predication MP

P. Y. Hsu and E. S. Davidson, “Highly concurrent scalar processing,” in Proceedings of the 13th International Symposium on Computer Architecture, pp. 386–395, June 1986.

conditional

Acknowledgements

Lavery

IMl BR

[7]

to

support.

Move MPR

M. Butler, T. Yeh, Y. Patt, M. Alsup, H. Scales, and M. Shebanow, “Single instruction stream parallelism is greater than two,” in Proceedings of the 18th International Symposium on Computer Architecture, pp. 276-286, May 1991.

149

[17]

Hewlett-Packard Company, Architecture and Instruction

Cupertino, CA, PA-RISC 1.1 Set Reference Manual, 1990.

[18]

D. S. Blickstein et al., tem,” Digital Technical

[19]

trace scheduling comP. G. Lowney et al., “The Multiflow piler,” The Journal 0$ Supercomputing, vol. 7, pp. 51–142, January 1993.

[20]

An effective technique W. W. Hwu et al., “The Superblock: for VLIW and superscalar compilation,” The Journal of Supercomputing, vol. 7, pp. 229–248, January 1993.

“The GEM optimizing compiler sysJournal, vol. 4, pp. 121-136, 1992.

Suggest Documents