Boosting Trace Cache Performance with NonHead ...

Boosting Trace Cache Performance with NonHead Miss Speculation Stevan Vlaović Architecture and Advanced Development

Sun Microsystems June 25, 2002

ICS 2002

Motivation • Performance of new trace cache design

(linked list vs. fixed length) • How to improve on this design in the context of x86 ISA • Not limited to x86, but is applicable to any design that utilizes a linked list style trace cache • Especially helpful with longer pipelines

Trace Cache Designs Academic [Rotenberg96][Patel97] Issue Width

A B

C D

E

A

C

D

Pentium 4[Hinton Patent97] A

C

D

E Pentium 4 issue width = 3µops, but TC allows long traces

NonHead Miss Dynamic Istream A B

C D

A

C

D

E

Basic block B is executed, not C • Only find out in Execute stage! • Only then look up trace that starts with basic block B •

E

Trace (in trace cache)

Outline • Trace Analysis For X86

Interpretation (TAXI) • Applications • NonHead Miss Speculation • Conclusions

Modern X86 Processors • High performance ==> dynamic

translation of x86 instructions into smaller, more RISC-like instructions (µops) • Back end can be similar to other, more

conventional processors

• Complexity is moved to the front end



Trace Analysis for X86 Interpretation [to appear in ICCD ‘02] • Simulator • functional (state information, correctness) • performance (dependencies, timing,

resource utilization)

• To do a performance simulator

need to do x86 instruction to µop(s) mapping for each instruction • for our applications: 150 instructions,

9 different addressing modes

Example Decomposition

ADD addr, REG

LW Temp1, addr ADD Temp1, Temp1, REG SW Temp1, addr

TAXI Attributes • Trace-based • No wrongpath execution ± Separate functional/performance

simulations

• Bochs • limited device models (e.g. monitor) + captures OS activity

• sim-outorder (SimpleScalar) + easy to modify


Interpretation (TAXI) • Applications • NonHead Miss Speculation • Conclusions and future work

Applications • Id’s Doom • Microsoft Explorer 5.0 • Filemaker Pro 5.0 • Microsoft Visual C++ 5.0 • Netscape Navigator 6.0 • Real System’s RealPlayer 8.0 • Winamp 2.72 • Winzip 8.0

Application Trace Inst µ/In In/St Characteristics

In/Ld

In/Br

Doom

388

1.52

4.61

2.64

5.40

Explorer

396

1.50

4.53

2.55

4.77

FileMaker 447

1.49

4.61

2.85

5.04

MsDev

1.46

4.94

2.63

4.07

Netscape 4.33

377

1.52

4.53

2.65

RealPlayer 4.63

522

1.44

5.44

2.76

Winamp

456

1.47

4.11

2.31

6.88

Winzip

302

1 44

5 74

3 05

5 63

469

Application Characteristics (vs. CPU2000)

• Worse instruction and data cache

performance

• larger applications • use much more OS support • 32 KB or 64 KB caches generally sufficient

• Generally worse branch prediction

performance

• more indirects • more frequent branches, except for streaming



Model Trace Cache (TC) • Holds µops, not x86 instructions • 12 K µops • provides µops that are physically noncontiguous • Traces composed of segments (TC lines) • Each segment holds 6 µops [at most 2 branches/segment] • Possibly multiple segments per trace • First segment called head, last segment called

tail

• Trace ends on the following conditions: • Indirect branch • Call • Return • allows up to 64 segments/trace

Functional Diagram

L2 Cache

Streaming Buffer Trace Cache

Bpred

BTB

DIQ

Back end

Model Parameters • Front end • 12 K- µops, 4-way trace cache (512 entry BTB) • 1 cache line (64 bytes) Streaming Buffer • 4 K-entry GAg predictor (4 K-entry BTB) • 1 complex (>5 µops) decoders

• Back end • 8 entry Decoded Instruction Queue • 4 µop issue and commit width • 8 µop decode width • 126 ROB (RUU) entries • 8 KB, 4-way L1 data cache

BTB Access

Hit

Miss BTB Miss

Correct Address

Wrong Address

TC Head Lookup Hit

Get Address of next block from execution pipeline

Miss Head Miss

More Blocks

no

Exit yes

TC Body Pointer Correct Address

Wrong Address

Nonhead Miss

Trace Cache Access Classification (8 cases)

Trace Cache Access Classification • BTB hit with correct address followed by a trace cache head hit

• BTB hit with correct address followed by a trace cache head

miss

• BTB hit with wrong address followed by a trace cache head

hit

• BTB hit with wrong address followed by a trace cache head

miss

• Nonhead miss, followed by a trace cache head hit • Nonhead miss, followed by a trace cache head miss • BTB miss followed by a trace cache head hit

Trace Cache Access Breakdown BTB Hit Wrong/H miss BTB Hit Wrong/H 1% hit 4% BTB Hit/H miss 4%

NH miss/H hit 14%

NH miss/H miss 1% BTB Miss/H hit 1% BTB Miss/H miss 1%

BTB Hit/H hit 74%

• 74% of all TC accesses are BTB hits (w. correct addr.) followed by a Head hit • 15% of all accesses are Nonhead misses • BUT 14% of all accesses are Nonhead misses that then hit in the trace cache as the Head of a new Trace • BUT Nonhead misses discovered at Execute (avg 17 cycles after TC

BTB Access

Hit

Miss BTB Miss

Correct Address

Wrong Address

TC Head Lookup Hit


Miss Head Miss

More Blocks

no

Exit yes


Wrong Address

Nonhead Miss


Trace Cache Access Breakdown • 74% of all TC accesses BTB Hit Wrong/H miss BTB Hit Wrong/H 1% hit 4% BTB Hit/H miss 4%

NH miss/H hit 14%

NH miss/H miss 1% BTB Miss/H hit 1% BTB Miss/H miss 1%

BTB Hit/H hit 74%

are BTB hits (w. correct addr.) followed by a Head hit • 15% of all accesses are Nonhead misses • BUT 14% of all accesses are Nonhead misses that then hit in the trace cache as the Head of a new Trace • BUT Nonhead misses discovered at Execute (avg. 17 cycles after TC

BTB Access

Hit

Miss BTB Miss

Correct Address

Wrong Address

TC Head Lookup Hit


Miss Head Miss

More Blocks

no

Exit yes


Wrong Address

Nonhead Miss


Trace Cache Access • 74% of all TC accesses Breakdown are BTB hits (w. correct BTB Hit Wrong/H miss BTB Hit Wrong/H 1% hit 4% BTB Hit/H miss 4%

addr.) followed by a Head NH miss/H miss hit 1% 15% of all accesses are BTB Miss/H hit • 1% Nonhead misses BTB Miss/H miss • BUT 14% of all accesses 1% are Nonhead misses that then hit in the trace cache as the Head of a new Trace • BUT Nonhead misses BTB Hit/H hit 74% discovered at Execute (avg. 17 cycles after TC access)!

NH miss/H hit 14%

BTB Access

Hit

Miss BTB Miss

Correct Address

Wrong Address

TC Head Lookup Hit


Miss Head Miss

More Blocks

no

Exit yes


Wrong Address

Nonhead Miss


Trace Cache Access Breakdown • 74% of all TC accesses BTB Hit Wrong/H miss BTB Hit Wrong/H 1% hit 4% BTB Hit/H miss 4%

are BTB hits (w. correct addr.) followed by a Head NH miss/H miss 1% hit BTB Miss/H hit 1% • 15% of all accesses are Nonhead misses BTB Miss/H miss 1% • BUT 14% of all accesses are Nonhead misses that then hit in the trace cache as the Head of a new Trace BTB Hit/H hit 74% • BUT Nonhead misses discovered at Execute (avg. 17 cycles after TC

NH miss/H hit 14%

Optimal nonhead miss handling (discover wrong address immediately, 17 -> 0 cycles) 2 Base Opt

1.8

1.4

1.2

1

ea n Hm

in zip W

W

in am

p

o ea lV id e R

ap e Ne ts c

ev M sD

r Fi le M ak e

Ex pl or er

oo

m

0.8 D

CPµ

1.6

NonHead Miss Speculation (NHMS) Trace Cache

DIQ

EIP

Nonhead Miss Predictor

Nonhead Miss Target Buffer Speculation Mechanism

Miss Predictor Detail EIP

31

16

15

0

16-bit Nonhead History Register

Table of 3-bit Counters

64 K-entry Miss Predictor • 72.5% of Nonhead misses correctly

predicted • 27.5% of Nonhead misses not predicted to miss • # of Nonhead hits predicted to miss • 0.22% of all accesses to the predictor

(once per 455 x86 instructions)

• equals only 4.2% of all nonhead misses

(net effect is as if 68.3% of nonhead misses discovered early)

4 K NonHead Miss Target Buffer

• Doom

• Explorer • FileMaker • MsDev • Netscape • RealVideo • Winamp • Winzip • Average

97.7 (Hit Rate) 98.0 98.3 99.2 96.3 99.4 99.3 98.7 98.4

Impact of Adding NHMS

(removes 80% of NonHead Miss Penalty on average) 2 Base Opt

1.8

64k pred 4k NMTB

1.4 1.2 1

Hm ea n

W in zi p

W in am p

id eo Re al V

ap e Ne ts c

M sD ev

Fi le M ak er

Ex pl or er

0.8 Do om

CP

1.6

NHMS Results • BTB Hit BTB Hit Wrong/H hit Wrong/H miss 1% 4% BTB Hit/H miss 4%

NH miss/H hit NH miss/H miss 3% 1% BTB Miss/H hit 2% BTB Miss/H miss 1%

BTB Hit/H hit 84%

BTB hit then head hit increases from 74% to 84% • Nonhead misses reduced from 15% to 4% • Average CPµ improvement ~10% – only 2% for Netscape – 20% for MsDev – 15% for RealVideo 14% for Winamp

Conclusions • 8 categories of Model trace cache

accesses help identify the majority of misses as nonhead misses • NHMS removes most of this penalty, yielding almost 10% performance improvement

NHMS vs. Other Implementations 2

1.9

base BB

1.8

br_comp BB_8k_btb

1.7

br_comp_8k NHMS

CPµ

1.6 1.5 1.4 1.3 1.2 1.1 1 doom

explorer

filemaker

msdev

netscape realvideo

winamp

winzip

hmean

100

% of NonHead Miss Penalty Removed

90 80 70

Percent

60 50 40 30 20 10 0 Doom

Explorer

FileMaker

MsDev

Netscape

RealVideo

Winamp

Winzip

Average

Related Work (Trace Cache) • Decoded Instruction Cache [Patt85] •

Fill unit: larger piece of atomic work that could be given to the dynamic scheduler [Melvin88]

•

VAX instructions decomposed into microoperations[Melvin89] •

•

CPI 6-> 2 for current VAX implementations

Pentium Pro could use a fill unit [Franklin95]

• Trace Cache •

Branch Address Cache [Yeh93] [Dutta95]

•

Patent [Peleg94]

Boosting Trace Cache Performance with NonHead ...

Boosting Trace Cache Performance with NonHead ...

Suggest Documents

Boosting SQL Performance with cursor_sharing - Oracle

Boosting Sex Identification Performance

Evaluating Trace Cache Energy Efficiency - Computer Science

Evaluating Trace Cache Energy Efficiency - Computer Science

Boosting Sex Identification Performance - CiteSeerX

Boosting California's Postsecondary Education Performance

Cache Write Policies and Performance

Performance White Paper Zend Cache

Cache Write Policies and Performance

Boosting the performance of Bayesian divergence time estimation with ...

Boosting performance with the Dell Acceleration ... - Dell Community

Boosting the performance of Bayesian divergence time estimation with ...

Boosting Multi-Core Reachability Performance with Shared ... - arXiv

Boosting the Performance of PC-based Software Routers with FPGA ...

Boosting Multi-Core Reachability Performance with Shared ... - arXiv

April 2017: Boosting Athletic Performance Comes With Risks ... - NJPIES

KV-Cache: A Scalable High-Performance Web-Object Cache for ...

Improving Data Cache Performance with Integrated Use of ... - CiteSeerX

Experience with Improving Distributed Shared Cache Performance on ...

Femto-Caching with Soft Cache Hits: Improving Performance ... - arXiv

Analyzing a Proxy Cache Server Performance Model with ... - RISC-Linz

Achieving Non-Inclusive Cache Performance with ... - Aamer Jaleel

Trace Reduction for Performance Improvement

Boosting graphics-rendering performance in Windows Forms