GRFPU - High Performance IEEE- 754 Floating- Point Unit

33 downloads 113 Views 78KB Size Report
GRFPU – High Performance IEEE754 Floating- Point Unit. Gaisler Research. GRFPU Ov e rv ie w. ○. IEEE754 compliant supporting single and double FP ...
GRFPU - High Performance IEEE- 7 5 4 Floating- Point Unit

Edvin Catovic Gaisler Research [email protected]

GRFPU – High Performance IEEE754 Floating- Point Unit

Gaisler Research

GRFPU Ov e rv ie w ●

IEEE75 4 com p lian t s u p p or tin g sin gle an d d ou b le FP n u m ber s



Pr im ar ily d evelop ed for u se with LEON



Sign ifican t syst em p er for m an ce im p r ovem en t over exis tin g solu tion s



Exten s ively valid ated



Fau lt Toler an t



Wr it ten in h igh - level s yn t h es iz able VHDL- cod e

GRFPU – High Performance IEEE754 Floating- Point Unit

Gaisler Research

IEEE- 7 5 4 Standard fo r Binary Flo atin g - Po int Arithm e tic ●

For m at s Sin gle p recis ion FP

S

EXPONENT

31 30

Dou ble p recis ion FP

S 63 62



0 s

52 51

e x p−1 0 2 3

f p d ou b le=−1 1 . f ∗2

FRACTION

EXPONENT

0

Ar it h m etic: Ad d it ion , s u b t r action , m u lt ip lication , d ivis ion an d s q u ar e- r oot Com p ar is on For m at con ver s ion s : FP to in teger , in t eger t o FP

 

Rou n d in g 



23 22

Op erat ion s 



f p single=−1s 1 . f ∗2e x p−1 2 7

FRACTION

4 r ou n d in g m od es: r ou n d - to- n ear es t, r ou n d - t o- z er o, r ou n d - t o- + ∞, r ou n d to- -∞

Excep t ion s 

In valid op er at ion , d ivis ion b y z er o, over flow, u n d er flow, in exact

GRFPU – High Performance IEEE754 Floating- Point Unit

Gaisler Research

FPU Design Challenges ●



FP Algor it h m s –

Com p lexit y



Correct n es s an d accu racy



IEEE- 754 com p lian ce

Sys t em a r ch it ect u r e –





HW or SW s u p p ort



Affect s overall s ys t em p erform an ce

Ha r d wa r e d es ign ⇒



Com p lex h igh p recis ion op erat ion s



Trad eoffs t o ach ieve h igh p erform an ce/ area

Large d at a p at h s

Tes t a n d va lid a t ion

GRFPU – High Performance IEEE754 Floating- Point Unit

Gaisler Research

FPU Algorithms ●





Sys t em level p er for m a n ce im p a ct –

Lat en cy an d t h rou gh p u t



HW Su p p ort

Divis ion a n d s q u a r e- r oot –

Lack of s u p p ort in HW ⇒ overall CPI in creas e



High lat en cy ( > 3 0 clock cycles ) ⇒ s ign ifican t CPI in creas e

p er form an ce d egrad at ion

Div/ s q r t b y d igit - r ecu r r en ce –



⇒ s ys t em

Ded icat ed HW, h igh lat en cy

Div/ s q r t b y fu n ct ion a l it er a t ion –

Low lat en cy d iv an d s q rt op erat ion s



Mu lt ip licat ion is bas ic s t ep



Sm all area overh ead

⇒ Mu lt ip lier

⇒ h igh

can b e s h ared b et ween m u l, d iv an d s q rt

p erform an ce/ area

GRFPU – High Performance IEEE754 Floating- Point Unit

Gaisler Research

GRFPU Architectural Features ●

Im p lem en t s all SPARC V8 FP op er at ion s



Efficien t ly im p lem en tation of com p lex FDIV an d FSQRT op er ation s



Low laten cy an d h igh th r ou gh p u t



Sp ecial h an d lin g of d en or m aliz ed n u m b er s





Op erat ion s on d en orm aliz ed in p u t s d eferred t o s oft ware



Tin y res u lt s flu s h ed t o z ero (allowed b y IEEE- 754 )



Fas t n on - IEEE m od e

Fas t FP m u lt ip lier –

Cap able of p erform in g m u lt ip licat ion on t wo DP op eran d s



Ad ap t ed for d ivis ion an d s q u are- root



Non - blockin g d ivis ion an d s q u are- root

GRFPU – High Performance IEEE754 Floating- Point Unit

⇒ good



low lat en cy m u l

p erform an ce/ area t rad eoff

Gaisler Research

GRFPU – Logical View

Pipelined execution unit clk result

opcode operand1

exceptions

operand2

id

id Iteration unit



All SPARC V8 FP op er at ion s



FADD, FSUB, FMUL, FCMP an d CONV ar e fu lly p ip elin ed



Sep ar at e n on - b lockin g iter ation u n it (FDIV an d FSQRT)

GRFPU – High Performance IEEE754 Floating- Point Unit

Gaisler Research

GRFPU Ope ration Tim ing Ex am ple Fu lly p ip elin ed op er ation s (FADD, FSUB, FMUL, FCMP, CONV)



1

Clock cycle

2

3

5

6

FADDD FMULD FSUBD

Stage 1

FADDD FMULD FSUBD

Stage 2

FADDD FMULD FSUBD

Stage 3



4

Fu lly p ip elin ed op erat ion s in t erleaved wit h FDIV (or FSQRT) Clock cycle

Stage 1 Stage 2 / Iter. Stage Stage 3

1

2

3

4

5

6

7

8

FDIVS FADDS

9

10

11

12

13

14

FDIVS

FDIVS

FDIVS

15

FMULD

FAD FDIVS FDI VS DS FDIVS

FDIVS

FDIVS

FDIVS

FDIVS

FADDS

FMU FDIVS FDI VS LD FDIVS

FMULD



FDIV a n d FSQRT ar e n on - b lockin g op er a t ion



All ot h er s op er a t ion s ca n b e in t er leaved wit h FDIV or FSQRT



Op er at io n s ca n com p let e ou t - of- or d er

GRFPU – High Performance IEEE754 Floating- Point Unit

FDIVS

Gaisler Research

Pe rfo rm ance ●

Th r ou gh p u t an d lat en cy OPERATION





THROUGHPUT

LATENCY

FADDD, FSUBD, FMULD, COMP, CONV

1

3

FDIVS

15

15

FDIVD

16

16

FSQRTS

23

23

FSQRTD

24

24

Fr eq u en cy 

250 MHz on 0 .13 u m st an d ar d - cell ASIC p r oces s



65 MHz on Vir tex- II FPGA

Ar ea 

100 kgat es on ASIC



850 0 LUTs on Vir t ex- II FPGA

GRFPU – High Performance IEEE754 Floating- Point Unit

Gaisler Research

GRFPU Block Diagram

UNPACK

FPOP DECODE AND CONTROL

ALIGNMENT ADDER LOGIC

ADDER

APPROX TABLES

LZ CNT

BOOTH ENCODER

WALLACE TREE

SHIFTER

POSTNORM / ROUNDER

ITERATION BUFFER AND CTRL

POSTNORM / ROUNDER

GRFPU – High Performance IEEE754 Floating- Point Unit

INTERM. RESULT

Gaisler Research

FPU Comparison

Table s h ows t h rou gh p u t an d (lat en cy)



FPU



FADDD

FMULD

FDIVD

FREQ

AREA

COMMENTS

GRFPU

1 (3)

1 (3)

16 (16)

250

100

0.13 um, synthesis

ARM VFP9-S

1 (4)

2 (5)

28 (31)

140

100

0.18 um, synthesis

ARM VFP11

1 (5)

2 (10)

29 (33)

350

100

0.13 um, hard-block

AMD K7

1 (2)

1 (4)

17 (20)

500

?

0.13 um, hard-block

MEIKO

8

10

50

140

25

0.18 um synthesis

GRFPU com p ares well again s t ot h er FPUs on t h e m arket

GRFPU – High Performance IEEE754 Floating- Point Unit

Gaisler Research

GRFPU Controller - GRFPC ●



GRFPC p r ovid es an in t er face b etween LEON an d GRFPU Sch ed u les SPARC FPOPs for execu tion on GRFPU



Han d les FP r egis t er file (32 x 32- b it FP r egis ter s )



Excep tion h an d lin g (FP St at u s r egis t er , FP d efer r ed q u eu e)



Par allel execu tion of FP an d in t eger op er at ion s –

FP op er ation s d o n ot block IU p ip elin e an d vice ver s a



FP load an d s tor e h an d led b y IU



Ou t- of- or d er execu tion of FP in st r u ct ion s



Fu ll com p lian ce wit h SPARC V8 in s tr u ct ion s ch ed u lin g an d t r ap m od el

GRFPU – High Performance IEEE754 Floating- Point Unit

Gaisler Research

GRFPC Block Diagram DECODE

Decode stage

FP REGFILE

from WB

Register file stage

from inst buffers FORWARD

store

EDAC GRFPU to fwd

Execution stage(s)

LOW LATENCY INST BUFFER

load

EXC CTRL

WB CTRL

FQ FSR

HIGH LATENCY INST BUFFER

Write- back stage

to RF to fwd

GRFPU – High Performance IEEE754 Floating- Point Unit

Gaisler Research

Instruction Trace Example TIME

ADDRESS

INSTRUCTION

RESULT

262843492

40003700

fsubs

%f6, %f3, %f6

[10000060]

262843493

40003704

ld

[%o0 + 0xc], %f5

[3cc90aaf]

262843494

40003708

fmuls

%f2, %f4, %f2

[00000054]

262843495

4000370c

fmuls

%f5, %f6, %f3

[00001fe1]

262843499

40003710

fsubs

%f2, %f3, %f2

[00000054]

262843503

40003714

st

%f2, [%o7 + %o3]

[40014dd8]

262843504

40003718

ld

[%o1 + %i5], %f3

[c1200000]

262843505

4000371c

add

262843506

40003720

ld

262843507

40003724

add

262843510

40003728

fsubs

262843511

4000372c

ld

262843515

40003730

fmuls

%f2, %f6, %f2

[00000054]

262843516

40003734

fmuls

%f5, %f4, %f5

[00001fe1]

%o7, 8, %o7 [%o1 + %i0], %f4 %i5, 8, %i5

[00000028] [c1200000] [00000418]

%f4, %f3, %f4

[4000a800]

[%o0 + 0x8], %f2

[3f7fec43]

GRFPU – High Performance IEEE754 Floating- Point Unit

Gaisler Research

Application Level Performance ●

High p er for m an ce p r ovid ed b y GRFPU Par allel in teger an d floatin g- p oin t in st r u ction execu t ion

⇒ Overall s ys t em ●



level p erform an ce in creas e

Exam p le: LEON2 + GRFPU/ GRFPC r u n n in g at 100 MHz –

GRFPU @ 100 MHz : 10 0 MFLOPS p eak FP p erform an ce



C- cod e: 30 - 40 MFLOPS @ 10 0 MHz



Han d cod ed as s em b ly: 4 0 - 70 MFLOPS @ 10 0 MHz

Lar ge over all s ys tem level p er for m an ce in cr eas e for h eavy FP ap p lication s –

A t yp ical GNC ap p licat ion ru n s 60 % fas t er wit h GRFPU com p ared t o MEIKO (at t h e s am e clock frequ en cy)

GRFPU – High Performance IEEE754 Floating- Point Unit

Gaisler Research

Fault Tolerance ●

GRFPU an d GRFPC ar e SEU p r ot ect ed b y d es ign



TMR r egis t er s



FP r egis ter file is p r otect ed u s in g (32, 7) BCH cod e (SEC/ DED)



In tegr at ed wit h LEON in s tr u ction r est ar t cap ab ility

GRFPU – High Performance IEEE754 Floating- Point Unit

Gaisler Research

Design methodology and validation



FP algor it h m s ar e h igh ly com p lex (s p ecially d ivid e an d sq u ar e- r oot)



Valid ation sh owed to be a ver y h ar d tas k –





Several cas es of bu gs in com m ercial p roces s ors were d et ect ed aft er large- s cale d ep loym en t (Pen t iu m d ivid e- b u g)

GRFPU Des ign Wor k: –

Ph as e 1: Develop m en t of FP algorit h m s . Correct n es s , accu racy an d con vergen ce of t h e FP algorit h m s were m at h em at ically p roved



Ph as e 2: Develop m en t of h igh level FPU m od el in C an d at t ach in g it t o TSIM s im u lat or. FP t es t p rogram s an d real- life s oft ware cou ld b e ru n on t h e m od el b efore d evelop m en t of HW s t art ed .



Ph as e 3: HW d evelop m en t



Ph as e 4: Tes t an d valid at ion

Valid ation p er for m ed d u r in g s ever al s t ages of t h e d evelop m en t wor k

GRFPU – High Performance IEEE754 Floating- Point Unit

Gaisler Research

Design methodology and validation (2) TSIM + GRFPU as load ab le m od u le



FPU module I/F

fadds() fdivs() faddd() fdivs() fmuld() fsqrts() ...

TSIM GRFPU



TSIM sim u lat es fu ll fu n ct ion alit y of LEON, m em or y an d p er ip h er als



Pr ovid es an in ter face to at tach u s er - d efin ed FPU m od el



Poss ible t o t es t th e FP algor it h m s b efor e HW im p lem en t ation st ar ted



Offer s h igh p er for m an ce (+ 20 MIPS) –

Large an d exh au s t ive t es t p rogram s were ru n on t h e C- m od el (UCBTEST, Soft Float , IeeeCC754, GNC ap p licat ion )



Us ed as gold en m od el in later s tages of th e d evelop m en t wor k



Efficien t d eb u ggin g en vir on m en t

GRFPU – High Performance IEEE754 Floating- Point Unit

Gaisler Research

Design methodology and validation (3 )



Floatin g- Poin t Test p r ogr am s –

UCBTEST: Us es n u m b er t h eory t o gen erat e h ard cas e t es t vect ors .



Tes t Float : Ch ecks FPU im p lem en t at ion b y com p arin g it again s t it s own s oft ware im p lem en t at ion . Us es large s et of t es t vect ors + ran d om d at a.





IeeeCC754: Ch ecks IEEE754 com p lian ce

Ru n on b oth fin al im p lem en t ation as well as a C- m od el of th e GRFPU

GRFPU – High Performance IEEE754 Floating- Point Unit

Gaisler Research

Summary ●

GRFPU/ GRFPC offer s ign ifican t p er for m an ce im p r ovem en t over exist in g s olu tion s (LEON/ MEIKO or ERC32/ MEIKO)



GRFPU com p ar es well again s t oth er im p lem en t ation s



Sch ed u led for 2 SOC d esign s



Por tab ility an d FT cap ab ilities m akes it s u itab le for lon g- t er m s p ace u s e

GRFPU – High Performance IEEE754 Floating- Point Unit

Gaisler Research

Suggest Documents