GRFPU - High Performance IEEE- 754 Floating- Point Unit

GRFPU - High Performance IEEE- 7 5 4 Floating- Point Unit

Edvin Catovic Gaisler Research [email protected]

GRFPU – High Performance IEEE754 Floating- Point Unit

Gaisler Research

GRFPU Ov e rv ie w ●

IEEE75 4 com p lian t s u p p or tin g sin gle an d d ou b le FP n u m ber s

●

Pr im ar ily d evelop ed for u se with LEON

●

Sign ifican t syst em p er for m an ce im p r ovem en t over exis tin g solu tion s

●

Exten s ively valid ated

●

Fau lt Toler an t

●

Wr it ten in h igh - level s yn t h es iz able VHDL- cod e


Gaisler Research

IEEE- 7 5 4 Standard fo r Binary Flo atin g - Po int Arithm e tic ●

For m at s Sin gle p recis ion FP

S

EXPONENT

31 30

Dou ble p recis ion FP

S 63 62

●

0 s

52 51

e x p−1 0 2 3

f p d ou b le=−1 1 . f ∗2

FRACTION

EXPONENT

0

Ar it h m etic: Ad d it ion , s u b t r action , m u lt ip lication , d ivis ion an d s q u ar e- r oot Com p ar is on For m at con ver s ion s : FP to in teger , in t eger t o FP

 

Rou n d in g 

●

23 22

Op erat ion s 

●

f p single=−1s 1 . f ∗2e x p−1 2 7

FRACTION

4 r ou n d in g m od es: r ou n d - to- n ear es t, r ou n d - t o- z er o, r ou n d - t o- + ∞, r ou n d to- -∞

Excep t ion s 

In valid op er at ion , d ivis ion b y z er o, over flow, u n d er flow, in exact


Gaisler Research

FPU Design Challenges ●

●

FP Algor it h m s –

Com p lexit y

–

Correct n es s an d accu racy

–

IEEE- 754 com p lian ce

Sys t em a r ch it ect u r e –

●

●

HW or SW s u p p ort

⇒

Affect s overall s ys t em p erform an ce

Ha r d wa r e d es ign ⇒

–

Com p lex h igh p recis ion op erat ion s

–

Trad eoffs t o ach ieve h igh p erform an ce/ area

Large d at a p at h s

Tes t a n d va lid a t ion


Gaisler Research

FPU Algorithms ●

●

●

Sys t em level p er for m a n ce im p a ct –

Lat en cy an d t h rou gh p u t

–

HW Su p p ort

Divis ion a n d s q u a r e- r oot –

Lack of s u p p ort in HW ⇒ overall CPI in creas e

–

High lat en cy ( > 3 0 clock cycles ) ⇒ s ign ifican t CPI in creas e

p er form an ce d egrad at ion

Div/ s q r t b y d igit - r ecu r r en ce –

●

⇒ s ys t em

Ded icat ed HW, h igh lat en cy

Div/ s q r t b y fu n ct ion a l it er a t ion –

Low lat en cy d iv an d s q rt op erat ion s

–

Mu lt ip licat ion is bas ic s t ep

–

Sm all area overh ead

⇒ Mu lt ip lier

⇒ h igh

can b e s h ared b et ween m u l, d iv an d s q rt

p erform an ce/ area


Gaisler Research

GRFPU Architectural Features ●

Im p lem en t s all SPARC V8 FP op er at ion s

●

Efficien t ly im p lem en tation of com p lex FDIV an d FSQRT op er ation s

●

Low laten cy an d h igh th r ou gh p u t

●

Sp ecial h an d lin g of d en or m aliz ed n u m b er s

●

–

Op erat ion s on d en orm aliz ed in p u t s d eferred t o s oft ware

–

Tin y res u lt s flu s h ed t o z ero (allowed b y IEEE- 754 )

–

Fas t n on - IEEE m od e

Fas t FP m u lt ip lier –

Cap able of p erform in g m u lt ip licat ion on t wo DP op eran d s

–

Ad ap t ed for d ivis ion an d s q u are- root

–

Non - blockin g d ivis ion an d s q u are- root


⇒ good

⇒

low lat en cy m u l

p erform an ce/ area t rad eoff

Gaisler Research

GRFPU – Logical View

Pipelined execution unit clk result

opcode operand1

exceptions

operand2

id

id Iteration unit

●

All SPARC V8 FP op er at ion s

●

FADD, FSUB, FMUL, FCMP an d CONV ar e fu lly p ip elin ed

●

Sep ar at e n on - b lockin g iter ation u n it (FDIV an d FSQRT)


Gaisler Research

GRFPU Ope ration Tim ing Ex am ple Fu lly p ip elin ed op er ation s (FADD, FSUB, FMUL, FCMP, CONV)

●

1

Clock cycle

2

3

5

6

FADDD FMULD FSUBD

Stage 1

FADDD FMULD FSUBD

Stage 2

FADDD FMULD FSUBD

Stage 3

●

4

Fu lly p ip elin ed op erat ion s in t erleaved wit h FDIV (or FSQRT) Clock cycle

Stage 1 Stage 2 / Iter. Stage Stage 3

1

2

3

4

5

6

7

8

FDIVS FADDS

9

10

11

12

13

14

FDIVS

FDIVS

FDIVS

15

FMULD

FAD FDIVS FDI VS DS FDIVS

FDIVS

FDIVS

FDIVS

FDIVS

FADDS

FMU FDIVS FDI VS LD FDIVS

FMULD

–

FDIV a n d FSQRT ar e n on - b lockin g op er a t ion

–

All ot h er s op er a t ion s ca n b e in t er leaved wit h FDIV or FSQRT

–

Op er at io n s ca n com p let e ou t - of- or d er


FDIVS

Gaisler Research

Pe rfo rm ance ●

Th r ou gh p u t an d lat en cy OPERATION

●

●

THROUGHPUT

LATENCY

FADDD, FSUBD, FMULD, COMP, CONV

1

3

FDIVS

15

15

FDIVD

16

16

FSQRTS

23

23

FSQRTD

24

24

Fr eq u en cy 

250 MHz on 0 .13 u m st an d ar d - cell ASIC p r oces s



65 MHz on Vir tex- II FPGA

Ar ea 

100 kgat es on ASIC



850 0 LUTs on Vir t ex- II FPGA


Gaisler Research

GRFPU Block Diagram

UNPACK

FPOP DECODE AND CONTROL

ALIGNMENT ADDER LOGIC

ADDER

APPROX TABLES

LZ CNT

BOOTH ENCODER

WALLACE TREE

SHIFTER

POSTNORM / ROUNDER

ITERATION BUFFER AND CTRL

POSTNORM / ROUNDER


INTERM. RESULT

Gaisler Research

FPU Comparison

Table s h ows t h rou gh p u t an d (lat en cy)

●

FPU

●

FADDD

FMULD

FDIVD

FREQ

AREA

COMMENTS

GRFPU

1 (3)

1 (3)

16 (16)

250

100

0.13 um, synthesis

ARM VFP9-S

1 (4)

2 (5)

28 (31)

140

100

0.18 um, synthesis

ARM VFP11

1 (5)

2 (10)

29 (33)

350

100

0.13 um, hard-block

AMD K7

1 (2)

1 (4)

17 (20)

500

?

0.13 um, hard-block

MEIKO

8

10

50

140

25

0.18 um synthesis

GRFPU com p ares well again s t ot h er FPUs on t h e m arket


Gaisler Research

GRFPU Controller - GRFPC ●

●

GRFPC p r ovid es an in t er face b etween LEON an d GRFPU Sch ed u les SPARC FPOPs for execu tion on GRFPU

●

Han d les FP r egis t er file (32 x 32- b it FP r egis ter s )

●

Excep tion h an d lin g (FP St at u s r egis t er , FP d efer r ed q u eu e)

●

Par allel execu tion of FP an d in t eger op er at ion s –

FP op er ation s d o n ot block IU p ip elin e an d vice ver s a

–

FP load an d s tor e h an d led b y IU

●

Ou t- of- or d er execu tion of FP in st r u ct ion s

●

Fu ll com p lian ce wit h SPARC V8 in s tr u ct ion s ch ed u lin g an d t r ap m od el


Gaisler Research

GRFPC Block Diagram DECODE

Decode stage

FP REGFILE

from WB

Register file stage

from inst buffers FORWARD

store

EDAC GRFPU to fwd

Execution stage(s)

LOW LATENCY INST BUFFER

load

EXC CTRL

WB CTRL

FQ FSR

HIGH LATENCY INST BUFFER

Write- back stage

to RF to fwd


Gaisler Research

Instruction Trace Example TIME

ADDRESS

INSTRUCTION

RESULT

262843492

40003700

fsubs

%f6, %f3, %f6

[10000060]

262843493

40003704

ld

[%o0 + 0xc], %f5

[3cc90aaf]

262843494

40003708

fmuls

%f2, %f4, %f2

[00000054]

262843495

4000370c

fmuls

%f5, %f6, %f3

[00001fe1]

262843499

40003710

fsubs

%f2, %f3, %f2

[00000054]

262843503

40003714

st

%f2, [%o7 + %o3]

[40014dd8]

262843504

40003718

ld

[%o1 + %i5], %f3

[c1200000]

262843505

4000371c

add

262843506

40003720

ld

262843507

40003724

add

262843510

40003728

fsubs

262843511

4000372c

ld

262843515

40003730

fmuls

%f2, %f6, %f2

[00000054]

262843516

40003734

fmuls

%f5, %f4, %f5

[00001fe1]

%o7, 8, %o7 [%o1 + %i0], %f4 %i5, 8, %i5

[00000028] [c1200000] [00000418]

%f4, %f3, %f4

[4000a800]

[%o0 + 0x8], %f2

[3f7fec43]


Gaisler Research

Application Level Performance ●

High p er for m an ce p r ovid ed b y GRFPU Par allel in teger an d floatin g- p oin t in st r u ction execu t ion

⇒ Overall s ys t em ●

●

level p erform an ce in creas e

Exam p le: LEON2 + GRFPU/ GRFPC r u n n in g at 100 MHz –

GRFPU @ 100 MHz : 10 0 MFLOPS p eak FP p erform an ce

–

C- cod e: 30 - 40 MFLOPS @ 10 0 MHz

–

Han d cod ed as s em b ly: 4 0 - 70 MFLOPS @ 10 0 MHz

Lar ge over all s ys tem level p er for m an ce in cr eas e for h eavy FP ap p lication s –

A t yp ical GNC ap p licat ion ru n s 60 % fas t er wit h GRFPU com p ared t o MEIKO (at t h e s am e clock frequ en cy)


Gaisler Research

Fault Tolerance ●

GRFPU an d GRFPC ar e SEU p r ot ect ed b y d es ign

●

TMR r egis t er s

●

FP r egis ter file is p r otect ed u s in g (32, 7) BCH cod e (SEC/ DED)

●

In tegr at ed wit h LEON in s tr u ction r est ar t cap ab ility


Gaisler Research

Design methodology and validation

●

FP algor it h m s ar e h igh ly com p lex (s p ecially d ivid e an d sq u ar e- r oot)

●

Valid ation sh owed to be a ver y h ar d tas k –

●

●

Several cas es of bu gs in com m ercial p roces s ors were d et ect ed aft er large- s cale d ep loym en t (Pen t iu m d ivid e- b u g)

GRFPU Des ign Wor k: –

Ph as e 1: Develop m en t of FP algorit h m s . Correct n es s , accu racy an d con vergen ce of t h e FP algorit h m s were m at h em at ically p roved

–

Ph as e 2: Develop m en t of h igh level FPU m od el in C an d at t ach in g it t o TSIM s im u lat or. FP t es t p rogram s an d real- life s oft ware cou ld b e ru n on t h e m od el b efore d evelop m en t of HW s t art ed .

–

Ph as e 3: HW d evelop m en t

–

Ph as e 4: Tes t an d valid at ion

Valid ation p er for m ed d u r in g s ever al s t ages of t h e d evelop m en t wor k


Gaisler Research

Design methodology and validation (2) TSIM + GRFPU as load ab le m od u le

●

FPU module I/F

fadds() fdivs() faddd() fdivs() fmuld() fsqrts() ...

TSIM GRFPU

●

TSIM sim u lat es fu ll fu n ct ion alit y of LEON, m em or y an d p er ip h er als

●

Pr ovid es an in ter face to at tach u s er - d efin ed FPU m od el

●

Poss ible t o t es t th e FP algor it h m s b efor e HW im p lem en t ation st ar ted

●

Offer s h igh p er for m an ce (+ 20 MIPS) –

Large an d exh au s t ive t es t p rogram s were ru n on t h e C- m od el (UCBTEST, Soft Float , IeeeCC754, GNC ap p licat ion )

●

Us ed as gold en m od el in later s tages of th e d evelop m en t wor k

●

Efficien t d eb u ggin g en vir on m en t


Gaisler Research

Design methodology and validation (3 )

●

Floatin g- Poin t Test p r ogr am s –

UCBTEST: Us es n u m b er t h eory t o gen erat e h ard cas e t es t vect ors .

–

Tes t Float : Ch ecks FPU im p lem en t at ion b y com p arin g it again s t it s own s oft ware im p lem en t at ion . Us es large s et of t es t vect ors + ran d om d at a.

–

●

IeeeCC754: Ch ecks IEEE754 com p lian ce

Ru n on b oth fin al im p lem en t ation as well as a C- m od el of th e GRFPU


Gaisler Research

Summary ●

GRFPU/ GRFPC offer s ign ifican t p er for m an ce im p r ovem en t over exist in g s olu tion s (LEON/ MEIKO or ERC32/ MEIKO)

●

GRFPU com p ar es well again s t oth er im p lem en t ation s

●

Sch ed u led for 2 SOC d esign s

●

Por tab ility an d FT cap ab ilities m akes it s u itab le for lon g- t er m s p ace u s e


Gaisler Research

GRFPU - High Performance IEEE- 754 Floating- Point Unit

GRFPU - High Performance IEEE- 754 Floating- Point Unit

Suggest Documents

IEEE 754 Floating-Point Format

IEEE 754 Floating Point Representation

Introduction to IEEE-754 floating-point arithmetic

explain IEEE 754 floating-point arithmetics

Precision & Performance: Floating Point and IEEE 754 ... - Nvidia

Evaluation of IEEE 754 Floating-Point Arithmetic Compliance Across a ...

An IEEE 754 Floating Point Engine designed with an ... - CiteSeerX

FPGA implementation of IEEE-754 floating point Karatsuba multiplier

A High-Performance SIMD Floating Point Unit for ... - CiteSeerX

IEEE Standard 754 for Binary Floating-Point Arithmetic - Electrical ...

IEEE 754 Compliant Floating Point Routines - application notes

Floating Point and IEEE-754 Compliance for NVIDIA GPUs

High Performance Decimal Floating-Point Units

High-Performance Floating Point Divide - CiteSeerX

GENERATING HIGH-PERFORMANCE CUSTOM FLOATING-POINT ...

An Asynchronous IEEE Floating-Point Arithmetic Unit - UP Diliman ...

Performance Analysis of Floating Point MAC Unit

Implementation of a High Speed Single Precision Floating Point Unit ...

Automatic Application Specific Floating-point Unit Generation

A Decimal Floating Point Arithmetic Unit for

P6 Binary Floating-Point Unit - Lirmm

EFFICIENT FLOATING-POINT LOGARITHM UNIT FOR FPGAS ...

P6 Binary Floating-Point Unit - Semantic Scholar

The POWER7 Binary Floating-Point Unit - acsel