GRFPU – High Performance IEEE754 Floating- Point Unit. Gaisler Research.
GRFPU Ov e rv ie w. ○. IEEE754 compliant supporting single and double FP ...
GRFPU - High Performance IEEE- 7 5 4 Floating- Point Unit
Edvin Catovic Gaisler Research
[email protected]
GRFPU – High Performance IEEE754 Floating- Point Unit
Gaisler Research
GRFPU Ov e rv ie w ●
IEEE75 4 com p lian t s u p p or tin g sin gle an d d ou b le FP n u m ber s
●
Pr im ar ily d evelop ed for u se with LEON
●
Sign ifican t syst em p er for m an ce im p r ovem en t over exis tin g solu tion s
●
Exten s ively valid ated
●
Fau lt Toler an t
●
Wr it ten in h igh - level s yn t h es iz able VHDL- cod e
GRFPU – High Performance IEEE754 Floating- Point Unit
Gaisler Research
IEEE- 7 5 4 Standard fo r Binary Flo atin g - Po int Arithm e tic ●
For m at s Sin gle p recis ion FP
S
EXPONENT
31 30
Dou ble p recis ion FP
S 63 62
●
0 s
52 51
e x p−1 0 2 3
f p d ou b le=−1 1 . f ∗2
FRACTION
EXPONENT
0
Ar it h m etic: Ad d it ion , s u b t r action , m u lt ip lication , d ivis ion an d s q u ar e- r oot Com p ar is on For m at con ver s ion s : FP to in teger , in t eger t o FP
Rou n d in g
●
23 22
Op erat ion s
●
f p single=−1s 1 . f ∗2e x p−1 2 7
FRACTION
4 r ou n d in g m od es: r ou n d - to- n ear es t, r ou n d - t o- z er o, r ou n d - t o- + ∞, r ou n d to- -∞
Excep t ion s
In valid op er at ion , d ivis ion b y z er o, over flow, u n d er flow, in exact
GRFPU – High Performance IEEE754 Floating- Point Unit
Gaisler Research
FPU Design Challenges ●
●
FP Algor it h m s –
Com p lexit y
–
Correct n es s an d accu racy
–
IEEE- 754 com p lian ce
Sys t em a r ch it ect u r e –
●
●
HW or SW s u p p ort
⇒
Affect s overall s ys t em p erform an ce
Ha r d wa r e d es ign ⇒
–
Com p lex h igh p recis ion op erat ion s
–
Trad eoffs t o ach ieve h igh p erform an ce/ area
Large d at a p at h s
Tes t a n d va lid a t ion
GRFPU – High Performance IEEE754 Floating- Point Unit
Gaisler Research
FPU Algorithms ●
●
●
Sys t em level p er for m a n ce im p a ct –
Lat en cy an d t h rou gh p u t
–
HW Su p p ort
Divis ion a n d s q u a r e- r oot –
Lack of s u p p ort in HW ⇒ overall CPI in creas e
–
High lat en cy ( > 3 0 clock cycles ) ⇒ s ign ifican t CPI in creas e
p er form an ce d egrad at ion
Div/ s q r t b y d igit - r ecu r r en ce –
●
⇒ s ys t em
Ded icat ed HW, h igh lat en cy
Div/ s q r t b y fu n ct ion a l it er a t ion –
Low lat en cy d iv an d s q rt op erat ion s
–
Mu lt ip licat ion is bas ic s t ep
–
Sm all area overh ead
⇒ Mu lt ip lier
⇒ h igh
can b e s h ared b et ween m u l, d iv an d s q rt
p erform an ce/ area
GRFPU – High Performance IEEE754 Floating- Point Unit
Gaisler Research
GRFPU Architectural Features ●
Im p lem en t s all SPARC V8 FP op er at ion s
●
Efficien t ly im p lem en tation of com p lex FDIV an d FSQRT op er ation s
●
Low laten cy an d h igh th r ou gh p u t
●
Sp ecial h an d lin g of d en or m aliz ed n u m b er s
●
–
Op erat ion s on d en orm aliz ed in p u t s d eferred t o s oft ware
–
Tin y res u lt s flu s h ed t o z ero (allowed b y IEEE- 754 )
–
Fas t n on - IEEE m od e
Fas t FP m u lt ip lier –
Cap able of p erform in g m u lt ip licat ion on t wo DP op eran d s
–
Ad ap t ed for d ivis ion an d s q u are- root
–
Non - blockin g d ivis ion an d s q u are- root
GRFPU – High Performance IEEE754 Floating- Point Unit
⇒ good
⇒
low lat en cy m u l
p erform an ce/ area t rad eoff
Gaisler Research
GRFPU – Logical View
Pipelined execution unit clk result
opcode operand1
exceptions
operand2
id
id Iteration unit
●
All SPARC V8 FP op er at ion s
●
FADD, FSUB, FMUL, FCMP an d CONV ar e fu lly p ip elin ed
●
Sep ar at e n on - b lockin g iter ation u n it (FDIV an d FSQRT)
GRFPU – High Performance IEEE754 Floating- Point Unit
Gaisler Research
GRFPU Ope ration Tim ing Ex am ple Fu lly p ip elin ed op er ation s (FADD, FSUB, FMUL, FCMP, CONV)
●
1
Clock cycle
2
3
5
6
FADDD FMULD FSUBD
Stage 1
FADDD FMULD FSUBD
Stage 2
FADDD FMULD FSUBD
Stage 3
●
4
Fu lly p ip elin ed op erat ion s in t erleaved wit h FDIV (or FSQRT) Clock cycle
Stage 1 Stage 2 / Iter. Stage Stage 3
1
2
3
4
5
6
7
8
FDIVS FADDS
9
10
11
12
13
14
FDIVS
FDIVS
FDIVS
15
FMULD
FAD FDIVS FDI VS DS FDIVS
FDIVS
FDIVS
FDIVS
FDIVS
FADDS
FMU FDIVS FDI VS LD FDIVS
FMULD
–
FDIV a n d FSQRT ar e n on - b lockin g op er a t ion
–
All ot h er s op er a t ion s ca n b e in t er leaved wit h FDIV or FSQRT
–
Op er at io n s ca n com p let e ou t - of- or d er
GRFPU – High Performance IEEE754 Floating- Point Unit
FDIVS
Gaisler Research
Pe rfo rm ance ●
Th r ou gh p u t an d lat en cy OPERATION
●
●
THROUGHPUT
LATENCY
FADDD, FSUBD, FMULD, COMP, CONV
1
3
FDIVS
15
15
FDIVD
16
16
FSQRTS
23
23
FSQRTD
24
24
Fr eq u en cy
250 MHz on 0 .13 u m st an d ar d - cell ASIC p r oces s
65 MHz on Vir tex- II FPGA
Ar ea
100 kgat es on ASIC
850 0 LUTs on Vir t ex- II FPGA
GRFPU – High Performance IEEE754 Floating- Point Unit
Gaisler Research
GRFPU Block Diagram
UNPACK
FPOP DECODE AND CONTROL
ALIGNMENT ADDER LOGIC
ADDER
APPROX TABLES
LZ CNT
BOOTH ENCODER
WALLACE TREE
SHIFTER
POSTNORM / ROUNDER
ITERATION BUFFER AND CTRL
POSTNORM / ROUNDER
GRFPU – High Performance IEEE754 Floating- Point Unit
INTERM. RESULT
Gaisler Research
FPU Comparison
Table s h ows t h rou gh p u t an d (lat en cy)
●
FPU
●
FADDD
FMULD
FDIVD
FREQ
AREA
COMMENTS
GRFPU
1 (3)
1 (3)
16 (16)
250
100
0.13 um, synthesis
ARM VFP9-S
1 (4)
2 (5)
28 (31)
140
100
0.18 um, synthesis
ARM VFP11
1 (5)
2 (10)
29 (33)
350
100
0.13 um, hard-block
AMD K7
1 (2)
1 (4)
17 (20)
500
?
0.13 um, hard-block
MEIKO
8
10
50
140
25
0.18 um synthesis
GRFPU com p ares well again s t ot h er FPUs on t h e m arket
GRFPU – High Performance IEEE754 Floating- Point Unit
Gaisler Research
GRFPU Controller - GRFPC ●
●
GRFPC p r ovid es an in t er face b etween LEON an d GRFPU Sch ed u les SPARC FPOPs for execu tion on GRFPU
●
Han d les FP r egis t er file (32 x 32- b it FP r egis ter s )
●
Excep tion h an d lin g (FP St at u s r egis t er , FP d efer r ed q u eu e)
●
Par allel execu tion of FP an d in t eger op er at ion s –
FP op er ation s d o n ot block IU p ip elin e an d vice ver s a
–
FP load an d s tor e h an d led b y IU
●
Ou t- of- or d er execu tion of FP in st r u ct ion s
●
Fu ll com p lian ce wit h SPARC V8 in s tr u ct ion s ch ed u lin g an d t r ap m od el
GRFPU – High Performance IEEE754 Floating- Point Unit
Gaisler Research
GRFPC Block Diagram DECODE
Decode stage
FP REGFILE
from WB
Register file stage
from inst buffers FORWARD
store
EDAC GRFPU to fwd
Execution stage(s)
LOW LATENCY INST BUFFER
load
EXC CTRL
WB CTRL
FQ FSR
HIGH LATENCY INST BUFFER
Write- back stage
to RF to fwd
GRFPU – High Performance IEEE754 Floating- Point Unit
Gaisler Research
Instruction Trace Example TIME
ADDRESS
INSTRUCTION
RESULT
262843492
40003700
fsubs
%f6, %f3, %f6
[10000060]
262843493
40003704
ld
[%o0 + 0xc], %f5
[3cc90aaf]
262843494
40003708
fmuls
%f2, %f4, %f2
[00000054]
262843495
4000370c
fmuls
%f5, %f6, %f3
[00001fe1]
262843499
40003710
fsubs
%f2, %f3, %f2
[00000054]
262843503
40003714
st
%f2, [%o7 + %o3]
[40014dd8]
262843504
40003718
ld
[%o1 + %i5], %f3
[c1200000]
262843505
4000371c
add
262843506
40003720
ld
262843507
40003724
add
262843510
40003728
fsubs
262843511
4000372c
ld
262843515
40003730
fmuls
%f2, %f6, %f2
[00000054]
262843516
40003734
fmuls
%f5, %f4, %f5
[00001fe1]
%o7, 8, %o7 [%o1 + %i0], %f4 %i5, 8, %i5
[00000028] [c1200000] [00000418]
%f4, %f3, %f4
[4000a800]
[%o0 + 0x8], %f2
[3f7fec43]
GRFPU – High Performance IEEE754 Floating- Point Unit
Gaisler Research
Application Level Performance ●
High p er for m an ce p r ovid ed b y GRFPU Par allel in teger an d floatin g- p oin t in st r u ction execu t ion
⇒ Overall s ys t em ●
●
level p erform an ce in creas e
Exam p le: LEON2 + GRFPU/ GRFPC r u n n in g at 100 MHz –
GRFPU @ 100 MHz : 10 0 MFLOPS p eak FP p erform an ce
–
C- cod e: 30 - 40 MFLOPS @ 10 0 MHz
–
Han d cod ed as s em b ly: 4 0 - 70 MFLOPS @ 10 0 MHz
Lar ge over all s ys tem level p er for m an ce in cr eas e for h eavy FP ap p lication s –
A t yp ical GNC ap p licat ion ru n s 60 % fas t er wit h GRFPU com p ared t o MEIKO (at t h e s am e clock frequ en cy)
GRFPU – High Performance IEEE754 Floating- Point Unit
Gaisler Research
Fault Tolerance ●
GRFPU an d GRFPC ar e SEU p r ot ect ed b y d es ign
●
TMR r egis t er s
●
FP r egis ter file is p r otect ed u s in g (32, 7) BCH cod e (SEC/ DED)
●
In tegr at ed wit h LEON in s tr u ction r est ar t cap ab ility
GRFPU – High Performance IEEE754 Floating- Point Unit
Gaisler Research
Design methodology and validation
●
FP algor it h m s ar e h igh ly com p lex (s p ecially d ivid e an d sq u ar e- r oot)
●
Valid ation sh owed to be a ver y h ar d tas k –
●
●
Several cas es of bu gs in com m ercial p roces s ors were d et ect ed aft er large- s cale d ep loym en t (Pen t iu m d ivid e- b u g)
GRFPU Des ign Wor k: –
Ph as e 1: Develop m en t of FP algorit h m s . Correct n es s , accu racy an d con vergen ce of t h e FP algorit h m s were m at h em at ically p roved
–
Ph as e 2: Develop m en t of h igh level FPU m od el in C an d at t ach in g it t o TSIM s im u lat or. FP t es t p rogram s an d real- life s oft ware cou ld b e ru n on t h e m od el b efore d evelop m en t of HW s t art ed .
–
Ph as e 3: HW d evelop m en t
–
Ph as e 4: Tes t an d valid at ion
Valid ation p er for m ed d u r in g s ever al s t ages of t h e d evelop m en t wor k
GRFPU – High Performance IEEE754 Floating- Point Unit
Gaisler Research
Design methodology and validation (2) TSIM + GRFPU as load ab le m od u le
●
FPU module I/F
fadds() fdivs() faddd() fdivs() fmuld() fsqrts() ...
TSIM GRFPU
●
TSIM sim u lat es fu ll fu n ct ion alit y of LEON, m em or y an d p er ip h er als
●
Pr ovid es an in ter face to at tach u s er - d efin ed FPU m od el
●
Poss ible t o t es t th e FP algor it h m s b efor e HW im p lem en t ation st ar ted
●
Offer s h igh p er for m an ce (+ 20 MIPS) –
Large an d exh au s t ive t es t p rogram s were ru n on t h e C- m od el (UCBTEST, Soft Float , IeeeCC754, GNC ap p licat ion )
●
Us ed as gold en m od el in later s tages of th e d evelop m en t wor k
●
Efficien t d eb u ggin g en vir on m en t
GRFPU – High Performance IEEE754 Floating- Point Unit
Gaisler Research
Design methodology and validation (3 )
●
Floatin g- Poin t Test p r ogr am s –
UCBTEST: Us es n u m b er t h eory t o gen erat e h ard cas e t es t vect ors .
–
Tes t Float : Ch ecks FPU im p lem en t at ion b y com p arin g it again s t it s own s oft ware im p lem en t at ion . Us es large s et of t es t vect ors + ran d om d at a.
–
●
IeeeCC754: Ch ecks IEEE754 com p lian ce
Ru n on b oth fin al im p lem en t ation as well as a C- m od el of th e GRFPU
GRFPU – High Performance IEEE754 Floating- Point Unit
Gaisler Research
Summary ●
GRFPU/ GRFPC offer s ign ifican t p er for m an ce im p r ovem en t over exist in g s olu tion s (LEON/ MEIKO or ERC32/ MEIKO)
●
GRFPU com p ar es well again s t oth er im p lem en t ation s
●
Sch ed u led for 2 SOC d esign s
●
Por tab ility an d FT cap ab ilities m akes it s u itab le for lon g- t er m s p ace u s e
GRFPU – High Performance IEEE754 Floating- Point Unit
Gaisler Research