Oct 2, 2006 - Beyond caches and domain decomposition ... Arithmetic is cheap, Communication is expensive. ⢠Arithmetic ... Local Register. Time. Cost*.
ICCD: 1
Oct 2, 2006
! $ % $ $ ( $ *
"
&
# "
' ) +
, .
/
$ #+ , ! $ ICCD: 2
' ' &, & 01. . & & '
& 0 & & "&
' Oct 2, 2006
! $ % $ $ ( $ *
"
&
# "
' ) +
, .
/
$ #+ , ! $ ICCD: 3
' ' &, & 01. . & & '
& 0 & & "&
' Oct 2, 2006
2
ICCD: 4
*
Oct 2, 2006
3 1 4
/ ' 1e+7 1e+6 1e+5
56 0
Perf (ps/Inst) Linear (ps/Inst)
1e+4 1e+3 1e+2
78 60
1e+1
;< 9
9:60
1e+0
9
1e-1
< 9 ;
< 9
1e-2 1e-3 1e-4 1980
1990
2000
2010
2020
Dally et al. “The Last Classsical Computer”, ISAT Study, 2001 ICCD: 5
Oct 2, 2006
'
Source: S Borkar, Intel ICCD: 6
Oct 2, 2006
0
ICCD: 7
+
9=
Oct 2, 2006
ICCD: 8
Oct 2, 2006
=
9
QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture.
ICCD: 9
Oct 2, 2006
! -
ICCD: 10
Oct 2, 2006
'
'
>
$ 5
&
$ % - ' / / /
' ? &
$ %
&
@ - '
/ ('
+
$
/ (' $ !
$ ICCD: 11
& &
& Oct 2, 2006
A
"
!1 ; # ICCD: 12
, 9::7. + 0
Oct 2, 2006
@
'
&
B
' & 4
ICCD: 13
&
' -
&
& B Oct 2, 2006
! $ % $ $ ( $ *
"
&
# "
' ) +
, .
/
$ #+ , ! $ ICCD: 14
' ' &, & 01. . & & '
& 0 & & "&
' Oct 2, 2006
+ 0.5mm
$ / 9 C1 / DB 50 EC3 1 5 %0 EC3 1 / #+ '
64-bit FPU (to scale)
$200 1GHz
Decreasing BW
$ / $ D=0 E%0 %0 E%0 , "
/
90n m Chip
%
.
, .'
/ 1' / 3
1 clock
' '
$
12mm
'
/ *
Increasing power
& '
ICCD: 15
Oct 2, 2006
9 C 3
A& A&
E
. , 95
, E
,
#
&
9
F
G
4
DB 5
9
5 F
D
8
F
D9
. .
+
9 F
D5
5F
D5
9
*Cost of providing 1GW/s of bandwidth All numbers approximate ICCD: 16
Oct 2, 2006
1 #
ICCD: 17
H1 H3
Oct 2, 2006
! $ % $ $ ( $ *
"
&
# "
' ) +
, .
/
$ #+ , ! $ ICCD: 18
' ' &, & 01. . & & '
& 0 & & "&
' Oct 2, 2006
'
ICCD: 19
--
Oct 2, 2006
' Global Memory Switch LM CM Switch RM
RM
RM
Switch Switch Switch R R R R R R R R R A A A A A A A A A ICCD: 20
Oct 2, 2006
*
ICCD: 21
!
"
Oct 2, 2006
*'
'@ '
ICCD: 22
'
) >
Oct 2, 2006
! $ % $ $ ( $ *
"
&
# "
' ) +
, .
/
$ #+ , ! $ ICCD: 23
' ' &, & 01. . & & '
& 0 & & "&
' Oct 2, 2006
%
< Global Memory Switch
LM CM Switch RM
RM
RM
Switch Switch Switch R R R R R R R R R A A A A A A A A A ICCD: 24
Oct 2, 2006
I
'
/
$ $ 1 /
$ !
ICCD: 25
+
'
&
&
+
Oct 2, 2006
#+
/
C
"#
loop over cells flux[i] = ... loop over cells ... = f(flux[i],...)
ICCD: 26
Oct 2, 2006
#+
-
loop over cells flux[i] = ... loop over cells ... = f(flux[i],...)
ICCD: 27
AC Flux passed through SRF, no memory traffic
Oct 2, 2006
#+
-
loop over cells flux[i] = ... loop over cells ... = f(flux[i],...)
ICCD: 28
AC Explicit re-use of Cells, no misses
Oct 2, 2006
0 , 9
ICCD: 29
'
, & .
.
Oct 2, 2006
#+
&
+ All needed data and instructions on-chip no misses
ICCD: 30
Oct 2, 2006
,
ICCD: 31
J'
K.
Oct 2, 2006
J' 99% hit rate, 1 miss
K
costs 100s of cycles, 10,000s of ops
ICCD: 32
Oct 2, 2006
'
ICCD: 33
' >
&
+
Oct 2, 2006
1 & 3
1
&< 1
$ 1 / / 4 / 3 1'
-
$ 3 / 1 / %
0 -
$ 1 / #
&
K2 K1 ICCD: 34
K3
K4 Oct 2, 2006
#
1 &
9::7
0 L - &
&
-
L
0 &
9
&
( 4 J
5
& "
I E ) 4 3 J& J K ! "
ICCD: 35
K
J
-K
K
' J
K
Oct 2, 2006
#+
&
+
& SW Pipeline
One iteration 0
0 10 20
10 20 30
30 40 50 60 70 80 90 100
40 50 60
110 120 20 30 40 50 60 70
70
ComputeCellInt kernel from StreamFem3D Over 95% of peak with simple hardware Depends on explicit communication to make delays predictable
80 90 100
80 90
110 120 20 30 40
100
50 60 70
110
80 90 100
120
ICCD: 36
110 120
Oct 2, 2006
&+
+
&
' Read-Only Table Lookup Data (Master Element)
StreamFEM application Compute Flux States Element Faces Gathered Elements
ICCD: 37
Compute Numerical Flux
Face Geometry
Gather Cell
Numerical Flux
Cell Geometry
Compute Cell Interior
Advance Cell
Elements (Current)
Elements (New)
Cell Orientations
Prefetching, reuse, use/def, limited spilling
Oct 2, 2006
I $ 1 $ $ J
/E
)L
3
Node memory
& K' -&
void __task matmul::leaf( __in float A[M][P], __in float B[P][N], __inout float C[M][N] ) { for (int i=0; i