Computer Architecture in the Many-Core Era

3 downloads 9436 Views 2MB Size Report
Oct 2, 2006 - Beyond caches and domain decomposition ... Arithmetic is cheap, Communication is expensive. • Arithmetic ... Local Register. Time. Cost*.
ICCD: 1

Oct 2, 2006

! $ % $ $ ( $ *

"

&

# "

' ) +

, .

/

$ #+ , ! $ ICCD: 2

' ' &, & 01. . & & '

& 0 & & "&

' Oct 2, 2006

! $ % $ $ ( $ *

"

&

# "

' ) +

, .

/

$ #+ , ! $ ICCD: 3

' ' &, & 01. . & & '

& 0 & & "&

' Oct 2, 2006

2

ICCD: 4

*

Oct 2, 2006

3 1 4

/ ' 1e+7 1e+6 1e+5

56 0

Perf (ps/Inst) Linear (ps/Inst)

1e+4 1e+3 1e+2

78 60

1e+1

;< 9

9:60

1e+0

9

1e-1

< 9 ;

< 9

1e-2 1e-3 1e-4 1980

1990

2000

2010

2020

Dally et al. “The Last Classsical Computer”, ISAT Study, 2001 ICCD: 5

Oct 2, 2006

'

Source: S Borkar, Intel ICCD: 6

Oct 2, 2006

0

ICCD: 7

+

9=

Oct 2, 2006

ICCD: 8

Oct 2, 2006

=

9

QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture.

ICCD: 9

Oct 2, 2006

! -

ICCD: 10

Oct 2, 2006

'

'

>

$ 5

&

$ % - ' / / /

' ? &

$ %

&

@ - '

/ ('

+

$

/ (' $ !

$ ICCD: 11

& &

& Oct 2, 2006

A

"

!1 ; # ICCD: 12

, 9::7. + 0

Oct 2, 2006

@

'

&

B

' & 4

ICCD: 13

&

' -

&

& B Oct 2, 2006

! $ % $ $ ( $ *

"

&

# "

' ) +

, .

/

$ #+ , ! $ ICCD: 14

' ' &, & 01. . & & '

& 0 & & "&

' Oct 2, 2006

+ 0.5mm

$ / 9 C1 / DB 50 EC3 1 5 %0 EC3 1 / #+ '

64-bit FPU (to scale)

$200 1GHz

Decreasing BW

$ / $ D=0 E%0 %0 E%0 , "

/

90n m Chip

%

.

, .'

/ 1' / 3

1 clock

' '

$

12mm

'

/ *

Increasing power

& '

ICCD: 15

Oct 2, 2006

9 C 3

A& A&

E

. , 95

, E

,

#

&

9

F

G

4

DB 5

9

5 F

D

8

F

D9

. .

+

9 F

D5

5F

D5

9

*Cost of providing 1GW/s of bandwidth All numbers approximate ICCD: 16

Oct 2, 2006

1 #

ICCD: 17

H1 H3

Oct 2, 2006

! $ % $ $ ( $ *

"

&

# "

' ) +

, .

/

$ #+ , ! $ ICCD: 18

' ' &, & 01. . & & '

& 0 & & "&

' Oct 2, 2006

'

ICCD: 19

--

Oct 2, 2006

' Global Memory Switch LM CM Switch RM

RM

RM

Switch Switch Switch R R R R R R R R R A A A A A A A A A ICCD: 20

Oct 2, 2006

*

ICCD: 21

!

"

Oct 2, 2006

*'

'@ '

ICCD: 22

'

) >

Oct 2, 2006

! $ % $ $ ( $ *

"

&

# "

' ) +

, .

/

$ #+ , ! $ ICCD: 23

' ' &, & 01. . & & '

& 0 & & "&

' Oct 2, 2006

%

< Global Memory Switch

LM CM Switch RM

RM

RM

Switch Switch Switch R R R R R R R R R A A A A A A A A A ICCD: 24

Oct 2, 2006

I

'

/

$ $ 1 /

$ !

ICCD: 25

+

'

&

&

+

Oct 2, 2006

#+

/

C

"#

loop over cells flux[i] = ... loop over cells ... = f(flux[i],...)

ICCD: 26

Oct 2, 2006

#+

-

loop over cells flux[i] = ... loop over cells ... = f(flux[i],...)

ICCD: 27

AC Flux passed through SRF, no memory traffic

Oct 2, 2006

#+

-

loop over cells flux[i] = ... loop over cells ... = f(flux[i],...)

ICCD: 28

AC Explicit re-use of Cells, no misses

Oct 2, 2006

0 , 9

ICCD: 29

'

, & .

.

Oct 2, 2006

#+

&

+ All needed data and instructions on-chip no misses

ICCD: 30

Oct 2, 2006

,

ICCD: 31

J'

K.

Oct 2, 2006

J' 99% hit rate, 1 miss

K

costs 100s of cycles, 10,000s of ops

ICCD: 32

Oct 2, 2006

'

ICCD: 33

' >

&

+

Oct 2, 2006

1 & 3

1

&< 1

$ 1 / / 4 / 3 1'

-

$ 3 / 1 / %

0 -

$ 1 / #

&

K2 K1 ICCD: 34

K3

K4 Oct 2, 2006

#

1 &

9::7

0 L - &

&

-

L

0 &

9

&

( 4 J

5

& "

I E ) 4 3 J& J K ! "

ICCD: 35

K

J

-K

K

' J

K

Oct 2, 2006

#+

&

+

& SW Pipeline

One iteration 0

0 10 20

10 20 30

30 40 50 60 70 80 90 100

40 50 60

110 120 20 30 40 50 60 70

70

ComputeCellInt kernel from StreamFem3D Over 95% of peak with simple hardware Depends on explicit communication to make delays predictable

80 90 100

80 90

110 120 20 30 40

100

50 60 70

110

80 90 100

120

ICCD: 36

110 120

Oct 2, 2006

&+

+

&

' Read-Only Table Lookup Data (Master Element)

StreamFEM application Compute Flux States Element Faces Gathered Elements

ICCD: 37

Compute Numerical Flux

Face Geometry

Gather Cell

Numerical Flux

Cell Geometry

Compute Cell Interior

Advance Cell

Elements (Current)

Elements (New)

Cell Orientations

Prefetching, reuse, use/def, limited spilling

Oct 2, 2006

I $ 1 $ $ J

/E

)L

3

Node memory

& K' -&

void __task matmul::leaf( __in float A[M][P], __in float B[P][N], __inout float C[M][N] ) { for (int i=0; i

Suggest Documents