A Multigrain Parallelizing gg Compiler with Power Control for Multicore ...

A Multigrain g Parallelizing g Compiler with Power Control for Multicore Processors Hironori Kasahara Professor, Department of Computer Science Professor Director, Advanced Chip-Multiprocessor Research Institutes

Waseda University Tokyo Japan Tokyo, http://www.kasahara.cs.waseda.ac.jp Feb. 5, 2008 at Google Headquarter

Hironori Kasahara

＜Personal History＞ B.S. (1980,Waseda), M.S.(1982,Waseda), Ph.D.(1985,EE, Waseda). Res.Assoc. (1983,Waseda), S i l Research Special R h Fellow F ll JSPS (1985) ,Visiting Vi i i Scholar S h l (1985.Univ.California (1985 U i C lif i at Berkeley), B k l ) . Assist. Prof. (1986.Waseda), Assoc. Prof.(1988,Waseda), Visiting Research Scholar(1989-1990. Center for Supercomputing R&D, Univ.of Illinois at Urbana-Champaign), Prof.(1997-,Dept. CS, Waseda). , IFAC World Congress Young Author Prize (1987), IPSJ Sakai Memorial Special Award (1997), STARC Industry-Academia Cooperative Research Award (2004) ＜Activities for Societies＞ IPSJ：Sig. Computer Architecture(Chair), Trans of IPSJ Editorial Board (HG Chair), Journal of IPSJ Editorial Board (HWG Chair), 2001 Journal of IPSJ Special Issue on Parallel Processing(Chair of Editorial Board: Guest Editor, JSPP2000 (Program Chair) etc. ACM ：International Conference on Supercomputing(ICS)(Program Committee) Int’l conf. on Supercomputing (PC, esp. ‘96 ENIAC 50th Anniversary Co-Prog. Chair). IEEE Computer IEEE: C t S Society i t JJapan Ch Chapter t Ch Chair, i Tokyo T k S Section ti B Board dM Member, b SC07 PC OTHER: PCs of many conferences on Supercomputing and Parallel Processing. p Forum(Architecture/HPC ( WG Chair), ), METI：IT Policyy Proposal Super Advanced Electronic Basis Technology Investigation Committee NEDO:Millennium Project IT21 “Advanced Parallelizing Compiler”( Project Leader), Computer Strategy WG (Chair).Multicore for Realtime Consumer Electronics Project Leader etc. project j evaluation committee,, 10PFLOPS Supercomputer p p evaluat. comm. MEXT:Earth Simulator p JAERI: Research accomplishment evaluation committee, CCSE 1st class invited researcher. JST: Scientific Research Fund Sub Committee, COINS Steering Committee , Precursory Research for Embryonic Science and Technology (Research Area Adviser) Cabinet Office: CSTP Expert Panel on Basic Policy, Information & Communication Field Promotion Strategy , R&D Infrastructure WG, Software & Security WG 2 ＜Papers＞Papers 158, Invited Talks 64, Tech. Reports 114, Symposium 25, News Papers/TV/Web News/Magazine 139, IEEE Trans. Computer, IPSJ Trans., ISSCC, Cool Chips, Supercomputing, ACM ICS, etc.etc.

Multi-core Everywhere Multi-core from embedded to supercomputers ¾ Consumer Electronics (Embedded) Mobile Phone, Game, Digital TV, Car Navigation, DVD C DVD, Camera, IBM/ Sony/ Toshiba Cell, Fuijtsu FR1000, NEC/ARMMPCore&MP211, Panasonic Uniphier, Renesas SH multi-core(4 multi core(4 core RP1 RP1, 8 core RP2) Tilera Tile64, SPI Storm-1(16 VLIW cores)

¾ PCs, Servers OSCAR Type Multi-core Chip by Renesas in METI/NEDO Multicore for Real-time Consumer ¾ Electronics Project (Leader: Prof.Kasahara)

IBM BlueGene/L　 Lawrence Livermore National Laboratory2005/04稼動予定低消費電力チップマルチプロセッサをベースとしたスパコン

Intel Quad Xeon, Core 2 Quad, Montvale, Tukwila, 80 core AMD Quad Core Opteron, Phenom

WSs, Deskside & Highend Servers IBM Power4,5,5+,6

Sun Niagara(SparcT1,T2), Rock

¾ Supercomputers S t Earth Simulator:40TFLOPS, 2002, 5120 vector proc. IBM Blue Gene/L: 360TFLOPS, 2005, Low power CMP based 128K processor chips, BG/P 2008

High quality application software, Productivity, Cost performance, Low power consumption are important 65,536-プロセッサチップ Ex, Mobile phones, Games （128Kプロセッサ 360TFLOPS 　＝毎秒360兆回の Compiler C il cooperated t d multi-core lti processors are キャビネットあたり 1024プロセッサチップ　　　　浮動小数点計算） promising to realize the above futures 3 １プロセッサチップ上に２プロセッサ集積

Market of Consumer Electronics 1 Trillion T illi Dollars D ll in i 2010 (World Wide) W /W 市場規模

16兆円

携帯電話

市場規模（出荷額）

14兆円 12兆円

2兆円

薄型ﾃﾚﾋﾞ

ﾃﾞｼﾞﾀﾙｶﾒﾗ 1 .5 兆円

ﾃﾞｼﾞﾀﾙﾋﾞﾃﾞｵ

1兆円

D VD ﾚｺｰﾀﾞ

5千億円 0

Dig Camera Dig. Dig. TV DVD Recorder DVD for PC Mobile Phone LSI for Cars

ｶｰﾅﾋﾞ `0 2

`0 3

`0 4

`0 5

`0 6

‘０３

‘０７

年平均成長率％

デジタルスチルカメラ（M 台）

４９

７６

１２

デジタル TV（M 台）

６

２７

４５

DVDレコーダ（M 台）

３．６

３３

７４

PC 用 D V D （記録型）（M 台）

２７

１１４

４３

携帯電話（M 台）

４９０

６７０

８

自動車用半導体需要（B $）

１４．０

２０．９

１１

Annual Growth Rates 2005.5.11 NEDOロードマップ報告会電子・情報技術開発部「技術開発戦略」より

4

Roadmap of compiler cooperative multicore project 00

Millennium Project IT21 NEDO Advanced Parallelizing Compiler

01

02

03

04

Compiler development of Multiprocessor Servers

06

05

07

08

09

10

Apply for Supercomputer Compilers

（Waseda Univ. Fujitsu,Hitachi, JIPDEC,AIST）

STARC Compiler p Cooperative p Chip Multiprocessor

Basic resrach

（Waseda・Toshiba・Fujitsu・ Panasonic・Sony）

（ Waseda Univ., Fujitsu, NEC, Toshiba, Panasonic,Sony） y

STARC： Semiconductor Technology Academic Research Center Fujitsu,Toshiba,NEC, Renesas,Panasonic, Sony etc.

（Waseda・Fujitsu・NEC・Toshiba）

Waseda Univ Univ., Hitachi Hitachi, Renesas, Renesas

NEDO (2004.07-2007.06) Heterogeneous Multiprocessor （（Waseda Univ.,, Hitachi））

NEDO (2005.06-2008.03) Multicore Technology for Realtime Consumer Electronics Waseda Univ., Hitachi, Renesas, Fujitsu, NEC, Toshiba, Panasonic ¾Power Saving Multicore Architecture, Parallelizing Compiler,API

NEDO (2007.02- 2010.03) Heterogeneous Multicore for Consumer Electronics Waseda Univ., Hitachi, Renesas, Tokyo Inst, of Tech.

Practical Resarch

Plan

Arch. % Compil R&D

Practical Use

Soft realtime Multicore Arch. Compiler Plan API R&D

Plan

Practical Use

Hetero H t M Multicore lti A Arch. h & Compiler R&D

5 Mar. Oct.

Mar.

Mar.

METI/NEDO Advanced Parallelizing Compiler Technology Project Background and Problems

Millenium Project IT21 2000.9.8 –2003.3.31 W d U Waseda Univ., i F Fujitsu, jit Hitachi, Hit hi AIST

①Adoption of parallel processing as a core technology on PC to HPC ② Increase of importance of software on IT ③ Need for improvement of cost-performance and usability

Performance 1T

Hardware Peak performance

Improvement of ① Effective performance ②Cost-performance ③ Ease of use

① R & D of advanced parallelizing compiler Multigrain, Data localization, Overhead hiding ② R & D of Performance evaluation technology for parallelizing compilers

APC P Project

Contents of Research and Development

Goal: Double the effective performance Ripple pp Effect Slow down

Effective Performance 1G 1980

1990

2000

Theoretical maximum performance vs. Effective performance of HPC

yyear

① Development of competitive next generation PC and HPC ② Putting the innovative automatic parallelizing p g compiler p technology gy to practical use ③ Development and market acquisition of future single-chip multiprocessors ④ Boosting g R&D in the followingg manyy fields:

IT, Bio-tech., Device, Earth environment, Next-generation VLSI design, Financial engineering, Weather forecast, New clean energy, Space 6 development, Automobile, Electric Commerce, etc

Performance of APC Compiler on IBM pSeries690 16 Processors High-end Server •

IBM XL Fortran for AIX Version 8.1 – – –

Sequential execution : -O5 -qarch=pwr4 Automatic loop parallelization : -O5 -qsmp=auto -qarch=pwr4 OSCAR compiler : -O5 -qsmp=noauto -qarch=pwr4 (su2cor: -O4 -qstrict)

12.0

3.1s

3.4s

XL Fortran(max)

33.5 5 times ti speedup d in average

3.0s

10.0

28.8s

APC(max)

38.5s

8.0 3.8s

6.0

28.9s

4.0

184.8s

282.4s

321.4s

t ra

ck

282 4s 321.4s 282.4s 321 4s

s ix

pl u

2k

2k ap

r id

mg

sw im

2k

126 5s 307.6s 126.5s 307 6s 291.2s 291 2s 279.1s 279 1s

i se

wu

pw

ve

ap

105.0s

k

18 8s 18.8s

107.4s

s i2

85 8s 85.8s

5

22 5s 22.5s

pp

126.5s 18.8s

si

85.8s

wa

tu r

b3

d

39 5s 39.5s

22.5s

fp p

37.4s

27 8s 27.8s

pl u

r id

35 1s 35.1s

ap

d ro hy

2c

23.2s

30 3s 30.3s

2d

21 5s 21.5s

or

38 3s 38.3s

im

ca

tv

23 1s 23.1s

115.2s

13 0s 13.0s

21.0s

21.5s

su

0.0

16 4s 16.4s

mg

19.2s

sw

2.0

16.7s

ap

7.1s

to m

S p e e d u p r a tio

3.5s

7

Performance of Multigrain Parallel Processing for 102.swim on IBM pSeries690 16 00 16.00

XLF(AUTO)

14.00

OSCAR

Speed up S p ratio

12.00 10.00 8.00 6.00 4.00 2.00 0.00 1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

Processors 8

OSCAR Parallelizing Compiler • Improve effective performance, cost-performance and productivity and reduce consumed power – Multigrain Parallelization • Exploitation of parallelism from the whole program by use of coarse-grain coarse grain parallelism among loops and subroutines, near fine grain parallelism among statements in addition to loop parallelism

– Data Localization • Automatic data distribution for distributed shared memory, cache and local memory on multiprocessor systems.

– Data Transfer Overlapping • Data transfer overhead hiding by overlapping task execution and data transfer using DMA or data pre pre-fetching fetching

– Power Reduction • Reduction of consumed power by compiler control of frequency, voltage and power shut down with hardware supports. 9

Generation of Coarse Grain Tasks Ｍacro-tasks (MTs)

¾ Block Bl k off P Pseudo d A Assignments i t (BPA): (BPA) Basic B i Block Bl k (BB) ¾ Repetition Block (RB) : outermost natural loop ¾ Subroutine Block (SB): subroutine

Program

BPA

Near fine grain parallelization

RB

Loop level parallelization Near fine grain of loop body Coarse grain parallelization

SB

Total System

1st. Layer

Coarse grain parallelization 2nd. Layer

BPA RB SB BPA RB SB

BPA RB SB BPA RB SB BPA RB SB BPA RB SB

3rd. Layer 10

Earliest Executable Condition Analysis for coarse grain tasks (Macro-tasks) Data Dependency Control flow Conditional branch 1 2

1

BPA

BPA

3

4

BPA

BPA

BPA

Block of Psuedo Assignment Statements

RB

Repetition Block

2

7 5

3 4

RB

BPA

8 6

BPA 6

RB BPA

15

BPA

7

5

RB

6

9

RB

11

RB 8

BPA BPA

9

BPA 11

12

END

10

RB

BPA RB

14

RB

15

7 Data dependency

12

Extended control dependency

BPA

13

10

A Macro Flow Graph

Conditional branch

13

OR AND

14

Original control flow

A Macro Task Graph 11

Automatic processor assignment in 103 su2cor 103.su2cor • Using g 14 p processors –

Coarse grain parallelization within DO400 of subroutine LOOPS

12

MTG of Su2cor-LOOPS-DO400 Coarse grain parallelism PARA_ALD = 4.3

DOALL

Sequential LOOP

SB

13

BB

Data-Localization L Loop Ali Aligned dD Decomposition iti •

Decompose multiple loop (Doall and Seq) into CARs and LRs considering id i inter-loop i t l data d t d dependence. d – Most data in LR can be passed through LM. – LR: Localizable Region, CAR: Commonly Accessed Region C RB1(Doall) DO I=1,101 A(I)=2*I ENDDO C RB2(Doseq) ( q) DO I=1,100 B(I)=B(I-1) +A(I)+A(I+1) ENDDO RB3(Doall) DO I=2,100 C(I)=B(I)+B(I-1) ENDDO C

LR DO I=1,33

CAR DO I=34,35

LR DO I=36,66

CAR DO I=67,68

LR DO I=69,101

DO I=1,33 DO I=34,34 DO I=35,66 DO I=67 I=67,67 67 DO I=68,100 DO I=2,34 I=2 34

DO I=35 I=35,67 67

DO I=68 I=68,100 100 14

Data Localization 1 1

PE0

2 3

6

3

4

2

5

4

5 6

12

7

10

11

9

8

7 14

8

9

13

11

24

17

13

15

MTG

19

25

21

22

23

26 dlg3

dlg0

28

29

Data Localization Group

27

31

32

12

1

2

3

6

7

4

14

8

18

15

5

19

9

25

11

29

10

13

16

17

20

22

26

21

30

23

24

27

28

20

dlg2

dlg1

12

16

10 18

14

15

PE1

30

32 33

31

MTG after Division

15 A schedule for two processors

An Example of Data Localization for Spec95 Swim DO 200 J=1 J=1,N N DO 200 I=1,M UNEW(I+1,J) = UOLD(I+1,J)+ 1 TDTS8*(Z(I+1,J+1)+Z(I+1,J))*(CV(I+1,J+1)+CV(I,J+1)+CV(I,J) 2 +CV(I+1 +CV(I+1,J))-TDTSDX*(H(I+1,J)-H(I,J)) J)) TDTSDX*(H(I+1 J) H(I J)) VNEW(I,J+1) = VOLD(I,J+1)-TDTS8*(Z(I+1,J+1)+Z(I,J+1)) 1 *(CU(I+1,J+1)+CU(I,J+1)+CU(I,J)+CU(I+1,J)) 2 -TDTSDY*(H(I,J+1)-H(I,J)) PNEW(I,J) = POLD(I,J)-TDTSDX*(CU(I+1,J)-CU(I,J)) 1 -TDTSDY*(CV(I,J+1)-CV(I,J)) 200 CONTINUE

DO 210 J=1,N UNEW(1,J) = UNEW(M+1,J) VNEW(M+1,J+1) = VNEW(1,J+1) PNEW(M+1 J) = PNEW(1 PNEW(M+1,J) PNEW(1,J) J) 210 CONTINUE

DO 300 J=1,N DO 300 I=1,M UOLD(I,J) = U(I,J)+ALPHA*(UNEW(I,J)-2.*U(I,J)+UOLD(I,J)) VOLD(I,J) = V(I,J)+ALPHA*(VNEW(I,J)-2.*V(I,J)+VOLD(I,J)) ( , ) = P(I,J)+ALPHA*(PNEW(I,J)-2.*P(I,J)+POLD(I,J)) (, ) ( (, ) (, ) ( , )) POLD(I,J) 300 CONTINUE

(a) An example of target loop group for data localization

cache size 0

1

VN PO H

2

PN CU

3

UO CV

4MB

UN VO Z UN

VN

PN

U VN PO

V PN

P UO

UN VO

C h liline conflicts Cache fli t occurs among arrays which share the same location on cache (b) Image of alignment of arrays on cache accessed by target loops 16

Data Layout for Removing Line Conflict Misses by Array Dimension Padding Declaration part of arrays in spec95 swim after padding before padding PARAMETER (N1=513, N2=513)

PARAMETER (N1=513, N2=544)

COMMON U(N1,N2), V(N1,N2), P(N1,N2), * UNEW(N1,N2), VNEW(N1,N2), 1 PNEW(N1,N2), UOLD(N1,N2), * VOLD(N1,N2), POLD(N1,N2), 2 CU(N1,N2), CV(N1,N2), * Z(N1,N2), H(N1,N2)

COMMON U(N1,N2), V(N1,N2), P(N1,N2), * UNEW(N1,N2), VNEW(N1,N2), 1 PNEW(N1,N2), UOLD(N1,N2), * VOLD(N1,N2), POLD(N1,N2), 2 CU(N1,N2), CV(N1,N2), * Z(N1,N2), H(N1,N2)

4MB

4MB

padding Box: Access range of DLG0

17

APC Compiler Organization APC M Multigrain li i Parallelizing Compiler

Automatic Data Distribution - Cache optimization - DSM optimization Multigrain Parallelization

Data Dependence Analysis

Source Program FORTRAN

- Inter-procedural - Conditional

- Hierarchical Parallel - Affine Partitioning

Scheduling - Static Scheduling - Dynamic Scheduling

OpenMP C d Code Generation

Speculative E Execution ti - Coarse Grain - Medium Grain - Architecture Support

Feedback Sun Forte 6 IBM XL Fortran IBM XL Fortran Update 2 Ver.7.1 Ver.8.1

Parallelizing Tuning System Program Visualization Technique Techniques for Profiling & Utilizing Runtime Info

Runtime Info

Feedback-directed Selection Technique of Compiler Directives

… IBM IBM 18 pSeries690 RS6000 Variety of Shared Memory Parallel machines SUN Ultra 80

Centralized scheduling code

Image of Generated OpenMP Code for Hierarchical Multigrain Parallel Processing SECTIONS SECTION

1st layer Distributed scheduling code d

MT1_1

MT1_2 DOALL MT1_4 MT1 4 RB

MT1_3 SB

T1 T2

T3

T4 T5

SYNC SEND MT1-4

1_3_2 1_3_3 1_3_4

T6

T7

MT1_1

SYNC RECV

MT1_2

1_3_1 1_4_1

T0

SECTION

1_4_1

1_4_1

1_4_2

1_4_2

1_4_3

1_4_3

1_4_4

1_4_4

MT1-3 1_3_1

1_3_1

1_3_1

1_3_2

1_3_2

1_3_2

1_3_3

1_3_3

1_3_3

1_3_4

1_3_4

1_3_4

1_3_5

1_3_5

1_3_5

1 3 6 1_3_6

1 3 6 1_3_6

1 3 6 1_3_6

1_3_5 1_3_6 1_4_21_4_3 1_4_4

2nd layer

END SECTIONS

2nd layer Thread group0

Thread group1

19

Performance OSCAR Multigrain Parallelizing Compiler on a IBM p550q 8core Deskside Server 2.7 times speedup against loop

parallelizing compiler on 8 cores Loop parallelization Multigrain parallelizition 9 8 speed dup ratio

7 6 5 4 3 2 1 0 t tomcatvswim t i su2corhydro2d 2 h d 2dmgrid id applu l turb3d t b3d apsii fpppp f wave55 swim i mgrid id applu l apsii Page. 20

spec95

spec2000

OSCAR Compiler Performance on 24 Processor IBM p p690Highend g SMP Server GX Sl ot

Mem Slot

GX Slot

Mem Slot

Mem Slot

L3 L3

L3 L3 GX

L3 L3

P

P

P

P

P

L3 L3

P

L3 L3

P

P

P

P

P

MCM 1

P

P

P

P

P

P

L2 P

P

P

MCM 0

P

L3 L3

P

GX

L3 L3

P

GX

L2

GX

GX

L3 L3

XL Fortran Ver.8.1 OSCAR

L3 L3 GX

L2

L2

GX Sl ot

GX P

L2

GX

L3 L3

P

L3 L3

P

L2

L3 L3

GX Slot

Mem Slot

Mem Slot

Four 8-way MCM Features Assembled into a 32-way pSeries 690

10 5

spec95 Page. 21

spec2000

4.82 times speedup against loop parallelization

apsi

a p p lu

m g ridd

sw im m

wave5

fp p p p

apsi

tu rb 3 d

a p p lu

m g ridd

h y d ro 2 d

su 2 c o r

sw im m

0 to m c a tvv

sp e e d u p ra tio

L3 L3

P

L2

L2

GX

15

P

L2

L2

GX

Mem Slot

P

L2

L2

L2

GX

GX

MCM 3 P

L2

GX

Mem Slot

P

L2

Mem Slot

L3 L3 GX

MCM 2

L3 L3

20

P

L2

GX

25

L3 L3 GX

Performance on SGI Altix 450 Montecito 16 p processors cc-NUMA server Intel Ver.8.1 OSCAR

9 8 sppeedup ratio

7 6 5 4 3 2 1

spec95 95

•

apsi

applu

mgrid

swim

wave5

fpppp

apsi

turb3d

applu

mgrid

hydro2d

su2cor

swim

tomcatv

0

spec2000 2000

OSCAR compiler gave us 1.86 times speedup against Intel Fortran Itanium Compiler revision 88.1 1

Performance of OSCAR compiler on 16 cores SGI Altix Alti 450 Montvale M t l server 6

sppeedup ratio

5

Intel Ver.10.1 OSCAR

4 3 2 1

spec95 •

spec2000

OSC OSCAR compiler co p e gave us 2.32 .3 ttimes es speedup against Intel Fortran Itanium Compiler revision 10.1

apsi

applu

mgrid

swim

wave5

fpppp

apsi

turb3d

applu

mgrid

hhydro2d

su2cor

swim

toomcatv

0

NEC/ARM MPCore Embedded 4 core SMP

4.5

g77 oscar

4 sp peedup ratio

3.5 3 2.5 2 1.5 1 0.5 0 1

2

3

toamcatv

4

1

2

3

swim

4

1

2

3

su2cor

4

1

2

3

hydro2d

4

1

2

3

mgrid

4

1

2

3

applu

4

1

2

3

4

turb3d

SPEC95

24 3.48 times speedup by OSCAR compiler against sequential processing

Power Reduction by Power Supply, Clock Frequency and Voltage Control by OSCAR Compiler •

Shortest execution time mode

25

OSCAR Multi-Core Architecture CMP 0 (chip multiprocessor 0) CMP m

PE 0 PE 1

CPU LPM/ I-Cache

LDM/ D-cache

PE

I/O I/O Devices Devices CSM j

n

DTC

I/O CMP k

DSM

FVR

N t Network k IInterface t f

CS CSM

Intra-chip connection network (Multiple Buses, Crossbar, etc)

CSM / L2 Cache

FVR

FVR FVR

FVR

FVR

FVR

FVR Inter-chip connection network (Crossbar, Buses, Multistage network, etc) CSM: central shared mem. LDM : local data mem. DSM: distributed shared mem. LPM : local program mem. DTC: Data Transfer Controller FVR: frequency / voltage control register 26

An Example of Machine Parameters for the Power Saving Scheme • Functions of the multiprocessor – Frequency of each proc. is changed to several levels – Voltage is changed together with frequency – Each proc. can be powered on/off state frequency voltage dynamic y energy gy static power

FULL 1 1 1 1

MID 1/2 0.87 3/4 1

LOW 1/4 0.71 1/2 1

OFF 0 0 0 0

• State transition overhead state FULL MID LOW OFF

FULL 0 40k 40k 80k

MID 40k 0 40k 80k

LOW 40k 40k 0 80k

delay time [u.t.]

OFF 80k 80k 80k 0

state FULL MID LOW OFF

FULL 0 20 20 40

MID 20 0 20 40

energy overhead [μJ]

LOW 20 20 0 40

OFF 40 40 40 0

27

Speed-up p p in Fastest Execution Mode 45 4.5 w/o Saving w Saving

4 speeddup ratio

3.5 3 2.5 2 1.5 1 0.5 0 1

2 tomcatv

4

1

2

4

1

swim

2 applu

4

1

2

4

mpeg2enc

b h k benchmark 28

Consumed Energy in Fastest Execution Mode 200 180 160

w/o Saving w Saving

ennergy(m JJ)

energy(J)

140 120 100 80 60 40 20 0 1

2 tomcatv

4

1

2 swim benchmark

4

1

2 applu

4

1600 1400 1200 1000 800 600 400 200 0 1

2

4

number b off proc. mpeg2 encode mpeg2_encode 29

Energy Reduction by OSCAR Compiler in Real-time Processing mode (10% Leak)

• deadline = sequential execution time, time Leakage Power: 10% 30

METI/NEDO National Project

Multi core for Real-time Multi-core Real time Consumer Electronics R&D of compiler cooperative multicore processor technology for consumer electronics like Mobile phones, Games, DVD, Digital TV, Car navigation systems. From July 2005 to March 2008

(2005.7～2008.3)** ( ) CMP 0 （マルチコアチップ0）

PC0

API：Application Programming Interface

(プロセッサ）

LDM/ D-cache

LPM/ I-Cache (ローカルプログラムメモリ/ 命令キャッシュ)

(ローカルデータメモリ/L1データキャッシュ)

DTC (データ転送コントローラ)

PC 1 (プロセッサコア1)

(マルチコア・チプ ) チップm)

PC n (プロセッサコアn)

DSM (分散共有メモリ)

NI(ネットワークインターフェイス)

IntraCCN (チップ内結合網：　複数バス、クロスバー等 )

CSM / L2 Cache (集中共有メモリあるいはL２キャッシュ )

,,

（入出力装置）

CMP m

CPU

(プロセッサコア0))

・Good cost performance

・Short hardware and software development periods ・Low power consumption ・Scalable performance improvement with the advancement of semiconductor ・Use of the same parallelizing compiler for multi-cores from different vendors using newly developed API

新マルチコア新マルチコアプロセッサプロセッサ I/O I/O Devices •高性能 •高性能 •低消費電力低消費電力 •低消費電力低消費電力 •短HW/SW開 •短HW/SW開発期間発期間 CSM •各チップ間でア •各チップ間でアプリケーションプリケションプリケーションプリケション共用可共用可 •高信頼性 •高信頼性 CSM j

I/O

ＣＳＰ k

(入出力用マルチコア・チップ)

(集中共有メモリ)

•半導体集積度 •半導体集積度と共に性能向上と共に性能向上

マルチコア InterCCN (チップ間結合網：　複数バス、クロスバー、多段ネットワーク等 ) 統合ECU

開発マルチコアチップは情報家電へ開発マルチコアチップは情報家電へ

**Hitachi, Renesas, Fujitsu, Toshiba, Panasonic, NEC

31

NEDOMulticore Technology for Realtime Consumer Electronics R&D Organization(2005.7-2008.3 ) Integrated R&D Steering Committee Chair : Hironori Kasahara (Waseda Univ.), Project Leader Sub-chair : Kunio Uchiyama (Hitachi), Project Sub-leader

Grant: Multicore Architecture and Compiler Research and Development Project Leader : Prof. Hironori Kasahara, Waseda Univ. R&D Steeringg Comittee Architecture & Compiler R&D Group. Group Leader : Associate Prof. Keiji Kimura, Waseda Univ. (Outsourcing ： Fujitsu)

Standard Multicore Architecture & API C Committee i Chair : Prof. Hironori Kasahara, Waseda Univ.

Subsidy: Evaluation Environment for Multicore Technology Project Sub-leader : Dr. Kunio Uchiyama, Chief Researcher, Hitachi Testt Chi T Chip Development Group. Group Leader : Renesas Technology gy

Evaluation E l ti S Sytem t Development Group Group Leader : Hitachi

Hitachi, Fujitsu, Toshiba, NEC, Panasonic Renesas Technology Panasonic, 32

Fujitsu FR-1000Multicore Processor FR550 VLIW Processor

FR-V Multi-core Processor FR550 core

Local BUS IF

FR550 ｃore DMA-I

DMA-E

Crossbar Switch FR550 core

I-cache 32KB

Mem. Cont.

Memory

Mem. Cont.

Memory

D-cache 32KB

Integer Operation Unit Inst. 0 Inst Inst. 1 Inst. 2 Inst. 3

GR

Inst. 4 Inst. 5 Inst. 6 Inst. 7

FR

Media Operation Unit

FR550 core

IO Chip

Core1

Core2

C 1 Core1

C 2 Core2

× Switch

Fast I/O Bus •Memory Bus: 64bit x 2ch / 266MHz •System Bus: 64bit / 178MHz

Memory Memory

Bus

Switch

Memory Memory

Crossbar （FR1000） 33

Panasonic UniPhier

34

CELL Processor C ocesso Overview Ove v ew • Power Processor Element ((PPE)) – PowerCore processes OS and Control tasks – 2-way 2 M lti th d d Multi-threaded

• Synergistic Processor Element (SPE) – – – – – –

8 SPE offers high performance Dual issue RISC Architecture 128bit SIMD(16‐way) 128 x 128bit General Registers 256KB Local L l Store St DedicatedDMA engines

SPE

Synergistic Processor Elements for High (Fl)ops / Watt

SPU

SPU

SPU

SPU

SPU

SPU

SPU

SPU

LS

LS

LS

LS

LS

LS

LS

16B/cycle

LS 16B/cycl e

EIB (up to 96B/cycle) 16B/cycle

16B/cycle 16B/cycle (2x)

PPE L2

MIC

BIC

512KB 32B/cycle

PPU L1 32KB+32KB

16B/cycle

Dual XDRTM

RRAC

従来の演算処理のためにVMXを持つ64bitのPower Architecture

35

I/O

1987 OSCAR(Optimally Scheduled Advanced Multiprocessor)

36

OSCAR(Optimally Scheduled Advanced Multiprocessor) HOST COMPUTER CONTROL & I/O PROCESSOR RISC Processor

I/O Processor

CENTRALIZED SHARED MEMORY1 (Simultaneous Readable) Bank1

Bank2 Bank3

Addr.n Addr.n Addr.n Data Memory

Prog. Memory

Distributed Shared Memory

CSM2

CSM3

Read & Write Requests Arbitrator

Bus Interface

Distributed Shared Memory (Dual Port) (CP) -5MFLOPS,32bit RISC Processor (64 Registers) -2 Banks of Program Memory -Data Memory -Stack Memory -DMA Controller

PE1

(CP)

..

PE5

(CP)

..

PE6

5PE CLUSTER (SPC1) 8PE PROCESSOR CLUSTER (LPC1)

PE8

PE9

(CP)

(CP)

..

PE10 PE11

PE15 PE16 SPC3

SPC2 LPC2

37

OSCAR PE (Processor Element) SYSTEM BUS

BUS INTERFACE

LOCAL BUS 1

LOCAL BUS 2 DSM

DMA

LPM

LSM

LDM

IPU

FPU

REG INSC INSTRUCTION BUS

D M A : DMA CONTROLLER L P M : LOCAL PROGRAM MEMORY (128KW * 2BANK) I N S C : INSTRUCTION CONTROL UNIT : DISTRIBUTED DSM SHARED MEMORY (2KW) L S M : LOCAL STACK MEMORY (4KW)

DP

L D M : LOCAL DATA MEMORY (256KW) : DATA PATH DP I P U : INTEGER PROCESSING UNIT : FLOATING FPU PROCESSING UNIT R E G : REGISTER FILE (64 REGISTERS)

38

1987 OSCAR PE Board

39

OSCAR Memory Space SYSTEM MEMORY SPACE 00000000

LOCAL MEMORY SPACE 00000000

PE15

DSM

UNDEFINED (Local Memory Area)

(Distributed Shared Memory)

00100000

00000800

BROAD CAST

PE 1 PE 0

NOT USE

00200000

00010000 NOT USE

01000000 CP (Control Processor)

L P M(Bank0) (Local Program Memory) L P M(Bank1)

00030000

01100000 NOT USE

00020000

NOT USE

00040000

02000000 PE 0

LDM (Local Data Memory)

02100000 PE 1

00080000

NOT USE

000F0000

02200000 : : :

CONTROL

00100000

02F00000 PE15

SYSTEM ACCESSING

03000000 CSM1, 2, 3 (Centralized Shared Memory)

AREA

03400000 NOT USE FFFFFFFF

FFFFFFFF

40

OSCAR Multi-Core Architecture CMP 0 (chip multiprocessor 0) CMP m

PE 0 PE 1

CPU LPM/ I-Cache

LDM/ D-cache

PE

I/O I/O Devices Devices CSM j

n

DTC

I/O CMP k

DSM

FVR

N t Network k IInterface t f

CS CSM

Intra-chip connection network (Multiple Buses, Crossbar, etc)

CSM / L2 Cache

FVR

FVR FVR

FVR

FVR

FVR

FVR Inter-chip connection network (Crossbar, Buses, Multistage network, etc) CSM: central shared mem. LDM : local data mem. DSM: distributed shared mem. LPM : local program mem. DTC: Data Transfer Controller FVR: frequency / voltage control register 41

API and Parallelizing Compiler in METI/NEDO Advanced Multicore for Realtime Consumer Electronics Project j Sequential Application Program （Subset of C Language)

API to specify data assignment, data transfer, power reduction control

Was seda OS SCAR Co ompiler

Realtime Cons sumer Electroni E cs A Applicati ion Prog grams Image e, Secure e Audio Streaming etc.

Proc0 Scheduled Tasks T1

Backend compiler API decoder

Sequential Compiler

Mach. Codes SH multicore

Backend Compiler API decoder

Sequential Compiler

Mach. Codes

T4

Proc2 Scheduled Tasks T3

Executable codes for each vendor chip

Stop

Proc1 Scheduled T k Tasks T2

Translate into parallel codes for each vender

T6 Slow

FR-V Backend Compiler API decoder

Sequential Compiler

Mach. Codes （CELL）

Data Transfer by DTC(DMAC)

42

OSCAR Multigrain g Parallelizingg Compiler p • Automatic Parallelization – Multigrain Parallel Processing – Data Localization – Data transfer Overlapping – Complier Controlled Power Saving Scheme

• Compiler cooperative Multi-core architecture – OSCAR Multi-core Architecture – OSCAR Heterogeneous Multiprocessor Architecture

• Commercial SMP machines

43

Processor Block Diagram 600MHz S Snoop p contrroller ((SNC)

Core #3 Core #2 CPU FPU Core #1 CPU FPUD$ I$ Core #0 CPU FPU 32K 32K I$ D$ CCN CPU FPU 32K 32K IL CCN D$ I$ $ $ DL 16K 32K 32K IL8KCCN D$ DL I$ CCN 16K IL8K DL URAM 128K 32K 32K 8K 16K LCPG 128K IL URAMDL LCPG LCPG 0-3 URAM 16K 128K 8K LCPG 0-3 0-3 URAM 128K 0-3 0 3

On-chip system bus (SHwy) GCPG

LBSC DBSC SRAM DDR2 32bit 32bit

HW IP

CCN:Cache controller IL DL IL, DL: IInstruction/Data t ti /D t local memory URAM: User RAM GCPG: Global clock pulse generator g LCPG: Local CPG for each core LBSC: SRAM controller DBSC: DDR2 controller

300MHz

PCI Exp 4 lane 44

Chip Overview Core #0

Core #1

Peripherals

SNC Core #2

Core #3

SH4A Multicore SoC Chip

Process Technology

90nm, 8-layer, triple-Vth, CMOS

Chip Size

97.6mm 97 6mm2 (9.88mm (9 88mm x 9.88mm) 9 88mm)

Supply Voltage

1.0V (internal), 1.8/3.3V (I/O)

Power Consumption

0.6 mW/MHz/CPU @ 600MHz (90nm G)

Clock Frequency

600MHz

CPU Performance

4320 MIPS (Dhrystone 2.1)

FPU Performance

16.8 GFLOPS

I/D Cache

32KB 4way set-associative ((each))

ILRAM/OLRAM

8KB/16KB (each CPU)

URAM

128KB (each CPU)

P k Package

FCBGA 554pin, 554 i 29mm 29 x 29mm

ISSCC07 Paper p No.5.3, Y. Yoshida, et al., “A 4320MIPS Four-Processor Core SMP/AMP with Individually Managed Clock Frequency for Low Power Consumption” 45

Performance on a Developed SH Multi-core Multi core (RP1: SH-X3) SH X3) Using Compiler and API Audio AAC* Encoder

Image Susan Smoothing 5.0

5.0 4.0 2.70 3.0

3.82 Speedd up

Speed up

3.43

4.0 2 95 2.95 3.0 1.91

1.86 2.0

2.0

1.00

1.00 1.0

1.0

0

0 1

2 3 4 Number of Processors

*) ISO Advanced Audio Coding： Page. 46

1

2 3 4 Number of Processors

**) Mibench Embedded application benchmark by Michigan Univ.

ILRAM

I$

Core#0 D$

SNC0

URAMDLRAM

Core#1

Core#2

Core#3

Core#6 Core#4 DBSC

DDRPAD

S SNC1

SHWY

CSM

LBSC

RP2 Chip Photo and Specifications

VSWC

Core#7 Core#5 GCPG

Process 90nm, 8-layer 90nm 8-layer, tripleTechnology Vth, CMOS Chip Size 104.8mm 104 8mm2 (10.61mm x 9.88mm) CPU Core 6.6mm2 Size (3.36mm x 1.96mm) Supply pp y 1.0V–1.4V ((internal), ), Voltage 1.8/3.3V (I/O) Power 17 ((8 CPUs, Domains 8 URAMs, common)

47

8 Core RP2 Chip Block Diagram

LCPG0

PCR3 PCR2 PCR1 PCR0 C 0

I$ D$ LocalFPU memory CPU 16K 16K I$ D$ I:8K, D:32K Local 16K CCN I$16K D$memory I:8K, D:32K Local memory User RAM 64K 16K 16K BAR IUser I:8K, 8Kmemory D:32K D 32K64K Local RAM I:8K, D:32K64K User RAM URAM 64K

Snoop con ntroller 1

Core #3 Core #2 FPU CPU Core #1 FPU CPU I$ D$ Core #0 CPU 16K FPU16K

Barrier Sync. Lines

Snoop con ntroller 0

Cluster #0

Cl Cluster #1

Core #7 Core #6 CPU FPU Core #5 CPU FPU D$ I$ Core #4 CPU 16KFPU 16K

D$ I$ CPU FPU 16K 16K D$ I$ I:8K, D:32K 16K D$ 16KI$ CCN UserI:8K, RAMD:32K 64K 16K 16K BAR IRAM I:8K, 8K memory D:32K D64K 32K Local User UserI:8K, RAMD:32K 64K URAM 64K

LCPG1

PCR7 PCR6 PCR5 PCR4 C

On-chip system bus (SuperHyway)

DDR2 SRAM DMA control control control

LCPG: Local clock pulse generator PCR: Power Control Register CCN/BAR:Cache controller/Barrier 48 Register URAM: User RAM

Processing Performance on the Developed Multicore Using Automatic Parallelizing Compiler Speedup against single core execution for audio AAC encoding 7.0 6.0

Speedups

5.0 40 4.0 3.0

5.8

2.0

3.6

1.0

1.9 10 1.0

0.0 1 *) Advanced Audio Coding

2 4 N b Numbers off processor cores

8 49

Power Reduction Using Power Shutdown and Voltage Control by OSCAR Parallelizing Compiler

Consumed Poower [W W]

7

Without Power Control 7

6

6

5

5

4

4

3

3

2

2

1

1

0

0

Average Power 5.82 [W]

with Power Control （Resume Standby: Power shutdown & voltage lowering)

86 0% Power Reduction 86.0%

Average Power 0.81 [W]

OSCAR Heterogeneous Multicore • •

OSCAR Type Memory Architecture LPM –

•

LDM –

•

Local Data Memory

DSM –

•

Local Program g Memory y

Di ib d Shared Distributed Sh d Memory M

CSM –

Centralized Shared Memory •

•

DTU –

•

On Chipp and/or Off Chip p

Data Transfer Unit

Interconnection Network – M Multiple lti l Buses B – Split Transaction Buses – CrossBar …

51

Static Scheduling of Coarse Grain Tasks for a Heterogeneous Multi-core CPU0 CPU1 DRP

MT1 for CPU MT2 for CPU

MT3 for CPU

MT4 for DRP

MT5 for DRP

MT7 for DRP

MT8 for CPU

MT11 f CPU for

MT1 MT2

MT6 for CPU

MT3

MT6

MT9 for DRP

MT10 for DRP

MT8

MT12 f DRP for

MT13 f DRP for

MT11

MT4 MT5 MT7 MT9 MT10

MT13 MT12

EMT 52

An Image of Static Schedule for Heterogeneous Multicore with Data Transfer Overlapping pp g and Power Control

TIME E

53

Compiler Performance on a OSCAR HeteroHetero-multi multi--core 25.2 times speedup p p using g 4 SH ggeneral p purpose p cores and 4 DRP accelerators against a single SH

41.3 times Speedup

汎用高性能プロセッサ

10.34 10.11 9 10 9.10

6.10 5.18

SH4A Core On-Chip CSM STB×1

6.11 5.19

SH4A Core On-Chip CSM BUS×3

SH4A Core On-Chip CSM STB×3

8CPU+8D DRP

4CPU+8D DRP

8CPU+4D DRP

4CPU+4D DRP

2CPU+4D DRP

4CPU+2D DRP

SH4A Core

2CPU+1D DRP

4CPU

0.000.00 1CPU

8CPU+8D DRP

4CPU+8D DRP

8CPU+4D DRP

4CPU+4D DRP

2CPU+4D DRP

2.90 4CPU+2D DRP

SH4A Core

2CPU+1D DRP

4CPU

2.90 1.59 0.40 1CPU

8CPU+8D DRP

4CPU+8D DRP

SH4A Core

2CPU+1D DRP

4CPU

0.0

1CPU

2.0

2.90 1.59 1.00 0.40

1CPU

Execution Cllock Cycles

4.0

10.31 10.09 9 08 9.08

6.09 5.17

6.0

8CPU+4D DRP

8.0

4CPU+4D DRP

10 0 10.0

16.25

10.17 10.05 9 08 9.08

2CPU+4D DRP

12.0

4CPU+2D DRP

14.0

25.2 times Speedup against 1SH core

2CPU+2D DRP

16.0

17.22 16.30

17.10

16.13 16.47

2CPU+2D DRP

18.0

2CPU+2D DRP

20.0

Power Reduction by OSCAR Compiler(4SHs＋4DRPs) 0 78 W d ti b il C t l 0.78 W: 22% P Power reduction by C Compiler Control 1.2

Average P Power Conssumption [[W]

1

1.01

08 0.8

-22% 22%

1.01

1.01

1.01

1.01

1.01

0 78 0.78

0 78 0.78

0 78 0.78

0 78 0.78

0 78 0.78

1 78 1.78

FV FV ON OFF

FV FV ON OFF

FV FV ON OFF

FV FV ON OFF

FV FV ON OFF

FV ON

0.6 0.4

DRP Clock FPU IEU BPU Register

0.2 0 FV OFF

STB×1

Bus×3 Off Chip

2007.7.30

STB×3

STB×1

Bus×3

STB×3

On Chip 55

Conclusions ¾ Compiler cooperative low power high effective performance multi-core processors will be more important in wide range of information systems from games, mobile phones, automobiles to peta-scale supercomputers. p p ¾ Parallelizing compilers are essential for realization of ¾ Good cost performance p ¾ Short hardware and software development periods ¾ Low p power consumption p ¾ High software productivity ¾ Scalable p performance improvement p with advancement in semiconductor integration technology

¾ Key technologies in multi-core compiler ¾ Multigrain parallelization, Data localization, Data transfer 56 overlapping using DMA, Low power control technologies

A Multigrain Parallelizing gg Compiler with Power Control for Multicore ...

A Multigrain Parallelizing gg Compiler with Power Control for Multicore ...

Suggest Documents

Multi-core Parallelizing Compiler for Low Power High ...

"OSCAR Parallelizing Compiler and API for Low Power High ...

Parallelizing more Loops with Compiler Guided Refactoring

Parallelizing more Loops with Compiler Guided Refactoring

A Multigrain Delaunay Mesh Generation Method for Multicore ... - CRTC

Multigrain Parallel Processing on Compiler Cooperative Chip ...

A Portable Parallelizing Compiler with Loop ... - Semantic Scholar

Low Power Multicores, Parallelizing Compiler and Multiplatform API

Parallelizing Compiler Framework and API for ... - Semantic Scholar

The SUIF Compiler System: A Parallelizing and ... - CiteSeerX

Compiler Control Power Saving Scheme for Multi Core Processors

Parallelizing RSA Algorithm on Multicore CPU and GPU

Power Efficiency for Variation-Tolerant Multicore Processors

Efficient Traffic Aware Power Management for Multicore ...

I gg I gg I gg I gg I gg I gg gg gg gg gg gg gg VV j VV j VVV f V g VV g ...

A Simple Power-Aware Scheduling for Multicore Systems when ... - UPV

Multicore PSO Operation for Maximum Power Point Tracking of a

Parallelizing with BDSC, a resource-constrained

Development of a Multicore Power System Simulator for ... - IEEE Xplore

Multicore PSO Operation for Maximum Power Point Tracking of a

A Simple Power-Aware Scheduling for Multicore Systems when ... - UPV

g((GG)-(GG))

GG

gg