"OSCAR Parallelizing Compiler and API for Low Power High ...

17 downloads 1516 Views 3MB Size Report
Apr 10, 2008 - 12. BPA. 13 RB. 14. 13. Conditional branch. OR. AND. Original control flow. 14 RB. END. A Macro Flow Graph. A Macro Task Graph. 9 ...
OSCAR Parallelizing Compiler and API for Low Power High Performance Multicores Hironori Kasahara Professor, Deptment of Computer Science & Engineering Director, Advanced Multicore Processor Research Institute

Waseda University, Tokyo, Japan IEEE Computer Society Board of Governor http://www.kasahara.cs.waseda.ac.jp Climate 2009, Oak Ridge National Laboratory, Mar. 17, 2009

Multi-core Everywhere Multi-core Multi core from embedded to supercomputers ¾ Consumer Electronics (Embedded) Mobile Phone, Game, Digital TV, Car Navigation, DVD, Camera, IBM/ Sony/ Toshiba Cell, Fuijtsu FR1000, NEC/ARMMPCore&MP211, Panasonic Uniphier, Renesas SH multi-core(4 core RP1, 8 core RP2) Tilera Tile64, Tile64 SPI Storm-1(16 Storm 1(16 VLIW cores)

¾ PCs, Servers OSCAR Type yp Multi-core Chip p by y Renesas in METI/NEDO Multicore for Real-time Consumer Electronics Project (Leader: Prof.Kasahara)

Intel Quad Xeon, Core 2 Quad, Montvale, Nehalem(8core), 80 core, Larrabee(32core) AMD Quad Core Opteron, Phenom

¾ WSs, Deskside & Highend Servers IBM Power4,5,5+,6

Sun Niagara(SparcT1,T2), Rock

¾ Supercomputers S t Earth Simulator:40TFLOPS, 2002, 5120 vector proc. IBM Blue Gene/L: 360TFLOPS, 2005, Low power CMP d 128K processor chips, BG/Q :20PFLOPS.2011, BlueWaters: Effective ff i 1PFLOPS, 1 O S Julty2011,NCSA 2011 CSA UIUC C

High quality application software, Productivity, Cost performance, Low power consumption are important Ex Mobile phones, Ex, phones Games Compiler cooperated multi-core processors are 2 promising to realize the above futures

METI/NEDO National Project Multi-core Multi core for Real Real-time time Consumer Electronics R&D of compiler cooperative multicore processor technology for consumer electronics like Mobile phones, Games, DVD, Digital TV, Car navigation systems. Period From July 2005 to March 2008 ・Good cost performance

(2005.7~2008.3)** 新マルチコア 新マルチコア プロセッサ プロセッサ I/O I/O Devices •高性能 •高性能

CMP 0 ( マルチコアチップ0)

PC0

CMP m

CPU

(プロセッ サコア0)

(プロセッサ)

LDM/ D-cache

LPM/ IC h I-Cache (ローカルプロ グラムメモリ/ 命令キャッシュ)

((ローカルデータ メモリ/L1データ デ キャッシュ)

DTC (データ 転送コン トローラ)

PC 1 (プロ セッサ コア1)

(マルチ コア・ チップm)

PC n (プロセッサ コアn)

DSM (分散共有メモリ)

NI(ネットワークイン ターフェイス)

(入出力装置)

CSM j

•低消費電力 •低消費電力 •短HW/SW開 短HW/SW開 •短HW/SW開 短 開 発期間 発期間 CSM •各チップ間でア •各チップ間でア プリケーション プリケーション 共用可 共用可 •高信頼性 •高信頼性 I/O

CSP k

(入出力 用 マルチコ ア・ チップ)

(集中 共有 メモリ)

・Short hardware and software development periods •半導体集積度 •半導体集積度 マルチコア , , と共に性能向上 ・Low power consumption と共に性能向上 統合ECU ・Scalable S l bl performance f iimprovementt 開発マルチコアチップは情報家電へ 開発マルチコアチップは情報家電へ with the advancement of semiconductor ・Use Use of the same parallelizing compiler for multi-cores from different vendors using newly developed API IntraCCN ((チップ内結合網: 複数バス、クロスバー等 )

CSM / L2 Cache

(集中共有メモリあるいはL2キャッシュ )

InterCCN (チップ間結合網: 複数バス、クロスバー、多段ネットワーク等 )

API:Application Programming Interface

**Hitachi, Renesas, Fujitsu, Toshiba, Panasonic, NEC

OSCAR Multi-Core Architecture CMP 0 (chip multiprocessor 0) CMP m

PE 0 PE 1

CPU LPM/ I-Cache

LDM/ D-cache

PE

I/O I/O Devices Devices CSM j

n

DTC

I/O CMP k

DSM

FVR

N t Network k IInterface t f

CS CSM

Intra-chip connection network (Multiple Buses, Crossbar, etc)

CSM / L2 Cache

FVR

FVR FVR

FVR

FVR

FVR

FVR Inter-chip connection network (Crossbar, Buses, Multistage network, etc) CSM: central shared mem. LDM : local data mem. DSM: distributed shared mem. LPM : local program mem. DTC: Data Transfer Controller FVR: frequency / voltage control register

ILRAM

I$

Core#0 URAMDLRAM

D$

Core#1 SNC0

LBSC

Renesas-Hitachi-Waseda 8 core RP2 Chip Chi Photo Ph and d Specifications S ifi i

Core#2

Core#3

Core#4 DBSC

DDRPAD

CSM

Core#6

S SNC1

SHWY

VSWC

Core#7 Core#5 GCPG

Process 90nm, 8-layer 90nm 8-layer, tripleTechnology Vth, CMOS Chip Size 104.8mm 104 8mm2 (10.61mm x 9.88mm) CPU Core 6.6mm2 Size (3.36mm x 1.96mm) Supply pp y 1.0V–1.4V ((internal), ), Voltage 1.8/3.3V (I/O) Power 17 ((8 CPUs, Domains 8 URAMs, common)

IEEE ISSCC08: Paper No. 4.5, M.ITO, … and H. Kasahara, “A 8640 MIPS “An S SoC S C with i Independent Power-off ff Control C off 8 CPUs and 8 RAMs by an Automatic Parallelizing Compiler”

Demo of NEDO Multicore for Real Time Consumer Electronics at the Council of Science and Engineering Policy on April 10, 2008 CSTP Members Prime Minister: Mr. Y. FUKUDA

Minister of State for S i Science, T Technology h l and Innovation Policy: Mr. F. KISHIDA

Chief Cabinet Secretary: Mr. N. MACHIMURA

Minister of Internal Affairs and Communications : Mr. H. MASUDA

Mi i t off Finance Minister Fi : Mr. F. NUKAGA

Minister of Education, Culture, Sports, Science and Technology: Mr. K. TOKAI

Minister of Economy,Trade and Industry: Mr. A. AMARI

OSCAR Parallelizing Compiler • Improve effective performance, cost-performance and productivity and reduce consumed power – Multigrain Parallelization • Exploitation of parallelism from the whole program by use of coarse-grain coarse grain parallelism among loops and subroutines, near fine grain parallelism among statements in addition to loop parallelism

– Data Localization • Automatic data distribution for distributed shared memory, cache and local memory on multiprocessor systems.

– Data Transfer Overlapping • Data transfer overhead hiding by overlapping task execution and data transfer using DMA or data pre pre-fetching fetching

– Power Reduction • Reduction of consumed power by compiler control of frequency, voltage and power shut down with hardware supports.

Generation of coarse grain tasks Macro-tasks (MTs)

„

¾ Block oc of o Pseudo seudo Assignments ss g e ts ((BPA): ): Basic as c Block oc ((BB)) ¾ Repetition Block (RB) : natural loop ¾ Subroutine Block (SB): subroutine

Program

BPA

Near fine grain parallelization

RB

Loop level parallelization Near fine grain of loop body C Coarse grain i parallelization

SB

Total System

1st. Layer

grain Coarse g parallelization 2nd. Layer

BPA RB SB BPA RB SB

BPA RB SB BPA RB SB BPA RB SB BPA RB SB

3rd. Layer 8

Earliest Executable Condition Analysis for coarse grain tasks (Macro-tasks) (Macro tasks) Data Dependency Control flow Conditional branch 1 2

1

BPA

BPA

3

4

BPA

BPA

BPA

Block of Psuedo Assignment Statements

RB

Repetition Block

2

7 5

3 4

RB

BPA

8 6

BPA 6

RB BPA

15

BPA

7

5

RB

6

9

RB

11

RB 8

BPA BPA

9

BPA 11

12

10

RB

15

7 Data dependency

12

Extended control dependency

BPA

BPA 13

10

Conditional branch

13

OR AND

14

RB

Original control flow 14

RB

A Macro Flow Graph A Macro Task Graph

END

9

Automatic processor assignment in su2cor • Usingg 14 p processors –

Coarse grain parallelization within DO400 of subroutine LOOPS

10

MTG of Su2cor-LOOPS-DO400 „

Coarse grain parallelism PARA_ALD = 4.3

DOALL

Sequential LOOP

SB

BB

11

Data-Localization Loop oop Aligned g ed Decomposition eco pos o •

Decompose multiple loop (Doall and Seq) into CARs and LRs consideringg inter-loop p data dependence. p – –

Most data in LR can be passed through LM. LR: Localizable Region, CAR: Commonly Accessed Region

C RB1(Doall) DO I=1,101 A(I) 2*I A(I)=2*I ENDDO C RB2(Doseq) DO I=1,100 B(I)=B(I-1) +A(I)+A(I+1) ENDDO RB3(Doall) DO I=2,100 C(I)=B(I)+B(I-1) () () ( ) ENDDO C

LR DO I=1,33

CAR DO I=34,35

LR DO I=36,66

CAR DO I=67,68

LR DO I=69,101

DO I=1,33 DO I=34,34 DO I=35,66 DO I=67 I=67,67 67 DO I=68,100 DO II=22,34 34

DO II=35 35,67 67

DO II=68 68,100 100 12

Data Localization 1 1

PE0

2 3

6

3

4

2

5

4

5 6

12

7

10

11

9

8

7 14

8

9

13

11

24

17

13

15

MTG

19

25

21

22

23

26 dlg3

dlg0

28

29

Data Localization Group

27

31

32

12

1

2

3

6

7

4

14

8

18

15

5

19

9

25

11

29

10

13

16

17

20

22

26

21

30

23

24

27

28

20

dlg2

dlg1

12

16

10 18

14

15

PE1

30

32 33

31

MTG after Division

A schedule for two processors

13

An Example of Data Localization for Spec95 Swim DO 200 J=1,N DO 200 I=1,M UNEW(I+1,J) = UOLD(I+1,J)+ 1 TDTS8*(Z(I+1,J+1)+Z(I+1,J))*(CV(I+1,J+1)+CV(I,J+1)+CV(I,J) 2 +CV(I+1,J))-TDTSDX*(H(I+1,J)-H(I,J)) VNEW(I,J+1) = VOLD(I,J+1)-TDTS8*(Z(I+1,J+1)+Z(I,J+1)) 1 *(CU(I+1,J+1)+CU(I,J+1)+CU(I,J)+CU(I+1,J)) 2 -TDTSDY*(H(I,J+1)-H(I,J)) PNEW(I,J) = POLD(I,J)-TDTSDX*(CU(I+1,J)-CU(I,J)) 1 -TDTSDY*(CV(I,J+1)-CV(I,J)) 200 CONTINUE

DO 210 J=1,N UNEW(1,J) = UNEW(M+1,J) VNEW(M+1,J+1) = VNEW(1,J+1) PNEW(M+1,J) = PNEW(1,J) 210 CONTINUE

DO 300 J=1 J=1,N N DO 300 I=1,M UOLD(I,J) = U(I,J)+ALPHA*(UNEW(I,J)-2.*U(I,J)+UOLD(I,J)) VOLD(I,J) = V(I,J)+ALPHA*(VNEW(I,J)-2.*V(I,J)+VOLD(I,J)) POLD(I J) = P(I,J)+ALPHA POLD(I,J) P(I J)+ALPHA*(PNEW(I (PNEW(I,J) J)-2 2.*P(I P(I,J)+POLD(I,J)) J)+POLD(I J)) 300 CONTINUE

(a) An example of target loop group for data localization

cache size 0

1

VN PO H

2

PN CU

3

UO CV

4MB

UN VO Z UN

VN

PN

U VN PO

V PN

P UO

UN VO

Cache line conflicts occurs among arrays which share the same location on cache (b) Image of alignment of arrays on cache accessed by target loops

14

Data Layout for Removing Line Conflict Misses byy Arrayy Dimension Paddingg Declaration part of arrays in spec95 swim after padding before padding PARAMETER (N1=513, N2=513)

PARAMETER (N1=513, N2=544)

COMMON U(N1,N2), V(N1,N2), P(N1,N2), * UNEW(N1,N2), VNEW(N1,N2), 1 PNEW(N1,N2), UOLD(N1,N2), * VOLD(N1,N2), POLD(N1,N2), 2 CU(N1,N2), CV(N1,N2), * Z(N1,N2), H(N1,N2)

COMMON U(N1,N2), V(N1,N2), P(N1,N2), * UNEW(N1,N2), VNEW(N1,N2), 1 PNEW(N1,N2), UOLD(N1,N2), * VOLD(N1,N2), POLD(N1,N2), 2 CU(N1,N2), CV(N1,N2), * Z(N1,N2), H(N1,N2)

4MB

4MB

padding Box: Access range of DLG0

15

Power Reduction by Power Supply, Clock Frequency and Voltage Control by OSCAR Compiler •

Shortest execution time mode

Image of Generated Multigrain Parallelized Code using the developed Multicore API

Centralized scheduling code

(The API is compatible with OpenMP) SECTIONS SECTION

1st layer Distributed scheduling code d

MT1_1

MT1_2 DOALL MT1_4 MT1 4 RB

MT1_3 SB

T0

SECTION

T1

T2

T4

SYNC SEND

T7

MT1-3

MT1-4

1_3_2 1_3_3 1_3_4

T6

SYNC RECV

MT1_2

1_3_1

T5

MT1_1

1_4_1

1_4_1

T3

1_3_1

1_3_1

1_3_2

1_3_2

1_3_2

1_3_3

1_3_3

1_3_3

1_3_4

1_3_4

1_3_4

1_3_5

1_3_5

1_3_5

1 3 6 1_3_6

1 3 6 1_3_6

1 3 6 1_3_6

1_4_1

1_4_2

1_4_2

1_4_3

1_4_3

1_4_4

1_3_1

1_4_4

1_3_5 1_3_6 1_4_21_4_3 1_4_4

2nd layer

END SECTIONS

2nd layer Thread group0

Thread group1

17

Compilation Flow Using OSCAR API

Waseda Univ. OSCAR Parallelizing Compiler ¾ Coarse grain task parallelization ¾ Global data Localization ¾ Data transfer overlapping using DMA ¾ Power reduction control using DVFS, Clock and Power gating g g

Parallelized Fortran or C program with APII Proc0 Code with directives Thread 0 Proc1 Code with directives d ec ves Thread 1

Backend compiler API E i ti Existing Analyzer sequential compiler Backend compiler API Existing Analyzer sequential compiler

Generation of parallel machine codes using sequential compilers p Machine codes

Multicore from Vendor A

Machine codes 実行

コード

Multicore from Vendor B

Backend compiler p

OpenMP Compiler

OSCAR: Optimally Scheduled Advanced Multiprocessor API: Application Program Interface

Executaable on vaarious mu ulticores

Application Program Fortran or Parallelizable C ( Sequential program)

OSCAR API for RealReal-time Low Power High Performance Multicores Directives for thread generation, memory, data transfer using DMA, power managements

Shred memory servers 18

Performance of OSCAR Compiler on IBM p6 595 Power6 (4 2GHz) based 32 (4.2GHz) 32-core core SMP Server

OpenMP codes generated by OSCAR compiler accelerate IBM XL Fortran for AIX Ver.12.1 about 3.3 times on the average Compile Option: (*1) Sequential: -O3 –qarch=pwr6, XLF: -O3 –qarch=pwr6 –qsmp=auto, OSCAR: -O3 –qarch=pwr6 –qsmp=noauto (*2) Sequential: -O5 -q64 –qarch=pwr6, XLF: -O5 –q64 –qarch=pwr6 –qsmp=auto, OSCAR: -O5 –q64 –qarch=pwr6 –qsmp=noauto (Others) Sequential: -O5 –qarch=pwr6, XLF: -O5 –qarch=pwr6 –qsmp=auto, OSCAR: -O5 –qarch=pwr6 –qsmp=noauto

Performance of OSCAR Compiler Using the Multicore API on Intel Quad-core Xeon Intel Ver.10.1 OSCAR

9 8 speeedup ratioo

7 6 5 4 3 2 1

SPEC95 •

SPEC2000

OSCAR Compiler gives us 2.1 times speedup on the average against Intel Compiler ver.10.1

apsi

applu

mgrid m

swim

wave5 w

ffpppp

apsi

tuurb3d

applu

mgrid m

hyddro2d

suu2cor

swim

tom mcatv

0

Performance of OSCAR compiler on 16 cores SGI Altix 450 Montvale server 6

sppeedup ratio

5

Compiler options for the Intel Compiler: for Automation parallelization: -fast -parallel. for OpenMP codes generated by OSCAR: -fast -openmp

Intel Ver.10.1 OSCAR

4 3 2 1

spec95 •

spec2000

OSC OSCAR compiler co p e gave us 2.3 .3 ttimes es speedup against Intel Fortran Itanium Compiler revision 10.1

apsi

applu

mgrid

swim

wave5

fpppp

apsi

turb3d

applu

mgrid

hhydro2d

su2cor

swim

toomcatv

0

Performance of OSCAR compiler on NEC NaviEngine(ARM-NEC N iE i (ARM NEC MPcore) MP ) 4.5

g77 4

OSCAR

speedup ratio

3.5 3 2.5 2 1.5 1 0.5 0 1PE

2PE mgrid

4PE

1PE

2PE su2cor S PEC9 5



4PE

1PE

2PE

4PE

hydro2d

Compile Opiion : -O3

OSCAR compiler gave us 3.43 times speedup against 1 core on 22 ARM/NEC MPCore with 4 ARM 400MHz cores

Performance of OSCAR Compiler Using the multicore API on Fujitsu FR1000 Multicore 4.0

3.76

3.75 3 47 3.47

3.5 3.0

2.90

2.81

2.55

2.54 2.5 2.12 1.94

2.0

1.98

1.98

1 80 1.80

1.5 1.0

1.00

1.00

1.00

1.00

0.5 0.0 1

2

3

MPEG2enc

4

1

2

3

MPEG2dec

4

1

2

3

MP3enc

4

1

2

3

4

JPEG 2000enc

3.38 times speedup on the average for 4 cores against a single core execution

Performance of OSCAR Compiler Using the Developed API on 4 core (SH4A) OSCAR T Type M Multicore lti 4.0 3.5

3.50

3.35

3.24

3.17

3.0

2.71

2.57

2.53

2.5 2 07 2.07 2.0

1.88

1.86

1.78

1.83

1.5 1.0

1.00

1.00

1.00

1.00

0.5 0.0 1

2

3

MPEG2enc

4

1

2

3

MPEG2dec

4

1

2

3

MP3enc

4

1

2

3

4

JPEG 2000enc

3.31 times speedup on the average for 4cores against 1core

Processing Performance on the Developed Multicore Using Automatic Parallelizing Compiler Speedup against single core execution for audio AAC encoding *) Ad Advanced dA Audio di Coding C di

7.0

Speedup ps

60 6.0 5.0 4.0 30 3.0

58 5.8

2.0

3.6

1.0

1.9 1.0

0.0 1

2

4

Numbers of processor cores

8

Power Reduction by OSCAR Parallelizing Compiler for Secure Audio Encoding AAC Encoding + AES Encryption with 8 CPU cores 7

Without Power Control (Voltage:1.4V)

7

6

6

5

5

4

4

3

3

2

2

1

1

0

0

With Power Control (Frequency, Resume Standby: Power shutdown & Voltage lowering 1.4V1.0V)

Avg. Power 88.3% 88 3% Power Reduction Avg. Power 5.68 [W] 0.67 [W] 26

Power Reduction by OSCAR Parallelizing Compiler for MPEG2 Decoding MPEG2 Decoding with 8 CPU cores 7

Without Power C Control l (Voltage:1.4V)

7

6

6

5

5

4

4

3

3

2

2

1

1

0

0

W Power With owe Control Co o (Frequency, Resume Standby: Power shutdown & Voltage lowering 1.4V1 4V 1.0V)

Avg. Power 73.5% 73 5% Power Reduction Avg. Power 5.73 [W] 1.52 [W] 27

Low Power High Performance  Multicore Computer with Multicore Computer with  Solar Panel  Clean Energy Autonomous gy ¾ Servers operational in deserts ¾

Conclusions ¾ OSCAR compiler cooperative low powermulti-core powermulti core with high effective performance, short software development period will be important in wide range of IT systems from consumer electronics to exa-scale supercomputers. ¾ The OSCAR compiler with API boosts up the vendors compiler performance on various multicores and servers ¾ 3.3 times on IBM p p595 SMP server using g Power6 ¾ 2. 1 times on Intel Quad core Xeon ¾ 2.3 times on SGI Altix450 usingg Intel Itanium2 ((Montvale)) ¾ 88% power reduction by the compiler power control on the Renesas-Hitachi-Waseda 8 core (SH4A) multicore RP2 for realtime li secure AAC encoding di ¾ 70% power reduction on the multicore for MPEG2 decoding

Suggest Documents