Apr 10, 2008 - 12. BPA. 13 RB. 14. 13. Conditional branch. OR. AND. Original control flow. 14 RB. END. A Macro Flow Graph. A Macro Task Graph. 9 ...
OSCAR Parallelizing Compiler and API for Low Power High Performance Multicores Hironori Kasahara Professor, Deptment of Computer Science & Engineering Director, Advanced Multicore Processor Research Institute
Waseda University, Tokyo, Japan IEEE Computer Society Board of Governor http://www.kasahara.cs.waseda.ac.jp Climate 2009, Oak Ridge National Laboratory, Mar. 17, 2009
Multi-core Everywhere Multi-core Multi core from embedded to supercomputers ¾ Consumer Electronics (Embedded) Mobile Phone, Game, Digital TV, Car Navigation, DVD, Camera, IBM/ Sony/ Toshiba Cell, Fuijtsu FR1000, NEC/ARMMPCore&MP211, Panasonic Uniphier, Renesas SH multi-core(4 core RP1, 8 core RP2) Tilera Tile64, Tile64 SPI Storm-1(16 Storm 1(16 VLIW cores)
¾ PCs, Servers OSCAR Type yp Multi-core Chip p by y Renesas in METI/NEDO Multicore for Real-time Consumer Electronics Project (Leader: Prof.Kasahara)
Intel Quad Xeon, Core 2 Quad, Montvale, Nehalem(8core), 80 core, Larrabee(32core) AMD Quad Core Opteron, Phenom
¾ WSs, Deskside & Highend Servers IBM Power4,5,5+,6
Sun Niagara(SparcT1,T2), Rock
¾ Supercomputers S t Earth Simulator:40TFLOPS, 2002, 5120 vector proc. IBM Blue Gene/L: 360TFLOPS, 2005, Low power CMP d 128K processor chips, BG/Q :20PFLOPS.2011, BlueWaters: Effective ff i 1PFLOPS, 1 O S Julty2011,NCSA 2011 CSA UIUC C
High quality application software, Productivity, Cost performance, Low power consumption are important Ex Mobile phones, Ex, phones Games Compiler cooperated multi-core processors are 2 promising to realize the above futures
METI/NEDO National Project Multi-core Multi core for Real Real-time time Consumer Electronics R&D of compiler cooperative multicore processor technology for consumer electronics like Mobile phones, Games, DVD, Digital TV, Car navigation systems. Period From July 2005 to March 2008 ・Good cost performance
(2005.7~2008.3)** 新マルチコア 新マルチコア プロセッサ プロセッサ I/O I/O Devices •高性能 •高性能
CMP 0 ( マルチコアチップ0)
PC0
CMP m
CPU
(プロセッ サコア0)
(プロセッサ)
LDM/ D-cache
LPM/ IC h I-Cache (ローカルプロ グラムメモリ/ 命令キャッシュ)
((ローカルデータ メモリ/L1データ デ キャッシュ)
DTC (データ 転送コン トローラ)
PC 1 (プロ セッサ コア1)
(マルチ コア・ チップm)
PC n (プロセッサ コアn)
DSM (分散共有メモリ)
NI(ネットワークイン ターフェイス)
(入出力装置)
CSM j
•低消費電力 •低消費電力 •短HW/SW開 短HW/SW開 •短HW/SW開 短 開 発期間 発期間 CSM •各チップ間でア •各チップ間でア プリケーション プリケーション 共用可 共用可 •高信頼性 •高信頼性 I/O
CSP k
(入出力 用 マルチコ ア・ チップ)
(集中 共有 メモリ)
・Short hardware and software development periods •半導体集積度 •半導体集積度 マルチコア , , と共に性能向上 ・Low power consumption と共に性能向上 統合ECU ・Scalable S l bl performance f iimprovementt 開発マルチコアチップは情報家電へ 開発マルチコアチップは情報家電へ with the advancement of semiconductor ・Use Use of the same parallelizing compiler for multi-cores from different vendors using newly developed API IntraCCN ((チップ内結合網: 複数バス、クロスバー等 )
CSM / L2 Cache
(集中共有メモリあるいはL2キャッシュ )
InterCCN (チップ間結合網: 複数バス、クロスバー、多段ネットワーク等 )
API:Application Programming Interface
**Hitachi, Renesas, Fujitsu, Toshiba, Panasonic, NEC
OSCAR Multi-Core Architecture CMP 0 (chip multiprocessor 0) CMP m
PE 0 PE 1
CPU LPM/ I-Cache
LDM/ D-cache
PE
I/O I/O Devices Devices CSM j
n
DTC
I/O CMP k
DSM
FVR
N t Network k IInterface t f
CS CSM
Intra-chip connection network (Multiple Buses, Crossbar, etc)
CSM / L2 Cache
FVR
FVR FVR
FVR
FVR
FVR
FVR Inter-chip connection network (Crossbar, Buses, Multistage network, etc) CSM: central shared mem. LDM : local data mem. DSM: distributed shared mem. LPM : local program mem. DTC: Data Transfer Controller FVR: frequency / voltage control register
ILRAM
I$
Core#0 URAMDLRAM
D$
Core#1 SNC0
LBSC
Renesas-Hitachi-Waseda 8 core RP2 Chip Chi Photo Ph and d Specifications S ifi i
Core#2
Core#3
Core#4 DBSC
DDRPAD
CSM
Core#6
S SNC1
SHWY
VSWC
Core#7 Core#5 GCPG
Process 90nm, 8-layer 90nm 8-layer, tripleTechnology Vth, CMOS Chip Size 104.8mm 104 8mm2 (10.61mm x 9.88mm) CPU Core 6.6mm2 Size (3.36mm x 1.96mm) Supply pp y 1.0V–1.4V ((internal), ), Voltage 1.8/3.3V (I/O) Power 17 ((8 CPUs, Domains 8 URAMs, common)
IEEE ISSCC08: Paper No. 4.5, M.ITO, … and H. Kasahara, “A 8640 MIPS “An S SoC S C with i Independent Power-off ff Control C off 8 CPUs and 8 RAMs by an Automatic Parallelizing Compiler”
Demo of NEDO Multicore for Real Time Consumer Electronics at the Council of Science and Engineering Policy on April 10, 2008 CSTP Members Prime Minister: Mr. Y. FUKUDA
Minister of State for S i Science, T Technology h l and Innovation Policy: Mr. F. KISHIDA
Chief Cabinet Secretary: Mr. N. MACHIMURA
Minister of Internal Affairs and Communications : Mr. H. MASUDA
Mi i t off Finance Minister Fi : Mr. F. NUKAGA
Minister of Education, Culture, Sports, Science and Technology: Mr. K. TOKAI
Minister of Economy,Trade and Industry: Mr. A. AMARI
OSCAR Parallelizing Compiler • Improve effective performance, cost-performance and productivity and reduce consumed power – Multigrain Parallelization • Exploitation of parallelism from the whole program by use of coarse-grain coarse grain parallelism among loops and subroutines, near fine grain parallelism among statements in addition to loop parallelism
– Data Localization • Automatic data distribution for distributed shared memory, cache and local memory on multiprocessor systems.
– Data Transfer Overlapping • Data transfer overhead hiding by overlapping task execution and data transfer using DMA or data pre pre-fetching fetching
– Power Reduction • Reduction of consumed power by compiler control of frequency, voltage and power shut down with hardware supports.
Generation of coarse grain tasks Macro-tasks (MTs)
¾ Block oc of o Pseudo seudo Assignments ss g e ts ((BPA): ): Basic as c Block oc ((BB)) ¾ Repetition Block (RB) : natural loop ¾ Subroutine Block (SB): subroutine
Program
BPA
Near fine grain parallelization
RB
Loop level parallelization Near fine grain of loop body C Coarse grain i parallelization
SB
Total System
1st. Layer
grain Coarse g parallelization 2nd. Layer
BPA RB SB BPA RB SB
BPA RB SB BPA RB SB BPA RB SB BPA RB SB
3rd. Layer 8
Earliest Executable Condition Analysis for coarse grain tasks (Macro-tasks) (Macro tasks) Data Dependency Control flow Conditional branch 1 2
1
BPA
BPA
3
4
BPA
BPA
BPA
Block of Psuedo Assignment Statements
RB
Repetition Block
2
7 5
3 4
RB
BPA
8 6
BPA 6
RB BPA
15
BPA
7
5
RB
6
9
RB
11
RB 8
BPA BPA
9
BPA 11
12
10
RB
15
7 Data dependency
12
Extended control dependency
BPA
BPA 13
10
Conditional branch
13
OR AND
14
RB
Original control flow 14
RB
A Macro Flow Graph A Macro Task Graph
END
9
Automatic processor assignment in su2cor • Usingg 14 p processors –
Coarse grain parallelization within DO400 of subroutine LOOPS
10
MTG of Su2cor-LOOPS-DO400
Coarse grain parallelism PARA_ALD = 4.3
DOALL
Sequential LOOP
SB
BB
11
Data-Localization Loop oop Aligned g ed Decomposition eco pos o •
Decompose multiple loop (Doall and Seq) into CARs and LRs consideringg inter-loop p data dependence. p – –
Most data in LR can be passed through LM. LR: Localizable Region, CAR: Commonly Accessed Region
C RB1(Doall) DO I=1,101 A(I) 2*I A(I)=2*I ENDDO C RB2(Doseq) DO I=1,100 B(I)=B(I-1) +A(I)+A(I+1) ENDDO RB3(Doall) DO I=2,100 C(I)=B(I)+B(I-1) () () ( ) ENDDO C
LR DO I=1,33
CAR DO I=34,35
LR DO I=36,66
CAR DO I=67,68
LR DO I=69,101
DO I=1,33 DO I=34,34 DO I=35,66 DO I=67 I=67,67 67 DO I=68,100 DO II=22,34 34
DO II=35 35,67 67
DO II=68 68,100 100 12
Data Localization 1 1
PE0
2 3
6
3
4
2
5
4
5 6
12
7
10
11
9
8
7 14
8
9
13
11
24
17
13
15
MTG
19
25
21
22
23
26 dlg3
dlg0
28
29
Data Localization Group
27
31
32
12
1
2
3
6
7
4
14
8
18
15
5
19
9
25
11
29
10
13
16
17
20
22
26
21
30
23
24
27
28
20
dlg2
dlg1
12
16
10 18
14
15
PE1
30
32 33
31
MTG after Division
A schedule for two processors
13
An Example of Data Localization for Spec95 Swim DO 200 J=1,N DO 200 I=1,M UNEW(I+1,J) = UOLD(I+1,J)+ 1 TDTS8*(Z(I+1,J+1)+Z(I+1,J))*(CV(I+1,J+1)+CV(I,J+1)+CV(I,J) 2 +CV(I+1,J))-TDTSDX*(H(I+1,J)-H(I,J)) VNEW(I,J+1) = VOLD(I,J+1)-TDTS8*(Z(I+1,J+1)+Z(I,J+1)) 1 *(CU(I+1,J+1)+CU(I,J+1)+CU(I,J)+CU(I+1,J)) 2 -TDTSDY*(H(I,J+1)-H(I,J)) PNEW(I,J) = POLD(I,J)-TDTSDX*(CU(I+1,J)-CU(I,J)) 1 -TDTSDY*(CV(I,J+1)-CV(I,J)) 200 CONTINUE
DO 210 J=1,N UNEW(1,J) = UNEW(M+1,J) VNEW(M+1,J+1) = VNEW(1,J+1) PNEW(M+1,J) = PNEW(1,J) 210 CONTINUE
DO 300 J=1 J=1,N N DO 300 I=1,M UOLD(I,J) = U(I,J)+ALPHA*(UNEW(I,J)-2.*U(I,J)+UOLD(I,J)) VOLD(I,J) = V(I,J)+ALPHA*(VNEW(I,J)-2.*V(I,J)+VOLD(I,J)) POLD(I J) = P(I,J)+ALPHA POLD(I,J) P(I J)+ALPHA*(PNEW(I (PNEW(I,J) J)-2 2.*P(I P(I,J)+POLD(I,J)) J)+POLD(I J)) 300 CONTINUE
(a) An example of target loop group for data localization
cache size 0
1
VN PO H
2
PN CU
3
UO CV
4MB
UN VO Z UN
VN
PN
U VN PO
V PN
P UO
UN VO
Cache line conflicts occurs among arrays which share the same location on cache (b) Image of alignment of arrays on cache accessed by target loops
14
Data Layout for Removing Line Conflict Misses byy Arrayy Dimension Paddingg Declaration part of arrays in spec95 swim after padding before padding PARAMETER (N1=513, N2=513)
PARAMETER (N1=513, N2=544)
COMMON U(N1,N2), V(N1,N2), P(N1,N2), * UNEW(N1,N2), VNEW(N1,N2), 1 PNEW(N1,N2), UOLD(N1,N2), * VOLD(N1,N2), POLD(N1,N2), 2 CU(N1,N2), CV(N1,N2), * Z(N1,N2), H(N1,N2)
COMMON U(N1,N2), V(N1,N2), P(N1,N2), * UNEW(N1,N2), VNEW(N1,N2), 1 PNEW(N1,N2), UOLD(N1,N2), * VOLD(N1,N2), POLD(N1,N2), 2 CU(N1,N2), CV(N1,N2), * Z(N1,N2), H(N1,N2)
4MB
4MB
padding Box: Access range of DLG0
15
Power Reduction by Power Supply, Clock Frequency and Voltage Control by OSCAR Compiler •
Shortest execution time mode
Image of Generated Multigrain Parallelized Code using the developed Multicore API
Centralized scheduling code
(The API is compatible with OpenMP) SECTIONS SECTION
1st layer Distributed scheduling code d
MT1_1
MT1_2 DOALL MT1_4 MT1 4 RB
MT1_3 SB
T0
SECTION
T1
T2
T4
SYNC SEND
T7
MT1-3
MT1-4
1_3_2 1_3_3 1_3_4
T6
SYNC RECV
MT1_2
1_3_1
T5
MT1_1
1_4_1
1_4_1
T3
1_3_1
1_3_1
1_3_2
1_3_2
1_3_2
1_3_3
1_3_3
1_3_3
1_3_4
1_3_4
1_3_4
1_3_5
1_3_5
1_3_5
1 3 6 1_3_6
1 3 6 1_3_6
1 3 6 1_3_6
1_4_1
1_4_2
1_4_2
1_4_3
1_4_3
1_4_4
1_3_1
1_4_4
1_3_5 1_3_6 1_4_21_4_3 1_4_4
2nd layer
END SECTIONS
2nd layer Thread group0
Thread group1
17
Compilation Flow Using OSCAR API
Waseda Univ. OSCAR Parallelizing Compiler ¾ Coarse grain task parallelization ¾ Global data Localization ¾ Data transfer overlapping using DMA ¾ Power reduction control using DVFS, Clock and Power gating g g
Parallelized Fortran or C program with APII Proc0 Code with directives Thread 0 Proc1 Code with directives d ec ves Thread 1
Backend compiler API E i ti Existing Analyzer sequential compiler Backend compiler API Existing Analyzer sequential compiler
Generation of parallel machine codes using sequential compilers p Machine codes
Multicore from Vendor A
Machine codes 実行
コード
Multicore from Vendor B
Backend compiler p
OpenMP Compiler
OSCAR: Optimally Scheduled Advanced Multiprocessor API: Application Program Interface
Executaable on vaarious mu ulticores
Application Program Fortran or Parallelizable C ( Sequential program)
OSCAR API for RealReal-time Low Power High Performance Multicores Directives for thread generation, memory, data transfer using DMA, power managements
Shred memory servers 18
Performance of OSCAR Compiler on IBM p6 595 Power6 (4 2GHz) based 32 (4.2GHz) 32-core core SMP Server
OpenMP codes generated by OSCAR compiler accelerate IBM XL Fortran for AIX Ver.12.1 about 3.3 times on the average Compile Option: (*1) Sequential: -O3 –qarch=pwr6, XLF: -O3 –qarch=pwr6 –qsmp=auto, OSCAR: -O3 –qarch=pwr6 –qsmp=noauto (*2) Sequential: -O5 -q64 –qarch=pwr6, XLF: -O5 –q64 –qarch=pwr6 –qsmp=auto, OSCAR: -O5 –q64 –qarch=pwr6 –qsmp=noauto (Others) Sequential: -O5 –qarch=pwr6, XLF: -O5 –qarch=pwr6 –qsmp=auto, OSCAR: -O5 –qarch=pwr6 –qsmp=noauto
Performance of OSCAR Compiler Using the Multicore API on Intel Quad-core Xeon Intel Ver.10.1 OSCAR
9 8 speeedup ratioo
7 6 5 4 3 2 1
SPEC95 •
SPEC2000
OSCAR Compiler gives us 2.1 times speedup on the average against Intel Compiler ver.10.1
apsi
applu
mgrid m
swim
wave5 w
ffpppp
apsi
tuurb3d
applu
mgrid m
hyddro2d
suu2cor
swim
tom mcatv
0
Performance of OSCAR compiler on 16 cores SGI Altix 450 Montvale server 6
sppeedup ratio
5
Compiler options for the Intel Compiler: for Automation parallelization: -fast -parallel. for OpenMP codes generated by OSCAR: -fast -openmp
Intel Ver.10.1 OSCAR
4 3 2 1
spec95 •
spec2000
OSC OSCAR compiler co p e gave us 2.3 .3 ttimes es speedup against Intel Fortran Itanium Compiler revision 10.1
apsi
applu
mgrid
swim
wave5
fpppp
apsi
turb3d
applu
mgrid
hhydro2d
su2cor
swim
toomcatv
0
Performance of OSCAR compiler on NEC NaviEngine(ARM-NEC N iE i (ARM NEC MPcore) MP ) 4.5
g77 4
OSCAR
speedup ratio
3.5 3 2.5 2 1.5 1 0.5 0 1PE
2PE mgrid
4PE
1PE
2PE su2cor S PEC9 5
•
4PE
1PE
2PE
4PE
hydro2d
Compile Opiion : -O3
OSCAR compiler gave us 3.43 times speedup against 1 core on 22 ARM/NEC MPCore with 4 ARM 400MHz cores
Performance of OSCAR Compiler Using the multicore API on Fujitsu FR1000 Multicore 4.0
3.76
3.75 3 47 3.47
3.5 3.0
2.90
2.81
2.55
2.54 2.5 2.12 1.94
2.0
1.98
1.98
1 80 1.80
1.5 1.0
1.00
1.00
1.00
1.00
0.5 0.0 1
2
3
MPEG2enc
4
1
2
3
MPEG2dec
4
1
2
3
MP3enc
4
1
2
3
4
JPEG 2000enc
3.38 times speedup on the average for 4 cores against a single core execution
Performance of OSCAR Compiler Using the Developed API on 4 core (SH4A) OSCAR T Type M Multicore lti 4.0 3.5
3.50
3.35
3.24
3.17
3.0
2.71
2.57
2.53
2.5 2 07 2.07 2.0
1.88
1.86
1.78
1.83
1.5 1.0
1.00
1.00
1.00
1.00
0.5 0.0 1
2
3
MPEG2enc
4
1
2
3
MPEG2dec
4
1
2
3
MP3enc
4
1
2
3
4
JPEG 2000enc
3.31 times speedup on the average for 4cores against 1core
Processing Performance on the Developed Multicore Using Automatic Parallelizing Compiler Speedup against single core execution for audio AAC encoding *) Ad Advanced dA Audio di Coding C di
7.0
Speedup ps
60 6.0 5.0 4.0 30 3.0
58 5.8
2.0
3.6
1.0
1.9 1.0
0.0 1
2
4
Numbers of processor cores
8
Power Reduction by OSCAR Parallelizing Compiler for Secure Audio Encoding AAC Encoding + AES Encryption with 8 CPU cores 7
Without Power Control (Voltage:1.4V)
7
6
6
5
5
4
4
3
3
2
2
1
1
0
0
With Power Control (Frequency, Resume Standby: Power shutdown & Voltage lowering 1.4V1.0V)
Avg. Power 88.3% 88 3% Power Reduction Avg. Power 5.68 [W] 0.67 [W] 26
Power Reduction by OSCAR Parallelizing Compiler for MPEG2 Decoding MPEG2 Decoding with 8 CPU cores 7
Without Power C Control l (Voltage:1.4V)
7
6
6
5
5
4
4
3
3
2
2
1
1
0
0
W Power With owe Control Co o (Frequency, Resume Standby: Power shutdown & Voltage lowering 1.4V1 4V 1.0V)
Avg. Power 73.5% 73 5% Power Reduction Avg. Power 5.73 [W] 1.52 [W] 27
Low Power High Performance Multicore Computer with Multicore Computer with Solar Panel Clean Energy Autonomous gy ¾ Servers operational in deserts ¾
Conclusions ¾ OSCAR compiler cooperative low powermulti-core powermulti core with high effective performance, short software development period will be important in wide range of IT systems from consumer electronics to exa-scale supercomputers. ¾ The OSCAR compiler with API boosts up the vendors compiler performance on various multicores and servers ¾ 3.3 times on IBM p p595 SMP server using g Power6 ¾ 2. 1 times on Intel Quad core Xeon ¾ 2.3 times on SGI Altix450 usingg Intel Itanium2 ((Montvale)) ¾ 88% power reduction by the compiler power control on the Renesas-Hitachi-Waseda 8 core (SH4A) multicore RP2 for realtime li secure AAC encoding di ¾ 70% power reduction on the multicore for MPEG2 decoding