Jul 30, 2007 - OSCAR Type Multi-core Chip by Renesas in. METI/NEDO Multicore for Real-time Consumer. Electronics Project (Leader: Prof.Kasahara).
A Multigrain g Parallelizing g Compiler with Power Control for Multicore Processors Hironori Kasahara Professor, Department of Computer Science Professor Director, Advanced Chip-Multiprocessor Research Institutes
Waseda University Tokyo Japan Tokyo, http://www.kasahara.cs.waseda.ac.jp Feb. 5, 2008 at Google Headquarter
Hironori Kasahara
<Personal History> B.S. (1980,Waseda), M.S.(1982,Waseda), Ph.D.(1985,EE, Waseda). Res.Assoc. (1983,Waseda), S i l Research Special R h Fellow F ll JSPS (1985) ,Visiting Vi i i Scholar S h l (1985.Univ.California (1985 U i C lif i at Berkeley), B k l ) . Assist. Prof. (1986.Waseda), Assoc. Prof.(1988,Waseda), Visiting Research Scholar(1989-1990. Center for Supercomputing R&D, Univ.of Illinois at Urbana-Champaign), Prof.(1997-,Dept. CS, Waseda). , IFAC World Congress Young Author Prize (1987), IPSJ Sakai Memorial Special Award (1997), STARC Industry-Academia Cooperative Research Award (2004) <Activities for Societies> IPSJ:Sig. Computer Architecture(Chair), Trans of IPSJ Editorial Board (HG Chair), Journal of IPSJ Editorial Board (HWG Chair), 2001 Journal of IPSJ Special Issue on Parallel Processing(Chair of Editorial Board: Guest Editor, JSPP2000 (Program Chair) etc. ACM :International Conference on Supercomputing(ICS)(Program Committee) Int’l conf. on Supercomputing (PC, esp. ‘96 ENIAC 50th Anniversary Co-Prog. Chair). IEEE Computer IEEE: C t S Society i t JJapan Ch Chapter t Ch Chair, i Tokyo T k S Section ti B Board dM Member, b SC07 PC OTHER: PCs of many conferences on Supercomputing and Parallel Processing. p Forum(Architecture/HPC ( WG Chair), ), METI:IT Policyy Proposal Super Advanced Electronic Basis Technology Investigation Committee NEDO:Millennium Project IT21 “Advanced Parallelizing Compiler”( Project Leader), Computer Strategy WG (Chair).Multicore for Realtime Consumer Electronics Project Leader etc. project j evaluation committee,, 10PFLOPS Supercomputer p p evaluat. comm. MEXT:Earth Simulator p JAERI: Research accomplishment evaluation committee, CCSE 1st class invited researcher. JST: Scientific Research Fund Sub Committee, COINS Steering Committee , Precursory Research for Embryonic Science and Technology (Research Area Adviser) Cabinet Office: CSTP Expert Panel on Basic Policy, Information & Communication Field Promotion Strategy , R&D Infrastructure WG, Software & Security WG 2 <Papers>Papers 158, Invited Talks 64, Tech. Reports 114, Symposium 25, News Papers/TV/Web News/Magazine 139, IEEE Trans. Computer, IPSJ Trans., ISSCC, Cool Chips, Supercomputing, ACM ICS, etc.etc.
Multi-core Everywhere Multi-core from embedded to supercomputers ¾ Consumer Electronics (Embedded) Mobile Phone, Game, Digital TV, Car Navigation, DVD C DVD, Camera, IBM/ Sony/ Toshiba Cell, Fuijtsu FR1000, NEC/ARMMPCore&MP211, Panasonic Uniphier, Renesas SH multi-core(4 multi core(4 core RP1 RP1, 8 core RP2) Tilera Tile64, SPI Storm-1(16 VLIW cores)
¾ PCs, Servers OSCAR Type Multi-core Chip by Renesas in METI/NEDO Multicore for Real-time Consumer ¾ Electronics Project (Leader: Prof.Kasahara)
IBM BlueGene/L Lawrence Livermore National Laboratory2005/04稼動予定 低消費電力チップマルチプロセッサ をベースとしたスパコン
Intel Quad Xeon, Core 2 Quad, Montvale, Tukwila, 80 core AMD Quad Core Opteron, Phenom
WSs, Deskside & Highend Servers IBM Power4,5,5+,6
Sun Niagara(SparcT1,T2), Rock
¾ Supercomputers S t Earth Simulator:40TFLOPS, 2002, 5120 vector proc. IBM Blue Gene/L: 360TFLOPS, 2005, Low power CMP based 128K processor chips, BG/P 2008
High quality application software, Productivity, Cost performance, Low power consumption are important 65,536-プロセッサチップ Ex, Mobile phones, Games (128Kプロセッサ 360TFLOPS =毎秒360兆回の Compiler C il cooperated t d multi-core lti processors are キャビネットあたり 1024プロセッサチップ 浮動小数点計算) promising to realize the above futures 3 1プロセッサチップ上に2プロセッサ集積
Market of Consumer Electronics 1 Trillion T illi Dollars D ll in i 2010 (World Wide) W /W 市 場 規 模
16兆 円
携 帯 電 話
市場規模( 出荷額)
14兆 円 12兆 円
2兆 円
薄 型 テレビ
テ ゙シ ゙タ ル カ メ ラ 1 .5 兆 円
テ ゙シ ゙タ ル ヒ ゙テ ゙オ
1兆 円
D VD レコーダ
5千 億 円 0
Dig Camera Dig. Dig. TV DVD Recorder DVD for PC Mobile Phone LSI for Cars
カーナビ `0 2
`0 3
`0 4
`0 5
`0 6
‘0 3
‘0 7
年平均成 長率%
デ ジ タル ス チ ル カ メラ (M 台 )
49
76
12
デ ジ タル TV(M 台 )
6
27
45
DVDレ コ ー ダ (M 台 )
3.6
33
74
PC 用 D V D (記 録 型 )(M 台 )
27
114
43
携 帯 電 話 (M 台 )
490
670
8
自 動 車 用 半 導 体 需 要 (B $)
14.0
20.9
11
Annual Growth Rates 2005.5.11 NEDOロードマップ報告会 電子・情報技術開発部 「技術開発戦略」より
4
Roadmap of compiler cooperative multicore project 00
Millennium Project IT21 NEDO Advanced Parallelizing Compiler
01
02
03
04
Compiler development of Multiprocessor Servers
06
05
07
08
09
10
Apply for Supercomputer Compilers
(Waseda Univ. Fujitsu,Hitachi, JIPDEC,AIST)
STARC Compiler p Cooperative p Chip Multiprocessor
Basic resrach
(Waseda・Toshiba・Fujitsu・ Panasonic・Sony)
( Waseda Univ., Fujitsu, NEC, Toshiba, Panasonic,Sony) y
STARC: Semiconductor Technology Academic Research Center Fujitsu,Toshiba,NEC, Renesas,Panasonic, Sony etc.
(Waseda・Fujitsu・NEC・Toshiba)
Waseda Univ Univ., Hitachi Hitachi, Renesas, Renesas
NEDO (2004.07-2007.06) Heterogeneous Multiprocessor ((Waseda Univ.,, Hitachi) )
NEDO (2005.06-2008.03) Multicore Technology for Realtime Consumer Electronics Waseda Univ., Hitachi, Renesas, Fujitsu, NEC, Toshiba, Panasonic ¾Power Saving Multicore Architecture, Parallelizing Compiler,API
NEDO (2007.02- 2010.03) Heterogeneous Multicore for Consumer Electronics Waseda Univ., Hitachi, Renesas, Tokyo Inst, of Tech.
Practical Resarch
Plan
Arch. % Compil R&D
Practical Use
Soft realtime Multicore Arch. Compiler Plan API R&D
Plan
Practical Use
Hetero H t M Multicore lti A Arch. h & Compiler R&D
5 Mar. Oct.
Mar.
Mar.
METI/NEDO Advanced Parallelizing Compiler Technology Project Background and Problems
Millenium Project IT21 2000.9.8 –2003.3.31 W d U Waseda Univ., i F Fujitsu, jit Hitachi, Hit hi AIST
①Adoption of parallel processing as a core technology on PC to HPC ② Increase of importance of software on IT ③ Need for improvement of cost-performance and usability
Performance 1T
Hardware Peak performance
Improvement of ① Effective performance ②Cost-performance ③ Ease of use
① R & D of advanced parallelizing compiler Multigrain, Data localization, Overhead hiding ② R & D of Performance evaluation technology for parallelizing compilers
APC P Project
Contents of Research and Development
Goal: Double the effective performance Ripple pp Effect Slow down
Effective Performance 1G 1980
1990
2000
Theoretical maximum performance vs. Effective performance of HPC
yyear
① Development of competitive next generation PC and HPC ② Putting the innovative automatic parallelizing p g compiler p technology gy to practical use ③ Development and market acquisition of future single-chip multiprocessors ④ Boosting g R&D in the followingg manyy fields:
IT, Bio-tech., Device, Earth environment, Next-generation VLSI design, Financial engineering, Weather forecast, New clean energy, Space 6 development, Automobile, Electric Commerce, etc
Performance of APC Compiler on IBM pSeries690 16 Processors High-end Server •
IBM XL Fortran for AIX Version 8.1 – – –
Sequential execution : -O5 -qarch=pwr4 Automatic loop parallelization : -O5 -qsmp=auto -qarch=pwr4 OSCAR compiler : -O5 -qsmp=noauto -qarch=pwr4 (su2cor: -O4 -qstrict)
12.0
3.1s
3.4s
XL Fortran(max)
33.5 5 times ti speedup d in average
3.0s
10.0
28.8s
APC(max)
38.5s
8.0 3.8s
6.0
28.9s
4.0
184.8s
282.4s
321.4s
t ra
ck
282 4s 321.4s 282.4s 321 4s
s ix
pl u
2k
2k ap
r id
mg
sw im
2k
126 5s 307.6s 126.5s 307 6s 291.2s 291 2s 279.1s 279 1s
i se
wu
pw
ve
ap
105.0s
k
18 8s 18.8s
107.4s
s i2
85 8s 85.8s
5
22 5s 22.5s
pp
126.5s 18.8s
si
85.8s
wa
tu r
b3
d
39 5s 39.5s
22.5s
fp p
37.4s
27 8s 27.8s
pl u
r id
35 1s 35.1s
ap
d ro hy
2c
23.2s
30 3s 30.3s
2d
21 5s 21.5s
or
38 3s 38.3s
im
ca
tv
23 1s 23.1s
115.2s
13 0s 13.0s
21.0s
21.5s
su
0.0
16 4s 16.4s
mg
19.2s
sw
2.0
16.7s
ap
7.1s
to m
S p e e d u p r a tio
3.5s
7
Performance of Multigrain Parallel Processing for 102.swim on IBM pSeries690 16 00 16.00
XLF(AUTO)
14.00
OSCAR
Speed up S p ratio
12.00 10.00 8.00 6.00 4.00 2.00 0.00 1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
Processors 8
OSCAR Parallelizing Compiler • Improve effective performance, cost-performance and productivity and reduce consumed power – Multigrain Parallelization • Exploitation of parallelism from the whole program by use of coarse-grain coarse grain parallelism among loops and subroutines, near fine grain parallelism among statements in addition to loop parallelism
– Data Localization • Automatic data distribution for distributed shared memory, cache and local memory on multiprocessor systems.
– Data Transfer Overlapping • Data transfer overhead hiding by overlapping task execution and data transfer using DMA or data pre pre-fetching fetching
– Power Reduction • Reduction of consumed power by compiler control of frequency, voltage and power shut down with hardware supports. 9
Generation of Coarse Grain Tasks Macro-tasks (MTs)
¾ Block Bl k off P Pseudo d A Assignments i t (BPA): (BPA) Basic B i Block Bl k (BB) ¾ Repetition Block (RB) : outermost natural loop ¾ Subroutine Block (SB): subroutine
Program
BPA
Near fine grain parallelization
RB
Loop level parallelization Near fine grain of loop body Coarse grain parallelization
SB
Total System
1st. Layer
Coarse grain parallelization 2nd. Layer
BPA RB SB BPA RB SB
BPA RB SB BPA RB SB BPA RB SB BPA RB SB
3rd. Layer 10
Earliest Executable Condition Analysis for coarse grain tasks (Macro-tasks) Data Dependency Control flow Conditional branch 1 2
1
BPA
BPA
3
4
BPA
BPA
BPA
Block of Psuedo Assignment Statements
RB
Repetition Block
2
7 5
3 4
RB
BPA
8 6
BPA 6
RB BPA
15
BPA
7
5
RB
6
9
RB
11
RB 8
BPA BPA
9
BPA 11
12
END
10
RB
BPA RB
14
RB
15
7 Data dependency
12
Extended control dependency
BPA
13
10
A Macro Flow Graph
Conditional branch
13
OR AND
14
Original control flow
A Macro Task Graph 11
Automatic processor assignment in 103 su2cor 103.su2cor • Using g 14 p processors –
Coarse grain parallelization within DO400 of subroutine LOOPS
12
MTG of Su2cor-LOOPS-DO400 Coarse grain parallelism PARA_ALD = 4.3
DOALL
Sequential LOOP
SB
13
BB
Data-Localization L Loop Ali Aligned dD Decomposition iti •
Decompose multiple loop (Doall and Seq) into CARs and LRs considering id i inter-loop i t l data d t d dependence. d – Most data in LR can be passed through LM. – LR: Localizable Region, CAR: Commonly Accessed Region C RB1(Doall) DO I=1,101 A(I)=2*I ENDDO C RB2(Doseq) ( q) DO I=1,100 B(I)=B(I-1) +A(I)+A(I+1) ENDDO RB3(Doall) DO I=2,100 C(I)=B(I)+B(I-1) ENDDO C
LR DO I=1,33
CAR DO I=34,35
LR DO I=36,66
CAR DO I=67,68
LR DO I=69,101
DO I=1,33 DO I=34,34 DO I=35,66 DO I=67 I=67,67 67 DO I=68,100 DO I=2,34 I=2 34
DO I=35 I=35,67 67
DO I=68 I=68,100 100 14
Data Localization 1 1
PE0
2 3
6
3
4
2
5
4
5 6
12
7
10
11
9
8
7 14
8
9
13
11
24
17
13
15
MTG
19
25
21
22
23
26 dlg3
dlg0
28
29
Data Localization Group
27
31
32
12
1
2
3
6
7
4
14
8
18
15
5
19
9
25
11
29
10
13
16
17
20
22
26
21
30
23
24
27
28
20
dlg2
dlg1
12
16
10 18
14
15
PE1
30
32 33
31
MTG after Division
15 A schedule for two processors
An Example of Data Localization for Spec95 Swim DO 200 J=1 J=1,N N DO 200 I=1,M UNEW(I+1,J) = UOLD(I+1,J)+ 1 TDTS8*(Z(I+1,J+1)+Z(I+1,J))*(CV(I+1,J+1)+CV(I,J+1)+CV(I,J) 2 +CV(I+1 +CV(I+1,J))-TDTSDX*(H(I+1,J)-H(I,J)) J)) TDTSDX*(H(I+1 J) H(I J)) VNEW(I,J+1) = VOLD(I,J+1)-TDTS8*(Z(I+1,J+1)+Z(I,J+1)) 1 *(CU(I+1,J+1)+CU(I,J+1)+CU(I,J)+CU(I+1,J)) 2 -TDTSDY*(H(I,J+1)-H(I,J)) PNEW(I,J) = POLD(I,J)-TDTSDX*(CU(I+1,J)-CU(I,J)) 1 -TDTSDY*(CV(I,J+1)-CV(I,J)) 200 CONTINUE
DO 210 J=1,N UNEW(1,J) = UNEW(M+1,J) VNEW(M+1,J+1) = VNEW(1,J+1) PNEW(M+1 J) = PNEW(1 PNEW(M+1,J) PNEW(1,J) J) 210 CONTINUE
DO 300 J=1,N DO 300 I=1,M UOLD(I,J) = U(I,J)+ALPHA*(UNEW(I,J)-2.*U(I,J)+UOLD(I,J)) VOLD(I,J) = V(I,J)+ALPHA*(VNEW(I,J)-2.*V(I,J)+VOLD(I,J)) ( , ) = P(I,J)+ALPHA*(PNEW(I,J)-2.*P(I,J)+POLD(I,J)) (, ) ( (, ) (, ) ( , )) POLD(I,J) 300 CONTINUE
(a) An example of target loop group for data localization
cache size 0
1
VN PO H
2
PN CU
3
UO CV
4MB
UN VO Z UN
VN
PN
U VN PO
V PN
P UO
UN VO
C h liline conflicts Cache fli t occurs among arrays which share the same location on cache (b) Image of alignment of arrays on cache accessed by target loops 16
Data Layout for Removing Line Conflict Misses by Array Dimension Padding Declaration part of arrays in spec95 swim after padding before padding PARAMETER (N1=513, N2=513)
PARAMETER (N1=513, N2=544)
COMMON U(N1,N2), V(N1,N2), P(N1,N2), * UNEW(N1,N2), VNEW(N1,N2), 1 PNEW(N1,N2), UOLD(N1,N2), * VOLD(N1,N2), POLD(N1,N2), 2 CU(N1,N2), CV(N1,N2), * Z(N1,N2), H(N1,N2)
COMMON U(N1,N2), V(N1,N2), P(N1,N2), * UNEW(N1,N2), VNEW(N1,N2), 1 PNEW(N1,N2), UOLD(N1,N2), * VOLD(N1,N2), POLD(N1,N2), 2 CU(N1,N2), CV(N1,N2), * Z(N1,N2), H(N1,N2)
4MB
4MB
padding Box: Access range of DLG0
17
APC Compiler Organization APC M Multigrain li i Parallelizing Compiler
Automatic Data Distribution - Cache optimization - DSM optimization Multigrain Parallelization
Data Dependence Analysis
Source Program FORTRAN
- Inter-procedural - Conditional
- Hierarchical Parallel - Affine Partitioning
Scheduling - Static Scheduling - Dynamic Scheduling
OpenMP C d Code Generation
Speculative E Execution ti - Coarse Grain - Medium Grain - Architecture Support
Feedback Sun Forte 6 IBM XL Fortran IBM XL Fortran Update 2 Ver.7.1 Ver.8.1
Parallelizing Tuning System Program Visualization Technique Techniques for Profiling & Utilizing Runtime Info
Runtime Info
Feedback-directed Selection Technique of Compiler Directives
… IBM IBM 18 pSeries690 RS6000 Variety of Shared Memory Parallel machines SUN Ultra 80
Centralized scheduling code
Image of Generated OpenMP Code for Hierarchical Multigrain Parallel Processing SECTIONS SECTION
1st layer Distributed scheduling code d
MT1_1
MT1_2 DOALL MT1_4 MT1 4 RB
MT1_3 SB
T1 T2
T3
T4 T5
SYNC SEND MT1-4
1_3_2 1_3_3 1_3_4
T6
T7
MT1_1
SYNC RECV
MT1_2
1_3_1 1_4_1
T0
SECTION
1_4_1
1_4_1
1_4_2
1_4_2
1_4_3
1_4_3
1_4_4
1_4_4
MT1-3 1_3_1
1_3_1
1_3_1
1_3_2
1_3_2
1_3_2
1_3_3
1_3_3
1_3_3
1_3_4
1_3_4
1_3_4
1_3_5
1_3_5
1_3_5
1 3 6 1_3_6
1 3 6 1_3_6
1 3 6 1_3_6
1_3_5 1_3_6 1_4_21_4_3 1_4_4
2nd layer
END SECTIONS
2nd layer Thread group0
Thread group1
19
Performance OSCAR Multigrain Parallelizing Compiler on a IBM p550q 8core Deskside Server 2.7 times speedup against loop
parallelizing compiler on 8 cores Loop parallelization Multigrain parallelizition 9 8 speed dup ratio
7 6 5 4 3 2 1 0 t tomcatvswim t i su2corhydro2d 2 h d 2dmgrid id applu l turb3d t b3d apsii fpppp f wave55 swim i mgrid id applu l apsii Page. 20
spec95
spec2000
OSCAR Compiler Performance on 24 Processor IBM p p690Highend g SMP Server GX Sl ot
Mem Slot
GX Slot
Mem Slot
Mem Slot
L3 L3
L3 L3 GX
L3 L3
P
P
P
P
P
L3 L3
P
L3 L3
P
P
P
P
P
MCM 1
P
P
P
P
P
P
L2 P
P
P
MCM 0
P
L3 L3
P
GX
L3 L3
P
GX
L2
GX
GX
L3 L3
XL Fortran Ver.8.1 OSCAR
L3 L3 GX
L2
L2
GX Sl ot
GX P
L2
GX
L3 L3
P
L3 L3
P
L2
L3 L3
GX Slot
Mem Slot
Mem Slot
Four 8-way MCM Features Assembled into a 32-way pSeries 690
10 5
spec95 Page. 21
spec2000
4.82 times speedup against loop parallelization
apsi
a p p lu
m g ridd
sw im m
wave5
fp p p p
apsi
tu rb 3 d
a p p lu
m g ridd
h y d ro 2 d
su 2 c o r
sw im m
0 to m c a tvv
sp e e d u p ra tio
L3 L3
P
L2
L2
GX
15
P
L2
L2
GX
Mem Slot
P
L2
L2
L2
GX
GX
MCM 3 P
L2
GX
Mem Slot
P
L2
Mem Slot
L3 L3 GX
MCM 2
L3 L3
20
P
L2
GX
25
L3 L3 GX
Performance on SGI Altix 450 Montecito 16 p processors cc-NUMA server Intel Ver.8.1 OSCAR
9 8 sppeedup ratio
7 6 5 4 3 2 1
spec95 95
•
apsi
applu
mgrid
swim
wave5
fpppp
apsi
turb3d
applu
mgrid
hydro2d
su2cor
swim
tomcatv
0
spec2000 2000
OSCAR compiler gave us 1.86 times speedup against Intel Fortran Itanium Compiler revision 88.1 1
Performance of OSCAR compiler on 16 cores SGI Altix Alti 450 Montvale M t l server 6
sppeedup ratio
5
Intel Ver.10.1 OSCAR
4 3 2 1
spec95 •
spec2000
OSC OSCAR compiler co p e gave us 2.32 .3 ttimes es speedup against Intel Fortran Itanium Compiler revision 10.1
apsi
applu
mgrid
swim
wave5
fpppp
apsi
turb3d
applu
mgrid
hhydro2d
su2cor
swim
toomcatv
0
NEC/ARM MPCore Embedded 4 core SMP
4.5
g77 oscar
4 sp peedup ratio
3.5 3 2.5 2 1.5 1 0.5 0 1
2
3
toamcatv
4
1
2
3
swim
4
1
2
3
su2cor
4
1
2
3
hydro2d
4
1
2
3
mgrid
4
1
2
3
applu
4
1
2
3
4
turb3d
SPEC95
24 3.48 times speedup by OSCAR compiler against sequential processing
Power Reduction by Power Supply, Clock Frequency and Voltage Control by OSCAR Compiler •
Shortest execution time mode
25
OSCAR Multi-Core Architecture CMP 0 (chip multiprocessor 0) CMP m
PE 0 PE 1
CPU LPM/ I-Cache
LDM/ D-cache
PE
I/O I/O Devices Devices CSM j
n
DTC
I/O CMP k
DSM
FVR
N t Network k IInterface t f
CS CSM
Intra-chip connection network (Multiple Buses, Crossbar, etc)
CSM / L2 Cache
FVR
FVR FVR
FVR
FVR
FVR
FVR Inter-chip connection network (Crossbar, Buses, Multistage network, etc) CSM: central shared mem. LDM : local data mem. DSM: distributed shared mem. LPM : local program mem. DTC: Data Transfer Controller FVR: frequency / voltage control register 26
An Example of Machine Parameters for the Power Saving Scheme • Functions of the multiprocessor – Frequency of each proc. is changed to several levels – Voltage is changed together with frequency – Each proc. can be powered on/off state frequency voltage dynamic y energy gy static power
FULL 1 1 1 1
MID 1/2 0.87 3/4 1
LOW 1/4 0.71 1/2 1
OFF 0 0 0 0
• State transition overhead state FULL MID LOW OFF
FULL 0 40k 40k 80k
MID 40k 0 40k 80k
LOW 40k 40k 0 80k
delay time [u.t.]
OFF 80k 80k 80k 0
state FULL MID LOW OFF
FULL 0 20 20 40
MID 20 0 20 40
energy overhead [μJ]
LOW 20 20 0 40
OFF 40 40 40 0
27
Speed-up p p in Fastest Execution Mode 45 4.5 w/o Saving w Saving
4 speeddup ratio
3.5 3 2.5 2 1.5 1 0.5 0 1
2 tomcatv
4
1
2
4
1
swim
2 applu
4
1
2
4
mpeg2enc
b h k benchmark 28
Consumed Energy in Fastest Execution Mode 200 180 160
w/o Saving w Saving
ennergy(m JJ)
energy(J)
140 120 100 80 60 40 20 0 1
2 tomcatv
4
1
2 swim benchmark
4
1
2 applu
4
1600 1400 1200 1000 800 600 400 200 0 1
2
4
number b off proc. mpeg2 encode mpeg2_encode 29
Energy Reduction by OSCAR Compiler in Real-time Processing mode (10% Leak)
• deadline = sequential execution time, time Leakage Power: 10% 30
METI/NEDO National Project
Multi core for Real-time Multi-core Real time Consumer Electronics R&D of compiler cooperative multicore processor technology for consumer electronics like Mobile phones, Games, DVD, Digital TV, Car navigation systems. From July 2005 to March 2008
(2005.7~2008.3)** ( ) CMP 0 ( マルチコアチップ0)
PC0
API:Application Programming Interface
(プロセッサ)
LDM/ D-cache
LPM/ I-Cache (ローカルプロ グラムメモリ/ 命令キャッシュ)
(ローカルデータ メモリ/L1データ キャッシュ)
DTC (データ 転送コン トローラ)
PC 1 (プロ セッサ コア1)
(マルチ コア・ チ プ ) チップm)
PC n (プロセッサ コアn)
DSM (分散共有メモリ)
NI(ネットワークイン ターフェイス)
IntraCCN (チップ内結合網: 複数バス、クロスバー等 )
CSM / L2 Cache (集中共有メモリあるいはL2キャッシュ )
,,
(入出力装置)
CMP m
CPU
(プロセッ サコア0))
・Good cost performance
・Short hardware and software development periods ・Low power consumption ・Scalable performance improvement with the advancement of semiconductor ・Use of the same parallelizing compiler for multi-cores from different vendors using newly developed API
新マルチコア 新マルチコア プロセッサ プロセッサ I/O I/O Devices •高性能 •高性能 •低消費電力 低消費電力 •低消費電力 低消費電力 •短HW/SW開 •短HW/SW開 発期間 発期間 CSM •各チップ間でア •各チップ間でア プリケーション プリケ ション プリケーション プリケ ション 共用可 共用可 •高信頼性 •高信頼性 CSM j
I/O
CSP k
(入出力 用 マルチコ ア・ チップ)
(集中 共有 メモリ)
•半導体集積度 •半導体集積度 と共に性能向上 と共に性能向上
マルチコア InterCCN (チップ間結合網: 複数バス、クロスバー、多段ネットワーク等 ) 統合ECU
開発マルチコアチップは情報家電へ 開発マルチコアチップは情報家電へ
**Hitachi, Renesas, Fujitsu, Toshiba, Panasonic, NEC
31
NEDOMulticore Technology for Realtime Consumer Electronics R&D Organization(2005.7-2008.3 ) Integrated R&D Steering Committee Chair : Hironori Kasahara (Waseda Univ.), Project Leader Sub-chair : Kunio Uchiyama (Hitachi), Project Sub-leader
Grant: Multicore Architecture and Compiler Research and Development Project Leader : Prof. Hironori Kasahara, Waseda Univ. R&D Steeringg Comittee Architecture & Compiler R&D Group. Group Leader : Associate Prof. Keiji Kimura, Waseda Univ. (Outsourcing : Fujitsu)
Standard Multicore Architecture & API C Committee i Chair : Prof. Hironori Kasahara, Waseda Univ.
Subsidy: Evaluation Environment for Multicore Technology Project Sub-leader : Dr. Kunio Uchiyama, Chief Researcher, Hitachi Testt Chi T Chip Development Group. Group Leader : Renesas Technology gy
Evaluation E l ti S Sytem t Development Group Group Leader : Hitachi
Hitachi, Fujitsu, Toshiba, NEC, Panasonic Renesas Technology Panasonic, 32
Fujitsu FR-1000Multicore Processor FR550 VLIW Processor
FR-V Multi-core Processor FR550 core
Local BUS IF
FR550 core DMA-I
DMA-E
Crossbar Switch FR550 core
I-cache 32KB
Mem. Cont.
Memory
Mem. Cont.
Memory
D-cache 32KB
Integer Operation Unit Inst. 0 Inst Inst. 1 Inst. 2 Inst. 3
GR
Inst. 4 Inst. 5 Inst. 6 Inst. 7
FR
Media Operation Unit
FR550 core
IO Chip
Core1
Core2
C 1 Core1
C 2 Core2
× Switch
Fast I/O Bus •Memory Bus: 64bit x 2ch / 266MHz •System Bus: 64bit / 178MHz
Memory Memory
Bus
Switch
Memory Memory
Crossbar (FR1000) 33
Panasonic UniPhier
34
CELL Processor C ocesso Overview Ove v ew • Power Processor Element ((PPE)) – PowerCore processes OS and Control tasks – 2-way 2 M lti th d d Multi-threaded
• Synergistic Processor Element (SPE) – – – – – –
8 SPE offers high performance Dual issue RISC Architecture 128bit SIMD(16‐way) 128 x 128bit General Registers 256KB Local L l Store St DedicatedDMA engines
SPE
Synergistic Processor Elements for High (Fl)ops / Watt
SPU
SPU
SPU
SPU
SPU
SPU
SPU
SPU
LS
LS
LS
LS
LS
LS
LS
16B/cycle
LS 16B/cycl e
EIB (up to 96B/cycle) 16B/cycle
16B/cycle 16B/cycle (2x)
PPE L2
MIC
BIC
512KB 32B/cycle
PPU L1 32KB+32KB
16B/cycle
Dual XDRTM
RRAC
従来の演算処理のためにVMXを持つ64bitのPower Architecture
35
I/O
1987 OSCAR(Optimally Scheduled Advanced Multiprocessor)
36
OSCAR(Optimally Scheduled Advanced Multiprocessor) HOST COMPUTER CONTROL & I/O PROCESSOR RISC Processor
I/O Processor
CENTRALIZED SHARED MEMORY1 (Simultaneous Readable) Bank1
Bank2 Bank3
Addr.n Addr.n Addr.n Data Memory
Prog. Memory
Distributed Shared Memory
CSM2
CSM3
Read & Write Requests Arbitrator
Bus Interface
Distributed Shared Memory (Dual Port) (CP) -5MFLOPS,32bit RISC Processor (64 Registers) -2 Banks of Program Memory -Data Memory -Stack Memory -DMA Controller
PE1
(CP)
..
PE5
(CP)
..
PE6
5PE CLUSTER (SPC1) 8PE PROCESSOR CLUSTER (LPC1)
PE8
PE9
(CP)
(CP)
..
PE10 PE11
PE15 PE16 SPC3
SPC2 LPC2
37
OSCAR PE (Processor Element) SYSTEM BUS
BUS INTERFACE
LOCAL BUS 1
LOCAL BUS 2 DSM
DMA
LPM
LSM
LDM
IPU
FPU
REG INSC INSTRUCTION BUS
D M A : DMA CONTROLLER L P M : LOCAL PROGRAM MEMORY (128KW * 2BANK) I N S C : INSTRUCTION CONTROL UNIT : DISTRIBUTED DSM SHARED MEMORY (2KW) L S M : LOCAL STACK MEMORY (4KW)
DP
L D M : LOCAL DATA MEMORY (256KW) : DATA PATH DP I P U : INTEGER PROCESSING UNIT : FLOATING FPU PROCESSING UNIT R E G : REGISTER FILE (64 REGISTERS)
38
1987 OSCAR PE Board
39
OSCAR Memory Space SYSTEM MEMORY SPACE 00000000
LOCAL MEMORY SPACE 00000000
PE15
DSM
UNDEFINED (Local Memory Area)
(Distributed Shared Memory)
00100000
00000800
BROAD CAST
PE 1 PE 0
NOT USE
00200000
00010000 NOT USE
01000000 CP (Control Processor)
L P M(Bank0) (Local Program Memory) L P M(Bank1)
00030000
01100000 NOT USE
00020000
NOT USE
00040000
02000000 PE 0
LDM (Local Data Memory)
02100000 PE 1
00080000
NOT USE
000F0000
02200000 : : :
CONTROL
00100000
02F00000 PE15
SYSTEM ACCESSING
03000000 CSM1, 2, 3 (Centralized Shared Memory)
AREA
03400000 NOT USE FFFFFFFF
FFFFFFFF
40
OSCAR Multi-Core Architecture CMP 0 (chip multiprocessor 0) CMP m
PE 0 PE 1
CPU LPM/ I-Cache
LDM/ D-cache
PE
I/O I/O Devices Devices CSM j
n
DTC
I/O CMP k
DSM
FVR
N t Network k IInterface t f
CS CSM
Intra-chip connection network (Multiple Buses, Crossbar, etc)
CSM / L2 Cache
FVR
FVR FVR
FVR
FVR
FVR
FVR Inter-chip connection network (Crossbar, Buses, Multistage network, etc) CSM: central shared mem. LDM : local data mem. DSM: distributed shared mem. LPM : local program mem. DTC: Data Transfer Controller FVR: frequency / voltage control register 41
API and Parallelizing Compiler in METI/NEDO Advanced Multicore for Realtime Consumer Electronics Project j Sequential Application Program (Subset of C Language)
API to specify data assignment, data transfer, power reduction control
Was seda OS SCAR Co ompiler
Realtime Cons sumer Electroni E cs A Applicati ion Prog grams Image e, Secure e Audio Streaming etc.
Proc0 Scheduled Tasks T1
Backend compiler API decoder
Sequential Compiler
Mach. Codes SH multicore
Backend Compiler API decoder
Sequential Compiler
Mach. Codes
T4
Proc2 Scheduled Tasks T3
Executable codes for each vendor chip
Stop
Proc1 Scheduled T k Tasks T2
Translate into parallel codes for each vender
T6 Slow
FR-V Backend Compiler API decoder
Sequential Compiler
Mach. Codes (CELL)
Data Transfer by DTC(DMAC)
42
OSCAR Multigrain g Parallelizingg Compiler p • Automatic Parallelization – Multigrain Parallel Processing – Data Localization – Data transfer Overlapping – Complier Controlled Power Saving Scheme
• Compiler cooperative Multi-core architecture – OSCAR Multi-core Architecture – OSCAR Heterogeneous Multiprocessor Architecture
• Commercial SMP machines
43
Processor Block Diagram 600MHz S Snoop p contrroller ((SNC)
Core #3 Core #2 CPU FPU Core #1 CPU FPUD$ I$ Core #0 CPU FPU 32K 32K I$ D$ CCN CPU FPU 32K 32K IL CCN D$ I$ $ $ DL 16K 32K 32K IL8KCCN D$ DL I$ CCN 16K IL8K DL URAM 128K 32K 32K 8K 16K LCPG 128K IL URAMDL LCPG LCPG 0-3 URAM 16K 128K 8K LCPG 0-3 0-3 URAM 128K 0-3 0 3
On-chip system bus (SHwy) GCPG
LBSC DBSC SRAM DDR2 32bit 32bit
HW IP
CCN:Cache controller IL DL IL, DL: IInstruction/Data t ti /D t local memory URAM: User RAM GCPG: Global clock pulse generator g LCPG: Local CPG for each core LBSC: SRAM controller DBSC: DDR2 controller
300MHz
PCI Exp 4 lane 44
Chip Overview Core #0
Core #1
Peripherals
SNC Core #2
Core #3
SH4A Multicore SoC Chip
Process Technology
90nm, 8-layer, triple-Vth, CMOS
Chip Size
97.6mm 97 6mm2 (9.88mm (9 88mm x 9.88mm) 9 88mm)
Supply Voltage
1.0V (internal), 1.8/3.3V (I/O)
Power Consumption
0.6 mW/MHz/CPU @ 600MHz (90nm G)
Clock Frequency
600MHz
CPU Performance
4320 MIPS (Dhrystone 2.1)
FPU Performance
16.8 GFLOPS
I/D Cache
32KB 4way set-associative ((each))
ILRAM/OLRAM
8KB/16KB (each CPU)
URAM
128KB (each CPU)
P k Package
FCBGA 554pin, 554 i 29mm 29 x 29mm
ISSCC07 Paper p No.5.3, Y. Yoshida, et al., “A 4320MIPS Four-Processor Core SMP/AMP with Individually Managed Clock Frequency for Low Power Consumption” 45
Performance on a Developed SH Multi-core Multi core (RP1: SH-X3) SH X3) Using Compiler and API Audio AAC* Encoder
Image Susan Smoothing 5.0
5.0 4.0 2.70 3.0
3.82 Speedd up
Speed up
3.43
4.0 2 95 2.95 3.0 1.91
1.86 2.0
2.0
1.00
1.00 1.0
1.0
0
0 1
2 3 4 Number of Processors
*) ISO Advanced Audio Coding: Page. 46
1
2 3 4 Number of Processors
**) Mibench Embedded application benchmark by Michigan Univ.
ILRAM
I$
Core#0 D$
SNC0
URAMDLRAM
Core#1
Core#2
Core#3
Core#6 Core#4 DBSC
DDRPAD
S SNC1
SHWY
CSM
LBSC
RP2 Chip Photo and Specifications
VSWC
Core#7 Core#5 GCPG
Process 90nm, 8-layer 90nm 8-layer, tripleTechnology Vth, CMOS Chip Size 104.8mm 104 8mm2 (10.61mm x 9.88mm) CPU Core 6.6mm2 Size (3.36mm x 1.96mm) Supply pp y 1.0V–1.4V ((internal), ), Voltage 1.8/3.3V (I/O) Power 17 ((8 CPUs, Domains 8 URAMs, common)
47
8 Core RP2 Chip Block Diagram
LCPG0
PCR3 PCR2 PCR1 PCR0 C 0
I$ D$ LocalFPU memory CPU 16K 16K I$ D$ I:8K, D:32K Local 16K CCN I$16K D$memory I:8K, D:32K Local memory User RAM 64K 16K 16K BAR IUser I:8K, 8Kmemory D:32K D 32K64K Local RAM I:8K, D:32K64K User RAM URAM 64K
Snoop con ntroller 1
Core #3 Core #2 FPU CPU Core #1 FPU CPU I$ D$ Core #0 CPU 16K FPU16K
Barrier Sync. Lines
Snoop con ntroller 0
Cluster #0
Cl Cluster #1
Core #7 Core #6 CPU FPU Core #5 CPU FPU D$ I$ Core #4 CPU 16KFPU 16K
D$ I$ CPU FPU 16K 16K D$ I$ I:8K, D:32K 16K D$ 16KI$ CCN UserI:8K, RAMD:32K 64K 16K 16K BAR IRAM I:8K, 8K memory D:32K D64K 32K Local User UserI:8K, RAMD:32K 64K URAM 64K
LCPG1
PCR7 PCR6 PCR5 PCR4 C
On-chip system bus (SuperHyway)
DDR2 SRAM DMA control control control
LCPG: Local clock pulse generator PCR: Power Control Register CCN/BAR:Cache controller/Barrier 48 Register URAM: User RAM
Processing Performance on the Developed Multicore Using Automatic Parallelizing Compiler Speedup against single core execution for audio AAC encoding 7.0 6.0
Speedups
5.0 40 4.0 3.0
5.8
2.0
3.6
1.0
1.9 10 1.0
0.0 1 *) Advanced Audio Coding
2 4 N b Numbers off processor cores
8 49
Power Reduction Using Power Shutdown and Voltage Control by OSCAR Parallelizing Compiler
Consumed Poower [W W]
7
Without Power Control 7
6
6
5
5
4
4
3
3
2
2
1
1
0
0
Average Power 5.82 [W]
with Power Control (Resume Standby: Power shutdown & voltage lowering)
86 0% Power Reduction 86.0%
Average Power 0.81 [W]
OSCAR Heterogeneous Multicore • •
OSCAR Type Memory Architecture LPM –
•
LDM –
•
Local Data Memory
DSM –
•
Local Program g Memory y
Di ib d Shared Distributed Sh d Memory M
CSM –
Centralized Shared Memory •
•
DTU –
•
On Chipp and/or Off Chip p
Data Transfer Unit
Interconnection Network – M Multiple lti l Buses B – Split Transaction Buses – CrossBar …
51
Static Scheduling of Coarse Grain Tasks for a Heterogeneous Multi-core CPU0 CPU1 DRP
MT1 for CPU MT2 for CPU
MT3 for CPU
MT4 for DRP
MT5 for DRP
MT7 for DRP
MT8 for CPU
MT11 f CPU for
MT1 MT2
MT6 for CPU
MT3
MT6
MT9 for DRP
MT10 for DRP
MT8
MT12 f DRP for
MT13 f DRP for
MT11
MT4 MT5 MT7 MT9 MT10
MT13 MT12
EMT 52
An Image of Static Schedule for Heterogeneous Multicore with Data Transfer Overlapping pp g and Power Control
TIME E
53
Compiler Performance on a OSCAR HeteroHetero-multi multi--core 25.2 times speedup p p using g 4 SH ggeneral p purpose p cores and 4 DRP accelerators against a single SH
41.3 times Speedup
汎用高性能 プロセッサ
10.34 10.11 9 10 9.10
6.10 5.18
SH4A Core On-Chip CSM STB×1
6.11 5.19
SH4A Core On-Chip CSM BUS×3
SH4A Core On-Chip CSM STB×3
8CPU+8D DRP
4CPU+8D DRP
8CPU+4D DRP
4CPU+4D DRP
2CPU+4D DRP
4CPU+2D DRP
SH4A Core
2CPU+1D DRP
4CPU
0.000.00 1CPU
8CPU+8D DRP
4CPU+8D DRP
8CPU+4D DRP
4CPU+4D DRP
2CPU+4D DRP
2.90 4CPU+2D DRP
SH4A Core
2CPU+1D DRP
4CPU
2.90 1.59 0.40 1CPU
8CPU+8D DRP
4CPU+8D DRP
SH4A Core
2CPU+1D DRP
4CPU
0.0
1CPU
2.0
2.90 1.59 1.00 0.40
1CPU
Execution Cllock Cycles
4.0
10.31 10.09 9 08 9.08
6.09 5.17
6.0
8CPU+4D DRP
8.0
4CPU+4D DRP
10 0 10.0
16.25
10.17 10.05 9 08 9.08
2CPU+4D DRP
12.0
4CPU+2D DRP
14.0
25.2 times Speedup against 1SH core
2CPU+2D DRP
16.0
17.22 16.30
17.10
16.13 16.47
2CPU+2D DRP
18.0
2CPU+2D DRP
20.0
Power Reduction by OSCAR Compiler(4SHs+4DRPs) 0 78 W d ti b il C t l 0.78 W: 22% P Power reduction by C Compiler Control 1.2
Average P Power Conssumption [[W]
1
1.01
08 0.8
-22% 22%
1.01
1.01
1.01
1.01
1.01
0 78 0.78
0 78 0.78
0 78 0.78
0 78 0.78
0 78 0.78
1 78 1.78
FV FV ON OFF
FV FV ON OFF
FV FV ON OFF
FV FV ON OFF
FV FV ON OFF
FV ON
0.6 0.4
DRP Clock FPU IEU BPU Register
0.2 0 FV OFF
STB×1
Bus×3 Off Chip
2007.7.30
STB×3
STB×1
Bus×3
STB×3
On Chip 55
Conclusions ¾ Compiler cooperative low power high effective performance multi-core processors will be more important in wide range of information systems from games, mobile phones, automobiles to peta-scale supercomputers. p p ¾ Parallelizing compilers are essential for realization of ¾ Good cost performance p ¾ Short hardware and software development periods ¾ Low p power consumption p ¾ High software productivity ¾ Scalable p performance improvement p with advancement in semiconductor integration technology
¾ Key technologies in multi-core compiler ¾ Multigrain parallelization, Data localization, Data transfer 56 overlapping using DMA, Low power control technologies