ï¼Waseda Univ., Hitachiï¼. NEDO. Multicore Technology for. Realtime Consumer Electronics. Waseda Univ., Hitachi, Renesas,. Fujitsu, NEC, Toshiba, Panasonic.
University of Illinois Oct. 2006
Multi-core Parallelizing Compiler for Low Power High Performance Computing Hironori Kasahara Head, Department of Computer Science Director, Advanced Chip-Multiprocessor Research Institute
Waseda University Tokyo, Japan http://www.kasahara.cs.waseda.ac.jp
1
Hironori Kasahara
University of Illinois Oct. 2006
<Personal History> B.S. (1980,Waseda), M.S.(1982,Waseda), Ph.D.(1985,EE, Waseda). Res.Assoc. (1983,Waseda), Special Research Fellow JSPS (1985) ,Visiting Scholar (1985.Univ.California at Berkeley), . Assist. Prof. (1986.Waseda), Assoc. Prof.(1988,Waseda), Visiting Research Scholar(1989-1990. Center for Supercomputing R&D, Univ.of Illinois at Urbana-Champaign), Prof.(1997-,Dept. CS, Waseda). , IFAC World Congress Young Author Prize (1987), IPSJ Sakai Memorial Special Award (1997), STARC Industry-Academia Cooperative Research Award (2004) <Activities for Societies> IPSJ:Sig. Computer Architecture(Chair), Trans of IPSJ Editorial Board (HG Chair), Journal of IPSJ Editorial Board (HWG Chair), 2001 Journal of IPSJ Special Issue on Parallel Processing(Chair of Editorial Board: Guest Editor, JSPP2000 (Program Chair) etc. ACM :International Conference on Supercomputing(ICS)(Program Committee) Int’l conf. on Supercomputing (PC, esp. ‘96 ENIAC 50th Anniversary Co-Prog. Chair). IEEE: Computer Society Japan Chapter Chair, Tokyo Section Board Member, SC07 PC OTHER: PCs of many conferences on Supercomputing and Parallel Processing. METI:IT Policy Proposal Forum(Architecture/HPC WG Chair), Super Advanced Electronic Basis Technology Investigation Committee NEDO:Millennium Project IT21 “Advanced Parallelizing Compiler”( Project Leader), Computer Strategy WG (Chair).Multicore for Realtime Consumer Electronics Project Leader etc. MEXT: Earth Simulator project evaluation committee, JAERI: Research accomplishment evaluation committee, CCSE 1st class invited researcher. JST: Scientific Research Fund Sub Committee, COINS Steering Committee , Precursory Research for Embryonic Science and Technology (Research Area Adviser) Cabinet Office: CSTP Expert Panel on Basic Policy, Information & Communication Field Promotion Strategy , R&D Infrastructure WG, Software & Security WG 2 <Papers>151 Papers with Review, 20 Papers for Symposium with Review, 105 Technicar Reports, 154 Papers for Annual Convention, 49 Invited Talks, 74 Articles in Newspaper & Web, etc.
Multi-core Everywhere
University of Illinois Oct. 2006
Multi-core from embedded to supercomputers ¾ Consumer Electronics (Embedded) Mobile Phone, Game, Digital TV, Car Navi, DVD, Camera IBM/ Sony/ Toshiba Cell, Fuijtsu FR1000, NEC/ARMMPCore&MP211, Panasonic Uniphier, Renesas SH multi-core(RP1) IBM First Chip Multiprocessor (CMP) Power 4 processor 2 cores on a chip
¾ PCs, Servers Intel Dual-Core Xeon, Core 2 Duo, Montecito AMD Quad and Dual-Core Opteron
¾ WSs, Deskside & Highend Servers IBM Power4,5,5+, pSeries690(32way), p5 550Q(8 way) , Sun N iagara(SparcT1,T2), SGI ALTIX350,
¾ Supercomputers Earth Simulator:40TFLOPS, 2002, 5120 vector proc. IBM Blue Gene/L: 360TFLOPS, 2005, Low power CMP based 128K processor chips
Earth simulaor:5120 vector processors Multiprocessor Supercomputer
High quality application software, Productivity, Cost performance, Low power consumption are important Ex, Mobile phones, Games Compiler cooperated multi-core processors are promising to realize the above futures 3
University of Illinois Oct. 2006
IBM BlueGene/L
4
University of Illinois Oct. 2006
Roadmap of compiler cooperative multicore project 00
Millennium Project IT21 NEDO Advanced Parallelizing Compiler
01
02
Compiler development of Multiprocessor Servers
03
04
05
06
07
08
09
10
Apply for Supercomputer Compilers
(Waseda Univ. Fujitsu,Hitachi, JIPDEC,AIST)
STARC Compiler Cooperative Chip Multiprocessor
Basic resrach
(Waseda・Toshiba・Fujitsu・ Panasonic・Sony)
( Waseda Univ., Fujitsu, NEC, Toshiba, Panasonic,Sony)
NEDO Heterogeneous Multiprocessor (Waseda Univ., Hitachi)
NEDO Multicore Technology for Realtime Consumer Electronics Waseda Univ., Hitachi, Renesas, Fujitsu, NEC, Toshiba, Panasonic ¾Power Saving Multicore Architecture, Parallelizing Compiler,API
Practical Resarch
STARC: STARC: Semiconductor Semiconductor Technology TechnologyAcademic Academic Research ResearchCenter Center Fujitsu,Toshiba,NEC, Fujitsu,Toshiba,NEC, Renesas,Panasonic, Renesas,Panasonic, Sony Sonyetc. etc.
(Waseda・Fujitsu・NEC・Toshiba)
Arch. % Plan Compil R&D
Practical Use
Soft Softrealtime realtime Multicore Arch. Compiler Plan Practical Use API R&D
Mar. Oct.
Mar.
Mar.
5
University of Illinois Oct. 2006
METI/NEDO National Project
Multi-core for Real-time Consumer Electronics R&D of compiler cooperative multicore processor technology for consumer electronics like Mobile phones, Games, DVD, Digital TV, Car navigation systems. From July 2005 to March 2008
経済産業省/NEDOリアルタ イム情報家電用マルチコア 新マルチコア 新マルチコア (2005.7~2008.3)** プロセッサ
PC0
API:Application Programming Interface
CMP m
CPU
(プロセッ サコア0)
(プロセッサ)
LDM/ D-cache
LPM/ I-Cache (ローカルプロ グラムメモリ/ 命令キャッシュ)
・Good cost performance
・Short hardware and software development periods ・Low power consumption ・Scalable performance improvement with the advancement of semiconductor ・Use of the same parallelizing compiler for multi-cores from different vendors using newly developed API
プロセッサ I/O Devices I/O •高性能 •高性能
CMP 0 ( マルチコアチップ0)
(ローカルデータ メモリ/L1データ キャッシュ)
DTC (データ 転送コン トローラ)
PC 1 (プロ セッサ コア1)
PC n (プロセッサ コアn)
DSM (分散共有メモリ)
NI(ネットワークイン ターフェイス)
IntraCCN (チップ内結合網: 複数バス、クロスバー等 )
CSM / L2 Cache (集中共有メモリあるいはL2キャッシュ )
,,
(マルチ コア・ チップm)
(入出力装置)
CSM j
•低消費電力 •低消費電力 •短HW/SW開 •短HW/SW開 発期間 発期間 CSM •各チップ間でア •各チップ間でア プリケーション プリケーション 共用可 共用可 •高信頼性 •高信頼性 I/O
CSP k
(入出力 用 マルチコ ア・ チップ)
(集中 共有 メモリ)
•半導体集積度 •半導体集積度 と共に性能向上 と共に性能向上
マルチコア InterCCN (チップ間結合網: 複数バス、クロスバー、多段ネットワーク等 ) 統合ECU
開発マルチコアチップは情報家電へ 開発マルチコアチップは情報家電へ
**日立,富士通,ルネサス,東芝,松 下,NEC 6
University of Illinois Oct. 2006
NEDOMulticore Technology for Realtime Consumer Electronics R&D Organization(2005.7-2008.3 ) Integrated R&D Steering Committee Chair : Hironori Kasahara (Waseda Univ.), Project Leader Sub-chair : Kunio Uchiyama (Hitachi), Project Sub-leader
Grant: Multicore Architecture and Compiler Research and Development Project Leader : Prof. Hironori Kasahara, Waseda Univ. R&D Steering Comittee Architecture & Compiler R&D Group. Group Leader : Associate Prof. Keiji Kimura, Waseda Univ. (Outsourcing : Fujitsu)
Standard Multicore Architecture & API Committee Chair : Prof. Hironori Kasahara, Waseda Univ.
Subsidy: Evaluation Environment for Multicore Technology Project Sub-leader : Dr. Kunio Uchiyama, Chief Researcher, Hitachi Test Chip Development Group. Group Leader : Renesas Technology
Evaluation Sytem Development Group Group Leader : Hitachi
Hitachi, Fujitsu, Toshiba, NEC, Panasonic, Renesas Technology 7
University of Illinois Oct. 2006
MPCoreTM
ARM and NEC Collaboration
Configurable number of hardware interrupt lines
MPCoreTM
Private FIQ lines
Interrupt Distributor Per-CPU aliased peripherals
Timer Wdog
Timer Wdog
CPU interface
Timer Wdog
Timer
CPU interface
Wdog
CPU interface
IRQ
IRQ
IRQ
IRQ
Configurable between 1 and 4 Symmetric CPU
CPU interface
CPU/VFP
CPU/VFP
CPU/VFP
CPU/VFP
L1 Memory
L1 Memory
L1 Memory
L1 Memory
Private Peripheral Bus
Snoop Control Unit (SCU) Duplicated L1 Tag
I & D Coherence 64bit bus Control bus
Optional 2nd AXI R/W 64bit bus
Primary AXI R/W 64bit bus
L2 (L220)
8
University of Illinois Oct. 2006
Fujitsu FR-1000Multicore Processor FR550 VLIW Processor
FR-V Multi-core Processor FR550 core
Local BUS IF
Crossbar Switch FR550 core
Integer Operation Unit Inst. 0 Inst. 1 Inst. 2 Inst. 3
GR
FR550 core DMA-I
DMA-E
I-cache 32KB
Mem. Cont.
Memory
Mem. Cont.
Memory
D-cache 32KB
Inst. 4 Inst. 5 Inst. 6 Inst. 7
FR
Media Operation Unit
FR550 core
IO Chip
Core1
Core2
Core1
Core2
× Switch
Fast I/O Bus •Memory Bus: 64bit x 2ch / 266MHz •System Bus: 64bit / 178MHz
Memory Memory
Bus
Switch
Memory Memory
Crossbar (FR1000) 9
University of Illinois Oct. 2006
CELL Processor Overview • Power Processor Element (PPE) – PowerCore processes OS and Control tasks – 2-way Multi-threaded
• Synergistic Processor Element (SPE) – – – – – –
8 SPE offers high performance Dual issue RISC Architecture 128bit SIMD(16‐way) 128 x 128bit General Registers 256KB Local Store DedicatedDMA engines
SPE
Synergistic Processor Elements for High (Fl)ops / Watt
SPU
SPU
SPU
SPU
SPU
SPU
SPU
SPU
LS
LS
LS
LS
LS
LS
LS
16B/cycle
LS 16B/cycl e
EIB (up to 96B/cycle) 16B/cycle
16B/cycle 16B/cycle (2x)
PPE L2
MIC
BIC
512KB 32B/cycle
PPU L1
16B/cycle
32KB+32KB
Dual XDRTM
RRAC
従来の演算処理のためにVMXを持つ64bitのPower Architecture
10
I/O
University of Illinois Oct. 2006
1987 OSCAR(Optimally Scheduled Advanced Multiprocessor)
11
University of Illinois Oct. 2006
OSCAR(Optimally Scheduled Advanced Multiprocessor) HOST COMPUTER CONTROL & I/O PROCESSOR RISC Processor
I/O Processor
CENTRALIZED SHARED MEMORY1 (Simultaneous Readable) Bank1
Bank2 Bank3
Addr.n Addr.n Addr.n Data Memory
Prog. Memory
Distributed Shared Memory
CSM2
CSM3
Read & Write Requests Arbitrator
Bus Interface
Distributed Shared Memory (Dual Port) (CP) -5MFLOPS,32bit RISC Processor (64 Registers) -2 Banks of Program Memory -Data Memory -Stack Memory -DMA Controller
PE1
(CP)
..
PE5
(CP)
..
PE6
5PE CLUSTER (SPC1) 8PE PROCESSOR CLUSTER (LPC1)
PE8
PE9
(CP)
(CP)
..
PE10 PE11
PE15 PE16 SPC3
SPC2 LPC2
12
University of Illinois Oct. 2006
OSCAR PE (Processor Element) SYSTEM BUS
BUS INTERFACE
LOCAL BUS 1
LOCAL BUS 2 DSM
DMA
LPM
LSM
LDM
IPU
FPU
REG INSC INSTRUCTION BUS
D M A : DMA CONTROLLER L P M : LOCAL PROGRAM MEMORY (128KW * 2BANK) I N S C : INSTRUCTION CONTROL UNIT : DISTRIBUTED DSM SHARED MEMORY (2KW) L S M : LOCAL STACK MEMORY (4KW)
DP
L D M : LOCAL DATA MEMORY (256KW) D P : DATA PATH I P U : INTEGER PROCESSING UNIT : FLOATING FPU PROCESSING UNIT R E G : REGISTER FILE (64 REGISTERS)
13
University of Illinois Oct. 2006
1987 OSCAR PE Board
14
University of Illinois Oct. 2006
OSCAR Memory Space SYSTEM MEMORY SPACE 00000000
LOCAL MEMORY SPACE 00000000
PE15
DSM
UNDEFINED (Local Memory Area)
(Distributed Shared Memory)
00100000
00000800
BROAD CAST
PE 1 PE 0
NOT USE
00200000
00010000 NOT USE
01000000 CP (Control Processor)
L P M(Bank0) (Local Program Memory) L P M(Bank1)
00030000
01100000 NOT USE
00020000
NOT USE
00040000
02000000 PE 0
LDM (Local Data Memory)
02100000 PE 1
00080000
NOT USE
000F0000
02200000 : : :
CONTROL
00100000
02F00000 PE15
SYSTEM ACCESSING
03000000 CSM1, 2, 3 (Centralized Shared Memory)
AREA
03400000 NOT USE FFFFFFFF
FFFFFFFF
15
University of Illinois Oct. 2006
OSCAR Multi-Core Architecture CMP 0 (chip multiprocessor 0) CMP m
PE 0 PE 1
CPU LPM/ I-Cache
LDM/ D-cache
PE
I/O I/O Devices Devices CSM j
n
DTC
I/O CMP k
DSM
FVR
Network Interface
CSM
Intra-chip connection network (Multiple Buses, Crossbar, etc)
CSM / L2 Cache
FVR
FVR FVR
FVR
FVR
FVR
FVR Inter-chip connection network (Crossbar, Buses, Multistage network, etc) LDM : local data mem. CSM: central shared mem. DSM: distributed shared mem. LPM : local program mem. DTC: Data Transfer Controller FVR: frequency / voltage control register 16
University of Illinois Oct. 2006
API and Parallelizing Compiler in METI/NEDO Advanced Multicore for Realtime Consumer Electronics Project Sequential Application Program (Subset of C Language)
API to specify data assignment, data transfer, power reduction control
Waseda OSCAR Compiler
Realtime Consumer Electronics Application Programs Image, Secure Audio Streaming etc.
Proc0 Scheduled Tasks T1
Backend compiler API decoder
Sequential Compiler
Mach. Codes SH multicore
Backend Compiler API decoder
Sequential Compiler
Mach. Codes
T4
Proc2 Scheduled Tasks T3
Executable codes for each vendor chip
Stop
Proc1 Scheduled Tasks T2
Translate into parallel codes for each vender
T6 Slow
FR-V Backend Compiler API decoder
Sequential Compiler
Mach. Codes (CELL)
Data Transfer by DTC(DMAC)
17
University of Illinois Oct. 2006
OSCAR Parallelizing Compiler
• Improve effective performance, cost-performance and productivity and reduce consumed power – Multigrain Parallelization • Exploitation of parallelism from the whole program by use of coarse-grain parallelism among loops and subroutines, near fine grain parallelism among statements in addition to loop parallelism
– Data Localization • Automatic data distribution for distributed shared memory, cache and local memory on multiprocessor systems.
– Data Transfer Overlapping • Data transfer overhead hiding by overlapping task execution and data transfer using DMA or data pre-fetching
– Power Reduction • Reduction of consumed power by compiler control of frequency, voltage and power shut down with hardware supports. 18
University of Illinois Oct. 2006
OSCAR Multigrain Parallelizing Compiler • Automatic Parallelization – Multigrain Parallel Processing – Data Localization – Data transfer Overlapping – Complier Controlled Power Saving Scheme
• Compiler cooperative Multi-core architecture – OSCAR Multi-core Architecture – OSCAR Heterogeneous Multiprocessor Architecture
• Commercial SMP machines
19
University of Illinois Oct. 2006
METI/NEDO Advanced Parallelizing Compiler Technology Project Background and Problems
Millenium Project IT21 2000.9.8 –2003.3.31 Waseda Univ., Fujitsu, Hitachi, AIST
①Adoption of parallel processing as a core technology on PC to HPC ② Increase of importance of software on IT ③ Need for improvement of cost-performance and usability
Performance 1T
Hardware Peak performance
Contents of Research and Development
APC Project
Ex pan d
Ga p
Improvement of ① Effective performance ②Cost-performance ③ Ease of use
Effective Performance
① R & D of advanced parallelizing compiler Multigrain, Data localization, Overhead hiding ② R & D of Performance evaluation technology for parallelizing compilers
Goal: Double the effective performance Ripple Effect Slow down
1G 1980
1990
2000
Theoretical maximum performance vs. Effective performance of HPC
year
① Development of competitive next generation PC and HPC ② Putting the innovative automatic parallelizing compiler technology to practical use ③ Development and market acquisition of future single-chip multiprocessors ④ Boosting R&D in the following many fields: IT, Bio-tech., Device, Earth environment, Next-generation VLSI design, Financial engineering, Weather forecast, New clean energy, Space20 development, Automobile, Electric Commerce, etc
University of Illinois Oct. 2006
Generation of Coarse Grain Tasks Macro-tasks (MTs)
¾ Block of Pseudo Assignments (BPA): Basic Block (BB) ¾ Repetition Block (RB) : outermost natural loop ¾ Subroutine Block (SB): subroutine
Program
BPA
Near fine grain parallelization
RB
Loop level parallelization Near fine grain of loop body Coarse grain parallelization
SB
Total System
1st. Layer
Coarse grain parallelization 2nd. Layer
BPA RB SB BPA RB SB
BPA RB SB BPA RB SB BPA RB SB BPA RB SB
3rd. Layer 21
University of Illinois Oct. 2006
Earliest Executable Condition Analysis for coarse grain tasks (Macro-tasks) Data Dependency Control flow Conditional branch 1 2
1
BPA
BPA
3
4
BPA
BPA
BPA
Block of Psuedo Assignment Statements
RB
Repetition Block
2
7 5
3 4
RB
BPA
8 6
BPA 6
RB BPA
15
BPA
7
6
5
RB
9
RB
11
RB 8
BPA BPA
9
BPA 11
12
END
10
RB
BPA RB
14
RB
15
7 Data dependency
12
Extended control dependency
BPA
13
10
A Macro Flow Graph
Conditional branch
13
OR AND
14
Original control flow
A Macro Task Graph 22
University of Illinois Oct. 2006
Automatic processor assignment in 103.su2cor • Using 14 processors –
Coarse grain parallelization within DO400 of subroutine LOOPS
23
University of Illinois Oct. 2006
MTG of Su2cor-LOOPS-DO400 Coarse grain parallelism PARA_ALD = 4.3
DOALL
Sequential LOOP
SB
24
BB
University of Illinois Oct. 2006
Data-Localization Loop Aligned Decomposition •
Decompose multiple loop (Doall and Seq) into CARs and LRs considering inter-loop data dependence. – Most data in LR can be passed through LM. – LR: Localizable Region, CAR: Commonly Accessed Region C RB1(Doall) DO I=1,101 A(I)=2*I ENDDO C RB2(Doseq) DO I=1,100 B(I)=B(I-1) +A(I)+A(I+1) ENDDO RB3(Doall) DO I=2,100 C(I)=B(I)+B(I-1) ENDDO C
LR DO I=1,33
CAR DO I=34,35
LR DO I=36,66
CAR DO I=67,68
LR DO I=69,101
DO I=1,33 DO I=34,34 DO I=35,66 DO I=67,67 DO I=68,100 DO I=2,34
DO I=35,67
DO I=68,100 25
University of Illinois Oct. 2006
Inter-loop data dependence analysis in TLG • Define exit-RB in TLG as Standard-Loop • Find iterations on which a iteration of Standard-Loop is data dependent – e.g. Kth of RB3 is data-dep on K-1th,Kth of RB2, on K-1th,Kth,K+1th of RB1
C RB1(Doall) DO I=1,101 A(I)=2*I ENDDO
K-1
K
C RB2(Doseq) DO I=1,100 B(I)=B(I-1) +A(I)+A(I+1) K-1 ENDDO C RB3(Doall) DO I=2,100 C(I)=B(I)+B(I-1) ENDDO
K+1
K
K
I(RB1)
I(RB2)
I(RB3)
Example of TLG 26
University of Illinois Oct. 2006
Decomposition of RBs in TLG • • •
Decompose GCIR into DGCIRp(1≦p≦n) – n: (multiple) num of PCs, DGCIR: Decomposed GCIR Generate CAR on which DGCIRp&DGCIRp+1 are data-dep. Generate LR on which DGCIRp is data-dep.
RB11 1
RB1
RB12
RB21
RB2
RB22
RB31 2
DGCIR1
34 35
RB2
99 100 101
RB23
65 66 67
33 34 35
1 2
RB13
65 66 67 68
33 34 35 36
2 3
RB1
RB32
99 100
RB33 66 67
DGCIR2
100
I(RB1) I(RB2) I(RB3)
DGCIR3
GCIR 27
University of Illinois Oct. 2006
Data Localization 1 1
PE0
2 3
6
3
4
2
5
4
5 6
12
7
10
11
9
8
7 14
8
9
13
11
24
17
13
15
MTG
19
25
21
22
23
26 dlg3
dlg0
28
29
Data Localization Group
27
31
32
12
1
2
3
6
7
4
14
8
18
15
5
19
9
25
11
29
10
13
16
17
20
22
26
21
30
23
24
27
28
20
dlg2
dlg1
12
16
10 18
14
15
PE1
30
32 33
31
MTG after Division
28 A schedule for two processors
University of Illinois Oct. 2006
An Example of Data Localization for Spec95 Swim DO 200 J=1,N DO 200 I=1,M UNEW(I+1,J) = UOLD(I+1,J)+ 1 TDTS8*(Z(I+1,J+1)+Z(I+1,J))*(CV(I+1,J+1)+CV(I,J+1)+CV(I,J) 2 +CV(I+1,J))-TDTSDX*(H(I+1,J)-H(I,J)) VNEW(I,J+1) = VOLD(I,J+1)-TDTS8*(Z(I+1,J+1)+Z(I,J+1)) 1 *(CU(I+1,J+1)+CU(I,J+1)+CU(I,J)+CU(I+1,J)) 2 -TDTSDY*(H(I,J+1)-H(I,J)) PNEW(I,J) = POLD(I,J)-TDTSDX*(CU(I+1,J)-CU(I,J)) 1 -TDTSDY*(CV(I,J+1)-CV(I,J)) 200 CONTINUE
DO 210 J=1,N UNEW(1,J) = UNEW(M+1,J) VNEW(M+1,J+1) = VNEW(1,J+1) PNEW(M+1,J) = PNEW(1,J) 210 CONTINUE
DO 300 J=1,N DO 300 I=1,M UOLD(I,J) = U(I,J)+ALPHA*(UNEW(I,J)-2.*U(I,J)+UOLD(I,J)) VOLD(I,J) = V(I,J)+ALPHA*(VNEW(I,J)-2.*V(I,J)+VOLD(I,J)) POLD(I,J) = P(I,J)+ALPHA*(PNEW(I,J)-2.*P(I,J)+POLD(I,J)) 300 CONTINUE
(a) An example of target loop group for data localization
cache size 0
1
VN PO H
2
PN CU
3
UO CV
4MB
UN VO Z UN
VN
PN
U VN PO
V PN
P UO
UN VO
Cache line conflicts occurs among arrays which share the same location on cache (b) Image of alignment of arrays on cache accessed by target loops 29
University of Illinois Oct. 2006
Data Layout for Removing Line Conflict Misses by Array Dimension Padding Declaration part of arrays in spec95 swim after padding before padding PARAMETER (N1=513, N2=513)
PARAMETER (N1=513, N2=544)
COMMON U(N1,N2), V(N1,N2), P(N1,N2), * UNEW(N1,N2), VNEW(N1,N2), 1 PNEW(N1,N2), UOLD(N1,N2), * VOLD(N1,N2), POLD(N1,N2), 2 CU(N1,N2), CV(N1,N2), * Z(N1,N2), H(N1,N2)
COMMON U(N1,N2), V(N1,N2), P(N1,N2), * UNEW(N1,N2), VNEW(N1,N2), 1 PNEW(N1,N2), UOLD(N1,N2), * VOLD(N1,N2), POLD(N1,N2), 2 CU(N1,N2), CV(N1,N2), * Z(N1,N2), H(N1,N2)
4MB
4MB
padding Box: Access range of DLG0
30
APC Compiler Organization APC Multigrain Parallelizing Compiler
University of Illinois Oct. 2006
Automatic Data Distribution - Cache optimization - DSM optimization Multigrain Parallelization
Data Dependence Analysis
Source Program FORTRAN
- Inter-procedural - Conditional
- Hierarchical Parallel - Affine Partitioning
Scheduling - Static Scheduling - Dynamic Scheduling
OpenMP Code Generation
Speculative Execution - Coarse Grain - Medium Grain - Architecture Support
Feedback Sun Forte 6 IBM XL Fortran IBM XL Fortran Update 2 Ver.7.1 Ver.8.1
Parallelizing Tuning System Program Visualization Technique Techniques for Profiling & Utilizing Runtime Info
Runtime Info
Feedback-directed Selection Technique of Compiler Directives
… SUN IBM IBM 31 pSeries690 Ultra 80 RS6000 Variety of Shared Memory Parallel machines
University of Illinois Oct. 2006
Centralized scheduling code
Image of Generated OpenMP Code for Hierarchical Multigrain Parallel Processing SECTIONS SECTION
1st layer Distributed scheduling code
MT1_1
MT1_2 DOALL MT1_4 RB
MT1_3 SB
T0
SECTION
T1 T2
T4 T5
SYNC SEND MT1-4
1_3_2 1_3_3 1_3_4
T7
SYNC RECV
MT1_2
1_3_1
T6
MT1_1
1_4_1
1_4_1
T3
MT1-3 1_3_1
1_3_1
1_3_1
1_3_2
1_3_2
1_3_2
1_3_3
1_3_3
1_3_3
1_3_4
1_3_4
1_3_4
1_3_5
1_3_5
1_3_5
1_3_6
1_3_6
1_3_6
1_4_1
1_4_2
1_4_2
1_4_3
1_4_3
1_4_4
1_4_4
1_3_5 1_3_6 1_4_21_4_3 1_4_4
2nd layer
END SECTIONS
2nd layer Thread group0
Thread group1
32
University of Illinois Oct. 2006
IBM pSeries690 RegattaH •
Up to 16 Power4: 32 way SMP Server – – –
L1(D) : 64 KB (32KB/processor, 2 way assoc.), L1(I): 128 KB (64KB/processor, Direct map) L2 : 1.5 MB (shared cache with 2 procs., 4~8 way assoc.) L3 : 32 MB (external, 8 way assoc.) [x 8 = 256 MB]
GX Sl ot
GX Slot
Mem Slot
Mem Slot
L3
Mem Slot
L3
L3
L3
GX
L3
L3
P
L3
L3
P
P
P
L3
P
P
P
P
P
P
P
P
P
L3
L3
P
P
P
L3
L3
L3
L3
L3
L3
L3
L3
GX P
L2
P
P
L2 P
P
L2
GX
P
L2 P
L3
P
L2
GX
L3
P
GX
MCM 0
GX
Mem Slot
P
L2
L2
L2
P L2
MC M 1
L3
GX Sl ot
GX
L2
L2
P
Mem Slot
L3
MCM 3 P
L2
GX
Mem Slot
P
L2
L2
GX
L3
P
L3 GX
MCM 2
GX
L3
L3
GX
L2
GX
L3
L3
GX GX
GX
L3
P
L2
L3
L3
Mem Slot
GX Slot
Mem Slot
Four 8-way MCM Featur es Assembled into a 32-way pSer ies 690
http://www-1.ibm.com/servers/eserver/pseries/hardware/whitepapers/p690_config.html
33 IBM @server pSeries690 Configuring for Performance
University of Illinois Oct. 2006
Performance of APC Compiler on IBM pSeries690 16 Processors High-end Server •
IBM XL Fortran for AIX Version 8.1 – – –
Sequential execution : -O5 -qarch=pwr4 Automatic loop parallelization : -O5 -qsmp=auto -qarch=pwr4 OSCAR compiler : -O5 -qsmp=noauto -qarch=pwr4 (su2cor: -O4 -qstrict)
12.0
3.1s
3.4s
3.5s
XL Fortran(max)
3.0s
28.8s
APC(max)
38.5s
8.0 3.8s
6.0
28.9s
4.0 7.1s 16.7s 19.2s
115.2s
16.4s
2.0
21.0s
21.5s
107.4s
105.0s
13.0s 23.2s
37.4s
22.5s
85.8s
126.5s 18.8s
22.5s
85.8s
18.8s
184.8s
282.4s
321.4s
k s i2 ap
ck
282.4s 321.4s
t ra s ix
ap
pl u
2k
2k r id
mg
sw im
pw wu
2k
126.5s 307.6s 291.2s 279.1s
i se
5 ve wa
fp p
si ap
b3
d
39.5s
tu r
pl u
27.8s
ap
r id
35.1s
mg
d ro
or hy
2c su
im
30.3s
2d
21.5s
38.3s
sw
ca
tv
23.1s
pp
0.0
to m
S p e e d u p r a tio
10.0
34
University of Illinois Oct. 2006
Performance of Multigrain Parallel Processing for 102.swim on IBM pSeries690 16.00
XLF(AUTO)
14.00
OSCAR
Speed up ratio
12.00 10.00 8.00 6.00 4.00 2.00 0.00 1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
Processors 35
University of Illinois Oct. 2006
Performance on IBM pSeries690 Power4 24 processors SMP server XL Fortran Ver.8.1 OSCAR
25
sp e e d u p ra tio
20 15 10 5
spec95
• OSCAR compiler gave us 4.82 times speedup against XL Fortran Ver.8.1
a p si
a p p lu
m g rid
sw im
wave5
fp p p p
a p si
tu rb 3 d
a p p lu
m g rid
h y d ro 2 d
su 2 c o r
sw im
to m c a tv
0
spec2000 36
University of Illinois Oct. 2006
SGI Altix 350 16 Itanium 2 (1.3GHz) SMP server L1 16KB(4way), L2 256KB(8way), L3 3MB(12way)
2.4 times speedup against Intel Fortran 7.1 using OpenMP code for IBM37p690
University of Illinois Oct. 2006
Sun Fire V880 • UltraSPARCIII 8 processor SMP Sever – L1(Data) : 64 KB (4 way assoc.), L1(Inst.): 32 KB (4 way assoc.) – Prefetch : 2 KB (4 way assoc.) – L2 : 8 MB / processor (2 way assoc.)
http://www.sun.com/servers/entry/880/
38 The Sun FireTM V880 Server Architecture
University of Illinois Oct. 2006
Performance OSCAR multigrain Parallelizing Compiler on Sun V880 Data Localization and Cache Prefetching Performance of OSCAR compiler on SUN V880(8 Ultra SPARC III Cu 1050MHz)
1.9 times speedup against Sun Forte compiler 7.1
39
University of Illinois Oct. 2006
IBM p5-550Q server (1.5 or 1.65 GHz) Two Quad-Core modules 105.6 GB/sec 2-core POWER5+ processor
36MB
L3
L2 Cache
L3
PPOWER5+ 2-core P processor
L2 L2 Cache Cache
42.2 GB/sec
Mem Mem Ctl Ctlr
1.9MB Switch Distributed SMP Bus GX Bus 4.4 GB/sec
Four I/O Drawers
4.0
GB/sec
I/O Hub
M E M O R Y
1 to 64GB DDR2 533 MHz DIMM
GX Bus on 2nd processor card 4.4 GB/sec
Four PCIX & One PCIX DDR
GX adapters share PCIX slots 4 and 5
Four I/O Drawers
40
University of Illinois Oct. 2006
Performance on IBM p5 550 POWER5+ 8 processors SMP server 9 8
XL Fortran Ver.10.1(SMT off) XL Fortran Ver.10.1(SMT on) OSCAR(SMT off)
7
speedup ratio
6 5 4 3 2 1 0 tomcatv
sw im
su2cor hydro2d mgrid
applu
turb3d
apsi
fpppp
w ave5
sw im
spec95
mgrid
applu
apsi
spec2000
• OSCAR compiler – 2.74 times speedup against XL Fortran Ver.10.1 without SMT – 2.78 times speedup against XL Fortran Ver.10.1 with SMT
41
University of Illinois Oct. 2006
Example of Near Fine Grain Tasks 1) u12 = a12/l11 2) u24 = a24/l22 3) u34 = a34/l33 4) l54 = -l52 * u24 5) u45 = a45/l44 6) l55 = u55 - l54 * u45
10 19
3
19
2
19
5
19
11
19
4
8
19 10
8
7) y1 = b1 / l11 8) y2 = b2 / l22 9) b5 = b5 - l52 * y2 10) y3 = b3 / l33 11) y4 = b4 / l44 12) b5 = b5 - l54 * y4 13) y5 = b5 / l55 14) x4 = y4 - u45 * y5 15) x3 = y3 - u34 * x4 16) x2 = y2 - u24 * x4 17) x1 = y1 - u12 * x2
Task No.
0
6
1
19
7
19
9
Data Transfer Time tij
12
10
10
13
19 1014
15
10
16
10
17
10
18
42
University of Illinois Oct. 2006
Task Graph for FPPPP
43
University of Illinois Oct. 2006
Redundant Synchronization Removal Using Static Scheduling Results on Shared Memory Multiprocessors PE1
PE2
PE3
A FS
Precedence relations
FC D
FC FS FC
B FC
FS
FC C
FC
FS FC
FS
Flag set
FC
Flag check
FC
Unnecessary
E 44
University of Illinois Oct. 2006
Generated parallel machine code for near fine grain parallel processing
PE1 A
B FC
PE2 C FS
PE1 ; ---- Task A ---; Task Body
FADD R23, R19, R21 ; ---- Task B ---; Flag Check : Task C L18: LDR R28, [R14, 0] CMP R28, R0 JNE L18 ; Data Receive LDR R29, [R14, 1] ; Task Body FMLT R24, R23, R29
PE2 ; ---- Task C ---; Task Body
FMLT R27, R28, R29 FSUB R29, R19, R27 ; Data Transfer STR [R14, 1], R29 ; Flag Set STR [R14, 0], R0
45
University of Illinois Oct. 2006
S p e e d - u p ra ti o
Multigrain Parallel Processing of JPEG Encoding on OSCAR CMP
1st. Layer
2nd. Layer
4 3.5 3 2.5 2 1.5 1 0.5 0
3.5 9 3.3 6 2.7 0
1.6 2
1.5 1 1.2 5
1.0 0
2
1 逐次実行
1PG2PE
4 2PG1PE
1PG4PE
4PG1PE
2PG2PE
4-issue
プロセッサ構成
46
University of Illinois Oct. 2006
OSCAR Multi-Core Architecture CMP 0 (chip multiprocessor 0) CMP m
PE 0 PE 1
CPU LPM/ I-Cache
LDM/ D-cache
PE
I/O I/O Devices Devices CSM j
n
DTC
I/O CMP k
DSM
FVR
Network Interface
CSM
Intra-chip connection network (Multiple Buses, Crossbar, etc)
CSM / L2 Cache
FVR
FVR FVR
FVR
FVR
FVR
FVR Inter-chip connection network (Crossbar, Buses, Multistage network, etc) LDM : local data mem. CSM: central shared mem. DSM: distributed shared mem. LPM : local program mem. DTC: Data Transfer Controller FVR: frequency / voltage control register 47
University of Illinois Oct. 2006
Power Reduction by Power Supply, Clock Frequency and Voltage Control by OSCAR Compiler •
Shortest execution time mode
48
University of Illinois Oct. 2006
An Example of Machine Parameters for the Power Saving Scheme • Functions of the multiprocessor – Frequency of each proc. is changed to several levels – Voltage is changed together with frequency – Each proc. can be powered on/off state frequency voltage dynamic energy static power
FULL 1 1 1 1
MID 1/2 0.87 3/4 1
LOW 1/4 0.71 1/2 1
OFF 0 0 0 0
• State transition overhead state FULL MID LOW OFF
FULL 0 40k 40k 80k
MID 40k 0 40k 80k
LOW 40k 40k 0 80k
delay time [u.t.]
OFF 80k 80k 80k 0
state FULL MID LOW OFF
FULL 0 20 20 40
MID 20 0 20 40
energy overhead [μJ]
LOW 20 20 0 40
OFF 40 40 40 0
49
University of Illinois Oct. 2006
Speed-up in Fastest Execution Mode 4.5 w/o Saving w Saving
4 speedup ratio
3.5 3 2.5 2 1.5 1 0.5 0 1
2 tomcatv
4
1
2
4
1
swim
2 applu
4
1
2
4
mpeg2enc
benchmark 50
University of Illinois Oct. 2006
Consumed Energy in Fastest Execution Mode 200 180 160
w/o Saving w Saving
energy(m J)
energy(J)
140 120 100 80 60 40 20 0 1
2 tomcatv
4
1
2 swim benchmark
4
1
2 applu
4
1600 1400 1200 1000 800 600 400 200 0 1
2
4
number of proc.
mpeg2_encode 51
University of Illinois Oct. 2006
Energy Reduction by OSCAR Compiler in Real-time Processing mode (1% Leak)
• deadline = sequential execution time, Leakage Power: 1%
52
University of Illinois Oct. 2006
Energy Reduction by OSCAR Compiler in Real-time Processing mode (10% Leak)
• deadline = sequential execution time, Leakage Power: 10% 53
University of Illinois Oct. 2006
Energy Reduction VS. Leakage Power (4cores)
• deadline = sequential execution time*0.5
54
Conclusions
University of Illinois Oct. 2006
¾ Compiler cooperative low power high effective performance multi-core processors will be more important in wide range of information systems from games, mobile phones, automobiles to peta-scale supercomputers. ¾ Parallelizing compilers are essential for realization of ¾ Good cost performance ¾ Short hardware and software development periods ¾ Low power consumption ¾ High software productivity ¾ Scalable performance improvement with advancement in semiconductor integration technology
¾ Key technologies in multi-core compiler ¾ Multigrain parallelization, Data localization, Data transfer 55 overlapping using DMA, Low power control technologies