Nov 3, 2011 - Extended control dependency. Conditional branch. 6. OR. AND. Original control flow. A Macro Flow Graph. A Macro Task Graph. 15 ...
Low Power Multicores, Parallelizing Compiler and Multiplatform API for Green Computing Hironori Kasahara Professor, Dept. of Computer Science & Engineering Director, Advanced Multicore Processor Research Institute
Waseda University, Tokyo, Japan IEEE Computer Society Board of Governors URL: http://www.kasahara.cs.waseda.ac.jp/
DASAN Conference on “Green IT”, Nov. 3, 2011
Multi/Many-core Everywhere
Multi-core from embedded to supercomputers Consumer Electronics (Embedded) Mobile Phone, Game, TV, Car Navigation, Camera, IBM/ Sony/ Toshiba Cell, Fujitsu FR1000, Panasonic Uniphier, NEC/ARM MPCore/MP211/NaviEngine, Renesas 4 core RP1, 8 core RP2, 15core Hetero RP-X, Plurarity HAL 64(Marvell), Tilera Tile64/ -Gx100(->1000cores), DARPA UHPC (2017: 80GFLOPS/W)
PCs, Servers Intel Quad Xeon, Core 2 Quad, Montvale, Nehalem(8cores), Larrabee(32cores), SCC(48cores), Night Corner(50 core+:22nm), AMD Quad Core Opteron (8, 12 cores) OSCAR Type Multi-core Chip by Renesas in METI/NEDO Multicore for Real-time Consumer Electronics Project (Leader: Prof.Kasahara)
WSs, Deskside & Highend Servers IBM(Power4,5,6,7), Sun (SparcT1,T2), Fujitsu SPARC64fx8
Supercomputers Earth Simulator:40TFLOPS, 2002, 5120 vector proc. BG/Q (A2:16cores) Water Cooled20PFLOPS, 3-4MW (2011-12), BlueWaters(HPCS) Power7, 10 PFLOP+(2011.07), Tianhe-1A (4.7PFLOPS,6coreX5670+ Nvidia Tesla M2050), Godson-3B (1GHz40W 8core128GFLOPS) -T (64 core,192GFLOPS:2011) RIKEN Fujitsu “K” 10PFLOPS(8core SPARC64VIIIfx, 128GGFLOPS)
High quality application software, Productivity, Cost The 27thTop 500 (20.6.2011), performance, Low power consumption are important No.1, Fujitsu “K” 548,352 cores Ex, Mobile phones, Games (Current Peak 8.774 PFLOPS) Compiler cooperated multi-core processors are LINPACK 8.162 PFLOPS (93.0%) promising to realize the above futures
2
Nov.2,2011,Press Release for “K”Supercomputer by Riken • No. of CPU: 88,128 • Peak performance: 11.28 PFLOPS • LINPACK performance: 10.51 PFLOPS • Execution Efficiency: 93.2%
METI/NEDO National Project Multi-core for Real-time Consumer Electronics R&D of compiler cooperative multicore processor technology for consumer electronics like Mobile phones, Games, DVD, Digital TV, Car navigation systems. From July 2005 to March 2008 ・Good cost performance
(2005.7~2008.3)** CMP 0 ( マルチコアチップ0)
PC0
CMP m
CPU
(プロセッ サコア0)
(プロセッサ)
LDM/ D-cache
LPM/ I-Cache (ローカルプロ グラムメモリ/ 命令キャッシュ)
(ローカルデータ メモリ/L1データ キャッシュ)
DTC (データ 転送コン トローラ)
PC 1 (プロ セッサ コア1)
(マルチ コア・ チップm)
PC n (プロセッサ コアn)
DSM (分散共有メモリ)
・Short hardware and software development periods ・Low power consumption ・Scalable performance improvement with the advancement of semiconductor ・Use of the same parallelizing compiler for multi-cores from different vendors using newly developed API API:Application Programming Interface
新マルチコア 新マルチコア プロセッサ プロセッサ I/O I/O Devices •高性能 •高性能
NI(ネットワークイン ターフェイス)
IntraCCN (チップ内結合網: 複数バス、クロスバー等 )
CSM / L2 Cache (集中共有メモリあるいはL2キャッシュ)
,,
**Hitachi, Renesas, Fujitsu, Toshiba, Panasonic, NEC
CSM j
•低消費電力 •低消費電力 •短HW/SW開 •短HW/SW開 発期間 発期間 CSM •各チップ間でア •各チップ間でア プリケーション プリケーション 共用可 共用可 •高信頼性 •高信頼性 I/O
CSP k
(入出力 用 マルチコ ア・ チップ)
(集中 共有 メモリ)
•半導体集積度 •半導体集積度 と共に性能向上 と共に性能向上
マルチコア InterCCN (チップ間結合網: 複数バス、クロスバー、多段ネットワーク等 ) 統合ECU
(入出力装置)
ILRAM
Core#0 URAMDLRAM
D$
Core#1
Core#2
Chip Size Core#3
DBSC
DDRPAD
CSM
Core#4
SNC1
SHWY
Core#6
Process 90nm, 8-layer, tripleTechnology Vth, CMOS
I$
SNC0
LBSC
Renesas-Hitachi-Waseda Low Power 8 core RP2 Developed in 2007 in METI/NEDO project
VSWC
Core#7 Core#5 GCPG
CPU Core Size Supply Voltage Power Domains
104.8mm2 (10.61mm x 9.88mm) 6.6mm2 (3.36mm x 1.96mm) 1.0V–1.4V (internal), 1.8/3.3V (I/O) 17 (8 CPUs, 8 URAMs, common)
IEEE ISSCC08: Paper No. 4.5, M.ITO, … and H. Kasahara, “An 8640 MIPS SoC with Independent Power-off Control of 8 CPUs and 8 RAMs by an Automatic Parallelizing Compiler”
5
OSCAR Multi-Core Architecture CMP 0 (chip multiprocessor 0) CMP m
PE 0 PE 1
CPU LPM/ I-Cache
LDM/ D-cache
PE
I/O I/O Devices Devices CSM j
n
DTC
I/O CMP k
DSM
FVR
Network Interface
CSM
Intra-chip connection network (Multiple Buses, Crossbar, etc)
CSM / L2 Cache
FVR
FVR FVR
FVR
FVR
FVR
FVR Inter-chip connection network (Crossbar, Buses, Multistage network, etc) CSM: central shared mem. LDM : local data mem. DSM: distributed shared mem. LPM : local program mem. DTC: Data Transfer Controller FVR: frequency / voltage control register
6
8 Core RP2 Chip Block Diagram
LCPG0
PCR3 PCR2 PCR1 PCR0
I$ D$ LocalFPU memory CPU 16K 16K I$ D$ I:8K, D:32K Local memory 16K 16K I$Local D$ CCN I:8K, D:32K memory User RAM 64K 16K 16K BAR I:8K, D:32K64K User RAM Local memory I:8K, D:32K64K User RAM URAM 64K
Snoop controller 1
Core #3 Core #2 FPU CPU Core #1 FPU CPU I$ D$ Core #0 CPU 16K FPU16K
Barrier Sync. Lines
Snoop controller 0
Cluster #0
Cluster #1
Core #7 Core #6 CPU FPU Core #5 CPU FPU D$ I$ Core #4 CPU 16KFPU 16K
D$ I$ CPU FPU 16K 16K D$ I$ I:8K, D:32K 16K 16K D$ I$ CCN D:32K UserI:8K, RAM 16K 64K 16K BAR I:8K, D:32K User RAM 64K Local memory UserI:8K, RAMD:32K 64K URAM 64K
LCPG1
PCR7 PCR6 PCR5 PCR4
On-chip system bus (SuperHyway)
DDR2 SRAM DMA control control control
LCPG: Local clock pulse generator PCR: Power Control Register CCN/BAR:Cache controller/Barrier Register URAM: User RAM (Distributed Shared Memory)
7
Demo of NEDO Multicore for Real Time Consumer Electronics at the Council of Science and Engineering Policy on April 10, 2008 CSTP Members Prime Minister: Mr. Y. FUKUDA
Minister of State for Science, Technology and Innovation Policy: Mr. F. KISHIDA
Chief Cabinet Secretary: Mr. N. MACHIMURA
Minister of Internal Affairs and Communications : Mr. H. MASUDA
Minister of Finance : Mr. F. NUKAGA
Minister of Education, Culture, Sports, Science and Technology: Mr. K. TOKAI
Minister of Economy,Trade and Industry: Mr. A. AMARI
1987 OSCAR(Optimally Scheduled Advanced Multiprocessor)
OSCAR(Optimally Scheduled Advanced Multiprocessor) HOST COMPUTER CONTROL & I/O PROCESSOR RISC Processor
I/O Processor
CENTRALIZED SHARED MEMORY1 (Simultaneous Readable) Bank1
Bank2 Bank3
Addr.n Addr.n Addr.n Data Memory
Prog. Memory
Distributed Shared Memory
CSM2
CSM3
Read & Write Requests Arbitrator
Bus Interface
Distributed Shared Memory (Dual Port) (CP) -5MFLOPS,32bit RISC Processor (64 Registers) -2 Banks of Program Memory -Data Memory -Stack Memory -DMA Controller
PE1
(CP)
..
PE5
(CP)
..
PE6
5PE CLUSTER (SPC1) 8PE PROCESSOR CLUSTER (LPC1)
PE8
PE9
(CP)
(CP)
..
PE10 PE11
PE15 PE16 SPC3
SPC2 LPC2
1987 OSCAR PE Board
OSCAR Memory Space (Global and Local Address Space) SYSTEM MEMORY SPACE 00000000
LOCAL MEMORY SPACE 00000000
PE15
DSM
UNDEFINED (Local Memory Area)
(Distributed Shared Memory)
00100000
00000800
BROAD CAST
PE 1 PE 0
NOT USE
00200000
00010000 NOT USE
01000000 CP (Control Processor)
L P M(Bank0) (Local Program Memory) L P M(Bank1)
00030000
01100000 NOT USE
00020000
NOT USE
00040000
02000000 PE 0
LDM (Local Data Memory)
02100000 PE 1
00080000
NOT USE
000F0000
02200000 : : :
CONTROL
00100000
02F00000 PE15
SYSTEM ACCESSING
03000000 CSM1, 2, 3 (Centralized Shared Memory)
AREA
03400000 NOT USE FFFFFFFF
FFFFFFFF
OSCAR Parallelizing Compiler To improve effective performance, cost-performance and software productivity and reduce power Multigrain Parallelization coarse-grain parallelism among loops and subroutines, near fine grain parallelism among statements in addition to loop parallelism
1
Data Localization 6
Automatic data management for distributed shared memory, cache and local memory
Data Transfer Overlapping Data transfer overlapping using Data Transfer Controllers (DMAs)
Power Reduction Reduction of consumed power by compiler control DVFS and Power gating with hardware supports.
12
14
18
24
13
17
3
2
5
4
7
10
11
9
15
16
19
25
21
22
23
20
26
dlg2
dlg1
dlg3
dlg0
28
29
Data Localization Group
27
8
31
33
32
30
Generation of coarse grain tasks Macro-tasks (MTs)
Block of Pseudo Assignments (BPA): Basic Block (BB) Repetition Block (RB) : natural loop Subroutine Block (SB): subroutine
Program
BPA
Near fine grain parallelization
RB
Loop level parallelization Near fine grain of loop body Coarse grain parallelization
SB
Total System
1st. Layer
Coarse grain parallelization 2nd. Layer
BPA RB SB BPA RB SB
BPA RB SB BPA RB SB BPA RB SB BPA RB SB
3rd. Layer 14
Earliest Executable Condition Analysis for coarse grain tasks (Macro-tasks) Data Dependency Control flow Conditional branch 1 2
1
BPA
BPA
3
4
BPA
BPA
BPA
Block of Psuedo Assignment Statements
RB
Repetition Block
2
7 5
3 4
RB
BPA
8 6
BPA 6
RB BPA
15
BPA
7
6
5
RB
9
RB
11
RB 8
BPA BPA
9
BPA 11
12
10
RB
15
7 Data dependency
12
Extended control dependency
BPA
BPA 13
10
Conditional branch
13
OR AND
14
RB
Original control flow 14
RB
A Macro Flow Graph A Macro Task Graph
END
15
Automatic processor assignment in su2cor • Using 14 processors –
Coarse grain parallelization within DO400 of subroutine LOOPS
16
MTG of Su2cor-LOOPS-DO400
Coarse grain parallelism PARA_ALD = 4.3
DOALL
Sequential LOOP
SB
BB
17
Data-Localization Loop Aligned Decomposition •
Decompose multiple loop (Doall and Seq) into CARs and LRs considering inter-loop data dependence. – –
Most data in LR can be passed through LM. LR: Localizable Region, CAR: Commonly Accessed Region
C RB1(Doall) DO I=1,101 A(I)=2*I ENDDO C RB2(Doseq) DO I=1,100 B(I)=B(I-1) +A(I)+A(I+1) ENDDO RB3(Doall) DO I=2,100 C(I)=B(I)+B(I-1) ENDDO C
LR DO I=1,33
CAR DO I=34,35
LR DO I=36,66
CAR DO I=67,68
LR DO I=69,101
DO I=1,33 DO I=34,34 DO I=35,66 DO I=67,67 DO I=68,100 DO I=2,34
DO I=35,67
DO I=68,100 18
Data Localization 1 1
PE0
2 3
6
3
4
2
5
4
5 6
12
7
10
11
9
8
7 14
8
9
13
11
24
17
13
15
MTG
19
25
21
22
23
26 dlg3
dlg0
28
29
Data Localization Group
27
31
32
12
1
2
3
6
7
4
14
8
18
15
5
19
9
25
11
29
10
13
16
17
20
22
26
21
30
23
24
27
28
20
dlg2
dlg1
12
16
10 18
14
15
PE1
30
32 33
31
MTG after Division
A schedule for two processors
19
Power Reduction by Power Supply, Clock Frequency and Voltage Control by OSCAR Compiler •
Shortest execution time mode
Power Reduction Scheduling
Centralized scheduling code
Generated Multigrain Parallelized Code (The nested coarse grain task parallelization is realized by only OpenMP “section”, “Flush” and “Critical” directives.) SECTIONS SECTION
1st layer Distributed scheduling code
MT1_1
MT1_2 DOALL MT1_4 RB
MT1_3 SB
T1 T2
T3
T4 T5
SYNC SEND MT1-4
1_3_2 1_3_3 1_3_4
T6
T7
MT1_1
SYNC RECV
MT1_2
1_3_1 1_4_1
T0
SECTION
1_4_1
1_4_1
1_4_2
1_4_2
1_4_3
1_4_3
1_4_4
1_4_4
MT1-3 1_3_1
1_3_1
1_3_1
1_3_2
1_3_2
1_3_2
1_3_3
1_3_3
1_3_3
1_3_4
1_3_4
1_3_4
1_3_5
1_3_5
1_3_5
1_3_6
1_3_6
1_3_6
1_3_5 1_3_6 1_4_21_4_3 1_4_4
2nd layer
END SECTIONS
2nd layer Thread group0
Thread group1
22
Compilation Flow Using OSCAR API
Waseda Univ. OSCAR Parallelizing Compiler Coarse grain task parallelization Global data Localization Data transfer overlapping using DMA Power reduction control using DVFS, Clock and Power gating
Hitachi, Renesas, Fujitsu, Toshiba, Panasonic, NEC
Parallelized Fortran or C program with APII Proc0 Code with directives Thread 0 Proc1 Code with directives Thread 1
Backend compiler API Existing Analyzer sequential compiler Backend compiler API Existing Analyzer sequential compiler
Generation of parallel machine codes using sequential compilers Machine codes
Multicore from Vendor A
Machine codes 実行
コード
Multicore from Vendor B
Backend compiler
OpenMP Compiler
OSCAR: Optimally Scheduled Advanced Multiprocessor API: Application Program Interface
Executable on various multicores
Application Program Fortran or Parallelizable C ( Sequential program)
OSCAR API for Real-time Low Power High Performance Multicores Directives for thread generation, memory, data transfer using DMA, power managements
Shred memory servers 23
OSCAR API
http://www.kasahara.cs.waseda.ac.jp/api/regist_en.html • Targeting mainly realtime consumer electronics devices – embedded computing – various kinds of memory architecture • SMP, local memory, distributed shared memory, ...
• Developed with Japanese 6 companies – Fujitsu, Hitachi, NEC, Toshiba, Panasonic, Renesas – Supported by METI/NEDO • Based on the subset of OpenMP – very popular parallel processing API – shared memory programming model
• Six Categories – – – – – –
Parallel Execution (4 directives from OpenMP) Memory Mapping (Distributed Shared Memory, Local Memory) Data Transfer Overlapping Using DMA Controller Power Control (DVFS, Clock Gating, Power Gating) Timer for Real Time Control Synchronization (Hierarchical Barrier Synchronization) Kasahara-Kimura Lab., Waseda University
24
Low-Power Optimization with OSCAR API Scheduled Result by OSCAR Compiler VC1 VC0
Generate Code Image by OSCAR Compiler void void main_VC0() { main_VC1() { MT2
MT2 MT1
MT1 Sleep
#pragma oscar fvcontrol ¥ ((OSCAR_CPU(),0)) Sleep
#pragma oscar fvcontrol ¥ (1,(OSCAR_CPU(),100)) MT3
MT4 MT4
MT3
}
}
25
Performance of OSCAR Compiler on IBM p6 595 Power6 (4.2GHz) based 32-core SMP Server
OpenMP codes generated by OSCAR compiler accelerate IBM XL Fortran for AIX Ver.12.1 about 3.3 times on the average Compile Option: (*1) Sequential: -O3 –qarch=pwr6, XLF: -O3 –qarch=pwr6 –qsmp=auto, OSCAR: -O3 –qarch=pwr6 –qsmp=noauto (*2) Sequential: -O5 -q64 –qarch=pwr6, XLF: -O5 –q64 –qarch=pwr6 –qsmp=auto, OSCAR: -O5 –q64 –qarch=pwr6 –qsmp=noauto (Others) Sequential: -O5 –qarch=pwr6, XLF: -O5 –qarch=pwr6 –qsmp=auto, OSCAR: -O5 –qarch=pwr6 –qsmp=noauto
Performance of OSCAR Compiler Using the Multicore API on Intel Quad-core Xeon Intel Ver.10.1 OSCAR
9 8 speedup ratio
7 6 5 4 3 2 1
SPEC95 •
SPEC2000
OSCAR Compiler gives us 2.1 times speedup on the average against Intel Compiler ver.10.1
apsi
applu
mgrid
swim
wave5
fpppp
apsi
turb3d
applu
mgrid
hydro2d
su2cor
swim
tomcatv
0
Performance of OSCAR compiler on 16 cores SGI Altix 450 Montvale server 6
speedup ratio
5
Compiler options for the Intel Compiler: for Automation parallelization: -fast -parallel. for OpenMP codes generated by OSCAR: -fast -openmp
Intel Ver.10.1 OSCAR
4 3 2 1
spec95 •
spec2000
OSCAR compiler gave us 2.3 times speedup against Intel Fortran Itanium Compiler revision 10.1
apsi
applu
mgrid
swim
wave5
fpppp
apsi
turb3d
applu
mgrid
hydro2d
su2cor
swim
tomcatv
0
Performance of OSCAR compiler on NEC NaviEngine(ARM-NEC MPcore) 4.5
g77 4
OSCAR
speedup ratio
3.5 3 2.5 2 1.5 1 0.5 0 1PE
2PE mgrid
4PE
1PE
2PE su2cor S PEC9 5
•
4PE
1PE
2PE
4PE
hydro2d
Compile Opiion : -O3
OSCAR compiler gave us 3.43 times speedup against 1 core on 29 ARM/NEC MPCore with 4 ARM 400MHz cores
Performance of OSCAR Compiler Using the multicore API on Fujitsu FR1000 Multicore 4.0
3.76
3.75 3.47
3.5 3.0
2.90
2.81
2.55
2.54 2.5 2.12 1.94
2.0
1.98
1.98
1.80
1.5 1.0
1.00
1.00
1.00
1.00
0.5 0.0 1
2
3
MPEG2enc
4
1
2
3
MPEG2dec
4
1
2
3
MP3enc
4
1
2
3
4
JPEG 2000enc
3.38 times speedup on the average for 4 cores against a single core execution
Performance of OSCAR Compiler Using the Developed API on 4 core (SH4A) OSCAR Type Multicore 4.0 3.5
3.50
3.35
3.24
3.17
3.0
2.71
2.57
2.53
2.5 2.07 2.0
1.88
1.86
1.78
1.83
1.5 1.0
1.00
1.00
1.00
1.00
0.5 0.0 1
2
3
MPEG2enc
4
1
2
3
MPEG2dec
4
1
2
3
MP3enc
4
1
2
3
4
JPEG 2000enc
3.31 times speedup on the average for 4cores against 1core
Processing Performance on the Developed Multicore Using Automatic Parallelizing Compiler Speedup against single core execution for audio AAC encoding *) Advanced Audio Coding
7.0
Speedups
6.0 5.0 4.0 3.0
5.8
2.0
3.6
1.0
1.9 1.0
0.0 1
2
4
Numbers of processor cores
8
Power Reduction by OSCAR Parallelizing Compiler for Secure Audio Encoding AAC Encoding + AES Encryption with 8 CPU cores 7
Without Power Control (Voltage:1.4V)
7
6
6
5
5
4
4
3
3
2
2
1
1
0
0
With Power Control (Frequency, Resume Standby: Power shutdown & Voltage lowering 1.4V1.0V)
Avg. Power 88.3% Power Reduction Avg. Power 5.68 [W] 0.67 [W] 33
Power Reduction by OSCAR Parallelizing Compiler for MPEG2 Decoding MPEG2 Decoding with 8 CPU cores 7
Without Power Control (Voltage:1.4V)
7
6
6
5
5
4
4
3
3
2
2
1
1
0
0
With Power Control (Frequency, Resume Standby: Power shutdown & Voltage lowering 1.4V1.0V)
Avg. Power 73.5% Power Reduction Avg. Power 5.73 [W] 1.52 [W] 34
Low Power High Performance Multicore Computer with Solar Panel Clean Energy Autonomous Servers operational in deserts
OSCAR API-Applicable Heterogeneous Multicore Architecture VC(0),VPC(0)
CPU
VC(1),VPC(1)
VC(n),VPC(n)
•
DTU
DTU –
LPM
LD M
FVR
DSM
OnChip CSM
OffChip CSM Memory Interface
Network Interface
•
•
Network Interface DSM
LPM
LD M
CPU
ACCb_0
ACCa_0 VC(n+1), VPC(n+1)
•
DTU VC(n+2), VPC(n+2)
VC(n+m+1)
VC(n+m+l)
Centralized Shared Memory
FVR –
ACCa_ 1
Distributed Shared Memory
CSM –
DMAC
Local Data Memory
DSM –
•
Local Program Memory
LDM –
Interconnection Network
FVR
LPM –
•
Data Transfer Unit
Frequency/Volta ge Control Register
36
An Image of Static Schedule for Heterogeneous Multicore with Data Transfer Overlapping and Power Control
TIME
37
Heterogeneous Multicore RP-X presented in SSCC2010 Processors Session on Feb. 8, 2010 Cluster #0
Cluster #1
SH-X3 SH-X3 SH-X3 SH-4A
SH-X3 SH-X3 SH-X3 SH-4A
SHwy#0(Address=40,Data=128) DBSC DMAC VPU5 #0 #0
SHwy#1(Address=40,Data=128) FE DMAC DBSC MX2 #1 #1 #0-1 #0-3
CPU FPU
SHwy#2(Address=32,Data=64)
SHPB PCI exp
SATA SPU2 LBSC
SH-4A
HPB
I$
CRU
DTU D$
ILM URAM DLM
Renesas, Hitachi, Tokyo Inst. Of Tech. & Waseda Univ.
SNC
L2C
Parallel Processing Performance Using OSCAR Compiler and OSCAR API on RPX(Optical Flow with a hand-tuned library) 35
32.65
Speedups against a single SH processor
CPU performs data transfers between SH and FE 30
111[fps]
26.71
25
18.85
20
3.5[fps]
15
10
5.4 5
1
2.29
3.09
0
1SH
2SH
4SH
8SH
2SH+1FE
4SH+2FE
8SH+4FE
Power Reduction in a real-time execution controlled by OSCAR Compiler and OSCAR API on RP-X (Optical Flow with a hand-tuned library) With Power Reduction Without Power Reduction by OSCAR Compiler 70% of power reduction Average:1.76[W]
1cycle : 33[ms] →30[fps]
Average:0.54[W]
Fujitsu’s Vector Manycore Design Evaluated in NEDO Manycore Leading Research, Feb. 2010
41
Green Computing Systems R&D Center Waseda University Supported by METI (Mar. 2011 Completion) <R & D Target> Hardware, Software, Application for Super Low-Power Manycore Processors More than 64 cores Natural air cooling (No fan) Cool, Compact, Clear, Quiet Operational by Solar Panel
Hitachi SR16000: Power7 128coreSMP Fujitsu M9000 SPARC VII 256 core SMP
Hitachi, Fujitsu, NEC, Renesas, Olympus, Toyota, Denso, Mitsubishi, Toshiba, etc
<Ripple Effect> Low CO2 (Carbon Dioxide) Emissions Creation Value Added Products Consumer Electronics, Automobiles, Servers
Beside Subway Waseda Station, Near Waseda Univ. Main Campus
42
Environment
Industry
Lives
Cancer Treatment Carbon Ion Radiotherapy
Cancer Treatment Carbon Ion Radiotherapy
National Institute of Radiological Sciences (NIRS)
5.6 times speedup by 8 processors
50 times speedup by 64 processors
IBM Power 7 64 core SMP Intel Quadcore Xeon 8 core SMP (Hitachi SR16000)
LCPC 2012
The 25th International Workshop on Languages and Compilers for Parallel Computing (Tentative) http://www.kasahara.cs.waseda.ac.jp/lcpc2012/
September 11 – 13, 2012 Green Computing Systems Research and Development Center, Waseda University, Tokyo, Japan
The LCPC workshop is a forum for sharing cutting‐edge research on all aspects of parallel languages, compilers and related topics including runtime systems and tools. The scope of the workshop spans foundational results and practical experience, and all classes of parallel processors including concurrent, multithreaded, multicore, accelerated, multiprocessor, and tightly‐clustered systems. Given the rise of multicore processors, LCPC is particularly interested in work that seeks to transition parallel programming into the computing mainstream.
General Chair • Hironori Kasahara, Waseda University Program Chair • Keiji Kimura, Waseda University
Post workshop optional tour on September 14: “K” Kobe Supercomputing Center
Conclusions OSCAR compiler cooperative real-time low power multicore with high effective performance, short software development period will be important in wide range of IT systems from consumer electronics to automobiles, medical systems, disaster superrealtime simulator (Tsunami), and EX-FLOPS machines. For industry A few minutes of compilation of C program using OSCAR Compiler and API without months of programming allows us Scalable speedup on various multicores like Hitachi/ Renesas/Waseda 8 core homogeneous RP2 (8SH4A) & 15 core heterogeneous RPX , NEC/ARM MPCore, Fujitsu FR1000, Intel Quadcore Xeon and IBM Power 7etc. 70% automatic power reduction on RP2 and RPX for realtime media processing for the first time in the world. OSCAR green compiler, API, multicores and manycores will be continuously developed for saving lives from natural disasters and sickness like cancer in addition to the consumer electronics and scientific applications. 46