Low Power Multicores, Parallelizing Compiler and Multiplatform API

1 downloads 0 Views 4MB Size Report
Nov 3, 2011 - Extended control dependency. Conditional branch. 6. OR. AND. Original control flow. A Macro Flow Graph. A Macro Task Graph. 15 ...
Low Power Multicores, Parallelizing Compiler and Multiplatform API for Green Computing Hironori Kasahara Professor, Dept. of Computer Science & Engineering Director, Advanced Multicore Processor Research Institute

Waseda University, Tokyo, Japan IEEE Computer Society Board of Governors URL: http://www.kasahara.cs.waseda.ac.jp/

DASAN Conference on “Green IT”, Nov. 3, 2011

Multi/Many-core Everywhere

Multi-core from embedded to supercomputers  Consumer Electronics (Embedded) Mobile Phone, Game, TV, Car Navigation, Camera, IBM/ Sony/ Toshiba Cell, Fujitsu FR1000, Panasonic Uniphier, NEC/ARM MPCore/MP211/NaviEngine, Renesas 4 core RP1, 8 core RP2, 15core Hetero RP-X, Plurarity HAL 64(Marvell), Tilera Tile64/ -Gx100(->1000cores), DARPA UHPC (2017: 80GFLOPS/W)

 PCs, Servers Intel Quad Xeon, Core 2 Quad, Montvale, Nehalem(8cores), Larrabee(32cores), SCC(48cores), Night Corner(50 core+:22nm), AMD Quad Core Opteron (8, 12 cores) OSCAR Type Multi-core Chip by Renesas in  METI/NEDO Multicore for Real-time Consumer Electronics Project (Leader: Prof.Kasahara)

WSs, Deskside & Highend Servers IBM(Power4,5,6,7), Sun (SparcT1,T2), Fujitsu SPARC64fx8

 Supercomputers Earth Simulator:40TFLOPS, 2002, 5120 vector proc. BG/Q (A2:16cores) Water Cooled20PFLOPS, 3-4MW (2011-12), BlueWaters(HPCS) Power7, 10 PFLOP+(2011.07), Tianhe-1A (4.7PFLOPS,6coreX5670+ Nvidia Tesla M2050), Godson-3B (1GHz40W 8core128GFLOPS) -T (64 core,192GFLOPS:2011) RIKEN Fujitsu “K” 10PFLOPS(8core SPARC64VIIIfx, 128GGFLOPS)

High quality application software, Productivity, Cost The 27thTop 500 (20.6.2011), performance, Low power consumption are important No.1, Fujitsu “K” 548,352 cores Ex, Mobile phones, Games (Current Peak 8.774 PFLOPS) Compiler cooperated multi-core processors are LINPACK 8.162 PFLOPS (93.0%) promising to realize the above futures

2

Nov.2,2011,Press Release for “K”Supercomputer by Riken • No. of CPU: 88,128 • Peak performance: 11.28 PFLOPS • LINPACK performance: 10.51 PFLOPS • Execution Efficiency: 93.2%

METI/NEDO National Project Multi-core for Real-time Consumer Electronics R&D of compiler cooperative multicore processor technology for consumer electronics like Mobile phones, Games, DVD, Digital TV, Car navigation systems. From July 2005 to March 2008 ・Good cost performance

(2005.7~2008.3)** CMP 0 ( マルチコアチップ0)

PC0

CMP m

CPU

(プロセッ サコア0)

(プロセッサ)

LDM/ D-cache

LPM/ I-Cache (ローカルプロ グラムメモリ/ 命令キャッシュ)

(ローカルデータ メモリ/L1データ キャッシュ)

DTC (データ 転送コン トローラ)

PC 1 (プロ セッサ コア1)

(マルチ コア・ チップm)

PC n (プロセッサ コアn)

DSM (分散共有メモリ)

・Short hardware and software development periods ・Low power consumption ・Scalable performance improvement with the advancement of semiconductor ・Use of the same parallelizing compiler for multi-cores from different vendors using newly developed API API:Application Programming Interface

新マルチコア 新マルチコア プロセッサ プロセッサ I/O I/O Devices •高性能 •高性能

NI(ネットワークイン ターフェイス)

IntraCCN (チップ内結合網: 複数バス、クロスバー等 )

CSM / L2 Cache (集中共有メモリあるいはL2キャッシュ)

,,

**Hitachi, Renesas, Fujitsu, Toshiba, Panasonic, NEC

CSM j

•低消費電力 •低消費電力 •短HW/SW開 •短HW/SW開 発期間 発期間 CSM •各チップ間でア •各チップ間でア プリケーション プリケーション 共用可 共用可 •高信頼性 •高信頼性 I/O

CSP k

(入出力 用 マルチコ ア・ チップ)

(集中 共有 メモリ)

•半導体集積度 •半導体集積度 と共に性能向上 と共に性能向上

マルチコア InterCCN (チップ間結合網: 複数バス、クロスバー、多段ネットワーク等 ) 統合ECU

(入出力装置)

ILRAM

Core#0 URAMDLRAM

D$

Core#1

Core#2

Chip Size Core#3

DBSC

DDRPAD

CSM

Core#4

SNC1

SHWY

Core#6

Process 90nm, 8-layer, tripleTechnology Vth, CMOS

I$

SNC0

LBSC

Renesas-Hitachi-Waseda Low Power 8 core RP2 Developed in 2007 in METI/NEDO project

VSWC

Core#7 Core#5 GCPG

CPU Core Size Supply Voltage Power Domains

104.8mm2 (10.61mm x 9.88mm) 6.6mm2 (3.36mm x 1.96mm) 1.0V–1.4V (internal), 1.8/3.3V (I/O) 17 (8 CPUs, 8 URAMs, common)

IEEE ISSCC08: Paper No. 4.5, M.ITO, … and H. Kasahara, “An 8640 MIPS SoC with Independent Power-off Control of 8 CPUs and 8 RAMs by an Automatic Parallelizing Compiler”

5

OSCAR Multi-Core Architecture CMP 0 (chip multiprocessor 0) CMP m

PE 0 PE 1

CPU LPM/ I-Cache

LDM/ D-cache

PE

I/O I/O Devices Devices CSM j

n

DTC

I/O CMP k

DSM

FVR

Network Interface

CSM

Intra-chip connection network (Multiple Buses, Crossbar, etc)

CSM / L2 Cache

FVR

FVR FVR

FVR

FVR

FVR

FVR Inter-chip connection network (Crossbar, Buses, Multistage network, etc) CSM: central shared mem. LDM : local data mem. DSM: distributed shared mem. LPM : local program mem. DTC: Data Transfer Controller FVR: frequency / voltage control register

6

8 Core RP2 Chip Block Diagram

LCPG0

PCR3 PCR2 PCR1 PCR0

I$ D$ LocalFPU memory CPU 16K 16K I$ D$ I:8K, D:32K Local memory 16K 16K I$Local D$ CCN I:8K, D:32K memory User RAM 64K 16K 16K BAR I:8K, D:32K64K User RAM Local memory I:8K, D:32K64K User RAM URAM 64K

Snoop controller 1

Core #3 Core #2 FPU CPU Core #1 FPU CPU I$ D$ Core #0 CPU 16K FPU16K

Barrier Sync. Lines

Snoop controller 0

Cluster #0

Cluster #1

Core #7 Core #6 CPU FPU Core #5 CPU FPU D$ I$ Core #4 CPU 16KFPU 16K

D$ I$ CPU FPU 16K 16K D$ I$ I:8K, D:32K 16K 16K D$ I$ CCN D:32K UserI:8K, RAM 16K 64K 16K BAR I:8K, D:32K User RAM 64K Local memory UserI:8K, RAMD:32K 64K URAM 64K

LCPG1

PCR7 PCR6 PCR5 PCR4

On-chip system bus (SuperHyway)

DDR2 SRAM DMA control control control

LCPG: Local clock pulse generator PCR: Power Control Register CCN/BAR:Cache controller/Barrier Register URAM: User RAM (Distributed Shared Memory)

7

Demo of NEDO Multicore for Real Time Consumer Electronics at the Council of Science and Engineering Policy on April 10, 2008 CSTP Members Prime Minister: Mr. Y. FUKUDA

Minister of State for Science, Technology and Innovation Policy: Mr. F. KISHIDA

Chief Cabinet Secretary: Mr. N. MACHIMURA

Minister of Internal Affairs and Communications : Mr. H. MASUDA

Minister of Finance : Mr. F. NUKAGA

Minister of Education, Culture, Sports, Science and Technology: Mr. K. TOKAI

Minister of Economy,Trade and Industry: Mr. A. AMARI

1987 OSCAR(Optimally Scheduled Advanced Multiprocessor)

OSCAR(Optimally Scheduled Advanced Multiprocessor) HOST COMPUTER CONTROL & I/O PROCESSOR RISC Processor

I/O Processor

CENTRALIZED SHARED MEMORY1 (Simultaneous Readable) Bank1

Bank2 Bank3

Addr.n Addr.n Addr.n Data Memory

Prog. Memory

Distributed Shared Memory

CSM2

CSM3

Read & Write Requests Arbitrator

Bus Interface

Distributed Shared Memory (Dual Port) (CP) -5MFLOPS,32bit RISC Processor (64 Registers) -2 Banks of Program Memory -Data Memory -Stack Memory -DMA Controller

PE1

(CP)

..

PE5

(CP)

..

PE6

5PE CLUSTER (SPC1) 8PE PROCESSOR CLUSTER (LPC1)

PE8

PE9

(CP)

(CP)

..

PE10 PE11

PE15 PE16 SPC3

SPC2 LPC2

1987 OSCAR PE Board

OSCAR Memory Space (Global and Local Address Space) SYSTEM MEMORY SPACE 00000000

LOCAL MEMORY SPACE 00000000

PE15

DSM

UNDEFINED (Local Memory Area)

(Distributed Shared Memory)

00100000

00000800

BROAD CAST

PE 1 PE 0

NOT USE

00200000

00010000 NOT USE

01000000 CP (Control Processor)

L P M(Bank0) (Local Program Memory) L P M(Bank1)

00030000

01100000 NOT USE

00020000

NOT USE

00040000

02000000 PE 0

LDM (Local Data Memory)

02100000 PE 1

00080000

NOT USE

000F0000

02200000 : : :

CONTROL

00100000

02F00000 PE15

SYSTEM ACCESSING

03000000 CSM1, 2, 3 (Centralized Shared Memory)

AREA

03400000 NOT USE FFFFFFFF

FFFFFFFF

OSCAR Parallelizing Compiler To improve effective performance, cost-performance and software productivity and reduce power Multigrain Parallelization coarse-grain parallelism among loops and subroutines, near fine grain parallelism among statements in addition to loop parallelism

1

Data Localization 6

Automatic data management for distributed shared memory, cache and local memory

Data Transfer Overlapping Data transfer overlapping using Data Transfer Controllers (DMAs)

Power Reduction Reduction of consumed power by compiler control DVFS and Power gating with hardware supports.

12

14

18

24

13

17

3

2

5

4

7

10

11

9

15

16

19

25

21

22

23

20

26

dlg2

dlg1

dlg3

dlg0

28

29

Data Localization Group

27

8

31

33

32

30

Generation of coarse grain tasks Macro-tasks (MTs)



 Block of Pseudo Assignments (BPA): Basic Block (BB)  Repetition Block (RB) : natural loop  Subroutine Block (SB): subroutine

Program

BPA

Near fine grain parallelization

RB

Loop level parallelization Near fine grain of loop body Coarse grain parallelization

SB

Total System

1st. Layer

Coarse grain parallelization 2nd. Layer

BPA RB SB BPA RB SB

BPA RB SB BPA RB SB BPA RB SB BPA RB SB

3rd. Layer 14

Earliest Executable Condition Analysis for coarse grain tasks (Macro-tasks) Data Dependency Control flow Conditional branch 1 2

1

BPA

BPA

3

4

BPA

BPA

BPA

Block of Psuedo Assignment Statements

RB

Repetition Block

2

7 5

3 4

RB

BPA

8 6

BPA 6

RB BPA

15

BPA

7

6

5

RB

9

RB

11

RB 8

BPA BPA

9

BPA 11

12

10

RB

15

7 Data dependency

12

Extended control dependency

BPA

BPA 13

10

Conditional branch

13

OR AND

14

RB

Original control flow 14

RB

A Macro Flow Graph A Macro Task Graph

END

15

Automatic processor assignment in su2cor • Using 14 processors –

Coarse grain parallelization within DO400 of subroutine LOOPS

16

MTG of Su2cor-LOOPS-DO400 

Coarse grain parallelism PARA_ALD = 4.3

DOALL

Sequential LOOP

SB

BB

17

Data-Localization Loop Aligned Decomposition •

Decompose multiple loop (Doall and Seq) into CARs and LRs considering inter-loop data dependence. – –

Most data in LR can be passed through LM. LR: Localizable Region, CAR: Commonly Accessed Region

C RB1(Doall) DO I=1,101 A(I)=2*I ENDDO C RB2(Doseq) DO I=1,100 B(I)=B(I-1) +A(I)+A(I+1) ENDDO RB3(Doall) DO I=2,100 C(I)=B(I)+B(I-1) ENDDO C

LR DO I=1,33

CAR DO I=34,35

LR DO I=36,66

CAR DO I=67,68

LR DO I=69,101

DO I=1,33 DO I=34,34 DO I=35,66 DO I=67,67 DO I=68,100 DO I=2,34

DO I=35,67

DO I=68,100 18

Data Localization 1 1

PE0

2 3

6

3

4

2

5

4

5 6

12

7

10

11

9

8

7 14

8

9

13

11

24

17

13

15

MTG

19

25

21

22

23

26 dlg3

dlg0

28

29

Data Localization Group

27

31

32

12

1

2

3

6

7

4

14

8

18

15

5

19

9

25

11

29

10

13

16

17

20

22

26

21

30

23

24

27

28

20

dlg2

dlg1

12

16

10 18

14

15

PE1

30

32 33

31

MTG after Division

A schedule for two processors

19

Power Reduction by Power Supply, Clock Frequency and Voltage Control by OSCAR Compiler •

Shortest execution time mode

Power Reduction Scheduling

Centralized scheduling code

Generated Multigrain Parallelized Code (The nested coarse grain task parallelization is realized by only OpenMP “section”, “Flush” and “Critical” directives.) SECTIONS SECTION

1st layer Distributed scheduling code

MT1_1

MT1_2 DOALL MT1_4 RB

MT1_3 SB

T1 T2

T3

T4 T5

SYNC SEND MT1-4

1_3_2 1_3_3 1_3_4

T6

T7

MT1_1

SYNC RECV

MT1_2

1_3_1 1_4_1

T0

SECTION

1_4_1

1_4_1

1_4_2

1_4_2

1_4_3

1_4_3

1_4_4

1_4_4

MT1-3 1_3_1

1_3_1

1_3_1

1_3_2

1_3_2

1_3_2

1_3_3

1_3_3

1_3_3

1_3_4

1_3_4

1_3_4

1_3_5

1_3_5

1_3_5

1_3_6

1_3_6

1_3_6

1_3_5 1_3_6 1_4_21_4_3 1_4_4

2nd layer

END SECTIONS

2nd layer Thread group0

Thread group1

22

Compilation Flow Using OSCAR API

Waseda Univ. OSCAR Parallelizing Compiler  Coarse grain task parallelization  Global data Localization  Data transfer overlapping using DMA  Power reduction control using DVFS, Clock and Power gating

Hitachi, Renesas, Fujitsu, Toshiba, Panasonic, NEC

Parallelized Fortran or C program with APII Proc0 Code with directives Thread 0 Proc1 Code with directives Thread 1

Backend compiler API Existing Analyzer sequential compiler Backend compiler API Existing Analyzer sequential compiler

Generation of parallel machine codes using sequential compilers Machine codes

Multicore from Vendor A

Machine codes 実行

コード

Multicore from Vendor B

Backend compiler

OpenMP Compiler

OSCAR: Optimally Scheduled Advanced Multiprocessor API: Application Program Interface

Executable on various multicores

Application Program Fortran or Parallelizable C ( Sequential program)

OSCAR API for Real-time Low Power High Performance Multicores Directives for thread generation, memory, data transfer using DMA, power managements

Shred memory servers 23

OSCAR API

http://www.kasahara.cs.waseda.ac.jp/api/regist_en.html • Targeting mainly realtime consumer electronics devices – embedded computing – various kinds of memory architecture • SMP, local memory, distributed shared memory, ...

• Developed with Japanese 6 companies – Fujitsu, Hitachi, NEC, Toshiba, Panasonic, Renesas – Supported by METI/NEDO • Based on the subset of OpenMP – very popular parallel processing API – shared memory programming model

• Six Categories – – – – – –

Parallel Execution (4 directives from OpenMP) Memory Mapping (Distributed Shared Memory, Local Memory) Data Transfer Overlapping Using DMA Controller Power Control (DVFS, Clock Gating, Power Gating) Timer for Real Time Control Synchronization (Hierarchical Barrier Synchronization) Kasahara-Kimura Lab., Waseda University

24

Low-Power Optimization with OSCAR API Scheduled Result by OSCAR Compiler VC1 VC0

Generate Code Image by OSCAR Compiler void void main_VC0() { main_VC1() { MT2

MT2 MT1

MT1 Sleep

#pragma oscar fvcontrol ¥ ((OSCAR_CPU(),0)) Sleep

#pragma oscar fvcontrol ¥ (1,(OSCAR_CPU(),100)) MT3

MT4 MT4

MT3

}

}

25

Performance of OSCAR Compiler on IBM p6 595 Power6 (4.2GHz) based 32-core SMP Server

OpenMP codes generated by OSCAR compiler accelerate IBM XL Fortran for AIX Ver.12.1 about 3.3 times on the average Compile Option: (*1) Sequential: -O3 –qarch=pwr6, XLF: -O3 –qarch=pwr6 –qsmp=auto, OSCAR: -O3 –qarch=pwr6 –qsmp=noauto (*2) Sequential: -O5 -q64 –qarch=pwr6, XLF: -O5 –q64 –qarch=pwr6 –qsmp=auto, OSCAR: -O5 –q64 –qarch=pwr6 –qsmp=noauto (Others) Sequential: -O5 –qarch=pwr6, XLF: -O5 –qarch=pwr6 –qsmp=auto, OSCAR: -O5 –qarch=pwr6 –qsmp=noauto

Performance of OSCAR Compiler Using the Multicore API on Intel Quad-core Xeon Intel Ver.10.1 OSCAR

9 8 speedup ratio

7 6 5 4 3 2 1

SPEC95 •

SPEC2000

OSCAR Compiler gives us 2.1 times speedup on the average against Intel Compiler ver.10.1

apsi

applu

mgrid

swim

wave5

fpppp

apsi

turb3d

applu

mgrid

hydro2d

su2cor

swim

tomcatv

0

Performance of OSCAR compiler on 16 cores SGI Altix 450 Montvale server 6

speedup ratio

5

Compiler options for the Intel Compiler: for Automation parallelization: -fast -parallel. for OpenMP codes generated by OSCAR: -fast -openmp

Intel Ver.10.1 OSCAR

4 3 2 1

spec95 •

spec2000

OSCAR compiler gave us 2.3 times speedup against Intel Fortran Itanium Compiler revision 10.1

apsi

applu

mgrid

swim

wave5

fpppp

apsi

turb3d

applu

mgrid

hydro2d

su2cor

swim

tomcatv

0

Performance of OSCAR compiler on NEC NaviEngine(ARM-NEC MPcore) 4.5

g77 4

OSCAR

speedup ratio

3.5 3 2.5 2 1.5 1 0.5 0 1PE

2PE mgrid

4PE

1PE

2PE su2cor S PEC9 5



4PE

1PE

2PE

4PE

hydro2d

Compile Opiion : -O3

OSCAR compiler gave us 3.43 times speedup against 1 core on 29 ARM/NEC MPCore with 4 ARM 400MHz cores

Performance of OSCAR Compiler Using the multicore API on Fujitsu FR1000 Multicore 4.0

3.76

3.75 3.47

3.5 3.0

2.90

2.81

2.55

2.54 2.5 2.12 1.94

2.0

1.98

1.98

1.80

1.5 1.0

1.00

1.00

1.00

1.00

0.5 0.0 1

2

3

MPEG2enc

4

1

2

3

MPEG2dec

4

1

2

3

MP3enc

4

1

2

3

4

JPEG 2000enc

3.38 times speedup on the average for 4 cores against a single core execution

Performance of OSCAR Compiler Using the Developed API on 4 core (SH4A) OSCAR Type Multicore 4.0 3.5

3.50

3.35

3.24

3.17

3.0

2.71

2.57

2.53

2.5 2.07 2.0

1.88

1.86

1.78

1.83

1.5 1.0

1.00

1.00

1.00

1.00

0.5 0.0 1

2

3

MPEG2enc

4

1

2

3

MPEG2dec

4

1

2

3

MP3enc

4

1

2

3

4

JPEG 2000enc

3.31 times speedup on the average for 4cores against 1core

Processing Performance on the Developed Multicore Using Automatic Parallelizing Compiler Speedup against single core execution for audio AAC encoding *) Advanced Audio Coding

7.0

Speedups

6.0 5.0 4.0 3.0

5.8

2.0

3.6

1.0

1.9 1.0

0.0 1

2

4

Numbers of processor cores

8

Power Reduction by OSCAR Parallelizing Compiler for Secure Audio Encoding AAC Encoding + AES Encryption with 8 CPU cores 7

Without Power Control (Voltage:1.4V)

7

6

6

5

5

4

4

3

3

2

2

1

1

0

0

With Power Control (Frequency, Resume Standby: Power shutdown & Voltage lowering 1.4V1.0V)

Avg. Power 88.3% Power Reduction Avg. Power 5.68 [W] 0.67 [W] 33

Power Reduction by OSCAR Parallelizing Compiler for MPEG2 Decoding MPEG2 Decoding with 8 CPU cores 7

Without Power Control (Voltage:1.4V)

7

6

6

5

5

4

4

3

3

2

2

1

1

0

0

With Power Control (Frequency, Resume Standby: Power shutdown & Voltage lowering 1.4V1.0V)

Avg. Power 73.5% Power Reduction Avg. Power 5.73 [W] 1.52 [W] 34

Low Power High Performance  Multicore Computer with  Solar Panel  Clean Energy Autonomous  Servers operational in deserts 

OSCAR API-Applicable Heterogeneous Multicore Architecture VC(0),VPC(0)

CPU

VC(1),VPC(1)

VC(n),VPC(n)



DTU

DTU –

LPM

LD M

FVR

DSM

OnChip CSM

OffChip CSM Memory Interface

Network Interface





Network Interface DSM

LPM

LD M

CPU

ACCb_0

ACCa_0 VC(n+1), VPC(n+1)



DTU VC(n+2), VPC(n+2)

VC(n+m+1)

VC(n+m+l)

Centralized Shared Memory

FVR –

ACCa_ 1

Distributed Shared Memory

CSM –

DMAC

Local Data Memory

DSM –



Local Program Memory

LDM –

Interconnection Network

FVR

LPM –



Data Transfer Unit

Frequency/Volta ge Control Register

36

An Image of Static Schedule for Heterogeneous Multicore with Data Transfer Overlapping and Power Control

TIME

37

Heterogeneous Multicore RP-X presented in SSCC2010 Processors Session on Feb. 8, 2010 Cluster #0

Cluster #1

SH-X3 SH-X3 SH-X3 SH-4A

SH-X3 SH-X3 SH-X3 SH-4A

SHwy#0(Address=40,Data=128) DBSC DMAC VPU5 #0 #0

SHwy#1(Address=40,Data=128) FE DMAC DBSC MX2 #1 #1 #0-1 #0-3

CPU FPU

SHwy#2(Address=32,Data=64)

SHPB PCI exp

SATA SPU2 LBSC

SH-4A

HPB

I$

CRU

DTU D$

ILM URAM DLM

Renesas, Hitachi, Tokyo Inst. Of Tech. & Waseda Univ.

SNC

L2C

Parallel Processing Performance Using OSCAR Compiler and OSCAR API on RPX(Optical Flow with a hand-tuned library) 35

32.65

Speedups against a single SH processor 

CPU performs data transfers between SH and FE 30

111[fps]

26.71

25

18.85

20

3.5[fps]

15

10

5.4 5

1

2.29

3.09

0

1SH

2SH

4SH

8SH

2SH+1FE

4SH+2FE

8SH+4FE

Power Reduction in a real-time execution controlled by OSCAR Compiler and OSCAR API on RP-X (Optical Flow with a hand-tuned library) With Power Reduction Without Power Reduction by OSCAR Compiler 70% of power reduction Average:1.76[W]

1cycle : 33[ms] →30[fps]

Average:0.54[W]

Fujitsu’s Vector Manycore Design Evaluated in NEDO Manycore Leading Research, Feb. 2010

41

Green Computing Systems R&D Center Waseda University Supported by METI (Mar. 2011 Completion) <R & D Target> Hardware, Software, Application for Super Low-Power Manycore Processors More than 64 cores Natural air cooling (No fan) Cool, Compact, Clear, Quiet Operational by Solar Panel

Hitachi SR16000: Power7 128coreSMP Fujitsu M9000 SPARC VII 256 core SMP

Hitachi, Fujitsu, NEC, Renesas, Olympus, Toyota, Denso, Mitsubishi, Toshiba, etc

<Ripple Effect> Low CO2 (Carbon Dioxide) Emissions Creation Value Added Products Consumer Electronics, Automobiles, Servers 

Beside Subway Waseda Station, Near Waseda Univ. Main Campus

42

Environment

Industry

Lives

Cancer Treatment Carbon Ion Radiotherapy

Cancer Treatment Carbon Ion Radiotherapy

National Institute of Radiological Sciences (NIRS)

5.6 times speedup by 8 processors

50 times speedup by 64 processors

IBM Power 7 64 core SMP Intel Quadcore Xeon 8 core SMP (Hitachi SR16000)

LCPC 2012

The 25th International Workshop on Languages and Compilers for Parallel Computing (Tentative) http://www.kasahara.cs.waseda.ac.jp/lcpc2012/

September 11 – 13, 2012 Green Computing Systems Research and Development Center, Waseda University, Tokyo, Japan

The LCPC workshop is a forum for sharing cutting‐edge research on all aspects of parallel languages, compilers and related topics including  runtime systems and tools. The scope of the workshop spans foundational results and practical experience, and all classes of parallel  processors including concurrent, multithreaded, multicore, accelerated, multiprocessor, and tightly‐clustered systems. Given the rise of  multicore processors, LCPC is particularly interested in work that seeks to transition parallel programming into the computing mainstream. 

General Chair • Hironori Kasahara, Waseda University Program Chair • Keiji Kimura, Waseda University

Post workshop optional tour  on September 14: “K” Kobe Supercomputing Center

Conclusions  OSCAR compiler cooperative real-time low power multicore with high effective performance, short software development period will be important in wide range of IT systems from consumer electronics to automobiles, medical systems, disaster superrealtime simulator (Tsunami), and EX-FLOPS machines.  For industry A few minutes of compilation of C program using OSCAR Compiler and API without months of programming allows us  Scalable speedup on various multicores like Hitachi/ Renesas/Waseda 8 core homogeneous RP2 (8SH4A) & 15 core heterogeneous RPX , NEC/ARM MPCore, Fujitsu FR1000, Intel Quadcore Xeon and IBM Power 7etc.  70% automatic power reduction on RP2 and RPX for realtime media processing for the first time in the world.  OSCAR green compiler, API, multicores and manycores will be continuously developed for saving lives from natural disasters and sickness like cancer in addition to the consumer electronics and scientific applications. 46

Suggest Documents