Multi-core Parallelizing Compiler for Low Power High ...

University of Illinois Oct. 2006

Multi-core Parallelizing Compiler for Low Power High Performance Computing Hironori Kasahara Head, Department of Computer Science Director, Advanced Chip-Multiprocessor Research Institute

Waseda University Tokyo, Japan http://www.kasahara.cs.waseda.ac.jp

1

Hironori Kasahara


＜Personal History＞ B.S. (1980,Waseda), M.S.(1982,Waseda), Ph.D.(1985,EE, Waseda). Res.Assoc. (1983,Waseda), Special Research Fellow JSPS (1985) ,Visiting Scholar (1985.Univ.California at Berkeley), . Assist. Prof. (1986.Waseda), Assoc. Prof.(1988,Waseda), Visiting Research Scholar(1989-1990. Center for Supercomputing R&D, Univ.of Illinois at Urbana-Champaign), Prof.(1997-,Dept. CS, Waseda). , IFAC World Congress Young Author Prize (1987), IPSJ Sakai Memorial Special Award (1997), STARC Industry-Academia Cooperative Research Award (2004) ＜Activities for Societies＞ IPSJ：Sig. Computer Architecture(Chair), Trans of IPSJ Editorial Board (HG Chair), Journal of IPSJ Editorial Board (HWG Chair), 2001 Journal of IPSJ Special Issue on Parallel Processing(Chair of Editorial Board: Guest Editor, JSPP2000 (Program Chair) etc. ACM ：International Conference on Supercomputing(ICS)(Program Committee) Int’l conf. on Supercomputing (PC, esp. ‘96 ENIAC 50th Anniversary Co-Prog. Chair). IEEE: Computer Society Japan Chapter Chair, Tokyo Section Board Member, SC07 PC OTHER: PCs of many conferences on Supercomputing and Parallel Processing. METI：IT Policy Proposal Forum(Architecture/HPC WG Chair), Super Advanced Electronic Basis Technology Investigation Committee NEDO:Millennium Project IT21 “Advanced Parallelizing Compiler”( Project Leader), Computer Strategy WG (Chair).Multicore for Realtime Consumer Electronics Project Leader etc. MEXT: Earth Simulator project evaluation committee, JAERI: Research accomplishment evaluation committee, CCSE 1st class invited researcher. JST: Scientific Research Fund Sub Committee, COINS Steering Committee , Precursory Research for Embryonic Science and Technology (Research Area Adviser) Cabinet Office: CSTP Expert Panel on Basic Policy, Information & Communication Field Promotion Strategy , R&D Infrastructure WG, Software & Security WG 2 ＜Papers＞151 Papers with Review, 20 Papers for Symposium with Review, 105 Technicar Reports, 154 Papers for Annual Convention, 49 Invited Talks, 74 Articles in Newspaper & Web, etc.

Multi-core Everywhere


Multi-core from embedded to supercomputers ¾ Consumer Electronics (Embedded) Mobile Phone, Game, Digital TV, Car Navi, DVD, Camera IBM/ Sony/ Toshiba Cell, Fuijtsu FR1000, NEC/ARMMPCore&MP211, Panasonic Uniphier, Renesas SH multi-core(RP1) IBM First Chip Multiprocessor (CMP) Power 4 processor 2 cores on a chip

¾ PCs, Servers Intel Dual-Core Xeon, Core 2 Duo, Montecito AMD Quad and Dual-Core Opteron

¾ WSs, Deskside & Highend Servers IBM Power4,5,5+, pSeries690(32way), p5 550Q(8 way) , Sun N iagara(SparcT1,T2), SGI ALTIX350,

¾ Supercomputers Earth Simulator:40TFLOPS, 2002, 5120 vector proc. IBM Blue Gene/L: 360TFLOPS, 2005, Low power CMP based 128K processor chips

Earth simulaor：5120 vector processors Multiprocessor Supercomputer

High quality application software, Productivity, Cost performance, Low power consumption are important Ex, Mobile phones, Games Compiler cooperated multi-core processors are promising to realize the above futures 3


IBM BlueGene/L

4


Roadmap of compiler cooperative multicore project 00

Millennium Project IT21 NEDO Advanced Parallelizing Compiler

01

02

Compiler development of Multiprocessor Servers

03

04

05

06

07

08

09

10

Apply for Supercomputer Compilers

（Waseda Univ. Fujitsu,Hitachi, JIPDEC,AIST）

STARC Compiler Cooperative Chip Multiprocessor

Basic resrach

（Waseda・Toshiba・Fujitsu・ Panasonic・Sony）

（ Waseda Univ., Fujitsu, NEC, Toshiba, Panasonic,Sony）

NEDO Heterogeneous Multiprocessor （Waseda Univ., Hitachi）

NEDO Multicore Technology for Realtime Consumer Electronics Waseda Univ., Hitachi, Renesas, Fujitsu, NEC, Toshiba, Panasonic ¾Power Saving Multicore Architecture, Parallelizing Compiler,API

Practical Resarch

STARC： STARC： Semiconductor Semiconductor Technology TechnologyAcademic Academic Research ResearchCenter Center Fujitsu,Toshiba,NEC, Fujitsu,Toshiba,NEC, Renesas,Panasonic, Renesas,Panasonic, Sony Sonyetc. etc.

（Waseda・Fujitsu・NEC・Toshiba）

Arch. % Plan Compil R&D

Practical Use

Soft Softrealtime realtime Multicore Arch. Compiler Plan Practical Use API R&D

Mar. Oct.

Mar.

Mar.

5


METI/NEDO National Project

Multi-core for Real-time Consumer Electronics R&D of compiler cooperative multicore processor technology for consumer electronics like Mobile phones, Games, DVD, Digital TV, Car navigation systems. From July 2005 to March 2008

経済産業省/NEDOリアルタイム情報家電用マルチコア新マルチコア新マルチコア (2005.7～2008.3)** プロセッサ

PC0

API：Application Programming Interface

CMP m

CPU

(プロセッサコア0)

(プロセッサ）

LDM/ D-cache

LPM/ I-Cache (ローカルプログラムメモリ/ 命令キャッシュ)

・Good cost performance

・Short hardware and software development periods ・Low power consumption ・Scalable performance improvement with the advancement of semiconductor ・Use of the same parallelizing compiler for multi-cores from different vendors using newly developed API

プロセッサ I/O Devices I/O •高性能 •高性能

CMP 0 （マルチコアチップ0）

(ローカルデータメモリ/L1データキャッシュ)

DTC (データ転送コントローラ)

PC 1 (プロセッサコア1)

PC n (プロセッサコアn)

DSM (分散共有メモリ)

NI(ネットワークインターフェイス)

IntraCCN (チップ内結合網：　複数バス、クロスバー等 )

CSM / L2 Cache (集中共有メモリあるいはL２キャッシュ )

,,

(マルチコア・チップm)

（入出力装置）

CSM j

•低消費電力 •低消費電力 •短HW/SW開 •短HW/SW開発期間発期間 CSM •各チップ間でア •各チップ間でアプリケーションプリケーション共用可共用可 •高信頼性 •高信頼性 I/O

ＣＳＰ k

(入出力用マルチコア・チップ)

(集中共有メモリ)

•半導体集積度 •半導体集積度と共に性能向上と共に性能向上

マルチコア InterCCN (チップ間結合網：　複数バス、クロスバー、多段ネットワーク等 ) 統合ECU

開発マルチコアチップは情報家電へ開発マルチコアチップは情報家電へ

**日立,富士通,ルネサス,東芝,松下,NEC 6


NEDOMulticore Technology for Realtime Consumer Electronics R&D Organization(2005.7-2008.3 ) Integrated R&D Steering Committee Chair : Hironori Kasahara (Waseda Univ.), Project Leader Sub-chair : Kunio Uchiyama (Hitachi), Project Sub-leader

Grant: Multicore Architecture and Compiler Research and Development Project Leader : Prof. Hironori Kasahara, Waseda Univ. R&D Steering Comittee Architecture & Compiler R&D Group. Group Leader : Associate Prof. Keiji Kimura, Waseda Univ. (Outsourcing ： Fujitsu)

Standard Multicore Architecture & API Committee Chair : Prof. Hironori Kasahara, Waseda Univ.

Subsidy: Evaluation Environment for Multicore Technology Project Sub-leader : Dr. Kunio Uchiyama, Chief Researcher, Hitachi Test Chip Development Group. Group Leader : Renesas Technology

Evaluation Sytem Development Group Group Leader : Hitachi

Hitachi, Fujitsu, Toshiba, NEC, Panasonic, Renesas Technology 7


MPCoreTM

ARM and NEC Collaboration

Configurable number of hardware interrupt lines

MPCoreTM

Private FIQ lines

Interrupt Distributor Per-CPU aliased peripherals

Timer Wdog

Timer Wdog

CPU interface

Timer Wdog

Timer

CPU interface

Wdog

CPU interface

IRQ

IRQ

IRQ

IRQ

Configurable between 1 and 4 Symmetric CPU

CPU interface

CPU/VFP

CPU/VFP

CPU/VFP

CPU/VFP

L1 Memory

L1 Memory

L1 Memory

L1 Memory

Private Peripheral Bus

Snoop Control Unit (SCU) Duplicated L1 Tag

I & D Coherence 64bit bus Control bus

Optional 2nd AXI R/W 64bit bus

Primary AXI R/W 64bit bus

L2 (L220)

8


Fujitsu FR-1000Multicore Processor FR550 VLIW Processor

FR-V Multi-core Processor FR550 core

Local BUS IF

Crossbar Switch FR550 core

Integer Operation Unit Inst. 0 Inst. 1 Inst. 2 Inst. 3

GR

FR550 ｃore DMA-I

DMA-E

I-cache 32KB

Mem. Cont.

Memory

Mem. Cont.

Memory

D-cache 32KB

Inst. 4 Inst. 5 Inst. 6 Inst. 7

FR

Media Operation Unit

FR550 core

IO Chip

Core1

Core2

Core1

Core2

× Switch

Fast I/O Bus •Memory Bus: 64bit x 2ch / 266MHz •System Bus: 64bit / 178MHz

Memory Memory

Bus

Switch

Memory Memory

Crossbar （FR1000） 9


CELL Processor Overview • Power Processor Element (PPE) – PowerCore processes OS and Control tasks – 2-way Multi-threaded

• Synergistic Processor Element (SPE) – – – – – –

8 SPE offers high performance Dual issue RISC Architecture 128bit SIMD(16‐way) 128 x 128bit General Registers 256KB Local Store DedicatedDMA engines

SPE

Synergistic Processor Elements for High (Fl)ops / Watt

SPU

SPU

SPU

SPU

SPU

SPU

SPU

SPU

LS

LS

LS

LS

LS

LS

LS

16B/cycle

LS 16B/cycl e

EIB (up to 96B/cycle) 16B/cycle

16B/cycle 16B/cycle (2x)

PPE L2

MIC

BIC

512KB 32B/cycle

PPU L1

16B/cycle

32KB+32KB

Dual XDRTM

RRAC

従来の演算処理のためにVMXを持つ64bitのPower Architecture

10

I/O


1987 OSCAR(Optimally Scheduled Advanced Multiprocessor)

11


OSCAR(Optimally Scheduled Advanced Multiprocessor) HOST COMPUTER CONTROL & I/O PROCESSOR RISC Processor

I/O Processor

CENTRALIZED SHARED MEMORY1 (Simultaneous Readable) Bank1

Bank2 Bank3

Addr.n Addr.n Addr.n Data Memory

Prog. Memory

Distributed Shared Memory

CSM2

CSM3

Read & Write Requests Arbitrator

Bus Interface

Distributed Shared Memory (Dual Port) (CP) -5MFLOPS,32bit RISC Processor (64 Registers) -2 Banks of Program Memory -Data Memory -Stack Memory -DMA Controller

PE1

(CP)

..

PE5

(CP)

..

PE6

5PE CLUSTER (SPC1) 8PE PROCESSOR CLUSTER (LPC1)

PE8

PE9

(CP)

(CP)

..

PE10 PE11

PE15 PE16 SPC3

SPC2 LPC2

12


OSCAR PE (Processor Element) SYSTEM BUS

BUS INTERFACE

LOCAL BUS 1

LOCAL BUS 2 DSM

DMA

LPM

LSM

LDM

IPU

FPU

REG INSC INSTRUCTION BUS

D M A : DMA CONTROLLER L P M : LOCAL PROGRAM MEMORY (128KW * 2BANK) I N S C : INSTRUCTION CONTROL UNIT : DISTRIBUTED DSM SHARED MEMORY (2KW) L S M : LOCAL STACK MEMORY (4KW)

DP

L D M : LOCAL DATA MEMORY (256KW) D P : DATA PATH I P U : INTEGER PROCESSING UNIT : FLOATING FPU PROCESSING UNIT R E G : REGISTER FILE (64 REGISTERS)

13


1987 OSCAR PE Board

14


OSCAR Memory Space SYSTEM MEMORY SPACE 00000000

LOCAL MEMORY SPACE 00000000

PE15

DSM

UNDEFINED (Local Memory Area)

(Distributed Shared Memory)

00100000

00000800

BROAD CAST

PE 1 PE 0

NOT USE

00200000

00010000 NOT USE

01000000 CP (Control Processor)

L P M(Bank0) (Local Program Memory) L P M(Bank1)

00030000

01100000 NOT USE

00020000

NOT USE

00040000

02000000 PE 0

LDM (Local Data Memory)

02100000 PE 1

00080000

NOT USE

000F0000

02200000 : : :

CONTROL

00100000

02F00000 PE15

SYSTEM ACCESSING

03000000 CSM1, 2, 3 (Centralized Shared Memory)

AREA

03400000 NOT USE FFFFFFFF

FFFFFFFF

15


OSCAR Multi-Core Architecture CMP 0 (chip multiprocessor 0) CMP m

PE 0 PE 1

CPU LPM/ I-Cache

LDM/ D-cache

PE

I/O I/O Devices Devices CSM j

n

DTC

I/O CMP k

DSM

FVR

Network Interface

CSM

Intra-chip connection network (Multiple Buses, Crossbar, etc)

CSM / L2 Cache

FVR

FVR FVR

FVR

FVR

FVR

FVR Inter-chip connection network (Crossbar, Buses, Multistage network, etc) LDM : local data mem. CSM: central shared mem. DSM: distributed shared mem. LPM : local program mem. DTC: Data Transfer Controller FVR: frequency / voltage control register 16


API and Parallelizing Compiler in METI/NEDO Advanced Multicore for Realtime Consumer Electronics Project Sequential Application Program （Subset of C Language)

API to specify data assignment, data transfer, power reduction control

Waseda OSCAR Compiler

Realtime Consumer Electronics Application Programs Image, Secure Audio Streaming etc.

Proc0 Scheduled Tasks T1

Backend compiler API decoder

Sequential Compiler

Mach. Codes SH multicore

Backend Compiler API decoder

Sequential Compiler

Mach. Codes

T4


Executable codes for each vendor chip

Stop


Translate into parallel codes for each vender

T6 Slow

FR-V Backend Compiler API decoder

Sequential Compiler

Mach. Codes （CELL）

Data Transfer by DTC(DMAC)

17


OSCAR Parallelizing Compiler

• Improve effective performance, cost-performance and productivity and reduce consumed power – Multigrain Parallelization • Exploitation of parallelism from the whole program by use of coarse-grain parallelism among loops and subroutines, near fine grain parallelism among statements in addition to loop parallelism

– Data Localization • Automatic data distribution for distributed shared memory, cache and local memory on multiprocessor systems.

– Data Transfer Overlapping • Data transfer overhead hiding by overlapping task execution and data transfer using DMA or data pre-fetching

– Power Reduction • Reduction of consumed power by compiler control of frequency, voltage and power shut down with hardware supports. 18


OSCAR Multigrain Parallelizing Compiler • Automatic Parallelization – Multigrain Parallel Processing – Data Localization – Data transfer Overlapping – Complier Controlled Power Saving Scheme

• Compiler cooperative Multi-core architecture – OSCAR Multi-core Architecture – OSCAR Heterogeneous Multiprocessor Architecture

• Commercial SMP machines

19


METI/NEDO Advanced Parallelizing Compiler Technology Project Background and Problems

Millenium Project IT21 2000.9.8 –2003.3.31 Waseda Univ., Fujitsu, Hitachi, AIST

①Adoption of parallel processing as a core technology on PC to HPC ② Increase of importance of software on IT ③ Need for improvement of cost-performance and usability

Performance 1T

Hardware Peak performance

Contents of Research and Development

APC Project

Ex pan d

Ga p

Improvement of ① Effective performance ②Cost-performance ③ Ease of use

Effective Performance

① R & D of advanced parallelizing compiler Multigrain, Data localization, Overhead hiding ② R & D of Performance evaluation technology for parallelizing compilers

Goal: Double the effective performance Ripple Effect Slow down

1G 1980

1990

2000

Theoretical maximum performance vs. Effective performance of HPC

year

① Development of competitive next generation PC and HPC ② Putting the innovative automatic parallelizing compiler technology to practical use ③ Development and market acquisition of future single-chip multiprocessors ④ Boosting R&D in the following many fields: IT, Bio-tech., Device, Earth environment, Next-generation VLSI design, Financial engineering, Weather forecast, New clean energy, Space20 development, Automobile, Electric Commerce, etc


Generation of Coarse Grain Tasks Ｍacro-tasks (MTs)

¾ Block of Pseudo Assignments (BPA): Basic Block (BB) ¾ Repetition Block (RB) : outermost natural loop ¾ Subroutine Block (SB): subroutine

Program

BPA

Near fine grain parallelization

RB

Loop level parallelization Near fine grain of loop body Coarse grain parallelization

SB

Total System

1st. Layer

Coarse grain parallelization 2nd. Layer

BPA RB SB BPA RB SB

BPA RB SB BPA RB SB BPA RB SB BPA RB SB

3rd. Layer 21


Earliest Executable Condition Analysis for coarse grain tasks (Macro-tasks) Data Dependency Control flow Conditional branch 1 2

1

BPA

BPA

3

4

BPA

BPA

BPA

Block of Psuedo Assignment Statements

RB

Repetition Block

2

7 5

3 4

RB

BPA

8 6

BPA 6

RB BPA

15

BPA

7

6

5

RB

9

RB

11

RB 8

BPA BPA

9

BPA 11

12

END

10

RB

BPA RB

14

RB

15

7 Data dependency

12

Extended control dependency

BPA

13

10

A Macro Flow Graph

Conditional branch

13

OR AND

14

Original control flow

A Macro Task Graph 22


Automatic processor assignment in 103.su2cor • Using 14 processors –

Coarse grain parallelization within DO400 of subroutine LOOPS

23


MTG of Su2cor-LOOPS-DO400 Coarse grain parallelism PARA_ALD = 4.3

DOALL

Sequential LOOP

SB

24

BB


Data-Localization Loop Aligned Decomposition •

Decompose multiple loop (Doall and Seq) into CARs and LRs considering inter-loop data dependence. – Most data in LR can be passed through LM. – LR: Localizable Region, CAR: Commonly Accessed Region C RB1(Doall) DO I=1,101 A(I)=2*I ENDDO C RB2(Doseq) DO I=1,100 B(I)=B(I-1) +A(I)+A(I+1) ENDDO RB3(Doall) DO I=2,100 C(I)=B(I)+B(I-1) ENDDO C

LR DO I=1,33

CAR DO I=34,35

LR DO I=36,66

CAR DO I=67,68

LR DO I=69,101

DO I=1,33 DO I=34,34 DO I=35,66 DO I=67,67 DO I=68,100 DO I=2,34

DO I=35,67

DO I=68,100 25


Inter-loop data dependence analysis in TLG • Define exit-RB in TLG as Standard-Loop • Find iterations on which a iteration of Standard-Loop is data dependent – e.g. Kth of RB3 is data-dep on K-1th,Kth of RB2, on K-1th,Kth,K+1th of RB1

C RB1(Doall) DO I=1,101 A(I)=2*I ENDDO

K-1

K

C RB2(Doseq) DO I=1,100 B(I)=B(I-1) +A(I)+A(I+1) K-1 ENDDO C RB3(Doall) DO I=2,100 C(I)=B(I)+B(I-1) ENDDO

K+1

K

K

I(RB1)

I(RB2)

I(RB3)

Example of TLG 26


Decomposition of RBs in TLG • • •

Decompose GCIR into DGCIRp(1≦p≦n) – n: (multiple) num of PCs, DGCIR: Decomposed GCIR Generate CAR on which DGCIRp&DGCIRp+1 are data-dep. Generate LR on which DGCIRp is data-dep.

RB11 1

RB1

RB12

RB21

RB2

RB22

RB31 2

DGCIR1

34 35

RB2

99 100 101

RB23

65 66 67

33 34 35

1 2

RB13

65 66 67 68

33 34 35 36

2 3

RB1

RB32

99 100

RB33 66 67

DGCIR2

100

I(RB1) I(RB2) I(RB3)

DGCIR3

GCIR 27


Data Localization 1 1

PE0

2 3

6

3

4

2

5

4

5 6

12

7

10

11

9

8

7 14

8

9

13

11

24

17

13

15

MTG

19

25

21

22

23

26 dlg3

dlg0

28

29

Data Localization Group

27

31

32

12

1

2

3

6

7

4

14

8

18

15

5

19

9

25

11

29

10

13

16

17

20

22

26

21

30

23

24

27

28

20

dlg2

dlg1

12

16

10 18

14

15

PE1

30

32 33

31

MTG after Division

28 A schedule for two processors


An Example of Data Localization for Spec95 Swim DO 200 J=1,N DO 200 I=1,M UNEW(I+1,J) = UOLD(I+1,J)+ 1 TDTS8*(Z(I+1,J+1)+Z(I+1,J))*(CV(I+1,J+1)+CV(I,J+1)+CV(I,J) 2 +CV(I+1,J))-TDTSDX*(H(I+1,J)-H(I,J)) VNEW(I,J+1) = VOLD(I,J+1)-TDTS8*(Z(I+1,J+1)+Z(I,J+1)) 1 *(CU(I+1,J+1)+CU(I,J+1)+CU(I,J)+CU(I+1,J)) 2 -TDTSDY*(H(I,J+1)-H(I,J)) PNEW(I,J) = POLD(I,J)-TDTSDX*(CU(I+1,J)-CU(I,J)) 1 -TDTSDY*(CV(I,J+1)-CV(I,J)) 200 CONTINUE

DO 210 J=1,N UNEW(1,J) = UNEW(M+1,J) VNEW(M+1,J+1) = VNEW(1,J+1) PNEW(M+1,J) = PNEW(1,J) 210 CONTINUE

DO 300 J=1,N DO 300 I=1,M UOLD(I,J) = U(I,J)+ALPHA*(UNEW(I,J)-2.*U(I,J)+UOLD(I,J)) VOLD(I,J) = V(I,J)+ALPHA*(VNEW(I,J)-2.*V(I,J)+VOLD(I,J)) POLD(I,J) = P(I,J)+ALPHA*(PNEW(I,J)-2.*P(I,J)+POLD(I,J)) 300 CONTINUE

(a) An example of target loop group for data localization

cache size 0

1

VN PO H

2

PN CU

3

UO CV

4MB

UN VO Z UN

VN

PN

U VN PO

V PN

P UO

UN VO

Cache line conflicts occurs among arrays which share the same location on cache (b) Image of alignment of arrays on cache accessed by target loops 29


Data Layout for Removing Line Conflict Misses by Array Dimension Padding Declaration part of arrays in spec95 swim after padding before padding PARAMETER (N1=513, N2=513)

PARAMETER (N1=513, N2=544)

COMMON U(N1,N2), V(N1,N2), P(N1,N2), * UNEW(N1,N2), VNEW(N1,N2), 1 PNEW(N1,N2), UOLD(N1,N2), * VOLD(N1,N2), POLD(N1,N2), 2 CU(N1,N2), CV(N1,N2), * Z(N1,N2), H(N1,N2)

COMMON U(N1,N2), V(N1,N2), P(N1,N2), * UNEW(N1,N2), VNEW(N1,N2), 1 PNEW(N1,N2), UOLD(N1,N2), * VOLD(N1,N2), POLD(N1,N2), 2 CU(N1,N2), CV(N1,N2), * Z(N1,N2), H(N1,N2)

4MB

4MB

padding Box: Access range of DLG0

30

APC Compiler Organization APC Multigrain Parallelizing Compiler


Automatic Data Distribution - Cache optimization - DSM optimization Multigrain Parallelization

Data Dependence Analysis

Source Program FORTRAN

- Inter-procedural - Conditional

- Hierarchical Parallel - Affine Partitioning

Scheduling - Static Scheduling - Dynamic Scheduling

OpenMP Code Generation

Speculative Execution - Coarse Grain - Medium Grain - Architecture Support

Feedback Sun Forte 6 IBM XL Fortran IBM XL Fortran Update 2 Ver.7.1 Ver.8.1

Parallelizing Tuning System Program Visualization Technique Techniques for Profiling & Utilizing Runtime Info

Runtime Info

Feedback-directed Selection Technique of Compiler Directives

… SUN IBM IBM 31 pSeries690 Ultra 80 RS6000 Variety of Shared Memory Parallel machines


Centralized scheduling code

Image of Generated OpenMP Code for Hierarchical Multigrain Parallel Processing SECTIONS SECTION

1st layer Distributed scheduling code

MT1_1

MT1_2 DOALL MT1_4 RB

MT1_3 SB

T0

SECTION

T1 T2

T4 T5

SYNC SEND MT1-4

1_3_2 1_3_3 1_3_4

T7

SYNC RECV

MT1_2

1_3_1

T6

MT1_1

1_4_1

1_4_1

T3

MT1-3 1_3_1

1_3_1

1_3_1

1_3_2

1_3_2

1_3_2

1_3_3

1_3_3

1_3_3

1_3_4

1_3_4

1_3_4

1_3_5

1_3_5

1_3_5

1_3_6

1_3_6

1_3_6

1_4_1

1_4_2

1_4_2

1_4_3

1_4_3

1_4_4

1_4_4

1_3_5 1_3_6 1_4_21_4_3 1_4_4

2nd layer

END SECTIONS

2nd layer Thread group0

Thread group1

32


IBM pSeries690 RegattaH •

Up to 16 Power4: 32 way SMP Server – – –

L1(D) : 64 KB (32KB/processor, 2 way assoc.), L1(I): 128 KB (64KB/processor, Direct map) L2 : 1.5 MB (shared cache with 2 procs., 4~8 way assoc.) L3 : 32 MB (external, 8 way assoc.) [x 8 = 256 MB]

GX Sl ot

GX Slot

Mem Slot

Mem Slot

L3

Mem Slot

L3

L3

L3

GX

L3

L3

P

L3

L3

P

P

P

L3

P

P

P

P

P

P

P

P

P

L3

L3

P

P

P

L3

L3

L3

L3

L3

L3

L3

L3

GX P

L2

P

P

L2 P

P

L2

GX

P

L2 P

L3

P

L2

GX

L3

P

GX

MCM 0

GX

Mem Slot

P

L2

L2

L2

P L2

MC M 1

L3

GX Sl ot

GX

L2

L2

P

Mem Slot

L3

MCM 3 P

L2

GX

Mem Slot

P

L2

L2

GX

L3

P

L3 GX

MCM 2

GX

L3

L3

GX

L2

GX

L3

L3

GX GX

GX

L3

P

L2

L3

L3

Mem Slot

GX Slot

Mem Slot

Four 8-way MCM Featur es Assembled into a 32-way pSer ies 690

http://www-1.ibm.com/servers/eserver/pseries/hardware/whitepapers/p690_config.html

33 IBM @server pSeries690 Configuring for Performance


Performance of APC Compiler on IBM pSeries690 16 Processors High-end Server •

IBM XL Fortran for AIX Version 8.1 – – –

Sequential execution : -O5 -qarch=pwr4 Automatic loop parallelization : -O5 -qsmp=auto -qarch=pwr4 OSCAR compiler : -O5 -qsmp=noauto -qarch=pwr4 (su2cor: -O4 -qstrict)

12.0

3.1s

3.4s

3.5s

XL Fortran(max)

3.0s

28.8s

APC(max)

38.5s

8.0 3.8s

6.0

28.9s

4.0 7.1s 16.7s 19.2s

115.2s

16.4s

2.0

21.0s

21.5s

107.4s

105.0s

13.0s 23.2s

37.4s

22.5s

85.8s

126.5s 18.8s

22.5s

85.8s

18.8s

184.8s

282.4s

321.4s

k s i2 ap

ck

282.4s 321.4s

t ra s ix

ap

pl u

2k

2k r id

mg

sw im

pw wu

2k

126.5s 307.6s 291.2s 279.1s

i se

5 ve wa

fp p

si ap

b3

d

39.5s

tu r

pl u

27.8s

ap

r id

35.1s

mg

d ro

or hy

2c su

im

30.3s

2d

21.5s

38.3s

sw

ca

tv

23.1s

pp

0.0

to m

S p e e d u p r a tio

10.0

34


Performance of Multigrain Parallel Processing for 102.swim on IBM pSeries690 16.00

XLF(AUTO)

14.00

OSCAR

Speed up ratio

12.00 10.00 8.00 6.00 4.00 2.00 0.00 1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

Processors 35


Performance on IBM pSeries690 Power4 24 processors SMP server XL Fortran Ver.8.1 OSCAR

25

sp e e d u p ra tio

20 15 10 5

spec95

• OSCAR compiler gave us 4.82 times speedup against XL Fortran Ver.8.1

a p si

a p p lu

m g rid

sw im

wave5

fp p p p

a p si

tu rb 3 d

a p p lu

m g rid

h y d ro 2 d

su 2 c o r

sw im

to m c a tv

0

spec2000 36


SGI Altix 350 16 Itanium 2 (1.3GHz) SMP server L1 16KB(4way), L2 256KB(8way), L3 3MB(12way)

2.4 times speedup against Intel Fortran 7.1 using OpenMP code for IBM37p690


Sun Fire V880 • UltraSPARCIII 8 processor SMP Sever – L1(Data) : 64 KB (4 way assoc.), L1(Inst.): 32 KB (4 way assoc.) – Prefetch : 2 KB (4 way assoc.) – L2 : 8 MB / processor (2 way assoc.)

http://www.sun.com/servers/entry/880/

38 The Sun FireTM V880 Server Architecture


Performance OSCAR multigrain Parallelizing Compiler on Sun V880 Data Localization and Cache Prefetching Performance of OSCAR compiler on SUN V880(8 Ultra SPARC III Cu 1050MHz)

1.9 times speedup against Sun Forte compiler 7.1

39


IBM p5-550Q server (1.5 or 1.65 GHz) Two Quad-Core modules 105.6 GB/sec 2-core POWER5+ processor

36MB

L3

L2 Cache

L3

PPOWER5+ 2-core P processor

L2 L2 Cache Cache

42.2 GB/sec

Mem Mem Ctl Ctlr

1.9MB Switch Distributed SMP Bus GX Bus 4.4 GB/sec

Four I/O Drawers

4.0

GB/sec

I/O Hub

M E M O R Y

1 to 64GB DDR2 533 MHz DIMM

GX Bus on 2nd processor card 4.4 GB/sec

Four PCIX & One PCIX DDR

GX adapters share PCIX slots 4 and 5

Four I/O Drawers

40


Performance on IBM p5 550 POWER5+ 8 processors SMP server 9 8

XL Fortran Ver.10.1(SMT off) XL Fortran Ver.10.1(SMT on) OSCAR(SMT off)

7

speedup ratio

6 5 4 3 2 1 0 tomcatv

sw im

su2cor hydro2d mgrid

applu

turb3d

apsi

fpppp

w ave5

sw im

spec95

mgrid

applu

apsi

spec2000

• OSCAR compiler – 2.74 times speedup against XL Fortran Ver.10.1 without SMT – 2.78 times speedup against XL Fortran Ver.10.1 with SMT

41


Example of Near Fine Grain Tasks 1) u12 = a12/l11 2) u24 = a24/l22 3) u34 = a34/l33 4) l54 = -l52 * u24 5) u45 = a45/l44 6) l55 = u55 - l54 * u45

10 19

3

19

2

19

5

19

11

19

4

8

19 10

8

7) y1 = b1 / l11 8) y2 = b2 / l22 9) b5 = b5 - l52 * y2 10) y3 = b3 / l33 11) y4 = b4 / l44 12) b5 = b5 - l54 * y4 13) y5 = b5 / l55 14) x4 = y4 - u45 * y5 15) x3 = y3 - u34 * x4 16) x2 = y2 - u24 * x4 17) x1 = y1 - u12 * x2

Task No.

0

6

1

19

7

19

9

Data Transfer Time tij

12

10

10

13

19 1014

15

10

16

10

17

10

18

42


Task Graph for FPPPP

43


Redundant Synchronization Removal Using Static Scheduling Results on Shared Memory Multiprocessors PE1

PE2

PE3

A FS

Precedence relations

FC D

FC FS FC

B FC

FS

FC C

FC

FS FC

FS

Flag set

FC

Flag check

FC

Unnecessary

E 44


Generated parallel machine code for near fine grain parallel processing

PE1 A

B FC

PE2 C FS

PE1 ; ---- Task A ---; Task Body

FADD R23, R19, R21 ; ---- Task B ---; Flag Check : Task C L18: LDR R28, [R14, 0] CMP R28, R0 JNE L18 ; Data Receive LDR R29, [R14, 1] ; Task Body FMLT R24, R23, R29

PE2 ; ---- Task C ---; Task Body

FMLT R27, R28, R29 FSUB R29, R19, R27 ; Data Transfer STR [R14, 1], R29 ; Flag Set STR [R14, 0], R0

45


S p e e d - u p ra ti o

Multigrain Parallel Processing of JPEG Encoding on OSCAR CMP

1st. Layer

2nd. Layer

4 3.5 3 2.5 2 1.5 1 0.5 0

3.5 9 3.3 6 2.7 0

1.6 2

1.5 1 1.2 5

1.0 0

2

1 逐次実行

1PG2PE

4 2PG1PE

1PG4PE

4PG1PE

2PG2PE

4-issue

プロセッサ構成

46


OSCAR Multi-Core Architecture CMP 0 (chip multiprocessor 0) CMP m

PE 0 PE 1

CPU LPM/ I-Cache

LDM/ D-cache

PE

I/O I/O Devices Devices CSM j

n

DTC

I/O CMP k

DSM

FVR

Network Interface

CSM

Intra-chip connection network (Multiple Buses, Crossbar, etc)

CSM / L2 Cache

FVR

FVR FVR

FVR

FVR

FVR

FVR Inter-chip connection network (Crossbar, Buses, Multistage network, etc) LDM : local data mem. CSM: central shared mem. DSM: distributed shared mem. LPM : local program mem. DTC: Data Transfer Controller FVR: frequency / voltage control register 47


Power Reduction by Power Supply, Clock Frequency and Voltage Control by OSCAR Compiler •

Shortest execution time mode

48


An Example of Machine Parameters for the Power Saving Scheme • Functions of the multiprocessor – Frequency of each proc. is changed to several levels – Voltage is changed together with frequency – Each proc. can be powered on/off state frequency voltage dynamic energy static power

FULL 1 1 1 1

MID 1/2 0.87 3/4 1

LOW 1/4 0.71 1/2 1

OFF 0 0 0 0

• State transition overhead state FULL MID LOW OFF

FULL 0 40k 40k 80k

MID 40k 0 40k 80k

LOW 40k 40k 0 80k

delay time [u.t.]

OFF 80k 80k 80k 0

state FULL MID LOW OFF

FULL 0 20 20 40

MID 20 0 20 40

energy overhead [μJ]

LOW 20 20 0 40

OFF 40 40 40 0

49


Speed-up in Fastest Execution Mode 4.5 w/o Saving w Saving

4 speedup ratio

3.5 3 2.5 2 1.5 1 0.5 0 1

2 tomcatv

4

1

2

4

1

swim

2 applu

4

1

2

4

mpeg2enc

benchmark 50


Consumed Energy in Fastest Execution Mode 200 180 160

w/o Saving w Saving

energy(m J)

energy(J)

140 120 100 80 60 40 20 0 1

2 tomcatv

4

1

2 swim benchmark

4

1

2 applu

4

1600 1400 1200 1000 800 600 400 200 0 1

2

4

number of proc.

mpeg2_encode 51


Energy Reduction by OSCAR Compiler in Real-time Processing mode (1% Leak)

• deadline = sequential execution time, Leakage Power: 1%

52


Energy Reduction by OSCAR Compiler in Real-time Processing mode (10% Leak)

• deadline = sequential execution time, Leakage Power: 10% 53


Energy Reduction VS. Leakage Power (4cores)

• deadline = sequential execution time*0.5

54

Conclusions


¾ Compiler cooperative low power high effective performance multi-core processors will be more important in wide range of information systems from games, mobile phones, automobiles to peta-scale supercomputers. ¾ Parallelizing compilers are essential for realization of ¾ Good cost performance ¾ Short hardware and software development periods ¾ Low power consumption ¾ High software productivity ¾ Scalable performance improvement with advancement in semiconductor integration technology

¾ Key technologies in multi-core compiler ¾ Multigrain parallelization, Data localization, Data transfer 55 overlapping using DMA, Low power control technologies

Multi-core Parallelizing Compiler for Low Power High ...

Multi-core Parallelizing Compiler for Low Power High ...

Suggest Documents

"OSCAR Parallelizing Compiler and API for Low Power High ...

A Multigrain Parallelizing gg Compiler with Power Control for Multicore ...

Low Power Multicores, Parallelizing Compiler and Multiplatform API

Parallelizing more Loops with Compiler Guided Refactoring

Parallelizing more Loops with Compiler Guided Refactoring

Low-Power CMOS For High

An Adaptive and Integrated Low-Power Framework for Multicore ...

Parallelizing Compiler Framework and API for ... - Semantic Scholar

Low-Power, Low-Complexity Instruction Issue Using Compiler

Compiler Optimization on VLIW Instruction Scheduling for Low Power

Parallelizing RSA Algorithm on Multicore CPU and GPU

Design Considerations for Low-Power, High

high speed low power comparator for ultra

Power Efficiency for Variation-Tolerant Multicore Processors

Efficient Traffic Aware Power Management for Multicore ...

Low Power Low Voltage High Speed CMOS

The SUIF Compiler System: A Parallelizing and ... - CiteSeerX

Low-Pain, High-Gain Multicore Programming in Haskell - School of ...

A Portable Parallelizing Compiler with Loop ... - Semantic Scholar

Low-Power High-Voltage Power Modulator for Motor Insulation Testing

Compiler Techniques For High Performance ... - Semantic Scholar

Compiler Techniques For High Performance ... - Semantic Scholar

Classification of Compiler Optimizations for High ... - CiteSeerX

Compiler transformations for high-performance ... - Purdue Engineering