Inter-Block Scoreboard Scheduling in a JIT Compiler for ... - Google Sites

4 downloads 115 Views 357KB Size Report
Classic List Scheduling ... schedules including software pipelines (cyclic schedules). Euro-Par 2008 ... propagation rem
Inter-Block Scoreboard Scheduling in a JIT Compiler for VLIW Processors

Benoˆıt Dupont de Dinechin Research & Development Responsible Software, Tools and Services (STS) STMicroelectronics Grenoble (France) [email protected]

Inter-Block Scoreboard Scheduling

Presentation Outline

Presentation Outline • JIT for Media Processing • Classic List Scheduling • Scoreboard Scheduling • Inter-Block Scheduling • ST200 VLIW Experiments • Observations and Conclusions

Euro-Par 2008 – August 28th 2008

2

Inter-Block Scoreboard Scheduling

JIT for Media Processing

JIT for Media Processing Systems-On-Chip (SoCs) at STMicroelectronics • STMicroelectronics SoC(s) used in: consumer electronics (set-top boxes, car infotainment), telecoms infrastructure, mobile phones • STMicroelectronics SoC(s) typically comprise: – Host processors: ARM family, ST40/SH4 processors – Application processors: DSPs, VLIW-Media (ST200 family) – Programmable hardware: processor with custom extensions, coarse-grained reconfigurable arrays (CGRA), GP-GPU • By using a processor-neutral program representation, and AOT or JIT compilation, C / C++ media processing code may dispatch to different processors ⇒ need byte-code for C / C++ programs The Microsoft .NET Common Language Infrastructure (CLI) standard Euro-Par 2008 – August 28th 2008

3

Inter-Block Scoreboard Scheduling

JIT for Media Processing

The ST200 VLIW Media Family (ST210, ST220, ST231, ST240)

• Lx architecture [ISCA’00], partial predication with SELECT • 63 × 32bit general registers, 8 × 1bit branch registers • Scheduled resources: 4×ISSUE, 1×MEM, 1×CTL, 2×ODD Euro-Par 2008 – August 28th 2008

4

Inter-Block Scoreboard Scheduling

JIT for Media Processing

JIT for Media Processing Post-Pass Scheduling Challenges • Achieve efficiency (code quality) and speed (compilation time) – C media processing kernels expose significantly more instruction-level parallelism that Java applications • Satisfy post-pass scheduling constraints along all program paths – Required on VLIW processors without interlocking hardware (MIPS ≡ Microprocessor without Interlocked Pipeline Stages) • Preserve pre-pass region schedules that are still valid and that satisfy post-pass scheduling constraints at boundaries – Not only local pre-pass schedules, but also global pre-pass schedules including software pipelines (cyclic schedules)

Euro-Par 2008 – August 28th 2008

5

Inter-Block Scoreboard Scheduling

JIT for Media Processing

Classic Approaches in Static and JIT Compilers • Open64-based ST200 VLIW production compiler: post-pass schedule superblock regions, insert NOPs between regions to prevent scheduling hazards • IBM Testarossa Java JIT compiler (zSeries 990 and POWER4): apply pre-pass and post-pass scheduling to a few code paths Proposed Approach: Inter-Block Scoreboard Scheduling • Scoreboard Scheduling is a restriction of classic Operation Scheduling that can be implemented efficiently • Inter-Block Scheduling is an iterative scheduling constraint propagation reminiscent of forward data-flow analysis • Combining these two techniques addresses all our “JIT for Media Processing Post-Pass Scheduling Challenges” Euro-Par 2008 – August 28th 2008

6

Inter-Block Scoreboard Scheduling

Classic List Scheduling

Classic List Scheduling Sample Dependence Graph Assume two execution units (scheduled resources) and 5 operations: 000 111 000 111 000 111 1 3 4 000 111 000 111 000 111 0 000 111 000 111 000 111 2 5 000 111

111 000 000 111

111 000 000 111

The dependence graph contains a dummy operation O0 Critical-Path Scheduling Priorities Defined as longest path from operation start to end of execution: Operation

O1

O2

O3

O4

O5

Execution Time

1

2

1

2

1

Priority

4

2

3

2

1

Euro-Par 2008 – August 28th 2008

7

Inter-Block Scoreboard Scheduling

Classic List Scheduling

Cycle Scheduling (Graham List Scheduling) • Schedule by non-decreasing time slot order • At each time slot, try to schedule all the dependence-ready operations in priority order

1111 0000 0000 1111 0000 1111 0000 1111 1 5 0000 1111 0000 1111 0000 1111 0000 1111

1111111 0000000 0000000 1111111 4 0000000 1111111 0000000 1111111

1111111 0000000 0000 1111 2 3 00000001111 1111111 0000

Cycle Scheduling produces ’Non-Delay Schedules’ • No execution resources are left idle if there exists an operation that could start executing • Non-Delay Schedules may not contain optimal schedules (for Makespan, Max-Lateness, and other regular measures) Euro-Par 2008 – August 28th 2008

8

Inter-Block Scoreboard Scheduling

Classic List Scheduling

Operation Scheduling • Consider operations in priority order, which must be a topological sort of the dependence graph • Schedule an operation at the earliest time slot possible

1111 0000 0000 1111 0000000 1111111 0000 1111 0000 1111 0000000 1111111 1 5 4 0000 1111 0000 1111 0000000 1111111 0000 1111 0000 1111 0000000 1111111

1111 0000 0000000 1111111 3 2 00001111111 1111 0000000 Operation Scheduling produces ’Active Schedules’

• No operation can be completed earlier without delaying another operation • Active Schedules contain Non-Delay Schedules and also optimal schedules (for Makespan, Max-Lateness, etc.) Euro-Par 2008 – August 28th 2008

9

Inter-Block Scoreboard Scheduling

Classic List Scheduling

Cases of Unit Execution Time (Pipelined Execution) • Cycle Scheduling computes same as Operation Scheduling • Optimality proved for various shapes of dependence graph • Classic Graham performance bound for m resources: 2 −

1 m

• Performance bound for k types of resources, mi units of resource 1 i, and z + 1 maximum latency: (k + 1) − (z+1)∗max 0≤i

Suggest Documents