Classic List Scheduling ... schedules including software pipelines (cyclic schedules). Euro-Par 2008 ... propagation rem
Inter-Block Scoreboard Scheduling in a JIT Compiler for VLIW Processors
Benoˆıt Dupont de Dinechin Research & Development Responsible Software, Tools and Services (STS) STMicroelectronics Grenoble (France)
[email protected]
Inter-Block Scoreboard Scheduling
Presentation Outline
Presentation Outline • JIT for Media Processing • Classic List Scheduling • Scoreboard Scheduling • Inter-Block Scheduling • ST200 VLIW Experiments • Observations and Conclusions
Euro-Par 2008 – August 28th 2008
2
Inter-Block Scoreboard Scheduling
JIT for Media Processing
JIT for Media Processing Systems-On-Chip (SoCs) at STMicroelectronics • STMicroelectronics SoC(s) used in: consumer electronics (set-top boxes, car infotainment), telecoms infrastructure, mobile phones • STMicroelectronics SoC(s) typically comprise: – Host processors: ARM family, ST40/SH4 processors – Application processors: DSPs, VLIW-Media (ST200 family) – Programmable hardware: processor with custom extensions, coarse-grained reconfigurable arrays (CGRA), GP-GPU • By using a processor-neutral program representation, and AOT or JIT compilation, C / C++ media processing code may dispatch to different processors ⇒ need byte-code for C / C++ programs The Microsoft .NET Common Language Infrastructure (CLI) standard Euro-Par 2008 – August 28th 2008
3
Inter-Block Scoreboard Scheduling
JIT for Media Processing
The ST200 VLIW Media Family (ST210, ST220, ST231, ST240)
• Lx architecture [ISCA’00], partial predication with SELECT • 63 × 32bit general registers, 8 × 1bit branch registers • Scheduled resources: 4×ISSUE, 1×MEM, 1×CTL, 2×ODD Euro-Par 2008 – August 28th 2008
4
Inter-Block Scoreboard Scheduling
JIT for Media Processing
JIT for Media Processing Post-Pass Scheduling Challenges • Achieve efficiency (code quality) and speed (compilation time) – C media processing kernels expose significantly more instruction-level parallelism that Java applications • Satisfy post-pass scheduling constraints along all program paths – Required on VLIW processors without interlocking hardware (MIPS ≡ Microprocessor without Interlocked Pipeline Stages) • Preserve pre-pass region schedules that are still valid and that satisfy post-pass scheduling constraints at boundaries – Not only local pre-pass schedules, but also global pre-pass schedules including software pipelines (cyclic schedules)
Euro-Par 2008 – August 28th 2008
5
Inter-Block Scoreboard Scheduling
JIT for Media Processing
Classic Approaches in Static and JIT Compilers • Open64-based ST200 VLIW production compiler: post-pass schedule superblock regions, insert NOPs between regions to prevent scheduling hazards • IBM Testarossa Java JIT compiler (zSeries 990 and POWER4): apply pre-pass and post-pass scheduling to a few code paths Proposed Approach: Inter-Block Scoreboard Scheduling • Scoreboard Scheduling is a restriction of classic Operation Scheduling that can be implemented efficiently • Inter-Block Scheduling is an iterative scheduling constraint propagation reminiscent of forward data-flow analysis • Combining these two techniques addresses all our “JIT for Media Processing Post-Pass Scheduling Challenges” Euro-Par 2008 – August 28th 2008
6
Inter-Block Scoreboard Scheduling
Classic List Scheduling
Classic List Scheduling Sample Dependence Graph Assume two execution units (scheduled resources) and 5 operations: 000 111 000 111 000 111 1 3 4 000 111 000 111 000 111 0 000 111 000 111 000 111 2 5 000 111
111 000 000 111
111 000 000 111
The dependence graph contains a dummy operation O0 Critical-Path Scheduling Priorities Defined as longest path from operation start to end of execution: Operation
O1
O2
O3
O4
O5
Execution Time
1
2
1
2
1
Priority
4
2
3
2
1
Euro-Par 2008 – August 28th 2008
7
Inter-Block Scoreboard Scheduling
Classic List Scheduling
Cycle Scheduling (Graham List Scheduling) • Schedule by non-decreasing time slot order • At each time slot, try to schedule all the dependence-ready operations in priority order
1111 0000 0000 1111 0000 1111 0000 1111 1 5 0000 1111 0000 1111 0000 1111 0000 1111
1111111 0000000 0000000 1111111 4 0000000 1111111 0000000 1111111
1111111 0000000 0000 1111 2 3 00000001111 1111111 0000
Cycle Scheduling produces ’Non-Delay Schedules’ • No execution resources are left idle if there exists an operation that could start executing • Non-Delay Schedules may not contain optimal schedules (for Makespan, Max-Lateness, and other regular measures) Euro-Par 2008 – August 28th 2008
8
Inter-Block Scoreboard Scheduling
Classic List Scheduling
Operation Scheduling • Consider operations in priority order, which must be a topological sort of the dependence graph • Schedule an operation at the earliest time slot possible
1111 0000 0000 1111 0000000 1111111 0000 1111 0000 1111 0000000 1111111 1 5 4 0000 1111 0000 1111 0000000 1111111 0000 1111 0000 1111 0000000 1111111
1111 0000 0000000 1111111 3 2 00001111111 1111 0000000 Operation Scheduling produces ’Active Schedules’
• No operation can be completed earlier without delaying another operation • Active Schedules contain Non-Delay Schedules and also optimal schedules (for Makespan, Max-Lateness, etc.) Euro-Par 2008 – August 28th 2008
9
Inter-Block Scoreboard Scheduling
Classic List Scheduling
Cases of Unit Execution Time (Pipelined Execution) • Cycle Scheduling computes same as Operation Scheduling • Optimality proved for various shapes of dependence graph • Classic Graham performance bound for m resources: 2 −
1 m
• Performance bound for k types of resources, mi units of resource 1 i, and z + 1 maximum latency: (k + 1) − (z+1)∗max 0≤i