Small, heterogeneous register classes. â Instruction-based constraints on allowable register sets ... Domains and Allowable Register Sets. ++ ld. ++ macc.
Mapping Reference Code to Irregular DSPs within the Retargetable, Optimizing Compiler COGEN(t)
Gary Grewal and Tom Wilson Dept. of Computing and Information Science University of Guelph Guelph,Ontario, Canada
Outline 1. 2. 3. 4. 5. 6. 7.
The problem of generating code for highly encoded instruction sets: to optimize or retarget? Our approach: make optimization more generic by ignoring some details until near the end The idealized version of the processor (clean machine) Quick tour of how we compile for the clean machine Shake And Bake: the final mapping to the real processor Inside Shake and Bake: enhanced genetic algorithms Closing comments
The Problem with Highly Encoded Instruction Sets • Code generation becomes more difficult with… – – – – –
Instruction level parallelism (especially if constrained) Non-orthogonal structure (overlapping roles and special cases) Small, heterogeneous register classes Instruction-based constraints on allowable register sets Instruction-based interdependencies among registers
• Specialized processors have a niche • Optimized code is usually required for their applications • Retargetability is important during design exploration – Trying different existing processors – Developing an ASIP or extending an existing processor family
Typical DSP Architecture
Address Registers
OpndReg1 General DSP Architecture showing Register Classes: Accumulators Operand Registers Address Registers Dual Data Memories and Register-to-Register ALU operations
X
Y
Memory
Memory
OpndReg2
Accum1
Address Registers
Accum2
To Optimize or Retarget? •
• •
Optimizing for a “difficult” instruction set often involves clever, specialized techniques that don’t adapt easily. Retargeting requires agile algorithms that often fail to find good enough code. The COGEN(t) Strategy: 1. 2.
Optimize for a well-behaved target processor that resembles the real one – minus some ugly constraints Then map from the resulting code to code that satisfies all the constraints of the real processor
Source Program
Compiler
Object Program
Target Processor
Source Program
Compiler Compiler Reference Code
Shake And Bake Object Program
Target Processor
Clean Machine
Sample Loop Kernel double FIR_filter(in double A[], in double B[], in int tap) { int k; double sum=0; for(k=0; k< tap; k++) sum += A[k] * B[k]; return sum; } ALU CLR REP MAC MACR
X-Memory
A X:(R0)+, X0 #N-1 X0,Y0, A X:(R0)+, X0 X0, Y0, A
Y-Memory Y:(R4)+,Y0 Y:(R4)+, Y0
[1] [2] [3] [4]
Parallel Instruction Template MAC
X0,Y0, A
X:(R0)+, X0
Y:(R4)+, Y0
Input Ports
++
Output Ports
ld
mac
ld
++
Domains relate to specific units of the processor
Addr Regs
Opnd1
MAC X0,Y0, A
R0
R0
++
ld
R0
X0
X:(R0)+, X0
X0
A
mac
A
Y:(R4)+, Y0
Y0
R4
R4
ld
++
Y0
R4
X
Y
Mem
Mem
Opnd2
Accum1
Addr Regs
Accum2
Domains and Allowable Register Sets R4, R5, R6, R7
++
ld
macc
ld
A, B, Y0, Y1, R0, R1, R2, R3 R4, R5. R6, R7
++
Register-to-Port Constraints
R4, R5, R6, R7
++
ld
macc
ld
A, B, Y0, Y1, R0, R1, R2, R3 R4, R5, R6, R7
R4, R5, R6, R7
++
R4, R5, R6, R7
MAC X0, Y0, A
R0
R0
++
ld
R0
X0
X: (R0)+, X0
X0
A
macc
A
Y0
Y: (R4)+,Y0
R4
R4
ld
++
Y0
R4
Effect of Parallelism on Constraints Instruction Template P1
Register Sets
P2 P1 = {A0, A1, R0, R1, L0, L1, P} P2 = {A0, A1, R0, R1}
×
P3 = {P} P3
P1
P1
P2
{R0, A0} incompatible {R1, A1} incompatible:
{A0, A1} incompatible
P2
×
ld
P3
P4
P3
P4
+
×
P6
P7
P1 = {L0, L1} P2 = {R0,R1}
P2
P4 = P2 (same register)
P3 = {P} P4 = {R0,R1}
P5
st
P1 = {A0,A1} P2 = {P} P3 = {L0,L1} P4 = {R0,R1} P5 = {A0,A1} P6 = {A0,A1} P7 = {P}
ld d P1 d
Constraints
P3
×
ld
P4
P5
P1 = {L0,L1} P2 = {L0,L1} P3 = {R0,R1} P4 = {P} P5 = {R0,R1}
if(P3 = L0) then P5 = A0 if(P3 = L1) then P5 = A1 P1 = P6 (same register)
Clean Machine – operator parallelism • •
Each domain supports certain basic ops Ops from different domains may be used in parallel without restriction
Alignment of basic operations r 11
domain step 1
d1
d2
r21 r12
r 22
b1
b2
r 31
r 32
r13
r 23 b3
d3
m3 b1
b2
r 33
b3
2 3
r11
r 12
r 22
r13
.
ld .
st
* +
N
r 21
r42
r 23
Clean Machine – register sets RegSet1
• •
•
Each basic operation in a domain has a set of allowable registers for each port These register sets are unaffected by parallelism (they remain the same whether the op is isolated or used in parallel) Register sets that map to ports of the same instruction have no special external constraints between them
RegSet2
r12
r22
b1
b2
b3
r31
r32
r33
RegSet1
RegSet2
r12
r22 b2
r32
Sample Loop Kernel – many connected instructions double FIR_filter(in double A[], in double B[], in int tap) { int k; double sum=0; for(k=0; k< tap; k++) sum += A[k] * B[k]; return sum; } ALU CLR REP MAC MACR
X-Memory
A X:(R0)+, X0 #N-1 X0,Y0, A X:(R0)+, X0 X0, Y0, A
Y-Memory Y:(R4)+,Y0 Y:(R4)+, Y0
[1] [2] [3] [4]
++
ld
++
ld
++
ld
clr
mac
mac
macr
ld
++
ld
++
ld
++
The COGEN(t) Retargetable Code Generator • Patterns map basic ops into supported ops, restructure • Basic blocks partitioned into traces, inner loops first • Schedule mainstream ops and bind ops to domains using an “enhanced genetic algorithm” (EGA) • Spills introduced to meet register availability • Memory-resident values mapped to final memory • Address registers and address update code inserted • Relational Database supplies intermediate storage è – powerful search capability – easy to maintain pools of alternatives
Note: scheduler finds “optimum” schedule by successive discovery of “feasible” length-constrained schedules
Compiler
Reference Code
Shake And Bake
Object Program
Pool Methodology EGA
adjust for: •architecture •type of trace Candidate pool – all feasible
relative importance of factors
× W1
criteria
û
× W2 × W3
û V1
V3 V4 max
Others may be kept as backups
Reference Code produced by COGEN(t) • In every trace, the code observes constraints: – At most one basic op per domain per instruction (1) – The assignment of ops to domains is appropriate (2) – All necessary orderings are observed (3) Note: all parallel domain combinations (alignments) are allowed in the clean machine
• In every trace, the code consumes the fewest possible steps and is as parallel as possible • At every step of every trace, the number of live variables does not exceed the number of suitable registers Note: Actual registers are assigned later by Bake
Compiler
Reference Code
Shake And Bake
Object Program
L
L
L
∗
L
L
M
L
L
M
L
L
∗
L
S
M
+
S
L
L
S
M
∗
M
+
L
L
No Register Conflicts
Task of Shake And Bake • General objective: Map reference code to actual target instruction set • Specifically: – Perturb reference code, so that alignments of basic operations all correspond to actual machine instructions – Find an actual register assignment to all live variables, so that encoding constraints of target instruction set are observed
Compiler
Reference Code
Shake And Bake
Object Program
Overview of Shake And Bake • SHAKE generates a pool of alternative reference code sequences, equivalent to the original reference code, but with basic ops aligned differently • AND verifies that every alignment in the reference code corresponds to an actual (parallel) instruction • AND inserts register copy operations on edges having no feasible register assignment • BAKE discovers (if possible) a feasible register assignment meeting all constraints
Compiler
Reference Code
Shake And Bake
Object Program
st
+
ld
*
ld
X-Mem ALU Y-mem
+
Action of SHAKE: produce pool of candidate alignments
st
+
ld
*
ld
+
X-Mem ALU Y-mem
+ ld
st
* +
ld
Action of AND: map aligned operations to real instructions
*
ld
Effect of Different Alignments ld
ld
e1:{L0, L1, A0H, A1H}
e2:{R0, R1, A0H, A1H}
×
e3:{∅}
e5:{L0, L1}
e4:{R0,R1}
e6:{A0H, A1H}
×
{ } - possible register assignment
ld
e1: {L0, L1}
ld
+
ld
ld
×
e3: {P}
e4:{R0,R1}
+
e5:{L0, L1, A0H, A1H} e6:{A0H, A1H}
×
The effect of relaxing parallelism (a) step 1
×
step 2
st
step 3
st
step 4
st
step 5
step 6
mac
(b) ×
ld
ld
st
(c) ×
ld
ld
st
st
+
+
mac
ld
st
st
register mismatches
ld
st
mac
st
×
ld
st
mac
+
(d)
ld
+
st
AND verifies instructions (feasible alignments) and determines register sets for each connecting edge opi
oph
Intersect Register Sets allowed at edge sources and destinations
opk
opj
AND: register copy operations resolve mismatches ld
ld
s1
s2
×
e3:{A,B}
s3
-
e5:{A,B}
e4:{ A,B}
s4
+
cp p
e4:{A,B}
+
e8:{ X,Y}
e6:{∅}
×
e2:{X,Y}
×
e3:{A,B}
e5:{∅}
ld
e1:{X,Y}
e2:{X,Y}
e1:{X,Y}
ld
s5
e6:{A,B}
cp
e7:{ X,Y}
×
Operation of BAKE • BAKE uses an Enhanced Genetic Algorithm (EGA) to “shower” the edges of reference code, until a feasible register assignment is found • Observes many constraints: – Every edge is assigned a register from a suitable class – Restrictions imposed by parallelism: each instruction port can be serviced only by suitable register subclass – Inter-instruction: same register used throughout the life of any edge, with no conflicting uses – Intra-instruction: explicit “same/different”, “if-then”, and “iff” register assignment at different ports
Interactions Between Instructions – problem for BAKE
Dependencies between ports may restrict allowable registers
Disjoint
SHAKE: Find new Alignments
Reference Code
Combined Effect of Shake And Bake
AND: Valid instructions
AND: Copy insertion
BAKE: Register Assignment
Object Code SELECT
Flow though Shake and Bake
Perform repair or increase schedule length
Clean Machine
No Feasible Solution
Reference code
SHAKE
AND
cp
cp
cp
BAKE
pool of candidate solutions
Fitness fi Fitness fj
Genetic Algorithm (GA)
select
[asap,alap] step operations
order constraints
crossover with large probability
[constrained RegSet]
mutation with small probability
registers edges
Candidate pool for next generation
Many reg-port constraints
GA
Fitness fi Fitness fj
EGA
Fitness fi Fitness fj
crossover always performed select best two (based on fewest violated constraints)
pool next generation
always mutate violated constraint -- alter a variable in each violated constraint pool next generation
Results for Biquad Filter • Results generated by SAB at each step of the mapping process
Patterns? DSP
SHAKE Alignments
AND Feasible Copies
BAKE Feasible
Steps in Schedule Best Worst
Time per schedule
No No
M56K ST950
30 30
29 21
0 2-5
29 15
11 14
13 18
3.0s 3.1s
Yes Yes
M56K ST950
30 30
30 30
0 2
29 30
8 11
11 13
2.0s 4.3s
Mapping Code to DSPs and ASIPs • ASIPs differ in register number, type, and connectivity [Leupers 97]
Schedule Length ASIP1 ASIP2 ASIP3 ASIP4 ASIP5
M56K
ST950
ASIP6
ASIP7 ASIP8
T1
8
11
7
7
7
8
7
7
7
8
T2
7
9
6
6
6
6
6
6
6
7
T3
16
27
17
16
16
20
16
16
18
18
Shake and Bake in Other Contexts Source Program
Instruction Set Capture
Clean Machine
Compile
Real Machine Reference Code ISS Generator
Shake & Bake Clean ISS Real ISS
Object Code