Mapping Reference Code to Irregular DSPs within ... - Semantic Scholar

3 downloads 398 Views 127KB Size Report
Small, heterogeneous register classes. – Instruction-based constraints on allowable register sets ... Domains and Allowable Register Sets. ++ ld. ++ macc.
Mapping Reference Code to Irregular DSPs within the Retargetable, Optimizing Compiler COGEN(t)

Gary Grewal and Tom Wilson Dept. of Computing and Information Science University of Guelph Guelph,Ontario, Canada

Outline 1. 2. 3. 4. 5. 6. 7.

The problem of generating code for highly encoded instruction sets: to optimize or retarget? Our approach: make optimization more generic by ignoring some details until near the end The idealized version of the processor (clean machine) Quick tour of how we compile for the clean machine Shake And Bake: the final mapping to the real processor Inside Shake and Bake: enhanced genetic algorithms Closing comments

The Problem with Highly Encoded Instruction Sets • Code generation becomes more difficult with… – – – – –

Instruction level parallelism (especially if constrained) Non-orthogonal structure (overlapping roles and special cases) Small, heterogeneous register classes Instruction-based constraints on allowable register sets Instruction-based interdependencies among registers

• Specialized processors have a niche • Optimized code is usually required for their applications • Retargetability is important during design exploration – Trying different existing processors – Developing an ASIP or extending an existing processor family

Typical DSP Architecture

Address Registers

OpndReg1 General DSP Architecture showing Register Classes: Accumulators Operand Registers Address Registers Dual Data Memories and Register-to-Register ALU operations

X

Y

Memory

Memory

OpndReg2

Accum1

Address Registers

Accum2

To Optimize or Retarget? •

• •

Optimizing for a “difficult” instruction set often involves clever, specialized techniques that don’t adapt easily. Retargeting requires agile algorithms that often fail to find good enough code. The COGEN(t) Strategy: 1. 2.

Optimize for a well-behaved target processor that resembles the real one – minus some ugly constraints Then map from the resulting code to code that satisfies all the constraints of the real processor

Source Program

Compiler

Object Program

Target Processor

Source Program

Compiler Compiler Reference Code

Shake And Bake Object Program

Target Processor

Clean Machine

Sample Loop Kernel double FIR_filter(in double A[], in double B[], in int tap) { int k; double sum=0; for(k=0; k< tap; k++) sum += A[k] * B[k]; return sum; } ALU CLR REP MAC MACR

X-Memory

A X:(R0)+, X0 #N-1 X0,Y0, A X:(R0)+, X0 X0, Y0, A

Y-Memory Y:(R4)+,Y0 Y:(R4)+, Y0

[1] [2] [3] [4]

Parallel Instruction Template MAC

X0,Y0, A

X:(R0)+, X0

Y:(R4)+, Y0

Input Ports

++

Output Ports

ld

mac

ld

++

Domains relate to specific units of the processor

Addr Regs

Opnd1

MAC X0,Y0, A

R0

R0

++

ld

R0

X0

X:(R0)+, X0

X0

A

mac

A

Y:(R4)+, Y0

Y0

R4

R4

ld

++

Y0

R4

X

Y

Mem

Mem

Opnd2

Accum1

Addr Regs

Accum2

Domains and Allowable Register Sets R4, R5, R6, R7

++

ld

macc

ld

A, B, Y0, Y1, R0, R1, R2, R3 R4, R5. R6, R7

++

Register-to-Port Constraints

R4, R5, R6, R7

++

ld

macc

ld

A, B, Y0, Y1, R0, R1, R2, R3 R4, R5, R6, R7

R4, R5, R6, R7

++

R4, R5, R6, R7

MAC X0, Y0, A

R0

R0

++

ld

R0

X0

X: (R0)+, X0

X0

A

macc

A

Y0

Y: (R4)+,Y0

R4

R4

ld

++

Y0

R4

Effect of Parallelism on Constraints Instruction Template P1

Register Sets

P2 P1 = {A0, A1, R0, R1, L0, L1, P} P2 = {A0, A1, R0, R1}

×

P3 = {P} P3

P1

P1

P2

{R0, A0} incompatible {R1, A1} incompatible:

{A0, A1} incompatible

P2

×

ld

P3

P4

P3

P4

+

×

P6

P7

P1 = {L0, L1} P2 = {R0,R1}

P2

P4 = P2 (same register)

P3 = {P} P4 = {R0,R1}

P5

st

P1 = {A0,A1} P2 = {P} P3 = {L0,L1} P4 = {R0,R1} P5 = {A0,A1} P6 = {A0,A1} P7 = {P}

ld d P1 d

Constraints

P3

×

ld

P4

P5

P1 = {L0,L1} P2 = {L0,L1} P3 = {R0,R1} P4 = {P} P5 = {R0,R1}

if(P3 = L0) then P5 = A0 if(P3 = L1) then P5 = A1 P1 = P6 (same register)

Clean Machine – operator parallelism • •

Each domain supports certain basic ops Ops from different domains may be used in parallel without restriction

Alignment of basic operations r 11

domain step 1

d1

d2

r21 r12

r 22

b1

b2

r 31

r 32

r13

r 23 b3

d3

m3 b1

b2

r 33

b3

2 3

r11

r 12

r 22

r13

.

ld .

st

* +

N

r 21

r42

r 23

Clean Machine – register sets RegSet1

• •



Each basic operation in a domain has a set of allowable registers for each port These register sets are unaffected by parallelism (they remain the same whether the op is isolated or used in parallel) Register sets that map to ports of the same instruction have no special external constraints between them

RegSet2

r12

r22

b1

b2

b3

r31

r32

r33

RegSet1

RegSet2

r12

r22 b2

r32

Sample Loop Kernel – many connected instructions double FIR_filter(in double A[], in double B[], in int tap) { int k; double sum=0; for(k=0; k< tap; k++) sum += A[k] * B[k]; return sum; } ALU CLR REP MAC MACR

X-Memory

A X:(R0)+, X0 #N-1 X0,Y0, A X:(R0)+, X0 X0, Y0, A

Y-Memory Y:(R4)+,Y0 Y:(R4)+, Y0

[1] [2] [3] [4]

++

ld

++

ld

++

ld

clr

mac

mac

macr

ld

++

ld

++

ld

++

The COGEN(t) Retargetable Code Generator • Patterns map basic ops into supported ops, restructure • Basic blocks partitioned into traces, inner loops first • Schedule mainstream ops and bind ops to domains using an “enhanced genetic algorithm” (EGA) • Spills introduced to meet register availability • Memory-resident values mapped to final memory • Address registers and address update code inserted • Relational Database supplies intermediate storage è – powerful search capability – easy to maintain pools of alternatives

Note: scheduler finds “optimum” schedule by successive discovery of “feasible” length-constrained schedules

Compiler

Reference Code

Shake And Bake

Object Program

Pool Methodology EGA

adjust for: •architecture •type of trace Candidate pool – all feasible

relative importance of factors

× W1

criteria

û

× W2 × W3

û V1

V3 V4 max

Others may be kept as backups

Reference Code produced by COGEN(t) • In every trace, the code observes constraints: – At most one basic op per domain per instruction (1) – The assignment of ops to domains is appropriate (2) – All necessary orderings are observed (3) Note: all parallel domain combinations (alignments) are allowed in the clean machine

• In every trace, the code consumes the fewest possible steps and is as parallel as possible • At every step of every trace, the number of live variables does not exceed the number of suitable registers Note: Actual registers are assigned later by Bake

Compiler

Reference Code

Shake And Bake

Object Program

L

L

L



L

L

M

L

L

M

L

L



L

S

M

+

S

L

L

S

M



M

+

L

L

No Register Conflicts

Task of Shake And Bake • General objective: Map reference code to actual target instruction set • Specifically: – Perturb reference code, so that alignments of basic operations all correspond to actual machine instructions – Find an actual register assignment to all live variables, so that encoding constraints of target instruction set are observed

Compiler

Reference Code

Shake And Bake

Object Program

Overview of Shake And Bake • SHAKE generates a pool of alternative reference code sequences, equivalent to the original reference code, but with basic ops aligned differently • AND verifies that every alignment in the reference code corresponds to an actual (parallel) instruction • AND inserts register copy operations on edges having no feasible register assignment • BAKE discovers (if possible) a feasible register assignment meeting all constraints

Compiler

Reference Code

Shake And Bake

Object Program

st

+

ld

*

ld

X-Mem ALU Y-mem

+

Action of SHAKE: produce pool of candidate alignments

st

+

ld

*

ld

+

X-Mem ALU Y-mem

+ ld

st

* +

ld

Action of AND: map aligned operations to real instructions

*

ld

Effect of Different Alignments ld

ld

e1:{L0, L1, A0H, A1H}

e2:{R0, R1, A0H, A1H}

×

e3:{∅}

e5:{L0, L1}

e4:{R0,R1}

e6:{A0H, A1H}

×

{ } - possible register assignment

ld

e1: {L0, L1}

ld

+

ld

ld

×

e3: {P}

e4:{R0,R1}

+

e5:{L0, L1, A0H, A1H} e6:{A0H, A1H}

×

The effect of relaxing parallelism (a) step 1

×

step 2

st

step 3

st

step 4

st

step 5

step 6

mac

(b) ×

ld

ld

st

(c) ×

ld

ld

st

st

+

+

mac

ld

st

st

register mismatches

ld

st

mac

st

×

ld

st

mac

+

(d)

ld

+

st

AND verifies instructions (feasible alignments) and determines register sets for each connecting edge opi

oph

Intersect Register Sets allowed at edge sources and destinations

opk

opj

AND: register copy operations resolve mismatches ld

ld

s1

s2

×

e3:{A,B}

s3

-

e5:{A,B}

e4:{ A,B}

s4

+

cp p

e4:{A,B}

+

e8:{ X,Y}

e6:{∅}

×

e2:{X,Y}

×

e3:{A,B}

e5:{∅}

ld

e1:{X,Y}

e2:{X,Y}

e1:{X,Y}

ld

s5

e6:{A,B}

cp

e7:{ X,Y}

×

Operation of BAKE • BAKE uses an Enhanced Genetic Algorithm (EGA) to “shower” the edges of reference code, until a feasible register assignment is found • Observes many constraints: – Every edge is assigned a register from a suitable class – Restrictions imposed by parallelism: each instruction port can be serviced only by suitable register subclass – Inter-instruction: same register used throughout the life of any edge, with no conflicting uses – Intra-instruction: explicit “same/different”, “if-then”, and “iff” register assignment at different ports

Interactions Between Instructions – problem for BAKE

Dependencies between ports may restrict allowable registers

Disjoint

SHAKE: Find new Alignments

Reference Code

Combined Effect of Shake And Bake

AND: Valid instructions

AND: Copy insertion

BAKE: Register Assignment

Object Code SELECT

Flow though Shake and Bake

Perform repair or increase schedule length

Clean Machine

No Feasible Solution

Reference code

SHAKE

AND

cp

cp

cp

BAKE

pool of candidate solutions

Fitness fi Fitness fj

Genetic Algorithm (GA)

select

[asap,alap] step operations

order constraints

crossover with large probability

[constrained RegSet]

mutation with small probability

registers edges

Candidate pool for next generation

Many reg-port constraints

GA

Fitness fi Fitness fj

EGA

Fitness fi Fitness fj

crossover always performed select best two (based on fewest violated constraints)

pool next generation

always mutate violated constraint -- alter a variable in each violated constraint pool next generation

Results for Biquad Filter • Results generated by SAB at each step of the mapping process

Patterns? DSP

SHAKE Alignments

AND Feasible Copies

BAKE Feasible

Steps in Schedule Best Worst

Time per schedule

No No

M56K ST950

30 30

29 21

0 2-5

29 15

11 14

13 18

3.0s 3.1s

Yes Yes

M56K ST950

30 30

30 30

0 2

29 30

8 11

11 13

2.0s 4.3s

Mapping Code to DSPs and ASIPs • ASIPs differ in register number, type, and connectivity [Leupers 97]

Schedule Length ASIP1 ASIP2 ASIP3 ASIP4 ASIP5

M56K

ST950

ASIP6

ASIP7 ASIP8

T1

8

11

7

7

7

8

7

7

7

8

T2

7

9

6

6

6

6

6

6

6

7

T3

16

27

17

16

16

20

16

16

18

18

Shake and Bake in Other Contexts Source Program

Instruction Set Capture

Clean Machine

Compile

Real Machine Reference Code ISS Generator

Shake & Bake Clean ISS Real ISS

Object Code

Suggest Documents