TAO: Two-level Atomicity for Dynamic Binary Optimizations

2 downloads 0 Views 1MB Size Report
Apr 26, 2010 - Mauricio Breternitz Jr., Shiliang Hu, *Esfir Natanzon, *Shai Rotem, *Roni Rosner. MPR/Programming Systems Lab *MCD/Microprocessor ...
道 ... a 'way', 'path', often used to signify the true nature of the world

TAO: Two-level Atomicity for Dynamic Binary Optimizations Edson Borin, Youfeng Wu, Cheng Wang, Wei Liu, Mauricio Breternitz Jr., Shiliang Hu, *Esfir Natanzon, *Shai Rotem, *Roni Rosner MPR/Programming Systems Lab

Intel Labs

*MCD/Microprocessor

Architecture

Intel Architecture Group Intel Corporation April 26, 2010

Binary Compilation Technology

Agenda y Atomic execution support y Optimization scope vs rollback penalty y Two level atomicity y Preliminary results y Conclusions & Future Work

2

Binary Compilation Technology

Atomicity and Binary Optimizations y Binary optimizations are very limited without atomicity support – Many optimizations are not allowed: cannot reorder load/load, store/store, early load/later store – Many hard issues: memory accesses across cache line boundary, non-cacheable memory operation, precise exception, alias speculation, etc

y Atomic regions – Increase the optimization opportunities – Address many tough issues – Simplify the optimizations design

3

Binary Compilation Technology

Atomic Region Scope y Small atomicity scope – Traces, super-blocks, basic blocks, etc – Fast local optimizations, but small opportunities

y Large atomicity scope – Loops or large DAGs – Global optimizations, loop invariant hoisting, software pipelining, long range prefetch, etc – May suffer from resource overflow and large amount of work being discarded when rollback.

4

Binary Compilation Technology

Small Atomicity Scope: Loop body

B1 ckpt

cmit

r3 Å ld [r2] r4 Å ld [r2 + r0*4 + 4] B2 r6 Å r3 + r4 st [r2+4] Å r6 r0 Å r0 + 1 if (r6 < r1) goto B2

B1 has the same EIP as B2 (region entry)

Loop body

B3 …

5

Binary Compilation Technology

Small Atomicity Scope: Loop body B1 t3 Å ld [r2]

B1 ckpt

cmit

r3 Å ld [r2] r4 Å ld [r2 + r0*4 + 4] B2 r6 Å r3 + r4 st [r2+4] Å r6 r0 Å r0 + 1 if (r6 < r1) goto B2 B3 …

ckpt

cmit

r3 Å t3 r4 Å ld [r2 + r0*4 + 4] B2 r6 Å t3 + r4 st [r2+4] Å r6 r0 Å r0 + 1 if (r6 < r1) goto B2 B3 … Optimizations: LICM + copy propagation

6

Binary Compilation Technology

Small Atomicity Scope: Loop body Remote store may invalidate the load after commit B1 t3 Å ld [r2]

B1 ckpt

cmit

r3 Å ld [r2] r4 Å ld [r2 + r0*4 + 4] B2 r6 Å r3 + r4 st [r2+4] Å r6 r0 Å r0 + 1 if (r6 < r1) goto B2 B3 …

ckpt

cmit

r3 Å t3 r4 Å ld [r2 + r0*4 + 4] B2 r6 Å t3 + r4 st [r2+4] Å r6 r0 Å r0 + 1 if (r6 < r1) goto B2 B3 … Optimizations: LICM + copy propagation

7

Binary Compilation Technology

Small Atomicity Scope: Loop body B1 t3 Å ld [r2]

B1 ckpt

cmit

r3 Å ld [r2] r4 Å ld [r2 + r0*4 + 4] B2 r6 Å r3 + r4 st [r2+4] Å r6 r0 Å r0 + 1 if (r6 < r1) goto B2 B3 …

ckpt

cmit

r3 Å t3 r4 Å ld [r2 + r0*4 + 4] B2 r6 Å t3 + r4 st [r2+4] Å r6 r0 Å r0 + 1 if (r6 < r1) goto B2 B3 …

Commit requires precise architectural state: Partially dead store elimination is invalid

8

Binary Compilation Technology

Large Atomicity Scope: Whole loop ckpt B1 r3 Å ld [r2] r4 Å ld [r2 + r0*4 + 4] B2 r6 Å r3 + r4 st [r2+4] Å r6 r0 Å r0 + 1 if (r6 < r1) goto B2

cmit

9

Whole loop

B3 …

Binary Compilation Technology

Large Atomicity Scope: Whole loop ckpt B1

ckpt r3 Å ld [r2] B1

r3 Å ld [r2] r4 Å ld [r2 + r0*4 + 4] B2 r6 Å r3 + r4 st [r2+4] Å r6 r0 Å r0 + 1 if (r6 < r1) goto B2

cmit

r4 Å ld [r2 + r0*4 + 4] r6 Å r3 + r4 B2 r0 Å r0 + 1 if (r6 < r1) goto B2

B3 … cmit

10

B3 st [r2+4] Å r6 … Optimizations: LICM + PDSE + copy propagation Binary Compilation Technology

Rollback Penalty: Small Atomicity Scope Execution eip: 0x8 B1 …

eip: 0x8 B2 ckpt … … cmit B3 …

11

Exception

B1 B21 B22 … B2999 B21000

Binary Compilation Technology

Rollback Penalty: Small Atomicity Scope Execution eip: 0x8 B1 …

eip: 0x8 B2 ckpt … … cmit B3 …

Exception

B1 B21 B22 … B2999 B21000 Eip: 0x8 (B21000)

Resume from last checkpoint with non-opt. code eip: 0x8

12

Binary Compilation Technology

Rollback Penalty: Small Atomicity Scope Execution eip: 0x8 B1 …

eip: 0x8 B2 ckpt … … cmit B3 …

Exception

Resume optimized code execution

B1 B21 B22 … B2999 B21000 Eip: 0x8 (B21000) B1 B21001

Amount of work discarded by rollback = 1 loop iteration

13

Binary Compilation Technology

Rollback Penalty: Large Atomicity Scope eip: 0x8 B1 ckpt …

Execution

eip: 0x8 B2 … …

cmit

14

B3 …

Exception

B1 B21 B22 … B2999 B21000

Binary Compilation Technology

Rollback Penalty: Large Atomicity Scope Execution

eip: 0x8 B1 ckpt … eip: 0x8 B2 … …

cmit

Exception

B3 … Resume from last checkpoint with non-opt. code (first loop iteration)

15

B1 B21 Discarded 1000 B22 … iterations B2999 B21000 Eip: 0x8 (B21) B1 Resume optimized B22 code execution …

Binary Compilation Technology

Rollback Penalty: Large Atomicity Scope Execution

eip: 0x8 B1 ckpt … eip: 0x8 B2 … …

cmit

Exception

B3 …

Fail and discard again!

16

B1 B21 B22 … B2999 B21000 Eip: 0x8 (B21) B1 Discarded 999 B22 … iterations B21000 Eip: 0x8 (B22) B1 B23 … Binary Compilation Technology

Rollback Penalty: Large Atomicity Scope eip: 0x8 B1 ckpt …

Execution

eip: 0x8 B2 … …

cmit

B3 …

Exception

B1 B21 B22 … B2999 B21000 Eip: 0x8 (B21) B1 B22 … B21000 Eip: 0x8 (B22) B1 B23 …

Large amount of work discarded at rollbacks

17

Binary Compilation Technology

Small and Large Atomicity Scopes

Rollback Cost

+ Large Atomicity Scope

Small Atomicity Scope

-

18

Optimization opportunities

+

Binary Compilation Technology

Small and Large Atomicity Scopes

Rollback Cost

+ Large Atomicity Scope

TAO Small Atomicity Scope

Two Level Atomicity Scope

-

19

Optimization opportunities

+

Binary Compilation Technology

Two Level Atomicity Region checkpoint Local ckpt

Local cmit Region commit

B1 (0x8)

B2 (0x8) r3 Å ld [r2] r4 Å ld [r2 + r0*4 + 4] r6 Å r3 + r4 st [r2+4] Å r6 r0 Å r0 + 1 if (r6 < r1) goto B2

Whole loop 2nd Level

B3 …

1st Level Local cmit Buffer

20

Loop Body 1st Level

2nd Level Reg. cmit Buffer

Non-spec. State Binary Compilation Technology

Two Level Atomicity Reg. ckpt.

B1 (0x8)

Local ckpt

B2 (0x8)

Local cmit

r3 Å ld [r2] r4 Å ld [r2 + r0*4 + 4] r6 Å r3 + r4 st [r2+4] Å r6 r0 Å r0 + 1 if (r6 < r1) goto B2

Reg. ckpt.

B1 (0x8)

Local ckpt

B2 (0x8)

Local cmit

r3 Å ld [r2]

r4 Å ld [r2 + r0*4 + 4] r6 Å r3 + r4 r0 Å r0 + 1 if (r6 < r1) goto B2

B2-Fixup

B3 Reg. cmit

21

B3



Reg. cmit

Reg. cmit

st [r2+4] Å r6

st [r2+4] Å r6

Binary Compilation Technology

Two Level Atomicity Reg. ckpt.

B1 (0x8)

Local ckpt

B2 (0x8)

Local cmit

r3 Å ld [r2]

B1 B21 B22

r4 Å ld [r2 + r0*4 + 4] r6 Å r3 + r4 r0 Å r0 + 1 if (r6 < r1) goto B2

B2-Fixup B3

Reg. cmit

Execution

Reg. cmit

st [r2+4] Å r6

st [r2+4] Å r6 1st Level B22

22

2nd Level

Non-spec.

B21,B1

… Binary Compilation Technology

Two Level Atomicity Reg. ckpt.

B1 (0x8)

Local ckpt

B2 (0x8)

Local cmit

r3 Å ld [r2]

r4 Å ld [r2 + r0*4 + 4] r6 Å r3 + r4 r0 Å r0 + 1 if (r6 < r1) goto B2

Exception

Reg. cmit

st [r2+4] Å r6

st [r2+4] Å r6 1st Level B21000

23

B1 B21 B22 … B2999 B21000

B2-Fixup B3

Reg. cmit

Execution

2nd Level

Non-spec.

B2999, …, B21, B1

… Binary Compilation Technology

Two Level Atomicity Reg. ckpt. Local ckpt Local cmit

Reg. cmit

Execution

B1 (0x8) r3 Å ld [r2]

B1 B21 B2 (0x8) B22 … r4 Å ld [r2 + r0*4 + 4] B2999 Exception r6 Å r3 + r4 B21000 r0 Å r0 + 1 Reg. B2-Fixup if (r6 < r1) goto B2 cmit Eip: 0x8 (B2 1000) B2-Fixup B1 Reg. st [r2+4] Å r6 B21001 cmit B3 B21002 st [r2+4] Å r6

1st Level B21002

24

2nd Level B21001, B1

Non-spec. 0x8, B2-Fix, B2999, …, B1 Binary Compilation Technology

Two Level Atomicity Reg. ckpt. Local ckpt Local cmit

Reg. cmit

B1 (0x8)

Execution

r3 Å ld [r2]

B1 B21 B2 (0x8) B22 … r4 Å ld [r2 + r0*4 + 4] B2999 Exception r6 Å r3 + r4 B21000 r0 Å r0 + 1 Reg. B2-Fixup if (r6 < r1) goto B2 cmit Eip: 0x8 (B2 1000) B2-Fixup B1 Reg. st [r2+4] Å r6 B21001 cmit B3 B21002 st [r2+4] Å r6

Amount of work discarded by rollback = 1 loop iterations

25

Binary Compilation Technology

Preliminary Results y Implemented TAO in a cycle accurate simulator y 1st level atomicity is modeled with frame y 2nd level atomicity is modeled assuming unlimited speculative cache y Regions consist of frames y Global partial redundancy elimination (PRE) and dead code elimination (PDE) are implemented – PDSE not measured due to simulator issue – Global optimization overhead is not measured, although frame level HW optimizations are modeled

26

Binary Compilation Technology

Region and Frame Dynamic Sizes Frames Only

27

TAO Regions

Binary Compilation Technology

Region and Frame Coverage

28

Binary Compilation Technology

Performance Potential Frames Only

29

Additional from TAO Regions

Binary Compilation Technology

Conclusions & Future Work y Two Level Atomicity enlarges the Optimization Scope with limited rollback costs y Promising performance gains over frame level atomicity alone y More optimizations can boost the performance gain – Global fusion, etc

y May improve in-order co-designed processor by enabling more global scheduling y Hardware needs more investigation – Pipeline Buffers + Speculative Cache – 2-level speculative cache

30

Binary Compilation Technology

Questions?

31

Binary Compilation Technology

Suggest Documents