Apr 26, 2010 - Mauricio Breternitz Jr., Shiliang Hu, *Esfir Natanzon, *Shai Rotem, *Roni Rosner. MPR/Programming Systems Lab *MCD/Microprocessor ...
道 ... a 'way', 'path', often used to signify the true nature of the world
TAO: Two-level Atomicity for Dynamic Binary Optimizations Edson Borin, Youfeng Wu, Cheng Wang, Wei Liu, Mauricio Breternitz Jr., Shiliang Hu, *Esfir Natanzon, *Shai Rotem, *Roni Rosner MPR/Programming Systems Lab
Intel Labs
*MCD/Microprocessor
Architecture
Intel Architecture Group Intel Corporation April 26, 2010
Binary Compilation Technology
Agenda y Atomic execution support y Optimization scope vs rollback penalty y Two level atomicity y Preliminary results y Conclusions & Future Work
2
Binary Compilation Technology
Atomicity and Binary Optimizations y Binary optimizations are very limited without atomicity support – Many optimizations are not allowed: cannot reorder load/load, store/store, early load/later store – Many hard issues: memory accesses across cache line boundary, non-cacheable memory operation, precise exception, alias speculation, etc
y Atomic regions – Increase the optimization opportunities – Address many tough issues – Simplify the optimizations design
3
Binary Compilation Technology
Atomic Region Scope y Small atomicity scope – Traces, super-blocks, basic blocks, etc – Fast local optimizations, but small opportunities
y Large atomicity scope – Loops or large DAGs – Global optimizations, loop invariant hoisting, software pipelining, long range prefetch, etc – May suffer from resource overflow and large amount of work being discarded when rollback.
4
Binary Compilation Technology
Small Atomicity Scope: Loop body
B1 ckpt
cmit
r3 Å ld [r2] r4 Å ld [r2 + r0*4 + 4] B2 r6 Å r3 + r4 st [r2+4] Å r6 r0 Å r0 + 1 if (r6 < r1) goto B2
B1 has the same EIP as B2 (region entry)
Loop body
B3 …
5
Binary Compilation Technology
Small Atomicity Scope: Loop body B1 t3 Å ld [r2]
B1 ckpt
cmit
r3 Å ld [r2] r4 Å ld [r2 + r0*4 + 4] B2 r6 Å r3 + r4 st [r2+4] Å r6 r0 Å r0 + 1 if (r6 < r1) goto B2 B3 …
ckpt
cmit
r3 Å t3 r4 Å ld [r2 + r0*4 + 4] B2 r6 Å t3 + r4 st [r2+4] Å r6 r0 Å r0 + 1 if (r6 < r1) goto B2 B3 … Optimizations: LICM + copy propagation
6
Binary Compilation Technology
Small Atomicity Scope: Loop body Remote store may invalidate the load after commit B1 t3 Å ld [r2]
B1 ckpt
cmit
r3 Å ld [r2] r4 Å ld [r2 + r0*4 + 4] B2 r6 Å r3 + r4 st [r2+4] Å r6 r0 Å r0 + 1 if (r6 < r1) goto B2 B3 …
ckpt
cmit
r3 Å t3 r4 Å ld [r2 + r0*4 + 4] B2 r6 Å t3 + r4 st [r2+4] Å r6 r0 Å r0 + 1 if (r6 < r1) goto B2 B3 … Optimizations: LICM + copy propagation
7
Binary Compilation Technology
Small Atomicity Scope: Loop body B1 t3 Å ld [r2]
B1 ckpt
cmit
r3 Å ld [r2] r4 Å ld [r2 + r0*4 + 4] B2 r6 Å r3 + r4 st [r2+4] Å r6 r0 Å r0 + 1 if (r6 < r1) goto B2 B3 …
ckpt
cmit
r3 Å t3 r4 Å ld [r2 + r0*4 + 4] B2 r6 Å t3 + r4 st [r2+4] Å r6 r0 Å r0 + 1 if (r6 < r1) goto B2 B3 …
Commit requires precise architectural state: Partially dead store elimination is invalid
8
Binary Compilation Technology
Large Atomicity Scope: Whole loop ckpt B1 r3 Å ld [r2] r4 Å ld [r2 + r0*4 + 4] B2 r6 Å r3 + r4 st [r2+4] Å r6 r0 Å r0 + 1 if (r6 < r1) goto B2
cmit
9
Whole loop
B3 …
Binary Compilation Technology
Large Atomicity Scope: Whole loop ckpt B1
ckpt r3 Å ld [r2] B1
r3 Å ld [r2] r4 Å ld [r2 + r0*4 + 4] B2 r6 Å r3 + r4 st [r2+4] Å r6 r0 Å r0 + 1 if (r6 < r1) goto B2
cmit
r4 Å ld [r2 + r0*4 + 4] r6 Å r3 + r4 B2 r0 Å r0 + 1 if (r6 < r1) goto B2
B3 … cmit
10
B3 st [r2+4] Å r6 … Optimizations: LICM + PDSE + copy propagation Binary Compilation Technology
Rollback Penalty: Small Atomicity Scope Execution eip: 0x8 B1 …
eip: 0x8 B2 ckpt … … cmit B3 …
11
Exception
B1 B21 B22 … B2999 B21000
Binary Compilation Technology
Rollback Penalty: Small Atomicity Scope Execution eip: 0x8 B1 …
eip: 0x8 B2 ckpt … … cmit B3 …
Exception
B1 B21 B22 … B2999 B21000 Eip: 0x8 (B21000)
Resume from last checkpoint with non-opt. code eip: 0x8
12
Binary Compilation Technology
Rollback Penalty: Small Atomicity Scope Execution eip: 0x8 B1 …
eip: 0x8 B2 ckpt … … cmit B3 …
Exception
Resume optimized code execution
B1 B21 B22 … B2999 B21000 Eip: 0x8 (B21000) B1 B21001
Amount of work discarded by rollback = 1 loop iteration
13
Binary Compilation Technology
Rollback Penalty: Large Atomicity Scope eip: 0x8 B1 ckpt …
Execution
eip: 0x8 B2 … …
cmit
14
B3 …
Exception
B1 B21 B22 … B2999 B21000
Binary Compilation Technology
Rollback Penalty: Large Atomicity Scope Execution
eip: 0x8 B1 ckpt … eip: 0x8 B2 … …
cmit
Exception
B3 … Resume from last checkpoint with non-opt. code (first loop iteration)
15
B1 B21 Discarded 1000 B22 … iterations B2999 B21000 Eip: 0x8 (B21) B1 Resume optimized B22 code execution …
Binary Compilation Technology
Rollback Penalty: Large Atomicity Scope Execution
eip: 0x8 B1 ckpt … eip: 0x8 B2 … …
cmit
Exception
B3 …
Fail and discard again!
16
B1 B21 B22 … B2999 B21000 Eip: 0x8 (B21) B1 Discarded 999 B22 … iterations B21000 Eip: 0x8 (B22) B1 B23 … Binary Compilation Technology
Rollback Penalty: Large Atomicity Scope eip: 0x8 B1 ckpt …
Execution
eip: 0x8 B2 … …
cmit
B3 …
Exception
B1 B21 B22 … B2999 B21000 Eip: 0x8 (B21) B1 B22 … B21000 Eip: 0x8 (B22) B1 B23 …
Large amount of work discarded at rollbacks
17
Binary Compilation Technology
Small and Large Atomicity Scopes
Rollback Cost
+ Large Atomicity Scope
Small Atomicity Scope
-
18
Optimization opportunities
+
Binary Compilation Technology
Small and Large Atomicity Scopes
Rollback Cost
+ Large Atomicity Scope
TAO Small Atomicity Scope
Two Level Atomicity Scope
-
19
Optimization opportunities
+
Binary Compilation Technology
Two Level Atomicity Region checkpoint Local ckpt
Local cmit Region commit
B1 (0x8)
B2 (0x8) r3 Å ld [r2] r4 Å ld [r2 + r0*4 + 4] r6 Å r3 + r4 st [r2+4] Å r6 r0 Å r0 + 1 if (r6 < r1) goto B2
Whole loop 2nd Level
B3 …
1st Level Local cmit Buffer
20
Loop Body 1st Level
2nd Level Reg. cmit Buffer
Non-spec. State Binary Compilation Technology
Two Level Atomicity Reg. ckpt.
B1 (0x8)
Local ckpt
B2 (0x8)
Local cmit
r3 Å ld [r2] r4 Å ld [r2 + r0*4 + 4] r6 Å r3 + r4 st [r2+4] Å r6 r0 Å r0 + 1 if (r6 < r1) goto B2
Reg. ckpt.
B1 (0x8)
Local ckpt
B2 (0x8)
Local cmit
r3 Å ld [r2]
r4 Å ld [r2 + r0*4 + 4] r6 Å r3 + r4 r0 Å r0 + 1 if (r6 < r1) goto B2
B2-Fixup
B3 Reg. cmit
21
B3
…
Reg. cmit
Reg. cmit
st [r2+4] Å r6
st [r2+4] Å r6
Binary Compilation Technology
Two Level Atomicity Reg. ckpt.
B1 (0x8)
Local ckpt
B2 (0x8)
Local cmit
r3 Å ld [r2]
B1 B21 B22
r4 Å ld [r2 + r0*4 + 4] r6 Å r3 + r4 r0 Å r0 + 1 if (r6 < r1) goto B2
B2-Fixup B3
Reg. cmit
Execution
Reg. cmit
st [r2+4] Å r6
st [r2+4] Å r6 1st Level B22
22
2nd Level
Non-spec.
B21,B1
… Binary Compilation Technology
Two Level Atomicity Reg. ckpt.
B1 (0x8)
Local ckpt
B2 (0x8)
Local cmit
r3 Å ld [r2]
r4 Å ld [r2 + r0*4 + 4] r6 Å r3 + r4 r0 Å r0 + 1 if (r6 < r1) goto B2
Exception
Reg. cmit
st [r2+4] Å r6
st [r2+4] Å r6 1st Level B21000
23
B1 B21 B22 … B2999 B21000
B2-Fixup B3
Reg. cmit
Execution
2nd Level
Non-spec.
B2999, …, B21, B1
… Binary Compilation Technology
Two Level Atomicity Reg. ckpt. Local ckpt Local cmit
Reg. cmit
Execution
B1 (0x8) r3 Å ld [r2]
B1 B21 B2 (0x8) B22 … r4 Å ld [r2 + r0*4 + 4] B2999 Exception r6 Å r3 + r4 B21000 r0 Å r0 + 1 Reg. B2-Fixup if (r6 < r1) goto B2 cmit Eip: 0x8 (B2 1000) B2-Fixup B1 Reg. st [r2+4] Å r6 B21001 cmit B3 B21002 st [r2+4] Å r6
1st Level B21002
24
2nd Level B21001, B1
Non-spec. 0x8, B2-Fix, B2999, …, B1 Binary Compilation Technology
Two Level Atomicity Reg. ckpt. Local ckpt Local cmit
Reg. cmit
B1 (0x8)
Execution
r3 Å ld [r2]
B1 B21 B2 (0x8) B22 … r4 Å ld [r2 + r0*4 + 4] B2999 Exception r6 Å r3 + r4 B21000 r0 Å r0 + 1 Reg. B2-Fixup if (r6 < r1) goto B2 cmit Eip: 0x8 (B2 1000) B2-Fixup B1 Reg. st [r2+4] Å r6 B21001 cmit B3 B21002 st [r2+4] Å r6
Amount of work discarded by rollback = 1 loop iterations
25
Binary Compilation Technology
Preliminary Results y Implemented TAO in a cycle accurate simulator y 1st level atomicity is modeled with frame y 2nd level atomicity is modeled assuming unlimited speculative cache y Regions consist of frames y Global partial redundancy elimination (PRE) and dead code elimination (PDE) are implemented – PDSE not measured due to simulator issue – Global optimization overhead is not measured, although frame level HW optimizations are modeled
26
Binary Compilation Technology
Region and Frame Dynamic Sizes Frames Only
27
TAO Regions
Binary Compilation Technology
Region and Frame Coverage
28
Binary Compilation Technology
Performance Potential Frames Only
29
Additional from TAO Regions
Binary Compilation Technology
Conclusions & Future Work y Two Level Atomicity enlarges the Optimization Scope with limited rollback costs y Promising performance gains over frame level atomicity alone y More optimizations can boost the performance gain – Global fusion, etc
y May improve in-order co-designed processor by enabling more global scheduling y Hardware needs more investigation – Pipeline Buffers + Speculative Cache – 2-level speculative cache
30
Binary Compilation Technology
Questions?
31
Binary Compilation Technology