Dec 12, 2009 - Unused. App Mem 3. App Mem 2. Shd Mem 2. Shd Mem 3 offset scale addr addr app shd. +. Ã. = ... Address c
Examples: Part I 1:30-1:40 1:40-2:40 2:40-3:00 3:00-3:15 3:15-4:15 4:15-5:15 5:15-5:30
Welcome + DynamoRIO History DynamoRIO Internals Examples, Part 1 Break DynamoRIO API Examples, Part 2 Feedback
DynamoRIO Examples Part I Outline • Common Steps of writing a DynamoRIO client • Dynamic Instruction Counting Example
DynamoRIO Tutorial at MICRO 12 Dec 2009
2
Common Steps • Step 1: Register Events – DR_EXPORT void dr_init(client_id_t id) Register Function
Events
dr_register_bb_event
Basic Block Building
dr_register_thread_init_event
Thread Initialization
dr_register_exit_event
Process Exit
• Step2: Implementation – Initialization – Finalization – Instrumentation
• Step 3: Optimization – Optimize the instrumentation to improve the performance DynamoRIO Tutorial at MICRO 12 Dec 2009
3
DynamoRIO Examples Part I Outline • Common Steps of writing a DynamoRIO client • Dynamic Instruction Counting Example
DynamoRIO Tutorial at MICRO 12 Dec 2009
4
Step 1: Register Events uint num_dyn_instrs; static void event_init(void); static void event_exit(void); static dr_emit_flags_t event_basic_block(void *drcontext, void *tag, instrlist_t *ilist, bool for_trace, bool translating); DR_EXPORT void dr_init(client_id_t id) { /* register events */ dr_register_bb_event (event_basic_block); dr_register_exit_event(event_exit); /* process initialization event */ event_init(); }
DynamoRIO Tutorial at MICRO 12 Dec 2009
5
Step 2: Implementation (I) static void event_init(void) { num_dyn_instrs = 0; } static void event_exit(void) { dr_printf(“Total number of instruction executed: %u\n”, num_dyn_instrs); } static dr_emit_flags_t event_basic_block(void *drcontext, void *tag, instrlist_t *ilist, bool for_trace, bool translating) { int num_instrs; num_instrs = ilist_num_instrs(ilist); insert_count_code(drcontext, ilist, num_instrs); return DR_EMIT_DEFAULT; }
DynamoRIO Tutorial at MICRO 12 Dec 2009
6
Step 2: Implementation (II) static int ilist_num_instrs(instrlist_t *ilist) { instr_t *instr; int num_instrs = 0; /* iterate over instruction list to count number of instructions */ for (instr = instrlist_first(ilist); instr != NULL; instr = instr_get_next(instr)) num_instrs++; return num_instrs; } static void do_ins_count(int num_instrs) { num_dyn_instrs += num_instrs; } static void insert_count_code(void * drcontext, instrlist_t * ilist, int num_instrs) { dr_insert_clean_call(drcontext, ilist, instrlist_first(ilist), do_ins_count, false, 1, OPND_CREATE_INT32(num_instrs)); } DynamoRIO Tutorial at MICRO 12 Dec 2009
7
Instrumented Basic Block # switch stack # switch aflags and errorno # save all registers # call do_ins_count push $0x00000003 call $0xb7ef73e4 (do_ins_count) # restore registers # switch aflags and errorno back # switch stack back # application code add $0x0000e574 %ebx Æ %ebx test %al $0x08 jz $0xb80e8a98 DynamoRIO Tutorial at MICRO 12 Dec 2009
8
Step 3: Optimization (I): counter update inlining static void insert_count_code (void * drcontext, instrlist_t * ilist, int num_instrs) { instr_t *instr, *where; opnd_t opnd1, opnd2; where = instrlist_first(ilist); /* save aflags */ dr_save_arith_flags(drcontext, ilist, where, SPILL_SLOT_1); /* num_dyn_instrs += num_instrs */ opnd1 = OPND_CREATE_ABSMEM(&num_dyn_instrs, OPSZ_PTR); opnd2 = OPND_CREATE_INT32(num_instrs); instr = INSTR_CREATE_add(drcontext, opnd1, opnd2); instrlist_meta_preinsert(ilist, where, instr); /* restore aflags */ dr_restore_arith_flags(drcontext, ilist, where, SPILL_SLOT_1); }
DynamoRIO Tutorial at MICRO 12 Dec 2009
9
Instrumented Basic Block mov %eax Æ %fs:0x0c lahf Æ %ah seto Æ %al add $0x00000003, 0xb7d25030 add $0x7f %al Æ %al sahf %ah mov %fs:0x0c Æ %eax # application code add $0x0000e574 %ebx Æ %ebx test %al $0x08 jz $0xb7f14a98
DynamoRIO Tutorial at MICRO 12 Dec 2009
10
Step 3: Optimization (II): aflags stealing static void insert_count_code (void * drcontext, instrlist_t * ilist, int num_instrs) { … save_aflags = aflags_analysis(ilist); /* save aflags */ if (save_aflags) dr_save_arith_flags(drcontext, ilist, where, SPILL_SLOT_1); /* num_dyn_instrs += num_instrs */ opnd1 = OPND_CREATE_ABSMEM(&num_dyn_instrs, OPSZ_PTR); opnd2 = OPND_CREATE_INT32(num_instrs); instr = INSTR_CREATE_add(drcontext, opnd1, opnd2); instrlist_meta_preinsert(ilist, where, instr); /* restore aflags */ if (save_aflags) dr_restore_arith_flags(drcontext, ilist, where, SPILL_SLOT_1); }
DynamoRIO Tutorial at MICRO 12 Dec 2009
11
Instrumented Basic Block
add $0x00000003, 0xb7d25030 # application code add $0x0000e574 %ebx Æ %ebx test %al $0x08 jz $0xb7f14a98
DynamoRIO Tutorial at MICRO 12 Dec 2009
12
Step 3: Optimization (III): more optimizations • Using lea (load effective address) instead of add lea [%reg, num_instr] Î %reg
• Register liveness analysis – Using dead register to avoid register save/restore for lea
• Global aflags/registers analysis – Analyze aflags/registers liveness over CFG
• Trace Optimization – Trace: single-entry multi-exit – Update counters only at trace exits
DynamoRIO Tutorial at MICRO 12 Dec 2009
13
Other Issues • Data race on counter update in multithreaded programs – Global lock for every update – Atomic update (lock prefixed add) • LOCK(instr);
– Thread private counter • Thread-private code cache: different variable at different address • Thread-shared code cache: thread local storage
• 32-bit counter overflow – 64-bit counter: • Two instructions on 32-bit architecture: add, adc
– One 32-bit local counter and one 64-bit global counter • Instrument to update 32-bit local counter • Update 64-bit global counter using time interrupt DynamoRIO Tutorial at MICRO 12 Dec 2009
14
Examples: Part 2 1:30-1:40 1:40-2:40 2:40-3:00 3:00-3:15 3:15-4:15 4:15-5:15 5:15-5:30
Welcome + DynamoRIO History DynamoRIO Internals Examples, Part 1 Break DynamoRIO API Examples, Part 2 Feedback
Larger Examples • Dynamic Optimization – Strength Reduction – Software Prefetch
• Profiling – Pipelined Profiling and Analysis
• Shadow Memory – Umbra – Millions of Watchpoints
• Dr. Memory
DynamoRIO Tutorial at MICRO 12 Dec 2009
2
Dynamic Optimization Opportunities • Traditional compiler optimizations – Compiler has limited view: application assembled at runtime – Some shipped products are built without optimizations
• Microarchitecture-specific optimizations – Feature set and relative performance of instructions varies – Combinatorial blowup if done statically
• Adaptive optimizations – Need runtime information: prior profiling runs not always representative
DynamoRIO Tutorial at MICRO 12 Dec 2009
3
Dynamic Optimization in DynamoRIO • Traces are natural unit for optimization – Focus only on hot code – Cross procedure, file and module boundaries
• Linear control flow – Single-entry, multi-exit simplifies analysis
• Support for adaptive optimization – Can replace traces dynamically
DynamoRIO Tutorial at MICRO 12 Dec 2009
4
Strength Reduction: inc to add • On Pentium 4, inc is slower add 1 (and dec is slower than sub 1) • Opposite is true on Pentium 3 • Microarchitecture-specific optimization best performed dynamically
DynamoRIO Tutorial at MICRO 12 Dec 2009
5
EXPORT void dr_init() { if (proc_get_family() == FAMILY_PENTIUM_IV) dr_register_trace_event(event_trace); }
Pentium 4?
static void event_trace(void *drcontext, app_pc tag, instrlist_t *trace, bool xl8) { instr_t *instr, *next_instr; int opcode; for (instr = instrlist_first(bb); instr != NULL; instr = next_instr) { next_instr = instr_get_next(instr); opcode = instr_get_opcode(instr); if (opcode == OP_inc || opcode == OP_dec) replace_inc_with_add(drcontext, instr, trace); } } }
Look for inc / dec
static bool replace_inc_with_add(void *drcontext, instr_t *instr, instrlist_t *trace) { instr_t *in; uint eflags; int opcode = instr_get_opcode(instr); bool ok_to_replace = false; for (in = instr; in != NULL; in = instr_get_next(in)) { eflags = instr_get_arith_flags(in); if ((eflags & EFLAGS_READ_CF) != 0) return false; if ((eflags & EFLAGS_WRITE_CF) != 0) { ok_to_replace = true; break; } if (instr_is_exit_cti(in)) return false; } if (!ok_to_replace) return false; if (opcode == OP_inc) in = INSTR_CREATE_add(drcontext, instr_get_dst(instr, 0), OPND_CREATE_INT8(1)); else in = INSTR_CREATE_sub(drcontext, instr_get_dst(instr, 0), OPND_CREATE_INT8(1)); instr_set_prefixes(in, instr_get_prefixes(instr)); instrlist_replace(trace, instr, in); instr_destroy(drcontext, instr); return true; }
Ensure eflags change ok
Replace with add / sub
DynamoRIO Tutorial at MICRO 12 Dec 2009
6
base
inc2add
DynamoRIO Tutorial at MICRO 12 Dec 2009
har. mean
wupwise
swim
sixtrack
mgrid
mesa
equake
art
apsi
1.2 1.1 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0
applu
2% mean speedup
ammp
Normalized Execution Time
Strength Reduction Results
Benchmark
7
Software Prefetching • Ubiquitous Memory Introspection (UMI): – Online, light weight, adaptive memory optimization
• Key Ideas – Sampling to identify hot traces • Time interrupt • Performance counter (L2 Cache miss) overflow interrupt
– Profiling a trace if it is hot enough • Instrument trace for memory profiling
– Analyzing profiling if it is full
Ref 1
Ref 2
Ref 3
Ref 4
Ref 5
Iter 1 Iter 2 Iter 3
• Cache Simulation to identify loads with high miss ratios • Stride reference analysis
– Optimization • Instrument trace with L2 prefetch requests for high miss loads DynamoRIO Tutorial at MICRO 12 Dec 2009
8
Software Prefetching Results
DynamoRIO Tutorial at MICRO 12 Dec 2009
9
Larger Examples • Dynamic Optimization – Strength Reduction – Software Prefetch
• Profiling – Pipelined Profiling and Analysis
• Shadow Memory – Umbra – Millions of Watchpoints
• Dr. Memory
DynamoRIO Tutorial at MICRO 12 Dec 2009
10
PiPA: Pipelined Profiling and Analysis Original application
Instrumentation overhead
Profiling and Analyzing overhead
Threads or Processes
Time
Instrumented application – stage 0
Profile processing – stage 1
Analysis on profile 1 Analysis on profile 2 Analysis on profile 3
Parallel analysis stage 2
Analysis on profile 4
DynamoRIO Tutorial at MICRO 12 Dec 2009
11
PiPA Challenges Minimize the profiling overhead – Runtime Execution Profile (REP)
Minimize the communication between stages – double buffering
Design efficient parallel analysis algorithms – we focus on cache simulation
PiPA Prototype – Stage 0 : instrumented application – collect REP – Stage 1 : profile reconstruct and splitting – Stage 2 : parallel cache simulation
DynamoRIO Tutorial at MICRO 12 Dec 2009
12
Stage 0 : Profiling static dr_emit_flags_t event_basic_block(void *drcontext, void *tag, instrlist_t *ilist, bool for_trace, bool xl8) { instr_t *instr; for (instr = instrlist_first(ilist); instr != NULL; instr_get_next(instr)) { /* check and instrument if any memory read */ if (instr_reads_memory(instr)) for (int i = 0; i < instr_num_srcs(instr); i++) if (opnd_is_memory_reference(instr_get_src(instr, i))) instrument_mem_read(drcontext, ilist, instr, instr_get_src(instr, i)); /* check and instrument if any memory write */ if (instr_writes_memory(instr)) for (int i = 0; i < instr_num_dsts(instr); i++) if (opnd_is_memory_reference(instr_get_dst(instr, i))) instrument_mem_write(drcontext, ilist, instr, instr_get_dst(instr, i)); } return DR_EMIT_DEFAULT; }
DynamoRIO Tutorial at MICRO 12 Dec 2009
13
Stage 0: Profiling static void instrument_mem_read(void *drcontext, instrlist_t *ilist, instr_t *where, opnd_t ref) { app_pc pc = instr_get_app_pc(where); int size = opnd_size_in_bytes(opnd_get_size(ref)); … /* calculate memory reference address */ if (opnd_is_base_disp(ref)) { opnd_set_size(ref, OPSZ_lea); instr = INSTR_CREATE_lea(drcontext, opnd_create_reg(reg1), ref); } else if (opnd_is_rel_addr(ref) || opnd_is_abs_addr(ref)) instr = INSTR_CREATE_mov_imm(drcontext, opnd_create_reg(reg1), OPND_CREATE_INTPTR(opnd_get_addr(ref)); instrlist_meta_preinsert(ilist, where, instr); /* put address into profile buffer */ instrlist_meta_preinsert(ilist, where, INSTR_CREATE_mov_st(drcontext, OPND_CREATE_MEMPTR(reg2, 4), opnd_create_reg(reg1))); … }
DynamoRIO Tutorial at MICRO 12 Dec 2009
14
Stage 0: Profiling • Runtime Execution Profile (REP) – fast profiling – small profile size – easy information extraction
• REP unit – REP static • Information that known at static time
– REP dynamic • Information that only known at runtime
• Can be customized for different analyses – in our prototype we consider cache simulation
DynamoRIO Tutorial at MICRO 12 Dec 2009
15
REP Example profile base pointer
REP First buffer
. . .
bb1: mov [eax + 0x0c] Æ eax mov ebp Æ esp pop ebp return
bb1 REP Unit
tag: 0x080483d7 num_slots: 2 num_refs: 3 refs: ref0
eax esp
. . . Canary Zone Next buffer
. . . DynamoRIO Tutorial at MICRO 12 Dec 2009
REPS
pc: type: size: offset: value_slot: size_slot:
0x080483d7 read 4 12 1 -1
pc: type: size: offset: value_slot: size_slot:
0x080483dc read 4 0 2 -1
pc: type: size: offset: value_slot: size_slot:
0x080483dd read 4 4 2 -1
REPD
16
REP Example profile base pointer
REP First buffer
. . .
bb1: mov [eax + 0x0c] Æ eax mov ebp Æ esp pop ebp return
bb1
12 bytes
tag: 0x080483d7 num_slots: 2 num_refs: 3 refs: ref0
eax esp bb2 esp . . .
bb2: pop pop cmp jz
ebx ecx eax, 0 label_bb3
Canary Zone Next buffer
. . . DynamoRIO Tutorial at MICRO 12 Dec 2009
REPS
pc: type: size: offset: value_slot: size_slot:
0x080483d7 read 4 12 1 -1
pc: type: size: offset: value_slot: size_slot:
0x080483dc read 4 0 2 -1
pc: type: size: offset: value_slot: size_slot:
0x080483dd read 4 4 2 -1
REPD
17
Profiling Optimization • Store register values in REP – avoid computing the memory address
• Register liveness analysis – avoid register stealing if possible
• Record a single register value for multiple references – a single stack pointer value for a sequence of push/pop – the base address for multiple accesses to the same structure
• Profiling buffer count update – Update counter once per basic block – Using lea instruction to avoid aflags usage
• Buffer full check – Canary Zone – lea & jecxz to avoid aflags uasge DynamoRIO Tutorial at MICRO 12 Dec 2009
18
Profiling Overhead
Slowdown relative to native execution
8
optimized instrumentation
7 6
instrumentation without optimization
5 4 3 2
Avg slowdown : ~ 3x
1 0 SPECint2000
SPECfp2000
DynamoRIO Tutorial at MICRO 12 Dec 2009
SPEC2000
19
Stage 1 : Profile Reconstruct Need to reconstruct the full memory reference information – REP . . .
tag: 0x080483d7 num_slots: 2 num_refs: 3 refs: ref0
pc: 0x080483d7 type: read size: 4 offset: 12 value_slot: 1 size_slot: -1
pc: 0x080483dc type: read size: 4 offset: 0 value_slot: 2 size_slot: -1
. . .
bb1 REP Unit
0x2304 0x141a
REP Unit
bb2 0x1423 . . .
PC .... 0x080483d7
Address ............. 0x2310
Type ........ read
Size ......... 4
0x080483dc ....
0x141a .............
read ........
4 .........
Canary Zone . . . DynamoRIO Tutorial at MICRO 12 Dec 2009
20
Profile Reconstruct Overhead The impact of using REP – experiments done on the 8-core system with 16MB buffers and 8 threads
Slowdown relative to native execution
35.00 30.00 25.00 20.00 15.00 10.00 5.00 0.00
PIPA using REP PIPA using standard profile format
PIPA-REP : 4.5x PIPA-standard : 20.7x
DynamoRIO Tutorial at MICRO 12 Dec 2009
21
Stage 2 : Parallel Cache Simulation How to parallelize? – split the address trace into independent groups (in stage 1) – two memory references that access different sets are independent
Set associative caches – partition the cache sets and simulate them using several independent simulators – merge the results (no of hits and misses) at the end of the simulation
Example: – 32K cache, 32-byte line, 4-way associative => 256 sets – 4 independent simulators, each one simulates 64 sets (round-robin distribution) PC Address
Type
.... .... .... .... .... ....
r w r w r r
0xbf9c4614 0xbf9c4705 0xbf9c4a34 0xbf9c4a60 0xbf9c4a5c 0xbf9c460d
Size 4 4 4 4 4 4
DynamoRIO Tutorial at MICRO 12 Dec 2009
0: 0xbf9c4614, 0xbf9c4705 , 0xbf9c460d ... 1: 0xbf9c4a34 ... 2: 0xbf9c4a5c ... 3: 0xbf9c4a60 ...
22
Cache Simulation Overhead Experiments done on the 8-core system – 8 recovery threads and 8 cache simulators
50.00
native execution
Slowdown relative to
60.00
40.00 30.00 20.00 10.00 0.00
PiPA
10.5x
Pin dcache
DynamoRIO Tutorial at MICRO 12 Dec 2009
PiPA speedup over dcache : 3x
32x
23
Larger Examples • Dynamic Optimization – Strength Reduction – Software Prefetch
• Profiling – Pipelined Profiling and Analysis
• Shadow Memory – Umbra – Millions of Watchpoints
• Dr. Memory
DynamoRIO Tutorial at MICRO 12 Dec 2009
24
Shadow Memory • Application – Store meta-data associated with application data • • • •
Millions of software watchpoints Dynamic information flow tracking (taint propagation) Race detection Memory usage debugging tool (MemCheck/Dr. Memory)
• Issues – – – –
Performance Multi-thread applications Flexibility Platform dependent
DynamoRIO Tutorial at MICRO 12 Dec 2009
25
Umbra Outline • Design • Implementation • Optimization
Design • Address Space
App Mem 1
– A collection of fixed size units Unused
• 256M (32-bit), 4G (64-bit) • Application, Shadow, Unused
Shd Mem 1
• Translation Table – Translation from application memory unit to corresponding shadow memory unit addr shd = addr app × scale + offset
Unused Shd Mem 2 Shd Mem 3
App Mem
Shd Mem
Offset
[0x00000000, 0x10000000)
[0x20000000, 0x30000000)
0x20000000
App Mem 2
[0x60000000, 0x70000000)
[0x40000000, 0x50000000)
-0x20000000
App Mem 3
[0x80000000, 0x90000000)
[0x50000000, 0x60000000)
-0x20000000
DynamoRIO Tutorial at MICRO 12 Dec 2009
27
Implementation • Memory Manager – Monitor and control application memory allocation • brk, mmap, munmap, mremap • dr_register_pre_syscall_event • dr_register_post_syscall_event
– Allocate shadow memory – Maintain translation table
• Instrumenter – Instrument every memory reference • • • • •
Context save Address calculation Address translation Shadow memory update Context restore
DynamoRIO Tutorial at MICRO 12 Dec 2009
28
Instrument Code Example Context Save
mov %ecx Î [ECX_SLOT] mov %edx Î [EDX_SLOT] mov %eax Î [EAX_SLOT] lahf Î %ah seto Î %al
Address Calculation
lea [%ebx, 16] Î %ecx
Address Translation
mov 0 Î %edx … # table lookup code add %ecx, table[%edx].offset Î %ecx
Shadow Memory Update
mov 1 Î [%ecx]
Context Restore
add %al 0x7f sahf mov [ECX_SLOT] Î %ecx mov [EDX_SLOT] Î %edx mov [EAX_SLOT] Î %eax
Application memory reference
mov 0 Î [%ebx, 16]
Optimization • Translation Optimization – Thread Local Translation Table – Memoization Check – Reference Check
• Instrumentation Optimization – Context Switch Reduction – Reference Grouping – 3-stage Code Layout
DynamoRIO Tutorial at MICRO 12 Dec 2009
30
Translation Optimization • Thread Local Translation Optimization – Local translation table per thread – Synchronize with global translation table when necessary – Avoid lock contention
Thread 1
Thread 2
Thread Local translation table DynamoRIO Tutorial at MICRO 12 Dec 2009
Global translation table
31
Translation Optimization • Memoization Cache – Software cache per thread – Stores frequently used translation entries • Stack • Units found in last table lookup Thread 1
Thread 2
Memoization Cache DynamoRIO Tutorial at MICRO 12 Dec 2009
Thread Local translation table
Global translation table
32
Translation Optimization • Reference Cache – Software cache per static application memory reference • Last reference unit tag • Last translation offset
Thread 1
Thread 2
Reference cache
Memoization Cache
DynamoRIO Tutorial at MICRO 12 Dec 2009
Thread Local translation table
Global translation table
33
Instrumentation Optimization • Context Switch Reduction – Registers liveness analysis
• Reference Grouping – One translation for multiple references using the same base • Stack local variables • Different members of the same object
• 3-stage Code Layout – Inline stub • Quick inline check code with minimal context switch
– Lean procedure • Simple assembly procedure with partial context switch
– Callout • C function with complete context switch DynamoRIO Tutorial at MICRO 12 Dec 2009
34
3-stage Code Layout • Inline stub – Reference cache check – Jump to lean procedure if miss
• Lean procedure – Memoization cache check – Local table lookup – Clean call to call out
• Callout – Global table synchronization – Local table lookup
DynamoRIO Tutorial at MICRO 12 Dec 2009
35
Instrumentation Optimization Inline Stub
Lean Procedure
# reference cache check lea [ref] Î %r1 %r1 & 0xf0000000 Î %r1 cmp %r1, ref.tag je .update_shadow_memory # jmp-and-link to lean procedure mov %r1 Î ref.tag mov .update_ref_cache Î [ret_pc] jmp lean_procedure .update_ref_cache mov %r1 Î ref.offset # shadow memory update .update_shadow_memory lea [ref] Î %r1 add %r1 + ref.offsetÎ %r1 mov 1 Î [%r1]
# memorization check cmp %r1, cache1.tag jne .cache1_miss mov cache1.offset Î %r1 jmp [ret_pc] .cache1_miss cmp %r1, cache2.tag jne .cache2_miss mov cache1.offset Î %r1 jmp [ret_pc] .cache2_miss # table lookup mov %r1 Î cache2.tag mov %r2 Î [R2_SLOT] mov 0 Î %r2 … mov [R2_SLOT] Î %r2 mov %r1 Î cache2.offset jmp [ret_pc]
DynamoRIO Tutorial at MICRO 12 Dec 2009
36
Performance Evaluation DynamoRIO
Local Translation Table
Memorization Check
Reference Cache
Context Switch Reduction
Reference Grouping
18 16 14 12 10 8 6 4
3.29 2.49
1.89
2 0 CINT DynamoRIO Tutorial at MICRO 12 Dec 2009
CFP
CPU2006 37
Umbra Client: Shared Memory Detection static void instrument_update(void *drcontext, umbra_info_t *umbra_info, mem_ref_t *ref, instrlist_t *ilist, instr_t *where) { … /* test [%reg].tid_map, tid_map*/ opnd1 = OPND_CREATE_MEM32(umbra_infoÆreg, 0, OPSZ_4); opnd2 = OPND_CREATE_INT32(client_tls_dataÆtid_map); instrlist_meta_preinsert(ilist, where, INSTR_CREATE_test(drcontext, opnd1, opnd2)); /* jnz where */ opnd1 = opnd_create_instr(where); instrlist_meta_preinsert(ilist, where, INSTR_CREATE_jcc(drcontext, OP_jnz, opnd1)); /* or */ opnd1 = OPND_CREATE_MEM32(umbra_infoÆreg, 0, OPSZ_4); opnd2 = OPND_CREATE_INT32(client_tls_dataÆtid_map | 1); instr = INSTR_CREATE_or(drcontext, opnd1, opnd2); LOCK(instr); instrlist_meta_preinsert(ilist, label, instr); } DynamoRIO Tutorial at MICRO 12 Dec 2009
38
Watchpoints • Watchpoints are poweful debugging tools – Millions of Watchpoints • Watch all heap allocations to detect uninitialized values • Watch all buffers to detect overflow • Watch all return addresses
• However, more than a handful of watched addresses cannot be handled efficiently by today’s debuggers – GDB is forced into a single-step mode – Prohibitively expensive
DynamoRIO Tutorial at MICRO 12 Dec 2009
39
Millions of Watchpoints • Full instrumentation with shadow memory – Simply mark the shadow memory – Instrument to check on every memory access
• Partial instrumentation: – On setting a watchpoint • Duplicate data into shadow pages • Set application page non-accessible
– On access violation handler • Instrument the fault instruction – Check reference target – Report if access the watched data – Redirect if access the protected pages
DynamoRIO Tutorial at MICRO 12 Dec 2009
40
Larger Examples • Dynamic Optimization – Strength Reduction – Software Prefetch
• Profiling – Pipelined Profiling and Analysis
• Shadow Memory – Umbra – Millions of Watchpoints
• Dr. Memory
DynamoRIO Tutorial at MICRO 12 Dec 2009
41