Examples: Part I - DynamoRIO

2 downloads 339 Views 868KB Size Report
Dec 12, 2009 - Unused. App Mem 3. App Mem 2. Shd Mem 2. Shd Mem 3 offset scale addr addr app shd. +. ×. = ... Address c
Examples: Part I 1:30-1:40 1:40-2:40 2:40-3:00 3:00-3:15 3:15-4:15 4:15-5:15 5:15-5:30

Welcome + DynamoRIO History DynamoRIO Internals Examples, Part 1 Break DynamoRIO API Examples, Part 2 Feedback

DynamoRIO Examples Part I Outline • Common Steps of writing a DynamoRIO client • Dynamic Instruction Counting Example

DynamoRIO Tutorial at MICRO 12 Dec 2009

2

Common Steps • Step 1: Register Events – DR_EXPORT void dr_init(client_id_t id) Register Function

Events

dr_register_bb_event

Basic Block Building

dr_register_thread_init_event

Thread Initialization

dr_register_exit_event

Process Exit

• Step2: Implementation – Initialization – Finalization – Instrumentation

• Step 3: Optimization – Optimize the instrumentation to improve the performance DynamoRIO Tutorial at MICRO 12 Dec 2009

3

DynamoRIO Examples Part I Outline • Common Steps of writing a DynamoRIO client • Dynamic Instruction Counting Example

DynamoRIO Tutorial at MICRO 12 Dec 2009

4

Step 1: Register Events uint num_dyn_instrs; static void event_init(void); static void event_exit(void); static dr_emit_flags_t event_basic_block(void *drcontext, void *tag, instrlist_t *ilist, bool for_trace, bool translating); DR_EXPORT void dr_init(client_id_t id) { /* register events */ dr_register_bb_event (event_basic_block); dr_register_exit_event(event_exit); /* process initialization event */ event_init(); }

DynamoRIO Tutorial at MICRO 12 Dec 2009

5

Step 2: Implementation (I) static void event_init(void) { num_dyn_instrs = 0; } static void event_exit(void) { dr_printf(“Total number of instruction executed: %u\n”, num_dyn_instrs); } static dr_emit_flags_t event_basic_block(void *drcontext, void *tag, instrlist_t *ilist, bool for_trace, bool translating) { int num_instrs; num_instrs = ilist_num_instrs(ilist); insert_count_code(drcontext, ilist, num_instrs); return DR_EMIT_DEFAULT; }

DynamoRIO Tutorial at MICRO 12 Dec 2009

6

Step 2: Implementation (II) static int ilist_num_instrs(instrlist_t *ilist) { instr_t *instr; int num_instrs = 0; /* iterate over instruction list to count number of instructions */ for (instr = instrlist_first(ilist); instr != NULL; instr = instr_get_next(instr)) num_instrs++; return num_instrs; } static void do_ins_count(int num_instrs) { num_dyn_instrs += num_instrs; } static void insert_count_code(void * drcontext, instrlist_t * ilist, int num_instrs) { dr_insert_clean_call(drcontext, ilist, instrlist_first(ilist), do_ins_count, false, 1, OPND_CREATE_INT32(num_instrs)); } DynamoRIO Tutorial at MICRO 12 Dec 2009

7

Instrumented Basic Block # switch stack # switch aflags and errorno # save all registers # call do_ins_count push $0x00000003 call $0xb7ef73e4 (do_ins_count) # restore registers # switch aflags and errorno back # switch stack back # application code add $0x0000e574 %ebx Æ %ebx test %al $0x08 jz $0xb80e8a98 DynamoRIO Tutorial at MICRO 12 Dec 2009

8

Step 3: Optimization (I): counter update inlining static void insert_count_code (void * drcontext, instrlist_t * ilist, int num_instrs) { instr_t *instr, *where; opnd_t opnd1, opnd2; where = instrlist_first(ilist); /* save aflags */ dr_save_arith_flags(drcontext, ilist, where, SPILL_SLOT_1); /* num_dyn_instrs += num_instrs */ opnd1 = OPND_CREATE_ABSMEM(&num_dyn_instrs, OPSZ_PTR); opnd2 = OPND_CREATE_INT32(num_instrs); instr = INSTR_CREATE_add(drcontext, opnd1, opnd2); instrlist_meta_preinsert(ilist, where, instr); /* restore aflags */ dr_restore_arith_flags(drcontext, ilist, where, SPILL_SLOT_1); }

DynamoRIO Tutorial at MICRO 12 Dec 2009

9

Instrumented Basic Block mov %eax Æ %fs:0x0c lahf Æ %ah seto Æ %al add $0x00000003, 0xb7d25030 add $0x7f %al Æ %al sahf %ah mov %fs:0x0c Æ %eax # application code add $0x0000e574 %ebx Æ %ebx test %al $0x08 jz $0xb7f14a98

DynamoRIO Tutorial at MICRO 12 Dec 2009

10

Step 3: Optimization (II): aflags stealing static void insert_count_code (void * drcontext, instrlist_t * ilist, int num_instrs) { … save_aflags = aflags_analysis(ilist); /* save aflags */ if (save_aflags) dr_save_arith_flags(drcontext, ilist, where, SPILL_SLOT_1); /* num_dyn_instrs += num_instrs */ opnd1 = OPND_CREATE_ABSMEM(&num_dyn_instrs, OPSZ_PTR); opnd2 = OPND_CREATE_INT32(num_instrs); instr = INSTR_CREATE_add(drcontext, opnd1, opnd2); instrlist_meta_preinsert(ilist, where, instr); /* restore aflags */ if (save_aflags) dr_restore_arith_flags(drcontext, ilist, where, SPILL_SLOT_1); }

DynamoRIO Tutorial at MICRO 12 Dec 2009

11

Instrumented Basic Block

add $0x00000003, 0xb7d25030 # application code add $0x0000e574 %ebx Æ %ebx test %al $0x08 jz $0xb7f14a98

DynamoRIO Tutorial at MICRO 12 Dec 2009

12

Step 3: Optimization (III): more optimizations • Using lea (load effective address) instead of add lea [%reg, num_instr] Î %reg

• Register liveness analysis – Using dead register to avoid register save/restore for lea

• Global aflags/registers analysis – Analyze aflags/registers liveness over CFG

• Trace Optimization – Trace: single-entry multi-exit – Update counters only at trace exits

DynamoRIO Tutorial at MICRO 12 Dec 2009

13

Other Issues • Data race on counter update in multithreaded programs – Global lock for every update – Atomic update (lock prefixed add) • LOCK(instr);

– Thread private counter • Thread-private code cache: different variable at different address • Thread-shared code cache: thread local storage

• 32-bit counter overflow – 64-bit counter: • Two instructions on 32-bit architecture: add, adc

– One 32-bit local counter and one 64-bit global counter • Instrument to update 32-bit local counter • Update 64-bit global counter using time interrupt DynamoRIO Tutorial at MICRO 12 Dec 2009

14

Examples: Part 2 1:30-1:40 1:40-2:40 2:40-3:00 3:00-3:15 3:15-4:15 4:15-5:15 5:15-5:30

Welcome + DynamoRIO History DynamoRIO Internals Examples, Part 1 Break DynamoRIO API Examples, Part 2 Feedback

Larger Examples • Dynamic Optimization – Strength Reduction – Software Prefetch

• Profiling – Pipelined Profiling and Analysis

• Shadow Memory – Umbra – Millions of Watchpoints

• Dr. Memory

DynamoRIO Tutorial at MICRO 12 Dec 2009

2

Dynamic Optimization Opportunities • Traditional compiler optimizations – Compiler has limited view: application assembled at runtime – Some shipped products are built without optimizations

• Microarchitecture-specific optimizations – Feature set and relative performance of instructions varies – Combinatorial blowup if done statically

• Adaptive optimizations – Need runtime information: prior profiling runs not always representative

DynamoRIO Tutorial at MICRO 12 Dec 2009

3

Dynamic Optimization in DynamoRIO • Traces are natural unit for optimization – Focus only on hot code – Cross procedure, file and module boundaries

• Linear control flow – Single-entry, multi-exit simplifies analysis

• Support for adaptive optimization – Can replace traces dynamically

DynamoRIO Tutorial at MICRO 12 Dec 2009

4

Strength Reduction: inc to add • On Pentium 4, inc is slower add 1 (and dec is slower than sub 1) • Opposite is true on Pentium 3 • Microarchitecture-specific optimization best performed dynamically

DynamoRIO Tutorial at MICRO 12 Dec 2009

5

EXPORT void dr_init() { if (proc_get_family() == FAMILY_PENTIUM_IV) dr_register_trace_event(event_trace); }

Pentium 4?

static void event_trace(void *drcontext, app_pc tag, instrlist_t *trace, bool xl8) { instr_t *instr, *next_instr; int opcode; for (instr = instrlist_first(bb); instr != NULL; instr = next_instr) { next_instr = instr_get_next(instr); opcode = instr_get_opcode(instr); if (opcode == OP_inc || opcode == OP_dec) replace_inc_with_add(drcontext, instr, trace); } } }

Look for inc / dec

static bool replace_inc_with_add(void *drcontext, instr_t *instr, instrlist_t *trace) { instr_t *in; uint eflags; int opcode = instr_get_opcode(instr); bool ok_to_replace = false; for (in = instr; in != NULL; in = instr_get_next(in)) { eflags = instr_get_arith_flags(in); if ((eflags & EFLAGS_READ_CF) != 0) return false; if ((eflags & EFLAGS_WRITE_CF) != 0) { ok_to_replace = true; break; } if (instr_is_exit_cti(in)) return false; } if (!ok_to_replace) return false; if (opcode == OP_inc) in = INSTR_CREATE_add(drcontext, instr_get_dst(instr, 0), OPND_CREATE_INT8(1)); else in = INSTR_CREATE_sub(drcontext, instr_get_dst(instr, 0), OPND_CREATE_INT8(1)); instr_set_prefixes(in, instr_get_prefixes(instr)); instrlist_replace(trace, instr, in); instr_destroy(drcontext, instr); return true; }

Ensure eflags change ok

Replace with add / sub

DynamoRIO Tutorial at MICRO 12 Dec 2009

6

base

inc2add

DynamoRIO Tutorial at MICRO 12 Dec 2009

har. mean

wupwise

swim

sixtrack

mgrid

mesa

equake

art

apsi

1.2 1.1 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0

applu

2% mean speedup

ammp

Normalized Execution Time

Strength Reduction Results

Benchmark

7

Software Prefetching • Ubiquitous Memory Introspection (UMI): – Online, light weight, adaptive memory optimization

• Key Ideas – Sampling to identify hot traces • Time interrupt • Performance counter (L2 Cache miss) overflow interrupt

– Profiling a trace if it is hot enough • Instrument trace for memory profiling

– Analyzing profiling if it is full

Ref 1

Ref 2

Ref 3

Ref 4

Ref 5

Iter 1 Iter 2 Iter 3

• Cache Simulation to identify loads with high miss ratios • Stride reference analysis

– Optimization • Instrument trace with L2 prefetch requests for high miss loads DynamoRIO Tutorial at MICRO 12 Dec 2009

8

Software Prefetching Results

DynamoRIO Tutorial at MICRO 12 Dec 2009

9

Larger Examples • Dynamic Optimization – Strength Reduction – Software Prefetch

• Profiling – Pipelined Profiling and Analysis

• Shadow Memory – Umbra – Millions of Watchpoints

• Dr. Memory

DynamoRIO Tutorial at MICRO 12 Dec 2009

10

PiPA: Pipelined Profiling and Analysis Original application

Instrumentation overhead

Profiling and Analyzing overhead

Threads or Processes

Time

Instrumented application – stage 0

Profile processing – stage 1

Analysis on profile 1 Analysis on profile 2 Analysis on profile 3

Parallel analysis stage 2

Analysis on profile 4

DynamoRIO Tutorial at MICRO 12 Dec 2009

11

PiPA Challenges Minimize the profiling overhead – Runtime Execution Profile (REP)

Minimize the communication between stages – double buffering

Design efficient parallel analysis algorithms – we focus on cache simulation

PiPA Prototype – Stage 0 : instrumented application – collect REP – Stage 1 : profile reconstruct and splitting – Stage 2 : parallel cache simulation

DynamoRIO Tutorial at MICRO 12 Dec 2009

12

Stage 0 : Profiling static dr_emit_flags_t event_basic_block(void *drcontext, void *tag, instrlist_t *ilist, bool for_trace, bool xl8) { instr_t *instr; for (instr = instrlist_first(ilist); instr != NULL; instr_get_next(instr)) { /* check and instrument if any memory read */ if (instr_reads_memory(instr)) for (int i = 0; i < instr_num_srcs(instr); i++) if (opnd_is_memory_reference(instr_get_src(instr, i))) instrument_mem_read(drcontext, ilist, instr, instr_get_src(instr, i)); /* check and instrument if any memory write */ if (instr_writes_memory(instr)) for (int i = 0; i < instr_num_dsts(instr); i++) if (opnd_is_memory_reference(instr_get_dst(instr, i))) instrument_mem_write(drcontext, ilist, instr, instr_get_dst(instr, i)); } return DR_EMIT_DEFAULT; }

DynamoRIO Tutorial at MICRO 12 Dec 2009

13

Stage 0: Profiling static void instrument_mem_read(void *drcontext, instrlist_t *ilist, instr_t *where, opnd_t ref) { app_pc pc = instr_get_app_pc(where); int size = opnd_size_in_bytes(opnd_get_size(ref)); … /* calculate memory reference address */ if (opnd_is_base_disp(ref)) { opnd_set_size(ref, OPSZ_lea); instr = INSTR_CREATE_lea(drcontext, opnd_create_reg(reg1), ref); } else if (opnd_is_rel_addr(ref) || opnd_is_abs_addr(ref)) instr = INSTR_CREATE_mov_imm(drcontext, opnd_create_reg(reg1), OPND_CREATE_INTPTR(opnd_get_addr(ref)); instrlist_meta_preinsert(ilist, where, instr); /* put address into profile buffer */ instrlist_meta_preinsert(ilist, where, INSTR_CREATE_mov_st(drcontext, OPND_CREATE_MEMPTR(reg2, 4), opnd_create_reg(reg1))); … }

DynamoRIO Tutorial at MICRO 12 Dec 2009

14

Stage 0: Profiling • Runtime Execution Profile (REP) – fast profiling – small profile size – easy information extraction

• REP unit – REP static • Information that known at static time

– REP dynamic • Information that only known at runtime

• Can be customized for different analyses – in our prototype we consider cache simulation

DynamoRIO Tutorial at MICRO 12 Dec 2009

15

REP Example profile base pointer

REP First buffer

. . .

bb1: mov [eax + 0x0c] Æ eax mov ebp Æ esp pop ebp return

bb1 REP Unit

tag: 0x080483d7 num_slots: 2 num_refs: 3 refs: ref0

eax esp

. . . Canary Zone Next buffer

. . . DynamoRIO Tutorial at MICRO 12 Dec 2009

REPS

pc: type: size: offset: value_slot: size_slot:

0x080483d7 read 4 12 1 -1

pc: type: size: offset: value_slot: size_slot:

0x080483dc read 4 0 2 -1

pc: type: size: offset: value_slot: size_slot:

0x080483dd read 4 4 2 -1

REPD

16

REP Example profile base pointer

REP First buffer

. . .

bb1: mov [eax + 0x0c] Æ eax mov ebp Æ esp pop ebp return

bb1

12 bytes

tag: 0x080483d7 num_slots: 2 num_refs: 3 refs: ref0

eax esp bb2 esp . . .

bb2: pop pop cmp jz

ebx ecx eax, 0 label_bb3

Canary Zone Next buffer

. . . DynamoRIO Tutorial at MICRO 12 Dec 2009

REPS

pc: type: size: offset: value_slot: size_slot:

0x080483d7 read 4 12 1 -1

pc: type: size: offset: value_slot: size_slot:

0x080483dc read 4 0 2 -1

pc: type: size: offset: value_slot: size_slot:

0x080483dd read 4 4 2 -1

REPD

17

Profiling Optimization • Store register values in REP – avoid computing the memory address

• Register liveness analysis – avoid register stealing if possible

• Record a single register value for multiple references – a single stack pointer value for a sequence of push/pop – the base address for multiple accesses to the same structure

• Profiling buffer count update – Update counter once per basic block – Using lea instruction to avoid aflags usage

• Buffer full check – Canary Zone – lea & jecxz to avoid aflags uasge DynamoRIO Tutorial at MICRO 12 Dec 2009

18

Profiling Overhead

Slowdown relative to native execution

8

optimized instrumentation

7 6

instrumentation without optimization

5 4 3 2

Avg slowdown : ~ 3x

1 0 SPECint2000

SPECfp2000

DynamoRIO Tutorial at MICRO 12 Dec 2009

SPEC2000

19

Stage 1 : Profile Reconstruct Need to reconstruct the full memory reference information – REP . . .

tag: 0x080483d7 num_slots: 2 num_refs: 3 refs: ref0

pc: 0x080483d7 type: read size: 4 offset: 12 value_slot: 1 size_slot: -1

pc: 0x080483dc type: read size: 4 offset: 0 value_slot: 2 size_slot: -1

. . .

bb1 REP Unit

0x2304 0x141a

REP Unit

bb2 0x1423 . . .

PC .... 0x080483d7

Address ............. 0x2310

Type ........ read

Size ......... 4

0x080483dc ....

0x141a .............

read ........

4 .........

Canary Zone . . . DynamoRIO Tutorial at MICRO 12 Dec 2009

20

Profile Reconstruct Overhead The impact of using REP – experiments done on the 8-core system with 16MB buffers and 8 threads

Slowdown relative to native execution

35.00 30.00 25.00 20.00 15.00 10.00 5.00 0.00

PIPA using REP PIPA using standard profile format

PIPA-REP : 4.5x PIPA-standard : 20.7x

DynamoRIO Tutorial at MICRO 12 Dec 2009

21

Stage 2 : Parallel Cache Simulation How to parallelize? – split the address trace into independent groups (in stage 1) – two memory references that access different sets are independent

Set associative caches – partition the cache sets and simulate them using several independent simulators – merge the results (no of hits and misses) at the end of the simulation

Example: – 32K cache, 32-byte line, 4-way associative => 256 sets – 4 independent simulators, each one simulates 64 sets (round-robin distribution) PC Address

Type

.... .... .... .... .... ....

r w r w r r

0xbf9c4614 0xbf9c4705 0xbf9c4a34 0xbf9c4a60 0xbf9c4a5c 0xbf9c460d

Size 4 4 4 4 4 4

DynamoRIO Tutorial at MICRO 12 Dec 2009

0: 0xbf9c4614, 0xbf9c4705 , 0xbf9c460d ... 1: 0xbf9c4a34 ... 2: 0xbf9c4a5c ... 3: 0xbf9c4a60 ...

22

Cache Simulation Overhead Experiments done on the 8-core system – 8 recovery threads and 8 cache simulators

50.00

native execution

Slowdown relative to

60.00

40.00 30.00 20.00 10.00 0.00

PiPA

10.5x

Pin dcache

DynamoRIO Tutorial at MICRO 12 Dec 2009

PiPA speedup over dcache : 3x

32x

23

Larger Examples • Dynamic Optimization – Strength Reduction – Software Prefetch

• Profiling – Pipelined Profiling and Analysis

• Shadow Memory – Umbra – Millions of Watchpoints

• Dr. Memory

DynamoRIO Tutorial at MICRO 12 Dec 2009

24

Shadow Memory • Application – Store meta-data associated with application data • • • •

Millions of software watchpoints Dynamic information flow tracking (taint propagation) Race detection Memory usage debugging tool (MemCheck/Dr. Memory)

• Issues – – – –

Performance Multi-thread applications Flexibility Platform dependent

DynamoRIO Tutorial at MICRO 12 Dec 2009

25

Umbra Outline • Design • Implementation • Optimization

Design • Address Space

App Mem 1

– A collection of fixed size units Unused

• 256M (32-bit), 4G (64-bit) • Application, Shadow, Unused

Shd Mem 1

• Translation Table – Translation from application memory unit to corresponding shadow memory unit addr shd = addr app × scale + offset

Unused Shd Mem 2 Shd Mem 3

App Mem

Shd Mem

Offset

[0x00000000, 0x10000000)

[0x20000000, 0x30000000)

0x20000000

App Mem 2

[0x60000000, 0x70000000)

[0x40000000, 0x50000000)

-0x20000000

App Mem 3

[0x80000000, 0x90000000)

[0x50000000, 0x60000000)

-0x20000000

DynamoRIO Tutorial at MICRO 12 Dec 2009

27

Implementation • Memory Manager – Monitor and control application memory allocation • brk, mmap, munmap, mremap • dr_register_pre_syscall_event • dr_register_post_syscall_event

– Allocate shadow memory – Maintain translation table

• Instrumenter – Instrument every memory reference • • • • •

Context save Address calculation Address translation Shadow memory update Context restore

DynamoRIO Tutorial at MICRO 12 Dec 2009

28

Instrument Code Example Context Save

mov %ecx Î [ECX_SLOT] mov %edx Î [EDX_SLOT] mov %eax Î [EAX_SLOT] lahf Î %ah seto Î %al

Address Calculation

lea [%ebx, 16] Î %ecx

Address Translation

mov 0 Î %edx … # table lookup code add %ecx, table[%edx].offset Î %ecx

Shadow Memory Update

mov 1 Î [%ecx]

Context Restore

add %al 0x7f sahf mov [ECX_SLOT] Î %ecx mov [EDX_SLOT] Î %edx mov [EAX_SLOT] Î %eax

Application memory reference

mov 0 Î [%ebx, 16]

Optimization • Translation Optimization – Thread Local Translation Table – Memoization Check – Reference Check

• Instrumentation Optimization – Context Switch Reduction – Reference Grouping – 3-stage Code Layout

DynamoRIO Tutorial at MICRO 12 Dec 2009

30

Translation Optimization • Thread Local Translation Optimization – Local translation table per thread – Synchronize with global translation table when necessary – Avoid lock contention

Thread 1

Thread 2

Thread Local translation table DynamoRIO Tutorial at MICRO 12 Dec 2009

Global translation table

31

Translation Optimization • Memoization Cache – Software cache per thread – Stores frequently used translation entries • Stack • Units found in last table lookup Thread 1

Thread 2

Memoization Cache DynamoRIO Tutorial at MICRO 12 Dec 2009

Thread Local translation table

Global translation table

32

Translation Optimization • Reference Cache – Software cache per static application memory reference • Last reference unit tag • Last translation offset

Thread 1

Thread 2

Reference cache

Memoization Cache

DynamoRIO Tutorial at MICRO 12 Dec 2009

Thread Local translation table

Global translation table

33

Instrumentation Optimization • Context Switch Reduction – Registers liveness analysis

• Reference Grouping – One translation for multiple references using the same base • Stack local variables • Different members of the same object

• 3-stage Code Layout – Inline stub • Quick inline check code with minimal context switch

– Lean procedure • Simple assembly procedure with partial context switch

– Callout • C function with complete context switch DynamoRIO Tutorial at MICRO 12 Dec 2009

34

3-stage Code Layout • Inline stub – Reference cache check – Jump to lean procedure if miss

• Lean procedure – Memoization cache check – Local table lookup – Clean call to call out

• Callout – Global table synchronization – Local table lookup

DynamoRIO Tutorial at MICRO 12 Dec 2009

35

Instrumentation Optimization Inline Stub

Lean Procedure

# reference cache check lea [ref] Î %r1 %r1 & 0xf0000000 Î %r1 cmp %r1, ref.tag je .update_shadow_memory # jmp-and-link to lean procedure mov %r1 Î ref.tag mov .update_ref_cache Î [ret_pc] jmp lean_procedure .update_ref_cache mov %r1 Î ref.offset # shadow memory update .update_shadow_memory lea [ref] Î %r1 add %r1 + ref.offsetÎ %r1 mov 1 Î [%r1]

# memorization check cmp %r1, cache1.tag jne .cache1_miss mov cache1.offset Î %r1 jmp [ret_pc] .cache1_miss cmp %r1, cache2.tag jne .cache2_miss mov cache1.offset Î %r1 jmp [ret_pc] .cache2_miss # table lookup mov %r1 Î cache2.tag mov %r2 Î [R2_SLOT] mov 0 Î %r2 … mov [R2_SLOT] Î %r2 mov %r1 Î cache2.offset jmp [ret_pc]

DynamoRIO Tutorial at MICRO 12 Dec 2009

36

Performance Evaluation DynamoRIO

Local Translation Table

Memorization Check

Reference Cache

Context Switch Reduction

Reference Grouping

18 16 14 12 10 8 6 4

3.29 2.49

1.89

2 0 CINT DynamoRIO Tutorial at MICRO 12 Dec 2009

CFP

CPU2006 37

Umbra Client: Shared Memory Detection static void instrument_update(void *drcontext, umbra_info_t *umbra_info, mem_ref_t *ref, instrlist_t *ilist, instr_t *where) { … /* test [%reg].tid_map, tid_map*/ opnd1 = OPND_CREATE_MEM32(umbra_infoÆreg, 0, OPSZ_4); opnd2 = OPND_CREATE_INT32(client_tls_dataÆtid_map); instrlist_meta_preinsert(ilist, where, INSTR_CREATE_test(drcontext, opnd1, opnd2)); /* jnz where */ opnd1 = opnd_create_instr(where); instrlist_meta_preinsert(ilist, where, INSTR_CREATE_jcc(drcontext, OP_jnz, opnd1)); /* or */ opnd1 = OPND_CREATE_MEM32(umbra_infoÆreg, 0, OPSZ_4); opnd2 = OPND_CREATE_INT32(client_tls_dataÆtid_map | 1); instr = INSTR_CREATE_or(drcontext, opnd1, opnd2); LOCK(instr); instrlist_meta_preinsert(ilist, label, instr); } DynamoRIO Tutorial at MICRO 12 Dec 2009

38

Watchpoints • Watchpoints are poweful debugging tools – Millions of Watchpoints • Watch all heap allocations to detect uninitialized values • Watch all buffers to detect overflow • Watch all return addresses

• However, more than a handful of watched addresses cannot be handled efficiently by today’s debuggers – GDB is forced into a single-step mode – Prohibitively expensive

DynamoRIO Tutorial at MICRO 12 Dec 2009

39

Millions of Watchpoints • Full instrumentation with shadow memory – Simply mark the shadow memory – Instrument to check on every memory access

• Partial instrumentation: – On setting a watchpoint • Duplicate data into shadow pages • Set application page non-accessible

– On access violation handler • Instrument the fault instruction – Check reference target – Report if access the watched data – Redirect if access the protected pages

DynamoRIO Tutorial at MICRO 12 Dec 2009

40

Larger Examples • Dynamic Optimization – Strength Reduction – Software Prefetch

• Profiling – Pipelined Profiling and Analysis

• Shadow Memory – Umbra – Millions of Watchpoints

• Dr. Memory

DynamoRIO Tutorial at MICRO 12 Dec 2009

41