Examples: Part I - DynamoRIO

Examples: Part I 1:30-1:40 1:40-2:40 2:40-3:00 3:00-3:15 3:15-4:15 4:15-5:15 5:15-5:30

Welcome + DynamoRIO History DynamoRIO Internals Examples, Part 1 Break DynamoRIO API Examples, Part 2 Feedback

DynamoRIO Examples Part I Outline • Common Steps of writing a DynamoRIO client • Dynamic Instruction Counting Example

DynamoRIO Tutorial at MICRO 12 Dec 2009

2

Common Steps • Step 1: Register Events – DR_EXPORT void dr_init(client_id_t id) Register Function

Events

dr_register_bb_event

Basic Block Building

dr_register_thread_init_event

Thread Initialization

dr_register_exit_event

Process Exit

• Step2: Implementation – Initialization – Finalization – Instrumentation

• Step 3: Optimization – Optimize the instrumentation to improve the performance DynamoRIO Tutorial at MICRO 12 Dec 2009

3

DynamoRIO Examples Part I Outline • Common Steps of writing a DynamoRIO client • Dynamic Instruction Counting Example


4

Step 1: Register Events uint num_dyn_instrs; static void event_init(void); static void event_exit(void); static dr_emit_flags_t event_basic_block(void *drcontext, void *tag, instrlist_t *ilist, bool for_trace, bool translating); DR_EXPORT void dr_init(client_id_t id) { /* register events */ dr_register_bb_event (event_basic_block); dr_register_exit_event(event_exit); /* process initialization event */ event_init(); }


5

Step 2: Implementation (I) static void event_init(void) { num_dyn_instrs = 0; } static void event_exit(void) { dr_printf(“Total number of instruction executed: %u\n”, num_dyn_instrs); } static dr_emit_flags_t event_basic_block(void *drcontext, void *tag, instrlist_t *ilist, bool for_trace, bool translating) { int num_instrs; num_instrs = ilist_num_instrs(ilist); insert_count_code(drcontext, ilist, num_instrs); return DR_EMIT_DEFAULT; }


6

Step 2: Implementation (II) static int ilist_num_instrs(instrlist_t *ilist) { instr_t *instr; int num_instrs = 0; /* iterate over instruction list to count number of instructions */ for (instr = instrlist_first(ilist); instr != NULL; instr = instr_get_next(instr)) num_instrs++; return num_instrs; } static void do_ins_count(int num_instrs) { num_dyn_instrs += num_instrs; } static void insert_count_code(void * drcontext, instrlist_t * ilist, int num_instrs) { dr_insert_clean_call(drcontext, ilist, instrlist_first(ilist), do_ins_count, false, 1, OPND_CREATE_INT32(num_instrs)); } DynamoRIO Tutorial at MICRO 12 Dec 2009

7

Instrumented Basic Block # switch stack # switch aflags and errorno # save all registers # call do_ins_count push $0x00000003 call $0xb7ef73e4 (do_ins_count) # restore registers # switch aflags and errorno back # switch stack back # application code add $0x0000e574 %ebx Æ %ebx test %al $0x08 jz $0xb80e8a98 DynamoRIO Tutorial at MICRO 12 Dec 2009

8

Step 3: Optimization (I): counter update inlining static void insert_count_code (void * drcontext, instrlist_t * ilist, int num_instrs) { instr_t *instr, *where; opnd_t opnd1, opnd2; where = instrlist_first(ilist); /* save aflags */ dr_save_arith_flags(drcontext, ilist, where, SPILL_SLOT_1); /* num_dyn_instrs += num_instrs */ opnd1 = OPND_CREATE_ABSMEM(&num_dyn_instrs, OPSZ_PTR); opnd2 = OPND_CREATE_INT32(num_instrs); instr = INSTR_CREATE_add(drcontext, opnd1, opnd2); instrlist_meta_preinsert(ilist, where, instr); /* restore aflags */ dr_restore_arith_flags(drcontext, ilist, where, SPILL_SLOT_1); }


9

Instrumented Basic Block mov %eax Æ %fs:0x0c lahf Æ %ah seto Æ %al add $0x00000003, 0xb7d25030 add $0x7f %al Æ %al sahf %ah mov %fs:0x0c Æ %eax # application code add $0x0000e574 %ebx Æ %ebx test %al $0x08 jz $0xb7f14a98


10

Step 3: Optimization (II): aflags stealing static void insert_count_code (void * drcontext, instrlist_t * ilist, int num_instrs) { … save_aflags = aflags_analysis(ilist); /* save aflags */ if (save_aflags) dr_save_arith_flags(drcontext, ilist, where, SPILL_SLOT_1); /* num_dyn_instrs += num_instrs */ opnd1 = OPND_CREATE_ABSMEM(&num_dyn_instrs, OPSZ_PTR); opnd2 = OPND_CREATE_INT32(num_instrs); instr = INSTR_CREATE_add(drcontext, opnd1, opnd2); instrlist_meta_preinsert(ilist, where, instr); /* restore aflags */ if (save_aflags) dr_restore_arith_flags(drcontext, ilist, where, SPILL_SLOT_1); }


11

Instrumented Basic Block

add $0x00000003, 0xb7d25030 # application code add $0x0000e574 %ebx Æ %ebx test %al $0x08 jz $0xb7f14a98


12

Step 3: Optimization (III): more optimizations • Using lea (load effective address) instead of add lea [%reg, num_instr] Î %reg

• Register liveness analysis – Using dead register to avoid register save/restore for lea

• Global aflags/registers analysis – Analyze aflags/registers liveness over CFG

• Trace Optimization – Trace: single-entry multi-exit – Update counters only at trace exits


13

Other Issues • Data race on counter update in multithreaded programs – Global lock for every update – Atomic update (lock prefixed add) • LOCK(instr);

– Thread private counter • Thread-private code cache: different variable at different address • Thread-shared code cache: thread local storage

• 32-bit counter overflow – 64-bit counter: • Two instructions on 32-bit architecture: add, adc

– One 32-bit local counter and one 64-bit global counter • Instrument to update 32-bit local counter • Update 64-bit global counter using time interrupt DynamoRIO Tutorial at MICRO 12 Dec 2009

14

Examples: Part 2 1:30-1:40 1:40-2:40 2:40-3:00 3:00-3:15 3:15-4:15 4:15-5:15 5:15-5:30

Welcome + DynamoRIO History DynamoRIO Internals Examples, Part 1 Break DynamoRIO API Examples, Part 2 Feedback

Larger Examples • Dynamic Optimization – Strength Reduction – Software Prefetch

• Profiling – Pipelined Profiling and Analysis

• Shadow Memory – Umbra – Millions of Watchpoints

• Dr. Memory


2

Dynamic Optimization Opportunities • Traditional compiler optimizations – Compiler has limited view: application assembled at runtime – Some shipped products are built without optimizations

• Microarchitecture-specific optimizations – Feature set and relative performance of instructions varies – Combinatorial blowup if done statically

• Adaptive optimizations – Need runtime information: prior profiling runs not always representative


3

Dynamic Optimization in DynamoRIO • Traces are natural unit for optimization – Focus only on hot code – Cross procedure, file and module boundaries

• Linear control flow – Single-entry, multi-exit simplifies analysis

• Support for adaptive optimization – Can replace traces dynamically


4

Strength Reduction: inc to add • On Pentium 4, inc is slower add 1 (and dec is slower than sub 1) • Opposite is true on Pentium 3 • Microarchitecture-specific optimization best performed dynamically


5

EXPORT void dr_init() { if (proc_get_family() == FAMILY_PENTIUM_IV) dr_register_trace_event(event_trace); }

Pentium 4?

static void event_trace(void *drcontext, app_pc tag, instrlist_t *trace, bool xl8) { instr_t *instr, *next_instr; int opcode; for (instr = instrlist_first(bb); instr != NULL; instr = next_instr) { next_instr = instr_get_next(instr); opcode = instr_get_opcode(instr); if (opcode == OP_inc || opcode == OP_dec) replace_inc_with_add(drcontext, instr, trace); } } }

Look for inc / dec

static bool replace_inc_with_add(void *drcontext, instr_t *instr, instrlist_t *trace) { instr_t *in; uint eflags; int opcode = instr_get_opcode(instr); bool ok_to_replace = false; for (in = instr; in != NULL; in = instr_get_next(in)) { eflags = instr_get_arith_flags(in); if ((eflags & EFLAGS_READ_CF) != 0) return false; if ((eflags & EFLAGS_WRITE_CF) != 0) { ok_to_replace = true; break; } if (instr_is_exit_cti(in)) return false; } if (!ok_to_replace) return false; if (opcode == OP_inc) in = INSTR_CREATE_add(drcontext, instr_get_dst(instr, 0), OPND_CREATE_INT8(1)); else in = INSTR_CREATE_sub(drcontext, instr_get_dst(instr, 0), OPND_CREATE_INT8(1)); instr_set_prefixes(in, instr_get_prefixes(instr)); instrlist_replace(trace, instr, in); instr_destroy(drcontext, instr); return true; }

Ensure eflags change ok

Replace with add / sub


6

base

inc2add


har. mean

wupwise

swim

sixtrack

mgrid

mesa

equake

art

apsi

1.2 1.1 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0

applu

2% mean speedup

ammp

Normalized Execution Time

Strength Reduction Results

Benchmark

7

Software Prefetching • Ubiquitous Memory Introspection (UMI): – Online, light weight, adaptive memory optimization

• Key Ideas – Sampling to identify hot traces • Time interrupt • Performance counter (L2 Cache miss) overflow interrupt

– Profiling a trace if it is hot enough • Instrument trace for memory profiling

– Analyzing profiling if it is full

Ref 1

Ref 2

Ref 3

Ref 4

Ref 5

Iter 1 Iter 2 Iter 3

• Cache Simulation to identify loads with high miss ratios • Stride reference analysis

– Optimization • Instrument trace with L2 prefetch requests for high miss loads DynamoRIO Tutorial at MICRO 12 Dec 2009

8

Software Prefetching Results


9




• Dr. Memory


10

PiPA: Pipelined Profiling and Analysis Original application

Instrumentation overhead

Profiling and Analyzing overhead

Threads or Processes

Time

Instrumented application – stage 0

Profile processing – stage 1

Analysis on profile 1 Analysis on profile 2 Analysis on profile 3

Parallel analysis stage 2

Analysis on profile 4


11

PiPA Challenges Minimize the profiling overhead – Runtime Execution Profile (REP)

Minimize the communication between stages – double buffering

Design efficient parallel analysis algorithms – we focus on cache simulation

PiPA Prototype – Stage 0 : instrumented application – collect REP – Stage 1 : profile reconstruct and splitting – Stage 2 : parallel cache simulation


12

Stage 0 : Profiling static dr_emit_flags_t event_basic_block(void *drcontext, void *tag, instrlist_t *ilist, bool for_trace, bool xl8) { instr_t *instr; for (instr = instrlist_first(ilist); instr != NULL; instr_get_next(instr)) { /* check and instrument if any memory read */ if (instr_reads_memory(instr)) for (int i = 0; i < instr_num_srcs(instr); i++) if (opnd_is_memory_reference(instr_get_src(instr, i))) instrument_mem_read(drcontext, ilist, instr, instr_get_src(instr, i)); /* check and instrument if any memory write */ if (instr_writes_memory(instr)) for (int i = 0; i < instr_num_dsts(instr); i++) if (opnd_is_memory_reference(instr_get_dst(instr, i))) instrument_mem_write(drcontext, ilist, instr, instr_get_dst(instr, i)); } return DR_EMIT_DEFAULT; }


13

Stage 0: Profiling static void instrument_mem_read(void *drcontext, instrlist_t *ilist, instr_t *where, opnd_t ref) { app_pc pc = instr_get_app_pc(where); int size = opnd_size_in_bytes(opnd_get_size(ref)); … /* calculate memory reference address */ if (opnd_is_base_disp(ref)) { opnd_set_size(ref, OPSZ_lea); instr = INSTR_CREATE_lea(drcontext, opnd_create_reg(reg1), ref); } else if (opnd_is_rel_addr(ref) || opnd_is_abs_addr(ref)) instr = INSTR_CREATE_mov_imm(drcontext, opnd_create_reg(reg1), OPND_CREATE_INTPTR(opnd_get_addr(ref)); instrlist_meta_preinsert(ilist, where, instr); /* put address into profile buffer */ instrlist_meta_preinsert(ilist, where, INSTR_CREATE_mov_st(drcontext, OPND_CREATE_MEMPTR(reg2, 4), opnd_create_reg(reg1))); … }


14

Stage 0: Profiling • Runtime Execution Profile (REP) – fast profiling – small profile size – easy information extraction

• REP unit – REP static • Information that known at static time

– REP dynamic • Information that only known at runtime

• Can be customized for different analyses – in our prototype we consider cache simulation


15

REP Example profile base pointer

REP First buffer

. . .

bb1: mov [eax + 0x0c] Æ eax mov ebp Æ esp pop ebp return

bb1 REP Unit

tag: 0x080483d7 num_slots: 2 num_refs: 3 refs: ref0

eax esp

. . . Canary Zone Next buffer

. . . DynamoRIO Tutorial at MICRO 12 Dec 2009

REPS

pc: type: size: offset: value_slot: size_slot:

0x080483d7 read 4 12 1 -1


0x080483dc read 4 0 2 -1


0x080483dd read 4 4 2 -1

REPD

16

REP Example profile base pointer

REP First buffer

. . .

bb1: mov [eax + 0x0c] Æ eax mov ebp Æ esp pop ebp return

bb1

12 bytes


eax esp bb2 esp . . .

bb2: pop pop cmp jz

ebx ecx eax, 0 label_bb3

Canary Zone Next buffer

. . . DynamoRIO Tutorial at MICRO 12 Dec 2009

REPS


0x080483d7 read 4 12 1 -1


0x080483dc read 4 0 2 -1


0x080483dd read 4 4 2 -1

REPD

17

Profiling Optimization • Store register values in REP – avoid computing the memory address

• Register liveness analysis – avoid register stealing if possible

• Record a single register value for multiple references – a single stack pointer value for a sequence of push/pop – the base address for multiple accesses to the same structure

• Profiling buffer count update – Update counter once per basic block – Using lea instruction to avoid aflags usage

• Buffer full check – Canary Zone – lea & jecxz to avoid aflags uasge DynamoRIO Tutorial at MICRO 12 Dec 2009

18

Profiling Overhead

Slowdown relative to native execution

8

optimized instrumentation

7 6

instrumentation without optimization

5 4 3 2

Avg slowdown : ~ 3x

1 0 SPECint2000

SPECfp2000


SPEC2000

19

Stage 1 : Profile Reconstruct Need to reconstruct the full memory reference information – REP . . .


pc: 0x080483d7 type: read size: 4 offset: 12 value_slot: 1 size_slot: -1

pc: 0x080483dc type: read size: 4 offset: 0 value_slot: 2 size_slot: -1

. . .

bb1 REP Unit

0x2304 0x141a

REP Unit

bb2 0x1423 . . .

PC .... 0x080483d7

Address ............. 0x2310

Type ........ read

Size ......... 4

0x080483dc ....

0x141a .............

read ........

4 .........

Canary Zone . . . DynamoRIO Tutorial at MICRO 12 Dec 2009

20

Profile Reconstruct Overhead The impact of using REP – experiments done on the 8-core system with 16MB buffers and 8 threads

Slowdown relative to native execution

35.00 30.00 25.00 20.00 15.00 10.00 5.00 0.00

PIPA using REP PIPA using standard profile format

PIPA-REP : 4.5x PIPA-standard : 20.7x


21

Stage 2 : Parallel Cache Simulation How to parallelize? – split the address trace into independent groups (in stage 1) – two memory references that access different sets are independent

Set associative caches – partition the cache sets and simulate them using several independent simulators – merge the results (no of hits and misses) at the end of the simulation

Example: – 32K cache, 32-byte line, 4-way associative => 256 sets – 4 independent simulators, each one simulates 64 sets (round-robin distribution) PC Address

Type

.... .... .... .... .... ....

r w r w r r

0xbf9c4614 0xbf9c4705 0xbf9c4a34 0xbf9c4a60 0xbf9c4a5c 0xbf9c460d

Size 4 4 4 4 4 4


0: 0xbf9c4614, 0xbf9c4705 , 0xbf9c460d ... 1: 0xbf9c4a34 ... 2: 0xbf9c4a5c ... 3: 0xbf9c4a60 ...

22

Cache Simulation Overhead Experiments done on the 8-core system – 8 recovery threads and 8 cache simulators

50.00

native execution

Slowdown relative to

60.00

40.00 30.00 20.00 10.00 0.00

PiPA

10.5x

Pin dcache


PiPA speedup over dcache : 3x

32x

23




• Dr. Memory


24

Shadow Memory • Application – Store meta-data associated with application data • • • •

Millions of software watchpoints Dynamic information flow tracking (taint propagation) Race detection Memory usage debugging tool (MemCheck/Dr. Memory)

• Issues – – – –

Performance Multi-thread applications Flexibility Platform dependent


25

Umbra Outline • Design • Implementation • Optimization

Design • Address Space

App Mem 1

– A collection of fixed size units Unused

• 256M (32-bit), 4G (64-bit) • Application, Shadow, Unused

Shd Mem 1

• Translation Table – Translation from application memory unit to corresponding shadow memory unit addr shd = addr app × scale + offset

Unused Shd Mem 2 Shd Mem 3

App Mem

Shd Mem

Offset

[0x00000000, 0x10000000)

[0x20000000, 0x30000000)

0x20000000

App Mem 2

[0x60000000, 0x70000000)

[0x40000000, 0x50000000)

-0x20000000

App Mem 3

[0x80000000, 0x90000000)

[0x50000000, 0x60000000)

-0x20000000


27

Implementation • Memory Manager – Monitor and control application memory allocation • brk, mmap, munmap, mremap • dr_register_pre_syscall_event • dr_register_post_syscall_event

– Allocate shadow memory – Maintain translation table

• Instrumenter – Instrument every memory reference • • • • •

Context save Address calculation Address translation Shadow memory update Context restore


28

Instrument Code Example Context Save

mov %ecx Î [ECX_SLOT] mov %edx Î [EDX_SLOT] mov %eax Î [EAX_SLOT] lahf Î %ah seto Î %al

Address Calculation

lea [%ebx, 16] Î %ecx

Address Translation

mov 0 Î %edx … # table lookup code add %ecx, table[%edx].offset Î %ecx

Shadow Memory Update

mov 1 Î [%ecx]

Context Restore

add %al 0x7f sahf mov [ECX_SLOT] Î %ecx mov [EDX_SLOT] Î %edx mov [EAX_SLOT] Î %eax

Application memory reference

mov 0 Î [%ebx, 16]

Optimization • Translation Optimization – Thread Local Translation Table – Memoization Check – Reference Check

• Instrumentation Optimization – Context Switch Reduction – Reference Grouping – 3-stage Code Layout


30

Translation Optimization • Thread Local Translation Optimization – Local translation table per thread – Synchronize with global translation table when necessary – Avoid lock contention

Thread 1

Thread 2

Thread Local translation table DynamoRIO Tutorial at MICRO 12 Dec 2009

Global translation table

31

Translation Optimization • Memoization Cache – Software cache per thread – Stores frequently used translation entries • Stack • Units found in last table lookup Thread 1

Thread 2

Memoization Cache DynamoRIO Tutorial at MICRO 12 Dec 2009

Thread Local translation table


32

Translation Optimization • Reference Cache – Software cache per static application memory reference • Last reference unit tag • Last translation offset

Thread 1

Thread 2

Reference cache

Memoization Cache


Thread Local translation table


33

Instrumentation Optimization • Context Switch Reduction – Registers liveness analysis

• Reference Grouping – One translation for multiple references using the same base • Stack local variables • Different members of the same object

• 3-stage Code Layout – Inline stub • Quick inline check code with minimal context switch

– Lean procedure • Simple assembly procedure with partial context switch

– Callout • C function with complete context switch DynamoRIO Tutorial at MICRO 12 Dec 2009

34

3-stage Code Layout • Inline stub – Reference cache check – Jump to lean procedure if miss

• Lean procedure – Memoization cache check – Local table lookup – Clean call to call out

• Callout – Global table synchronization – Local table lookup


35

Instrumentation Optimization Inline Stub

Lean Procedure

# reference cache check lea [ref] Î %r1 %r1 & 0xf0000000 Î %r1 cmp %r1, ref.tag je .update_shadow_memory # jmp-and-link to lean procedure mov %r1 Î ref.tag mov .update_ref_cache Î [ret_pc] jmp lean_procedure .update_ref_cache mov %r1 Î ref.offset # shadow memory update .update_shadow_memory lea [ref] Î %r1 add %r1 + ref.offsetÎ %r1 mov 1 Î [%r1]

# memorization check cmp %r1, cache1.tag jne .cache1_miss mov cache1.offset Î %r1 jmp [ret_pc] .cache1_miss cmp %r1, cache2.tag jne .cache2_miss mov cache1.offset Î %r1 jmp [ret_pc] .cache2_miss # table lookup mov %r1 Î cache2.tag mov %r2 Î [R2_SLOT] mov 0 Î %r2 … mov [R2_SLOT] Î %r2 mov %r1 Î cache2.offset jmp [ret_pc]


36

Performance Evaluation DynamoRIO

Local Translation Table

Memorization Check

Reference Cache

Context Switch Reduction

Reference Grouping

18 16 14 12 10 8 6 4

3.29 2.49

1.89

2 0 CINT DynamoRIO Tutorial at MICRO 12 Dec 2009

CFP

CPU2006 37

Umbra Client: Shared Memory Detection static void instrument_update(void *drcontext, umbra_info_t *umbra_info, mem_ref_t *ref, instrlist_t *ilist, instr_t *where) { … /* test [%reg].tid_map, tid_map*/ opnd1 = OPND_CREATE_MEM32(umbra_infoÆreg, 0, OPSZ_4); opnd2 = OPND_CREATE_INT32(client_tls_dataÆtid_map); instrlist_meta_preinsert(ilist, where, INSTR_CREATE_test(drcontext, opnd1, opnd2)); /* jnz where */ opnd1 = opnd_create_instr(where); instrlist_meta_preinsert(ilist, where, INSTR_CREATE_jcc(drcontext, OP_jnz, opnd1)); /* or */ opnd1 = OPND_CREATE_MEM32(umbra_infoÆreg, 0, OPSZ_4); opnd2 = OPND_CREATE_INT32(client_tls_dataÆtid_map | 1); instr = INSTR_CREATE_or(drcontext, opnd1, opnd2); LOCK(instr); instrlist_meta_preinsert(ilist, label, instr); } DynamoRIO Tutorial at MICRO 12 Dec 2009

38

Watchpoints • Watchpoints are poweful debugging tools – Millions of Watchpoints • Watch all heap allocations to detect uninitialized values • Watch all buffers to detect overflow • Watch all return addresses

• However, more than a handful of watched addresses cannot be handled efficiently by today’s debuggers – GDB is forced into a single-step mode – Prohibitively expensive


39

Millions of Watchpoints • Full instrumentation with shadow memory – Simply mark the shadow memory – Instrument to check on every memory access

• Partial instrumentation: – On setting a watchpoint • Duplicate data into shadow pages • Set application page non-accessible

– On access violation handler • Instrument the fault instruction – Check reference target – Report if access the watched data – Redirect if access the protected pages


40




• Dr. Memory


41

Examples: Part I - DynamoRIO

Examples: Part I - DynamoRIO

Suggest Documents

PART I PART II

Part I

PART I

Part I

Part-I

Part-I

Part I

Part I

Part-I

PART I

Part I

PART I

Part I

PART I

BIBLIOGRAPHY Part 1. Principal Sources of Examples

Building Dynamic Instrumentation Tools with DynamoRIO

COMPUTER SCIENCE B.Sc. Part-I B.SC. PART-I COMPUTER ...

IPv6 Technology Overview Tutorial- Part I Tutorial- Part I - Nanog

part i

PART I: BIOMECHANICS

OpenCV Part I

Part I - TEMS

PART I

PART I - PLOS