A New Facility for Dynamic Control of Program Execution ... - EMSOFT

3 downloads 0 Views 1MB Size Report
Oct 9, 2002 - x86, SuperH, ARM, ST200. • PocketPC, Linux. DEVICE HARDWARE. • Code & Data Caching and Linking. • Code Optimization. • Hardware ...
A New Facility for Dynamic Control of Program Execution: DELI

Giuseppe Desoli (STMicroelectronics*) Nikolay Mateev (HP Labs) Evelyn Duesterwald (IBM*) Paolo Faraboschi (HP Labs) Josh Fisher (HP Labs)

(* work done while at HP)

© 2002

EMSOFT 2002

October 9, 2002

page 1

programs that process programs

compile time

run time

persistent changes

compilers

dynamic loaders, DELI

transient changes

[?]

superscalar hardware

the ultimate control point in program execution © 2002

EMSOFT 2002

October 9, 2002

page 2

• emulators must stare at every target instruction as it runs (can’t run on native hardware)

example: emulators

– this is considered a “necessary evil” – serious (~100x) performance loss, when operating at the binary level

Program Being Emulated

• an important “trick” – cache, link and then execute a private copy of fragments of the executable code

Emulator Hardware

– as long as you stay in the cache, you achieve nearly native speed © 2002

EMSOFT 2002

October 9, 2002

page 3

hpl project whose goal was to optimize pa-risc binaries at run-time (Vasanth Bala, Evelyn Duesterwald)

short history

– in effect, they did native-native emulation, with optimization as the desired side-effect

produced a startling result

the dynamo project at hpl

– given a running program, you can run a second program (“dynamo”) that speeds up the first one, more than enough to amortize the cost of running dynamo

(1996-2001+)

© 2002

some (already optimized) specint95 programs went 23% faster EMSOFT 2002

October 9, 2002

page 4

dynamo speeds up native code, making up for its own overhead

how dynamo works

the gains come from – inlining small functions – “straightening out” branches – cache locality improvements

Input native instruction stream Interpretation/Profiling Loop no Lookup next PC in Trace Cache hit

Dynamo Code Cache

ex it br an ch

Interpret until taken branch

miss

Hot start of trace? yes recycle counter Trace Selector

Trace Optimizer

Emit Trace

Trace Linker

– eliminating redundant loads

dynamo uses cache & link – when we realized that we could do this at ‘native speed’, we thought about a general layer, accessible for many different purposes

the “cambridge DELI” © 2002

EMSOFT 2002

October 9, 2002

page 5

the point of this talk

a dynamic control point is very powerful

... this is not an emulator, and we did not rediscover cache & link

but only an elaborate fantasy if it costs 100x performance

we can make this practical and open it up to many clients applications the potential and variety of clients that take advantage of this is stunning © 2002

EMSOFT 2002

virtually every processor— gp or embedded— could include such a facility if this works, there will be far-reaching changes in what computers do and how they do it October 9, 2002

page 6

a “control point” in program execution to dynamically transform code and data, enabling:

DELI: Dynamic Execution Layer Interface

Remote Native App. Native App.

• • • •

Remote Emulation System

network

Emulation System

Fragment Description Interface • Code & Data Caching and Linking • Code Optimization • Hardware Abstraction Model • Linking of Remote Data and Code

DEVICE HARDWARE

© 2002

dynamic optimization caching emulation transparent remote streaming virtualized hardware

DELI facilitates • efficient emulation • program transformation • program observation

work in progress on different platforms • x86, SuperH, ARM, ST200 • PocketPC, Linux

EMSOFT 2002

October 9, 2002

page 7

• provides a uniform layer for caching & linking of code • facilitates the construction of emulators & virtual machines • provides hardware virtualization

the DELI infrastructure

• integrates emulated and native code execution • acts as a system service to support 1. observation 2. transformation 3. emulation

© 2002

EMSOFT 2002

October 9, 2002

page 8

DELI components Application Transparent Injector

Programming Interface (API) Dispatch

Optimization Manager

Fragment Manager Lookup

Trace Picker

IR

Profiler Linker

Optimizer Scheduler

Cache Manager

Hardware Abstraction Module (HAM) Hardware Platform © 2002

EMSOFT 2002

October 9, 2002

page 9

transparent or explicit?

could run as-on hardware as is

it was previously impractical to work at this level

© 2002

transparent DELI execution

explicit DELI API

• use dynamic opt. technology to achieve performance comparable to the underlying hardware

• control the DELI operations

application #1



• emit custom code dynamically • extend the native ISA

application #n

emulation system #1



emulation system #n

the DELI

hardware EMSOFT 2002

October 9, 2002

page 10

how DELI works Application level Syscall interface (user < - >kernel)

Trace manager

Optimizer

Optimizer Deli IR

Fragment manager

Optimizer

Dispatcher Cache manager

Optimization manager

DELI API

Code cache Code cache Code cache

Hardware Abstraction Module (HAM) Hardware Platform © 2002

EMSOFT 2002

October 9, 2002

page 11

the transformation infrastructure App licatio plica tio n Level

ma ke_ inst ake_ ap pe n d_ nd _ in st in cre m en tal_ op increm tal_o p tim ize D e li IR

em it_ fra gm en t it_fra

De li IR eli

O p tim ize

la yo ut

in st_ sche sch e du le

in cr _ op tim ize op o p _sch ed ule

ap pe pp e nd _in st d ecod e_ inst in st

e m it_ cod eb lock C od e C a ch e © 2002

EMSOFT 2002

October 9, 2002

page 12

Transmeta “code-morphing” Transitive “dynamite” – transparent to the application and bundled with HW – DELI exports API to application and supports mix native-emulated

VM-Ware: full-system emulation, transparent to the client

how unique is DELI?

– DELI is not an emulator- it only provides support for it

AppStream: streams Java binaries to client machines running JVMs. – requires Java, and modification of the client-side JVM – DELI can support streaming of legacy native binaries

© 2002

EMSOFT 2002

October 9, 2002

page 13

• efficient emulation of embedded cores on a VLIW processor – leverages OS + SDK (e.g. WinCE) without a port

emulation with DELI:

– augments the OS functionality with no OS changes

• integration of native and emulated code

the “DELIverX” prototype

– allows native implementation of compute-intensive tasks – enables integration of legacy GUIs with native computation engines – facilitates “incremental” migration across product generations

© 2002

EMSOFT 2002

October 9, 2002

page 14

example: emulation in rich-media devices PocketTV running on emulated PocketPC /SuperH

Native thread decodes mpeg video/audio stream with a large speedup with respect to actual target core

Emulated application controls user interface and native thread is seen as a library in the emulated environment © 2002

EMSOFT 2002

October 9, 2002

page 15

emulation in an embedded appliance legacy applications (legacy ISA)

new apps or plug-ins (legacy ISA)

media engines (VLIW ISA) ISA))

legacy O/S (WinCE,Palm,Symbian) emulated

patches upgrades ...patches

standard micro emulation system (eg, SuperH, ARM)

DELI native O/S (eg, embedded linux)

native apps (VLIW ISA)

VLIW processor © 2002

EMSOFT 2002

October 9, 2002

page 16

“Lx”: a vliw architecture for media-oriented applications

•base vliw instr. set + extensions •efficient 32-bit integer ISA •extensible (for fp, simd, special ops) •many gp/predicate registers •simple predication through “select” •static branch prediction •precise interrupts •explicit speculation and prefetching © 2002

EMSOFT 2002

joint hp - stmicro family of vliw embedded cores customizable and scalable for a specific application domain variable number of clusters and registers, cache sizes, operation latencies, special operations October 9, 2002

page 17

the st200 (Lx) vliw architecture DELIverX target: the ST210 core

4-way VLIW

HP & STMicroelectronics joint project

IPU

Exception Control

Multiply

Multiply

250MHz (0.18µ) DPU

Control Registers

Other Clusters Pre-Decode

I$

Reg File 64 GPR (32b)

Load Store Unit

scalable customizable compiler-driven

D$

ST210

Branch Unit

Branch RegFile 8BR (1 bit)

ALU

ALU

ALU

Cluster 0

© 2002

Prefetch Buffer

ALU Cluster

EMSOFT 2002

October 9, 2002

32KB I$ 32KB D$ (10mm2)

page 18

DELIverX components

© 2002

EMSOFT 2002

October 9, 2002

page 19

DELIverX performance 2.0 1.83

DELI/opt 500

1.8

DELI/no opt

1.6 1.4

1.33

1.32

1.28

1.25 1.2 1.0 0.88

0.91 0.8

0.83

0.92

0.74

0.70

0.65

0.6 0.4

0.34

0.2

© 2002

EMSOFT 2002

October 9, 2002

qs or t

cr c3 2

bi tc ou nt

ne ry st o dh

sh a

ad m

e la m

g dj pe

g cj pe

ad

pc m

d

c pc m ad

cf m

bz

ip 2

0.0

page 20

mixing native and emulated code value proposition for developers: • scale performance by selectively porting the computeintensive kernels of applications • without giving up the benefits of a legacy operating system and GUI SH3 Emulated/no opt

0.70

SH3 Emulated/opt 500 SH3 Native

1.83 1.00

Mix Emulated/Native

5.36

10.71

VLIW Native 0

© 2002

1

2

3

EMSOFT 2002

4

5

October 9, 2002

6

7

8

9

10

11

page 21

• emulation is just one of the many DELI clients • many types of code transformations are practical at run-time

transforming and observing programs

– – – – –

dynamic code patching streaming of data & code code decompression code decryption polymorphic virus detection

• many types of code observations are useful at run-time – sand-boxing – profiling – digital rights management

© 2002

EMSOFT 2002

October 9, 2002

page 22

transformation example: dynamic code patching loaded code image

supply patch dispatch

cache lookup?

no

API

select fragment

patch table

inspect & patch Miss

Deli

yes

Context Switch

executable code cache

© 2002

EMSOFT 2002

October 9, 2002

Executing Patched Application

page 23

transformation example: code decompression (or decryption) loaded code image

decompression dispatch

cache lookup?

no

API

select fragment decompression routine

decompress Miss

Deli

yes

Context Switch

executable code cache © 2002

EMSOFT 2002

October 9, 2002

Executing Decompressed Application page 24

DELI is a general layer, accessible for many purposes

summary

other alternatives have a specific usage in mind, or point products

we have shown feasibility, achievable performance, and basic infrastructure © 2002

EMSOFT 2002

future work ideas hardware support compiler annotations more applications October 9, 2002

page 25

Suggest Documents