Oct 9, 2002 - x86, SuperH, ARM, ST200. ⢠PocketPC, Linux. DEVICE HARDWARE. ⢠Code & Data Caching and Linking. ⢠Code Optimization. ⢠Hardware ...
A New Facility for Dynamic Control of Program Execution: DELI
Giuseppe Desoli (STMicroelectronics*) Nikolay Mateev (HP Labs) Evelyn Duesterwald (IBM*) Paolo Faraboschi (HP Labs) Josh Fisher (HP Labs)
(* work done while at HP)
© 2002
EMSOFT 2002
October 9, 2002
page 1
programs that process programs
compile time
run time
persistent changes
compilers
dynamic loaders, DELI
transient changes
[?]
superscalar hardware
the ultimate control point in program execution © 2002
EMSOFT 2002
October 9, 2002
page 2
• emulators must stare at every target instruction as it runs (can’t run on native hardware)
example: emulators
– this is considered a “necessary evil” – serious (~100x) performance loss, when operating at the binary level
Program Being Emulated
• an important “trick” – cache, link and then execute a private copy of fragments of the executable code
Emulator Hardware
– as long as you stay in the cache, you achieve nearly native speed © 2002
EMSOFT 2002
October 9, 2002
page 3
hpl project whose goal was to optimize pa-risc binaries at run-time (Vasanth Bala, Evelyn Duesterwald)
short history
– in effect, they did native-native emulation, with optimization as the desired side-effect
produced a startling result
the dynamo project at hpl
– given a running program, you can run a second program (“dynamo”) that speeds up the first one, more than enough to amortize the cost of running dynamo
(1996-2001+)
© 2002
some (already optimized) specint95 programs went 23% faster EMSOFT 2002
October 9, 2002
page 4
dynamo speeds up native code, making up for its own overhead
how dynamo works
the gains come from – inlining small functions – “straightening out” branches – cache locality improvements
Input native instruction stream Interpretation/Profiling Loop no Lookup next PC in Trace Cache hit
Dynamo Code Cache
ex it br an ch
Interpret until taken branch
miss
Hot start of trace? yes recycle counter Trace Selector
Trace Optimizer
Emit Trace
Trace Linker
– eliminating redundant loads
dynamo uses cache & link – when we realized that we could do this at ‘native speed’, we thought about a general layer, accessible for many different purposes
the “cambridge DELI” © 2002
EMSOFT 2002
October 9, 2002
page 5
the point of this talk
a dynamic control point is very powerful
... this is not an emulator, and we did not rediscover cache & link
but only an elaborate fantasy if it costs 100x performance
we can make this practical and open it up to many clients applications the potential and variety of clients that take advantage of this is stunning © 2002
EMSOFT 2002
virtually every processor— gp or embedded— could include such a facility if this works, there will be far-reaching changes in what computers do and how they do it October 9, 2002
page 6
a “control point” in program execution to dynamically transform code and data, enabling:
DELI: Dynamic Execution Layer Interface
Remote Native App. Native App.
• • • •
Remote Emulation System
network
Emulation System
Fragment Description Interface • Code & Data Caching and Linking • Code Optimization • Hardware Abstraction Model • Linking of Remote Data and Code
DEVICE HARDWARE
© 2002
dynamic optimization caching emulation transparent remote streaming virtualized hardware
DELI facilitates • efficient emulation • program transformation • program observation
work in progress on different platforms • x86, SuperH, ARM, ST200 • PocketPC, Linux
EMSOFT 2002
October 9, 2002
page 7
• provides a uniform layer for caching & linking of code • facilitates the construction of emulators & virtual machines • provides hardware virtualization
the DELI infrastructure
• integrates emulated and native code execution • acts as a system service to support 1. observation 2. transformation 3. emulation
© 2002
EMSOFT 2002
October 9, 2002
page 8
DELI components Application Transparent Injector
Programming Interface (API) Dispatch
Optimization Manager
Fragment Manager Lookup
Trace Picker
IR
Profiler Linker
Optimizer Scheduler
Cache Manager
Hardware Abstraction Module (HAM) Hardware Platform © 2002
EMSOFT 2002
October 9, 2002
page 9
transparent or explicit?
could run as-on hardware as is
it was previously impractical to work at this level
© 2002
transparent DELI execution
explicit DELI API
• use dynamic opt. technology to achieve performance comparable to the underlying hardware
• control the DELI operations
application #1
…
• emit custom code dynamically • extend the native ISA
application #n
emulation system #1
…
emulation system #n
the DELI
hardware EMSOFT 2002
October 9, 2002
page 10
how DELI works Application level Syscall interface (user < - >kernel)
Trace manager
Optimizer
Optimizer Deli IR
Fragment manager
Optimizer
Dispatcher Cache manager
Optimization manager
DELI API
Code cache Code cache Code cache
Hardware Abstraction Module (HAM) Hardware Platform © 2002
EMSOFT 2002
October 9, 2002
page 11
the transformation infrastructure App licatio plica tio n Level
ma ke_ inst ake_ ap pe n d_ nd _ in st in cre m en tal_ op increm tal_o p tim ize D e li IR
em it_ fra gm en t it_fra
De li IR eli
O p tim ize
la yo ut
in st_ sche sch e du le
in cr _ op tim ize op o p _sch ed ule
ap pe pp e nd _in st d ecod e_ inst in st
e m it_ cod eb lock C od e C a ch e © 2002
EMSOFT 2002
October 9, 2002
page 12
Transmeta “code-morphing” Transitive “dynamite” – transparent to the application and bundled with HW – DELI exports API to application and supports mix native-emulated
VM-Ware: full-system emulation, transparent to the client
how unique is DELI?
– DELI is not an emulator- it only provides support for it
AppStream: streams Java binaries to client machines running JVMs. – requires Java, and modification of the client-side JVM – DELI can support streaming of legacy native binaries
© 2002
EMSOFT 2002
October 9, 2002
page 13
• efficient emulation of embedded cores on a VLIW processor – leverages OS + SDK (e.g. WinCE) without a port
emulation with DELI:
– augments the OS functionality with no OS changes
• integration of native and emulated code
the “DELIverX” prototype
– allows native implementation of compute-intensive tasks – enables integration of legacy GUIs with native computation engines – facilitates “incremental” migration across product generations
© 2002
EMSOFT 2002
October 9, 2002
page 14
example: emulation in rich-media devices PocketTV running on emulated PocketPC /SuperH
Native thread decodes mpeg video/audio stream with a large speedup with respect to actual target core
Emulated application controls user interface and native thread is seen as a library in the emulated environment © 2002
EMSOFT 2002
October 9, 2002
page 15
emulation in an embedded appliance legacy applications (legacy ISA)
new apps or plug-ins (legacy ISA)
media engines (VLIW ISA) ISA))
legacy O/S (WinCE,Palm,Symbian) emulated
patches upgrades ...patches
standard micro emulation system (eg, SuperH, ARM)
DELI native O/S (eg, embedded linux)
native apps (VLIW ISA)
VLIW processor © 2002
EMSOFT 2002
October 9, 2002
page 16
“Lx”: a vliw architecture for media-oriented applications
•base vliw instr. set + extensions •efficient 32-bit integer ISA •extensible (for fp, simd, special ops) •many gp/predicate registers •simple predication through “select” •static branch prediction •precise interrupts •explicit speculation and prefetching © 2002
EMSOFT 2002
joint hp - stmicro family of vliw embedded cores customizable and scalable for a specific application domain variable number of clusters and registers, cache sizes, operation latencies, special operations October 9, 2002
page 17
the st200 (Lx) vliw architecture DELIverX target: the ST210 core
4-way VLIW
HP & STMicroelectronics joint project
IPU
Exception Control
Multiply
Multiply
250MHz (0.18µ) DPU
Control Registers
Other Clusters Pre-Decode
I$
Reg File 64 GPR (32b)
Load Store Unit
scalable customizable compiler-driven
D$
ST210
Branch Unit
Branch RegFile 8BR (1 bit)
ALU
ALU
ALU
Cluster 0
© 2002
Prefetch Buffer
ALU Cluster
EMSOFT 2002
October 9, 2002
32KB I$ 32KB D$ (10mm2)
page 18
DELIverX components
© 2002
EMSOFT 2002
October 9, 2002
page 19
DELIverX performance 2.0 1.83
DELI/opt 500
1.8
DELI/no opt
1.6 1.4
1.33
1.32
1.28
1.25 1.2 1.0 0.88
0.91 0.8
0.83
0.92
0.74
0.70
0.65
0.6 0.4
0.34
0.2
© 2002
EMSOFT 2002
October 9, 2002
qs or t
cr c3 2
bi tc ou nt
ne ry st o dh
sh a
ad m
e la m
g dj pe
g cj pe
ad
pc m
d
c pc m ad
cf m
bz
ip 2
0.0
page 20
mixing native and emulated code value proposition for developers: • scale performance by selectively porting the computeintensive kernels of applications • without giving up the benefits of a legacy operating system and GUI SH3 Emulated/no opt
0.70
SH3 Emulated/opt 500 SH3 Native
1.83 1.00
Mix Emulated/Native
5.36
10.71
VLIW Native 0
© 2002
1
2
3
EMSOFT 2002
4
5
October 9, 2002
6
7
8
9
10
11
page 21
• emulation is just one of the many DELI clients • many types of code transformations are practical at run-time
transforming and observing programs
– – – – –
dynamic code patching streaming of data & code code decompression code decryption polymorphic virus detection
• many types of code observations are useful at run-time – sand-boxing – profiling – digital rights management
© 2002
EMSOFT 2002
October 9, 2002
page 22
transformation example: dynamic code patching loaded code image
supply patch dispatch
cache lookup?
no
API
select fragment
patch table
inspect & patch Miss
Deli
yes
Context Switch
executable code cache
© 2002
EMSOFT 2002
October 9, 2002
Executing Patched Application
page 23
transformation example: code decompression (or decryption) loaded code image
decompression dispatch
cache lookup?
no
API
select fragment decompression routine
decompress Miss
Deli
yes
Context Switch
executable code cache © 2002
EMSOFT 2002
October 9, 2002
Executing Decompressed Application page 24
DELI is a general layer, accessible for many purposes
summary
other alternatives have a specific usage in mind, or point products
we have shown feasibility, achievable performance, and basic infrastructure © 2002
EMSOFT 2002
future work ideas hardware support compiler annotations more applications October 9, 2002
page 25