performance. • We use smaller machines to simulate larger machines. • We
extend the capabilities of computer system simulation by an order of magnitude,.
Parallel SimOS: Scalability and Performance for Large System Simulation Ph.D. Oral Defense Robert E. Lantz Computer Systems Laboratory Stanford University
1
Overview • This work develops methods to simulate large computer systems with practical performance
• We use smaller machines to simulate larger machines
• We extend the capabilities of computer
system simulation by an order of magnitude, to systems of more than 1000 processors
2
Outline • •
• •
Background and Motivation Parallel SimOS Investigation
• • •
Design Issues and Experiences Performance Evaluation Usability Evaluation
Related Work Future Work and Conclusions
3
Why large systems? •
Large applications!
• • • •
Biology, Chemistry, Physics, Engineering From large systems (e.g. Earth’s climate) to small systems (e.g. cells, DNA) Web applications, search, databases Simulation, visualization (and games!)
4
Why simulate large systems? • Compare alternative designs • Verify a system before building it • Predict behavior and performance • Debug a system during bring-up • Write software when the system is not available (or before it exists!)
• Avoid expensive mistakes 5
The SimOS System • • • •
Complete Machine Simulator developed in CSL Simulates complete hardware of computer system: CPU, memory, devices Enough speed and detail to run full operating system, system software, application programs Multiple CPU and memory models for fast or detailed performance and behavioral modeling
Target Workload Target OS SimOS Simulated Hardware CPU Model
Memory Model
P P P P
M M M M
Device Models Disk
Network
Other
Host OS Host Hardware 6
Using SimOS Disk Image
Config/ Control Scripts
Modeled performance and event statistics
SimOS
Program output
External I/O
Simulator statistics
OS, System Software User Applications Application Data
7
Performance Terminology
• Execution time is the most meaningful
measurement of simulator performance
• Slowdown = Real Time/Simulated Time • Slowdown tells you how much longer it will take to simulate a workload compared to running it on actual hardware
• Self-relative slowdown compares a simulator with the machine it is running on
8
Speed/Detail Trade-off SimOS CPU Model
Detail
Approximate KIPS (225 MHz R10K)
Self-relative slowdown
MXS
Dynamic, superscalar microarchitecture model; non-blocking memory system
12
2000+
Mipsy
Sequential interpreter; blocking memory system
800
300+
Embra w/caches
Single-cycle CPU model; simplified cache model
12000
~20
Embra
Single-cycle CPU and memory model
25000
~10
SimOS CPU and Memory Models
9
Benefits of fast simulation
• Makes it possible to simulate complex workloads • Real OS, system software, large applications • Many billions of cycles • Positioning before more detailed simulation • Allows software development, debugging • interactive usability • Enables exploration of large design space • Provides rough estimate of performance, trends 10
SimOS Applications •
Used in design, development, debugging of Stanford FLASH multiprocessor throughout its life cycle
•
Enabled numerous studies of OS and application performance
•
Research platform for operating systems, virtual machines, visualization 11
SimOS Limitations • As we simulate larger machines, slowdown increases 15,000 Slowdown 10,000 (real time / simulated time)
Barnes FFT Radix LU
5,000
0
1
128
1024
Simulated Processors
12
SimOS Limitations • ...resulting in longer simulation times 15,000 Time (minutes) to simulate one minute of virtual time
> 1 week 10,000
5,000 10 minutes 0
1
23 hours
128
1024
Simulated Processors
13
Problem: Simulator Slowdown
• What causes simulator slowdown? • Intrinsic Slowdown • Resource Exhaustion • Linear slowdown • Overall multiplicative slowdown:
Simulation Time = Workload Time * (Intrinsic Slowdown + Resource Exhaustion Penalty) * Linear Slowdown 14
Solution: Parallel SimOS • Use increased capacity of shared-memory multiprocessors to address resource exhaustion and linear slowdown
• Extend speed/detail trade-off with fast, parallel mode of simulation
• Goal: eliminate slowdown due to parallelism
and increase scalability to enable large system simulation with practical performance
15
Outline • •
• •
Background and Motivation Parallel SimOS Investigation
• • •
Design Issues and Experiences
• •
Embra background Parallel Embra Design
Performance Evaluation Usability Evaluation
Related Work Future Work and Conclusions 16
Embra: SimOS’ fastest simulation mode •
Binary translation CPU and memory simulator
• •
Translation Cache (TC)
• •
Callouts to handle events, MMU operations, exceptions and annotations CPU multiplexing ~10x base slowdown
MMU/ glue code
Kernel TC
User TC
Translation Cache (TC) Translation Cache (TC) index
Translator Decoder
Callout and Exception Handlers Event Handlers
MMU Cache MMU Handler Statistics Reporting
SimOS Interface
Embra
17
Embra: sources of slowdown • Binary translation overhead • Multiplexing overhead • Resource Exhaustion ST = WT * (Slowdown(I) + Slowdown(R)) * M
18
Binary translation overhead PC
lw r1, (r2) lw r3, (r4) add r5, r1, r3
Simulator Memory
Decoder and Translator
TC Index
lw SIM_T1, R2(cpu_base) jal mem_read_addr lw SIM_T2, (SIM_T1) sw SIM_T2, R1(cpu_base) lw SIM_T1, R4(cpu_base) jal mem_read_addr lw SIM_T3, (SIM_T1) sw SIM_T3, R3(cpu_base) add.w SIM_T1, SIM_T2, SIM_T3 sw SIM_T1, R5(cpu_base)
Translation Cache (TC)
19
CPU multiplexing overhead • •
CPU State array Context switching with variable timeslice
•
CPU 0
P
•
FPU
MMU
other state
large for low overhead
CPU 1 P
•
Registers
small for better responsiveness
Registers
FPU
MMU
other state
CPU 2
minimal: MPinUP mode
P
Registers
FPU
MMU
other state 20
A new, faster mode: Parallel Embra
•
Use parallelism and memory system of shared-memory multiprocessor
•
Decimation-in-space approach
•
Parallelism and increased memory bandwidth reduce linear slowdown and resource exhaustion:
Simulated nodes
Simulator threads
ST = WT * (Slowdown(I) + Slowdown(R)) * M 21
Design Evolution • We started with a baseline design and
evolved it to achieve scalable performance
• Baseline: thread-based parallelism, shared memory
• Critical design features: • Mirroring hardware in software • Replication, fine-grained parallelism • Unsynchronized execution speed 22
Design: Software should mirror Hardware •
Shared Translation Cache to reduce overhead?
• • •
Problem: contention and serialization; chaining and cache conflicts Fuses hardware, breaks parallelism
Solution: mirror hardware in software with replicated Translation Caches
MMU/ MMU/ MMU/ gluecode code glue glue code
Kernel Kernel Kernel TC TC TC
User User User TC TC TC
TranslationCache Cache(TC) (TC) Translation Translation Cache (TC) TranslationCache Cache(TC) (TC)index index Translation Translation Cache (TC) index
Translator Decoder
Callout and Exception Handlers Event Handlers
MMU Cache MMU Handler Statistics Reporting
SimOS Interface
Parallel Embra
23
Design: Software should mirror Hardware •
Shared Event Queue for global ordering? Events are rare!
• •
Problem: event frequency increases with parallelism
Solution: replicated event queues to mirror hardware in software
MMU/ glue code
Kernel TC
User TC
Translation Cache (TC) Translation Cache (TC) index
Translator Decoder
Callout and Exception Handlers Event Handlers
MMU Cache MMU Handler Statistics Reporting
SimOS Interface
Parallel Embra
24
Design: Software should mirror Hardware •
90% of time in TC - how about parallelize TC only?
• •
Problem: Amdahl’s law
•
Problem: frequent callouts, contention everywhere Result: critical region expansion and serialization
MMU/ glue code
Kernel TC
User TC
Translation Cache (TC) Translation Cache (TC) index
Translator Decoder
Callout and Exception Handlers Event Handlers
MMU Cache MMU Handler Statistics Reporting
SimOS Interface
Parallel Embra
25
Critical Region Expansion Critical Regions
Expansion and Serialization
Contention and Descheduling Time 26
Design: Software should mirror Hardware •
• •
Solution: mirror hardware in software with finegrained parallelism throughout Parallel Embra OS and apps require parallel callouts from Translation Cache Parallel statistics reporting is also a good idea, but happens infrequently
MMU/ glue code
Kernel TC
User TC
Translation Cache (TC) Translation Cache (TC) index
Translator Decoder
Callout and Exception Handlers Event Handlers
MMU Cache MMU Handler Statistics Reporting
SimOS Interface
Parallel Embra
27
Design: flexible virtual time synchronization • Problem: cycle skew between fast, slow processors
• Solution: configurable barrier synchronization
• fast processors wait for slow processors • fine-grain (like MPinUP mode) • loose grain (reduce sync overhead) • variable interval for flexibility 28
Design: synchronization causes slowdown 4
Barnes FFT LU MP3D Ocean Raytrace Radix Water
3
32p Slowdown vs. large sync interval
2 1 0
500000
1000000
10000000
Synchronization interval (cycles)
29
Design: unsynchronized execution • For performance, the best synchronization
interval is longer than the workload, i.e. never synchronize
• We were surprised to find that both the OS and parallel benchmarks ran correctly with unlimited time skew
• This is because every thread sees a
consistent ordering of memory and synchronization events 30
Design conclusions •
Parallelism increases contention for: callouts, event system, TC, clock, MMU, interrupt controllers, any shared subsystem
•
Contention cascades, resulting in critical region expansion and serialization
•
Mirroring hardware in software preserves parallelism, avoids contention effects
•
Fine-grained synchronization is required to permit correct and highly parallel access to simulator data
•
Time synchronization across processors is unnecessary for correctness and undesirable for speed
•
Performance depends on combination of all parallel performance features
31
Outline • •
• • •
Background and Motivation Parallel SimOS Investigation
• • •
Design Issues and Experiences Performance Evaluation Usability Evaluation
Related Work Future Work Conclusions
32
Performance:Test Configuration benchmark Barnes
description Hierarchical Barnes-Hut method for N-body problem
FFT
Fast Fourier Transform
LU
Lower/Upper matrix factorization
MP3D
Particle-based hypersonic wind tunnel simulation
Radix
Integer radix sort
Stanford FLASH Multiprocessor
Raytrace Ocean
Ocean currents simulation
64 nodes MIPS R10000, 225 Mhz 220 MB DRAM/node (14GB total) flash1, flash32, flash64, etc.
Water
Water molecule simulation
pmake
Compile phase of Modified Andrew Benchmark
Machine
ptest
Ray tracer
Simple benchmark for sanity check/peak performance
Workload 33
Performance: Peak and actual MIPS 1600 MIPS
MIPS over time 1100 -vpc-suite/flash-32-suite.log
1000 MIPS
1000
900
800
700
600
500
400
300
200
100
0 0
200
400
600
800
1000
1200
Flash32: ptest Flash32: SPLASH-2 Overall result: > 1000 MIPS in simulation, ~10x slowdown compared to hardware 34
Performance: Hardware self-relative slowdown 60 50
Barnes FFT LU MP3D Ocean Radix Raytrace Water pmake LU-big Radix-big
40
Self-relative slowdown
30 20 10 0
1
2
4
8
16
32
64
Simulated Machine Size
~10x slowdown regardless of machine size 35
Performance: benchmark phases
Barnes-Flash32
LU-Flash32 36
Performance: benchmark phases
MP3D-Flash32 37
Large Scale Performance
38
Large Scale Performance 15,000 12,500 Slowdown (Real time/ simulated time)
10,323
10,000
Radix/Flash32 LU/Flash64
9,409
7,500 5,000 2,500 772
0
SimOS
442
Parallel SimOS
Hours or days rather than weeks 39
Speed/Detail Trade-off, revisited Parallel SimOS CPU Model
Detail
Approximate KIPS (225 MHz R10K)
Self-relative slowdown
MXS
Dynamic, superscalar microarchitecture model; non-blocking memory system
12
2000+
Mipsy
Sequential interpreter; blocking memory system
800
300+
Embra w/caches
Single-cycle CPU model; simplified cache model
12000
~20
Embra
Single-cycle CPU and memory model
25000
~10
Parallel Embra
Non-deterministic, single-cycle CPU and memory model
> 1,000,000
~10
Parallel SimOS CPU and Memory Models 40
Performance Conclusions • Parallel SimOS achieves peak and actual MIPS far beyond serial SimOS
• Parallel SimOS simulates multiprocessor
with analogous performance to Serial SimOS simulating a uniprocessor
• Parallel SimOS extends scalability of
complete machine simulation to 1024 processor systems
41
Usability Study •
Study of large, complex parallel program: Parallel SimOS itself
•
Self-hosting capability of orthogonal simulators
•
Performance debugging of Parallel SimOS, and test of functionality and usability
•
Benchmark (Radix) Inner Irix 6.5 Inner SimOS Outer Irix 6.5 Outer SimOS Irix 6.5
Self-hosting architecture:
Hardware (SGI Origin)
42
CPU
CPU
Phase profile | |
2
|
|
1
|
1
3
|
2
4
|
3
|
4
|
|
|
|
|
|
|
|
45
50
55
60
65
70
75
80
0| 11
|
|
0| 40
|
|
|
|
|
|
|
13
15
17
19
21
23
25
time(s) Computation intervals for self-hosted radix
Serial SimOS
time(s) Computation intervals for self-hosted radix
Parallel SimOS
Bugs: Excessive TLB misses, interrupt storms Limitation: system imbalance effects 43
Usability Conclusions • Parallel SimOS worked correctly on itself • Revealed bugs and limitations of Parallel SimOS
• Speed/detail trade-off enabled with checkpoints
• Detailed mode too slow - ended up scaling down workload
• Need for faster detailed simulation modes 44
Limitations • Virtual time depends on real time • Loss of determinism, repeatability • but can use checkpoints! • System Imbalance Effects • Memory Limits • Need for fast detailed mode • future work 45
Related Work • Parallel SimOS uses shared-memory
multiprocessors and decimation in space
• Other approaches to improving
performance using parallelism include:
• Decimation in time • Cluster-based simulation 46
Related Work: Decimation in Time checkpoint
Initial serial execution
Segment 1 checkpoint
checkpoint
Segment 2
checkpoint
Segment 3
Segment 4
checkpoint
Segment 1
Subsequent parallel execution
Segment 2
overlap
Segment 3 Segment 4 checkpoint
Serial reconstruction
checkpoint
Segment 1
checkpoint
Segment 3 Segment 2
Segment 4
ST = WT * (Slowdown(I) + Slowdown(R)) * N 47
Parallel SimOS: Decimation in Space
Simulated nodes
Simulator threads
ST = WT * (Slowdown(I) + Slowdown(R)) * M 48
Related Work: Clusterbased Simulation
•
Most common means of parallel simulation: Shaman, BigSim, others;
•
Fast (?) LAN = highlatency communication switch
•
Software-based shared memory = low performance
•
Reduced flexibility
49
Parallel SimOS: Flexible Simulation • • •
Workstation Cluster - “Sweet Hall”
Tightly and loosely coupled machines
Network
NUMA Shared-Memory Multiprocessor - Stanford FLASH Machine
From clusters to multiprocessors and everything in between Parallelism across multiprocessor nodes
Node
Node
Node
Node
CPU
Cache
CPU
Cache
CPU
Cache
CPU
Cache
CPU
Cache
CPU
Cache
CPU
Cache
CPU
Cache
Memory
Controller
Memory
Controller
Memory
Controller
Memory
Controller
Multi-level Bus/interconnect
...
Multiprocessor Cluster
Multiprocessor
Multiprocessor
Node
Node
Node
Node
Node
Node
Node
Node
Network Interface
Network Interface
... Network
50
Related Work Summary • Decimation in Time achieves good speedup at the expense of interactivity
• synergistic with Parallel SimOS
• Cluster-based simulation addresses needs of loosely-coupled systems, generally without shared memory
• Parallel SimOS approach achieves
programmability and performance - for larger design space that includes tightlycoupled and hybrid systems 51
Future Work • Faster detailed simulation • Parallel detailed mode with flexible memory, pipeline models
• Try to recapture determinism • Global memory ordering in virtual time
• Faster less-detailed simulation • Revisit direct execution, using virtual
machine monitors, user-mode OS, etc. 52
Conclusion: Thesis Contributions
• Developed design and implementation of
scalable, parallel complete machine simulation
• Eliminated slowdown due to resource exhaustion and multiplexing
• Scaled complete machine simulation up by an order of magnitude - 1024 processor machines on our hardware
• Developed flexible simulator capable of
simulating large, tightly-coupled systems with interactive performance 53