Parallel SimOS: Scalability and Performance for Large System ...

5 downloads 41 Views 2MB Size Report
performance. • We use smaller machines to simulate larger machines. • We extend the capabilities of computer system simulation by an order of magnitude,.
Parallel SimOS: Scalability and Performance for Large System Simulation Ph.D. Oral Defense Robert E. Lantz Computer Systems Laboratory Stanford University

1

Overview • This work develops methods to simulate large computer systems with practical performance

• We use smaller machines to simulate larger machines

• We extend the capabilities of computer

system simulation by an order of magnitude, to systems of more than 1000 processors

2

Outline • •

• •

Background and Motivation Parallel SimOS Investigation

• • •

Design Issues and Experiences Performance Evaluation Usability Evaluation

Related Work Future Work and Conclusions

3

Why large systems? •

Large applications!

• • • •

Biology, Chemistry, Physics, Engineering From large systems (e.g. Earth’s climate) to small systems (e.g. cells, DNA) Web applications, search, databases Simulation, visualization (and games!)

4

Why simulate large systems? • Compare alternative designs • Verify a system before building it • Predict behavior and performance • Debug a system during bring-up • Write software when the system is not available (or before it exists!)

• Avoid expensive mistakes 5

The SimOS System • • • •

Complete Machine Simulator developed in CSL Simulates complete hardware of computer system: CPU, memory, devices Enough speed and detail to run full operating system, system software, application programs Multiple CPU and memory models for fast or detailed performance and behavioral modeling

Target Workload Target OS SimOS Simulated Hardware CPU Model

Memory Model

P P P P

M M M M

Device Models Disk

Network

Other

Host OS Host Hardware 6

Using SimOS Disk Image

Config/ Control Scripts

Modeled performance and event statistics

SimOS

Program output

External I/O

Simulator statistics

OS, System Software User Applications Application Data

7

Performance Terminology

• Execution time is the most meaningful

measurement of simulator performance

• Slowdown = Real Time/Simulated Time • Slowdown tells you how much longer it will take to simulate a workload compared to running it on actual hardware

• Self-relative slowdown compares a simulator with the machine it is running on

8

Speed/Detail Trade-off SimOS CPU Model

Detail

Approximate KIPS (225 MHz R10K)

Self-relative slowdown

MXS

Dynamic, superscalar microarchitecture model; non-blocking memory system

12

2000+

Mipsy

Sequential interpreter; blocking memory system

800

300+

Embra w/caches

Single-cycle CPU model; simplified cache model

12000

~20

Embra

Single-cycle CPU and memory model

25000

~10

SimOS CPU and Memory Models

9

Benefits of fast simulation

• Makes it possible to simulate complex workloads • Real OS, system software, large applications • Many billions of cycles • Positioning before more detailed simulation • Allows software development, debugging • interactive usability • Enables exploration of large design space • Provides rough estimate of performance, trends 10

SimOS Applications •

Used in design, development, debugging of Stanford FLASH multiprocessor throughout its life cycle



Enabled numerous studies of OS and application performance



Research platform for operating systems, virtual machines, visualization 11

SimOS Limitations • As we simulate larger machines, slowdown increases 15,000 Slowdown 10,000 (real time / simulated time)

Barnes FFT Radix LU

5,000

0

1

128

1024

Simulated Processors

12

SimOS Limitations • ...resulting in longer simulation times 15,000 Time (minutes) to simulate one minute of virtual time

> 1 week 10,000

5,000 10 minutes 0

1

23 hours

128

1024

Simulated Processors

13

Problem: Simulator Slowdown

• What causes simulator slowdown? • Intrinsic Slowdown • Resource Exhaustion • Linear slowdown • Overall multiplicative slowdown:

Simulation Time = Workload Time * (Intrinsic Slowdown + Resource Exhaustion Penalty) * Linear Slowdown 14

Solution: Parallel SimOS • Use increased capacity of shared-memory multiprocessors to address resource exhaustion and linear slowdown

• Extend speed/detail trade-off with fast, parallel mode of simulation

• Goal: eliminate slowdown due to parallelism

and increase scalability to enable large system simulation with practical performance

15

Outline • •

• •

Background and Motivation Parallel SimOS Investigation

• • •

Design Issues and Experiences

• •

Embra background Parallel Embra Design

Performance Evaluation Usability Evaluation

Related Work Future Work and Conclusions 16

Embra: SimOS’ fastest simulation mode •

Binary translation CPU and memory simulator

• •

Translation Cache (TC)

• •

Callouts to handle events, MMU operations, exceptions and annotations CPU multiplexing ~10x base slowdown

MMU/ glue code

Kernel TC

User TC

Translation Cache (TC) Translation Cache (TC) index

Translator Decoder

Callout and Exception Handlers Event Handlers

MMU Cache MMU Handler Statistics Reporting

SimOS Interface

Embra

17

Embra: sources of slowdown • Binary translation overhead • Multiplexing overhead • Resource Exhaustion ST = WT * (Slowdown(I) + Slowdown(R)) * M

18

Binary translation overhead PC

lw r1, (r2) lw r3, (r4) add r5, r1, r3

Simulator Memory

Decoder and Translator

TC Index

lw SIM_T1, R2(cpu_base) jal mem_read_addr lw SIM_T2, (SIM_T1) sw SIM_T2, R1(cpu_base) lw SIM_T1, R4(cpu_base) jal mem_read_addr lw SIM_T3, (SIM_T1) sw SIM_T3, R3(cpu_base) add.w SIM_T1, SIM_T2, SIM_T3 sw SIM_T1, R5(cpu_base)

Translation Cache (TC)

19

CPU multiplexing overhead • •

CPU State array Context switching with variable timeslice



CPU 0

P



FPU

MMU

other state

large for low overhead

CPU 1 P



Registers

small for better responsiveness

Registers

FPU

MMU

other state

CPU 2

minimal: MPinUP mode

P

Registers

FPU

MMU

other state 20

A new, faster mode: Parallel Embra



Use parallelism and memory system of shared-memory multiprocessor



Decimation-in-space approach



Parallelism and increased memory bandwidth reduce linear slowdown and resource exhaustion:

Simulated nodes

Simulator threads

ST = WT * (Slowdown(I) + Slowdown(R)) * M 21

Design Evolution • We started with a baseline design and

evolved it to achieve scalable performance

• Baseline: thread-based parallelism, shared memory

• Critical design features: • Mirroring hardware in software • Replication, fine-grained parallelism • Unsynchronized execution speed 22

Design: Software should mirror Hardware •

Shared Translation Cache to reduce overhead?

• • •

Problem: contention and serialization; chaining and cache conflicts Fuses hardware, breaks parallelism

Solution: mirror hardware in software with replicated Translation Caches

MMU/ MMU/ MMU/ gluecode code glue glue code

Kernel Kernel Kernel TC TC TC

User User User TC TC TC

TranslationCache Cache(TC) (TC) Translation Translation Cache (TC) TranslationCache Cache(TC) (TC)index index Translation Translation Cache (TC) index

Translator Decoder

Callout and Exception Handlers Event Handlers

MMU Cache MMU Handler Statistics Reporting

SimOS Interface

Parallel Embra

23

Design: Software should mirror Hardware •

Shared Event Queue for global ordering? Events are rare!

• •

Problem: event frequency increases with parallelism

Solution: replicated event queues to mirror hardware in software

MMU/ glue code

Kernel TC

User TC

Translation Cache (TC) Translation Cache (TC) index

Translator Decoder

Callout and Exception Handlers Event Handlers

MMU Cache MMU Handler Statistics Reporting

SimOS Interface

Parallel Embra

24

Design: Software should mirror Hardware •

90% of time in TC - how about parallelize TC only?

• •

Problem: Amdahl’s law



Problem: frequent callouts, contention everywhere Result: critical region expansion and serialization

MMU/ glue code

Kernel TC

User TC

Translation Cache (TC) Translation Cache (TC) index

Translator Decoder

Callout and Exception Handlers Event Handlers

MMU Cache MMU Handler Statistics Reporting

SimOS Interface

Parallel Embra

25

Critical Region Expansion Critical Regions

Expansion and Serialization

Contention and Descheduling Time 26

Design: Software should mirror Hardware •

• •

Solution: mirror hardware in software with finegrained parallelism throughout Parallel Embra OS and apps require parallel callouts from Translation Cache Parallel statistics reporting is also a good idea, but happens infrequently

MMU/ glue code

Kernel TC

User TC

Translation Cache (TC) Translation Cache (TC) index

Translator Decoder

Callout and Exception Handlers Event Handlers

MMU Cache MMU Handler Statistics Reporting

SimOS Interface

Parallel Embra

27

Design: flexible virtual time synchronization • Problem: cycle skew between fast, slow processors

• Solution: configurable barrier synchronization

• fast processors wait for slow processors • fine-grain (like MPinUP mode) • loose grain (reduce sync overhead) • variable interval for flexibility 28

Design: synchronization causes slowdown 4

Barnes FFT LU MP3D Ocean Raytrace Radix Water

3

32p Slowdown vs. large sync interval

2 1 0

500000

1000000

10000000

Synchronization interval (cycles)

29

Design: unsynchronized execution • For performance, the best synchronization

interval is longer than the workload, i.e. never synchronize

• We were surprised to find that both the OS and parallel benchmarks ran correctly with unlimited time skew

• This is because every thread sees a

consistent ordering of memory and synchronization events 30

Design conclusions •

Parallelism increases contention for: callouts, event system, TC, clock, MMU, interrupt controllers, any shared subsystem



Contention cascades, resulting in critical region expansion and serialization



Mirroring hardware in software preserves parallelism, avoids contention effects



Fine-grained synchronization is required to permit correct and highly parallel access to simulator data



Time synchronization across processors is unnecessary for correctness and undesirable for speed



Performance depends on combination of all parallel performance features

31

Outline • •

• • •

Background and Motivation Parallel SimOS Investigation

• • •

Design Issues and Experiences Performance Evaluation Usability Evaluation

Related Work Future Work Conclusions

32

Performance:Test Configuration benchmark Barnes

description Hierarchical Barnes-Hut method for N-body problem

FFT

Fast Fourier Transform

LU

Lower/Upper matrix factorization

MP3D

Particle-based hypersonic wind tunnel simulation

Radix

Integer radix sort

Stanford FLASH Multiprocessor

Raytrace Ocean

Ocean currents simulation

64 nodes MIPS R10000, 225 Mhz 220 MB DRAM/node (14GB total) flash1, flash32, flash64, etc.

Water

Water molecule simulation

pmake

Compile phase of Modified Andrew Benchmark

Machine

ptest

Ray tracer

Simple benchmark for sanity check/peak performance

Workload 33

Performance: Peak and actual MIPS 1600 MIPS

MIPS over time 1100 -vpc-suite/flash-32-suite.log

1000 MIPS

1000

900

800

700

600

500

400

300

200

100

0 0

200

400

600

800

1000

1200

Flash32: ptest Flash32: SPLASH-2 Overall result: > 1000 MIPS in simulation, ~10x slowdown compared to hardware 34

Performance: Hardware self-relative slowdown 60 50

Barnes FFT LU MP3D Ocean Radix Raytrace Water pmake LU-big Radix-big

40

Self-relative slowdown

30 20 10 0

1

2

4

8

16

32

64

Simulated Machine Size

~10x slowdown regardless of machine size 35

Performance: benchmark phases

Barnes-Flash32

LU-Flash32 36

Performance: benchmark phases

MP3D-Flash32 37

Large Scale Performance

38

Large Scale Performance 15,000 12,500 Slowdown (Real time/ simulated time)

10,323

10,000

Radix/Flash32 LU/Flash64

9,409

7,500 5,000 2,500 772

0

SimOS

442

Parallel SimOS

Hours or days rather than weeks 39

Speed/Detail Trade-off, revisited Parallel SimOS CPU Model

Detail

Approximate KIPS (225 MHz R10K)

Self-relative slowdown

MXS

Dynamic, superscalar microarchitecture model; non-blocking memory system

12

2000+

Mipsy

Sequential interpreter; blocking memory system

800

300+

Embra w/caches

Single-cycle CPU model; simplified cache model

12000

~20

Embra

Single-cycle CPU and memory model

25000

~10

Parallel Embra

Non-deterministic, single-cycle CPU and memory model

> 1,000,000

~10

Parallel SimOS CPU and Memory Models 40

Performance Conclusions • Parallel SimOS achieves peak and actual MIPS far beyond serial SimOS

• Parallel SimOS simulates multiprocessor

with analogous performance to Serial SimOS simulating a uniprocessor

• Parallel SimOS extends scalability of

complete machine simulation to 1024 processor systems

41

Usability Study •

Study of large, complex parallel program: Parallel SimOS itself



Self-hosting capability of orthogonal simulators



Performance debugging of Parallel SimOS, and test of functionality and usability



Benchmark (Radix) Inner Irix 6.5 Inner SimOS Outer Irix 6.5 Outer SimOS Irix 6.5

Self-hosting architecture:

Hardware (SGI Origin)

42

CPU

CPU

Phase profile | |

2

|

|

1

|

1

3

|

2

4

|

3

|

4

|

|

|

|

|

|

|

|

45

50

55

60

65

70

75

80

0| 11

|

|

0| 40

|

|

|

|

|

|

|

13

15

17

19

21

23

25

time(s) Computation intervals for self-hosted radix

Serial SimOS

time(s) Computation intervals for self-hosted radix

Parallel SimOS

Bugs: Excessive TLB misses, interrupt storms Limitation: system imbalance effects 43

Usability Conclusions • Parallel SimOS worked correctly on itself • Revealed bugs and limitations of Parallel SimOS

• Speed/detail trade-off enabled with checkpoints

• Detailed mode too slow - ended up scaling down workload

• Need for faster detailed simulation modes 44

Limitations • Virtual time depends on real time • Loss of determinism, repeatability • but can use checkpoints! • System Imbalance Effects • Memory Limits • Need for fast detailed mode • future work 45

Related Work • Parallel SimOS uses shared-memory

multiprocessors and decimation in space

• Other approaches to improving

performance using parallelism include:

• Decimation in time • Cluster-based simulation 46

Related Work: Decimation in Time checkpoint

Initial serial execution

Segment 1 checkpoint

checkpoint

Segment 2

checkpoint

Segment 3

Segment 4

checkpoint

Segment 1

Subsequent parallel execution

Segment 2

overlap

Segment 3 Segment 4 checkpoint

Serial reconstruction

checkpoint

Segment 1

checkpoint

Segment 3 Segment 2

Segment 4

ST = WT * (Slowdown(I) + Slowdown(R)) * N 47

Parallel SimOS: Decimation in Space

Simulated nodes

Simulator threads

ST = WT * (Slowdown(I) + Slowdown(R)) * M 48

Related Work: Clusterbased Simulation



Most common means of parallel simulation: Shaman, BigSim, others;



Fast (?) LAN = highlatency communication switch



Software-based shared memory = low performance



Reduced flexibility

49

Parallel SimOS: Flexible Simulation • • •

Workstation Cluster - “Sweet Hall”

Tightly and loosely coupled machines

Network

NUMA Shared-Memory Multiprocessor - Stanford FLASH Machine

From clusters to multiprocessors and everything in between Parallelism across multiprocessor nodes

Node

Node

Node

Node

CPU

Cache

CPU

Cache

CPU

Cache

CPU

Cache

CPU

Cache

CPU

Cache

CPU

Cache

CPU

Cache

Memory

Controller

Memory

Controller

Memory

Controller

Memory

Controller

Multi-level Bus/interconnect

...

Multiprocessor Cluster

Multiprocessor

Multiprocessor

Node

Node

Node

Node

Node

Node

Node

Node

Network Interface

Network Interface

... Network

50

Related Work Summary • Decimation in Time achieves good speedup at the expense of interactivity

• synergistic with Parallel SimOS

• Cluster-based simulation addresses needs of loosely-coupled systems, generally without shared memory

• Parallel SimOS approach achieves

programmability and performance - for larger design space that includes tightlycoupled and hybrid systems 51

Future Work • Faster detailed simulation • Parallel detailed mode with flexible memory, pipeline models

• Try to recapture determinism • Global memory ordering in virtual time

• Faster less-detailed simulation • Revisit direct execution, using virtual

machine monitors, user-mode OS, etc. 52

Conclusion: Thesis Contributions

• Developed design and implementation of

scalable, parallel complete machine simulation

• Eliminated slowdown due to resource exhaustion and multiplexing

• Scaled complete machine simulation up by an order of magnitude - 1024 processor machines on our hardware

• Developed flexible simulator capable of

simulating large, tightly-coupled systems with interactive performance 53

Suggest Documents