Performance Issues for Parallel Implementations of ... - Google Sites

Performance Issues for Parallel Implementations of Bootstrap Simulation Algorithm 22nd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD 2010)

Ricardo M. Czekster, Paulo Fernandes, Afonso Sales and Thais Webber Pontif´ıcia Universidade Católica do Rio Grande do Sul (PUCRS) PaleoProspec Project - PUCRS/Petrobras Funded also by CAPES and CNPq - Brazil

Context Interest The solution of complex and large state-based stochastic models to extract performance indices.

Solution Numerical (iterative methods) ◮ ◮ ◮

Power method [Stewart 94] Arnoldi [Arnoldi 51] GMRES [Saad and Schultz 86]

Simulation ◮ ◮ ◮ ◮

Traditional [Ross 96] Monte Carlo [Häggström 02] Backward [Propp and Wilson 96] Bootstrap [Czekster et al. 10] ⋆ ⋆

reliable estimations high computational cost to generate repeated batches of samples

Context

Markovian simulation Generation of independent samples (parallel execution) Parallel sampling (e.g., master-worker approach) Possible sequence of states using the transition matrix ◮ ◮ ◮

random walk or simulation trajectory huge size → huge memory cost Stochastic Automata Networks (structured formalism, underlying Markov Chain)

Objective

Goal It is to present a parallel implementation of Bootstrap simulation, focusing on the overall technique performance by presenting a method to generate large amount of samples in less time.

Discussion processing x communication times model size x amount of generated samples

Outline

Stochastic Automata Networks (SAN) Bootstrap simulation Parallelization Experiments and results Conclusion and future works

Stochastic Automata Networks (SAN) • It allows the description of a large system in a structured manner by its parts (automata) SAN model

Underlying Continuous-Time Markov Chain

A

B

C

0

0

0

l1

s1 1

s2

s1

s2

1

s1 1

Type Event Rate Type Event Rate Type Event Rate syn s1 α syn s2 β loc l1 f f = [(B == 0) && (C == 0)] × γ

000

α

γ

111

β 100

Stochastic Automata Networks (SAN) • It allows the description of a large system in a structured manner by its parts (automata) SAN model

Underlying Continuous-Time Markov Chain

A

B

C

0

0

0

l1

s1 1

s2

s1

s2

1

s1 1

Type Event Rate Type Event Rate Type Event Rate syn s1 α syn s2 β loc l1 f f = [(B == 0) && (C == 0)] × γ

α

0

γ

1

β 2

Bootstrap

Method It is a well known statistical method applied to many fields to improve accuracy when performing sample estimations for complex distributions.

In the simulation context Bootstrap simulation provides more reliable estimations than the traditional simulation [SCSC’10].

Main feature Generation of repeated batches of samples that helps to improve the method accuracy.

Traditional simulation States

Transition Matrix

2

0 1 2 0 0.10 0.65 0.25 1 0.25 0.55 0.20 2 0.30 0.25 0.45

1

0 Time

n = trajectory length each visited state = sample

0

π′

π0′

π1′

π2′

0

0

0

mean permanence ′ probability π = πn


Transition Matrix

0 1 2 0 0.10 0.65 0.25 1 0.25 0.55 0.20 2 0.30 0.25 0.45 Initial state

0 Time


0

π′

π0′

π1′

π2′

0

0

0



Transition Matrix

0 1 2 0 0.10 0.65 0.25 1 0.25 0.55 0.20 2 0.30 0.25 0.45 Initial state

0 Time


0 1 U = 0.08 π′

π0′

π1′

π2′

0

0

0



Transition Matrix

0 1 2 0 0.10 0.65 0.25 1 0.25 0.55 0.20 2 0.30 0.25 0.45 Initial state

0

0 Time


0 1 U = 0.08 π′

π0′

π1′

π2′

0

0

0



Transition Matrix

0 1 2 0 0.10 0.65 0.25 1 0.25 0.55 0.20 2 0.30 0.25 0.45 Initial state

0

0 Time


0 1 U = 0.08 π′

π0′

π1′

π2′

1

0

0



Transition Matrix

0 1 2 0 0.10 0.65 0.25 1 0.25 0.55 0.20 2 0.30 0.25 0.45 Initial state

0

0 Time 0

each visited state = sample

1 2 U = 0.87 π′

n = trajectory length

π0′

π1′

π2′

1

0

0



Transition Matrix

2

Initial state

0

0 1 2 0 0.10 0.65 0.25 1 0.25 0.55 0.20 2 0.30 0.25 0.45

0 Time 0


1 2 U = 0.87 π′


π0′

π1′

π2′

1

0

0



Transition Matrix

2

Initial state

0

0 1 2 0 0.10 0.65 0.25 1 0.25 0.55 0.20 2 0.30 0.25 0.45

0 Time 0


1 2 U = 0.87 π′


π0′

π1′

π2′

1

0

1



Transition Matrix

2

Initial state

0

0 1 2 0 0.10 0.65 0.25 1 0.25 0.55 0.20 2 0.30 0.25 0.45

0 Time 0

1


2 3 U = 0.32 π′


π0′

π1′

π2′

1

0

1



Transition Matrix

2

0 1 2 0 0.10 0.65 0.25 1 0.25 0.55 0.20 2 0.30 0.25 0.45

1 Initial state

0

0 Time 0

1


2 3 U = 0.32 π′


π0′

π1′

π2′

1

0

1



Transition Matrix

2

0 1 2 0 0.10 0.65 0.25 1 0.25 0.55 0.20 2 0.30 0.25 0.45

1 Initial state

0

0 Time 0

1


2 3 U = 0.32 π′


π0′

π1′

π2′

1

1

1



Transition Matrix

2

1 Initial state

0

0 1 2 0 0.10 0.65 0.25 1 0.25 0.55 0.20 2 0.30 0.25 0.45

...

0 Time 0

1

3

2

π′

...

π0′

π1′

π2′

1

1

1

n

n = trajectory length each visited state = sample mean permanence ′ probability π = πn

Bootstrap simulation States

Transition Matrix

0 Initial state

0 Time

1

2

0 0.10 0.65 0.25 1 0.25 0.55 0.20 2 0.30 0.25 0.45

0

n = trajectory length K: bootstrap z: number of bootstraps mean permanence Pz x ¯i probability π = i=1 z


Transition Matrix

0 Initial state

0

0 Time

1

2

0 0.10 0.65 0.25 1 0.25 0.55 0.20 2 0.30 0.25 0.45

0 1 U = 0.08 K1

0 1 2

K2

0 1 2

Kz

0 ... 1 2



Transition Matrix

0 Initial state

0

0 Time

1

2

0 0.10 0.65 0.25 1 0.25 0.55 0.20 2 0.30 0.25 0.45

0 1 U = 0.08 K1

0 1 2

K2

0 1 2

Kz

0 ... 1 2

For each bootstrap, it is performed n ¯ trials to execute the resamplings



Transition Matrix

0 Initial state

0

0 Time

1

2

0 0.10 0.65 0.25 1 0.25 0.55 0.20 2 0.30 0.25 0.45

0 1 U = 0.08 K1

0 1 2

K2

0 1 2

Kz

0 ... 1 2




Transition Matrix

0 Initial state

0

0 Time

1

2

0 0.10 0.65 0.25 1 0.25 0.55 0.20 2 0.30 0.25 0.45

0 1 U = 0.08 K1

0 1 2

K2

0 1 2

Kz

0 ... 1 2




2

Transition Matrix

0 Initial state

0

0 Time 0

2

0 0.10 0.65 0.25 1 0.25 0.55 0.20 2 0.30 0.25 0.45

1 2 U = 0.87

K1

0 1 2

1

K2

0 1 2

Kz

0 ... 1 2




2

Transition Matrix

0

1 Initial state

0

0 Time 0

1

K1

0 1 2

K2

0 1 2

1

2

0 0.10 0.65 0.25 1 0.25 0.55 0.20 2 0.30 0.25 0.45

2 3 U = 0.32 Kz

0 ... 1 2




2

Transition Matrix 1 Initial state

0

...

0

0 Time 0

1

K1

0 1 2

K2

0 1 2

2 Kz

0 ... 1 2

3

...

1

2

0 0.10 0.65 0.25 1 0.25 0.55 0.20 2 0.30 0.25 0.45

n



2


0

...

0

Time 1

K1

0 1 2

K2

0 1 2

2

3

...

n x¯1

Kz

0 ... 1 2

normalize

0 1 2

2

0 0.10 0.65 0.25 1 0.25 0.55 0.20 2 0.30 0.25 0.45

0 0

1

x¯2

0 1 2

x¯z

0 ... 1 2



2


0

...

0

Time 1

K1

0 1 2

2

K2

0 1 2

3

...

n x¯1

Kz

0 ... 1 2

normalize

2

0 0.10 0.65 0.25 1 0.25 0.55 0.20 2 0.30 0.25 0.45

0 0

1

0 1 2

x¯2

x¯z

0 ... 1 2

0 1 2

n = trajectory length K: bootstrap z: number of bootstraps

π x¯1[0] + x¯2[0] + ··· + x¯z [0] z x¯1[1] + x¯2[1] + ··· + x¯z [1] z

= π0

x¯1[2] + x¯2[2] + ··· + x¯z [2] z

= π2

= π1

mean permanence Pz x ¯i probability π = i=1 z

Parallelization

Approach Split the bootstrap sampling tasks over the processing nodes Each node performs the full trajectory simulation but produces a different set of samples Master-worker pattern Implementation: C++ language and MPI primitives Executed on a cluster with 8 Dell PowerEdge R610 connected in a Gigabit Ethernet network

Experiments Number of bootstraps assigned to nodes in each configuration configuration 1 2 3 4 5 6 7 8

number of bootstraps 36 18 12 9 7 6 5 4

18 12 9 7 6 5 4

12 9 7 6 5 4

9 7 6 5 4

8 6 5 5

6 5 5

6 5

5

Models (examples) ASP - Alternate Service Patterns: describes an Open Queueing Network with servers that map P different service patterns. FAS - First Available Server: indicates the availability of N servers. RS - Resource Sharing: maps R shared resources to P processes.

Results Large models (million of states) n = 106 20

n = 107 100

Proc. Comm.

Time (s)

15 Time (s)

Proc. Comm.

80

10

5

60

40

20

0

0 1 2 3 4 5 6 7 8 ASP

1 2 3 4 5 6 7 8 FAS Number of nodes

1 2 3 4 5 6 7 8 RS

1 2 3 4 5 6 7 8 ASP

n = 108

1 2 3 4 5 6 7 8 RS

n = 109 10000

Proc. Comm.

800

8000

600

6000

Time (s)

Time (s)

1000


400

200

Proc. Comm.

4000

2000

0

0 1 2 3 4 5 6 7 8 ASP


1 2 3 4 5 6 7 8 RS

1 2 3 4 5 6 7 8 ASP

1 23 4 5 6 7 8 FAS Number of nodes

1 23 4 5 6 7 8 RS

Results Small models (hundred of states) n = 106

n = 107 100

Proc. Comm.

8

80

6

60

Time (s)

Time (s)

10

4

2

Proc. Comm.

40

20

0

0 1 2 3 4 5 6 7 8 ASP


1 2 3 4 5 6 7 8 RS

1 2 3 4 5 6 7 8 ASP

n = 108

1 2 3 4 5 6 7 8 RS

n = 109 10000

Proc. Comm.

800

8000

600

6000

Time (s)

Time (s)

1000


400

200

Proc. Comm.

4000

2000

0

0 1 2 3 4 5 6 7 8 ASP


1 2 3 4 5 6 7 8 RS

1 2 3 4 5 6 7 8 ASP

1 23 4 5 6 7 8 FAS Number of nodes

1 23 4 5 6 7 8 RS

Conclusion and future works

Summary An efficient implementation of a novel simulation algorithm Considerable speedup for very large models ◮

specially for long trajectories

The speedup was consistent with different SAN models ◮

nearly 5 times speedup for 8 nodes

The processing demands depend only on the simulation trajectory length (n) The communication demands depend only on the reachable state space size of the model

Conclusion and future works

Future works Study of bootstrap distribution over non-uniform memory architectures ◮

some levels of shared memory could be highly beneficial to cope with high communication (short trajectories for large models)

Blending methods ◮

combination of parallel Bootstrap approach with more sophisticated simulation approaches (e.g., Perfect Simulation)

Thank you for your attention.

Performance Issues for Parallel Implementations of ... - Google Sites

Performance Issues for Parallel Implementations of ... - Google Sites

Suggest Documents

Performance Analysis Issues for Parallel ... - Google Sites

Performance Analysis Issues for Parallel ... - Google Sites

Parallel Implementations of Combinations of

Performance of Parallel Implementations of an Explicit ... - CiteSeerX

Parallel implementations of the MinMin

Parallel matrix multiplication for various implementations

Parallel Implementations of Evolutionary Strategies - Semantic Scholar

Parallel Implementations of Gusfield's Cut Tree

Enabling scalable parallel implementations of structured ... - CiteSeerX

Assessment of Barrier Implementations for Fine-Grain Parallel ...

Array Optimizations for Parallel Implementations of High ... - Rice CS

Parallel Implementations of the Split-Step Fourier Method for Solving ...

Parallel Implementations of RCM Algorithm for Bandwidth ... - SciELO

An Overview on High Performance Issues of Parallel ...

Prototyping parallel ANN implementations with ...

Latency Performance of SOAP Implementations - CiteSeerX

Performance Evaluation of WBEM Implementations - KNOM

Latency Performance of SOAP Implementations - CiteSeerX

Performance models for master/slave parallel programs - Google Sites

Performance Metrics for Embedded Parallel

Performance evaluation for implementations of a ... - Semantic Scholar

Results of Parallel Implementations of the Selection ...

Comparison of Three Parallel Implementations of ... - Semantic Scholar

Parallel Performance Measurement of Heterogeneous Parallel ...