Exploiting Multi-core Architectures in Clusters for ... - Google Sites

Exploiting Multi-core Architectures in Clusters for Enhancing the Performance of the Parallel Bootstrap Simulation Algorithm The 12th IEEE International Workshop on Parallel and Distributed Scientific and Engineering Computing (PDSEC 2011)

César A. F. De Rose, Paulo Fernandes, Antonio M. Lima, Afonso Sales, Thais Webber (Avelino F. Zorzo)

Pontif´ıcia Universidade Católica do Rio Grande do Sul (PUCRS) PaleoProspec Project - PUCRS/Petrobras Funded also by CAPES and CNPq - Brazil

Context Interest The solution of complex and large state-based stochastic models to extract performance indices.

Application domains Biology, Physics, Social Sciences, Business, Computer Science and Telecommunication, etc.

Modeling Examples ASP - Alternate Service Patterns: describes an Open Queueing Network with servers that map P different service patterns FAS - First Available Server: indicates the availability of N servers RS - Resource Sharing: maps R shared resources to P processes

Context Problem large amount of possible configurations of large models a traditional numerical solution becomes intractable and easily dependable on the available computational resources

Solution alternatives Numerical: bounded by memory

Simulation: lack of accuracy

and computational power

generating samples

Iterative methods:

Methods:

◮ ◮ ◮

Power [Stewart 94] GMRES [Saad and Schultz Arnoldi [Arnoldi 51]

◮

86]

◮ ◮ ◮

Traditional [Ross 96] Monte Carlo [Häggström 02] Backward [Propp and Wilson 96] Bootstrap [Czekster et al. 10] ⋆ ⋆

reliable estimations high computational cost to generate repeated batches of samples

Context Problem large amount of possible configurations of large models a traditional numerical solution becomes intractable and easily dependable on the available computational resources

Solution alternatives Numerical: bounded by memory

Simulation: lack of accuracy

and computational power

generating samples

Iterative methods:

Methods:

◮ ◮ ◮

Power [Stewart 94] GMRES [Saad and Schultz Arnoldi [Arnoldi 51]

◮

86]

◮ ◮ ◮

Traditional [Ross 96] Monte Carlo [Häggström 02] Backward [Propp and Wilson 96] Bootstrap [Czekster et al. 10] ⋆ ⋆

reliable estimations high computational cost to generate repeated batches of samples

Context

Simulation of Markovian models Definitions: initial state, trajectory length (number of samples) Main idea: perform a random walking process given the set of possible states that the system assumes (simulation trajectory) computing an approximation of the steady-state probability distribution Generation of independent samples for later statistical analysis Large models imply large memory costs Remark: need long run trajectories for more accuracy results

Outline Goal Faster numerical solution using the Bootstrap simulation algorithm, exploiting parallelism in a multi-core SMP cluster

Strategy Parallel approaches: using only MPI primitives Hybrid programming model with MPI and OpenMP: ◮ ◮

fine grain (Hybrid-I) coarse grain (Hybrid-II)

Results and Discussion Performance issues of the parallel Bootstrap implementations

Bootstrap simulation Algorithm Schema States

Bootstrapping process

Initial state

s2

K1

s2

s1

s1

s1

s0

s0

s0

1

2

s1

...

s0

s1

. . . s1

s2

s2

Time 0

3 ...

n

π x¯1[0] + x¯2[0] + ··· + x¯z [0] z π1 x¯1[1] + x¯2[1]z+ ··· + x¯z [1] π2 x¯1[2] + x¯2[2] + ··· + x¯z [2] z

π0

Kz

s0

Normalization

s2

x¯1 Computing

x¯z

s0

s0

s1

. . . s1

s2

s2

01. 02. 03. 04. 05. 06. 07. 08. 09. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26.

α ← U (0..¯ n-1) π←0 K←0 s ← s0 for t ← 1 to n do snew ← φ(s, U (0..1)) for b ← 1 to z do for c ← 1 to n ¯ do if (U (0..¯ n − 1) == α) then Kb [snew ] ← Kb [snew ] + 1 end for end for s ← snew end for for b ← 1 to z do ω←0 for i ← 1 to |S| do ω ← ω + Kb [i] for i ← 1 to |S| do x ¯b [i] ← Kωb [i] end for for i ← 1 to |S| do for b ← 1 to z do π[i] ← π[i] + x ¯b [i] π[i] ← π[i] z end for

Bootstrap simulation Processing costs Times vs. Max. Absolute Error 100000

4.50e-03 4.00e-03

the number of trials (¯ n) in the resampling process

1000 2.50e-03 2.00e-03 100 1.50e-03 1.00e-03

10

5.00e-04 1

0.00e+00 1e+09

1e+08

1e+07

FAS model

1e+06

1e+05

1e+09

1e+08

1e+07

1e+06

1e+05

ASP model

1e+09

1e+08

1e+07

1e+06

1e+05

z×n ¯ samplings take place for every trajectory step

3.50e-03 3.00e-03

Time (s)

the number of bootstraps (z)

10000

Maximum Absolute Error

the trajectory length (n)

RS model

Simulation trajectory length

Proposal Parallel Bootstrap implementations that change the workload distribution and programming model

Bootstrap simulation Processing costs Times vs. Max. Absolute Error 100000

4.50e-03 4.00e-03

the number of trials (¯ n) in the resampling process

1000 2.50e-03 2.00e-03 100 1.50e-03 1.00e-03

10

5.00e-04 1

0.00e+00 1e+09

1e+08

1e+07

FAS model

1e+06

1e+05

1e+09

1e+08

1e+07

1e+06

1e+05

ASP model

1e+09

1e+08

1e+07

1e+06

1e+05

z×n ¯ samplings take place for every trajectory step

3.50e-03 3.00e-03

Time (s)

the number of bootstraps (z)

10000

Maximum Absolute Error

the trajectory length (n)

RS model

Simulation trajectory length

Proposal Parallel Bootstrap implementations that change the workload distribution and programming model

Parallel approaches Environment Multi-core SMP cluster 8 nodes (Gigabit Ethernet network) Each node has two Intel Xeon E5520 Quad-core - Hyper-Threading technology - 16 logical processors - 16 GB RAM Linux O.S. OpenMPI 1.4.2

Number of bootstraps assigned to nodes in each configuration configuration 1 2 3 4 5 6 7 8

number of bootstraps 36 18 12 9 7 6 5 4

18 12 9 7 6 5 4

12 9 7 6 5 4

9 7 6 5 4

8 6 5 5

OpenMP 2.5

Experiments 30 trials taking a 95% confidence interval 1 (sequential), 2, 3, 4, 5, 6, 7, and 8 nodes z = 36 bootstraps and n=1e+06, 1e+07, 1e+08, and 1e+09

6 5 5

6 5

5

Parallel approaches (pure-MPI)

Pure MPI implementation Node #1 Bootstrapping process (sequential computation)

Split the bootstrap sampling tasks over C processing nodes Node #2

# bootstraps

Bootstrapping process (sequential computation) ... ... .. . ...

... ... .. . ...

Node #C Bootstrapping process (sequential computation) # bootstraps

# bootstraps

Only MPI primitives

... ... .. . ...

Parallel approaches (pure-MPI results) Large models (millions of states)

Small models (hundreds of states)

n = 1e+06

n = 1e+06

30

30

Next-State Bootstrapping Normalization Communication Computing

25

20

Time (s)

Time (s)

20

15

15

10

10

5

5

0

0 1

2

3

4 5 ASP

6

7

8

1

2

3

4 5 FAS

6

7

8

1

2

3

4 5 RS

6

7

8

1

2

3

4 5 ASP

6

7

8

1

2

3

4 5 FAS

6

7

8

Configuration

Configuration

n = 1e+09

n = 1e+09

18000

1

18000


16000 14000

14000

12000

12000

10000 8000

3

4 5 RS

6

7

8

7

8

10000 8000

6000

6000

4000

4000

2000

2


16000

Time (s)

Time (s)


25

2000

0

0 1

2

3

4 5 ASP

6

7

8

1

2

3

4 5 FAS

6

7

Configuration

8

1

2

3

4 5 RS

6

7

8

1

2

3

4 5 ASP

6

7

8

1

2

3

4 5 FAS

6

7

Configuration

8

1

2

3

4 5 RS

6

Parallel approaches (fine grain) Hybrid MPI/OpenMP implementation (Hybrid-I) Hybrid programming

Node #1 Bootstrapping process (parallel computation)

MPI and OpenMP

...

...

...

...

... .. . ...

... .. . ...

... .. . ...

... .. . ...

loop at line 7

... for t ← 1 to n do snew ← φ(s, U (0..1)) for b ← 1 to z do for c ← 1 to n ¯ do if (U (0..¯ n − 1) == α) then Kb [snew ] ← Kb [snew ] + 1 end for end for s ← snew end for ...

Node #2

Node #C

Bootstrapping process (parallel computation) # threads

◮ 04. 05. 06. 07. 08. 09. 10. 11. 12. 13. 14. 15.

# threads

Intra-node parallelism #pragma omp parallel for

...

...

...

...

... .. . ...

... .. . ...

... .. . ...

... .. . ...


◮

...

...

...

...

... .. . ...

... .. . ...

... .. . ...

... .. . ...

Parallel approaches (Hybrid-I results)

n = 1e+06

n = 1e+09

70

70000


60

50000

Time (s)

50

Time (s)


60000

40

30

40000

30000

20

20000

10

10000

0

0 1

2

4 ASP

8

1

2

4

8

1

FAS

Number of threads in only one node

2

4 RS

8

1

2

4 ASP

8

1

2

4

8

1

FAS

2

4

8

RS

Number of threads in only one node

A parallel region is created at each step in the simulation process

Parallel approaches (coarse grain) Hybrid MPI/OpenMP implementation (Hybrid-II) Hybrid programming MPI and OpenMP

Node #1 Bootstrapping process (parallel computation) ...

...

...

...

... .. . ...

... .. . ...

... .. . ...

... .. . ...

lines 5 to 14

... for t ← 1 to n do snew ← φ(s, U (0..1)) for b ← 1 to z do for c ← 1 to n ¯ do if (U (0..¯ n − 1) == α) then Kb [snew ] ← Kb [snew ] + 1 end for end for s ← snew end for ...

Node #2

Node #C


◮ 04. 05. 06. 07. 08. 09. 10. 11. 12. 13. 14. 15.

# threads

Intra-node parallelism Parallel region integrates the whole simulation process

...

...

...

...

... .. . ...

... .. . ...

... .. . ...

... .. . ...


◮

...

...

...

...

... .. . ...

... .. . ...

... .. . ...

... .. . ...

Parallel approaches (Hybrid-II results) Large models

Small models

n = 1e+06

n = 1e+06 Next-State + Bootstrapping Normalization Communication Computing

10

10

8

8

6

6

4

4

2

2

8 (5)

7 (6)

8 (5)

2500

Next-State + Bootstrapping Normalization Communication Computing

2000

1500

1500

Time (s)

Time (s)

7 (6)

6 (6)

5 (8)

4 (9)

3 (12)

2 (16)

1 (16)

8 (5)

7 (6)

6 (6)

5 (8)

4 (9)

RS

n = 1e+09

2000

1000

500


1000

500

RS

6 (6)

5 (8)

4 (9)

3 (12)

2 (16)

FAS

Configuration (Threads)

1 (16)

8 (5)

7 (6)

6 (6)

5 (8)

4 (9)

3 (12)

2 (16)

1 (16)

8 (5)

7 (6)

ASP

6 (6)

5 (8)

4 (9)

3 (12)

2 (16)

0

1 (16)

8 (5)

7 (6)

RS

6 (6)

5 (8)

4 (9)

3 (12)

2 (16)

FAS


1 (16)

8 (5)

7 (6)

6 (6)

5 (8)

4 (9)

3 (12)

2 (16)

1 (16)

8 (5)

7 (6)

ASP

6 (6)

5 (8)

4 (9)

3 (12)

2 (16)

1 (16)

0

FAS


n = 1e+09 2500

3 (12)

2 (16)

ASP


1 (16)

8 (5)

7 (6)

6 (6)

5 (8)

4 (9)

3 (12)

RS

2 (16)

1 (16)

0

8 (5)

7 (6)

6 (6)

5 (8)

4 (9)

3 (12)

2 (16)

FAS

1 (16)

8 (5)

7 (6)

6 (6)

5 (8)

4 (9)

3 (12)

2 (16)

ASP

1 (16)

8 (5)

7 (6)

6 (6)

5 (8)

4 (9)

3 (12)

2 (16)

1 (16)

0


12

Time (s)

Time (s)

12

Parallel approaches (Hybrid-II results)

Speedup vs. Efficiency Large models (n = 1e+09) 40

Small models (n = 1e+09) 120

35

100

30

100 90

35

80 30

70

40

Speedup

Speedup

20

25 60 20

50 40

15 30

15

20 10

20

10

0

5

10 5 1 2 3 4 5 6 7 8 ASP

1 2 3 4 5 6 7 8 FAS

Configuration

1 2 3 4 5 6 7 8 RS

0 1 2 3 4 5 6 7 8 ASP

1 2 3 4 5 6 7 8 FAS

Configuration

1 2 3 4 5 6 7 8 RS

Efficiency (%)

60

Efficiency (%)

80 25

Conclusion and future works

Summary Parallel performance analysis of the Bootstrap simulation algorithm, exploiting different characteristics of multi-core SMP clusters The algorithm allows the generation of samples in an independent manner, which is trivial in terms of parallelization efforts Considerable speedups have been achieved for very large models Processing demands depend only on the trajectory length (n) Communication demands depend only on the size of the model

Conclusion and future works

Future works Practical ◮ ◮

an efficient implementation of transition function to compute samples adapting parallel sampling techniques to mitigate the efforts related to the simulation of structured Markovian models

Theoretical ◮

◮

further study may be considered about the impact of the number of bootstraps on the accuracy of the simulation results incorporate the parallel sampling process with more sophisticated simulation approaches, such as Perfect Sampling

Thank you for your attention

Exploiting Multi-core Architectures in Clusters for ... - Google Sites

Exploiting Multi-core Architectures in Clusters for ... - Google Sites

Suggest Documents

Exploiting multi-core architectures in clusters for ... - Google Sites

OpenMP in Multicore Architectures - CiteSeerX

Design Space Exploration for Multicore Architectures: A ... - Google Sites

Design Space Exploration for Multicore Architectures: A ... - Google Sites

A multithreaded communication engine for multicore architectures

Utilizing Parallization and Multicore Architectures for ... - CiteSeerX

Issues in embedded single-chip multicore architectures

Issues in embedded single-chip multicore architectures

Evaluating Multicore Architectures for Application in High Assurance ...

Exploiting Spatial Architectures for Edit Distance ... - People.csail.mit.edu

Exploiting programmable architectures for WiFi ... - SpringerOpen

Exploiting Domain Architectures in Software Reuse - CiteSeerX

Exploiting Solution Clusters for Coarse-Grained ... - CiteSeerX

Program Execution on Reconfigurable Multicore Architectures - arXiv

Parallel Graph Partitioning on Multicore Architectures - Computer ...

Understanding Stencil Code Performance On MultiCore Architectures *

Scheduling Dynamic OpenMP Applications over Multicore Architectures

MPI Collectives on Modern Multicore Clusters - CiteSeerX

Towards a Parallel Tile LDL Factorization for Multicore Architectures

A Unified Model for Multicore Architectures - Brown Computer Science

Resource Management for Multicore Aware Software Architectures of ...

Multifrontal QR Factorization for Multicore Architectures over ... - Hal

Exploiting Multicore Technology in Software-Defined GNSS Receivers

Exploiting Multicore Technology in Software-Defined GNSS Receivers