Using SimGrid to Evaluate the Impact of AMPI Load ... - HPC4E

19 downloads 11100 Views 5MB Size Report
Finding the best load balancing parameters: .... Chuetsu-Oki simulation - 64 VPs and 16 processes ... GreedyLB is the best choice in both simulation and RL.
Using SimGrid to Evaluate the Impact of AMPI Load Balancing In a Geophysics HPC Application Rafael Keller Tesser? , Philippe O. A. Navaux? , Lucas Mello Schnorr? , Arnaud Legrand† ?: UFRGS GPPD/Inf, Porto Alegre, Brazil †: CNRS/Inria POLARIS, Grenoble, France

Urbana, April 2016 14th Charm++ Workshop

1 / 21

Outline 1

Context: Improving The Performance of Iterative Unbalanced Applications

2

SimGrid and SMPI in a Nutshell

3

A Simulation Based Methodology

4

5

Experimental Results Validation Investigating AMPI parameters Conclusion

2 / 21

Context Parallel HPC applications are often written with MPI, which is based on a regular SPMD programming model. • Many of these applications are iterative and such paradigm is

suited to balanced applications; • Unbalanced applications:

• May resort to static load balancing techniques (at application

level ) • Or not. . . (the load imbalance comes from the nature of the input data, evolve over time and space. e.g., Ondes3D) Handling this at the application level is just a nightmare.

A possible approach is to use over-decomposition and dynamic process-level load-balancing as proposed with AMPI/CHARM++ 3 / 21

Ondes3D, a Seismic Wave Propagation Simulator • Developed by BRGM [Aochi et al. 2013]; • Used to predict the consequences of future earthquakes.

Many sources of load imbalance: • Absorbing boundary conditions (tasks at the borders perform more computation) • Variation in the constitution laws of different geological layers (different equations); • Propagation of the shockwave in space and time;

Mesh partitioning techniques and quasi-static load balancing algorithm are thus ineffective. 4 / 21

AMPI can be quite effective 2500

Time (seconds)

2000 1500

-33.97% -36.58%

1000

-32.12% -34.72%

500 0

288

2304

4608

Number of chunks MPI (No LB) AMPI (No LB)

Ref neLB NucoLB

HwTopoLB HierarchicalLB 1

HierarchicalLB 2

500 time-steps Average execution times

Based on Mw 6.6, 2007 Niigata Chuetsu-Oki, Japan, earthquake [Aochi et.al ICCS 2013] • Full problem (6000 time steps) ; 162 minutes on 32 nodes (Intel Hapertown processors) 5 / 21

Challenges Finding the best load balancing parameters: • Which Load Balancer is the most suited? • How many iterations should be grouped together? (Migration Frequency) • How many VPs? (Decomposition level) Load-balancing benefit vs. application communication overhead and LB overhead • ...

6 / 21

Challenges Finding the best load balancing parameters: • Which Load Balancer is the most suited? • How many iterations should be grouped together? (Migration Frequency) • How many VPs? (Decomposition level) Load-balancing benefit vs. application communication overhead and LB overhead • ... And preparing for AMPI is not free: • Need to write data serialization code • Engaging in such approach without knowing how much there is to gain can be deterring;

Goal Propose a sound methodology for investigating performance improvement of irregular applications through over decomposition 6 / 21

Outline 1

Context: Improving The Performance of Iterative Unbalanced Applications

2

SimGrid and SMPI in a Nutshell

3

A Simulation Based Methodology

4

5

Experimental Results Validation Investigating AMPI parameters Conclusion

7 / 21

SimGrid Off-line: trace replay Time Independent Trace

mpirun tau, PAPI

na eo r onc luste ce c Tra ple sim

0 compute 1e6 0 send 1 1e6 0 recv 3 1e6 1 recv 0 1e6 1 compute 1e6 1 send 2 1e6 2 recv 1 1e6 2 compute 1e6 2 send 3 1e6

Model the machine of your dreams Platform Description Up Down

Limiter Up Down

13G Up Down

10G 1G Up Down

...

...

10G Up Down

...

Limiter

1G ... 1.5G

1−39 40−74 75−104 105−144

3 recv 2 1e6 3 compute 1e6 3 send 0 1e6

Replay the trace as many times as you want

[0.001000] 0 compute 1e6 0.01000 [0.010028] 0 send 1 1e6 0.009028 [0.040113] 0 recv 3 1e6 0.030085 [0.010028] 1 recv 0 1e6 0.010028 ...

SMPI Simulated or Emulated Computations

MPI Application

Timed Trace

Simulated Communications

Simulated Execution Time 43.232 seconds

Visualization

On-line: simulate/emulate unmodified complex applications

Paje time slice

- Possible memory folding and shadow execution - Handles non-deterministic applications TRIVA

• SimGrid: 15 years old collaboration between France, US, UK,

Austria, . . . • Flow-level models that account for topology and contention • SMPI: Supports both trace replay and direct emulation • Embeds 100+ collective communication algorithms

8 / 21

Outline 1

Context: Improving The Performance of Iterative Unbalanced Applications

2

SimGrid and SMPI in a Nutshell

3

A Simulation Based Methodology

4

5

Experimental Results Validation Investigating AMPI parameters Conclusion

9 / 21

Principle Approach: 1 Implement various load-balancing algorithms in SMPI; 2 Capture a time independent trace (faithful application profile) • Two alternatives: – Standard tracing: parallel/fast , requires more resources – Emulation (smpicc/smpirun): requires a single host , slow

;

• Add a fake call to MPI_Migrate where needed • Track how much memory is used by each VP and use it as an

upper bound of migration cost; • May take some time but does requires minimal modification /

knowledge of the application; 3

Replay the trace as often as wished, playing with the different parameters (LB, frequency, topology, . . . ) 10 / 21

Principle

Key questions: • How do we know whether our simulations are faithful? • How do we understand where the mismatch comes from? • VP scheduling, LB implementation, trace capture, network, . . .

10 / 21

Evaluation Challenge

No LB vs. GreedyLB : Simple Gantt charts are not very informative 11 / 21

Evaluation Challenge

No LB vs. GreedyLB : Simple Gantt charts are not very informative 11 / 21

Outline 1

Context: Improving The Performance of Iterative Unbalanced Applications

2

SimGrid and SMPI in a Nutshell

3

A Simulation Based Methodology

4

5

Experimental Results Validation Investigating AMPI parameters Conclusion

12 / 21

Description of the experiments Scenarios Two different earthquake simulations: • Niigata-ken Chuetsu-Oki: • 2007, Mw 6.6, Japan • 500 time-steps; dimensions: 300x300x150

• Ligurian: • 1887, Mw 6.3, north-western Italy • 300 times-teps; dimensions: 500x350x130

Load balancers No load balancing vs. GreedyLB vs. RefineLB Hardware Resources Parapluie cluster from Grid’5000 • 2 x AMD Opteron™ 6164 HE x 24 cores, 1.7GHz, Infiniband • Plus my own laptop (Intel Core™ i7-4610M, 2 cores, 3GHz)

13 / 21

Chuetsu-Oki simulation - 64 VPs and 16 processes Detailed View GreedyLB

RefineLB

16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1

AMPI

Resource (index)

None

Load 1.0 0.8 0.6

16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1

0.4

SMPI

0

100

200

300

400

500 0

100

200

300

400

500 0

100

200

300

400

500

Iteration (index)

14 / 21

Chuetsu-Oki simulation - 64 VPs and 16 processes Space Aggregated View 5-10 runs for each configuration None

GreedyLB ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ●● ●● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ●●● ● ● ● ● ●

0.9

Average Load

RefineLB

● ●● ●● ● ● ●● ● ● ●● ● ●● ●● ● ● ●● ● ● ● ● ● ●●● ● ●● ●●● ●● ● ●● ●● ● ● ● ● ● ● ●● ● ● ●● ●● ● ●● ●● ● ● ● ● ● ●● ●● ● ●● ●●●● ●● ●● ● ● ●● ●● ●● ●● ● ● ● ●●● ●● ● ● ● ●● ● ●● ● ●● ● ●● ● ● ●● ●● ●

● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ●● ● ●● ● ● ●● ●● ● ●● ●● ● ● ● ● ● ●● ● ●● ● ●● ●● ● ●● ● ● ● ●● ● ● ●● ●● ● ●● ●●● ● ● ● ●● ● ●● ● ●●● ●● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ●

● ● ● ● ●



0.6

0.3

0.0 0

250

500

750

0

250

500

750

0

250

500

750

Time (s) ●

AMPI



SMPI

• The simulated load behaves very similar to real life • GreedyLB is the best choice in both simulation and RL • There is still some mismatch in terms of makespan 14 / 21

Ligurian simulation - 64 VPs and 16 processes Detailed View GreedyLB

RefineLB

16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1

AMPI

Resource (index)

None

Load 1.0 0.8 0.6

16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1

0.4

SMPI

100

200

300

100

200

300

100

200

300

Iteration (index)

15 / 21

Ligurian simulation - 64 VPs and 16 processes Space Aggregated View None

Average Load

0.9

GreedyLB

RefineLB

● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●

●●● ●●● ● ● ●● ●

0.6

● ● ● ●●

●●

●● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ●● ● ● ● ● ● ●● ● ● ●● ● ● ● ●●● ● ● ● ● ● ● ●● ● ● ●● ●●●●● ●● ● ● ● ● ● ● ●●● ●● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ●● ● ● ● ●● ●● ●● ● ● ● ● ● ● ●

● ● ● ●● ● ● ●● ● ● ● ● ● ● ●

0.3

0.0 0

250

500

750 1000 1250

0

250

500

750 1000 1250

0

250

500

750 1000 1250

Time (s) ●

AMPI



SMPI

• Once again, the simulated an RL loads behave similarly • RefineLB is the best choice in both simulation and RL • Mismatch in the timings between simulation and RL 15 / 21

Impact of the LB frequency (simulation) • Call MPI_MIgrate on every iteration • Change the load balancing frequency in simulation

Makespan (s)

1000

Heuristic None GreedyLB RefineLB

500

0 10

20

30

40

Load−balancing interval

Use RefineLB every 10 or 20 iterations • Trace capture time: 10(XP)x5 hours ≈ 50 hours • Simulation time: 10x3(Heuristics)x4(Freq.)x200 sec ≈ 6h40m 16 / 21

Impact of the decomposition level (in Simulation) 2000

Makespan (s)

1500 Heuristic None 1000

GreedyLB RefineLB

500

0 32

48

64

128

256

VP count

Use either GreedyLB with 32 VP or RefineLB with 64VP • Trace capture time: 5(VP) × 5(XP) × 5 hours ≈ 5 days • Simulation time: 5 × 5 × 3(Heuristics) × 200 sec ≈ 4.1 hours 17 / 21

Impact of the decomposition level (real exp.) 2000

Makespan (s)

1500 Heuristic None 1000

GreedyLB RefineLB

500

0 32

48

64

128

256

VP count

Same conclusion. . . In only ≈ 29 hours but on a 16 node cluster.

18 / 21

Outline 1

Context: Improving The Performance of Iterative Unbalanced Applications

2

SimGrid and SMPI in a Nutshell

3

A Simulation Based Methodology

4

5

Experimental Results Validation Investigating AMPI parameters Conclusion

19 / 21

Conclusion • This is still ongoing work. . . Any comments are welcome! • Simulation of over decomposition based dynamic load balancing • Good results in terms of load distribution; • Some inaccuracy in terms of total makespan. • Visualize the evolution of resource usage: • quite useful to compare simulation with real life; • or to compare different load balancing heuristics. • We need to devise some way to speed up trace collection: • Facilitate the analysis of different over-decomposition levels; • Is there some way to get similar input traces straight from Charm++/AMPI? 20 / 21

Acknowledgements This research was partially supported by: • CNPq: PhD Scholarship at the Post-Graduate Program in Computer Science (PPGC) at UFRGS • CAPES-COFECUB: Part of this work was conducted during a sandwich doctorate scholarship at the Laboratoire d’Informatique de Grenoble, supported by the International Cooperation Program CAPES/COFECUB Fondation; financed by CAPES within the Ministry of Education of Brazil • HPC4E: This research has received funding from the EU H2020 Programme and from MCTI/RNP-Brazil under the HPC4E Project, grant agreement number 689772

21 / 21

Suggest Documents