Finding the best load balancing parameters: .... Chuetsu-Oki simulation - 64 VPs and 16 processes ... GreedyLB is the best choice in both simulation and RL.
Using SimGrid to Evaluate the Impact of AMPI Load Balancing In a Geophysics HPC Application Rafael Keller Tesser? , Philippe O. A. Navaux? , Lucas Mello Schnorr? , Arnaud Legrand† ?: UFRGS GPPD/Inf, Porto Alegre, Brazil †: CNRS/Inria POLARIS, Grenoble, France
Urbana, April 2016 14th Charm++ Workshop
1 / 21
Outline 1
Context: Improving The Performance of Iterative Unbalanced Applications
2
SimGrid and SMPI in a Nutshell
3
A Simulation Based Methodology
4
5
Experimental Results Validation Investigating AMPI parameters Conclusion
2 / 21
Context Parallel HPC applications are often written with MPI, which is based on a regular SPMD programming model. • Many of these applications are iterative and such paradigm is
suited to balanced applications; • Unbalanced applications:
• May resort to static load balancing techniques (at application
level ) • Or not. . . (the load imbalance comes from the nature of the input data, evolve over time and space. e.g., Ondes3D) Handling this at the application level is just a nightmare.
A possible approach is to use over-decomposition and dynamic process-level load-balancing as proposed with AMPI/CHARM++ 3 / 21
Ondes3D, a Seismic Wave Propagation Simulator • Developed by BRGM [Aochi et al. 2013]; • Used to predict the consequences of future earthquakes.
Many sources of load imbalance: • Absorbing boundary conditions (tasks at the borders perform more computation) • Variation in the constitution laws of different geological layers (different equations); • Propagation of the shockwave in space and time;
Mesh partitioning techniques and quasi-static load balancing algorithm are thus ineffective. 4 / 21
AMPI can be quite effective 2500
Time (seconds)
2000 1500
-33.97% -36.58%
1000
-32.12% -34.72%
500 0
288
2304
4608
Number of chunks MPI (No LB) AMPI (No LB)
Ref neLB NucoLB
HwTopoLB HierarchicalLB 1
HierarchicalLB 2
500 time-steps Average execution times
Based on Mw 6.6, 2007 Niigata Chuetsu-Oki, Japan, earthquake [Aochi et.al ICCS 2013] • Full problem (6000 time steps) ; 162 minutes on 32 nodes (Intel Hapertown processors) 5 / 21
Challenges Finding the best load balancing parameters: • Which Load Balancer is the most suited? • How many iterations should be grouped together? (Migration Frequency) • How many VPs? (Decomposition level) Load-balancing benefit vs. application communication overhead and LB overhead • ...
6 / 21
Challenges Finding the best load balancing parameters: • Which Load Balancer is the most suited? • How many iterations should be grouped together? (Migration Frequency) • How many VPs? (Decomposition level) Load-balancing benefit vs. application communication overhead and LB overhead • ... And preparing for AMPI is not free: • Need to write data serialization code • Engaging in such approach without knowing how much there is to gain can be deterring;
Goal Propose a sound methodology for investigating performance improvement of irregular applications through over decomposition 6 / 21
Outline 1
Context: Improving The Performance of Iterative Unbalanced Applications
2
SimGrid and SMPI in a Nutshell
3
A Simulation Based Methodology
4
5
Experimental Results Validation Investigating AMPI parameters Conclusion
7 / 21
SimGrid Off-line: trace replay Time Independent Trace
mpirun tau, PAPI
na eo r onc luste ce c Tra ple sim
0 compute 1e6 0 send 1 1e6 0 recv 3 1e6 1 recv 0 1e6 1 compute 1e6 1 send 2 1e6 2 recv 1 1e6 2 compute 1e6 2 send 3 1e6
Model the machine of your dreams Platform Description Up Down
Limiter Up Down
13G Up Down
10G 1G Up Down
...
...
10G Up Down
...
Limiter
1G ... 1.5G
1−39 40−74 75−104 105−144
3 recv 2 1e6 3 compute 1e6 3 send 0 1e6
Replay the trace as many times as you want
[0.001000] 0 compute 1e6 0.01000 [0.010028] 0 send 1 1e6 0.009028 [0.040113] 0 recv 3 1e6 0.030085 [0.010028] 1 recv 0 1e6 0.010028 ...
SMPI Simulated or Emulated Computations
MPI Application
Timed Trace
Simulated Communications
Simulated Execution Time 43.232 seconds
Visualization
On-line: simulate/emulate unmodified complex applications
Paje time slice
- Possible memory folding and shadow execution - Handles non-deterministic applications TRIVA
• SimGrid: 15 years old collaboration between France, US, UK,
Austria, . . . • Flow-level models that account for topology and contention • SMPI: Supports both trace replay and direct emulation • Embeds 100+ collective communication algorithms
8 / 21
Outline 1
Context: Improving The Performance of Iterative Unbalanced Applications
2
SimGrid and SMPI in a Nutshell
3
A Simulation Based Methodology
4
5
Experimental Results Validation Investigating AMPI parameters Conclusion
9 / 21
Principle Approach: 1 Implement various load-balancing algorithms in SMPI; 2 Capture a time independent trace (faithful application profile) • Two alternatives: – Standard tracing: parallel/fast , requires more resources – Emulation (smpicc/smpirun): requires a single host , slow
;
• Add a fake call to MPI_Migrate where needed • Track how much memory is used by each VP and use it as an
upper bound of migration cost; • May take some time but does requires minimal modification /
knowledge of the application; 3
Replay the trace as often as wished, playing with the different parameters (LB, frequency, topology, . . . ) 10 / 21
Principle
Key questions: • How do we know whether our simulations are faithful? • How do we understand where the mismatch comes from? • VP scheduling, LB implementation, trace capture, network, . . .
10 / 21
Evaluation Challenge
No LB vs. GreedyLB : Simple Gantt charts are not very informative 11 / 21
Evaluation Challenge
No LB vs. GreedyLB : Simple Gantt charts are not very informative 11 / 21
Outline 1
Context: Improving The Performance of Iterative Unbalanced Applications
2
SimGrid and SMPI in a Nutshell
3
A Simulation Based Methodology
4
5
Experimental Results Validation Investigating AMPI parameters Conclusion
12 / 21
Description of the experiments Scenarios Two different earthquake simulations: • Niigata-ken Chuetsu-Oki: • 2007, Mw 6.6, Japan • 500 time-steps; dimensions: 300x300x150
• Ligurian: • 1887, Mw 6.3, north-western Italy • 300 times-teps; dimensions: 500x350x130
Load balancers No load balancing vs. GreedyLB vs. RefineLB Hardware Resources Parapluie cluster from Grid’5000 • 2 x AMD Opteron™ 6164 HE x 24 cores, 1.7GHz, Infiniband • Plus my own laptop (Intel Core™ i7-4610M, 2 cores, 3GHz)
13 / 21
Chuetsu-Oki simulation - 64 VPs and 16 processes Detailed View GreedyLB
RefineLB
16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1
AMPI
Resource (index)
None
Load 1.0 0.8 0.6
16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1
0.4
SMPI
0
100
200
300
400
500 0
100
200
300
400
500 0
100
200
300
400
500
Iteration (index)
14 / 21
Chuetsu-Oki simulation - 64 VPs and 16 processes Space Aggregated View 5-10 runs for each configuration None
GreedyLB ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ●● ●● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ●●● ● ● ● ● ●
0.9
Average Load
RefineLB
● ●● ●● ● ● ●● ● ● ●● ● ●● ●● ● ● ●● ● ● ● ● ● ●●● ● ●● ●●● ●● ● ●● ●● ● ● ● ● ● ● ●● ● ● ●● ●● ● ●● ●● ● ● ● ● ● ●● ●● ● ●● ●●●● ●● ●● ● ● ●● ●● ●● ●● ● ● ● ●●● ●● ● ● ● ●● ● ●● ● ●● ● ●● ● ● ●● ●● ●
● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ●● ● ●● ● ● ●● ●● ● ●● ●● ● ● ● ● ● ●● ● ●● ● ●● ●● ● ●● ● ● ● ●● ● ● ●● ●● ● ●● ●●● ● ● ● ●● ● ●● ● ●●● ●● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ●
● ● ● ● ●
●
0.6
0.3
0.0 0
250
500
750
0
250
500
750
0
250
500
750
Time (s) ●
AMPI
●
SMPI
• The simulated load behaves very similar to real life • GreedyLB is the best choice in both simulation and RL • There is still some mismatch in terms of makespan 14 / 21
Ligurian simulation - 64 VPs and 16 processes Detailed View GreedyLB
RefineLB
16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1
AMPI
Resource (index)
None
Load 1.0 0.8 0.6
16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1
0.4
SMPI
100
200
300
100
200
300
100
200
300
Iteration (index)
15 / 21
Ligurian simulation - 64 VPs and 16 processes Space Aggregated View None
Average Load
0.9
GreedyLB
RefineLB
● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●
●●● ●●● ● ● ●● ●
0.6
● ● ● ●●
●●
●● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ●● ● ● ● ● ● ●● ● ● ●● ● ● ● ●●● ● ● ● ● ● ● ●● ● ● ●● ●●●●● ●● ● ● ● ● ● ● ●●● ●● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ●● ● ● ● ●● ●● ●● ● ● ● ● ● ● ●
● ● ● ●● ● ● ●● ● ● ● ● ● ● ●
0.3
0.0 0
250
500
750 1000 1250
0
250
500
750 1000 1250
0
250
500
750 1000 1250
Time (s) ●
AMPI
●
SMPI
• Once again, the simulated an RL loads behave similarly • RefineLB is the best choice in both simulation and RL • Mismatch in the timings between simulation and RL 15 / 21
Impact of the LB frequency (simulation) • Call MPI_MIgrate on every iteration • Change the load balancing frequency in simulation
Makespan (s)
1000
Heuristic None GreedyLB RefineLB
500
0 10
20
30
40
Load−balancing interval
Use RefineLB every 10 or 20 iterations • Trace capture time: 10(XP)x5 hours ≈ 50 hours • Simulation time: 10x3(Heuristics)x4(Freq.)x200 sec ≈ 6h40m 16 / 21
Impact of the decomposition level (in Simulation) 2000
Makespan (s)
1500 Heuristic None 1000
GreedyLB RefineLB
500
0 32
48
64
128
256
VP count
Use either GreedyLB with 32 VP or RefineLB with 64VP • Trace capture time: 5(VP) × 5(XP) × 5 hours ≈ 5 days • Simulation time: 5 × 5 × 3(Heuristics) × 200 sec ≈ 4.1 hours 17 / 21
Impact of the decomposition level (real exp.) 2000
Makespan (s)
1500 Heuristic None 1000
GreedyLB RefineLB
500
0 32
48
64
128
256
VP count
Same conclusion. . . In only ≈ 29 hours but on a 16 node cluster.
18 / 21
Outline 1
Context: Improving The Performance of Iterative Unbalanced Applications
2
SimGrid and SMPI in a Nutshell
3
A Simulation Based Methodology
4
5
Experimental Results Validation Investigating AMPI parameters Conclusion
19 / 21
Conclusion • This is still ongoing work. . . Any comments are welcome! • Simulation of over decomposition based dynamic load balancing • Good results in terms of load distribution; • Some inaccuracy in terms of total makespan. • Visualize the evolution of resource usage: • quite useful to compare simulation with real life; • or to compare different load balancing heuristics. • We need to devise some way to speed up trace collection: • Facilitate the analysis of different over-decomposition levels; • Is there some way to get similar input traces straight from Charm++/AMPI? 20 / 21
Acknowledgements This research was partially supported by: • CNPq: PhD Scholarship at the Post-Graduate Program in Computer Science (PPGC) at UFRGS • CAPES-COFECUB: Part of this work was conducted during a sandwich doctorate scholarship at the Laboratoire d’Informatique de Grenoble, supported by the International Cooperation Program CAPES/COFECUB Fondation; financed by CAPES within the Ministry of Education of Brazil • HPC4E: This research has received funding from the EU H2020 Programme and from MCTI/RNP-Brazil under the HPC4E Project, grant agreement number 689772
21 / 21