Apr 4, 2008 - 2College of Engineering & Science, Louisiana Tech University, LA. 3 Faculty of Mathematics and Informatics, Spiru Haret University, Bucharest, ...
Reliability-aware Optimal K-Node Allocation of Parallel Applications in Large Scale HPC Systems Narasimha R. Gottumukkala1, 2, Chokchai Box Leangsuksun2, Raja Nassar2, Mihaela Paun2, 3, Dileep Sule2 , Stephen L. Scott4 1Centre
for Business and Information Technologies, University of Louisiana at Lafayette, LA 2College of Engineering & Science, Louisiana Tech University, LA 3 Faculty of Mathematics and Informatics, Spiru Haret University, Bucharest, Romania 4Computer Science and Mathematics Division, Oak Ridge National Laboratory, Oakridge, TN
1
This research is supported by the Department of Energy Grant no:DE-FG02-05ER25659. and 4 by the Mathematics, Information and Computational Sciences Office Office of Advanced Scientific Computing Research, Office of Science
1
Introduction Present and future Computational applications require massively parallel processors - Top500.org reports Un-expected failures and downtimes a major performance hindrance for large scale parallel applications - A Single node failure interrupts the entire parallel application running on all the nodes For MPI Jobs, if one processor fails, the whole job running on all processors is aborted - Increasing processor count => Increasing number of failures and downtimes affects performance
2
Motivation Typical Fault Tolerance Mechanisms for parallel jobs - Duplicate Tasks/Resources : Too Expensive - Failure Prediction: Not 100% accurate - Checkpoint/Restart: Complex & difficult in HPC
Reliability aware resource management provides - Allocate nodes such that the performance loss due to failures is minimum - Individual Nodes are found to have time varying failure rates
3
Scalability vs Reliability
Increasing the number of nodes decreases reliability 4
Job run length
Increasing job completion time decreases reliability 5
Scalability vs Reliability
x
MTTR
x
x
Increasing the number of nodes decreases completion time 6
Problem Statement Given a parallel job which has T hrs of running on a single node, what is the optimal number of nodes that will minimize the completion time - Minimize failure probability and waste time - Minimize the total completion time
4/4/08
7
Outline Related Work Resource Allocation Algorithms Expected Completion Time on K Nodes with Reliability Reliability Aware optimal k-node allocation Algorithm Simulation Results Conclusion and Future Work 8
Related Work Current schedulers do not consider reliability as an important performance metric - FIFO (First In First Out) - Backfilling (Move short jobs ahead of the queue if long jobs are not interrupted)
Scalability of parallel applications - Amdahl’s Law • Speedup decreases, but saturates after a certain limit doesn’t matter how big the problem size - Gustafson’s Law: • Any large problem can be efficiently parallelized
Importance of the of number of processors has been studied for checkpointing applications [Plank et al] Reliability-aware resource allocation using the reliability of nodes observed to minimize the waste time due to failures. 9
Reliability aware optimal k-node allocation Idea: - Select k nodes out of n such that the expected completion time is minimal
the Expected completion time based on - - - -
The reliability of k nodes The completion time on k nodes The expected repair time The expected waste time in the presence of failures
10
Expected Completion Time of Parallel Program on k-Nodes
The Expected completion time is given by Where Tk is the actual running time of job on k nodes M is the Expected Waste-time R is the Expected Repair time Fk is the system failure probability 11
Expected completion time : With scalability models
Amdahl’s Law
Gustafson’s Law
The completion time decreases after certain point because of higher failure probability requiring resubmission 12
Reliability Aware Optimal K-Node allocation Algorithm
k
Rk =
∏ R (t + x | t ) i
i
i
i =1
€
F E(Tc (k ) ) = Tc (k ) + (W + R) k 1− Fk
€
Optimal k node Algorithm: - Select k out of N nodes such that the Expected Completion Time is minimal Each node has a different reliability, and increasing the number of nodes decreases reliability Increasing the number of nodes also decreases completion time 13
Reliability Aware Reliability aware optimal k-node algorithm: An Example Case Optimal k Node Allocation
620
x8
Figure shows how an optimal k is selected out of N nodes Algorithm selects k-nodes such that the Expected Completion Time is minimum
14
Simulation study Failure Data
- Used the failure properties of ASC White - ASC White: • LANL System, 4 year failure data 7/1/2000 to 10/1/2004 • 512 nodes , 8196 processors • 8196 Processors
Parallel Job Workloads
- Generated synthetic workload based on distribution of job run-lengths and distribution of number of processors [Lubin01] - the uniform-log distribution to generate the number of nodes, and two stage hyper exponential distribution to generated job runlengths 15
Simulation Framework Simulation Framework Performance Metrics
- Job completion time • MCT (Mean Completion Time) = Total completion time / unit job run-length - Waste Time • MWT (Mean Waste Time)
=
The total waste time / unit job run-length
• Where unit job run-length = job-run-length/number of processors
- Relative Percentage Difference (RPD) =
16
Resource Allocation Algorithms RR (Round Robin) - Allocates the job to k adjacent nodes based on the rotation policy of node-ids
All (Select All available Nodes) - Selects all the available nodes in the system
RAS (Select m nodes out of N, but m is fixed) - Here the m number of processors for a job is given by the user (fixed). The algorithm selects the m most reliable processors available for every job
RA-Opt (Selects k nodes out of N, k not fixed) - Selects K nodes out of N such that the expected completion time is minimal. 17
Experimental Results (MCT)
Mean Completion Time - MCT is less for RA-Opt as compared to all other techniques - The RPD (Relative Percentage Difference) shows the percentage difference on how much percentage Ra-Opt performs better than each technique 18
Experimental Results (MWT)
Mean Waste Time
- MWT is less for RA-Opt as compared to all other techniques - Observe that the MWT is higher for ALL nodes as compared to other techniques 19
Experimental Results (MWT with different job Run-lengths)
The MWT increases with the increase in job run-lengths
- Short and medium jobs do not fail very often, however longer jobs have higher chances of failures
We observe that RA-Opt technique has minimum MWT as compared to all the other techniques especially for very long jobs 21
Conclusions Several factors affect the completion time of a parallel program as nodes are scaled higher - Reliability becomes a major factor in deciding the optimal number of nodes to minimize completion time Developed Reliability aware optimal-k node allocation algorithm - Based on expected completion time function on k nodes Simulation Results - Long jobs can especially benefit with the reliability-aware optimal k node allocation algorithm 22
Future Work The reliability-aware optimal k-node allocation can be combined with various scheduling algorithms - Investigate if further improvement is possible.
Developing reliability aware optimal k-node allocation for checkpointed jobs - Select optimal k nodes by considering the checkpoint overhead - Importance of number of processor selection for checkpointing has been mentioned by Plank et al.
23
Future Work Different Scalability models for expected completion times This work can be extended for malleable jobs - Jobs for which the requirements change during runtime
Reliability aware resource management/allocation for time sharing applications - More than one job is affected due to failures
24
References [Plank et al 99] James S. Plank and Michael G. Thomason,“The Average Availability of Parallel Checkpointing Systems and Its Importance in Selecting Runtime Parameters,” 29th International Symposium on Fault-Tolerant Computing, Madison, WI, June, 1999, pp. 250-259. [Kumar et al 91] Kumar, V. and Gupta, A. 1991. “Analysis of scalability of parallel algorithms and architectures: a survey”. In Proceedings of the5th international Conference on Supercomputing (Cologne, West Germany, June 17 - 21, 1991). E. S. Davidson and F. Hossfield, Eds. ICS '91. ACM, New York, NY, 396-405. [Gottumukkala et al 07] Narasimha Raju, Gottumukkala, Chokchai Leangsuksun, Raja Nassar, Stephen L Scott. “Reliability-Aware Resource Allocation in HPC Systems”, Proceedings of the IEEE International Conference on Cluster Computing 2007, Austin Texas. U. Lublin and D. G. Feitelson, “The Workload on Parallel Supercomputers: Modeling the Characteristics of Rigid Jobs”, Journal of Parallel & Distributed Computing. Vol. 63, no.11, pp. 1105-1122, November 2003.
25
The End! Thank you
26