Reliability-aware Optimal K-Node Allocation of ...

2 downloads 0 Views 623KB Size Report
Apr 4, 2008 - 2College of Engineering & Science, Louisiana Tech University, LA. 3 Faculty of Mathematics and Informatics, Spiru Haret University, Bucharest, ...
Reliability-aware Optimal K-Node Allocation of Parallel Applications in Large Scale HPC Systems Narasimha R. Gottumukkala1, 2, Chokchai Box Leangsuksun2, Raja Nassar2, Mihaela Paun2, 3, Dileep Sule2 , Stephen L. Scott4 1Centre

for Business and Information Technologies, University of Louisiana at Lafayette, LA 2College of Engineering & Science, Louisiana Tech University, LA 3 Faculty of Mathematics and Informatics, Spiru Haret University, Bucharest, Romania 4Computer Science and Mathematics Division, Oak Ridge National Laboratory, Oakridge, TN

1

This research is supported by the Department of Energy Grant no:DE-FG02-05ER25659. and 4 by the Mathematics, Information and Computational Sciences Office Office of Advanced Scientific Computing Research, Office of Science

1

Introduction   Present and future Computational applications require massively parallel processors -  Top500.org reports   Un-expected failures and downtimes a major performance hindrance for large scale parallel applications -  A Single node failure interrupts the entire parallel application running on all the nodes   For MPI Jobs, if one processor fails, the whole job running on all processors is aborted -  Increasing processor count => Increasing number of failures and downtimes affects performance

2

Motivation   Typical Fault Tolerance Mechanisms for parallel jobs -  Duplicate Tasks/Resources : Too Expensive -  Failure Prediction: Not 100% accurate -  Checkpoint/Restart: Complex & difficult in HPC

  Reliability aware resource management provides -  Allocate nodes such that the performance loss due to failures is minimum -  Individual Nodes are found to have time varying failure rates

3

Scalability vs Reliability

 Increasing the number of nodes decreases reliability 4

Job run length

 Increasing job completion time decreases reliability 5

Scalability vs Reliability

x

MTTR

x

x

  Increasing the number of nodes decreases completion time 6

Problem Statement  Given a parallel job which has T hrs of running on a single node, what is the optimal number of nodes that will minimize the completion time -  Minimize failure probability and waste time -  Minimize the total completion time

4/4/08

7

Outline   Related Work   Resource Allocation Algorithms   Expected Completion Time on K Nodes with Reliability   Reliability Aware optimal k-node allocation Algorithm   Simulation Results   Conclusion and Future Work 8

Related Work   Current schedulers do not consider reliability as an important performance metric -  FIFO (First In First Out) -  Backfilling (Move short jobs ahead of the queue if long jobs are not interrupted)

  Scalability of parallel applications -  Amdahl’s Law •  Speedup decreases, but saturates after a certain limit doesn’t matter how big the problem size -  Gustafson’s Law: •  Any large problem can be efficiently parallelized

 Importance of the of number of processors has been studied for checkpointing applications [Plank et al]  Reliability-aware resource allocation using the reliability of nodes observed to minimize the waste time due to failures. 9

Reliability aware optimal k-node allocation   Idea: -  Select k nodes out of n such that the expected completion time is minimal

  the Expected completion time based on -  -  -  - 

The reliability of k nodes The completion time on k nodes The expected repair time The expected waste time in the presence of failures

10

Expected Completion Time of Parallel Program on k-Nodes

  The Expected completion time is given by Where Tk is the actual running time of job on k nodes M is the Expected Waste-time R is the Expected Repair time Fk is the system failure probability 11

Expected completion time : With scalability models

Amdahl’s Law

Gustafson’s Law

  The completion time decreases after certain point because of higher failure probability requiring resubmission 12

Reliability Aware Optimal K-Node allocation Algorithm

k

Rk =

∏ R (t + x | t ) i

i

i

i =1



 F  E(Tc (k ) ) = Tc (k ) + (W + R) k  1− Fk 



  Optimal k node Algorithm: -  Select k out of N nodes such that the Expected Completion Time is minimal   Each node has a different reliability, and increasing the number of nodes decreases reliability   Increasing the number of nodes also decreases completion time 13

Reliability Aware Reliability aware optimal k-node algorithm: An Example Case Optimal k Node Allocation

620

x8

  Figure shows how an optimal k is selected out of N nodes   Algorithm selects k-nodes such that the Expected Completion Time is minimum

14

Simulation study   Failure Data

-  Used the failure properties of ASC White -  ASC White: •  LANL System, 4 year failure data 7/1/2000 to 10/1/2004 •  512 nodes , 8196 processors •  8196 Processors

  Parallel Job Workloads

-  Generated synthetic workload based on distribution of job run-lengths and distribution of number of processors [Lubin01] -  the uniform-log distribution to generate the number of nodes, and two stage hyper exponential distribution to generated job runlengths 15

Simulation Framework Simulation Framework  Performance Metrics

-  Job completion time •  MCT (Mean Completion Time) = Total completion time / unit job run-length -  Waste Time •  MWT (Mean Waste Time)

=

The total waste time / unit job run-length

•  Where unit job run-length = job-run-length/number of processors

-  Relative Percentage Difference (RPD) =

16

Resource Allocation Algorithms   RR (Round Robin) -  Allocates the job to k adjacent nodes based on the rotation policy of node-ids

  All (Select All available Nodes) -  Selects all the available nodes in the system

  RAS (Select m nodes out of N, but m is fixed) -  Here the m number of processors for a job is given by the user (fixed). The algorithm selects the m most reliable processors available for every job

  RA-Opt (Selects k nodes out of N, k not fixed) -  Selects K nodes out of N such that the expected completion time is minimal. 17

Experimental Results (MCT)

  Mean Completion Time -  MCT is less for RA-Opt as compared to all other techniques -  The RPD (Relative Percentage Difference) shows the percentage difference on how much percentage Ra-Opt performs better than each technique 18

Experimental Results (MWT)

  Mean Waste Time

-  MWT is less for RA-Opt as compared to all other techniques -  Observe that the MWT is higher for ALL nodes as compared to other techniques 19

Experimental Results (MWT with different job Run-lengths)

  The MWT increases with the increase in job run-lengths

-  Short and medium jobs do not fail very often, however longer jobs have higher chances of failures

  We observe that RA-Opt technique has minimum MWT as compared to all the other techniques especially for very long jobs 21

Conclusions   Several factors affect the completion time of a parallel program as nodes are scaled higher -  Reliability becomes a major factor in deciding the optimal number of nodes to minimize completion time   Developed Reliability aware optimal-k node allocation algorithm -  Based on expected completion time function on k nodes   Simulation Results -  Long jobs can especially benefit with the reliability-aware optimal k node allocation algorithm 22

Future Work   The reliability-aware optimal k-node allocation can be combined with various scheduling algorithms -  Investigate if further improvement is possible.

  Developing reliability aware optimal k-node allocation for checkpointed jobs -  Select optimal k nodes by considering the checkpoint overhead -  Importance of number of processor selection for checkpointing has been mentioned by Plank et al.

23

Future Work  Different Scalability models for expected completion times  This work can be extended for malleable jobs -  Jobs for which the requirements change during runtime

 Reliability aware resource management/allocation for time sharing applications -  More than one job is affected due to failures

24

References   [Plank et al 99] James S. Plank and Michael G. Thomason,“The Average Availability of Parallel Checkpointing Systems and Its Importance in Selecting Runtime Parameters,” 29th International Symposium on Fault-Tolerant Computing, Madison, WI, June, 1999, pp. 250-259. [Kumar et al 91] Kumar, V. and Gupta, A. 1991. “Analysis of scalability of parallel algorithms and architectures: a survey”. In Proceedings of the5th international Conference on Supercomputing (Cologne, West Germany, June 17 - 21, 1991). E. S. Davidson and F. Hossfield, Eds. ICS '91. ACM, New York, NY, 396-405. [Gottumukkala et al 07] Narasimha Raju, Gottumukkala, Chokchai Leangsuksun, Raja Nassar, Stephen L Scott. “Reliability-Aware Resource Allocation in HPC Systems”, Proceedings of the IEEE International Conference on Cluster Computing 2007, Austin Texas. U. Lublin and D. G. Feitelson, “The Workload on Parallel Supercomputers: Modeling the Characteristics of Rigid Jobs”, Journal of Parallel & Distributed Computing. Vol. 63, no.11, pp. 1105-1122, November 2003.

25

The End! Thank you

26