B.Sc. in Computer Science, University of Missouri, Rolla. Research interests: Design of very high-speed computers, providing new computing capabilities.
Performability Modeling for Scheduling and Fault Tolerance Strategies for Scientific Workflows Lavanya Ramakrishnan and Daniel A. Reed Indiana University, Bloomington, IN Microsoft Research, Redmond, WA Proceedings of the HPDC 2008
Presenter: Sean
Outline
Outline
1
Introduction
2
Reliability Specification
3
Performability Analysis
4
Evaluation
Outline
Outline
1
Introduction
2
Reliability Specification
3
Performability Analysis
4
Evaluation
Outline
Outline
1
Introduction
2
Reliability Specification
3
Performability Analysis
4
Evaluation
Outline
Outline
1
Introduction
2
Reliability Specification
3
Performability Analysis
4
Evaluation
Outline
Lavanya Ramakrishnan Brief CV
Ph.D. Student, Graduated, Now MCNC
Ph.D. in Indiana University, 2008 (Expected), Advisor: Dennis Gannon. M.Sc. in Indiana University, 2002. B.Sc. in University of Mumbai, 2000.
Research interests:
Distributed systems including grid computing, high performance computing and utility computing, workflow tools, resource management, monitoring and adaptation for performance and fault tolerance
Publications: Lavanya Ramakrishnan and Daniel A. Reed. "Performability Modeling for Scheduling and Fault Tolerance Strategies for Grid workflows", HPDC 2008 Lavanya Ramakrishnan, Laura Grit, Adriana Iamnitchi, David Irwin, Aydan Yumerefendi, and Jeff Chase. "Toward a Doctrine of Containment: Grid Hosting with Adaptive Resource Control". SC 2006
Projects: Linked Environments for Atmospheric Discovery(LEAD) Virtual Grid Application Development Software (VGrADS) Open Resource Control Architecture(ORCA)
Outline
Daniel A. Reed Brief CV
Director of scalable computing and multicore at Microsoft Research (Since Nov. 2007) Ph.D. in Computer Science, Purdue University. M.Sc. in Computer Science, Purdue University. B.Sc. in Computer Science, University of Missouri, Rolla.
Research interests:
Design of very high-speed computers, providing new computing capabilities for scholars in science, medicine, engineering and the humanities, tools and techniques for capturing and analyzing the performance of parallel systems, and collaborative virtual environments for real-time performance analysis. Two great forces are reshaping computing:multicore processors with unprecedented power and the explosive growth of software services hosted on megascale data centers
Professional Experience: 2005 The North Carolina General Assembly appropriates $5.9M in state FY06 and $11.8M in FY07 and beyond to expand the Renaissance Computing Institute (RENCI) 2005 The President’s Information Technology Advisory Committee (PITAC) and its subcommittee on computational science, which he chaired, produced a report on the future of computational science, entitled “Computational Science: Ensuring America’s Competitiveness.” 2001, Reed led the effort to launch the National Science Foundation’s TeraGrid, the world’s largest, most comprehensive distributed cyberinfrastructure for open scientific research, and then served as TeraGrid chief architect through 2003.
INTRODUCTION
Outline
1
Introduction
2
Reliability Specification
3
Performability Analysis
4
Evaluation
INTRODUCTION
Core Concept: Performability
Performability: a composite measure a system’s performance and its dependability
Performance: the "quality of service (QOS), provided the system is correct"
Dependability: an all-encompassing definition for reliability, availability, safety and security
INTRODUCTION
Problem Statement Grid/Cloud computing need to be degradable Resource availability vary significantly: Hardware + Software Performance (QoS) fluctuation incurred by resource availability Degradable: a resource is not only in two states, "fully-operational" or "failed" How to be degradable
Resource provider: provide an assured level of service under a cost model Software: provide an interface for user to express their performance and reliability requirement Execution models: Characteristics of program execution need to be understood Approach: Using performability, present a qualitative model to capture and analyze the effect of resource reliability on application performance.
INTRODUCTION current virtual grid configuration.
III.C. Implementing Virtual Grids Background
Each class of virtual grid (e.g. a bag, a cluster, etc.) may in fact have a different specialized implementation, but these implementations will share a set of technologies which include scheduling, performance monitoring, information services, resource Detailed selection,approach: checkpointing, etc. It’s unclear at present if these implementations are separat (and Allow composable) if there are implementations forBased common of user toorexpress theseparate availability requirement: on combinations existing resource structures LooseBagof (Clusters)). Virtual Grids (e.g. framework in VGrADS Project III.D. Virtual Grid Execution System Design Application Application vgDL
vgID Information Services
vgES APIs VG VG VG
vgFAB vgMON
vgLAUNCH
Resource Managers
Understanding the applications’ reliability requirements: Three common Figure 1. vgES overall architecture programming models
The virtual grid vision is realized as part of the Virtual Grid Execution System (vgES). This work builds on and is informed by a four-year effort to build development tools for
INTRODUCTION
Three Common Programming Models
A M W
W
B W
W
(a) (a) Master Workder
C
B C
C (b)
(b) Divide and Conquer
C
A
A
A
A
A
A
A
A
(c) (c)SPMD
Figure 1: Three common programming models (a) Master Worker (b) Divide and Conquer (c) SPMD
rce Description in vgDLINTRODUCTION
Description of vgDL: BNF grammar for Redline
description of the Virtual Grid Description Language (vgDL) 2.1 and 2.2, which we describe hereafter. Redline expression ::= Identifier‘=‘ Arithmatic_expr | Logic_expr | Predicate Arithmatic_expr ::= A_operand [A_op A_operand]* A_opearnd ::= Integer | Real A_op ::= "+" | "-" | "*" | "/" | "^" Logic_expr::= L_operand [L_op L_operand]* L operand ::= Integer | Real | Boolean | Figure 2-1. BNF grammar for Redline
INTRODUCTION
Description of vgDL: BNF for vgDL Virtual Grids: Resource Abstractions for Grid Applications
8/9/2004
Vgrid ::= Identifier = Rdl-expression [ at time/event ] Rdl-expression ::= Rdl-subexpression | [ “(“ Rdl-expression “)” op “(“ Rdl-expression “)” ]* Rdl-subexpression ::= Associator-expression | Node-expression Associator-expression ::= Bag-of-expression | Cluster-of-expression Bag-of-expression ::= LooseBagof "" "[" MinNode ":" MaxNode "]" [ "[" Number [ “su” | “sec” ] "]" ] ";" Node-expression | TightBagof "" "[" MinNode ":" MaxNode "]" [ "[" Number [ “su” | “sec” ] "]" ] ";" Node-expression Identifier ::= String Min ::= Integer Max ::= Integer Node-expression ::= Identifier "=" Node-constraint Node-constraint ::= "{" Attribute-constraint | Rdl-expression "}" | Rdl-expression Attribute-constraint ::= Redline expression for attribute and constraint [see Figure 3-2] Cluster-of-expression ::= Clusterof "" "[" MinNode ":" MaxNode [ “,” MaxTime “:” “MinTime”] "]" ";" Node-expression op := close | far | highBW | lowBW
Figure 2-2. BNF for Virtual Grid Description Language (vgDL)
BLAST, follows the master-worker execution mpiBLAST1=MasterNode={memory 4GB,model. disk Con> INTRODUCTION sider an mpiBLAST resource request for a master node con20GB} highBW LooseBagOf [4:32] nected to a set of worker nodes, each with at least mpiBLAST1=MasterNode={memory 4GB, disk 4 GB > ;WorkerNode={memory >= 4GB } Example1:mpiBLAST (vgDL) of20GB} memory. In the virtual grid description language(vgDL), highBW LooseBagOf [4:32] this would specified as follows: One faultbetolerance strategy might ;WorkerNode={memory >= 4GB } require the network link between the master and the worker to have ”good” reliOne (section fault tolerance the network ability 3). Thestrategy modifiedmight vgDLrequire might mpiBLAST1=MasterNode={memory 4GB, look disklike>the link between the master and the worker to have ”good” relifollowing: 20GB} highBW LooseBagOf [4:32] ability (section 3). The modified vgDL might look like the ;WorkerNode={memory >= 4GB } following: mpiBLAST2=MasterNode={memory 4GB, disk One fault tolerance strategy might require the network >20GB} (goodReliability AND highBW) Looselink between the master and the worker to have relimpiBLAST2=MasterNode={memory 4GB,”good”disk BagOf [4:32]; WorkerNode={memory ability (section 3). The modified vgDLhighBW) might lookLooselike the >20GB} (goodReliability AND >= 4GB} following: BagOf [4:32]; WorkerNode={memory >= 4GB} to the network being reliable, the request could In addition also specify that the master node be highly4GB, reliable: disk mpiBLAST2=MasterNode={memory In addition to the network being reliable, the request could >20GB} (goodReliability AND highBW) Loosealso specify that the master node be highly reliable: BagOf [4:32]; WorkerNode={memory mpiBLAST3 = HighReliabilityBag= >= 4GB} {memory 4GB, disk > 20GB } (goodReliability AND mpiBLAST3 = HighReliabilityBag= highBW) LooseBagOf WorkIn addition to the network being reliable, the[4:32]; request could {memory 4GB, disk > 20GB } (goodReliability AND erNode ={memory >= 4GB}; MasterNode ={memory also specify that the master node be highly reliable: highBW) LooseBagOf [4:32]; Work4GB, disk >20GB} erNode ={memory >= 4GB}; MasterNode ={memory mpiBLAST3 = HighReliabilityBag= 4GB, disk >20GB} {memory 4GB, disk > 20GB } (goodReliability AND
ity indesc thm virtual gridthat ity levels scale and adaptation fications. We de follows: Excell ity scale that m space that can (70 - 79%), Fa follows: Excelle tive are m the levels exact defin (70 - 79%), Fa ity levels in the ment contexts the exact defin and adaptation puter hardware ment contexts ity scale that ma directly to reso puter hardware follows: Excelle level tools to t directly to resou (70 - 79%), Fai ations. This p level tools to t the exact defini resources when ations. This pr ment contexts a is analogous to resources whenb puter hardware time on their is analogous to directly to resou a longer than time tools on their b level to tr ing penalized w a longerThis than ations. pre clock time mig ing penalized w resources when We define a time migh isclock analogous to ity. These asso Weondefine aba time their defined earlier These assor aity. longer than bilityBag, (c) defined earlier ing penalized wi bilityBag, (e) bilityBag, (c) clock time migh operators for s bilityBag, We define (e) as the following o ity. These for assoc operators s ity and are ma defined earlier o( the following highReliabili bilityBag, ity and are (c) ma
Model
specific INTRODUCTION The Weather Research and Forecasting (WRF) model [17] associa Example2: Weather Research Forecast (WRF) is a mesoscale numerical weatherand prediction system. TheModel disks on WRF model is an SPMD computation where geographic re10 pm gions are modeled in parallel. For a simple WRF execution, vgDL’s the request might be for a cluster with 8 to 32 nodes, each added with at least 4 GB of memory: wrf1= WRFBag = TightBagOf [8:32]; CNode = {memory>=4GB} We might require all the nodes and the network connectwrf2= ing them to beWRFBag highly reliable = since thisHighReliabilityis an SPMD comBag [1:1]; isManyNodes Tight- a putation. A modified request shown below =to request BagOf [8:32]; CNode = {memory>=4GB} HighReliabilityBag:
From these examples we see that applications can have varied reliability requirements based on their characteristics. Workflow planning components need higher-level in- 25 terfaces to describe collective qualitative reliability requirements in the resource selection process. These requirements are based on application characteristics and other real-time constraints such as deadlines or budget. These user-specified
4. P
Grid more co reduced with m using s concep
RELIABILITY SPECIFICATION
Outline
1
Introduction
2
Reliability Specification
3
Performability Analysis
4
Evaluation
RELIABILITY SPECIFICATION
Reliability Specification for vgDL Extension
Quantitative Reliability Level: Excellent (90-100%), Good (80-89%), Satisfactory(70-79%), Fair (60-69%), Poor (0-59%)
Node: HighReliabilityBag, GoodReliabilityBag, MediumReliabilityBag, LowReliabilityBag, PoorReliabilityBag
Link: highReliability, goodReliability, mediumReliability, lowReliability, poorReliability
PERFORMABILITY ANALYSIS
Outline
1
Introduction
2
Reliability Specification
3
Performability Analysis
4
Evaluation
PERFORMABILITY ANALYSIS
ommonly used performability model today is the Markov Reward Model (MRM). To illustrate th Example echnique, appliedSystem to a cyclic case, a 3-CPU multi-processor system is used, that begins running mode. Jobs arrive at the buffer and are stored until a processor(CPU) becomes available, then th the buffer is sent to this CPU to be processed. In this manner jobs are shared equally between th ocessors.
Fig 1: Model of multi-processor the multi-processor system Model of the system
ome assumptions in our model to take note of. It is assumed that not more than one processor can ere is no simultaneous failures of CPUs. This is described by the transition arrows (only one pos & from a state). Another assumption is that the buffers are ultra-reliable, so buffer failure is not r, although such a failure might result in a complete system breakdown. There are no limits on b
PERFORMABILITY ANALYSIS
Markov Reward Model
e behaviour model and reward model describe the MRM :
Fig 2: The Markov reward Model The Markov reward Model
igure 2, you see that there are four states describing the system. These are :
1 : 3 processors up, 0 processors down
PERFORMABILITY ANALYSIS
Accumulative Reward Y(t)
3 : Sample paths of Z(t),& X(t) & Y(t) processes SampleFigpaths of Z(t), X(t) Y(t) processes
PERFORMABILITY ANALYSIS
The Probability Distribution Function of Y(t)
Fig 4The : The Probability Distribution Function of Y(t) Probability Distribution Function of Y(t)
PERFORMABILITY ANALYSIS
Definition of Performability
Performability is defined as “the probability that a system reaches an accomplished level y over a utilization interval (0,t).” y(x,t) = Prob[Y(t)x]
Ȝ
Ȝ
High
Ȝ
Good
Ȝ
Medium
Ȝ
Low
Poor
Fail
Markov chain for the resource performance and reliability states Figure 2: Markov chain for the resource performance and reliability states
t r
a s a c
PERFORMABILITY ANALYSIS
Resource State Reliability Model
MTBF = MTTF + MTTR λ = MTTF −1 µ = MTTR−1 The steady state probability of occupancy in each state: πn = ρn π0 , π0 = 1 − ρ, ρ = λ/µ, failure − to − repairratio Normally ρ < 1, otherwise, the system is towards complete failure
PERFORMABILITY ANALYSIS
Performability Modeling
T: the running time on high available resource Running time in other states: T + ni x, i = 1,2,3,4 (Fail is not counted) Reward Rate: inverse of the running time 1/(T + ni x) Measure performability as the accumulated reward rate over a specificized time interval: E[Z(t)] = Σri πi (t)
PERFORMABILITY ANALYSIS
Performability Example
Parameter Application running time T Failure-to-repair rate ρ Perform. x=2 Perform. x=100
A 30 min
Machines B C 30 min 25 min
D 15 min
0.1
0.4
0.4
0.6
0.033 0.031
0.032 0.224
0.038 0.027
0.055 0.029
n1 − n4 = 1, 2, 3, 4 Table 1: Performability for different performance model numbers and reliability characteristics where n1 = 1, n2 = 2, n3 = 3, n4 = 4
nee cat cal ter ity for Mi TM and res on T ma add
PERFORMABILITY ANALYSIS
Performability of Different Programming Models
Master-workder application: E(M−W) = Min(EMaster , EWorker , ENetwork ) when TMaster >> TWorker and TMaster >> TNetwork
Divide and Conquer: performability of the root (Tree root runs longer)
SPMD: ESPMD = Min(Esystemcomponents )
PERFORMABILITY ANALYSIS
Workflow Panning for Performability
Workflow scheduling can base on the projected application running time: Tprojected = 1/E[Z](computation) Follow the performability modeling procedure to achieve the network performability
Based on the computation performability and network performability, using traditional scheduling algorithm to the workflow
PERFORMABILITY ANALYSIS
Fault Tolerance Strategies
Two common strategies: replication (good performance and reliability, but high cost) and checkpoint-restart (good reliability but low performance) Cost of replication: CR = Tprojected ∗ n, n is the number of replica Cost of checkpoint-restart: CCR = Ccheckpoint + Crestart−on−failure , Ccheckpoint = Cper−checkpoint ∗ Tprojected /Tinterval , Tinterval : optimal checkpoint interval to meet the performability level if CR