IEICE TRANS. INF. & SYST., VOL.E89–D, NO.2 FEBRUARY 2006
612
PAPER
Special Section on Parallel/Distributed Computing and Networking
DRIC: Dependable Grid Computing Framework Hai JIN†a) , Xuanhua SHI† , Weizhong QIANG† , and Deqing ZOU† , Nonmembers
SUMMARY Grid computing presents a new trend to distributed and Internet computing to coordinate large scale resources sharing and problem solving in dynamic, multi-institutional virtual organizations. Due to the diverse failures and error conditions in the grid environments, developing, deploying, and executing applications over the grid is a challenge, thus dependability is a key factor for grid computing. This paper presents a dependable grid computing framework, called DRIC, to provide an adaptive failure detection service and a policy-based failure handling mechanism. The failure detection service in DRIC is adaptive to users’ QoS requirements and system conditions, and the failure-handling mechanism can be set optimized based on decision-making method by a policy engine. The performance evaluation results show that this framework is scalable, high efficiency and low overhead. key words: dependable computing, grid computing, failure detection, fault tolerance
1. Introduction Grid computing presents a new trend to distributed and Internet computing to coordinate large scale heterogeneous resources sharing and problem solving in dynamic, multiinstitutional virtual organizations [15]. The computing resources are highly heterogeneous, ranging from single PCs and workstations, cluster of workstations, to supercomputers. With grid technologies, it is possible to construct largescale applications over the grid environments. Now the grid technologies are involving towards an Open Grid Service Architecture (OGSA) in which a grid provides an extensible set of services that virtual organizations can be aggregated in various ways [14]. According to such philosophy, the resources and services can be accessed with standard interfaces which can be defined, described, registered, discovered and executed. The purpose of grid computing is to eliminate the resource island and to make computing and services ubiquitous. However, there are many challenges to construct dependable grid services, for example: (1) failure of a power leading to power loss of one part of the distributed system; (2) physical damage to the grid computing fabric as a result of natural events or human acts; and (3) failure of system or an application software leading to the loss of the services. Due to the diverse failures and error conditions in the grid environments, developing, deploying, and executing applications over the grid is a challenge. Dependability Manuscript received April 1, 2005. Manuscript revised August 16, 2005. † The authors are with Cluster and Grid Computing Lab., Huazhong University of Science and Technology, Wuhan, China. a) E-mail:
[email protected] DOI: 10.1093/ietisy/e89–d.2.612
is a key factor for grid computing. To construct dependable grid services requires an efficient failure detection approach and systematic failure handing mechanism. There are many QoS metrics for failure detection services (also called failure detectors), such as detection time, mistake rate. There are two kinds of completeness properties and four kinds of accuracy properties about failure detectors [6], [8]. In grid system, we focus on the Eventually Perfect detector, that is, ♦ P detector, which requires strong completeness and eventually strong accuracy. The grid addresses more requirements to a failure detection service for its wide-area distributed characteristic [9], [22]. The failure detection service in grid addresses the following issues: • Message Explosion. There are a large number of objects that need to be monitored. The failure detection services should prevent monitoring messages flooding or overloading of network. • Scalability. Scalability means two aspects, one is that a grid application running over a grid requires a large number of resources and the failure detection service can monitor them efficiently; the other is that when the number of monitored objects increase, the failure detection service still works efficiently. • Flexibility. This requirement means the failure detection service can adapt to different grid applications and can adapt to the dynamism of the grid. • Tolerance to Message Loss. The failure detection service must be able to work efficiently when the network condition is not good. There are different types of failures in grid systems and grid applications due to the diversity nature of the grid components and grid applications. The existing failure handling techniques in distributed systems, parallel systems, or even grid systems address failure handling with one scheme, which can not handle failures in grids with different semantics. The failure handling method in grids should address the following requirements: • Support diverse failure handling strategies. This is driven by the heterogeneous nature of the grid context, such as heterogeneous tasks and heterogeneous execution environments. • Separation of failure handling polices from application codes. This is driven by the dynamic nature of the grid system. The separation of polices from the application codes provides a high-level handling method without
c 2006 The Institute of Electronics, Information and Communication Engineers Copyright
JIN et al.: DRIC: DEPENDABLE GRID COMPUTING FRAMEWORK
613
having any conscious about the grid resources and the applications. • Task-specific failure handling support. This is driven by the diversity of grid applications and grid resources. The user should be able to specify appropriate failure handling method for performance and cost consideration. In this paper, we present a dependable grid computing framework, called DRIC, to provide an efficient failure detection service and a systematic failure handling method. The failure detection service in DRIC is a fundamental component, which detects the task failures based on time-out, and can configure the failure detector parameters (such as sending message interval, timeout) based on user requirements and system conditions. The failure detection services are organized hierarchically, and DRIC can tune the hierarchical topology automatically according to the system changes. The failure handling method is based on a decision-making method, which analyzes QoS metrics for different failure handling policy and establishes an optimized failure handling policy for different tasks. DRIC integrates different failure-recovery techniques. Based on the efficient failure detection service, the failure handling method addresses the unique requirements for fault-tolerance in grids. The following of this paper is organized as follows. In Sect. 2 the related works about failure detection and faulttolerance are reviewed. We present the overview of DRIC in Sect. 3. In Sect. 4 and Sect. 5 we present the key models of failure detection service and failure handling method. The system performance is evaluated in Sect. 6. Finally, we conclude this paper and give some future works in Sect. 7. 2. Related Work Many researches have been done on dependable computing in distributed, parallel, and grid systems. In this section, we present a short review of the techniques and related models. Chandra and Toueg [5] provided the first formal specification of unreliable failure detectors for reliable distributed system. Chen et al. [6] presented an adaptive failure detector which can adapt to the changing conditions by reconfiguring itself dynamically, but this approach only reconfigured some system parameters, the system topology could not be reconfigured. Another kind of adaptive failure detection protocol was lazy failure detection [13], the lazy protocol depended highly on the communication patterns between the application processes and may perform poorly for some specific applications. Sergent et al. [1] analyzed several implementations of failure detectors in the context of LAN, and proposed to customize the implementation of the failure detector with respect to the communication pattern of some particular algorithms. This approach had the same problem as the lazy failure detection protocol; besides, it was difficult to adapt to the context of the grid. Stelling et al.[33] proposed a failure detection service
for Globus Toolkit, namely the Globus Heartbeat Monitor (HBM). R-GMA [10] presented a general monitoring architecture with good scalability and flexibility, but the failure detection time was large especially when the network condition was not good. Felber et al. [12] proposed a failure detection service with a hierarchical structure. The hierarchical structure had weak flexibility to address grid dynamism issue. Renesse et al. [32] distinguished two variations of gossip-style failure detection service based on random gossiping, that is basic gossiping and multi-level gossiping. The gossip-style had good system scalability and flexibility, especially with the multi-level gossiping. However, this approach did not work well if a large percentage of members crashed or become partitioned away. Kumar et al. [26] presented the definition of the reliability of distributed program and distributed system, and there are many research works based on the model [25], [29]. In [11], reliability analysis for grid computing was defined as the probability of successful execution of the given program running on multiple nodes and exchanging information with the remote resources of other nodes. But grid computing applications included more than just mentioned above, such as file transferring, grid workflow, and grid services invoking. In [34], the reliability for grid computing was defined as the probability of successful running of the given task. Many researches have been done on failure recovery mechanisms in distributed, parallel, and grid systems. The main focus of those works is on the provision of a single failure recovery mechanism targeting their system-specific domains. In traditional distributed systems, transaction and replication mechanisms are often incorporated (e.g., OLTP [21], Ficus [31]). There are more techniques used in parallel systems (e.g., PVM [18], DOME [2]). Some research works have been done in grid systems (e.g., retrying in Netsolve [4] and Condor-G [16], replication in Mentat [19], hardcode method in CoG Kits [28], checkpointing in Legion [30] and DataGrid [20]. In [23], a workflow based failure handling framework was present, which targeted the generic, heterogeneous, and dynamic grid environments. However, this method did not address the policy of the failure handling. Different from these researches mentioned above, DRIC presented in this paper provides an adaptive failure detector which can reconfigure system parameters and system topology. Based on failure detector, DRIC also provides a police-based failure handling mechanism to choose the appropriate failure recovery method to fulfill the task-specific nature. 3. System Overview DRIC is designed to provide a software infrastructure to enable the construction of dependable grid applications by coupling services located in different sites dynamically, based on failure detection and failure handling. In Fig. 1, DRIC includes four main components: appli-
IEICE TRANS. INF. & SYST., VOL.E89–D, NO.2 FEBRUARY 2006
614
Fig. 2
Fig. 1
Overview of DRIC.
cation/user interface, resource management, failure detector, and failure handling. The user interface includes application QoS requirement interface and user policy-definition interface. With these interfaces, the user can tell the system the required QoS for a particular application and specify the failure handling polices. The resource management component includes service broker, data broker, information center, and data center. The resource includes the computing power, data, network, storage, and information resources. The failure handling component includes policy maker, policy executor and failure-recovery techniques. Policy maker make the failure-handling policy based on decisionmaking method and the policy executor is an engine to carry out the specific techniques needed by the policy. The failurerecovery techniques component is an integration of different failure-recovery techniques, which fulfils the failure handling policy and carries out the failure recovery for tasks. The failure detector component is an adaptive service, which monitors the status of the services running over the grid and notifies the failure handling components to take actions to such failures. As shown in Fig. 1, the failure detection service is based on heartbeat method, using the “push” method. 4. Failure Detection The failure detection service is able to adapt to system conditions and users’ QoS requirements. 4.1 Architecture of Failure Detection Service The failure detection service is organized in a hierarchical way, shown in Fig. 2. The system architecture has two levels: local groups and global groups. In local group there is a unique group leader. Failure detectors in local groups monitor the objects in local space, and the monitored object in one local space may be in one LAN or cross some LANs, but the network condition in one space should be good. Failure detectors in global groups monitor the global objects by means of monitoring the detectors in local groups. Thus, in
System architecture overview of failure detection service.
this architecture there are two different types of failure detectors: simple failure detector and group leader. The monitored objects send “I’m alive” messages to failure detectors in the local groups periodically, while the message that a group leader send is a list containing the monitored objects and their status, rather than “I’m live”. The group leaders in global groups share failure detection status with epidemic method [17]. For simplicity of management, we incorporate the idea of R-GMA into the system design, as shown in Fig. 2. There is an index service in the system implementation. The index service works as a directory registry of the global group failure detectors and the local group failure detectors. Besides, the index service provides some decision-making capability for the organization of the group leaders. As in the R-GMA, there are three components or roles in failure detection system: consumers, producers, and an index service. A consumer wants to monitor some specific objects in grid, first queries the index service and gets the location of the producer. Here the failure detector is the producer which gives the status of the monitored objects. As we discussed above, the consumer and the failure detector should be located in a same LAN or with the good network condition. Generally, the failure detector contacted by a consumer is one of the group leaders. 4.2 Adaptive Model The key theory about failure detection in DRIC is an adaptive model. As mentioned above, a grid failure detection service should address two issues, one is how to satisfy the QoS between two processes, and the other is how to satisfy the grid dynamic nature, that is, how to organize the hierarchical structure. The QoS between the monitored process and the U L U , T MR , TM ), where detector can be represented in a tuple (T DE U L T DE is an upper bound of the detection time (T DE ), T MR is a lower bound of the average mistake recurrence time (T MR ), U is the upper bound of the average mistake duration (T M ). TM The QoS requirement can be expressed in Eq. (1). U L U T DE ≤ T DE , T MR ≥ T MR , TM ≤ TM
(1)
It is clear that the heartbeat interval ∆interval is a very important factor that contributes to the detection time.
JIN et al.: DRIC: DEPENDABLE GRID COMPUTING FRAMEWORK
615 U T DE ≥ ∆interval + ∆tr
(2)
where ∆tr is a safety margin. We use a recursive method to get ∆interval , as depicted in algorithm 1 adopted from [6]. After getting ∆interval , a sliding window algorithm is taken to record the message behaviors. Thus the ∆interval will change adaptively to the system conditions. Another issue for failure detection is how to set ∆timeout ,and when to make a suspect about a failure. We use an estimation method about the next heartbeat message to suspect a failure. All the details are presented in algorithm 1. Algorithm 1: Assumption: The inter-arrival of the “I’m alive” message follows a Gaussian distribution. The parameters of the distribution are estimated from a sample window. The probability of a given message arriving more that t time unit is given in Eq. (3): +∞ (x−µ)2 1 e− 2σ2 dx (3) p(t) = √ σ 2π t Step 1. [Find ∆intervalmax ] Compute γ=
U 2 ) (1 − PL )(T DE U 2 ∆tr + (T DE )
U U , T DE ) ∆intervalmax = max(γT M
(4) (5)
If ∆intervalmax = 0, the QoS can not be achieved. PL is the probability of message loss and PL can be simply computed as in Eq. (6). PL =
C To − CRe CT o
(6)
CT o is the count of total messages sent in a sample window, and C Re is the count of the received message in the sample window. Step 2: [Get ∆interval ] Let f (∆interval ) = T U /∆interval ∆tr + (T U − j∆interval )2 DE DE ∆interval U j=1 ∆tr + PL (T DE − j∆interval )2
(7)
find the largest ∆interval which is less than ∆intervalmax and L ). Such a ∆interval always exists. with f (∆interval ≥ T MR Step 3: [Estimate arriving time of next heartbeat message] From the assumption and Eq. (3), we can get the µ and σ with the statistic method based on the arriving time of the heartbeat message samples in the sliding window. Estimate the next message arriving time as ET n+1 = W s µ + σ W s is the window size of heartbeat messages. Step 4: [Get freshness point τn+1 ] Compute the freshness point:
(8)
U − ∆interval τn+1 = ET n+1 + T DE
(9)
If no heartbeat message is received in τn+1 time, a suspect about the monitored process is made, that is: ∆timeout = τn+1 − T now
(10)
With algorithm 1, the failure detector can be adaptive to the system conditions between two processes. Next we present the algorithm to organize the failure detectors in grids. As we presented above, we use a hierarchical structure to organize them. The key issue is how to organize the hierarchical structure adaptively in grids, that is, how to choose a unique leader for a local group and how to organize these leaders in the global group according to system conditions. The details are presented in algorithm 2. Algorithm 2: Step 1: [Index service pulls information about the host environments] The index service pulls host environments information of all failure detectors with round-robin algorithm. The information about the host environments includes static and dynamic information. Step 2: [Compute SLi of all hosts and choose the most powerful host] Compute service level of the host environments as: S Li =
αFRi + βMemi Loadi
(11)
and take the host i which has Max(S Li ). FRi in Eq. (11) is the frequency of CPUi , Memi is the memory of host i, and Loadi is the load of host i. α and β are coefficients. Step 3: [Find a group leader to the most powerful host] Get the list of failure detectors on host i chosen in step 2. If one of them is a group leader, go to step 6; else go to step 4. Step 4: [Deploy a new local group leader to the most powerful host] Deploy a group leader failure detector on host i. All the failure detectors in this group send detection messages to it. Step 5: [Global notification about the local group leader] The index service sends a trigger message to upper level failure detector to notify the new local group leader, and the new group leader sends detection messages to upper-level failure detector. Step 6: [Handle the new added objects] If there are new monitored objects added, first try to find the lightest-load failure detector in this group by getting the Min(N f di ), where N f di is the number of objects that failure detector i monitors. If Min(N f di ) is less than F s − 1, the new object sends “I’m alive” message to this failure detector, and go to step 8; else go to step 7. Step 7: [Deploy a new failure detector] Create a new failure detector, and the new object sends “I’m alive” to this failure detector. If the number of failure detectors in this local group is larger than S g , a new group with a group leader failure detector is added and this new group leader registers itself to the index service and notifies other group leaders.
IEICE TRANS. INF. & SYST., VOL.E89–D, NO.2 FEBRUARY 2006
616
Step 8: [Change the messages sending to other group leaders] Change the messages sending to other group leader with the new object; at the same time send a message to index service to notify the global change. Step 9: [Handle failures of local detectors] The group leader detects the failure of local detector. If it fails, just add one, and let the monitored objects send “I’m alive” messages to it. Step 10: [Handle failure of group leader] The failure detectors in the global groups and the index service detect the failure of group leader. If one of them fails, just deploy a new group leader to the same host registered to the index service. Step 11: [Merge small groups] When monitored objects decrease, and the objects decreasing number Ndecrease follows this constraint S g ≤ Ndecrease ≤ 2S g −1, compute the number of groups needed with Eq. (12), where S g is the size of the local groups. n = 2 log S g NT otal (12) We choose S g with Eq. (13), and ϕ is a coefficient. S g = ϕ NT otal log NT otal (13) The network condition between two failure detectors is evaluated as Eq. (14): ∀i, j, T i j ≤ Φ |i, j ∈ LAN → good
(14)
where T i j is the average round-trip time between i and j, and φ is a threshold. In DIRC, the smallest group is annexed to the second smallest one if network conditions between them are good enough. With this process, the number of groups will eventually be less than n. With the algorithms above, the failure detector is adaptive to grid applications. In next section we will present the framework about the failures handling. 5. Failure Handling In this section, we present a policy-based approach to handle failures detected. We first briefly illustrate the overview of our approach and then review the application-level faulttolerance techniques and present the policy-based model of failure handling. 5.1 Overview of Failure Handling As depicted in Fig. 1, the dependable grid computing framework comprises two phases: failure detection and failure handling. Figure 1 presents the overview of the failure handling approach. The failure handling approach uses decision-making method to attain the QoS requirements described by users, in which we integrate almost all kinds of application-level failure-recovery techniques, such as checkpointing, replication, and workflow.
Fig. 3
Overview of failure handling approach.
First, the user submits a task with QoS requirements. The policy engine analyzes the QoS requirements and the attributes of the application to constitute a failure-recovery policy. Based on the policy, the policy engine carries out the policy with appropriate techniques with the help of job management, data management, and information management illustrated in Fig. 3. Also the user can specify the failure handling policy with the policy definition interface, and the policy engine just carries it out. 5.2 Application-Level Fault-Tolerance Techniques In this section we describe the task-level fault-tolerance techniques to handle task failures. • Retrying: Retrying is a simple and obvious failure handling technique, with which the system retries to run the task on the same resources when a failure is detected. Generally, the user or the system specifies the maximum times of retries and the intervals between the retries. After these retries, if the failure still exists, the system prompts an error, and the user or the system should provide other failure handling methods to finish the task. • Replication: The basic idea of replication is to have replica of tasks running on different resources, so that as long as not all replicated tasks crash, the execution of the associated activity would succeed. The failure detector detects these replicated tasks during the task execution and the policy executor kills other replicated tasks when one of them finishes with appropriated result. • Checkpointing: Checkpointing has been widely studied in distributed systems. Traditionally, for a single system, checkpointing can be realized at three levels: kernel level, library level, and application level [24]. Our failure handling framework supports checkpointenabled tasks so that when task fails, the task can be restarted from the recently checkpointed state. Beside the three levels of checkpointing, we implement a
JIN et al.: DRIC: DEPENDABLE GRID COMPUTING FRAMEWORK
617
workflow level checkpointing, which defines the task as a workflow and saves states of the atom-task as checkpoints. • Workflow: Workflow is another technique to handle failures in grids [23]. There are several workflow methods for failure handling, such as alternative task, workflow redundancy, and user-defined exception handling. The main idea of this method is to construct a new workflow for task execution. The workflow method can be regarded as the composition of other failure handling techniques. 5.3 Model of Policy Making In this section, we present some basic definitions about failure-recovery policies, give some quality criteria for failure-recovery, and present the policy making model. 5.3.1 Basic Definitions We present some basic definitions about the failure-recovery policies. Definition 1 Execution time: Execution time (Qet ) is the execution duration of a task with some failure-recovery methods. The task can be a single running service or a composition of services. Execution time is important for a failure-recovery method, especially in some mission critical applications. The execution time of the same task varies violently with different failure-recovery methods [23]. Definition 2 Failure-recovery policy (P): Failure-recovery policy is a policy specified by the user or the system to identify which failure-recovery techniques should be applied for a particular application. In this paper, the failure-recovery policy specifies the techniques presented above or specifies the composition of them, such as replication with checkpointing, replication with retrying. Definition 3 Execution cost: The execution cost (Qco ) of a service is amount of money for a service request to pay for executing the specific operation. The execution cost can be computed as Eq. (15), where Q price is the price of the specific resource, and Rnum is the number of resources. Qco = Q price × Qet × Rnum
(15)
Definition 4: Task: Task is a state chart with a sequence of states [t1 , t2 , · · · , tn ], such that t1 is the initial state, tn is the final state. With the above definitions, we define the QoS of a service s as a tuple below: Q(s) = (Qet (s), Qco (s))
(16)
5.3.2 Quality Criteria In this section, we present the quality criteria for different
failure-recovery policies. For simplicity and feasibility, we only discuss five policies here, they are retrying, replication, checkpointing, retrying with replication, and replication with checkpointing. The workflow method such as alternative task is specified by the user, and the system can not control it but just execute it. We first give the definition of the five failure-recovery policies. Definition 5 Retrying: Given a task [t1 , t2 , · · · , tn ], if the task fails at ti , it will be re-executed from t1 . Definition 6 Replication: Given a task [t1 , t2 , · · · , tn ], there are m replications for this task, they are [t11 , t12 , · · · , t1n ], [t21 , t22 , · · · , t2n ], · · · , [tm1 , tm2 , · · · , tmn ]. If any task fails at ti j , replica i is killed. If any replicated task successfully finishes at tin , other replicas are killed. Definition 7 Checkpointing: Give a task [t1 , t2 , · · · , tn ], the running states are saved to a file. If the task fails at ti , the task will be executed from ti . Definition 8 Replication with checkpointing: Given a task [t1 , t2 , · · · , tn ], there are m replications for this task, they are [t11 , t12 , · · · , t1n ], [t21 , t22 , · · · , t2n ], · · · , [tm1 , tm2 , · · · , tmn ]. The running states are saved to files. If any task fails at ti j , the task will be executed from ti j . If any replicated task successfully finished at tin , other replicas are killed. Definition 9 Retrying with replication: Given a task [t1 , t2 , · · · , tn ], there are m replications for this task, they are [t11 , t12 , · · · , t1n ], [t21 , t22 , · · · , t2n ], · · · , [tm1 , tm2 , · · · , tmn ]. If any task fails at ti j , the task will be re-executed from t1 j . If any replicated task successfully finished at tin , other replicas are killed. After the definition of the failure-recovery polices, we present the QoS criteria about them. For expression simplicity and feasibility, we give some propositions about the failure behaviors in grid computing. Assumption 1 1. The time a task at state ti is duration that the task performs specific operation on a specific service. 2. The time switch from state ti to state ti+1 is zero, that is, the task runs into state ti+1 immediately when state ti finishes. Assumption 2 1. The probability the task fails at state ti is λ (or failure rate), and follows Poisson distribution. Mean Time to Failure (MTTF) is mathematically defined by 1/λ. 2. The task fails at state ti , and the time when a failure occurs follows a Poisson distribution. 3. The failure execution time of the task is T F . 4. The average down time of the system is T D . Assumption 3 1. The number of replicas of a task is NR for replication policy or replication with checkpointing policy. 2. The overhead of checkpoint (TC ) is the same for different policies. 3. The recovery time of checkpointing (T R ) is the same for different policies. 4. The checkpoint interval is T I . With these definitions and assumptions, a brief explanation of QoS criteria for different failure-recovery policies
IEICE TRANS. INF. & SYST., VOL.E89–D, NO.2 FEBRUARY 2006
618
is presented. • Retrying The computation of the execution time with retrying was first specified by Duda [7]. Qet(ret) =
eλT D (eλT F − 1) λ
i
Qmax i
(18)
The execution time with checkpointing is computed as Eq. (19), adopted from [35]. λ(T D −TC +TR ) λ(T I +TC ) TF e (e − 1) (19) Qet(ch) = TI λ
(20)
• Replication with checkpointing The execution time for replication with checkpointing is as follow: Nλ (T D −TC +T R ) Nλ (T I +TC ) TF e R (e R − 1) Qet(rep+ch) = (21) TI λ/NR The execution cost for replication with checkpointing is computed as: Qco = Q price × Qet(rep+ch) × NR
2
(Vi, j × Wi )
(28)
λ NR
(23)
(29)
where fi j is depicted in Eq. (30). fi, j =
Vi, j 4
(30)
Vi, j
j=1
1 − Hi 2 2− Hi
(31)
i=1
The execution cost for retrying with replication is computed as: Qco = Q price × Qet(ret+rep) × NR
1 fi, j ln fi, j , i = 1, 2 ln 4 j=1 4
Hi = −
Wi =
TF
(e − 1) λ/NR
where W j represents the weight of each criterion, and W j is computed with the entropy method, as shown in the following equation.
With Eq. (30), Wi is computed as:
The execution time for retrying with replication is as: TD
The following formula is used to compute the overall quality score. The policy that has the maximum score and with the execution time less than the user required will be chosen by DRIC.
(22)
• Retrying with Replication λ NR
(27)
i=1
Qco = Q price × Qet(ch)
Qet(ret+rep) =
In Eq. (26), and are the maximal value and the minimal value of a quality criterion in matrix M. With Eq. (26), we obtain a new matrix M :
S core(P j ) =
The execution cost for checkpointing is as:
e
Qmin i
M = (Vi, j )2×4
• Checkpointing
i
(26)
(17)
The execution cost for retrying is computed as: Qco = Q price × Qet(ret)
As execution time and execution cost have different quantum, we first formalize the matrix M with Eq. (26). max Q −Qi, j i if Qmax −Qmin 0 i i max min , (i = 1, 2) Vi, j = Q −Q i i 1 if Qmax −Qmin = 0
(24)
In Eq. (28), the end user can also specify the weight Wi for their preference on QoS. In this paper we use the entropy weight to consider both the QoS requirement of the end user and the execution cost of the tasks.
5.3.3 Policy-Making Model
6. Performance Evaluation
The basic idea of the policy-making model is to set the optimized failure-recovery method using Multiple Attribute Decision Making (WADM) approach [27]. With the above definitions and the equations, we obtain the decision-making matrix M as below: Qco(ret) Q1,1 Q1,2 Qet(ret) Q Qco(ch) Q2,1 Q2,2 et(ch) = Mi, j = Qet(rep+ch) Qco(rep+ch) Q3,1 Q3,2 Qet(ret+rep) Qco(ret+rep) Q4,1 Q4,2 (25)
In this section, we study the performance of DRIC over a real grid environment. The testbed is based on the resources over China Education and Research Network (CERNET) [3], covering 1500 more universities, colleges and institutes in China. Currently, CERNET is the second largest nationwide network in China. The bandwidth of CERNET backbone is 10 Gpbs. The testbed includes two sites, one is in Cluster and Grid Computing Lab (CGCL) at Wuhan, China, and the other is in Future Internet Technology National Lab (FIT) at Beijing, China. There is a 16 nodes cluster linked by 100 Mbps switched Ethernet in CGCL, each
JIN et al.: DRIC: DEPENDABLE GRID COMPUTING FRAMEWORK
619 Table 1
Summary of accuracy test.
Number of false detection Average mistake duration (ms)
Fig. 4
DRIC 35 237.1
Chen 29 291.3
Detection time with Ws = 50.
node is equipped with Pentium III processor at 1 GHz and 512 MB memory, and the operating system is Red Hat Linux 9.0. There are 2 PCs linked by 100 Mbps switched Ethernet in FIT, each is composed of Pentium IV processor at 2.4 GHz and 512 MB memory, and the operating system is Red Hat Linux 9.0. The prototype of DRIC framework is implemented with JAVA, and the integrated into campus grid in Huazhong University of Science and Technology (HUST).
Fig. 5
System load and the monitored objects.
Fig. 6
Mistake rate and the monitored objects.
6.1 Performance Evaluation of Failure Detection We first evaluate the performance of algorithm 1. We set one node of the cluster in CGCL as a message sending host, and select one PC in FIT as a message receiving host. All the messages are transmitted with UDP protocol. During the experiment, the heartbeat message is generated every 200 ms with the message size 64 bytes. We measure that the average round-trip time (RTT) is 336.9 ms with a standard deviation of 77.1 ms, the minimum and the maximum RTT are 155.8 and 747.7 ms respectively. The average inter-arrival time of the received messages is 209.3 ms with a deviation of 101.9 ms. Figure 4 presents the detection time between algorithm 1 and Chen’s algorithm in [6], and the sample window size is set as 50. The Receive line in Fig. 4 donates the inter-arrival times of the messages. The Chen line donates the detection time with Chen’s method. The DRIC line donates the detection time with the algorithm 1 presented above. Figure 4 shows that Chen’s method has a longer detection time, but a shorter time to get convergence, and almost constant at the very beginning. Figure 4 also shows that with only a small mount of messages the adaptive method does not work well. To evaluate the accuracy about algorithm 1, an experiment was performed in 24 hours with the comparison between algorithm 1 and Chen’s. In this experiment, the ∆interval was set to 200 ms, and the results are summarized in Table 1. From Table 1, we can get that Chen’s detection
method has a little lower mistaken rate but long mistake duration. We compare the system scalability between HBM [33] and the adaptive system. The system scalability is evaluated by processor load and accuracy of failure detection. As shown in Fig. 5, the DRIC system has better scalability. When the number of the monitored objects is about 400, the system load is 100% in HBM system. But for DRIC, the number can be more than 1000. Accuracy is another aspect about the scalability. We compare the mistake rate when the monitored objects increase. The result is presented in Fig. 6. From Fig. 6, we get that DRIC system has lower mistake rate than HBM, especially when the number of monitored objects is larger than 200. The gap between DRIC and HBM grows larger with increasing of the monitored objects number. When the number of monitored objects is about 800, the failure detec-
IEICE TRANS. INF. & SYST., VOL.E89–D, NO.2 FEBRUARY 2006
620
Fig. 7
Execution time with T D = 0.
Fig. 9
Execution time with T D = T F .
Fig. 8
Execution cost with T D = 0.
Fig. 10
Execution cost with T D = T F .
tion service of HBM almost means nothing, as the host of failure detection almost collapses at that time. 6.2 Performance Evaluation of Failure-Recovery We perform some simulation tasks to evaluate the performance of the failure-recovery in DRIC. We first test the influence of the failure rate to the performance for different failure-recovery technique, then we evaluate the influence of downtime to the system performance, and at last we test the computation cost for failure-recovery policy making. We run simulation tasks over the campus grid of HUST, and most of the tasks are running on the clusters in CGCL. First we set T F to 60, T I to 6, T D to 0, both T R and T C to 0.5, and NR to 3. Figure 7 and Fig. 8 give the execution time and execution cost for different policies and for DRIC system. Figure 7 shows that checkpointing and replication with checkpointing has shorter execution time than other policies when the failure rate is high, and when small failure rate, these two policies show less merit for execution time for the extra overhead of checkpointing. Figure 8 shows that the replication and replication with retrying consume more resources when the failure rate is small. However for high failure rate, the retrying consumes many resources for long execution time. DRIC se-
lects checkpointing as failure-recovery policy considering both the execution time and execution cost under this condition. We then evaluate the impact of the downtime on the performance for different failure-recovery policies. We set the parameters as the above experiments using different downtimes: 60, 120, and 300. The results are shown in Fig. 9, 10, 11, 12, 13 and 14. The results illustrate that for longer downtime, the replication has shorter execution time than other policies. The results also show that the MTTF is a very important factor for different downtime, especially when the failure rate is high. When the downtime is very long the MTTF has less impact on the performance. Also, we find that the system using the replication with checkpointing policy is the worst one (long downtime and high failure rate). From the results, we find that DRIC system chooses the appropriate failure-recovery policy considering both the execution time and cost, and does not choose a failure-recovery policy all the time, but changes the policy-making dynamically according to the system condition and user requirements. We also test computation cost for policy-making with increasing the number of tasks. We test the computation cost on a node of a cluster in CGCL (Pentium III processor at 1 GHz and 512 MB memory). The results are plot-
JIN et al.: DRIC: DEPENDABLE GRID COMPUTING FRAMEWORK
621
Fig. 11
Execution time with T D = 2T F .
Fig. 12
Execution cost with T D = 2T F .
Fig. 14
Fig. 15
7.
Fig. 13
Execution time with T D = 5T F .
ted in Fig. 15. From Fig. 15, we get that the policy engine has good scalability, and the computation cost for policymaking is low, as for 10000 tasks, it only costs 100 s using a PC equipped with PIII processor at 1 GHz.
Execution cost with T D = 5T F .
Computation cost for policy-making.
Conclusion and Future Work
Due to the diverse failures and error conditions in the grid environments, developing, deploying, and executing applications over the grid is a challenge. Thus dependability is a key factor for grid computing. In this paper, we present a dependable grid computing framework, called DRIC, to provide an adaptive failure detection service and a policybased failure handling method. The results of the performance evaluation show that this framework is scalable, high-efficiency and low-overhead. This framework also provides some references for failure-recovery policy-making in other areas. DRIC in this paper provides a method to handle the failures based on the failure detection. We plan to provide a failure avoidance framework in grids. Also, we will test DRIC performance for different applications, such as computation-intensive applications, data-intensive applications. Based on system availability and grid program availability to make a failure-recovery policy is another research area in the future.
IEICE TRANS. INF. & SYST., VOL.E89–D, NO.2 FEBRUARY 2006
622
Acknowledgments This paper is supported by National Science Foundation of China under grant 60125208, 60273076, and 90412010, ChinaGrid project from Ministry of Education, and National 973 Basic Research program under grant 2003CB317003. References [1] M. Bertier, O. Marin, and P. Sens, “Implementation and performance evaluation of an adaptable failure detector,” Proc. International Conference on Dependable Systems and Networks, pp.354–363, Washington D.C., USA, June 2002. [2] A. Beguelin, E. Seligman, and P. Stephan, “Application level fault tolerance in heterogeneous networks of workstations,” Special Issue on Workstation Clusters and Network-based Computing, J. Parallel Distrib. Comput., vol.43, pp.147–155, June 1997. [3] CERNET, http://www.cernet.edu.cn [4] H. Casanova, J. Dongarra, C. Johnson, and M. Miller, “Applicationspecific tools,” in The Grid: Blueprint for a New Computing Infrastructure, ed. I. Foster and C. Kesselman, pp.159–180, Morgan Kaufmann, 1998. [5] T.D. Chandra and S. Toueg, “Unreliable failure detectors for reliable distributed systems,” J. ACM, vol.43, no.2, pp.225–267, 1996. [6] W. Chen, S. Toueg, and M.K. Aguilera, “On the quality of service of failure detectors,” IEEE Trans. Comput., vol.51, no.2, pp.13–32, 2002. [7] A. Duda, “The effects of checkpointing on program execution time,” Inf. Process. Lett., vol.16, pp.221–229, 1983. [8] D. Dolev, R. Friedman, I. Keidar, and D. Malkhi, “Failure detectors in omission failure environments,” Technical Report, Department of Computer Science, Cornell University, Sept. 1996. http://techreports.library.cornell.edu:8081/Dienst/UI/1.0/Display/ cul.cs/TR96-1608 [9] X. D´efago, N. Hayashibara, and T. Katayama, “On the design of a failure detection service for large scale distributed systems,” Proc. International Symposium towards Peta-Bit Ultra-Networks (PBit), pp.88–95, Ishikawa, Japan, Sept. 2003. [10] “DataGrid Information and Monitoring Services Architecture Report: Design, Requirements and Evaluation Criteria,” DataGrid Report, DataGrid-03-D3.2-334453-4-0, 2002. [11] Y.S. Dai, M. Xie, and K.L. Poh, “Reliability analysis of grid computing systems,” Proc. 2002 Pacific Rim International Symposium on Dependable Computing, pp.97–104, 2002. [12] P. Felber, X. D´efago, R. Guerraoui, and P. Oser, “Failure detector as first class objects,” Proc. 9th IEEE International Symposium on Distributed Objects and Applications (DOA’99), pp.132–141, Sept. 1999. [13] C. Felber, M. Raynal, and F. Tronel, “An adaptive failure detection protocol,” Proc. 8th IEEE Pacific Rim Symposium on Dependable Computing (PRDC-8), pp.146–153, 2001. [14] I. Foster, C. Kesselman, J.M. Nick, and S. Tuecke, “The physiology of the grid: An open grid services architecture for distributed systems integration,” http://www.gridforum.forum.org/ogsi-wg/drafts/ ogsa draft2.9 2002-06-22.pdf, 2002. [15] I. Foster, C. Kesselman, and S. Tuecke, “The anatomy of the grid: Enabling scalable virtual organizations,” International Journal of High Performance Computing Applications, vol.15, no.3, pp.200– 222, 2001. [16] J. Frey, T. Tannenbaum, I. Foster, M. Livny, and S. Tuecke, “Condorg: A computation management agent for multi-institutional grids,” Cluster Computing, vol.5, no.3, pp.237–246, 2002. [17] R.A. Golding, Weak-consistency group communication and membership, PhD Thesis, University of California at Santa Cruz, 1992.
[18] A. Geist, A. Beguelin, J. Dongarra, W. Jiang, B. Manchek, and V. Sunderam, PVM: Parallel Virtual Machine: A User’s Guide and Tutorial for Network Parallel Computing, MIT Press, 1994. [19] A.S. Grimshaw, A. Ferrari, and E.A. West, “Menta,” in Parallel Programming Using C++, ed. G.V. Wilson and P. Lu, pp.382–427, MIT Press, Cambridge Mass., 1996. [20] A. Gianelle, R. Peluso, and M. Sgaravatto, “Job partitioning and checkpointing,” Technical Report DataGrid-01-TED-0119-0 3, European DataGrid Project, 2001. [21] J. Gray and A. Reuter, Transaction Processing: Concepts and Techniques, Morgan Kaufmann Publishers, 1994. [22] N. Hayashibara, A. Cherif, and T. Katayama, “Failure detectors for large-scale distributed systems,” Proc. 21st IEEE Symposium on Reliable Distributed Systems, pp.404–409, Oct. 2002. [23] S. Hwang and C. Kesselman, “Grid workflow: A flexible failure handling framework for the grid,” Proc. 12th IEEE International Symposium on High Performance Distributed Computing, pp.126–137, 2003. [24] K. Hwang and Z. Xu, Scalable Parallel Computing, Technology, Architecture, Programming, pp.468–472, McGraw-Hill, 1997. [25] A. Kumar and D.P. Agrawal, “A generalized algorithm for evaluating distributed-program reliability,” IEEE Trans. Reliab., vol.42, no.3, pp.416–424, 1993. [26] V.K.P. Kumar, S. Hariri, and C.S. Raghavendra, “Distributed program reliability analysis,” IEEE Trans. Softw. Eng., vol.SE-12, no.1, pp.42–50, March 1986. [27] M. Kksalan and S. Zionts, Multiple Criteria Decision Making in the New Millennium, Springer-Verlag, 2001. [28] G. von Laszewski, I. Foster, J. Gawor, W. Smith, and S. Tuecke, “Cog kits: A bridge between commodity distributed computing and high-performance grids,” Proc. ACM 2000 Java Grande Conference, pp.97–106, 2000. [29] C.D. Lai, M. Xie, K.L. Poh, Y.S. Dai, and P. Yang, “A model for availability analysis of distributed software/hardware systems,” Information and Software Technology, vol.44, pp.343–350, 2002. [30] A. Nguyen-Tuong, “Integrating fault-tolerance techniques in grid applications,” Dissertation of Ph.D, www.cs.virginia.edu/ an7s/ publications/thesis/thesis.pdf [31] G.J. Popek, R.G. Guy, T.W. Page, Jr., and J.S. Heidemann, “Replication in ficus distributed file systems,” IEEE Computer Society Technical Committee on Operating Systems and Application Environments Newsletter, vol.4, pp.24–29, 1990. [32] R. van Renesse, Y. Minsky, and M. Hayden, “A gossip-style failure detection service,” Proc. Middleware’98, pp.55–70, 1998. [33] P. Stelling, I. Foster, C. Kesselman, C. Lee, and G. von Laszewski, “A fault detection service for wide area distributed computations,” Proc. 7th IEEE Symposium on High Performance Distributed Computing, pp.268–278, July 1998. [34] X. Shi, H. Jin, W. Qiang, and D. Zou, “Reliability analysis for grid computing,” Lect. Notes Comput. Sci., vol.3251, pp.787–790, 2004. [35] N.H. Vaidya, “Impact of checkpoint latency on overhead ratio of a checkpointing scheme,” IEEE Trans. Comput., vol.46, no.8, pp.942– 947, Aug. 1997.
JIN et al.: DRIC: DEPENDABLE GRID COMPUTING FRAMEWORK
623
Hai Jin is a professor of Computer Science and Engineering at the Huazhong University of Science and Technology (HUST) in China. He received his Ph.D. in computer engineering from HUST in 1994. In 1996, he was awarded German Academic Exchange Service (DAAD) fellowship for visiting the Technical University of Chemnitz in Germany. He worked for The University of Hong Kong between 1998 and 2000 and participated in the HKU cluster project. He worked as a visiting scholar at the University of Southern California between 1999 and 2000. He is the chief scientist of the largest grid computing project, ChinaGrid, in China. He is a member of IEEE and ACM. He is the associated editor of International Journal of Computer and Applications, Journal of Computer Science and Technology. He is the steering committee chair of International Conference on Grid and Pervasive Computing (GPC), steering committee member of IEEE/ACM International Symposium on Cluster Computing and the Grid (CCGrid), International Conference on Grid and Cooperative Computing (GCC), and served as program chair of GCC’04, HIS’05, NPC’05, UIC’06, program vice-chair of the CCGrid’01, PDCAT’03, NPC’04, e-Science’05, AINA’06. He also served as program committee for more than 100 international conferences/workshops. He has co-authored more than 10 books and published over 150 research papers. His research interests include cluster computing and grid computing, peer-to-peer computing, network storage, network security, and trusted computing.
Xuanhua Shi received his bachelor degree in computer science from Daqing Petroleum Institute (China) in 1999, and received master degree from Sichuan University of Science and Technology (China) in 2002. Currently, he is a Ph.D. candidate in Cluster and Grid Computing Lab at Huazhong University of Science and Technology (China). His research interests include cluster and grid computing, fault-tolerance, peer-to-peer computing, web services, semantic web, network security and grid security. Contact him at
[email protected].
Weizhong Qiang received his bachelor degree in computer science from Nanjing University of Science and Technology (China) in 1999, and received master degree from Huazhong University of Science and Technology (China) in 2002. Currently, he is a Ph.D. candidate in Cluster and Grid Computing Lab at Huazhong University of Science and Technology. His research interests include grid security, peer-topeer computing, and semantic web. Contact him at
[email protected].
Deqing Zou received his bachelor degree in computer science from Fuzhou University (China) in 1997, entered Huazhong University of Science and Technology (China) for a master degree in 1999, and received his Ph.D. in computer engineering from Huazhong University of Science and Technology in 2004. His research interests include cluster and grid computing, peer-to-peer computing, semantic web, network security and grid security. Contact him at
[email protected].