Journal of the Chinese Institute of Engineers, Vol. 31, No. 4, pp. 675-690 (2008)
675
A STUDY ON APPLICATION CLUSTER SERVICE SCHEME AND COMPUTER PERFORMANCE EVALUATOR
Fan-Tien Cheng*, Tsung-Li Wang, Haw-Ching Yang, Shang-Lun Wu, and Chi-Yao Lo
ABSTRACT The required availability in applications of a distributed computer system is continuous service for 24 hours a day, 7 days a week. However, computer failures due to exhaustion of operating system resources, data corruption, numerical error accumulation, and so on, may interrupt services and cause significant losses. Hence, this work proposes an application cluster service (APCS) scheme. The proposed APCS provides both a failover scheme and a state recovery scheme for failure management. The failover scheme is designed mainly to automatically activate the backup application to replace the failed application whenever it is sick or down. Meanwhile, the state recovery scheme is intended primarily to provide an inheritable software architecture scheme to support applications with state recovery requirements. An application simply needs to inherit and implement this scheme, and it then can accomplish the task of state backup and recovery. Furthermore, a performance evaluator (PEV) that may detect performance degradation and predict time to failure is developed in this study. By using these detection and prediction capabilities, the APCS can perform the failover process before node breakdown. Thus, applying APCS and PEV can enable an asynchronous distributed computer system with shared memory to provide services with near-zero-downtime. Key Words: application cluster service (APCS), failover scheme, state recovery scheme, performance evaluator (PEV).
I. INTRODUCTION The international technology roadmap for semiconductors (ITRS) states that total allowable scheduled and non-scheduled down time of a factory information and control system (FICS) ranged from 480 min in 2003 to 180 min in 2009 (ITRS, 2003). This requirement means that the availability of the FICS should exceed 0.9997 (1-180 min/(365x24x60 min)). That is, the services provided by FICS should have near-zero-downtime performance. Generally, the services include application programs and computer *Corresponding author. (Tel: 886-6-2099113; Email:
[email protected]) F. T. Cheng and T. L. Wang, S. L. Wu and C. Y. Lo are with the Institute of Manufacturing Engineering, National Cheng Kung University, Tainan 701, Taiwan, R.O.C. H. C. Yang is with the Institute of Systems and Control Engineering, National Kaohsiung First University of Science and Technology, Kaohsiung 811, Taiwan, R.O.C.
systems that provide operating environments for applications. Consequently, if continuous services of application programs are required, their supporting computer systems must also be healthy. To improve availability, high availability cluster service schemes of computer systems are often proposed. High availability clustering technology (Cluster HA, 2008; Clustering Center.com, 2002; Marcus and Stern, 2000; Piedad and Hawkins, 2001) was developed to automatically detect node failures during service processes. Upon detecting application program failure, the cluster service will launch a backup application program to continue providing application services from the point where the malfunctioning application program left off. Several commercial products, such as Microsoft Cluster Service (MSCS) (Gamache et al., 1998; Microsoft Cluster Service, 2003; Vogels et al., 1998), Matrix HA and Matrix Server of Ployserve (PolyServe, 2008), can provide such cluster service schemes. However,
676
Journal of the Chinese Institute of Engineers, Vol. 31, No. 4 (2008)
Fig. 1 Cluster environment using APCS
the heartbeat mechanisms for failure detection of the above schemes require private networks (Microsoft Cluster Service, 2003) for implementing these mechanisms. The heartbeat mechanism is just one of several mechanisms that can detect failures. Some other failure detection mechanisms are ping-acks, leases, etc. Cheng et al. (2004b) developed a service management scheme (SMS), based on the Jini infrastructure (Arnold et al., 2000) and the technology of design by contract (Meyer, 1992). A generic evaluator (GEV) is included in the SMS. The SMS can detect the following errors - service crashing, transmission of wrong messages, and service degradation. Furthermore, the GEV can periodically backup the execution status and parameters of a service to the database. The GEV informs the client if an error is detected in a service. The client then retrieves the backup information and activates the backup service, to continue the application process. However, the GEV simply uses a simple SPC (statistical process control)-like 3σ test to determine an abnormal state, an approach which is inadequate. The reason is that the parameters to be monitored for diagnosing a system tend to be multi-dimensional vectors rather than single variables. Therefore, multivariate analysis models, or more advanced diagnosis techniques are required. This study proposes an application cluster service (APCS) scheme with the capabilities of failover and state recovery such that application programs under the control of the APCS can provide more reliable services. The APCS has an efficient and concise failuredetection mechanism also known as a heartbeat mechanism. Heartbeat mechanisms (Zonghao et al., 2003; Guo et al., 2004; Johnson et al., 2005) are widely used in the high availability fields of monitor
network service and server nodes. A software architecture scheme for state recovery is also provided by the APCS. If a backup application program requires state recovery to resume normal operations, then this program can simply inherent this scheme to gain the state recovery capability. Furthermore, a performance evaluator (PEV) that may detect performance degradation and predict time to failure of a node is developed in this work. To simplify the computerperformance-monitoring problem, this work only considers the factors related to process aging (Brown, 1994; Huang et al., 1995). Combining APCS with PEV and using the detection and prediction capabilities of the PEV, the APCS can perform the failover process before an application program or a node breakdown. Consequently, a distributed computer system can provide services with near-zero-downtime by applying APCS+PEV. The rest of this paper is organized as follows. Section II details the APCS scheme. Section III then introduces the PEV scheme. Next, Section IV presents considerations for deploying the APCS and PEV. Subsequently, Section V shows comparisons of APCS+PEV with Microsoft cluster service (MSCS) and APCS. Section VI then describes the availability analysis for applying APCS+PEV. Next, Section VII introduces an illustrative example of implementing the APCS and PEV in a manufacturing execution system. Finally, conclusions are made in Section VIII. II. APCS SCHEME Figure 1 shows the cluster environment using the APCS. This cluster environment contains several nodes. A personal computer (PC) or any computer server may act as a node. Each node contains
F. T. Cheng et al.: A Study on Application Cluster Service Scheme and Computer Performance Evaluator
an APCS and several applications. The operating system application programming interface (OS API) is applied by the APCS to monitor and control each application within the node. CORBA specification is adopted as the communication infrastructure among the APCSs. The communication protocols used among all of the applications depend on user preferences, and can include CORBA, COM+, .NET, or JAVA RMI. Finally, the shared storage stores the statuses and records of all the APCSs and applications. The APCS has a failover scheme and a state recovery scheme, implemented by the failover service manager (FSM) and the software architecture scheme for state recovery (SASSR), respectively. 1. Failover Scheme The nodes in the cluster environment are assigned to one or other of two roles: Master and Slave. Only one node can act as Master and the others serve as Slaves. The Master sends heartbeats to each Slave such that each Slave is aware of the existence of the Master. The Slave should send an acknowledgement after receiving a heartbeat from the Master. If the Master does not receive this acknowledgement, it means that the Slave in question is down, and the failover process of the Slave is then launched. If a Slave fails to receive heartbeats from the Master on N consecutive occasions, the Slave will start investigating the existence of the Master. Here, N is a positive integer, determination of the value of N depends on design considerations and will be explained in Section II.1. Following N successive inquiries without response, most likely the Master is down. The Slave then notifies all of the other Slaves to also start investigating the existence of the Master. If more than half of the Slaves have the same result with “following N successive inquiries without a response”, then it is considered certain that the Master is down. Subsequently, the failed Master is excluded from the cluster and all of the Slaves will enter a reconfiguring state to select a new Master. Finally, the replacement process for the Master is launched. The entire failover scheme involves five steps as follows. (i) Establishing Cluster Environment All nodes to be included into the cluster environment must register at a specific group, after which their node IDs are assigned, and their APCS service data are stored in the shared storage. (ii) Invoking Applications and Commencing Detection Users can invoke a local application via the local APCS, or can invoke a remote application by
677
notifying the node that contains this specific application to do so. After invoking an application, the state of the application stored in the shared storage is updated as “UP”. The associated APCS then begins detecting the status of the application. (iii) Fault Recovery of an Application When an APCS detects that an application is down, this APCS will attempt to restart it. If the APCS cannot bring the application back to normal after M tries, the APCS will update the status of the application in the shared storage to “DOWN”. The value of M is determined by design considerations. The APCS then finds the node with the backup application in the shared storage, and notifies the node to invoke the backup application to continue providing the service. All the subsequent statuses are updated (in the shared storage) and essential recovery operations should then be performed. (iv) Role Assignment and Fault Detection of a Node The NIST Internet Time Service (ITS) allows users to synchronize computer clocks via the Internet (Internet Time Service, 2007). Therefore, by applying ITS, the APCS scheme knows which node started first. The first node started after establishing a cluster environment becomes the Master. All of the subsequent nodes started are assigned to be Slaves. The status of role (either Master or Slave) of each node is recorded in a database server installed at the shared storage. Each record in the database server is treated as a critical section for protection purposes. Therefore, each record can only be accessed by a single node. Each node which intends to become the Master must first confirm that no node in this cluster is serving as the Master by checking the role of each node. If the confirmation is granted, then this node can become the Master. With the above design considerations, it is certain that multiple Masters are not allowed. Following role assignment, the Master periodically sends heartbeats to all of the Slaves, and each Slave begins a timer to check whether the heartbeats are received normally. If a heartbeat is received at the Slave, an acknowledgement is returned to the Master and the timer is reset. Based on the above scheme, if a heartbeat is sent by the Master to a specific Slave and the Master does not receive the acknowledgement from that Slave, then a malfunction of the specific Slave is detected. After detecting N consecutive malfunctions, the breakdown of the specific Slave is confirmed. On the other hand, if a Slave does not receive heartbeats from the Master for N successive times, (namely if N consecutive timeouts are detected), then
678
Journal of the Chinese Institute of Engineers, Vol. 31, No. 4 (2008)
the failure of the Master may have occurred. The Slave then notifies all of the other Slaves to also start investigating the failure of the Master. If more than half of the Slaves have the same result: “following N successive inquiries without a response”, then it is confirmed that the Master is down. (v) Node Replacement Node replacement includes Slave replacement and Master replacement. Slave replacement should be executed after a Slave breakdown is detected; while Master replacement should be performed once the failure of the Master is confirmed. The failover scheme of APCS is implemented by FSM that includes a node event listener, fault detector, application manger, and several Ap agents, as depicted in Fig. 2. The functions of the FSM modules are outlined as follows. • The node event listener receives all the messages sent from all the other nodes, and distributes the messages to the fault detector or application manager. • The fault detector mainly handles the tasks of fault detection and node replacement, as stated in Steps 4 and 5 of the failover scheme, respectively. After performing node replacement, the fault detector forwards a request to the application manager to execute the fault recovery process of applications. • The application manager is primarily concerned with the tasks of invoking applications and application replacement, as described in Steps 2 and 3 of the failover scheme. The application manager is in charge of all Ap agents that monitor and control applications. • Ap agents are governed by the application manger. An Ap agent can only monitor and control an application. An Ap agent can invoke, detect, and close an application by using OS API, and it then reports the application status to the application manger. Determination of the Value of N The value of N mentioned in Section II.1 needs to be determined by design considerations. The approach about how to practically assign the value of N is explained below. Typical data of the example presented in Section VII are adopted here for explanation. Typical WIP (work in progress) tracking state-change period of an in-line IC-packaging production line is at least 30 sec. Therefore, we should design a failover scheme that can backup states, detect a failure, and finish the entire failover process within 30 sec. The typical number of WIPs is about 500. The backup time for storing 500 files of WIP object states to the shared storage
Fig. 2 Block diagram of failover service manager
needs about 10.705 sec (as shown in Table 6); and the recovery time for retrieving 500 files of WIP object states requires about 4.397 sec (as shown in Case 3 of Table 6). The Master heartbeat sending period is chosen as 2 sec and the Slave investigating period for checking the existence of the Master is also set as 2 sec. Consequently, if we choose three for N, then the time required for replacing the Master is 10.705 + 4.397 + 3 ×2 + 3 × 2 = 27.102 < 30sec. Therefore, for the above design considerations, the value of N is assigned to be three for this typical example. Of course, different design considerations will result in a different value of N. Generally speaking, dynamically determining the N value based on the states of clusters may be better than assigning a static value as this study does. However, as the discussions makes clear in Section VII, because the scale of the cluster will not affect the total fault recovery time much, it may not be necessary to implement a scheme for dynamically determining the N value. 2. State Recovery Scheme and Its Software Architecture Two types of state recovery, namely independent state recovery and dependent state recovery, are considered here. These types of state recovery are defined below. (i) Independent State Recovery During initialization the backup application creates all of the objects that need state recovery. Therefore, this type of state recovery is independent of object creation. (ii) Dependent State Recovery Following initialization, the backup application may not create all of the objects that require state
F. T. Cheng et al.: A Study on Application Cluster Service Scheme and Computer Performance Evaluator
679
Command pattern Memento pattern State backup/recovery base classes
StateMgr ObjectList MementoList
SR_Object BackupState() RecoverState()
Has
Implementation layer
aObject State BackupState( ) RecoverState( )
Memento
RegisterObject( ) BackupObjState( ) RecoverObjState( ) BackupTFParamvalue( ) ContiExe( ) ToCreateSRObj( ) ToContiExe( )
Has
SaveToFile( ) ReadFromFile( ) BackupStateValue( ) RecoverStateValue( )
aMemento
aStateMgr
ToCreateSRObj( ) ToContiExe( )
Application
Fig. 3 Software architecture scheme for state recovery
recovery. Those objects may be created and/or deleted depending on the process state. Consequently, the dependent state recovery scheme must create related objects based on the status of the application before failure occurrence. In this case, the state recovery process may then proceed if a failure occurs. A software architecture scheme for state recovery (SASSR) is developed to facilitate state recovery operations. Fig. 3 shows that SASSR consists of a Command Pattern (Command Pattern, 2008; Gamma et al., 1994) and a Memento Pattern (Gamma et al., 1994; Memento Pattern, 2008). They are described below. According to the documentation for the Command Pattern (2008), Command Pattern is used to “encapsulate a request as an object, thereby letting you parameterize clients with different requests, queue or log requests, and support undoable operations”. Fig. 3 shows that participants of the SASSR Command Pattern include: Application, StateMgr, SR_Object, aObject, and aMemento. According to the documentation for the Memento Pattern (Memento Pattern, 2008), Memento Pattern is used to achieve that backup and recover object’s internal state without violating encapsulation, capture and externalize an object’s internal state so that the object can be restored to this state later. Fig. 3 depicts that participants of the SASSR Memento Pattern include: Memento, aMemento, StateMgr, and aObject. Observing Fig. 3, the state backup/recovery base
classes, which include StateMgr, SR_Object, and Memento, are the kernel of SASSR. These classes are described below: (i) StateMgr is designed to mange all of the backup and recovery processes of object states and transition functions between Memento and the shared storage. Furthermore, StateMgr notifies the backup application to create associated objects according to the status of the application stored in the shared storage for the dependent state recovery. (ii) SR_Object is constructed to perform the operations of state backup and recovery. SR_Object should register with StateMgr so that SR_Object can be supervised by StateMgr. All of the objects can inherit SR_Object to obtain the functions of state backup and recovery. (iii) Memento is designed to pack/unpack all of the application objects’ state values for backup and recovery processes. Figure 3 shows that, in the implementation layer, aStateMgr should be implemented by inheriting StateMgr and construct a virtual ToCreateSRObj() which creates objects that exist prior to occurrence of failure and need state recovery based on the parameters stored in the shared storage. Besides, aStateMgr should also implement a virtual ToContiExe() that accesses the associated parameters stored in the shared storage and invokes the transition function of the backup
680
Journal of the Chinese Institute of Engineers, Vol. 31, No. 4 (2008)
application to enter the state where the failed application left off. aObject (of an Application) that needs the functions of state backup and recovery should inherit SR_Object, register at aStateMgr, and implement a virtual BackupState() and a virtual RecoverState(). aStateMgr will invoke BackupState() to inform aObject of the backup states. BackupState() must construct its (aObject) own aMemento object, which inherits Memento to handle the packing functions of object states, and call aMemento::BackupStateValue () to pass the backup states to the aMemento object. Then, the aMemento object is sent to aStateMgr and aStateMgr invoke aMemento::SaveToFile() to save the backup states into the shared storage. Similarly, aStateMgr will call RecoverState() to inform aObject of the backup states. These backup states are retrieved from the shared storage by aStateMgr and stored in a newly-built aMemento object. This aMemento object is then passed to the aObject. After receiving the aMemento object, the aObject invokes aMemento::RecoverStateValue() to retrieve the backup state values to recover aObject’s states. Each aObject in an Application must have its own specific aMemento object. For example, aObject_1 of an Application must have its corresponding aMemento _1 object: class aMemento_1:public Memento { friend aObject_1 ; }; As shown above, aObject_1 is aMemento_1’s friend class. It means that only aObject_1 can invoke aMemento_1’s private methods; and aMemento_1 must protect the stored state’s values from being accessed by the other aObjects. This preserves aObject_1’s encapsulation. When aObject_1 is informed by aStateMgr to backup its states, aObject_1 will construct aMemento_1 and store aObject_1’s states into aMemento_1. Then, aStateMgr and aMemento_1 will take over to proceed with the remaining backup process. Fig. 4 illustrates the detailed state backup and recovery processes, which are described as follows. (i) Backup Process If a transition function is executed and the state of the corresponding Application is changed, then Application invokes aStateMgr::BackupTFParamValue () to store the parameters of the transition function in the shared storage such that these parameters can be retrieved during the recovery. Subsequently, Application calls aStateMgr::BackupObjState() to store all of the state values in the shared storage. Following that, aStateMgr invokes BackupState() of each registered aObject to complete the backup process. Each
aObject then creates a corresponding aMemento object and calls aMemento::BackupStateValue() to pass its state values for backup to aMemento. The returned aMemento_1 to aMemento_n are then passed to aStateMgr. Finally, aStateMgr calls SaveToFile() of each aMemento to store all of the state values in the shared storage. (ii) Recovery Process When a backup Application is initialized, it invokes aStateMgr::ToCreateSRObject() to create all of the necessary aObjects that need state recovery. Then, aStateMgr creates the necessary aObject_1 to Object_n based on the state values saved in the shared storage. Subsequently, each aObject invokes aStateMgr:: RegisterObject() to register at aStateMgr. Afterwards, Application calls aStateMgr::RecoverObjState() to conduct the recovery process. Meanwhile, aStateMgr first creates an aMemento object, and then invokes aMemento::ReadFromFile() to retrieve all of the state values from the shared storage. Next, aStateMgr calls RecoverState() of each aObject to pass the state values in aMemento to each individual aObject. Finally, each individual aObject invokes aMemento:: RecoverStateValue() to recover its state values. Following state recovery, if the backup Application wishes to resume service provision, Application can call aStateMgr::ToContiExe() to ask aStateMgr to retrieve the parameters of the transition function stored just before the breakdown of the failed node from the shared storage, and then aStateMgr invokes aStateMgr:: ContiExe() to launch the last transition function and continue executing the application. III. PEV SCHEME The PEV is designed to detect whether a node is in a sick state or not by monitoring and evaluating the system resource parameters of the node. When a node is in a sick state, the PEV is also designed to predict the time to failure of the node. Fig. 5 illustrates the architecture of the PEV, which consists of several data collectors, the detection module, and prediction module. These three modules are detailed below. (i) Data Collector A data collect resides at each node for collecting system resource parameters via OS API. These parameters are then passed to the PEV by CORBA specification. (ii) Detection Module Before introducing the detection module, the state diagram of a node is presented as shown in Fig. 6.
F. T. Cheng et al.: A Study on Application Cluster Service Scheme and Computer Performance Evaluator
: Application
aStateMgr : StageMgr
aObject 1 : SR Object
aObject n : SR Obejct
aMemento 1 : Memento
aMemento n : Memento
: Shared Storage
BackupTFParamValue( ) Save parameter values to shared storage
BackupObjState( ) BackupState( ) BackStateValue( ) Retur n aMemento_1 BackupState( ) BackupStateValue( ) Retur n aMemento_n SaveToFile( ) Save Object States to Shared Storage SaveToFile( ) Save Object States to Shared Storage
(a) Backup process
: Application
aStateMgr : StageMgr
aMemento : Memento
ToCreateSRObi( )
aObject 1 : SR Object
aObject n : SR Object
Get data storaged in shared storage for creating objects Create aObject_1 RegisterObject( ) Create aObject_n RegisterObject( )
RecoverObjState( )
Create aMemento ReadFromFile( )
Read state values from shared storage RecoveState( ) RecoverStateValue( ) RecoveState( ) RecoverStateValue( )
ToContiExe( )
Read parameter values from shared storage ContiExe( )
(b) Recovery process
Fig. 4 Sequence diagrams of state backup and recovery
: Shared Storage
681
682
Journal of the Chinese Institute of Engineers, Vol. 31, No. 4 (2008)
Table 1 State transition table of a node Number
Current state
1 2 3 4 5 6 7
Initial Active Active Sick Dead Sick Inactive
Trigger
New state
Initialization complete Available resources being less than thresholds Service being paused Available resources being exhausted Repair complete Available resources back to normal Service being resumed
Performance evaluator Node OS API
Data collector
C O R B A
Prediction module
5
C O R B A
Node
1
Initial
OS API
Active
2
Sick
4
Dead
6 7 3
Data collector
Detection module
Inactive
Fig. 5 Architecture of performance evaluator (PEV)
Fuzzifier
ex Ind te_ Sta
µ(
Inference engine
)
Fig. 6 State diagram of a node
µ( P µ ( roce P ss µ ( rivile or_T Po ge im µ ( ol_N d_T e) Av o im ail npa e) ab g le_ ed_ Mb By yte tes) s)
XProcessor_Time XPrivileged_Time XPool_Nonpaged_bytes XAvailable_Mbytes
Active Sick Inactive Dead Initial Active Active
Defuzzifier
YState_Index
Fuzzy rule base
Fig. 7 Fuzzy logic detection module for inferring the state_index
The state diagram includes five states: initial, active, inactive, sick, and dead. Table 1 lists the corresponding state transition table of the state diagram. Generally, a node is in the active state. However, when available resources are below the thresholds (due to process aging (Huang et al., 1995) for example), the node enters the sick state. The node returns to the active state if the available resources are recovered. On the contrary, if the available resources are exhausted, the node enters the dead state or, restated, the node is down. The above description demonstrates that the major purpose of the detection module is to detect whether a node is in a sick state. Therefore, as shown in Fig. 7, the fuzzy logic detection module is applied to infer a State_Index from the major resource performance parameters. Several resources’ bottlenecks are suggested by Microsoft Development Network (MSDN) (Microsoft Development Network, 2008) to monitor and evaluate system performance. For example, CPU% processor time is
one system resource. The recommended threshold of % processor time may be 85% (Microsoft Development Network, 2008). Microsoft Windows 2000 technical information (Performance Monitoring, 2008) states that computer performance monitoring includes evaluating memory and cache usage, analyzing processor activity, examining and tuning disk performance, and monitoring network performance. To simplify the computer-performance-monitoring problem, this work only considers the factors related to process aging (Huang et al., 1995). As defined by Parnas (1994), “process aging is related to the application processes getting degraded over days and weeks of execution”. Process aging describes the accumulation of errors during software execution, eventually resulting in crash/ hang failure. A (process) crash fault is typically defined as a process ceases execution without warning and never recovers thereafter. Gradual performance
683
F. T. Cheng et al.: A Study on Application Cluster Service Scheme and Computer Performance Evaluator
low
1.0
medium
high
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2
0.0
0
30
50
70
sick
1.0
0.0
100
Fig. 8 Input membership functions of % processor time
degradation may also accompany process aging. The major factors related to process aging are processor and memory. According to the technical report of Microsoft (Performance Monitoring, 2008), the resource performance parameters related to processor and memory include processor time, privileged time, pool nonpaged bytes, available Mbytes, working set, pages/sec, pool nonpaged allocations, pool paged bytes, pages input/sec, pages output/sec, and so on. By applying multi-variate analysis and factor analysis, only four key parameters, namely, processor time, privileged time, pool nonpaged bytes, and available Mbytes are extracted as the inputs of the fuzzy logic detection module, as presented in Fig. 7. Among these inputs, processor time and privileged time are used to check processor bottlenecks, poll nonpaged bytes for memory leakage, and available Mbytes for memory shortage. To simplify the calculation of input parameters, the weights of the four input parameters considered in the fuzzy rules of the inference engine are assigned to be equal because different effects of parameters can be considered in their specific input membership functions. For example, the membership functions of % processor time are shown in Fig. 8. According to the suggestion of MSDN (Microsoft Development Network, 2008), the threshold of % processor time is 85 and safety is 25. Hence, the turning point of membership function ‘low’ is set to be 25 and ‘high’ to be 85. By the same token, the input membership functions of privileged time, pool nonpaged bytes, and available Mbytes can be designed. The semantic output variable of the fuzzy module is State_Index. The value of the State_Index is normalized to be between 0 and 1. After running several experiments, we conclude that if the value of the State_Index is between 0 and 0.3, then the node is in the sick state; otherwise the node is in the active state. Therefore, the output membership functions of State_Index are designed as in Fig. 9. The fuzzy module shown in Fig. 7 converts four input parameters into a single State_Index. This fuzzy module contains 40 fuzzy rules to infer the value of the State_Index by the input values of available Mbytes, pool nonpaged bytes arising percentage,
0
active_2
0.3
active_1
0.5
active_0
0.7
1
Fig. 9 Output membership functions of state_index
y
y = β 0 + β 1t ^
^
^
yUB ^
yLB ^
yS yD 0
(tS, yS) ^
^
^
^
tS tD, LB tD, MTTF tD, UB
t
^
TLB ^
TMTTF ^
TUB Fig. 10 Estimation curves for predicting mean time to failure
privileged time, and processor time. For example, one of the rules for inferring that a computer is in the sick state is listed below: IF Available MBytes is low and Pool Nonpaged Bytes Arising Percentage is high and Privileged Time is high and Processor Time is high Then State_Index is sick. (iii) Prediction Module When the detection module detects that a node is in the sick state, the prediction module is launched to predict the time to failure. A typical example of a monotonic-decay failure mode is that virtual memory expands monotonically. If the symptom of virtual memory expansion cannot be cured, this computer will crash when the size of virtual memory reaches a certain threshold. If a failure mode of process aging with monotonic-decay property is assessed, then the corresponding mean time to failure (MTTF) can be evaluated as follows. Figure 10 shows three curves for predicting the MTTF of a monotonic-decay software-aging system, where t and y denote time and remaining resources, respectively. Statistically, if the relationship between y and t can be represented by a simple linear-regression equation:
684
Journal of the Chinese Institute of Engineers, Vol. 31, No. 4 (2008)
y = β 0 + β 1 . t,
(1)
Real data
Conjecture data
where β 0 and β1 are the coefficients in Eq. (1). Then, with enough samples: (t i, y i), i = 1, 2, ... , n, the first curve that represents the simple linear-regression relationship between the estimations of y, β 0 , and β 1 (they are y, β 0, and β 1, respectively), and t can be formulated as (Draper and Smith, 1998): ^
^
^
yi = β0 + β 1 . ti i = 1, 2, ... , n, ^
^
^
(2)
RAM resosurce (Megabytes)
600 500 400 300 200 ys = 128
100 y0 = 64
0
0
75 100 125 150 175 200 225 250 275 300 325 350 375 (sec)
25 50
ts = 295 40 Samples for building conjecture model
where
^
TMTTF
^
tD, MTTF = 331.6
(36.6 sec)
n
Σ
β1 = i=1
Sick state Dead state
(yi – y )(t i – t ) n
Σ
i=1
(t i – t )
2
^
^
Fig. 11 Illustrative example for predicting MTTF
n
n
y =
, β 0 = y– – β1 . t–,
Σ yi i=1
n , and t =
Σ ti i=1
n .
By applying the concept of prediction interval (Draper and Smith, 1998), the second and third curves, y LB and y UB, which denote the lower bound (LB) and the upper bound (UB) of the estimation of y, are also illustrated in Fig. 10. If the detection module detects that a node is in the sick state at time t S, then the MTTF can be predicted below. Let yD represent the remaining resources when entering the dead state. yD, therefore, is given. Then, observing Fig. 10 and by applying Eq. (2) and the inverse regression methods of calibration (Brown, 1994), t D, MTTF (estimated MTTF time when entering the dead state), t D, LB (estimated LB time when entering the dead state), and tD, UB (estimated UB time when entering the dead state) can be devised by the intersections between y D and y, y D and y LB, as well as y D and yUB, respectively. As such, the most likely MTTF, T MTTF, the lower bound of MTTF, TLB, and the upper bound of MTTF, T UB, can be derived as: ^
^
^
^
^
^
^
^
^
^
MTTF of available RAM shortage is presented as follows. The personal computer used for the purpose of simulation is P4 1.6G (had a memory capacity of 512MB). Observing Fig. 11, to begin with, the computer is running normally with available RAM maintaining at about 486 MB. At t = 100 sec, a task in the computer starts to consume the available RAM at a constant rate. According to the suggestion of MSDN (Microsoft Development Network, 2008), the memory should retain at least 128 MB to maintain computer performance and stability. And, if the size of memory is down to 64 MB, then the computer is virtually dead. Therefore, we define yS = 128 and yD = 64. As shown in Fig. 11, when the available RAM is down to y S (= 128), then tS = 295. As such, those 40 samples shown in Table 2 will be taken as the training data to build the conjecture model for predicting the MTTF. According to Table 2, the linear-regression equation of predicting MTTF is derived below. First, y– and t– are calculated separately:
^
n
^
^
^
T MTTF = t D, MTTF – t S, ^
y = =
^
T LB = t D, LB – t S, and ^
Σ yi i=1
=
n
(y1 + y2 + + y40) 40
(479.4 + 472.3 + 40
+ 128.0)
^
T UB = t D, UB – t S.
t =
Notably, time-to-failure prediction modules should vary among different failure modes. A simple numerical example for predicting
=
Σ
i=1
n
ti
=
(t 1 + t 2 + + t 40) 40
(100 + 105 + 40
+ 295)
= 198 .
Next, β 0 and β 1 are calculated based on y– and t– : ^
^
n
Σ (yi – y )(t i – β1 = i=1 n Σ (t i – t ) 2 i=1
t)
(479.4 – 308.9)(100 – 198) + (472.3 – 308.9)(105 – 198) + + (128.0 – 308.9)(295 – 198) (100 – 198) 2 + (105 – 198) 2 + + (295 – 198) 2 = – 1.833
=
= 308.9 ,
n
685
F. T. Cheng et al.: A Study on Application Cluster Service Scheme and Computer Performance Evaluator
Table 2 Training data for building the MTTF conjecture model Sample No.
1
2
3
4
5
6
7
8
9
10
t (sec) y (MB)
100 479.4
105 472.3
110 462.9
115 457.4
120 440.5
125 435.7
130 441.0
135 433.6
140 406.3
145 404.4
Sample No.
11
12
13
14
15
16
17
18
19
20
t (sec) y (MB)
150 394.5
155 386.9
160 383.5
165 387.3
170 379.2
175 375.3
180 350.0
185 325.4
190 320.6
195 303.7
Sample No.
21
22
23
24
25
26
27
28
29
30
t (sec) y (MB)
200 303.0
205 295.2
210 285.1
215 274.8
220 267.3
225 257.0
230 248.8
235 238.0
240 219.7
245 218.1
Sample No.
31
32
33
34
35
36
37
38
39
40
t (sec) y (MB)
250 202.9
255 204.8
260 198.8
265 175.9
270 173.4
275 170.5
280 159.6
285 151.2
290 144.7
295 128.0
β 0 = y– – β 1 . t– = 308.9 – (–1.833 ×198) ^
^
= 671.83. Then, the linear-regression equation for predicting MTTF is y i = 671.83 – 1.833 . ti i = 1, 2, ... , n.
the heartbeat mechanism provided by the fault detector of FSM are no longer needed; while the other functions of FSM are still required. The deployment of the APCS and the PEV is demonstrated in Section IV. IV. DEPLOYMENT CONSIDERATIONS
^
^
Finally, with yD = 64, t D, MTTF can be calculated as t D, MTTF =
y – β 0 64 – 671.83 = = 331.6 . – 1.833 β1
Therefore, the predicted MTTF is ^
T MTTF = 331.6 –295 = 36.6 sec. As mentioned in Section II, the APCS possesses fault-detection and failover capabilities. However, the fault-detection scheme of APCS can only detect a failure when this failure has occurred. Therefore, the failover process can only proceed after a failure has been encountered. This results in discontinuous service. Because the PEV has the capability of predicting the time-to-failure of a sick node, integrating PEV with APCS may provide continuous service, which is the subject of the following sub-section. • Integrating PEV with APCS Figures 1, 2, and 5 illustrated the cluster environment using the APCS, the block diagram of FSM in the APCS, and the architecture of the PEV, respectively. To integrate the PEV with the APCS, the PEV is to substitute for the Master and all the other nodes serve as Slaves. Each node implements a data collector of the PEV. Additionally, the functions of
The deployment diagrams of 3-tier high availability cluster services obtained by applying the APCS and the APCS+PEV are displayed in Fig 12. Fig. 12(a) shows that the APCS is implemented in each node at the application tier, and one of the nodes acts as the Master that sends heartbeats through the public (cluster to client) network to all of the other nodes, which are Slave nodes. The shared storage that is required for the failover scheme of the APCS is combined with the system database. Figure 12(b) illustrates that the Master node is replaced by the PEV and each of the other nodes still needs to implement the APCS due to failover and state recovery considerations. Besides, each node must own a data collector that obtains the required resource parameters and sends them to the PEV for health monitoring. V. COMPARISONS OF APCS+PEV WITH APCS AND MSCS One commercially-available high availability clustering service was provided by Microsoft Cluster Service 2003 (MSCS 2003) (Sun et al., 2003) which is available only in Enterprise and Datacenter Editions of Windows Server 2003. According to the documentation of MSCS 2003 (Microsoft Cluster Service, 2003), MSCS 2003 provides high availability and scalability for mission-critical applications such as databases, messaging systems, and file and
686
Journal of the Chinese Institute of Engineers, Vol. 31, No. 4 (2008)
Table 3 Comparisons of APCS+PEV with APCS and MSCS 2003 Dedicated shared storage Heartbeat mechanism Private network State recovery Sickness detection & TTF prediction
MSCS 2003
APCS
APCS+PEV
Needed Broadcasting by each node Needed No No
No need Only send by master No need Yes No
No need No need No need Yes Yes
Master
Heartbeat SCSI
Client
Shared storage Network
DataBase
Applications
Clients
(a) With APCS PEV (Replace master) DC: Data collector
SCSI
Report Resourece Parameters
Client
Shared storage Network
DataBase
Applications
Clients
(b) With APCS and PEV Fig. 12 Deployment considerations by applying APCS and PEV
print services. In a cluster, there are multiple servers (nodes) in a cluster remaining in constant communication for immediate replacement of service if one of the nodes in a cluster becomes unavailable as a result of failure or maintenance. Such a process is known as failover. Based on the configuration instruction of MSCS 2003 documentation (Microsoft Cluster Service, 2003), MSCS 2003 requires that you have two peripheral-component-interconnect (PCI) network adapters in each node of the cluster to be certified for the hardware compatibility list (HCL). The
configuration instruction directs the users to configure one of the network adapters in each node on the production network with a static Internet protocol (IP) address that is for the public network, and configure the other network adapter in each on a separate network for private cluster communication only. All network cards on the public network need to be on the same logical network (same subnet) regardless of their physical location. From the above description, private networks are required in MSCS 2003 (Microsoft Cluster Service, 2003). Table 3 lists the comparisons of APCS+PEV with APCS and MSCS 2003. Basically the MSCS 2003 requires dedicated shared storage for each group of resources. This dedicated shared storage is different from the system database. On the other hand, the APCS and APCS+PEV do not require dedicated shared storage. The APCS and APCS+PEV can use the system database as shared storage. Owing to this difference, MSCS 2003 is more costly to install. Each node using the MSCS 2003 broadcasts its heartbeat to all the other nodes in a group; while the heartbeat mechanism of the APCS only allows the Master node to send a heartbeat to the Slave nodes. Clustering services utilizing the APCS+PEV do not require a heartbeat mechanism. The MSCS 2003 requires a private (node to node) network to support the heartbeat mechanism, while the APCS applies a public (cluster to client) network to send heartbeats. The APCS+PEV does not require a private network either. The MSCS 2003 has no state recovery scheme; while the APCS and APCS+PEV do have state recovery schemes. Both the MSCS 2003 and the APCS do not have the capabilities of node sickness detection and time-to-failure (TTF) prediction. If the clustering service integrates the PEV with the APCS, then the health of individual nodes can be monitored. Furthermore, if the PEV detects that a node is in the sick state, the PEV may also predict the time to failure of this sick node and ask the APCS to perform failover operations before node breakdown. VI. AVAILABILITY ANALYSIS FOR APPLYING APCS+PEV The availability of the APCS+PEV, R A+P , is
F. T. Cheng et al.: A Study on Application Cluster Service Scheme and Computer Performance Evaluator
687
Table 4 Availability improvement with M = 1 and R PEV = 1 0.9 N=2 N=3 N=4
0.09 0.099 0.0999
0.95 0.0475 0.049875 0.04999375
derived to evaluate the improvement of overall system availability. Suppose that a system has N similar nodes with the same availability, R, and the clustering system needs M nodes for the system to function properly. In this case, N-M backups are involved. Assume that the reliability of PEV is R PEV. Then, R A+P can be expressed as (Cheng, 2004b): R A + P = RPEV × (
N–M
Σ
i=0
0.98
0.99
0.0196 0.019992 0.01999984
0.0099 0.009999 0.00999999
Shared storage PEV 1
PEV 2 Mirror
Node 1
Node 2 EM1 (running) WT2 (backup) APCS 1
WT1 (running) EM2 (backup) APCS 2
Equipment
C iN (1 – R) iR N – i)
= RPEV × R APCS .
(3)
The availability of the original system without any backup is R 0 = R M. Therefore, the improvement of the availability is R A+P – R 0. Table 4 tabulates the availability improvement with M = 1 and RPEV = 1. Table 4 shows that N = 2 is required for R = 0.99, N = 3 for R = 0.98, and N = 4 for R = 0.95 to guarantee that the R APCS exceeds or equals 0.9999. If a mirror-type backup is prepared for the PEV and the reliability of each node of this mirror-type PEV is 0.99, then R PEV = 0.99 × 0.99 + (1-0.99) × 0.99 + 0.99 × (1-0.99) = 0.9999. Consequently, we have: R A+P = R PEV × R APCS = 0.9999 × 0.9999 = 0.9998. The international technology roadmap for semiconductors (ITRS) states that the scheduled and nonscheduled down time of a FICS ranges from 480 min in 2003 to 180 min in 2009. The requirement mentioned above means that the availability of the FICS should exceed 0.9997. From the above analysis, if only one healthy node is required for proper FICS functioning, then for R = 0.99 a total of four nodes (two for APCS, and two for PEV) are required to ensure that the FICS has 0.9998 availability. VII. ILLUSTRATIVE EXAMPLE Figure 13 presents an illustrative example for an IC-packaging factory involving two nodes, several
Fig. 13 Illustrative examples with APCS and APCS+PEV
pieces of equipment, and one shared storage. Two PEVs with a mirror-type backup configuration are also shown in a dotted-line box. The first version of the illustrative example applies APCS only. Therefore, these two PEVs are not required. The second version utilizes the APCS+PEV. Two identical equipment management application programs (EM1 and EM2) are installed with running EM1 in Node 1 and backup EM2 in Node 2. Similarly, two identical work-in-process (WIP) tracking application programs (WT1 and WT2) are installed with running WT1 in Node 2 and backup WT2 in Node 1. The APCS resides at both Node 1 (APCS1) and Node 2 (APCS2). The first version clustering system (with APCS only) operates as follows. Under the normal condition, EM1 is running at Node 1 to manage several pieces of equipment. WT2 is a backup and remains at Node 1. Moreover, WT1 is running at Node 2 for handling the WIP tracking task. Additionally, EM2 is a backup and resides at Node 2. Provided that APCS2 detects a failure of WT1, then APCS2 notifies APCS1 to invoke WT2 to recover the service provided by WT1. In another case, if APCS2 detects that Node 1 is down via the heartbeat mechanism, then APCS2 launches EM2 to resume the management of those pieces of equipment originally managed by EM1. However, in this case, the services provided by EM may be paused. The second version clustering system (with the APCS+PEV) performs as follows. Under the normal condition, EM1 and WT1 are running at Node 1 and Node 2, respectively. Provided that APCS2 detects a failure of WT1, then APCS2 notifies APCS1 to invoke WT2 to recover the service provided by WT1.
688
Journal of the Chinese Institute of Engineers, Vol. 31, No. 4 (2008)
Table 5 Performance evaluation of SASSR No. of Objects
10
250
500
750
1000
Backup time (sec)
WT object state size = 10 KB
0.21
5.298
10.705
16.150
21.591
Recovery time (sec)
Case 1 (WT down only) Case 2 (Slave node down) Case 3 (Master node down)
0.11 0.13 0.35
1.633 2.063 2.504
3.526 3.755 4.397
5.388 5.848 6.309
7.610 7.842 8.342
In another case, if the PEV detects that Node 1 is sick and its time to failure is also predicted, then the PEV will inform APCS2 to have EM2 take over the service originally provided by Node 1 before Node 1 breaks down. In this case, continuous service by EM is assured. Notably, if the availability/reliability is 0.99 for Node 1, Node 2, PEV1, and PEV2, then the second version clustering system (by applying the APCS+PEV) shown in Fig. 13 will have availability of 0.9998, as analyzed in Section VI. On the other hand, even if PEV’s detection “Node 1 is sick” is a false alarm, PEV still has to predict Node 1’s time to failure and then inform APCS2 to have EM2 take over the service originally provided by Node 1 before Node 1 breaks down. Therefore, continuous service by EM is still assured, which is the most important goal to fulfill, and this false alarm will be recorded and then fixed by a maintenance engineer later. 1. Performance Evaluation of SASSR Performance evaluation of SASSR is executed by using this illustrative example. The personal computers for Node 1, Node 2, and shared-storage servers are P4 1.4G (512M), P4 1.4G (512M), and P4 1. 8G (512M), respectively. Initially, Node 2 serves as Master and Node 1 as Slave. Microsoft Windows 2000 SP4 is adopted as the operating system for Node 1, Node 2, and shared-storage servers. The sharedstorage server also installs Microsoft SQL 2000. The network bandwidth is 100M bits per sec. WIP tracking (WT) application program is selected as the evaluation sample. The performance evaluation results of SASSR for WT are tabulated in Table 6. The file size for maintaining the state values of a typical WT object is about 1.6K bytes. However, in this evaluation, 10K bytes are assigned for tolerance consideration. The number of WT objects depends on the number of WIPs (Cheng et al., 2004a). In general, WIPs of a production line may vary from 0 to 1000. Therefore, the numbers of objects are assigned to be 10, 250, 500, 750, and 1000 for evaluation. Table 5 shows that the backup time increases linearly with the number of objects. Typically, for 500 objects, it takes about 10.705 sec to accomplish
the entire backup process. Three cases are tested for evaluating recovery time. Case 1 assumes that only the WT application program is down, while Node 1 and Node 2 are still up. Case 2 assumes that WT is installed and running in a Slave node and the Slave node is down. Case 3 assumes that WT is installed and running in the Master node and the Master node is down. As expected, the recovery time increases as the number of objects increases. The recovery time of Case 1 is less than that of Case 2 because Case 1 doesn’t need to swap the failed Slave node. The recovery time of Case 3 is higher than that of Case 2 because Case 3 needs to re-elect a new Master. Finally, the reason that the backup time is longer than the recovery time is mainly due to the fact that the data-storing time is longer than the data-retrieving time of the shared-storage server. Typical WT state-change period of an in-line ICpackaging production line is at least 30 sec. Table 5 shows that even when the number of objects is about 500, the backup time is only 10.705 sec for the file size of 10K bytes for maintaining the state values. If the file size is down to 1.6K bytes, which is the real data taken from a real production line, then the backup time shall be less than 10 sec. Therefore, the SASSR is practical and feasible. 2. Discussion This work assumes that the availability of the shared storage is 1 and the network availability is 1. The shared storage is actually the system database whose availability must be 1 according to practical considerations; while this study assumes that network availability of 1 can be achieved by a network resource failover scheme. Future research will examine how to incorporate a network resource failover scheme into the clustering system using the APCS+PEV. The recovery times shown in Table 5 are test cases involving two nodes (scale of cluster = 2) only. In fact, the fault recovery times will be affected by the scale of cluster as explained below. Besides the brief explanation stated in Step 5 of Section II.1 (Failover Scheme), the detailed Slave replacement processes (for scale of cluster > 2) are described as follows. After a Slave breakdown is confirmed, the
F. T. Cheng et al.: A Study on Application Cluster Service Scheme and Computer Performance Evaluator
Master stops sending heartbeats to the Slave, notifies all of the other nodes about the breakdown of the Slave, updates the state of the Slave in the shared storage to “DOWN”, and excludes this failed Slave from the cluster. Then, the Master selects a healthy Slave by checking the information in the shared storage and notifies the backup Slave to continue providing application services of the failed Slave. Similarly, the detailed Master replacement processes (for scale of cluster > 2) are depicted below. Once the failure of the Master is confirmed, the first Slave that discovered the failure updates the state of the failed Master to “DOWN” in the shared storage and excludes the failed Master from the cluster. Then, all of the Slave will enter a reconfiguring state to select a new Master. During the reconfiguring state, each of the nodes checks its starting time, which is stored in the shared storage. The earliest starting node is selected as the Master. The new Master then starts sending heartbeats to all of the Slave, and each Slave initiates its timer and waits for the arrival of heartbeats. Furthermore, the new Master checks the status stored in the shared storage for applications originally provided by the failed Master. As such, for scale of cluster > 2, some administrative steps (such as selecting a healthy Slave when a Slave is down or entering a reconfiguring state to select a new Master when the Master is down) are required. However, the execution time of those administrative steps is less than a second. Therefore, the scale of cluster will not affect the total fault recovery time much. VIII. CONCLUSIONS Two novel schemes for high availability clustering services are proposed in this work. The first scheme utilizes the APCS only. The APCS has a concise heartbeat mechanism that is executed via the public network. Besides, the APCS possesses both failover and state recovery schemes. The second scheme applies both the APCS and the PEV. The PEV can detect whether a node is sick or not. Moreover, if a node is sick, the PEV may also forecast its time to failure. Consequently, the heartbeat mechanism is not required for the scheme applying the APCS+PEV. Because the PEV can predict the time to failure of a sick node and notify the APCS of the backup node to perform failover process before the node’s breakdown, nearzero-downtime services can be guaranteed. ACKNOWLEDGEMENTS The work of APCS and PEV has Taiwan, R.O.C. patent numbers I235299 and I292091, respectively. The authors would like to thank the National Science Council of the Republic of China for financially supporting this
689
research under Contracts No. NSC-96-2221-E-006-279MY3 and NSC-96-2221-E-006-280-MY3. NOMENCLATURE APCS Application Cluster Service COM Component Object Model CORBA Common Object Request Broker Architecture EM Equipment Management FICS Factory Information and Control System FSM Failover Service Manager GEV Generic Evaluator HCL Hardware Compatibility List IP Internet Protocol ITRS International Technology Roadmap for Semiconductors ITS Internet Time Service LB Lower Bound MSCS Microsoft Cluster Service MSDN Microsoft Development Network MTTF Mean Time To Failure OS API Operating System Application Programming Interface PC Personal Computer PCI Peripheral Component Interconnect PEV Performance Evaluator RMI Remote Method Invocation SASSR Software Architecture Scheme for State Recovery SMS Service Management Scheme SPC Statistical Process Control TTF Time To Failure UB Upper Bound WIP Work In Progress WT WIP Tracking REFERENCES Arnold, K., O’Sullivan, B., Scheifler, R. W., Waldo, R., and Wollrath, A., 2000, The Jini Specifications, 2nd ed., Addison-Wesley, USA. Brown, P. J., 1994, Measurement, Regression, and Calibration, Oxford University Press, USA. Cheng, F.-T., Chang, C.-F., and Wu, S.-L., 2004a, “Development of Holonic Manufacturing Execution Systems,” Journal of Intelligent Manufacturing, Vol. 15, No. 2, pp. 253-267. Cheng, F.-T., Yang, H.-C., and Tsai, C.-Y., 2004b, “Developing a Service Management Scheme for Semiconductor Factory Management Systems,” IEEE Robotics and Automation Magazine, Vol. 11, No. 1, pp. 26-40. Cluster HA, 2008, High Availability Center.com. Available: http://www.highavailabilitycenter. com/.
690
Journal of the Chinese Institute of Engineers, Vol. 31, No. 4 (2008)
Clustering Center.com, 2002, PC Cluster, Clustering Center.com, Available: http://www.clusteringcenter. com/. Command Pattern, 2008, Design Pattern, Available: http://www.dofactory.com/Patterns/Pattern Command.aspx. Draper, N., and Smith, H., 1998, Applied Regression Analysis, 3rd ed., Wiley-Interscience, USA. Gamache, R., Short, R., and Massa, M., 1998, “Windows NT Clustering Service,” IEEE Computer, Vol. 31, No. 10, pp. 55-62. Gamma, E., Helm, R., Johnson, R., and Vlissides, J., 1994, Design Patterns: Elements of Reusable Object-Oriented Software, 1st ed., Addison Wesley Professional, USA. Guo, H., Zhou, J., Li, Y., and Yu, S., 2004, “Design of a Dual-Computer Cluster System and Availability Evaluation,” Proceedings of the 2004 IEEE International Conference on Networking, Sensing and Control, Taipei, Taiwan, Vol. 1, pp. 355360. Huang, Y., Kintala, C., Kolettis, N., and Fulton, N. D., 1995, “Software Rejuvenation: Analysis, Module and Application,” Proceedings of the 25th IEEE International Symposium on Fault Tolerance Computing 1995, Rasadena, CA, USA, pp. 381-390. Internet Time Service, 2007, NIST, Available: http:/ /tf.nist.gov/service/its.htm. ITRS, 2003, Factory Integration - International Technology Roadmap for Semiconductors. Johnson, T., Muthukrishnan, S., Shkapenyuk, V., and Spatscheck, O., 2005, “A Heartbeat Mechanism and its Application in Gigascope,” Proceedings of the 31st International Conference on Very Large Data Bases, Trondheim, Norway, pp. 1079-1088. Marcus, E., and Stern, H., 2000, Blueprints for High Availability: Designing Resilient Distributed Systems, John Wiley & Sons, NY, USA. Memento Pattern, 2008, Design Pattern. Available: http:/ /www.dofactory.com/Patterns/PatternMemento. aspx. Meyer, B., 1992, “Applying Design by Contract,” IEEE Computer, Vol. 25, No. 10, pp. 40-51.
Microsoft Cluster Service, 2003, Microsoft, Available: http://www.microsoft.com/windowsserver2003/ evaluation/overview/technologies/clustering.mspx. Microsoft Development Network, 2008, Microsoft, Available: http://msdn.microsoft.com/library/ default.asp?url=/library/en-us/counter/counters1_ zugs.asp. Parnas, D.L., 1994, “Software Aging,” Proceedings of the 16th IEEE International Conference on Software Engineering, Sorrento, Italy, pp. 279287. Performance Monitoring, 2008, Microsoft, Available: http://www.microsoft.com/resources/documentation/windows/2000/professional/reskit/en-us/ part6/proch27.mspx. Piedad, F., and Hawkins, M., 2001, High Availability: Design, Techniques and Processes, Prentice Hall, NY, USA. PolyServe, 2008, Matrix HA/Server, PolyServe Corporation, Available: http://www.polyserve. com/. Sun, H., Han, J. J., and Levendel, H., 2003, “Availability Requirement for a Fault-Management Server in High-Availability Communication Systems,” IEEE Transactions on Reliability, Vol. 52, No. 2, pp. 238-244. Vogels, W., Dumitriu, D., Birman, K., Gamache, R., Massa, M., Short, R., Vert, J., Barrera, J., and Gray, J., 1998, “The Design and Architecture of the Microsoft Cluster Service - A Practical Approach to High-Availability and Scalability,” Proceedings of the 28th Symposium Fault-Tolerant Computing, CS Press, Munich, Germany, pp. 422431. Zonghao, H., Yongxiang, H., Shouqi, Z., Xiaoshe, D., and Bingkang, W., 2003, “Design and Implementation of Heartbeat in Multi-Machine Environment,” Proceedings of the 17th International Conference on Advanced Information Networking and Applications, Xi’an, China, pp. 583-586. Manuscript Received: Aug. 23, 2007 Revision Received: Dec. 20, 2007 and Accepted: Mar. 19, 2008