Computer Communications 27 (2004) 1341–1353 www.elsevier.com/locate/comcom
An alarm management framework for automated network fault identification Chi-Shih Chao*, An-Chi Liu Department of Information Engineering, Feng Chia University, 100 Wenhua Road, Seatwen, Taichung 407, Taiwan, ROC Received 17 January 2003; revised 9 March 2004; accepted 13 April 2004 Available online 18 May 2004
Abstract Many timing constraint (or real-time) distributed systems, such as real-time database systems, are now being used in safety critical applications. However, they are subject to system failures caused by the malfunction of underlying network components. Without the helps of network experts or sophisticated management tools, most users cannot resolve these network problems by themselves. Sometimes, worse, it is usually prohibited to use these management tools, e.g. the ‘ping’ command, for the security sake. Accordingly, we develop a management system to automate network fault identification merely based on the analysis of the abnormal events from the monitored timing constraint distributed system. In this system, a fault identification framework is designed to identify automatically faulty network elements by using a two-level fault propagation model which combines Timing Constraint Petri nets with an alarm clustering mechanism. In addition, the concepts of redundant/ringleader alarms and innocent network elements are also introduced into the framework to obtain an effective diagnosis. At last, the management system is implemented according to the framework to demonstrate the performance of our fault identification. q 2004 Elsevier B.V. All rights reserved. Keywords: Automated fault identification; Timing constraint Petri net; Alarm clustering; Redundant/ringleader alarm; Innocent network element
1. Introduction Many real-time distributed systems are now being used in safety critical applications, in which human lives or expensive machinery may be at stake. Transactions in these systems should be scheduled considering both data consistency and timing constraints. Fault management in such systems is most important and imperative because a trivial fault could be propagated to cause costly damages to the entire system [1,2]. However, for the network and system security, it becomes very hard for users of the timing constraint distributed systems to obtain an effective fault management due to the prohibition of some management activities (like using ICMP packets) [3 – 5]. Once the occurrence of system failure, the system users are subject to get lost in a large number of system messages and logs. Thus, it would benefit users greatly if the capability of * Corresponding author. Tel.: þ886-42-451-7250/2760; fax: þ 886-42451-7110. E-mail address:
[email protected],
[email protected] (C.-S. Chao). 0140-3664/$ - see front matter q 2004 Elsevier B.V. All rights reserved. doi:10.1016/j.comcom.2004.04.009
automated network fault identification becomes part of the timing constraint distributed system management. Lots of related works have been done around the world. Gambhir et al. [6] develop an SNMP-based distributed software fault localization architecture with ‘local directed graph (LDG)’ and ‘node ordering’ to ‘localize’ a fault occurring in a real-time distributed software system. But, by the use of the LDG and node ordering mechanisms, the distributed fault localization architecture could only provide a set of ‘localized process fragments’ as its final answer, not the real root fault. Bouloutas et al. [7] and Katzela [8] use a hierarchical formal specification to describe the dependency relationships of the component elements of a networked system. Based on the dependency relationships, a brand-new concept named alarm clustering is proposed to localize the most possible faulty components. To have the best deduction for each alarm cluster, it is necessary to set the prior failure probability of each system component. However, this is hard to do in practice. In addition, the timing of alarms is not taken into account. Ka¨tker and Geihs [9] integrate a generic model into TMN (Telecommunication Management Network) where the fault isolation is accomplished by using
1342
C.-S. Chao, A.-C. Liu / Computer Communications 27 (2004) 1341–1353
a four-level dependency hierarchy of system services. The dependency hierarchy is built manually to correlate the system-level abnormal symptoms, and system users would be notified which system service does not work properly when the upper-level applications are out of order. Nevertheless, because the timing factor (the sequence of system execution) is excluded from the dependency hierarchy, the results of diagnosis are not trustworthy that much. In our work, a framework is introduced for automated fault identification of timing constraint distributed systems. Within this framework, the sequence of system execution can be specified by Timing Constraint Petri Net (TCPN) [10] and is used for our fault identification. In the first stage of our fault identification, an effective failure detection strategy is adopted to detect the occurrence of system failure. The alarms derived from the TCPN’s frozen tokens are also collected. In the second stage, a two-level fault propagation model with the system’s TCPN specification is employed to calculate alarm clusters by correlating the alarm tokens. Each alarm cluster contains all the suspicious network components capable of forming the cluster. The concepts of redundant alarms and innocent elements are also included to obtain a more effective system diagnosis. In the third stage, based upon the best explanation convention, the most suspicious faulty network elements are deduced from each alarm cluster. A prototype system is built according to the fault identification framework. This system not only provides effective fault identification, but also uses a standard-based network management architecture for heterogeneity. The rest of the paper is organized as follows. Section 2 describes the two-level fault propagation model for timing constraint distributed systems. TCPN is used for model establishment and failure detection. Two categories of alarms are defined to remove the unnecessary alarms’ domains and the innocent network elements from alarm clustering for the best diagnosis. A fault identification
framework with a standard-based management architecture is shown is Section 3. The corresponding algorithm is also presented. Section 4 shows the framework’s implementation to demonstrate the diagnosis effectiveness. Section 5 concludes the paper and describes our future work.
2. Fault propagation modeling in timing constraint distributed systems To develop a competent and practical system of automated fault identification in a timing constraint distributed environment, we require complete understanding of fault behavioral pattern. The term ‘fault’ used in this paper, refers to the malfunction or failure of a network component that causes the behavior of the monitored timing constraint distributed application to deviate from that given by its specification. In this situation, a user may perceive something wrong of the used system by keeping track of the system’s execution and analyzing it. Accordingly, in our work, a bipartite causality graph is used to model and analyze the fault propagation. Fig. 1 shows a simple fault propagation model with a two-level causality graph. In the propagation model, the execution of the specified timing constraint distributed system and the connection of the system network are embedded into the alarm level and the fault level, respectively. 2.1. Modeling timing constraint distributed systems with TCPN The execution of a distributed system can be treated as a series of state transformation and specified by Petri nets, which consists of places and transitions standing for system states and designated events, respectively [11 – 13]. As to a timing constraint distributed system, it comprises timingrestricted task components (or processes) [10] where each timing-restricted task component is characterized by two
Fig. 1. Fault propagation in timing constraint distributed systems.
C.-S. Chao, A.-C. Liu / Computer Communications 27 (2004) 1341–1353
events: the beginning event at which a task begins its execution and the ending event at which a task completes its execution. A Petri net is a particular kind of directed graph suitable for modeling concurrent behaviors, such as nondeterministic phenomenon of distributed systems [14]. A typical Petri net has transitions, places, directed arcs, and tokens. TCPN [10] extends Petri nets by associating a maximum timing constraint ðTCmax ðxÞÞ, a minimum timing constraint ðTCmin ðxÞÞ with each place pj ðx ¼ pj Þ or transition tj ðx ¼ tj Þ; and a duration timing constraint ½FIREdur ðtj Þ with each transition tj : This makes TCPN eligible to model the execution behavior of timing constraint or real-time distributed systems [10]. Users can determine the best approximation of the earliest fire beginning time (EFBT) and the latest fire ending time (LFET) of each strongly firable (schedulable) transition from the initial assignment of some transitions’ timing constraints. In our work, TCPN is used to build a proper framework for failure detection and fault identification of timing constraint distributed systems. The TCPN we used here is ordinary, conflict-free, and not in a strong firing mode [14]. In addition, the fire-or-not analysis of the TCPN [10] for the monitored distributed system must be made in advance, that is, the LFET and EFBT of all the transitions are preset.
Definition 2.1. The system failure of a timing constraint distributed system occurs if LFETðte Þ , Texec ; where te is the ending event and Texec is the measured execution time of the system, respectively. For example, in Fig. 2 we say there is a system failure if LFETðte Þ , Texec : In this figure, pji =tij means the ith place/transition of the system’s task component (process) j; ps =ts is the beginning place/
1343
transition, pe =te is the ending place/transition, and pj=k i defines the ith place belonging to both the process j and the process k at the same time. The reason of using LFETðte Þ as the threshold of a system failure, instead of using an arbitrary LFETðti Þ where ti is one of the transitions of the TCPN specification, will be explained in Section 2.2.
2.2. Two-level fault propagation model When a system failure occurs, the tokens in TCPN will stop at some places commonly called marked places, such as pa6 and pa=b 10 in Fig. 2. It follows that there is a frozen marking Mfail if a system failure is detected. In Fig. 2, Mfail ¼ a=b }: Our goal is using Mfail to find the most suspicious {pa6 ; p10 network elements that make the monitored system fail.
Definition 2.2. When a system failure is detected, Mfail will be determined. Any token that stops at a place in Mfail can be viewed as an alarm. As in Definition 2.1, Fig. 2 shows that there are two alarms, pa6 and pa=b 10 ; when the system fails. The presence of alarms reflect the existence of faulty network elements on which a timing constraint distributed system is running. A network manager has the responsibility of identifying the underlying causes of alarms and performing the corrective actions as necessary. Thus, a data structure for the alarms is created to solve the fault identification problem. Each alarm ai can be characterized by its domain, Domain(ai ) [8]. The domain of an alarm is a set of the suspicious network elements which could trigger the alarm. Since we focus our fault identification on the network elements at fault, the domain of an alarm ai
Fig. 2. The system failure.
1344
C.-S. Chao, A.-C. Liu / Computer Communications 27 (2004) 1341–1353
(Domain(ai )) involves all the network elements for an interprocess communication from the SEND transition of a process to the RECV transition of another, while a token is a=b frozen as the alarm ai : For example, in Fig. 2, Domain(p10 ) contains all the network elements along the communication path from the SEND transition t7b to the RECV transition t5a : It includes link(e –f), link( f – g) and link(g– h). As to pa6 without the matched SEND transition, we will assume the Domain(pa6 ) is f; indicating that no network elements are suspicious for pa6 : Such alarms, like pa6 ; are called null alarms. To deduce the suspicious faulty elements, the concept of alarm clustering is used [7]. An alarm belongs to a cluster if its domain has an intersection with a domain of at least one alarm that belongs to this cluster. Two different clusters share no intersection. When a system fails, a number of alarms may appear from the network. Any network element that appears in the domain of an alarm is a possible suspect. If two or more alarms belong to the same cluster, these alarms should be examined together because it is likely that they can be explained by the same set of faults [8]. If the collected alarms create more than one cluster, then for each cluster we find the set of faults that best explains it. The union of all such sets, one for each cluster, gives the best explanation of the collected alarms. The best explanation of an alarm cluster should be the set of network elements whose total number of network elements is minimum among all the sets that can explain the alarm cluster. This explanation can be justified by assuming that all network elements have the same probability at fault. It is based on the convention that there usually exist few network elements at fault during any time interval [7]. Fig. 3 shows a cluster consisting of 10 network elements ðE1 – E10 Þ and five alarms ðA1 – A5 Þ: Using the best explanation for a cluster of alarms, the most suspicious network elements are E1 ; E6 and E8 : On the basis of the best explanation, we propose two strategies to improve the perfromance of alarm clustering in a timing constraint distributed environment. They are (1) the ringleader/redundant alarms and (2) innocent network elements. To describe the first strategy, we assume the firing of one designated event ti depends on another event tj ’s firing. That is, in case tj fails to fire, ti will have no chance to fire.
Fig. 3. An example of the best alarm cluster explanation.
Fig. 4 is used to show a fragment of a TCPN specification with alarms p1 ; p3 ; and p5 at the time of a system failure. It is apparent that t4 and t5 are waiting for the arrival of tokens p4 and p6 ; respectively. However, the root cause is the unfiring of t2 with alarm p1 : In our study, alarm p1 is called the ringleader of alarms p3 and p5 : And, p3 and p5 are called redundant alarms and should be excluded from the calculation of alarm clusters. More detailed definitions are as follows.
Definition 2.3. An alarm ai is called the redundant alarm of another alarm aj iff (1) any of ai ’s sibling places without alarms (or with null alarms) could find an ascendant place with a frozen token in the TCPN at the time of the system failure; (2) or ai is a null alarm. This ascendant place is termed the ringleader alarm of the redundant alarm. The second strategy considers the possibility that some wellbehaved network elements may be regarded erroneously as suspects even if the calculation of alarm clusters excludes redundant alarms. Because even after a ringleader alarm occurs, some other related tokens could keep on moving until they become redundant alarms of the ringleader alarm. Network elements used by these tokens are supposed to function well from the EFBT of the immediate-ascendant transition of the ringleader alarm to the moment these tokens stop. For this reason, some places located on the execution path of a redundant alarm become meaningful.
Definition 2.4. If pi is a redundant alarm, some places located on the pi ’s execution path are needed by fault diagnosis. They are between the pi ’s immediate-ascendant transition and the pi ’s ascendant transition which is the first one whose EFBT is bigger than (or equal to) the EFBT of the immediateascendant transition of the pi ’s ringleader alarm. Network elements (if exist) in the domains of all such places behave well in spite of the system failure. These network elements are called the pi ’s innocent network elements and should not appear in the domain of the pi ’s ringleader alarm. For example in Fig. 4, even there is a stopped token at p1 ; a token running along the path t0 ! · · · ! t6 ! p2 ! t3 may still move forward until it is stopped at place p5 : The network elements in the domain of p2 are thought to behave well during the time interval ½EFBTðt1 Þ; LFETðt3 Þ: Assume Domain(p1 ) and Domain(p2 ) are {link(e – f), link( f – g), link(g –h), link(g– j), link( j – k), link(k– l)} and {link(i – j), link(g –j), link(g– h)}, respectively. Therefore, link(g– j) should not be included in the domain of alarm p1 during the calculation of alarm clusters. Accordingly, our fault identification for a timing constraint distributed system has to gather two kinds of information: effective alarms (ringleader alarms) and innocent network elements. To obtain these two kinds of information completely, the alerting time for a system failure
C.-S. Chao, A.-C. Liu / Computer Communications 27 (2004) 1341–1353
1345
Fig. 4. Redundant alarms and innocent elements.
is defined as LFET(te ). If the definition of a system failure uses the timing constraint violation of an arbitrary TCPN transition, some available information could be overlooked. For instance, if the alert of a system failure is triggered once the first unfired transition is detected, information on redundant alarms and innocent network elements may never be obtained correctly. Using the combination of TCPN’s frozen tokens (comprising ringleader alarms and redundant alarms), alarm clusters’ faulty elements (excluding innocent ones), and the casual relationships between alarms and network elements, a two-level fault propagation model of timing constraint distributed systems can be established as shown in Fig. 5. By using this model, our fault identification framework is built to achieve automated failure detection and fault identification.
3. Fault identification framework In this section, a framework is introduced for automated fault identification of timing constraint distributed systems. By utilizing this framework, we may correlate the alarms derived from the corresponding TCPN’s frozen tokens to deduce the most suspicious faulty network elements. To have automated fault identification, a standard-based timing constraint distributed system management architecture is used by the framework for event collection and system management in a heterogeneous distributed environment.
3.1. Timing constraint distributed system management architecture Fig. 6 shows the timing constraint distributed system management architecture used in our work. The system manager is an NMS-based (Network Management Standard) manager and MIB (either GDMO MIB or Internet MIB) databases contain the description of managed resources or managed objects (MO). A MO can be a process described by a set of MIB variables either applicable to all processes or specific to a particular process. The agent, running on every cooperative machine, monitors and forwards the events generated by the cooperative components (or processes) to the system manager. Then, the system manager records these events in an execution history database and displays them immediately. As mentioned previously, the behavior of a monitored timing constraint distributed system is specified by TCPN where the system states and designated events are defined by MIBs. Thus, the system manager compares the collected events (delivered from agents via SNMP or CMIP) with the global TCPN specification or with the local TCPN specifications stored in the monitored system to determine the current system state (execution behavior) to move the TCPNs’ tokens [13]. For unifying the management communication interface in a heterogeneous distributed environment, we use SNMP and CMIP as standard communication channels. However, the system manager is configured as an OSI manager because the management capability of using GDMO MIB is
1346
C.-S. Chao, A.-C. Liu / Computer Communications 27 (2004) 1341–1353
Fig. 5. Two-level fault propagation model.
better than that of using SNMP MIB. The SNMP-based agents can communicate with the system manager via the message transformation of a gateway (or called a proxy). In our system, IQA [15] is used as the gateway. 3.2. Fault identification algorithm The fault identification framework uses a three-stage process including (1) failure detection: as stated in Definition 2.1, when the program execution time Texec does exceed LFET(te ), an alert of system failure will be triggered
to invoke diagnosis systems for the determination of Mfail ; (2) alarm clustering: the domain of each ringleader alarm rai in Mfail is assigned. As stated in the latter half of Section 2.2, innocent network elements for each rai are decided and then all the alarm clusters of rai alarms are calculated; (3) fault identification: select the most suspicious network elements from each cluster individually as the final answer. In the failure detection stage, Mfail is determined for further fault identification when a system failure occurs. As we showed in Fig. 2, Mfail ¼ {pa6 ; pa=b 10 } consists of two which can be obtained by using a Reverse alarms pa6 and pa=b 10
Fig. 6. Distributed system management architecture.
C.-S. Chao, A.-C. Liu / Computer Communications 27 (2004) 1341–1353
1347
there are three ringleader alarm domains and six network elements. After 21 times of comparison (six times for a1 and a2 ; six times for a2 and a3 ; nine times for a1 and a3 ) between any two elements of different alarm domains, we obtain an alarm cluster with two common elements, E1 and E4 : In the fault identification stage, based on the best explanation concept, we select the least number of elements that can cause an alarm cluster as the diagnosis answer of that cluster. For example, there are 16 possible explanations for the cluster in Fig. 8: {E1 ; E4 }; {E1 ; E2 ; E4 }; {E1 ; E3 ; E4 }; {E1 ; E4 ; E5 }; {E1 ; E4 ; E6 }; {E1 ; E2 ; E3 ; E4 }; {E1 ; E2 ; E4 ; E5 }; {E1 ; E2 ; E4 ; E6 }; {E1 ; E3 ; E4 ; E5 }; {E1 ; E3 ; E5 ; E6 }; {E1 ; E4 ; E5 ; E6 }; {E1 ; E2 ; E3 ; E4 ; E5 }; {E1 ; E2 ; E3 ; E4 ; E6 }; {E1 ; E2 ; E4 ; E5 ; E6 }; {E1 ; E3 ; E4 ; E5 ; E6 } and {E1 ; E2 ; E3 ; E4 ; E5 ; E6 }: Clearly, {E1 ; E4 } would be chosen as the best answer. The final answer to the system failure is the union of all the alarm clusters’ solutions. The corresponding algorithm for the three-stage fault identification is as follows:
Fig. 7. Reverse alarm searching.
Depth First Search (RDFS). The search starts from te and follows the dotted lines as shown in Fig. 7 [11]. In the alarm clustering stage, based on Definition 2.3, the ringleader alarms are refined from Mfail by using RDFS. In Fig. 7, a=b only pa=b 10 is the ringleader alarm. It is because p10 ’s sibling a place p6 (a null alarm) cannot find an ascendant place with a frozen token. After the assignment of the domain of each ringleader alarm rai ; innocent network elements of all the rai ’s redundant alarms (as stated in Definition 2.4) are removed from rai ’s domain. To find the innocent elements of each rai ’s redundant alarm, RDFS is utilized again to search from the immediate-ascendant transition of each rai ’s redundant alarm. To obtain all of the alarm clusters, we have to check every pair of elements in which each element comes from different ringleader alarms’ domains. If one element appears in both two checked alarm domains, these two alarms will be combined into an alarm cluster. For example, in Fig. 8
Algorithm fault_identificationðTN; Texec Þ; /* Given the TCPN, TN, and Texec (as stated in Section 2) of the monitored distributed software, the fault_identification determines the most suspicious faulty network elements in a system failure. Section 4 will show how to construct a TN in our implementation. */ Begin If LFET(te ) , Texec Then { /* A system failure is detected. */ (1) Derive the marking Mfail of frozen tokens (alarms) by using the Reverse Depth First Search starting from the te of TN. (2.1) Based on Definition 2.3, use the Reverse Depth First Search to retrieve ringleader alarms from Mfail : (2.2) As stated in Definition 2.4, obtain innocent network elements by using the Reverse Depth First Search and then remove them from their own ringleader alarm’s domain. (2.3) With all the domains of ringleader alarms (each domain has no innocent network elements), derive the alarm clusters consisting of intersecting ringleader alarms.
Fig. 8. Alarm clustering.
1348
C.-S. Chao, A.-C. Liu / Computer Communications 27 (2004) 1341–1353
/* If the fault_identification disregards redundant alarms, ringleader alarms used in (2.3) will be replaced by alarms in Mfail */ (3) For each alarm cluster { /* Assume all the network elements have the same probability at fault in the alarm cluster containing k network elements. */ (3.1) Search for a single faulty element that could explain all the alarms; (3.2) Search for two faulty elements that could explain all the alarms; .. . (3.k) Search for a combination of k faulty elements that could explain all the alarms; (3.k þ 1) Select one solution from phases between (3.1) and (3.k) such that it not only can explain all the alarms in the alarm cluster but also has the minimum number of network elements; (4) Output the union of each alarm clusters’ solution as the final answer. /* Output the best explanation for the received alarms */ } /* End of For. */ } /* End of If. */ End. In each (3.m) phase of the fault_identification algorithm, there are ð mn Þ possible combinations of explanation where m ¼ 1 to k; and n is the total number of network
elements in the alarm cluster. This number is bounded by nm : Thus, phase (3) is bounded by (cN k ) where N is the total number of network elements in the maximum alarm cluster and c is the total number of alarm clusters. As to the running time of phase (1) and phase (2), they were discussed in Refs. [8,11]. Their upper bounds are pt and apt þ ð 2r Þe2 ; respectively, where e is the total number of network elements in the maximum ringleader alarm, and a; p; t; r are the total number of alarms, TCPN’s places, TCPN’s transitions, and ringleader alarms, respectively.
4. System implementation and performance evaluation To demonstrate our fault identification framework, SD2 (A System for Distributed Software Development) [11,13] was designed for developing distributed software systems in a heterogeneous distributed environment Fig. 9 shows the implementation of SD2 which provides five phase functions: preprocess, dispatch, compile, run, and debug. In the preprocess phase, a user inputs a CSP-like textual file which specifies all the communication events of a specific timing constraint distributed system. The textual file is then decomposed by CSP Transformation Subsystem into several cooperative C program files including embedded monitor calls. The corresponding Petri nets and IPC dataflow are also produced automatically [16]. The timing constraint parameters of each transition of the TCPN are assigned by the user. Later, in the dispatch phase, Global
Fig. 9. The implementation of SD2.
C.-S. Chao, A.-C. Liu / Computer Communications 27 (2004) 1341–1353
Monitor dispatches these cooperative C modules onto those system machines which will run these modules. Next, in the compile phase, these C modules are compiled individually on those system machines, and then, in the run phase, System Manager catches the designated events from local agents via standard management protocols and drives Petri Nets Monitor to show the program execution status by moving token(s). All the triggered events of the running program are recorded in a database. Finally, in the debug phase, Debugger retrieves history tables from the database to replay the program execution by using Event Code View [17] and Petri nets. In addition, Fault Diagnostician shows the most suspicious faulty network elements if a system failure has occurred. Now a real scenario is used to illustrate the process of our fault identification in SD2 by utilizing the frozen tokens (or alarms) of the monitored system during a system failure. The timing constraint voting mechanism with a two-phase commit protocol is widely used in many distributed timing-critical systems for failure recovery, such as in the distributed real-time databases [2]. In the example of Fig. 10, the timing constraint voting mechanism is performed by the coordinator (process A) to select the process with the largest process identifier from the backup group {B, C, D}. And then, process A would use the backup data stored in the process B’s repository as the major resource to do failure recovery, in a time limit. In Step 1, process A multicasts a request to the backup group to ask for their identifiers. After receiving the request from process A, each in {B, C, D} sends its identifier back to process A (Step 2). Since process A has no idea about the arrival sequence of reply messages, it is a non-deterministic event for process A to receive messages from {B, C, D}. When process A receives all the process identifiers from {B, C, D} in a designated time limit, it will select the one with the largest value of process identifier. Fig. 11 shows the Petri nets representation of the voting system. The assignments of transitions/places in Fig. 11 and their
1349
attached timing constraints are listed in Appendixes A and B, respectively. In the case of Fig. 10, the link between Node3 and Ethernet backbone breaks down before process A’s multicast is done (i.e. before Step 1). When the time LFET(te ) is up, there must be some frozen tokens in the TCPN’s places (Fig. 11) and then Algorithm fault_identification is triggered by SD2’s Fault Diagnostician. Table 1 shows detailed results from comparing three different diagnosis methods based on fault_identification described in Section 3.2. In phase (1) of Method 1, the alarms of Mfail are derived and passed to the next phase. Since our fault identification framework is developed to identify the network elements at fault, ps ; ts (system initialization), pe ; te (system ending), pa1 ; d pb8 ; pc13 ; pd18 (process ready to go), t3a ; t6b ; t9c ; t12 (process a b c d ending), p7 ; p12 ; p17 ; p22 (process ready to die) can be excluded from the fault_identification. Phase (2) separates a redundant alarm pb=a from the ringleader alarms 11 a a=c {pa2 ; p4a=c ; pa=d 5 }: The corresponding domains of p2 ; p4 ; a=d b=a p5 ; p11 can also be figured out: f; Links{ðc – dÞ; ðc – f Þ; ðe – f Þ}; Links{ðc – dÞ; ðc – f Þ; ðe – f Þ}; and Links {ðe – f Þ; ðf – gÞ; ðg – hÞ}; respectively. According to the definition of an innocent network element, the network b elements in the domains of pa=b 3 and p9 are innocent where a=c a=d both p4 and p5 are the ringleader alarms of pb=a 11 : That means link(e –f), link( f – g), and link(g –h) are innocent (Domainðpb9 Þ is f). By removing any innocent element from the calculation of alarm clusters, only one alarm cluster could be obtained: Links{ðc – dÞ; ðc – f Þ}: On the basis of the best explanation of alarm clusters, Phases (3) and (4) deduce the most suspicious faulty network elements: link(c –d) or link(c– f). Without the help of network experts or sophisticated management tools, one can isolate the faulty link(c– d) with at most two link checks in this case. Let us check the diagnosis results from two other methods. As in Table 1, if the information about innocent elements is ignored (Method 2), the final diagnosis answer
Fig. 10. A timing constraint voting mechanism.
1350
C.-S. Chao, A.-C. Liu / Computer Communications 27 (2004) 1341–1353
Fig. 11. The Petri nets of Fig. 10.
should be link(c– d), link(c– f), or link(e –f). There is one more suspect offered by using Method 2 than that of Method 1. However, if we care neither the innocent elements nor the redundant alarms (Method 3), the alarm clusters for Phase (3) will be Links{{ðc – dÞ; ðc – f Þ}; ðe – f Þ}; {ðe – f Þ; ðf – gÞ; ðg – hÞ}} and the best explanation would be link(e– f). That is definitely a wrong diagnosis. Therefore,
the information of innocent elements and redundant alarms should be taken into account in the fault identification process to obtain a more precise diagnosis. To evaluate the performance of these three fault identification methods, we simulate the occurrence of faults, generated randomly with different locations and different time for four hundred times, in the networking environment
C.-S. Chao, A.-C. Liu / Computer Communications 27 (2004) 1341–1353
1351
Table 1 The diagnosis results of different fault identifications Execution of Algorithm fault_identification Phase
Method 1: remove innocent elements & redundant alarms (with phases (2.1) & (2.2))
Method 2: remove redundant alarms only (without phase (2.2))
Method 3: don’t care innocent elements & redundant alarms (without phases (2.1) & (2.2))
1 2
b=a Mfail ¼ {pa2 ; p4a=c ; p5a=d ; p11 } Redundant alarm ¼ {pb=a 11 }
a=d b=a Mfail ¼ {pa2 ; pa=c 4 ; p5 ; p11 } b=a Redundant alarm ¼ {p11 }
a=d b=a Mfail ¼ {pa2 ; pa=c 4 ; p5 ; p11 } Alarm cluster ¼ Links{{ðc – dÞ; ðc – f Þ;
3.1 3.2
a=d Ringleader alarms ¼ {pa2 ; pa=c 4 ; p5 } Alarm cluster ¼ Links{ðc – dÞ; ðc – f Þ} linkðc – dÞ or linkðc – f Þ linkðc – dÞ and linkðc – f Þ
3.3
Null
Ringleader alarms ¼ {pa2 ; p4a=c ; p5a=d } Alarm cluster ¼ Links{ðc – dÞ; ðc – f Þ; ðe – f Þ} linkðc – dÞ or linkðc – f Þ or linkðe – f Þ ðlinkðc – dÞ and linkðc – f ÞÞ or ðlinkðc – f Þ and linkðe – f ÞÞ or ðlinkðc – dÞ and linkðe – f ÞÞ linkðc – dÞ and linkðc – f Þ and linkðe – f Þ
3.4
Null
null
3.5
Null
null
(e– f)},{(e–f),(f–g),(g–h)}}
3.6 Null null 4 linkðc – dÞ or linkðc – f Þ linkðc – dÞ or linkðc – f Þ or linkðe – f Þ Innocent network elements ¼ Links{ðe – f Þ; ðf – gÞ; ðg – hÞ} derived from phase (2.2)
depicted by Fig. 10. In addition, we assume there are no more than three faulty elements simultaneously since it is very rare that four elements fail simultaneously in a network. Fig. 12 shows the performance of these three methods, by calculating the corresponding value of fault hit ratio (FHR) on average. The identification effectiveness (FHR) of Method 1 keeps relatively high, while Method 3 has a considerably low identification and Method 2 has a just passable performance. In the meanwhile, checking the identification results, we find Methods 2 and 3 (especially Method 3) have a high error diagnosis rate for ‘healthy’ elements. According to what are shown in Fig. 12, the alarm clustering with removing the innocent elements and redundant alarms (Method 1) can be an appropriate alternative for the fault identification of timing-constraint distributed applications, while ‘ping-like’ probing (e.g. probing by using ICMP packets) are prohibited. In our work, TCPN is used as the analysis and diagnosis base tool for fault identification. However, for large
Fig. 12. Fault identification performance.
linkðe – f Þ select two links, from Alarm cluster, which includes linkðe – f Þ select three links, from Alarm cluster, which includes linkðe – f Þ select four links, from Alarm cluster, which includes linkðe – f Þ select five links, from Alarm cluster, which includes linkðe – f Þ all the links in Alarm cluster linkðe – f Þ
communication systems, it is very hard for any kind of Petri Nets to represent. Therefore, for this issue, our developed SD2 provides ‘cluster’ and ‘uncluster’ functions (the ‘cluster’ and ‘uncluster’ buttons on the menu bar of the window in Fig. 11) to let users change the size of a Petri Net [16]. Users can designate a desired area in the window’s canvas, and then ‘cluster’ it into a representative icon embedded within the Petri Net graph. This can improve the readability of a Petri Net representation for a large telecommunication service. Additionally, SD2 also offers sub-functions not only to monitor the ‘on-line’ execution of a specified distributed application program via the real-time movement of its Petri net tokens, but also to analyze and debug the Petri Net’s execution in an ‘off-line’ manner [12].
5. Conclusion and future work Without the support of sophisticated management tools, it is very hard for the users of a timing constraint distributed system to identify the root cause(s) of a system failure effectively and efficiently In such cases, a costly damage will be un-avoided. In this paper, we propose a framework for incorporating a standard-based management architecture (with SNMP and CMIP) and a two-level (TCPN’s and faulty network elements) fault propagation model into a distributed software development testbed—SD2 to demonstrate the capability of our fault identification. The framework accomplishes fault identification by using an important concept: alarm
1352
C.-S. Chao, A.-C. Liu / Computer Communications 27 (2004) 1341–1353
clustering, to reason the root cause(s) merely based on the analysis of the monitored distributed system’s abnormal symptoms. Using the reasoned results, the system users can effectively acquire the information of suspicious faulty area to speed up failure recovery. A proper failure detection strategy to obtain innocent network elements is adopted to construct better alarm clusters for more precise fault identification. The concepts of redundant alarms and ringleader alarms are also introduced to remove error/unnecessary information from the calculation of alarm clustering. Lastly, our implementation shows that there would be a more effective diagnosis with redundant alarm removal and innocent element exclusion than those without both or either one of the two refinements. To avoid inherent shortcomings of the centralized management, a delegated fault identification algorithm executed by each host’s agent is being developed. Utilizing individual TCPN, each agent may perform individual fault identification and exchange diagnosis results with each other. Furthermore, we have also been applying the framework to the time-aware network services in our university, such as FCU campus VoIP [18], for a more robust IP service.
Appendix A. Assignments of transitions/places in Fig. 11 ts : system starts or system forks processes pa=d 5 : message from process A to process D is on the way t1a : process A multicasts pa6 : process A has received all the desired responses t2a : process A receives responses non-deterministicly pa7 : process A releases all the occupied resources t3a : process A ends pb8 : process B is ready t4b : process B receives message from process A pb9 : process B prepares the process identifier t5b : process B sends process identifier to process A pb10 : process B has sent response t6b : process B ends pb=a 11 : response from process B to process A is on the way t7c : process C receives message from process A pb12 : process B releases all the occupied resources t8c : process C sends process identifier to process A pc13 : process C is ready t9c : process C ends pc14 : process C prepares the process identifier d t10 : process D receives message from process A pc15 : process C has sent response d t11 : process D sends process identifier to process A pc=a 16 : response from process C to process A is on the way d t12 : process D ends pc17 : process C releases all the occupied resources
te : system ends pd18 : process D is ready ps : system is ready pd19 : process D prepares the process identifier pa1 : process A is ready pd20 : process D has sent response pa2 : process A is waiting the receiving messages pd=a 21 : response from process D to process A is on the way pa=b 3 : message from process A to process B is on the way pd22 : process D releases all the occupied resources pa=c 4 : message from process A to process C is on the way pe : system stops
Appendix B. Assignments of timing constraints in Fig. 11 ½EFBTðts Þ; LFETðts Þ : [0, 2] ½EFBTðt1a Þ; LFETðt1a Þ : [1, 4]
½EFBTðt7c Þ; LFETðt7c Þ : [3, 8] ½EFBTðt8c ), LFETðt8c Þ : [4, 10]
½EFBTðt2a Þ; LFETðt2a Þ : [8, 17]
½EFBTðt9c Þ; LFETðt9c Þ : [5, 12]
½EFBTðt3a Þ; ½EFBTðt4b Þ; ½EFBTðt5b Þ; ½EFBTðt6b Þ;
: [9, 18]
d d ½EFBTðt10 ), LFETðt10 Þ : [4, 7]
: [3, 7]
d d ½EFBTðt11 Þ; LFETðt11 Þ : [5, 9]
: [4, 9]
d d ½EFBTðt12 Þ; LFETðt12 Þ : [7, 13]
: [5, 10]
½EFBTðte Þ; LFETðte Þ : [10, 20]
LFETðt3a Þ LFETðt4b Þ LFETðt5b Þ LFETðt6b Þ
References [1] C.S. Chao, D.L. Yang, A.C. Liu, A time-aware fault diagnosis system in LAN, Proceedings of the 2001 IFIP/IEEE International Symposium on Integrated Network Management (IM2001), 2001, pp. 499–512. [2] S.H. Son, R.C. Beckinger, D.A. Baker, DRDB: a distributed real-time database server for high-assurance time-critical applications, Proceedings of the IEEE COMPSAC’97 97 (1997) 362 –367. [3] G. Meredith, The New Management Landspace, Cisco Packet Magazine: 3rd Quarter, 2002, pp. 29– 34. [4] S.A. Yemini, et al., High speed and robust event correlation, IEEE Commun. Mag. 34 (5) (1996) 82 –90. [5] C.S. Chao, D.L. Yang, A.C. Liu, An automated fault diagnosis system using hierarchical reasoning and alarm correlation, Proceedings of the IEEE Workshop on Internet Application (1999) 120 –127. [6] D. Gambhir, M. Post, I. Frisch, A framework for adding real-time distributed software fault detection and isolation to SNMP-based system management, J. Network Syst. Mgmt 12 (3) (1994) 257 –281. [7] A.T. Bouloutas, S. Calo, A. Finkel, Alarm correlation and fault identification in communication networks, IEEE Trans. Comput. 42 (2–4) (1994) 523 –533. [8] I. Katzela, M. Schwartz, Schemes for fault identification in communication networks, IEEE Trans. Networking 3 (6) (1995) 753 –764. [9] K. Ka¨tker, K. Geihs, A generic model for fault isolation in integrated management systems, J. Network Syst. Mgmt 5 (2) (1997) 109 –130. [10] J.P. Tsai, J.H. Yang, Y.H. Chang, Timing constraint Petri nets and their application to schedulability analysis of real-time
C.-S. Chao, A.-C. Liu / Computer Communications 27 (2004) 1341–1353
[11]
[12]
[13]
[14] [15]
[16]
[17]
[18]
system specifications, IEEE Trans. Software Eng. 21 (1) (1995) 32–49. C.S. Chao, D.L. Yang, A.C. Liu, Distributed software system fault management in a heterogeneous environment, Proceedings of the Workshop on Distributed System Technologies and Applications (1998) 267–274. C.S. Chao, A.C. Liu, A CSP-based visual distributed programming environment, Proceedings of the Visual Conference, 1997, pp. 225–232. A.C. Liu, S.U. Cheng, A system for distributed software development and management, Proceedings of the International Conference on Circuits and Systems Science, 1996, pp. 207– 212. T. Murata, Petri nets: properties, analysis and applications, Proc. IEEE 77 (4) (1989) 541–580. K. McCarthy, et al., Exploiting the power of OSI management for the control of SNMP-capable resources using generic application level gateways, Proceedings of the IFIP/IEEE International Symposium on Integrated Network Management, 1995, pp. 440 –453. C.S. Chao, A.C. Liu, A visualization modeling framework for a CSP-based system, J. Vis. Lang. Comput. 11 (1) (2000) 83–103. C.S. Chao, T.T. Wang, A.C. Liu, Applying distributed breakpoints to a replay-based debugger, Proceedings of the Workshop on Distributed System Technologies and Applications, 1997, pp. 538– 547. S.C. Ke, C.S. Chao, A.C. Liu, FCU VoIP Network Management—A Case Study, TANET 2002, Session IV, no. 1, 2002.
1353
Chi-Shih Chao received his MS and PhD degrees in Information Engineering from Feng Chia University in 1996 and 2001, respectively. From 2002 to 2004, he was the director of Computer Resource Center, OIT, Feng Chia University. Currently, he is an assistant professor at the Communications Engineering Department of Feng Chia University and a part-time assistant professor at the Information Management Department of Chung Shan Medical University. His research interests include VoIP performance/fault measurement, wireless LAN management, multimedia networks/communications, and distributed/mobile application systems. In addition, he is a member of IEEE, ACM and Phi-Tau-Phi.
An-Chi Liu received his BSEE and MS degrees from the National Chiao Tung University in Taiwan, and PhD degree from the University of Illinois at Chicago, in 1973, 1977 and 1981, respectively. He is a Professor at Feng Chia University, Taiwan, which he joined in 1989. From 1989 to 1994, he was the Chair of Department of Information Engineering. From 1994 to 1998, he was the Dean of Engineering and, since 1998, he has been the President of Feng Chia University. Previously, he was a faculty member at the Illinois Institute of Technology and at the North Dakota State University. His research interests include distributed processing, network management, and software engineering.