Document not found! Please try again

Fault Identification with Binary Adaptive Fireflies in ... - Semantic Scholar

1 downloads 20 Views 346KB Size Report
Rafael Falcon, Marcio Almeida and Amiya Nayak. School of Information ... this goal in mind (e.g. branch-and-bound [8] [9], etc.), it wasn't until a few years ago ...
Fault Identification with Binary Adaptive Fireflies in Parallel and Distributed Systems Rafael Falcon, Marcio Almeida and Amiya Nayak School of Information Technology and Engineering, University of Ottawa 800 King Edward Ave., Ottawa ON, Canada K1N 6N5 Email: {rfalc032, malmeida, anayak}@site.uottawa.ca

Abstract—The efficient identification of hardware and software faults in parallel and distributed systems still remains a serious challenge in today’s most prolific decentralized environments. System-level fault diagnosis is concerned with the detection of all faulty nodes in a set of interconnected units. This is accomplished by thoroughly examining the collection of outcomes of all tests carried out by the nodes under a particular test model. Such task has non-polynomial complexity and can be posed as a combinatorial optimization problem, whose optimal solution has been sought through bio-inspired methods like genetic algorithms, ant colonies and artificial immune systems. In this paper, we employ a swarm of artificial fireflies to quickly and reliably navigate across the search space of all feasible sets of faulty units under the invalidation and comparison test models. Our approach uses a binary encoding of the potential solutions (fireflies), an adaptive light absorption coefficient to accelerate the search and problem-specific knowledge to handle infeasible solutions. The empirical analysis confirms that the proposed algorithm outperforms existing techniques in terms of convergence speed and memory requirements, thus becoming a viable approach for real-time fault diagnosis in large-size systems. Index Terms—fault diagnosis, firefly optimization, swarm intelligence, invalidation model, comparison model

I. I NTRODUCTION Parallel and distributed systems continue to increasingly permeate societies nowadays. From cellular networks to distributed database management systems, the emergence of innovative architectural and communication protocols has given rise to the next generation of decentralized systems such as wireless sensor and actuator networks and cloud computing. Moreover, many groundbreaking research projects largely rest on powerful multiprocessor systems due to their unrivaled processing capabilities, rapid growth and improved affordability. It is from the standpoint of these technological advancements that we witness a revival among the scientific community when it comes to fault tolerance protocols, as processing units in decentralized systems are subject to both hardware and software faults. Since undetected faults lead to system errors with unpredictable consequences, efficiently diagnosing the system’s status (i.e., identifying which nodes are faulty and which are fault-free) still remains a serious challenge. The aforementioned problem is known as “system-level fault diagnosis” and has drawn a significant amount of research over the last thirty years. Different test models [1] [2] [3] have been proposed in literature, each relying on the common assumption that every system unit is tested by one

978-1-4244-7835-4/11/$26.00 ©2011 IEEE

or more remaining units. Two well-known scenarios are the invalidation and comparison models. They constitute classical benchmarks in the field and assume that a test outcome indicates the correctness of the tested node and that the system’s status can be entirely derived out of the input syndrome, which is the collection of all test results. Thorough examination of the input syndrome under the set of rules enforced by a particular test model is a nonpolynomial complexity task. Furthermore, given the discrete nature of the node assignment process (i.e. every node is labeled as either faulty or fault-free), system-level fault diagnosis can be modeled as a combinatorial optimization problem, whose optimal solution is the system labeling (set of node assignments) that entirely matches the known input syndrome. Although several algorithms have been put forward with this goal in mind (e.g. branch-and-bound [8] [9], etc.), it wasn’t until a few years ago that nature-inspired meta-heuristic optimization approaches [10] were brought into the context of fault diagnosis in distributed systems. This sort of methods have become very popular for their proved ability to overcome local optima through a parallel exploration of the search space and the exploitation of social communication mechanisms to drive the population toward promising search regions. Ant colonies (ACO) [11], genetic algorithms (GA) [12] and artificial immune systems (AIS) [13] have all succeeded in reliably spotting the actual ensemble of damaged nodes. Yet the implementation of such algorithms could be rendered infeasible in many real-life scenarios given their underlying intricacy. In the former ACO case, the authors maintained two sets of pheromone trails and heuristic information per node, whereas GA and AIS develop their search strategy on the basis of the procreation (either by recombination or cloning) of their population members. It is clear that memory and computing power can severely hinder the applicability of the above techniques in many resource-constrained settings. We are pursuing fault detection algorithms which are “light” and accurate enough so that realistic conclusions are drawn with minimal memory and processing power requirements. A first step in this direction was recently undertaken in [14] by using a discrete formulation of Particle Swarm Optimization (PSO) [15]. The authors encoded the system status as a binary vector (particle) which “flies” over the search space of all feasible solutions. Because the population size remained constant and infeasible particles were fixed by using problem-

1359

specific knowledge, PSO exhibited a promising behavior in terms of memory consumption and convergence speed to the actual fault set (the group of all faulty units). Yet PSO evidenced a wandering worst-case behavior in medium-size networks and its performance in large graphs was not tested. This paper builds on the success of PSO-based fault diagnosis by further enhancing the exploration abilities with Firefly Optimization (FO) [16], which is a natural extension of PSO in which multiple “informer” particles are used to drive the search. In our scheme, each binary firefly (prospective solution) is attracted to every other swarm individual with better fitness value, thus fostering a greater diversification of the corporate search. Moreover, the light absorption coefficient is computed in an customized way for each swarm member and infeasible solutions are rapidly conducted into feasibility by flipping the assignment of poorly guessed nodes. The empirical analysis embraced the two aforementioned test models and the overall result is an accelerated convergence towards the global optimum with low memory consumption regardless of the system size under consideration and a steady performance superiority over PSO and AIS. The manuscript has been structured as follows. Section II is related work and Section III elaborates on the two fault diagnosis models under consideration. The original FO algorithm for the continuous optimization case and its adaptation to the combinatorial realm are the subject of Section IV. Then we unveil our bio-inspired scheme to fault identification (Section V), provide simulation results (Section VI) and outline some final remarks (Section VII). II. R ELATED W ORK The extent to which a system can react to unexpected hardware or software faults largely rests on how fast they are detected. All approaches described in this section share the same goal: to locate the set of faulty nodes given the input syndrome in the least amount of time and with high reliability. A large stream of research works [4], [5], [6], [7] aims at designing efficient algorithms based on the exploitation of structural properties of the testing graph. Hypercubes, crossed cubes, extended stars and twisted cubes, just to name a few, are among the preferred topologies because they simplify the conceptual design of the proposed solution approaches. We stick to a more flexible methodology in which no restrictions on the topological features of the system are laid, other than those to guarantee a reliable (i.e. deterministic) result. In this avenue, system-level fault diagnosis is posed as an optimization problem under a particular test model and an arbitrary structure of the testing graph, as explained in Sect. III. Exact, heuristic and metaheuristic methods are needed to explore the discrete search space in polynomial time. The complexity of such approaches is generally higher compared to those leaning on a simplified topology yet they bear a much wider applicability and still yield satisfactory results. For example, authors in [8] disclose an exact algorithm of the branch-and-bound type whose time complexity is 𝑂(𝑁 3 ), where 𝑁 is the system size (number of units). Sullivan [9]

introduced a backtracking-based enhancement to [8] which caused the complexity to drop to 𝑂(𝑡3 + ∣𝐸∣), where 𝑡 is the number of tolerated faults and ∣𝐸∣ the total number of tests in the system. This bound becomes more relevant as fewer units are found defective among a large group of nodes. As stated in Section I, evolutionary methods were applied to system-level fault diagnosis by Elhadef et al [12] [13], in particular GA and AIS. Both methods lack an adequate population diversity given the biased exploration brought about by an adaptive mutation operator (openly criticized in [13] but finally adopted). As a result of that, the convergence is slowed down and the worst-case identification of faulty nodes gets severely hindered in large systems composed of hundreds or thousands of units. Memory requirements are another important concern, given the enormous number of individuals generated by the algorithms over time in an attempt to find the optimal solution. Such demeanor is made even worst in the AIS-based model [13], where the number of elitist (fittest) antibodies and the number of clones per elite antibody are both equal to the population size. These drawbacks have been eliminated in our method, as the number of individuals is never altered (constant population size) and an unbiased exploration takes place, i.e. all components of the binary vector representing the candidate solution are allowed to change. An ACO-inspired scheme was designed for system-level fault diagnosis [11]. The chief idea is to have every ant construct a full tour of the graph representing the system and label each node as faulty/fault-free as it goes by. Unlike widely recognized ACO models, the authors use neither pheromone nor heuristic values to perform internodal transitions but to guess the node’s status. The disadvantage here lies in the redundant storage of dual sets of pheromone trails and heuristic information per node during the ants’ journey, which aggravates the memory consumption issue in large systems. Furthermore, it is not clear what is the heuristic value assigned to a node by a “flying ant”, as no link between the current and “leaping-to” nodes might exist. Another downside is that infeasible solutions are discarded by default, thus wasting valuable information previously gathered by the ants. Our proposed approach requires less information at the search agent level (firefly), gracefully turns infeasible solutions into feasible ones and is theoretically simpler. A binary PSO to system-level fault diagnosis was introduced in [14]. This protocol exhibits a superior convergence rate compared to previous ones and has low computational demands. Each particle is encoded as a binary vector representing the status of every node and “flies” across the search space of all candidate fault sets driven by its own best position and by the best position ever reached by the swarm. PSO has served as the foundation for enunciating more competitive metaheuristics like FO [16]. We diversify the search even further by adding more “informers” to the velocity rule of the artificial fireflies. Hence, our algorithm completely eliminates the erratic worst-case behavior of PSO in medium networks and its inability to grasp the global optimum in large graphs.

1360

III. FAULT D IAGNOSIS M ODELS Several models to identify faults in distributed systems appear in literature. In this section, we will focus on the popular invalidation and comparison models. Both assume that each node will be tested by a particular subset of the other nodes in the system and every test outcome has a binary nature, indicating whether the tested node is either faulty (1) or fault-free (0). The group of all faulty units in a system is called the fault set. One attribute of the so-called 𝑡-diagnosable systems is that they can only have at most 𝑡 faulty nodes. Although both diagnosis models represent the problem in a graph-like fashion and use the collection of test outcomes as a means to figure out what the status of every node is, they differ in the type of graph representation and the implicit assumptions on how tests are conducted. A. The Invalidation Model In the invalidation or PMC model (named after its authors in [3]), the problem is represented as a directed graph 𝐺 = (𝑉, 𝐸) with 𝑉 = {𝑣1 , 𝑣2 , . . . , 𝑣𝑛 } being the set of nodes (processors, sensors, etc.) and 𝐸 the set of edges. Each edge 𝑒𝑖𝑗 ∈ 𝐸 stands for a test performed by node 𝑣𝑖 upon node 𝑣𝑗 and whose binary outcome is denoted as 𝜎𝑖𝑗 . The set of all test outcomes is called a syndrome and is symbolized by 𝜎. Since a syndrome 𝜎 is a physical manifestation of an underlying fault set 𝐹 , we usually write 𝜎𝐹 . According to the PMC model, the system has a stationary nature, i.e. the statuses of all nodes do not change during the diagnosis phase. It is also assumed that tests carried out by fault-free nodes are always correct while those executed by faulty devices are unreliable, i.e. yield arbitrary results. Let 𝑣𝑖 ∈ 𝑉 be any node in 𝐺. Then the set of nodes tested by 𝑣𝑖 is Γ(𝑣𝑖 ) = {𝑣𝑗 ∈ 𝑉 : 𝑒𝑖𝑗 ∈ 𝐸} and 𝜎(𝑣𝑖 ) = {𝜎𝑖𝑗 ∈ 𝜎 : 𝑗 ∈ Γ(𝑣𝑖 )} is the subset of the syndrome 𝜎 containing the results of the tests realized by 𝑣𝑖 . Likewise, Γ−1 (𝑣𝑖 ) = {𝑣𝑗 ∈ 𝑉 : 𝑒𝑗𝑖 ∈ 𝐸} is the set of 𝑣𝑖 ’s testers and 𝜎 −1 (𝑣𝑖 ) = {𝜎𝑗𝑖 ∈ 𝜎 : 𝑗 ∈ Γ−1 (𝑣𝑖 )} the outcomes of all tests performed upon 𝑣𝑖 . Definition III.1 A fault set 𝐹 ⊆ 𝑉 is consistent with the syndrome 𝜎 under the PMC model if ∀ 𝑒𝑖𝑗 ∈ 𝐸, neither 𝜎𝑖𝑗 = 0 where 𝑣𝑖 ∈ 𝑉 − 𝐹 and 𝑣𝑗 ∈ 𝐹 , nor 𝜎𝑖𝑗 = 1 where 𝑣𝑖 , 𝑣𝑗 ∈ 𝑉 − 𝐹 , holds. The above definition states that only fault-free nodes always yield correct results. This assumption will play a pivotal role later on during the generation of candidate syndromes as part of the quest for the true fault set. B. Comparison Model In the comparison model [18], nodes perform tests in a pairwise fashion, i.e. some pairs of units test each other. The comparison between the two test outcomes (match/mismatch) stands as the foundation for deriving the nodes’ statuses. More formally, in a comparison-based scenario, an undirected graph 𝐺 = (𝑉, 𝐸) is used to model the system, where 𝑉 is the set of units and 𝐸 the collection of edges. Now each edge 𝑒𝑖𝑗 has an associated weight 𝜎𝑖𝑗 = 0 when the two test results carried out by units 𝑣𝑖 and 𝑣𝑗 are identical and 𝜎𝑖𝑗 = 1 otherwise.

As in the PMC model, the collection of all edge weights (test comparisons) makes up the syndrome 𝜎. The comparison model also regards the system as static, likewise PMC. Now the main assumption is that two faultfree units will always agree on their test outcomes whereas any couple of faulty and fault-free nodes will always disagree. However, divergent standpoints arise when it comes to the test comparison involving two faulty nodes. The asymmetric comparison model [2] considers that two damaged nodes will only produce a mismatch while in the symmetric variant [1], both a match and a mismatch are likely choices. Let 𝑣𝑖 ∈ 𝑉 be any node in 𝐺. Then the set of neighbors of 𝑣𝑖 is Γ(𝑣𝑖 ) = {𝑣𝑗 ∈ 𝑉 : 𝑒𝑖𝑗 ∈ 𝐸} and 𝜎(𝑣𝑖 ) = {𝜎𝑖𝑗 ∈ 𝜎 : 𝑗 ∈ Γ(𝑣𝑖 )} is the subset of the syndrome 𝜎 containing the test agreements/disagreements concerning 𝑣𝑖 . Definition III.2 A fault set 𝐹 ⊆ 𝑉 is consistent with the syndrome 𝜎 under the comparison model if ∀ 𝑒𝑖𝑗 ∈ 𝐸, neither 𝜎𝑖𝑗 = 0 where 𝑣𝑖 ∈ 𝑉 − 𝐹 and 𝑣𝑗 ∈ 𝐹 (or 𝑣𝑖 ∈ 𝐹 and 𝑣𝑗 ∈ 𝑉 − 𝐹 ), nor 𝜎𝑖𝑗 = 1 where 𝑣𝑖 , 𝑣𝑗 ∈ 𝑉 − 𝐹 , holds. C. 𝑡-Diagnosable Systems From the previous two subsections one may realize that, given an input syndrome 𝜎 which is compliant with the specifications of a particular model, multiple faults sets could possibly give rise to it. This makes system-level fault diagnosis a non-deterministic problem, which is of course very undesirable since the actual set of damaged units cannot be guessed with full certainty. To prevent this, we introduce the following definition. Definition III.3 A system of 𝑁 units is 𝑡-diagnosable if and only if the number of faulty units doesn’t exceed 𝑡 and for any consistent syndrome 𝜎, there is only one fault set 𝐹 that originated it. Hereafter, we confine ourselves to a type of system which is easy to generate and known to be 𝑡-diagnosable. As such, it will be used in our experiments. Definition III.4 A system represented by a test graph 𝐺 = (𝑉, 𝐸) with 𝑁 = ∣𝑉 ∣ is a 𝐺𝑡 (𝑁 ) design iff i) 𝑁 ≥ 2𝑡 + 1 and ii) each node is tested by at least 𝑡 others. For the PMC model, it was proved in [19] that a 𝐺𝑡 (𝑛) system in which no two units test each other is 𝑡−diagnosable. As to the comparison models, [18] concluded that the asymmetric version is 𝑡-diagnosable if every unit is linked to at least 𝑡 + 1 other units whereas in the symmetric version there must be at least 2𝑡 links to other units per node. Figure 1 displays a 𝐺𝑡 (𝑁 ) graph under the PMC model. Finally, let us point out that, irrespective of the diagnosis model being used, the problem of fault detection in distributed and parallel systems can be posed as a combinatorial optimization problem if we undertake a quest, over the discrete space of all possible fault sets 𝐹 , of the actual fault set 𝐹 which univocally generates the known input syndrome 𝜎𝐹 . This task has non-polynomial time complexity [20] and calls for the application of heuristic and meta-heuristic methods to informedly navigate over the search space until 𝐹 is found.

1361

&ĂƵůƚLJ Ϭ

Ϭ



&ĂƵůƚͲĨƌĞĞ sϮ

sϭϬ

Ϭ

Ϭ

Ϭ

Ϭ

Ϭ

Ϭ



ϭ

sϯ ϭ

Ϭ Ϭ



Ϭ ϭ



Ϭ

Ϭ ϭ

ϭ



sϱ sϲ

Ϭ

ϭ

Fig. 1. A 𝐺2 (10) test assignment graph under the PMC model. The directed edge labels represent test outcomes.

IV. F IREFLY O PTIMIZATION

A. A Binary Firefly Implementation

This metaheuristic method was proposed by X. S. Yang in 2009 [16] after watching the phenomenon of bioluminescent communication featured by firefly insects in tropical countries. It has been observed that most fireflies produce short and rhythmic flashes, which serve to attract either potential prey or other fireflies for mating purposes. Since the light intensity (a.k.a. brightness) is inversely proportional to the square of the distance from the light source (firefly) and a certain amount of light is absorbed by the air, fireflies are only visible up to a limited distance. Bearing the above ideas in mind, FO was enunciated as a population-based optimization method relying on a swarm 𝑆 of size 𝑀 , in which each firefly (candidate solution) has ⃗ 𝑖 = (𝑋𝑖1 , . . . , 𝑋𝑖𝑁 ), 𝑖 = an associated position vector 𝑋 1, . . . , 𝑀 in the real-valued, 𝑁 -dimensional continuous search ⃗ 𝑖 ) is directly/inversely space. The brightness of any firefly 𝐼0 (𝑋 ⃗ 𝑖 ) for the maximizaproportional to its fitness value 𝑓 (𝑋 ⃗ 𝑖 is attracted tion/minimization case.1 The degree to which 𝑋 ⃗ 𝑗 is modeled after the distance-dependent to another firefly 𝑋 ⃗𝑗 ) as follows: ⃗ 𝑖, 𝑋 attractiveness attenuation function 𝛽(𝑋 2 ⃗ 𝑖, 𝑋 ⃗𝑗 ) = 𝛽0 ⋅ exp(−𝛾 ⋅ 𝑟𝑖𝑗 𝛽(𝑋 )

(1)

where 𝛽0 , 𝛾 are algorithm’s parameters called “initial attractiveness” and “light absorption coefficient”, respectively (both ⃗ 𝑖, 𝑋 ⃗ 𝑗 ) is a distance usually set to fixed values) and 𝑟𝑖𝑗 = dist(𝑋 ⃗𝑗 . ⃗ 𝑖 and 𝑋 measure (e.g. Euclidean) between 𝑋 The Firefly Algorithm is a generalization of PSO because it allows any swarm member to consider multiple “informers” in its neighborhood for updating the position vector, whereas the standard PSO only takes into account a single informer: the best individual in a particle’s neighborhood. ⃗ 𝑖 is attracted along any proMore precisely, each firefly 𝑋 blem dimension 𝑘 = 1, . . . , 𝑁 towards every other “brighter” ⃗ 𝑗 as shown in (2): firefly 𝑋 (

⃗ 𝑖, 𝑋 ⃗𝑗 )(𝑋𝑗𝑘 −𝑋𝑖𝑘 )+𝛼⋅ rand( ) − 1 𝑋𝑖𝑘 = 𝑋𝑖𝑘 +𝛽(𝑋 2 1 Notice

that 𝑓 : ℜ𝑁 → ℜ is strictly problem-specific.

where 𝛼 is a randomization factor (algorithm parameter) which controls the magnitude of the stochastic perturbation endured by 𝑋𝑖𝑘 and rand( ) ∼ 𝑈 (0, 1). Notice that the ⃗ 𝑗 ) > 𝐼 0 (𝑋 ⃗ 𝑖 ). above expression is computed only when 𝐼0 (𝑋 Besides, the standard FO adopts a “global-best” neighborhood topology, i.e. each firefly is able to communicate with any other individual in the population. One may realize that the above movement law doesn’t fully apply to the best firefly 𝑋⃗𝑖∗ ∈ 𝑆, as no one is “brighter” than it is. In such case, we simply drop the “social” term in (2) and arrive at the shorter expression (3), in which 𝑋⃗𝑖∗ will perform a controlled random walk across the search space. ( ) 1 𝑋𝑖∗ 𝑘 = 𝑋𝑖∗ 𝑘 + 𝛼 ⋅ rand( ) − (3) 2

) (2)

Though at first conceived for coping with continuous optimization landscapes, FO can be modified to seek the most promising combination of discrete vector components that optimizes a given objective function. The first FO plunge into the field of combinatorial optimization took place shortly after its birth, when a binary version of the algorithm was put forward in [21] for makespan minimization in the realm of permutation flow shop scheduling. We reuse these ideas for our problem as well. In binary FO, after updating the firefly’s position according to (2), a mapping between the real-valued vector space ℜ𝑁 and the binary space {0, 1}𝑁 is enforced by means of a probabilistic rule based on a sigmoidal transformation applied to each component of the real-valued position vector, as shown in (4): ⎧ 1 ⎨ 1, if rand( ) ≤ (4) 𝑋𝑖𝑘 = 1 + exp(−𝑋𝑖𝑘 ) ⎩ 0, otherwise where 𝑘 = 1, . . . , 𝑁 and rand( ) ∼ 𝑈 (0, 1). Expression (4) forces the firefly encoding to be of binary type, as required by the problem representation, after receiving “search guidance” ⃗ 𝑗 as in (2). from every informer firefly 𝑋 V. T HE P ROPOSED A PPROACH We model system-level fault diagnosis as a combinatorial optimization problem undertaken by a binary firefly approach. The goal is to find, among many possible fault sets (see Sect. III), the true set 𝐹 that entirely matches the collection of test outcomes contained in the input syndrome 𝜎𝐹 . The proposed protocol (Firefly Algorithm to Fault Diagnosis, FAFD) is executed at a centralized location. A. Firefly Encoding and Initialization Each firefly symbolizes a candidate fault set 𝐹 . Its position ⃗ 𝑖 (𝑖 = 1..𝑀 ) is encoded as a binary vector whose dimen𝑋 sionality 𝑁 = ∣𝑉 ∣ is given by the system size (number of units). The status of the 𝑘-th node (𝑘 = 1..𝑁 ) is denoted by 𝑋𝑖𝑘 = 1 if 𝑣𝑘 is deemed faulty or 𝑋𝑖𝑘 = 0 otherwise.

1362

Initially, for each firefly position it is first decided at random the number of faulty units nf ∼ 𝑈 (1, 𝑡) the firefly will encode and then nf bits will be arbitrarily set to 1. By doing so, diversity is promoted in the first swarm, as feasible initial fault sets of varying cardinality will be stochastically derived. B. Fitness Function The quality or fitness of any individual (fault set 𝐹 ) can be measured as the degree of resemblance between the input syndrome 𝜎𝐹 and the candidate syndrome 𝜎𝐹 . The calculation of the latter and of the fitness function itself is strongly modeldependent although some similarities can be spotted. 1) Candidate Syndrome Generation: Of all test models considered in this study, only the asymmetric comparison model is entirely deterministic. That is, given the assumption of which nodes are faulty and which are fault-free (firefly encoding), we follow the asymmetric model rules in Sect. III-B and assign a label to every edge 𝑒𝑖𝑗 in the graph 𝐺. The collection of these labeled edges becomes the candidate syndrome 𝜎𝐹 . Given the non-deterministic nature of the tests carried out by faulty nodes (in the invalidation model) and the match/mismatch between pairs of damaged devices (in the symmetric comparison model), one can make the reasonable assumption that these edges in 𝜎𝐹 equal those in 𝜎𝐹 , i.e. 𝜎𝐹 (𝑣𝑖 ) = 𝜎𝐹 (𝑣𝑖 )

∀ 𝑣𝑖 ∈ 𝐹

Then we generate the remaining edge labels in 𝜎𝐹 as per the intrinsic model rules portrayed in Sect. III-B. We denote ⃗𝑖 ⃗ 𝑖 → 𝜎𝐹 the mapping from a candidate fault set 𝑋 by 𝑔 : 𝑋 𝑖 to its corresponding syndrome 𝜎𝐹𝑖 . 2) The Firefly Fitness: A firefly has higher fitness than others if the candidate syndrome associated with its encoding (fault set 𝐹 ) better resembles the input syndrome. From the local viewpoint of node 𝑣𝑖 , this is equivalent to counting how many labels of the incoming/outgoing edges coincide in both syndromes. For the comparison models, expression (5) models the node-level similarity. ) ∣𝜎𝐹 (𝑣𝑖 ) ∩ 𝜎𝐹 (𝑣𝑖 )∣ ( 𝑓 𝜎𝐹 , 𝜎𝐹 , 𝑣 𝑖 = ∣Γ(𝑣𝑖 )∣

𝑓

(

)

𝜎𝐹 , 𝜎𝐹 , 𝑣 𝑖 =

𝑓

−1

(

{

1,

∣𝜎𝐹 (𝑣𝑖 ) ∩ 𝜎𝐹 (𝑣𝑖 )∣ , ∣Γ(𝑣𝑖 )∣

)

𝜎𝐹 , 𝜎𝐹 , 𝑣 𝑖 =

if ∣Γ(𝑣𝑖 )∣ = 0 otherwise

∣𝜎𝐹−1 (𝑣𝑖 ) ∩ 𝜎𝐹−1 (𝑣𝑖 )∣ ∣Γ−1 (𝑣𝑖 )∣

𝑓 (𝜎𝐹 , 𝜎𝐹 ) =

𝑣𝑖 ∈𝑉

𝑁

(9)

where 𝑁 = ∣𝑉 ∣ is the system size. The above equation can be seen as the correctness probability of the potential fault set 𝐹 being the actual fault set 𝐹 . Because we are solely working with 𝑡-diagnosable systems, it is thus guaranteed that the fault set 𝐹 that makes 𝜎𝐹 = 𝜎𝐹 is unique. Since (5) and (8) take values in [0, 1], the image of (9) is also the unit interval. C. Adaptive Light Absorption Coefficient Equation (1) models the influence of the solutions attained by other fireflies over the current position of any individual. The attractiveness attenuation function is governed by two built-in parameters: 𝛽0 and 𝛾. The former describes the attractiveness of two individuals colliding at the same location and is usually set to 𝛽0 ∈ [0, 1]. The latter is the light absorption coefficient 𝛾, which determines how increasing distance from the light emitter degrades its appeal to other individuals. All firefly solutions are equally interesting when 𝛾 → 0 and completely irrelevant when 𝛾 → ∞. A range that works well in practice is 𝛾 ∈ [0, 10] as suggested by [10]. Superior results are obtained when 𝛾 is problem-tailored. Instead of using a fixed value, we write 𝛾 in terms of the ⃗ 𝑖 is supposed “characteristic length” of the subspace a firefly 𝑋 to travel in. The adaptive version of 𝛾 is as follows: ⃗ 𝑖) = 𝛾(𝑋

𝛾0

⃗ 𝑖) 𝑟𝑚𝑎𝑥 (𝑋

(10)

⃗ 𝑖 ) = max {dist(𝑋 ⃗ 𝑖, 𝑋 ⃗ 𝑗 )} for where 𝛾0 ∈ [0, 1] and 𝑟𝑚𝑎𝑥 (𝑋 ⃗ ⃗ all 𝑋𝑗 brighter than 𝑋𝑖 . D. Binary Mapping Rule

(5)

For the PMC model, we are to take into account that all nodes will be tested but not necessarily testers. Equations (6) to (8) respectively define the similarity from 𝑣𝑖 ’s perspective as a tester node, tested node and both. +1

Regardless of the diagnosis model used, the overall resemblance between the input and candidate syndromes (which in turn becomes FAFD’s fitness function) can be stated as in (9), . ∑ ( ) 𝑓 𝜎𝐹 , 𝜎𝐹 , 𝑣 𝑖

Expression (11) introduced in [22] yielded more promising empirical results for our problem.

𝑋𝑖𝑘

⎧ ⎨ 1, = ⎩ 0,

if rand( ) ≤ otherwise

1 −1.5 ⋅ 𝑁 ⋅ exp(−𝑋𝑖𝑘 )

(11)

E. Handling Infeasible Solutions (6) (7)

( ) ( ) ( ) 𝑓 +1 𝜎𝐹 , 𝜎𝐹 , 𝑣𝑖 + 𝑓 −1 𝜎𝐹 , 𝜎𝐹 , 𝑣𝑖 𝑓 𝜎𝐹 , 𝜎𝐹 , 𝑣 𝑖 = (8) 2

The application of (2) and (11) to each firefly’s encoding may lead to an infeasible solution, i.e. a particular configuration of node guesses in which either all units are faultfree or there are more than 𝑡 faulty units. Both scenarios are unacceptable in 𝑡-diagnosable systems. We gracefully manage this situation in the following way: 1) Compute the node-level resemblance by either (5) or (8) depending on the diagnosis model under consideration.

1363

2) If the firefly’s position is the zero vector, generate a random number 𝑏 ∈ [1..𝑡] and flip (turn to 1) the 𝑏 bits with the lowest fitness values. Otherwise, let 𝑘 be the number of 1’s in the particle (notice that 𝑘 > 𝑡). Then generate a random number 𝑏 ∈ [𝑘 − 𝑡..𝑘 − 1] and flip (turn to 0) the 𝑏 bits set to 1 with the lowest fitness values. The above procedure rapidly turns infeasible into feasible individuals by inverting the guesses on the status of those nodes with poorest local similarity to 𝜎𝐹 , in an attempt to speed up the algorithm’s convergence. F. Stop Criterion FAFD stops when it either reaches an upper bound of the running time or finds the actual fault set 𝐹 , i.e. a firefly encoding 𝐹 with fitness value 𝑓 (𝜎𝐹 , 𝜎𝐹 ) = 1. Such fault set 𝐹 is proved to be unique in 𝑡-diagnosable systems [18] [19]. G. The Pseudocode Algorithm 1 unfolds the pseudocode of our method. Algorithm 1 Firefly Algorithm to Fault Diagnosis (FAFD) Input: Parameters 𝜎𝐹 , 𝑓 (⋅), 𝑔(⋅), 𝑀 , 𝑁 , 𝛾0 , 𝛼, 𝛽0 , maxIter ⃗ 𝑖∗ Output: The true fault set 𝐹 returned in 𝑋 1: stop ← false; iteration ← 0; 2: initialize swarm 𝑆 as shown in Sect. V-A; 3: while not stop do 4: iteration ← iteration + 1; ⃗ 𝑖 ); ∀ 𝑖 = 1..𝑀 5: 𝜎𝐹𝑖 ← 𝑔(𝑋 ⃗ 6: 𝐼0 (𝑋𝑖 ) ← 𝑓 (𝜎𝐹𝑖 , 𝜎𝐹 ); ∀ 𝑖 = 1..𝑀 ⃗ 𝑖 )}; 7: 𝐼0∗ ← max {𝐼0 (𝑋 ⊳ best fitness in this iteration ⃗ 𝑖 )}; 8: 𝑖∗ ← argmax𝑖 {𝐼0 (𝑋 ⊳ index of best firefly 9: 10: 11: 12: 13: 14: 15:

H. Time Complexity An upper bound for the worst-case temporal complexity of FAFD is provided in this section. The analysis of the spatial complexity is analogous and thus omitted for brevity. Generation of the initial swarm (line 2) takes 𝑂(𝑀 𝑡) time steps, as at most 𝑡 bits per individual will be set to one. Since 𝑡 ≤ (𝑁 − 1)/2 in a 𝑡-diagnosable system, this is equivalent to 𝑂(𝑀 𝑁 ). The loop in line 3 has ‘maxIter’ as upper bound. The derivation of the candidate syndromes for the population in line 5 is the most expensive step, requiring 𝑂(𝑀 𝑁 𝑡) with the PMC model and 𝑂(𝑀 𝑁 𝑡/2) for any variant of the comparison model. In any case, such complexity is bounded by 𝑂(𝑀 𝑁 2 ). Line 6 is done in 𝑂(𝑀 𝑁 ) steps whereas lines 7–8 are ac⃗ 𝑖) complished altogether in 𝑂(𝑀 ). The calculation of 𝑟𝑚𝑎𝑥 (𝑋 has 𝑂(𝑀 𝑁 ) whereas the infeasibility handling scheme in line 32 needs 𝑂(𝑁 log 𝑁 ) time slots for sorting the fitness values of all vector components plus 𝑂(𝑁 ) for flipping bits. In total, the code in lines 14–25 can be run in 2𝑀 2 𝑁 +5𝑀 2 −𝑀 𝑁 −𝑀 and that in lines 27–34 requires 𝑀 𝑁 log 𝑁 +3𝑀 𝑁 +𝑁 . This brings the overall FAFD time complexity to 𝑂(maxIter×(𝑀 𝑁 2 +𝑀 2 𝑁 +𝑀 𝑁 log 𝑁 )). As simulations will reveal, both 𝑀 and maxIter will take modest values in reality, so the final expression boils down to 𝑂(𝑁 2 +𝑁 log 𝑁 ). The quadratic component above is not contributed by FAFD itself but is due to the unfolding of the candidate syndromes. To the best of the authors’ knowledge, all optimization algorithms dealing with diagnosable systems of arbitrary topologies run in Ω(𝑁 2 ). The actual sought-after property is a steady average detection time as more faulty units appear in the system. Next section confirms that FAFD fulfills this goal.

if 𝐼0∗ = 1 ∣∣ iteration > maxIter then stop ← true; break; end if for 𝑖 = 1 to 𝑀 do ⃗𝑖) ← 𝑟𝑚𝑎𝑥 (𝑋

max

⃗𝑗 :𝐼0 (𝑋 ⃗𝑗 )>𝐼0 (𝑋 ⃗𝑖 ) ∀𝑋

VI. S IMULATION R ESULTS

⃗𝑖 ⊳ for each firefly 𝑋 ⃗𝑖, 𝑋 ⃗𝑗 )}; {dist(𝑋

16: ⃗𝑗 17: for 𝑗 = 1 to 𝑀 do ⊳ for each firefly 𝑋 ⃗ 𝑗 ) > 𝐼0 (𝑋 ⃗ 𝑖 ) then ⃗𝑖 → 𝑋 ⃗𝑗 18: if 𝐼0 (𝑋 ⊳ move 𝑋 ⃗ ⃗ 19: 𝑟𝑖𝑗 ← dist(𝑋𝑖 , 𝑋𝑗 ); ⃗ 𝑖 ) ← 𝛾0 /𝑟𝑚𝑎𝑥 (𝑋 ⃗ 𝑖 ); 20: 𝛾(𝑋 ⃗𝑖, 𝑋 ⃗𝑗 ) ← 𝛽0 ⋅ exp(−𝛾(𝑋 ⃗ 𝑖 ) ⋅ 𝑟𝑖𝑗 ); 21: 𝛽(𝑋 22: apply movement rule in (2) ∀𝑘 = 1..𝑁 ; 23: end if 24: end for 25: end for 26: move best firefly 𝑋⃗𝑖∗ according to (3) ∀𝑘 = 1..𝑁 ; 27: 28: 29: for 𝑖 = 1 to 𝑀 do ⃗ 𝑖 ← binary mapping rule in (11) ∀𝑘 = 1..𝑁 ; 𝑋 30: ⃗ 𝑖 is infeasible then 31: if 𝑋 32: apply infeasibility scheme in Sect. V-E; 33: end if 34: end for 35: end while

We have empirically validated FAFD’s performance under 𝑡-diagnosable systems of various sizes with the PMC and comparison diagnosis models. Our algorithm has been contrasted to AIS in [13] and PSO in [14]. The experiments were conducted in Java using JDK 1.6 on an Intel Core2 Duo E4500 2.2GHz with 2 GB of RAM under Windows Vista. Each approach was executed until either the true fault set 𝐹 was found or a maximum time was reached2 . The performance metrics are: 𝜙 = total number of candidate solutions probed and 𝜌 = CPU time consumed. For scalability assessment, 𝑡diagnosable systems of different sizes were tried, viz 𝐺49 (100) and 𝐺249 (500). For each system above, the number of faults 𝑥 ranged from 1 to 𝑡 and for every value of 𝑥, 100 arbitrary fault sets of that cardinality were generated. The average- and worst-case behaviors of the algorithms in terms of 𝜙 and 𝜌 were recorded. 2 The upper bound of the running time is expressed as a function of the number of iterations carried out by each algorithm. The value maxIter = 1000 was used for all experiments as a means to ensure an acceptable degree of responsiveness, a tight requirement in many real-world applications. In such case, the fault set that best resembles the input syndrome is returned.

1364

A. Parameterization Setup The AIS configuration in [13] was adopted. For PSO, recommended values in [17] were respected, i.e. 𝑐1 = 𝑐2 = 2.05 and the constriction factor 𝜒 = 0.7298 is included in the velocity update rule. Both PSO and FAFD have a common neighborhood topology (“global best”). To guarantee a fair comparison baseline, the three metaheuristic schemes share the same population size (𝑀 = 10). Extensive trials confirm this is the most realistic choice in terms of accuracy and convergence rate for the application of these methods to resource-constrained environments. The randomization factor in FAFD was set to 𝛼 = 10−3 , as it slightly outperforms peer values. We don’t need too much randomness in the firefly movement rule, as the boundaries of the 𝑁 -dimensional binary search space can be efficiently reached with the help of the solutions contributed by informer fireflies. The remaining FAFD parameters were set to 𝛾0 = 𝛽0 = 1 after applying an identical trial-and-error rationale. B. Asymmetric Comparison Model, Medium-Size Graph When the three algorithms are tested with a 𝐺49 (100) graph, Fig. 2(a) reveals that PSO shows an erratic worst-case behavior in the number of particles required to find the global optimum. Interestingly, this phenomenon fades out as the dimension of the search space augments. One could envision that the presence of a single informer (swarm’s best) in a particular region of the search space could be drawing other particles strongly enough so as to lead them into an infeasible encoding. Particles can more easily escape stagnation and be redirected towards any stochastic, feasible encoding as the size of the search space increases. Fortunately, the combined influence of ⃗ 𝑖 ) in (10) manifold informers together with the customized 𝛾(𝑋 prevents FAFD agents from getting caught in local optima, as proved by the fact that FAFD consumes a nearly constant and negligible number of candidate solutions to converge. The evolutionary search undertaken by AIS yields almost identical results for the average and worst cases. The running time of the firefly-based algorithm is smaller than that of rival methods. Fig. 2(b) displays how the insufficient guidance in PSO still produces a significant delay in convergence speed. Although PSO and FAFD do very well in the average case (both efficiently reach the global optimum in less than 2 sec), for the former it takes up to 5x to detect 𝐹 than the latter in the worst case. C. PMC Model, Large Graph Let us finally contrast FAFD vs. PSO in presence of a complex system 𝐺249 (500). AIS was excluded from this experiment due to its poor and impractical behavior. Fig. 3(a) bears witness to the inability of PSO in locating the true fault set for most 𝑡 ∈ {1..200}, as depicted by the straight line topping 𝜙 along 1,000 iterations. Its vanishing effect can be explained by the same argument in Sec. VI-B. While FAFD never failed to encounter 𝐹 , PSO fell short in 78% of the cases. Recall that both methods are tried with a small population size (𝑀 = 10), which makes FAFD a strong

candidate for facing challenging fault detection scenarios with low resource requirements. Even FAFD’s worst-case setting is preferable to PSO’s average-case. PSO’s running time 𝜌 increases nearly monotonically in Fig. 3(b) with the expected number of faulty units until a large enough search space is reached for 𝑡 > 200. In such cases, the presence of numerous feasible candidate sets helps attenuate the undesirable effect of having a single informer particle draw all remaining population members towards a reduced group of infeasible encodings. Because of PSO’s high failure rate, its average-case behavior approaches the worst-case performance. Observe how FAFD responds with the true fault set within 20 sec. on average and always within 40 sec., irrespective of the values of 𝑡. This can be ascribed to the enhanced social communication mechanisms present in the firefly swarm. VII. C ONCLUSIONS AND F UTURE W ORK In this study, we have formulated a discrete version of the Firefly Algorithm to tackle the system-level fault diagnosis problem. The proposed scheme FAFD utilizes a binary encoding of the potential solutions (fireflies) and has them navigate in a personalized search subspace, whose characteristic length is embedded into an adaptive light absorption coefficient. Infeasible solutions are repaired by flipping the assumption on the status of the least promising nodes. As a result, we witness an accelerated convergence to the global optimum and a very low memory consumption rate (in contrast to PSO and AIS) under the invalidation and comparison test models and regardless of the system size. FAFD also eliminates PSO’s erratic worst-case behavior in small and medium networks and its inability to capture the true fault set in large graphs. In the future, other fault diagnosis models and performance metrics will be tested and further enhancements will be sought through hybridization and local search mechanisms. R EFERENCES [1] S. L. Hakimi and K. Y. Chwa, “Schemes for Fault Tolerant Computing: a Comparison of Modularly Redundant and ⁀t-Diagnosable Systems”, Information & Control, vol. 49, pp. 212–238, Jun. 1981. [2] J. Maeng and M. Malek, “A Comparison Connection Assignment for Self-Diagnosis of Multiprocessor Systems”, Proc. 11𝑡ℎ Int’l Symposium on Fault-Tolerant Computing, New York, USA, 1981, pp. 173–175. [3] F. P. Preparata, G. Metze and R. T. Chien, “On the Connection Assignment of Diagnosable Systems”, IEEE Transactions on Electronic Computing, vol. 16(6), pp. 848–854, Dec. 1967. [4] R. Ahlswede and H. Aydinian, “On Diagnosability of Large Multiprocessor Networks”, Discrete Applied Mathematics, vol. 156(18), pp. 3464– 3474, 2008. [5] X. Yang. G. M. Megson and D. J. Evans, “A Comparison-based Diagnosis Algorithm Tailored for Crossed Cube Multiprocessor Systems”, Microprocessors and Microsystems, vol. 29(4), pp. 169–1754, 2005. [6] H. Yang, X. Yang and A. Nayak, “A (4n - 9)/3 Diagnosis Algorithm for Generalised Cube Networks”, Int. J. Parallel Emerg. Distrib. Syst., vol. 25(3), pp. 171–182, 2010. [7] K. Tzu-Liang, C. Hsing-Chung and J.J.M. Tan, “On the Faulty Sensor Identification Algorithm of Wireless Sensor Networks under the PMC Diagnosis Model”, Proc. 6th Int’l Conference on Networked Computing and Advanced Information Management (NCM) pp. 657–661, 2010. [8] T. Kameda, S. Toida and F.J. Allan, “A Diagnosis Algorithm for Networks”, Information & Control, vol. 29, pp. 141–148, 1975. [9] G. Sullivan, “An 𝑂(𝑡3 +∣𝐸∣) Fault Identification Algorithm for Diagnosable Systems”, IEEE Transactions on Computers, vol. 37(4), pp. 388–397, 1988.

1365

Asymmetric Comp. 100/49 - FAFD alpha=0.001 - M=10

Asymmetric Comp. 100/49 - FAFD alpha=0.001 - M=10 25000 FAFD: Average-case FAFD: Worst-case PSO: Average-case PSO: Worst-case AIS: Average-case AIS: Worst-case

25000

Execution time (milliseconds)

Total number of generated individuals

30000

20000

15000

10000

FAFD: Average-case FAFD: Worst-case PSO: Average-case PSO: Worst-case AIS: Average-case AIS: Worst-case

20000

15000

10000

5000

5000

0

0 0

5

10

15

20

25

30

35

40

45

50

5

0

10

(a) Exploratory capabilities of the three algorithms Fig. 2.

15

20

25

30

35

40

45

50

Number of faulty nodes

Number of faulty nodes

(b) Convergence speed of the three algorithms

Performance metrics for AIS, PSO and FAFD with the asymmetric comparison model and 𝐺49 (100) graph.

PMC 500/249 - FAFD alpha=0.001 - M=10

PMC 500/249 - FAFD alpha=0.001 - M=10 100000

FAFD: Average-case FAFD: Worst-case PSO: Average-case PSO: Worst-case

12000

Execution time (milliseconds)

Total number of generated individuals

14000

FAFD: Average-case FAFD: Worst-case PSO: Average-case PSO: Worst-case

10000

8000

6000

4000

80000

60000

40000

20000

2000

0

0 0

50

100

150

200

250

0

Number of faulty nodes

100

150

200

250

Number of faulty nodes

(a) Exploratory capabilities of the two algorithms Fig. 3.

50

(b) Convergence speed of the two algorithms

Performance metrics for PSO and FAFD with the PMC model and 𝐺249 (500) graph.

[10] X.S. Yang, Nature-Inspired Metaheuristic Algorithms, Luniver Press, 2008. [11] M. Elhadef, A. Nayak and N. Zeng, “An Ant-based Fault Identification Algorithm for Distributed and Parallel Systems”, Proc. 10𝑡ℎ World Conference on Integrated Design & Process Technology IDPT-2007, Antalya, Turkey, Jun. 2007, pp. 1–6. [12] M. Elhadef and B. Ayeb, “An Evolutionary Algorithm for Identifying Faults in t-Diagnosable Systems”, Proc. IEEE Symposium on Reliable Distributed Systems SRDS 2000, N¨urnberg, Germany, Oct. 2000, pp. 74– 83. [13] H. Yang, M. Elhadef, A. Nayak and X. Yang, “Network Fault Diagnosis: An Artificial Immune System Approach”, Proc. 14th IEEE Int’l Conference on Parallel and Distributed Systems, Melbourne, Australia, Dec. 2008, pp. 463–469. [14] R. Falcon, M. Almeida and A. Nayak, “A Binary Particle Swarm Optimization Approach to Fault Diagnosis in Parallel and Distributed Systems”, Proc. IEEE Congress on Evolutionary Computation (CEC), Barcelona, Spain, July 2010, pp. 41–48. [15] J. Kennedy and R. C. Eberhart, “Particle Swarm Optimization”, Proc. IEEE Int’l Conference on Neural Networks, Perth, Australia, 1995, pp. 1942–1948.

[16] X.S. Yang, “Firefly Algorithms for Multimodal Optimization”, Proc. Stochastic Algorithms: Foundations and Applications, LNCS 5792, 2009, pp. 169–178. [17] D. Bratton and J. Kennedy, “Defining a Standard for Particle Swarm Optimization”, Proc. IEEE Swarm Intelligence Symposium SIS 2007, Honolulu HI, USA, Apr. 2007, pp. 120–127. [18] A. Pelc, “Undirected Graph Models for System-Level Fault Diagnosis” IEEE Trans. on Electronic Computers, vol. 40(11), pp. 1271–1276, 1991. [19] S. L. Hakimi and A. T. Amin, “Characterization of the Connection Assignment of Diagnosable Systems”, IEEE Trans. on Computers, vol. C-23, pp. 86–88, Jan. 1974. [20] D.M. Blough and A. Pelc., “Complexity of Fault Diagnosis in Comparison Models”, Proc. IEEE Transactions on Computers, vol 41(3), 1992, pp. 318–324. [21] M.K. Sayadi, R. Ramezanian and N. Ghaffari-Nasab, “A Discrete Firefly Meta-Heuristic with Local Search for Makespan Minimization in Permutation Flow Shop Scheduling Problems”, Int’l Journal of Industrial Engineering Computations, vol. 1, pp. 1–10, 2010. [22] S. Yuan and F. Chu, “Fault Diagnostics based on Particle Swarm Optimization and Support Vector Machines” Mechanical Systems and Signal Processing, vol. 21, pp. 1787–1798, 2007.

1366

Suggest Documents