International Journal of Computers and Applications, Vol. 31, No. 4, 2009
TIME-CONSTRAINED FAULT TOLERANT X-BY-WIRE SYSTEMS P.M. Khilar∗ and S. Mahapatra∗∗
Abstract
electronically signalled sensors, actuators and processors used in FBW system are more. An alternative and better approach is to use software means to diagnose the critical components and bypass faulty components during the system operation. A few active standby backup microcontrollers can be enabled by software to replace the permanently faulty critical components. Here in this paper, the comparison-based diagnosis is followed to diagnose a permanently faulty actuator, to reach an agreement over the fault free units [5] and allow the recovery tasks to enable the active standby nodes automatically to replace the faulty nodes. The comparison-based diagnosis can reduce the physical redundancy required by N-modular back-up microcontrollers [6]. The control surfaces and other components of a FBW system uses a number of actuators and processors to perform the flight control functions such as steering, tracking and adaptive cruise control during take-off, flight and landing time of the aircraft. Embedded systems used in an FBW applications for flight navigation and control employ microprocessor-controlled electromechanical actuators and networks without any mechanical backup. In these, processors measure the angle of attack, calculate the desired actuator deflection and then command the electromechanical actuators at the control surfaces on aircraft appropriately [7–9]. If an actuator delivers erroneous results due to electromechanical failure, this may result in undesirable system-level behaviour; e.g., a faulty actuator may cause an unwanted runway deviation during flight landing. The timely diagnosis of faulty actuators under deadline and resource constraints can ensure a fault tolerant FBW system. The previous approach to distributed diagnosis mainly deals with diagnosis as a standalone objective without considering the normal system functions and the realtime behaviour of control applications [5, 10–21]. Most of the algorithms work offline and are not suitable for online embedded control applications. Recently, in [1], Kandasamy et al. have proposed the distributed failure diagnosis of actuators used in steerby-wire (SBW) systems. Their work assumed a single rate (SR) system where the control functions such as steering, traction and cruise are of same period. However, their approach suffers from a number of limitations such as (i) the control and diagnosis tasks do not have any jitter left between them and therefore fail to accommodate
This paper presents a software-based approach to the problem of distributed actuator failure diagnosis under resource and deadline constraints in fly-by-wire systems using comparison of actuator behaviour and corresponding mathematical model. (2k + 1) replicated control and diagnosis tasks are executed to tolerate k-faults in the control and diagnosis functions. The proposed method has been evaluated through simulation based on real data and has been compared with a single rate steer-by-wire (SBW) systems recently suggested by Kandasamy et al. [1].
Key Words Safety critical systems, fault diagnosis, distributed embedded systems, task scheduling, diagnosis latency
1. Introduction Distributed embedded systems, which comprise of processors, smart sensors, and smart actuators interacting via a communication medium, are being widely used in safety critical systems such as fly-by-wire (FBW) system. The occurrence of transient faults (faults for small durations) at repeated time intervals in such systems lead to permanent faults. The transient faults cause the system to work in a degraded fashion but do not cause catastrophic events to system components. On the other hand, permanently faulty units cause the catastrophic events, thus endangering the related applications, which demand a high level of fault tolerance and performance under severe cost constraints. One possible solution is to mask failures by having sufficient physical redundancy such as N back up microcontrollers in Boeing 777 [2] and MARS [3] which does not affect the execution time of the working micro controllers. Also, the weight, cost, space and power consumption of most powerful up-to-date available microcontrollers is also less to adapt NMR approach as compared to an older generation processor [4]. However, the power consumption of ∗
Department of Computer Science & Engineering, NIT Rourkela – 769008; e-mail:
[email protected] ∗∗ Department of E & ECE, IIT, Kharagpur – 721 302; e-mail:
[email protected] Recommended by Dr. D. Wang (paper no. 202-2391)
230
tors can deliver erroneous results to the physical system due to electromechanical failures. Permanent faults remain until corrective action is taken while transient faults appear and disappear quickly and at random times. Fig. 2 illustrates a typical k-fault-tolerant (k-FT) configuration for a periodic control application executing on a subset of the n system processors. The control tasks obtain their input data from replicated and possibly diverse sensors; which may be of different types. The sensing and control tasks “Sensei ” and “Computei ” execute on sensor Si and processor Pi , respectively. The actuation task “Actuatei ” executes on actuator Ai on the generated commands and issue the final result controlling the actuator. Faults affecting this task due to hardware failures may result in incorrect actuation outputs being sent to the physical system.
the delay due to the clock deviation and communication latency; (ii) the technique follows a first fit approach for processor allocation and hence fails to achieve optimum utilization of system processors. In fact, to maximize the use of available hardware, FBW systems have a tendency towards high processor utilization often almost 100%; (iii) the technique diagnoses the actuators caused by actuator faults overlooking the faults caused by processor faults; (iv) manual recovery action is assumed when the actuator becomes permanently faulty leaving burden to pilot. The aim of this paper is to propose an improved software-based approach to the problem of distributed actuator failure diagnosis under resource and deadline constraints using comparison of actuator behaviour and corresponding mathematical model [22]. The proposed method is also expected to implement the control and diagnosis task scheduling using a fixed-priority approach, which guarantees actuator/system processor diagnosis within designer specified deadlines while meeting control performance goals. The paper outline is as follows: Section 2 presents the adopted system and fault models. The algorithm for distributed actuator diagnosis is given in Section 3. Section 4 discusses the scheduling problem and derives an estimate for the diagnosis deadline. The simulation is introduced in the paper based on real data. The simulation result of the proposed method is compared with single rate SBW system suggested recently to show the proposed method’s advantage in Section 5 followed by conclusion in Section 6.
3. The Proposed Diagnosis Algorithm Fig. 2 shows the integrated control and distributed actuator diagnosis approach to tolerate k faults. A sensing task executing on monitoring sensor Sj monitors Ai ’s behavior in response to actuator command ai . Typically, the monitored signal is measured by Sj after sampling delay ts . It is assumed that, model evaluation task applies a mathematical model Mi on the measured sensor value sj and estimates the previously applied actuator command. Let the estimate be denoted as a1i . The checker tasks executing on different processors compare the voted estimates value with the actual actuator value in response to command ai to generate the error or residue ei = ai − a1i . These local residues are further exchanged among the median voter tasks executing on different processors to obtain a consolidated residue Ei by discarding the residues having larger differences and keeping only the middle value. Now, the function of distributed actuator diagnosis begins to obtain a global view of Ai ’s status. In distributed actuator diagnosis, each diagnosis tasks independently evaluates the behavior of actuator Ai and multiple tasks agree on the
2. System and Fault Model Fig. 1 shows the physical system architecture comprising of processors, smart sensors and actuators and is directly interconnected by a broadcast network such as a time-shared bus. Sensors and actuators are of limited computing power and communicate with system processors by a network interface. Processors exchange messages via a broadcast network and incur communication delay using the reliable communication protocol such as TTP [23], or CAN [24]. A synchronous system is assumed where the execution time of the control and diagnosis tasks, communication delay and clock drift is bounded and deterministic. Unlike asynchronous system model, the assumption of synchronous system satisfies the real-time guarantees for control and diagnosis tasks deadline for the proposed system. The fault model assumes that processors, sensors, and actuators may suffer a bounded number of operational failures, including permanent and transient ones, and actua-
Figure 2. The integrated approach to fault-tolerant control and distributed actuator diagnosis.
Figure 1. The distributed system architecture. 231
on a network of multiple processing entities. Each vertex is labeled as Ti (ci ), where Ti is the task and ci its worst-case execution time in μs. The precedence relation between tasks Ti and Tj is the edge weight (the exact values are given in Section 5), which represents either interprocessor communication delay (if the tasks Ti and Tj are scheduled to different processors) or the sampling delay of the monitored signals. The graph G1 in Fig. 4 is a sequential composition of actuator control, monitoring, evaluation, voting tasks, and distributed diagnosis and recovery tasks where all the inputs to one task are the outputs of previous ones. G1 is augmented with additional buffering tasks to allow the tasks to be pipelined. The control performance and diagnosis latency requirements impose end-to-end timing constraints on the system. Table 1 gives the functional descriptions of the tasks with execution time of various control and diagnosis portion. The performance and diagnosability requirements of the control application impose the following timing constraints on the task graph: (i) Control period: The control period is the time elapsed between reading the sensors and commanding the actuator (denoted as πi ) and (ii) Diagnosis latency (td ): The diagnosis latency is the time from occurrence of a transient fault till all the processors are aware of the event and is a relative timing constraint derived from system-level safety requirements. The test-scheduling method used here assumes a priori knowledge of the diagnosis latency td for an actuator Ai . The upper bounds on fault detection latency td or Td are given by the following inequalities: td ≤ tunsaf e − tr − (1 + αT ) × πi , Td ≤ tunsaf e − tr where Td = td + (1 + αT ) × πi , αT = teval /πi . It is assumed that tunsaf e is the maximum diagnosis latency, which forms the basis for estimating td . A suspect actuator is typically evaluated for some period teval after the initial diagnosis to improve confidence in the final decision. The actuator is considered to suffer from a permanent fault once the error count α-count reaches the threshold given by αT = teval /πi . Once a permanent fault is detected,
fault status of an actuator Ai . Each distributed diagnosis tasks obtains the consolidated residue Ei and evaluates it against a previously defined threshold ε to diagnose Ai . The actuator status is observed in every control period and the occurrence of a transient fault is recorded if an actuator fails. Every occurrence of transient fault increments faults count α-count till an upper limit αT is reached. Occurrence of multiple consecutive transient faults is treated as a permanent fault. Once an actuator is diagnosed to suffer from a permanent fault, the faulty actuator is shutdown and another actuator in active standby mode is automatically switched on for recovery action. Fig. 3 outlines the diagnosis algorithm and is executed iteratively for each of the actuators to achieve correct and timely fault diagnosis. Since, transient faults will not prevent the system from functioning correctly, a processor waits till the occurrence of a permanent fault by using an error count α-count (i.e., number of consecutive transient faults occurred so far), which is incremented for every consecutive occurrence of a transient fault. Depending on the number of processors/actuators are diagnosed to be permanently faulty, the system can automatically reconfigure and enable the active standby units, which execute the same task as other units. Since the diagnosis is assumed to be correct, there is no need to duplicate the distributed diagnosis and recovery tasks, nor to mask out faulty results with voting [33]. Therefore, these tasks replications are omitted in the task graph of Fig. 4. As a result, distributed failure diagnosis theoretically improves the throughput of a system by the number of fault-free units over the same system employing NMR [25]. A penalty is incurred, when a fault occurs and the system must switch the tasks from faulty units to active standby to recover the system for its continued operation. 4. The Scheduling Algorithm The integrated control and distributed diagnosis approach shown in Fig. 4 is modelled as a directed acyclic precedence task graph involving multiple control and diagnosis tasks
Perform for each of the actuator under control and diagnosis: Step 1. Fault-tolerant control 1.1 Each of the (2k + 1) sensor tasks samples the data and sends the value to (2k + 1) compute tasks. 1.2 Each of the (2k + 1) compute tasks vote on sensing values and issue the actuation command to the physical system. Step 2. Fault-tolerant diagnosis 2.1 Each of the (2k + 1) checker tasks read the sampled actuators values from monitoring sensor (e.g., lift force to deflect the control surface) and compare with voted value obtained from model-based evaluation to obtain the residues e1 to e2k+1 . 2.2 Each of the (2k + 1) median voter tasks discard residues having largest difference and keep only a single most appropriate residue Ei . Step 3. Distributed Actuator Diagnosis 3.1 Each diagnosis task Ti broadcast the residue Ei to every other diagnosis task Tj . 3.2 Each diagnosis task Ti receive the residue Ej from every other diagnosis task Tj . 3.3 Each diagnosis task compares the residue Ei with an error quantity ε. 3.4 If Ei is greater than ε the actuator is recorded as transient faulty. Increment the α-count (i.e., the number of times the actuator is transient faulty so far). 3.5 If (α-count > αT ) record actuator is permanently faulty, shutdown the actuator and activate the standby. Step 4: Execute Step1 through Step 4 for each actuator under control at regular intervals. Figure 3. The algorithm for fault-tolerant control and actuator diagnosis. 232
Figure 4. Task graph G1 corresponding to the integrated control and diagnosis method of Fig. 2 with the buffering tasks. Table 1 Functional Descriptions of Tasks in the Task Graph in Fig. 4 with Execution-Time Ranges of the Control and Diagnostic Tasks used in the Simulations Task
Function
Execution Time (μs) cmin
cmax
T1 T2 T3
Read speed/AOA sensors
150
200
T4 T5 T6
Compute force command
400
500
T7 T8 T9
Issue voted force command
350
400
150
200
T13 T14 T15 Compute residue e1 e2 e3 using M1, M2, M3, 400
500
T16 T17 T18 Vote on {ei } to diagnose actuator
350
400
50
50
T10 T11 T12 Read torque/moment sensor ST ,
T19 – T30
Buffering
Jitter is the change in the time when a task is released. It is caused by variations in the worst-case execution time of tasks. The jitter increases through the course of the cycle, starting with a value of zero at the start of the cycle. This paper has a major strand with supporting the transition from cyclic scheduling to a simplified version of fixed priority scheduling which will ease the problem of meeting timing requirements now, and for the immediate future for FBW system. This scheduling strategy considers resources, precedence and synchronization, requirements of all tasks in the system and generates a feasible schedule that guarantees that the timing requirements of the tasks will be met. The scheduler is typically table driven and has low run-time overhead. Several dependable architectures
the actuator or processor is shutdown and a hot standby is activated with recovery action overhead of tr . The scheduling problem is: Given task graphs {Gi } corresponding to the diagnosis approach shown in Fig. 4, obtain a feasible schedule for each Gi satisfying both period πi and diagnostic latency td while minimizing the number of required processors and processor overhead, and maximizing processor utilization. The majority of the current scheduling practice for avionics system employs a cyclic scheduler which involves two constituent parts: a minor cycle and a major cycle. The major cycle is a sequence of tasks that are executed periodically. Each major cycle consists of a number of minor cycles that split the major cycle into uniform parts to reduce the jitter. 233
Step 1: Initialization 1.1 Prepare the ready list by assigning higher priority to tasks of shorter graph period. 1.2 Assign higher priority to the tasks based on precedence relationship within each graph. 1.3 Compute slack = [Di − ci ]/(n*y)/*Di is the deadline of critical path i , ci the execution time of a task Ti , n is the number of tasks along this path and y is the parameter to vary slack */ 1.4 Distribute the slack to every task and obtain the scheduling range [rj , dj ] of all tasks. Step 2: Task selection and frame generation 2.1 Select a task Tj from the ready list. 2.2 If (Tj is a sensing (actuation)) task, Schedule on the corresponding sensor (actuator) Si (Ai ). 2.3 If (Tj is a control or diagnosis or buffering task) then get a list of candidate processors in increasing order of utilization, schedule and allocate the task to a processor Pi within this list where its corresponding buffering task Tk is scheduled, using load balancing approach 2.4 If (schedule for Tj on Pi is infeasible) then unschedule Tk and place back in the ready list, reschedule on a different processor (add a new processor), and remove Pi as a candidate for tasks Tk and Tj ; Step 3: Updating: If (the task overflow occurs i.e. range extends across the schedule boundary) allocate the overflow portion to [0, dj ] and adjust the scheduling range. Step 4: Repeat Step 2 through Step 4 till all the graphs are completely scheduled. Figure 5. The scheduling algorithm. first step, the scheduling range of entry task and exit task of critical path, which is the path in the graph where the execution as well as communication time of various tasks is the highest, is fixed. The laxity available for distribution to the intermediate tasks is assumed to be slacki = [Di − Σci ]/(n ∗ y) where Di is the deadline of the critical path of the graph, ci is the execution time of a task Ti , n denotes the number of tasks along the critical path and y is the parameter in percentage to vary this laxity. The distribution heuristic maximizes the minimum laxity added to each Ti along path i by dividing slacki equally among tasks during each of the iterations and thus allowing sufficient jitter between the tasks. We assume that this task jitters can be suitably accommodated by the clock deviation and communication latency by any clock synchronization and communication protocol [3]. The above process is repeated until all tasks are assigned release time ri and deadline di . The second step computes the set of ready tasks. The schedule is feasible if all the tasks finish before their deadlines. A ready task is scheduled as soon as possible (ASAP) within a frame where a processor’s time is divided into multiple frames [32]. Once all ready tasks from step (ii) are scheduled, the scheduling range of each successor task Ti in the graph is computed. Steps (ii) and (iii) are repeatedly executed until the graph is completely scheduled.
for next generation applications use this type of scheduling [1, 3, 26]. Static table-driven approaches are applicable to periodic tasks or to aperiodic (sporadic) tasks that can be transformed into periodic ones [27–30]. To tackle multi-rate (MR) systems, we use fixed-priority table-driven scheduling where tasks of a graph with shorter period are given higher priority. Within each graph, the listscheduling heuristic is followed where tasks are assigned priorities on topological sorting order and placed in a list with decreasing priority [31]. The tasks with higher priorities are scheduled first. This approach satisfies the static scheduling objectives and minimizes the number of processors needed to schedule the task graph. The scheduling or dispatch table identifies the start and finish times of each task, and tasks are dispatched according to this table. Initially, the scheduling range of tasks in the control and diagnosis portions is determined using πi and td as the end-to-end deadline. During execution, the range of tasks is updated based on task dependency, performance, and diagnosis constraints between overlapped iterations. The schedule generation algorithm schedules a task Tj in pre-emptive fashion within a partial schedule S(Gi ) such that previously scheduled tasks continue to satisfy the constraints. For multiple graphs G1 , . . . , Gk with different periods, each of the graphs is scheduled by overlapping its individual tasks to meet control performance and diagnosis deadlines.
5. Simulation Results
The graph unrolling is used to achieve pipelined execution where the original graph is replicated to expose tasks across multiple iterations, which are then overlapped to obtain the schedule. The release time and deadline of tasks are determined by a combination of precedence and performance/diagnosis constraints. If Ti → Tj , then the message (data) produced by Ti must be consumed by Tj only after a communication or sampling delay. The scheduling algorithm is given in Fig. 5 and is consists of three major steps: (i) Initialization, (ii) Task selection and Frame generation and (iii) Updating. During the
In the simulation model, a file stores the execution time, communication delay, sampling delay, and control and diagnosis deadline corresponding to the graphs of interest. These parameters were used as input to evaluate the output parameters such as number of processors, computational cost overhead, average processor utilization, diagnosis and message overhead according to diagnosis and scheduling algorithm. Since the graphs are static, the control and diagnosis delay is set before hand. Multiple files storing different control and diagnosis deadline reflect a multi rate 234
system having multiple task graphs of different periods. The computation of release time and deadline for each of the task for multiple graphs proceeds as given in the algorithm. The execution time of task Ti is taken to be a random variable uniformly distributed in the range ci = [cmin, cmax]. Table 1 shows the cmin and cmax values assumed for the various control and diagnostic tasks. The interprocessor communication cost tc is assumed to be 1,000 μs and the sampling delay Ts for the monitoring sensors is uniformly distributed between [500, 1,000] μs. The parameters tunsaf e , teval , and tr , are assumed to have values of 50, 22, and 18 ms, respectively. The schedules on individual processors are simulated for duration equal to the least common multiple (LCM) of the graph periods to evaluate all possible interactions between tasks belonging to the different graph iterations and to retain their precedence order. The maximum of the periods of all graphs provides the overall system period which can be varied as max(πi ) × (1 + laxity) where laxity is the tightness factor, which varies from 0 to 1, and max(πi ) denotes the maximum period over all the graphs. The diagnosis latency for each graph varies with the actuator Ai under control. The minimum time tmin taken to diagnose a fault after issuing the corresponding actuation command is given by the longest path through the graph from the actuation task to an exit task. Therefore, the diagnosis deadline of each graph is simply tmin × (1 + laxity). If 0 ≤ laxity ≤ 0.3, corresponding deadlines are categorized as tight, if 0.3 ≤ laxity ≤ 0.7, it is categorized as medium and otherwise the deadlines are termed as loose. For each design subspace such as (tight period, tight latency), (loose period, medium latency), etc., 50 task graphs with different periods were created and scheduled. While evaluating the MR system, the following features were considered: (i) all task graphs are of different control period and of different diagnostic latency, (ii) variable task jitters were allowed between tasks, (iii) use of best-fit processor allocation policy, and (iv) distributed diagnosis and recovery tasks were augmented to the task graphs. Similarly, while evaluating SR system, the following features were considered: (i) all task graphs are of equal control period and of equal diagnostic latency to represent an SR system, (ii) no task jitters were allowed between tasks, (iii) use of first-fit task to processor allocation policy and (iv) distributed diagnosis tasks and recovery tasks were excluded from the task graphs thereby keeping this slot unused. To compare between MR and SR system, the LCM of graph periods of all graphs in the MR system is considered same as the graph period of each graph in the SR system. However, both SR and MR system use graph unrolling for functional pipelining.
in multi-rate systems, the theoretical lower bound on the number of processors nmin is determined using: nmin = Max( cj /πi ), (F T (Gi ))) where the summation is taken over all the task Tj of graph Gi for all the graphs. The above takes into account the number of processors needed to accommodate the corresponding workload and the fault tolerance requirements (F T (Gi )) of the individual graphs. The number of processors needed to guarantee feasible schedule is typically greater than nmin depending on the control and diagnosis constraints on Gi . Fig. 6 gives the comparison between the number of processors required to meet a feasible schedule for both single-rate SBW system in reference [1] and multi-rate FBW system. The number of processors required to obtain a feasible schedule for MR system is less than SR system when the graph period and diagnostic latency laxity varies from tight to loose. This is due to more flexibility in task scheduling and use of a best-fit approach to processor allocation.
Figure 6. Performance results showing number of processors under various period and deadline laxity combinations (SR versus MR system). 2. Cost overhead and average processor utilization: The cost overhead represents the additional number of processors required by the proposed algorithm over the theoretical lower bound. Therefore, if nmin denotes the lower bound and n the actual number of processors required to satisfy feasibility constraints, the overhead is given by: Ω = (n − nmin )/nmin * 100. Since tasks are executed ∪Pi = (( cj )/φ) * 100, on processor Pi within a frame Fk with duration φ, the individual processor utilization is where j varies over all the tasks executing in – any frame of the processor Pi . Therefore, the average utilization of ∪ = ( ∪Pi )/n, 1 ≤ i ≤ n. processors in the system is: Figs 7 and 8 show the performance results showing processor overhead and average processor utilization under various period and deadline laxity combinations
5.1 Performance Analysis The performance measures used here to evaluate the singlerate and multi-rate system are: (i) required number of processors, (ii) cost overhead and average processor utilization and (iii) diagnosis overhead and message complexity. 1. Number of processors: For multiple graphs G1 , . . . , Gk 235
Figure 9. Performance results showing diagnosis overhead under various period and deadline laxity combinations.
Figure 7. Performance results showing processor overhead under various period and deadline laxity combinations.
Figure 8. Performance results showing processor utilization under various period and deadline laxity combinations.
Figure 10. Performance results showing message complexity under various period and deadline laxity combinations.
for single- and multi-rate system. The proposed MR system reduces the cost overhead as compared to SR system particularly when the period and diagnosis latency laxity varies from tight to medium. The processor utilization for MR system is better than SR system when graph period and diagnostic latency are loose. This improvement is due to the use of more flexible scheduling policy and best-fit processor allocation in MR system. Clearly, it is desirable to generate embedded systems having low overhead and high processor utilization. 3. Diagnosis overhead and message complexity: The overhead due to diagnosis when no fault occurs is computed as: Δ = Σcd /(Σci + Σcd ) × 100. In the above, Σcd and Σci are the execution times of the diagnostic tasks and
control tasks respectively and the summation is taken over all the graphs. Figs 9 and 10 show the diagnosis overhead and the message complexity for the proposed MR system and the existing SR system respectively. The diagnosis overhead and message complexity for MR system is less compared to that of SR system as the graph period and diagnostic latency laxity varies from tight to loose. This is due to lesser number of processors required to meet the schedule. 6. Conclusion In a safety critical application, faulty hardware components such as actuators and processors need to be iden236
tified, shut down, and some recovery action taken before the underlying system becomes unsafe. This paper has presented a fault diagnosis method for distributed embedded applications targeting an FBW system. The approach results in fewer overheads in terms of additional number of processors required and results in higher processor utilization especially when the graph period and diagnosis are loose for a multi-rate system as compared to single-rate system. The diagnosis overhead and message complexity for the proposed system is also nominal, thus making the approach feasible in the safety critical embedded systems.
[19] M. Hiltunen, Membership and system diagnosis, Proc. 14th Symp. Reliable Distributed Systems, 1995, 208–217. [20] P.M. Khilar & S. Mahapatra, Distributed diagnosis in dynamic fault environments for arbitrary network topologies, in Proc. IEEE INDICON Conf., Chennai, India, 11–13, Dec 2005, 56–59. [21] P.M. Khilar & S. Mahapatra, Heartbeat based fault diagnosis in mobile adhoc network, Proc. of 3rd International Conf. on ACST, April 2–4, 2007, Phuket, Thailand, 194–199. [22] F. Jahanian et al., Run-time monitoring of timing constraints in distributed real-time systems, Proc. IEEE Real-Time Systems Symp, 1994, 247–273. [23] H. Kopetz & G. Gruensteidl, TTP-A time-triggered protocol for fault-tolerant real-time systems, Proc. IEEE Fault-Tolerant Comp. Symp, 1993, 524–532. [24] L.M. Pinho & V. Francisco, Reliable real-time communication in CAN networks, IEEE Transactions on Computers, 52 (12), 2003, 1594–1607.
References [1] N. Kandasamy, J.P.Hayes, & B.T. Murray, Time-constrained failure diagnosis in distributed embedded systems: application to actuator diagnosis, IEEE Transactions on PDS, 16 (3), 2005, 258–270. [2] Y.C.(Bob) Yeh, Design considerations in Boeing 777 fly-by-wire computers, 3rd IEEE International Symp. on High-Assurance Systems Engineering, 1998, 64–73. [3] H. Attiya, A. Herzberg, & S. Rajsbaum, Clock synchronization under different delay assumptions, SIAM Journal on Computing, 25 (2), 1996, 369–389. [4] R. Isermann, R. Schwarz, & S. Stolzl, Fault-tolerant drive_by_wire systems, IEEE Control Systems Magazine, 22 (5), 2002, 64–81. [5] D. Blough & H.W. Brown, The broadcast comparison model for on-line fault diagnosis in multicomputer systems: theory and implementation, IEEE Transactions on Computers, 48 (5), 1999, 470–493. [6] C. P. Fuhrman, Comparison-based diagnosis in fault-tolerant, multiprocessor systems, Ph.D Dissertation, Swiss Federal Institute of Technology in Lausanne (EPFL), 1996. [7] S.E. Lyshevski, R.D. Colgren, & V.A. Skormin, High torque density integrated electromechanical flight actuators, IEEE Transactions on Aerospace and Electronics System, 38 (1), 2002, 174–182. [8] S.E. Lyshevski, Electromechanical flight actuators for advanced flight vehicles, IEEE Transactions on Aerospace and Electronics System, 35 (2), 1999, 511–518. [9] http://www.grc.nasa.gov/www/K-12/airplane, The beginner’s guide to aeronautics. [10] M. Barborak, Malek, & Dahbura, The consensus problem in fault-tolerant computing, ACM Computing Surveys, 25, 1993, 171–220. [11] C.J. Walter, P. Lincoln, & N. Suri, Formally verified on-line diagnosis, IEEE Transactions Software Engineering, 23 (11), 1997, 684–721. [12] D. Chen et. al., Reliability and availability analysis for the JPL remote exploration and experimentation system, Proc. International Conf. on DSN, 2002. [13] R.P. Bianchini & R. Buskens, Implementation of on-line distributed system-level diagnosis theory, IEEE Transactions on Computer, 41, 1992, 616–626. [14] E.P. Duarte Jr., & T. Nanya, A hierarchical adaptive distributed system-level diagnosis algorithm, IEEE Transactions on Computer, 47, 1998, 34–45. [15] E.P. Duarte Jr., et al., An algorithm for distributed hierarchical diagnosis of dynamic fault and repair events, in Proc. IEEE ICPADS’00, 2000. [16] A. Subbiah & D.M. Blough, Distributed diagnosis in dynamic fault environments, IEEE Transactions on PDS, 15 (5), 2004, 453–467. [17] S. Rangarajan et al., A distributed system-level diagnosis algorithm for arbitrary network topologies, IEEE Transactions on Computers, 44 (2), 1995, 312–334. [18] R.P. Bianchini & R. Buskens, An adaptive distributed systemlevel diagnosis algorithm and its implementation, in Proc. FTCS-21, 1991, 222–229.
[25] M. Barborak & M. Malek, Partitioning for efficient consensuses, Proc. 26th Hawaii International Conf. on System Sciences, Jan, 1993. [26] C.A. lmeida & P. Verissimo, Timing failure detection and real-time group communication in quasi-synchronous systems, Proc. of the 8th Euromicro on Real-Time Systems Workshop, IEEE Computer Society Press, 1996, 230–235. [27] G.M.A Lima, Fault tolerance in fixed-priority hard real-time distributed systems, Ph.D Thesis, Department of CSE, University of York, May 2003. [28] N.V. Satyanarayana, Fault-tolerant scheduling in parallel and distributed real-time systems, Ph.D Theses, Department of CSE, IIT, Kharagpur, Feb. 2001. [29] S. Darbha & D.P Agrawal, Optimal scheduling algorithm for distributed memory machines, IEEE Transactions on PDS, 9 (1), 1998, 87–95. [30] R. Bajaj & D.P. Agrawal, Improving scheduling of tasks in a heterogeneous environments, IEEE Transactions on PDS, 15 (2), 2004, 107–118. [31] N. Kandasamy et al., Synthesis of robust task schedules for minimum disruption repair, IEEE International Conf. on Systems, Man and Cybernatics 2004, 5056–5061. [32] A. Sarkar, P.P. Chakraborty & Rajeeve Kumar, Frame-based proportional round-robin, IEEE Transactions on Computers, 55 (9), 2006, 1121–1129. [33] P.R. Lorczak et al., A theoretical investigation of generalized voters for redundant systems, Proc. IEEE Fault-Tolerant Comp. Symp., 1989, 444–451.
Biographies Pabitra Mohan Khilar was born in India, on July 10, 1968. Graduated (first class with distinction) from Mysore University in Computer Science & Engineering in the year 1990, received his M.Tech. degree (First Rank) in Computer Science from NIT, Rourkela in the year 1999. He obtained his Ph.D. degree from Indian Institute of Technology, Kharagpur, India, in the year 2009. At present he is a senior faculty in the Department of Computer Science and Engineering, NIT, Rourkela, India. His current research interests include Parallel and Distributed Processing, fault tolerant computing and adhoc and sensor networks. 237
Sudipta Mahapatra was born in India, on Feb 27, 1969. Graduated from Sambalpur University in Electronics and Telecommunications Engineering in the year 1990, received his M.Tech. and Ph.D. degrees from IIT, Kharagpur, in the year 1992 and 1997 respectively. Presently he is working as an associate professor in the E & ECE Department, IIT, Kharagpur. His current research interests include parallel and distributed processing, lossless data compression, and optical networking.
238