Evidence-based Bayesian Networks Approach to Airplane Maintenance Oscar Kipersztok Mathematics & Computing Technology Phantom Works, The Boeing Company P. O. Box 3707, MC 7L-44 Seattle, WA 98124-0346, USA
[email protected]
Abstract: - A system that facilitates airplane maintenance provides decision support for finding the root-cause of a failure from observed symptoms and findings. Such system provides diagnostic advice listing the most probable causes and recommending possible remedial actions. Furthermore, its goal is to reduce the number of delays and cancellations and unnecessary parts removal, which add significant costs to airplane maintenance operations. A Bayesian belief network, model-based approach is presently being used for building such diagnostic models. The paper describes the pertinent issues in using such models.
I. INTRODUCTION Delays and cancellations add significant costs to airline maintenance operations. Unnecessary parts removals, in addition, compound the problem. Much of these operational costs are attributed to a decrease in diagnostics ability of airline mechanics as a result of lack of experience with an increasing variety of airplane types in the fleet and the increasing practice of outsourcing maintenance operations by airlines. The critical factors in commercial airline maintenance operations are airplane safety, dispatch reliability and turnaround time. To ensure safety and reliability, airline operators must adhere to government regulatory agencies’ standards, which require a Minimum Equipment List (MEL) with the minimal set of Line Replaceable Units (LRU) that must be in working order before dispatch is approved. In response to a reported fault and under time pressure to meet scheduled departures, the tendency of operators is to replace suspect parts unnecessarily. This practice is referred to as “shotgunning”. Seasoned mechanics can quickly narrow the list of possible causes to a small number of replaceable units. The challenge is to disambiguate between the most probable parts by further performing troubleshooting tests, before departure time. To avoid costly delays and potential cancellations, the mechanics have to decide what action to take before departure, which suspect LRUs should be replaced and what remedial actions can be deferred to the next destination. A diagnostic decision support system for airplane
0-7803-7278-6/02/$10.00 ©2002 IEEE
Glenn A. Dildy Boeing Research & Product Development Boeing Commercial Airplane Company P. O. Box 3707, MC 2H-70 M-1016 Seattle, WA 98124-0346, USA
[email protected]
maintenance should be designed to facilitate the decision process in such a way as to improve the accuracy of airplane diagnosis without compromising safety and reliability. This paper describes the basis for such system. II. DECISION SUPPORT METHODS FOR DIAGNOSIS There are several methods for building diagnostic models, and tools to help build them. Model-based reasoning systems rely on physical models that describe the input/output relations between system sub-components and the fault-propagation dependencies between them [3,4,5,6]. Case-based reasoning systems, rely on historical references to associations between feature problem descriptions and actions taken to correct them [7,18]. Although several approaches incorporate measures of uncertainty to help resolve ambiguities, there are methods, which are inherently probabilistic. One such method uses Bayesian belief networks, to encode probabilistic dependencies between the variables of a diagnostic problem into the structure of a directed a cyclic graph [1,15,19]. Such graph is capable of updating subcomponents prior probabilities of failure when evidence of a fault is observed. Other approaches to diagnosis include the use of rule-based expert systems, fuzzy logic, and neural networks [2,13,14,16]. In this paper, it is suggested to define a diagnostic model as a transfer function between the causes of a problem and their observed effects. In airplane maintenance, the causes are LRUs, and the observed effects are either flight deck effects (FDEs), which are failure triggered events visible to pilots in the cockpit, or other perceived anomalies such as unordinary sounds, smells or visible cues (e.g, smoke in the cabin). Once such function is defined, the diagnostic problem is reduced to that of computing the problem rootcauses given the observed effects. In this manner, a diagnosis model is directly built to simulate the way a system fails, rather than to simulate the way the system deviates from its normal behavior. Airplane diagnosticians do not rely only on their systemic knowledge of the system, just as medical diagnosticians do not always rely on their understanding of the physiology and biochemistry
of the body when seeing a patient. Beyond systemic knowledge, much of the approach to diagnosis is also reliant on experiential knowledge accumulated over repeated exposure to similar problems and associations made between causes and effects, which are observed over long periods of time.
“Expertise” Knowledge
“Systemic” Knowledge Engineering-Design Basic Principles (LRU Level)
“Factual” Knowledge
In Service Data Mechanic Expertise Airline/Supplier/Airframe Anecdotal maintenance records (cause-effect associations) (numeric/textual data)
Understanding how the system behaves
Heuristic Rules of thumb
Reliability & Maintainability
Understanding how the systems fails?
Figure 1 – Three types of knowledge needed to diagnose complex airplane systems. Figure 1 shows the three sources of knowledge that are critical for diagnosis of a complex system such as an airplane. First, the “systemic” knowledge, which entails the understanding of how the sub- components of the system relate to each other and operate under normal conditions, so it is possible to understanding the different operational pathways conducive to failures. This is the type of knowledge possessed mostly by engineers responsible for designing and building the various systems. Second, the “experiential” knowledge, which entails the cause-andeffect associations learned over long periods of maintenance exposure and familiarity with the system. Mechanics and engineers who are the maintenance operators of the systems mostly possess this type of knowledge. And third, the “factual” knowledge which is a combination of text and numeric records that capture the actual field experience, i.e., the history of the actions taken in the field, and the component reliability data for each replaceable component. The latter is usually in the form of Mean Times Between Failures (MTBFs) or Unscheduled Removals (MTBURs). These three essential sources of knowledge provide the required information content for any comprehensive airplane diagnosis decision support system. Each of these types of knowledge can be, to a greater or lesser extent, differently suited for representation by the various diagnostic-modeling methods. Systemic knowledge, which relies on the understanding of the physical functionality of each component of the system, can be better suited for representation by model-based reasoning or simulation methods. Expertise knowledge,
0-7803-7278-6/02/$10.00 ©2002 IEEE
which is built on heuristic rules of thumb, can be better accommodated by methods such as rule-based and fuzzy expert systems, or case-based reasoning methods. Factual knowledge can be better handled by data intensive methods such as statistical analysis or neural networks methods. In terms of knowledge representation and reasoning, Bayesian belief networks provide a rich and efficient representation language that allows for handling the three types of knowledge within a single structure. Furthermore, it is plausible that the nature of each diagnosis domain area may be better suited to the application of a particular method. For example, in building a diagnostic decision support system that will facilitate a help-desk organization may require a type of method for handling the 20% of problems that 80% of complaints are attributed to, such as a case-base reasoning method. Typically, help desks troubleshoot a broad spectrum of loosely related problems (e.g., networks’ shut downs, hard drives breakdown, computers and software problems, etc.), where there is no single blue print of how these different components fit together or how dependent their failure modes are. In this type of environment it is relatively easy to build a database of case histories in a reasonably short time that can be used in a case-based reasoning system able of helping with the most common problems. Another example is the process of monitoring chemical or nuclear plants in support of critical safety decisions such as whether or not to shut down the plant. A fault detection system that will be used to support such decisions may be best suited using a model-based approach. A detailed physical model of the plant is built accounting for each of its components and used to simulate the overall predicted performance of the plant. A fault detection monitoring system would detect deviations of actual measured performance from simulated normal expected behavior. In airplane diagnosis, the different subsystems are highly integrated and designed to meet very strict standards of safety and reliability. There is a need to integrate the different types of knowledge with available reliability data of replaceable components. Bayesian networks provide an adequate representation language in which to capture these different types of knowledge and the calculus of probability theory, which is needed to properly update from priors to posteriors the probabilities for each replaceable component to cause the failure. III. THE PREFLIGHT TROUBLESHOOTING METHOD AT THE AIRPORT GATE Diagnosis at an airport gate is done as part of a decision support process to determine: a) which LRUs, if possible, should be fixed on the ground before scheduled departure,
b) which LRUs should be replaced before scheduled departure, c) whether scheduled departure should be delayed to support either a or b and if so, for how long, and c) whether the flight should be cancelled all together.
Crew/Pilot Report
Findings Probable Causes Document Actions and Decisions
Fix Replace Delay Cancel
YES
Access to relevant information
Departure Time?
Manuals Schematics Diagrams Reliability Etc.
NO
Remedial Actions
Figure 2 - Maintenance process at the airport gate Described in Figure 2 is the maintenance cycle process that takes place at the airport gate. Preflight troubleshooting begins when the aircraft arrives and is scheduled to depart on an outgoing flight. If a failure is detected by the pilot or flight crew in the preceding flight, or by the maintenance crew while on the ground, troubleshooting begins to ensure safe and timely airplane dispatch. The deadline for decisions is the departure time of the next scheduled flight. Troubleshooting is the responsibility of several decision makers including Airline Maintenance Operation Control (MOC), the ground maintenance staff, and the airplane flight crew. IV. BUILDING BAYESIAN BELIEF NETWORKS FOR AIRPLANE Bayesian belief networks are directed acyclic graphs that capture probabilistic dependencies between the variables of a problem. Bayesian networks approximate the joint probability distribution over the variables of the diagnosis problem using the chain rule of probability,
The general approach to building Bayesian networks is to map the fault causes (LRUs) to the observed effects (FDEs), keeping in mind that what is being modeled is not the normal behavior of the system but rather the behavior of the system when one or more of its parts fail. The construction of the Bayesian network requires the creation of nodes with associated discrete or continuous states, and arcs connecting between them where the probability of every child’s state is conditioned on the states of the parents [15]. Figure 3 shows a section of a Bayesian network from an air-conditioning system diagnostic model showing the connection between LRUs and FDEs through the use of intermediate nodes. The process of building such networks requires the elicitation of knowledge from domain experts. In the case of an airplane system diagnosis model the experts should represent the three types of knowledge shown in Figure 1. To improve knowledge elicitation in the creation of an airplane diagnostic model, we have found that the modeler must become familiar with the functionality and terminology of the system, and understand its behavior well enough to be conversant about it with the experts. One can achieve such level of understanding from system manuals (also used for training mechanics), maintenance manuals, and system schematics. Building the network requires to start from a list of the most problematic system faults, which are not necessarily those that occur the most frequent, but rather, the faults that are the most difficult to troubleshoot.
Heat Exchanger (LRU)
Switch1 (LRU)
Switch1 Failure modes
ACM (LRU)
Turbine Inlet Temp.
Duct Temp.
Switch2 (LRU)
Switch2 Failure modes
Switch1 state
Switch2 state
Switch2 test
Switch Closed?
Switch2 test
Relay (LRU)
Relay state
Light On (FDE)
n
p ( x1 , x2 ,..., xn ) = ∏ p ( xk | x1 , x2 ,...xk −1 )
(1)
k =1
which subject to simplifying conditional independence assumptions results in the product of probabilities of the variables conditioned on their parents. n
p ( x1 , x 2 ,..., x n ) = ∏ p ( x k | pa ( x k )) k =1
where pa (x) is the parent variable set of x.
0-7803-7278-6/02/$10.00 ©2002 IEEE
(2)
Figure 3 – Section of a Bayesian network connecting causes (LRUs) to effects (FDEs). Parentless nodes in the network are populated with prior probabilities derived from component reliability data. The data are available from various sources in the form of Mean Time Between Unscheduled Removals (MTBUR). These estimates can be converted into probability estimates using the exponential distribution assuming a Poisson process,
F ( x) = 1 − e
− λx
(3)
where 1/ λ is the long-term average life-time of the LRU and x can be interpreted as a single cycle of operation, equivalent, for example, to the average duration of the last flight leg. Typical values of MTBURs are of order greater than 105 hours. Since then λ