Challenges, Issues, and Lessons Learned Chasing the ... - IEEE Xplore

10 downloads 0 Views 620KB Size Report
Mr. Stephen J. Engel. Associate Technical Fellow. Northrop Grumman Corp. Bethpage, NY 11714. 516.346.6830. [email protected]. Dr. David Hoitsma.
Challenges, Issues, and Lessons Learned Chasing the “Big P”: Real Predictive Prognostics Part 2 Mr. Andrew Hess Joint Strike Fighter Program Office 200 12'th Street South Arlington, VA 22202-4304 703-601-5551 [email protected]

Maj. Giulio Calvello (It.A.F.) Joint Strike Fighter Program Office 200 12'th Street South Arlington, VA 22202-4304 703-604-3852 [email protected]

Mr. Stephen J. Engel Associate Technical Fellow Northrop Grumman Corp. Bethpage, NY 11714 516.346.6830 [email protected]

Dr. David Hoitsma Information Systems Technologist Northrop Grumman Corp. Bethpage, NY 11714 516.575.9667 [email protected]

Abstract—The desire and need for real predictive prognostic capabilities have been around for as long as man has operated complex and expensive machinery. This has been true for both mechanical and electronic systems. There has been a long history of trying to develop and implement various degrees of prognostic and useful life remaining capabilities. Recently, stringent Diagnostic, Prognostic, and Health Management (PHM) capability requirements are being placed on new applications, like the Joint Strike Fighter (JSF), in order to enable and reap the benefits of new and revolutionary Logistic Support concepts. While fault detection and fault isolation effectiveness with very low false alarm rates continue to improve on these new applications; prognostics requirements are even more ambitious and present very significant challenges to the system design teams. These prognostic challenges have been aggressively addressed for mechanical systems for some time; but are only recently being fully explored for electronics systems. This second paper12 in a series will continue to explore background, benefit impacts, and architectures; highlight some additional design challenges and issues; discuss prognostic capabilities for electronic systems; review strategies for prognostic capability verification and validation; and draw heavily on other related lessons learned from previous and current prognostic development efforts.

1 2

Dr. Peter Frith Joint Strike Fighter Program Office 200 12'th Street South Arlington, VA 22202-4304 (703) 601-5545 [email protected]

TABLE OF CONTENTS 1. 2. 3. 4. 5. 6. 7. 8.

INTRODUCTION BACKGROUND CHALLENGES: PBL, GS AND PROGNOSTICS ISSUES: UNCERTAINTY, PERFORMANCE REQUIREMENTS, AND VALIDATION LESSONS LEARNED CONCLUSION AND SUMMARY ACKNOWLEDGEMENTS REFERENCES

1. INTRODUCTION This paper builds on the background, benefit impacts, current playing field for predictive prognostics, the highlighted specific design challenges, issues, and lessons learned discussed in the previous paper “….Real Prognostics Part 1” [1]. It draws heavily on new specifics and updated lessons learned from several previous and current diagnostic and prognostic development efforts. Real predictive prognostic capabilities are just one element among many interrelated and complementary functions in the discipline of PHM. The challenges involved with developing and implementing prognostic capabilities are many; and this second (Part 2) in a series of papers will focus on two areas: those impacting and enabling benefits associated with new Performance Based Logistic (PBL)

0-7803-9546-8/06/$20.00© 2006 IEEE IEEEAC paper # 1489, Version 2, Updated January 13 2005

1

challenges, issues and lessons learned. The concept of the failure progression timeline is the center of this prognostic playing field review.

support concepts, long term and global sustainment strategies, and new business practices and the issue of how to specify and validate prognostic requirements in the presence of uncertainty. Additional and new lessons learned to those identified in the first paper (Part 1) [1] will also be discussed and explored.

Maintainers see the need for modern PHM systems with the capability to both anticipate future maintenance problems and required maintenance actions coupled with an ability to predict future health status. Such capabilities are key enablers and attributes to any Condition Based Maintenance (CBM) concept and to “enlightened” opportunistic maintenance decisions. Furthermore, these predictive abilities and prognostic capabilities are also key enablers and very necessary attributes to both the evolving Performance Based Logistics (PBL) concepts and paradigm changing New Business Case approaches.

Predictive Prognostics appears as one of the fundamental factors that are influencing the decision to shift to a new product support strategy, whose goal is to improve system readiness by procuring performance. The purchase of system sustainment, based on output measures such as aircraft availability, rather than input measures, such as parts and technical services is discussed. Within the above PBL vision, a Global Sustainment (GS) solution is necessary in manner that a single, integrated, shared system would leverages global capabilities and facilities and minimizes the need for a unique logistic infrastructure.

Some of the benefits of such a PHM system are listed below.

Similarly, the technical issues involved with developing real predictive prognostic capabilities are large and many, and here some of the issues associated with uncertainty will be discussed. In particular the issues of how to deal with all the uncertainty issues associated with useful predicted prognostic will be refocused and explored in more detail from the base established in the first paper [1]. Similarly, knowing the problems with uncertainty, the very related issues of establishing good prognostic performance requirements and validating actual performance against these requirements will be discussed.

Change of Maintenance Philosophy • Real On-Condition Maintenance • More Informed Opportunistic Maintenance Decisions • Not “on-failure” nor “per schedule” • Less interruption of mission schedule • Reduction in Test Equipment • Reduced Spares and Logistics Footprint • Significant reduction in Peculiar Support Equipment • Eliminated O-level TE Benefits to the Maintainer • Unprecedented insight into vehicle/squadron/fleet health • Less time spent on inspections • Better ability to plan maintenance • Simplified training • Improved fault detection • Increased Aircraft Availability and Sortie Generation

The challenges, issues, and the lessons learned are many and varied and these will certainly continue to grow, as this discipline matures and it is more widely applied. The readers should not expect to find here all the answers to the complex new world of “prediction”, but they will discover new horizons to start their travels in the winding and deceptive forest of Prognosis.

Modern and comprehensive PHM Systems include many types of functional capabilities. Many of these functional capabilities, though much advanced and more accurate, are an evolution of capabilities developed by legacy diagnostic systems. Through the philosophy and concept of health management, the modern PHM system uses these functional capabilities to complement each other and provide a much larger impact and a boarder set of maintenance oriented benefits than any single function by itself.

2. BACKGROUND Prognostics and Health Management (PHM) is the name given to the capability being developed by JSF to enable the vision of Autonomic Logistics and so meet the overall affordability and supportability goals [2-4]. In PHM, the term Prognostics takes on a much more global meaning covering the broader functions of fault / failure detection, fault / failure isolation, enhanced diagnostics, assess material condition, performance monitoring, life tracking and prognostics, rather than just the prognostic function alone. For this reason, the term ‘real predictive prognostics’ has been used to emphasize that here we are interested in the predictive part of a comprehensive health management system. The earlier paper ‘… real predictive prognostics part 1’ [1] discussed the needs and impacts of PHM and the definition and the playing field of prognostics in detail and so only a few major points are reviewed here, especially the PHM playing field, before proceeding with the new set of

Essentially, prognostics provides the predictive part of such a comprehensive health management system and so complements the diagnostic capabilities, that detect, isolate and quantify the fault, and prognostics in turn is dependent on the quality of the diagnostics system. From an operators, maintainers or logisticians - the user’s view point- what distinguishes prognostics from diagnostics is the provision of a lead-time or warning time to the useful life or failure, and that this time window is far enough ahead for the appropriate action to be taken. Naturally, what constitutes an appropriate time-window and action is dependent on the 2

particular user of the information and the overall Logistic system of system design one is trying to optimize. In summary, predictive prognostics is that part of the overall PHM capability that provides a prediction of the lead-time to a fault / failure event in sufficient time for it to be acted on.

life of a system or subsystem component; maintenance will often be delayed until the early incipient fault progresses to a more severe state but before an actual failure event. This area between very early detection of incipient faults and progression to actual system or component failure states is the realm of prognostics.

Though most of these functional capabilities will be developed together and can complement each other during PHM system development for a particular platform; real predictive prognostics, the “Big P”, represents the new, hard, and often “risky” technology capability. The rest of this paper will address a PHM playing field focused only on this new predictive prognostic capability.

If an operator has the will to continue to operate a system and/or component with a known detected incipient fault present; he will want to ensure that this can be done safely and will want to know how much useful life remains at any point along this particular failure progression timeline. This is the specific domain of real predictive prognostics, the “Big P”, being able to accurately predict useful life remaining along a specific failure progression timeline for a particular system or component.

To understand the role of predictive prognostics; one has to understand the relationship between diagnostics and prognostics capabilities. Envisioning an initial fault to failure progression timeline is one way of exploring this relationship. Figure 1 represents such a failure progression timeline. This timeline starts with a new component in proper working order, indicates a time where an early incipient fault develops, and depicts how under continuing usage the component reaches a component or system failure state and eventually, if under further operation, reaching states of secondary system damage and complete catastrophic failure. This failure progression timeline is very important to understand and keep in mind as the issues of uncertainty and the probabilistic perspective to accurate useful life remaining predictions are discussed and explored in the remaining sections of this paper.

To actually accomplish these accurate useful life remaining prediction capabilities requires many tools in your prognostic tool kit. Sometimes available sensors currently used for diagnostics provide adequate prognostic state awareness inputs; and some times advanced sensors or additional incipient fault detection techniques are required. This brings up the questions: how early or small of an incipient fault do you want to detect and how small of a material state change to want or need to “see”? Other needed tools in your prognostic tool kit include: a model or set of models that represent the understanding of a particular fault to failure progression rate; material physics of failure models; statistical and/or probabilistic based models; models to represent failure effects across interconnected subsystems; and models to account for and address future operational mission usage.

Failure Progression Timeline Prognostics Very early incipient fault Proper Working Order - New

Need: To Manage Interaction between Diagnostics and Prognostics

As true prognostic capabilities evolve and are developed for particular applications many difficult questions need to be addressed. These questions can be very application specific. Perhaps the first basic question to ask is; how far does your specific application want or need to “see” into the future? The answer to this question will very much depend on your application specific prognostic perspectives. From this perspective, this first question generates many additional considerations. Some of these include: specific needs versus recognized benefits; what is possible and doable; capabilities available versus those desired or high valued; technology shortfalls to be filled; integration, implementation, and usage strategies; and what is good enough.

Diagnostics

System, Component, or SubComponent Failure

Secondary Damage, Catastrophic Failure

Need: Understanding of fault to failure progression rate characteristics

Predicted useful life remaining State Awareness Detection Desire: Advanced Sensors and Detection Techniques to “see” incipient fault

Determine effects on rest of aircraft

Develop: Useful life remaining prediction models – physics and statistical based

Need: Better models to determine failure effects across subsystems

The Goal is To Detect “State Changes” as Far to the Left As Possible

Figure 1 - Failure progression timeline

Indeed the following challenge and issues sections examine the questions what is good enough and how far do you want or need to see ahead from two different perspectives. The first is from the perspective of the Performance Base Logistics framework prognostics will be operating in and the models and tools needed to quantify and measure logistic performance and hence setting what is good enough for this purpose. The second is from the perspective of providing safe and accurate predictions of

Diagnostic capabilities have traditionally been applied at or between the initial detection of a system, component, or subcomponent failure and complete system catastrophic failure. More recent diagnostics technologies are enabling detections to be made at much earlier incipient fault stages. In order to maximize the benefits of continued operational 3

failure in the presence of uncertainty and how to formulate requirements and validation techniques that give the designer a measure of what is good enough.

data are obtained; therefore the contract would transition to a fixed-price type contract, with performance incentives linked to metrics identified during LRIP. Through joint cooperative forums, client (i.e. Government) and contractor would establish the criteria and performance metrics for the transition in accordance with substantiated data, maturity of process, pricing, and performance responsibility.

3. CHALLENGES: PERFORMANCE BASED LOGISTICS (PBL), GLOBAL SUSTAINMENT (GS) AND PROGNOSTICS

Total transition to a fixed-price type contract would be tied to obtaining sufficient reliability and cost data. However, the actual transition point would depend on the system and air vehicle peculiarities and maturity. The final decisions for transition sequencing would be determined during LRIP using factors such as achieved and stable reliability, business case analysis impact, incentives, and the maturity of depot level repair.

Performance Based Logistics The future vision for Logistic Support to military aircraft is a structured organization responsible for the provision of Global Technical Support to the “war-fighters” during peace operations as well as in crisis phases. The above-mentioned organization would be managed by a joint Government and Contractor team. Most of the “legacy” activities, performed under Government responsibility, will be reallocated under Contractor management, but the Government will maintain full visibility. Work is underway in developing the details of this future Defense acquisition / sustainment strategy and there is urgent need to properly identify the roles and responsibilities of PHM / predictive prognostics in this strategy.

In order to implement a Performance-based approach, the strategy will revolve around how core capabilities for dealing with Original Equipment Manufacturers (OEMs) and Suppliers are established, refined, and executed in relationship to delivering performance to the user community. This is a challenge in itself and it is an additional challenge figuring out how “Big P” capabilities and their associated data products would assimilate it.

A key target of this new Logistic Strategy is the move to long-term Performance-based Contracting, an approach for buying set levels of performance, such as a payment per flight hour approach. The development of well-defined performance, cost, and schedule metrics, financial and other incentives, and award fee and award term plans are prerequisites to establishing successful long-term performance-based arrangements.

Predictive Prognostics is one of the fundamental factors that is influencing the decision to shift to the new product support strategy, called Performance Based Logistics (PBL), in order to improve weapons system readiness by procuring performance, which capitalizes on integrated logistics chains and public/private partnerships. The cornerstone of PBL is the purchase of weapons system sustainment as an affordable, integrated package based on output measures such as the availability, rather than input measures, such as parts and technical services.

A performance-based approach is intended to: • Reduce Total Ownership Costs. • Increase War-fighter Confidence and Satisfaction. • Facilitate Contractor-Government Integration and Communication. • Reduce the demand for logistics. • Provide incentives for Reliability Enhancements. • Decrease the Resource Requirements for Support. • Encourage early & continuing emphasis on diminishing manufacturing sources and obsolescence planning. • Centralize Management • Create a Real Time Problem Response (24/7) • Optimize the Technology Insertion • Utilize a consistent Life-Cycle Cost Analysis • Optimize infrastructure harmonization and asset utilization.

Performance-based contracting is envisioned as mechanisms to optimize products design with enhanced, reliable, maintainable components with “Big P” capability in order to reduce cost throughout component service life. Global Sustainment During new system design development and maturation, within the above PBL vision, a Global Sustainment (GS) solution is sought that provides a single, integrated, shared system able to leverage global capabilities and facilities and so minimize the need for unique logistic infrastructure. The GS concept is based on the following principles: a)

A collaborative partnership between Governments and industries to enhance efficiency, eliminate service duplication, accelerate execution, promote competition and comply with legislative requirements. b) A common global solution for different product variants and customers that allows for customer-

The vision is that legacy renegotiable price based contracting will be abandoned over time shifting gradually to the new long-term Performance-based concept. For example, cost-type contracts could be used during LRIP (Low Rate Initial Production) phase. As the design matures, defined enablers are achieved, sufficient reliability and cost 4

unique tailoring when economically and politically viable. c) An incremental stepped implementation approach through development phase, initial production and onto steady state. This iterative development builds confidence and reduces risk. d) Long-term, multi-year, performance-based, incentives contracts to facilitate investment and exploit economies of scale. The following are anticipated contracting and accounting precepts:

Business Case Analysis BCA is a decision support and planning tool that projects the likely financial results and other business outcomes of an action. The objective of the BCA is to provide a recommended global best value business or system approach. The term “best value”, is defined as meeting user requirements at the lowest Total Ownership Cost (TOC). As already discussed in Part 1 [1], BCA systematically analyzes alternative approaches for the appropriate assignment of sustainment responsibility between Government and industry and the ability to meet participant requirements (shared, partially common, unique). The Business Case Process establishes an end-to-end process that captures results derived from all analysis, whilst maintaining traceability of all design elements, methods, assumptions, and cost estimates. BCA goes through spiral development as: the quality and fidelity of data matures, the results from trade studies are assessed, Government boundaries are finalized, and lessons learned from other programs are incorporated. As the BCA progresses through this spiral development effort and it incorporates progressively more detailed quantitative evaluations, the best value position for selected business elements and associated attributes is re-evaluated.

• Transition from a cost-plus award fee contract in development phase to a firm fixed-price incentive fee contract in the steady state. • Employment of a Price Improvement Curve (PIC) for the firm fixed-price contract to drive cost reduction. • Sharing of cost reductions between Government and industry. • Compliance with the Truth in Negotiations Act where applicable. • Transition to performance-based pricing through LRIP, moving to a dollar per flight hour payment structure in steady state, with incentives tied to performances. • Pricing broken into two elements: a firm, fixed price element and a variable cost element. • Payments based on usage versus breakage. • Systematic application of risk management with appropriate allocation of risk and reward.

Logisticians use several modeling, simulation and cost models to support the estimating process. Some example models are listed below: 1.

Discrete event, PC-based computer simulation models designed to model the interaction between operations, maintenance and logistics elements at the base / ship level. They can be used to predict sortie-generation capability, estimate maintenance manpower requirements, assess spares, support equipment and facility limitations and assess the impact of policy / procedural decisions. 2. Spares modeling tools that compute optimal spares mixes to support a wide range of possible operating scenarios and size the spares inventory explicitly on the basis of desired weapon system readiness levels rather than supply-oriented measures. They have the capability to handle multiple indentures of parts for multiple configurations, for non-homogeneous squadron sizes and flying schedules. As well it is possible to perform an essential split between squadronlevel and depot-level spares to provide a globally optimized sparing solution. 3. Discrete event simulation tools that provide the user with the ability to define the operational and support environment, ascertain measures of effectiveness for performance and cost metrics based on multiple trials, and characterize the sensitivity to changes in Support System architecture, processes, business approach and air vehicle reliability/maintainability characteristics.

e)

Public / private depot partnering that incorporates the best Government and commercial practices, policies, skills and capabilities to provide an effective jointdepot infrastructure that complies with Title 10 USC Section 2466legislation3. f) Utilizes “Best Value” concept for sustainment responsibility assignment, underpinned by robust business case analysis. g) Utilizes lessons learned and past experience to refine the GS solution. In order to establish the level of affordability, a Global Sustainment solution is evaluated by conducting successive iterations of analysis against a business framework. A Business Case Analysis (BCA) spiral development is used as the basis for designing and operating the GS system. Each BCA iteration is referred to as a BCA “Spiral” with each spiral designed to address specific key development milestones and policy decisions. 3 "Not more than 50 percent of the funds made available in a fiscal year to a military department or a Defense Agency for depot-level maintenance and repair workload may be used to contract for the performance by nonFederal Government personnel or such workload for the military department or the Defense Agency."

5

5. Late prediction Probability ( fraction of time that the predicted failure time exceeds actual failure time); 6. Actual failure time; 7. Order lead time (time when the new part is ordered to the base);

4.

Models that simulate students flow through training centers on an event-by-event basis to determine the quantity of training resources. 5. Tools designed to optimize repair locations for components and aircraft based on costs of repair, transportation, spares, and infrastructure.

In order to limit input numbers and to reduce computation complexity and run-time, most of the time, Prognostics is available for a select number of items (i.e. parts that are not critical to flight safety may be allowed to fail) and some failure modes for these parts could not be predictable in modeling software.

Implementation of a disciplined approach, including systems engineering analysis tools, such as Failure Mode Effects and Criticality Analysis (FMECA) or Reliability Centered Maintenance (RCM), will produce a Maintenance Task Analysis (MTA) directly linked to a system’s Reliability, Maintainability and Supportability system. The MTA is based upon detailed technical tasks including those determined by application of a Condition Based Maintenance (CBM). Close collaboration between engineers and logisticians is critically important during system design and development and throughout the life cycle.

Prognostic capabilities are expected to result in an improvement in system performance (e.g. Availability rate (A0), Mission Capable rate (MC), Not Mission Capable Supply rate (NMCS)) and Not Mission Capable Maintenance (NMCM)) when compared to a non-PHM scenario with equal spare parts inventories and resources. In addition, prognostics capability is expected to impact system affordability by reducing the requirement for spares at squadron/base locations.

These tasks are refined through PBL Business Case Analysis (BCA) to determine a cost effective, sustainable product support solution to meet user needs in an operational environment. The real challenge is to develop valid and measurable metrics for quantifying the impact of the various prognostic technologies: The first step is to process a cost/benefit analysis using appropriate modeling tools, for each component and/or subsystem/system to evaluate the consequences of developing and supporting each component with “Big P” capability. Once this first analysis is accomplished, it is mandatory to evaluate consequences of having “Big P” component installed; in order to accomplish this task, it could be necessary to include statistical/damage/physics of failure accumulation models and sensor-based condition monitoring, as predictive mean life or failure/damage accumulation that would provide time-saving equivalent to monetary savings. The final step is to run results within PBL environment to evaluate the effects of “Big P” performances in Global Sustainment solution.

Figure 2 - Parameters for modeling Prognostics

Below are indicated some examples of prognostics parameters incorporated in some of the mentioned modeling tools (see Figure 2; and discussion in Section 4): 1. Prognosis success rate (percentage of the time that predicted failure time is earlier than the actual failure time); 2. Prognosis lead time (average time between the time the failure is predicted and the predicted failure time); 3. Variation of the prediction lead times (standard deviation of the prediction lead times); 4. Downlink lead time (time to communicate the fault and isolation information either by radio frequency (RF) data link or physical connection);

Finally, a key component of any PBL implementation is the establishment of metrics. Since the purpose of PBL is “buying performance”, what constitutes performance must be defined in a manner in which the achievement of performance can be tracked, measured and assessed. PBL metrics are designed to answer critical questions about what performance we are trying to measure for each of the Customer Performance Objectives. A typical LRIP and full production PBL Metric performances are depicted in the following Figure 3 in order to provide an example of metrics that can be used during two different production periods. 6

PBL Metric

LRIP

Readiness / Availability Full Mission Capable Aircraft Availability Mission Capable Training hours Days Ready to Deploy

X X

Mission Effectiveness

X

X

X X X X X X

X X X X X X X

Required Sorties/FH Accomplished Percent Sorties Flown Percent Flying Hours Flown Logistics Footprint Delta Cannibalizations per 1000 FH Maintenance Man-hours per FH Maint. Man-hours / FH (Subsystem)

another set of challenges and a new paradigm for subsystem suppliers. Suppliers will not only be required to design for functional performance but also to design for PHM. The requirement for an inherent PHM capability will necessarily drive the sub-system suppliers to better know their equipment, how it operates, its operating environment, and most importantly, how it fails. This rightly is the task of the sub-system supplier but the requirement will also cascade down to the individual component suppliers. Some of the challenges and issues imposed by this new paradigm are as follows.

Full Production X X X X X X

Most sub-system suppliers do not have design teams with the necessary PHM skill sets and tools. Such skill sets and tools are in short supply and expensive to develop. As well, the validation and test of PHM algorithms can impose significant costs. Therefore, suppliers need to develop strategies to engage outside help from PHM specialists and experts and to share the development costs across a variety of industry projects and /or defense platforms. The new paradigm means greater interdependencies between the sub-system suppliers and the prime system integrator. Here the supplier provides the foundation of the PHM capability through the level of sub-system diagnostics and prognostics; and the prime provides the enhanced diagnostics and prognostics of failures that propagate across several sub-systems. Both levels of PHM are needed to meet the stringent diagnostic and prognostic requirements imposed by Performance Based Logistics. Successful PHM design and development will require close collaboration between sub-system suppliers and prime. Over the life of type, this interdependency means more configuration control on proposed sub-system changes. Whereas, previously components in sub-system designs could be changed if it appeared functionally the same to the system but now how it fails must be taken into account and if different it must be quantified and passed onto the system designer, so the designer can change the higher level reasoning. A successful outcome will be most likely if the sub-system suppliers and primes seek to develop a strategic partnership that can support a platform over its life of type.

Figure 3 - Example of LRIP and full Production PBL Performance Metrics One of the most significant aspects of PBL is the concept of negotiated agreement between the major stakeholders that formally documents performance and support expectations and commensurate resources to achieve the desired PBL outcomes. Inherent in any business transaction where a level of performance is purchased, rather than discrete goods and services, there is a de facto shift of risk to the provider of support. This is true of PBL relationship as well. While DoD can never completely delegate risk for system operational performance, PBL strategies move the level of risk away from DoD to the support provider, commensurate with the scope of support for which the supporter is responsible. In order to make decisions beneficial to both contractors and the Government, while avoiding financial consequences of bad decisions, if structured with the right metrics, a PBL support package provides high incentives to develop items with “Big P” capabilities. Improving most of PBL metrics performances, Predictive Prognostics influences particularly reliability and maintainability based profits, whereby the contracts may be structured to provide an inherent profit incentive for PBL provider to lower operating costs by achieving higher product operational availability and to retain all or a portion of the savings achieved as a result of providing a better product.

Suppliers need to take up these challenges if they want to stay competitive in bids for modern systems where Performance Based Logistics will be the new way of doing business. Prognostics for Electronic Systems The challenges and issues of prognostics have been aggressively addressed for mechanical systems for some time but are only recently being fully explored for electronics systems. In some aspects, these challenges and issues are the same or very similar for both electronic and mechanical system prognostic applications. In other aspects, they are very different and unique for electronic system prognostic applications. This is particularly true and

Prognostic challenge for Sub-system suppliers Performance Based Logistics and the PHM capabilities required to enable its successful implementation presents 7

challenging with respect to electronic systems employing many digital devices and components in their designs.

4. ISSUES: UNCERTAINTY, PERFORMANCE REQUIREMENTS AND VALIDATION.

The fact that digital devices and components appear to fail very randomly and do not have adequate pre-failure attributes or characteristics leads many to conclude that prognostic capabilities are not possible. The truth is that prognostic capabilities for electronic systems are currently a “mixed bag” of approaches and techniques and are evolving just as they have for mechanical systems. Many of the more traditional prognostic approaches, when applied to the analog devices and components used in electronic systems will most likely work fine. Most electronic systems use many analog devices and components along with their digital ones so at least some partial prognostic coverage is likely possible. Also just being able to detect some form of system performance degradation, even without an accurate useful life remaining prediction, can at times be very beneficial.

One of the major challenges to the designers of modern PHM systems is the need to develop diagnostic and prognostic methods that are truly capable of handling realworld uncertainties – as the real world is not deterministic. Such real world uncertainties cause havoc with deterministic approaches leading to high false alarm rates, inaccurate predictions, incorrect decisions and an overall PHM system that is not very robust. Some of the issues uncertainty presents to the designer were elaborated in the previous paper [1], including issues associated with various steps in the predictive process, the estimate of current condition, the prediction of time to failure (or time remaining) and the choice of appropriate lead times (how far ahead to predict); and on the choice of an overall prognostic methodology. The preceding discussion of the issues of uncertainty is continued here to address the important issue of how to create and validate requirements for accurate and safe prognostic predictions.

There are several approaches that can be applied to the electronic prognostic challenge. Some of these can be described as follows. The legacy time based approach. Very specific data interrogation techniques using attributes, measurands, or characteristics that lead to the indication of repeatable degradation trends. Environmental monitoring to determine cumulative exposures and replacement of the system before some pre-defined limit is exceeded. A combination of environmental monitoring and modeling that uses cumulative exposure to drive an actual predictive cumulative damage model that will cause replacement of the system when a sufficient amount of life has been consumed. Canary Devices and/or witness samples that are integrated into a specific component, device, or system design and that incorporate failure mechanisms that occur first in the embedded device. These embedded Canary Devices would be non-critical elements of the overall design providing early incipient failure warnings before actual system or component failure would provide the prognostic information. There are and will be many other approaches developed but an important point to remember is that effective electronics prognostic will utilize a combination of approaches.

In general, the issues of uncertainty mean that no one method will provide a useful, robust and comprehensive prognostic capability for a total aircraft logistic system of system. Rather an integrated approach is needed combining information from the individual detection, diagnostic and prognostic functions at the appropriate component, Line Replaceable Unit, sub-system, system and system of system levels and from both on-board and off-board sources. The actual prognostic approach will also vary depending of how well the system behavior and failure mechanisms are understood and the how well the aircraft sensors, and maintenance and logistic systems capture and use the required information. The approaches will necessarily involve a mixture of physics / model based, rule-based and data driven algorithms embedded in an artificial intelligence reasoning framework to enable the desired automatic decisions to be made in the presence of uncertainty. Prognosis Requirements and Their Validation The creation of requirements for Prognosis, the predictive element of prognostics, and the validation4 thereof, presents several challenges for the technical community. Unlike deterministic methods used in diagnostics, predictions of future health often involve multiple unknowns. The most apparent of these are the component’s current state of health and its future usage. The current state prior to failure may be inferred by malfunction precursors or the absence of precursors, but their detection can be erratic and their interpretation uncertain in the incipient stages of failure.

The use of physics of failure modeling approaches for electronic components and devices, like those used for mechanical systems will be a powerful tool in support of electronic prognostic capabilities. It is felt that this is the likely possibility because the root cause of most all electronic device or component failure is actually mechanical – something physically breaks at a subcomponent, solder joint, connection, layer, de-lamination, etc., level. Solder fatigue models are already under development and show promise. If a mechanical break occurs, then a combination of prognostic approaches should be used to predict failure.

4 Validation and verification are terms that are often confused. In this context, validation means proving you’ve built the right product (it does what you intended), verification means proving the product is built right (built according to the design specifications).

8

evolves during component use. It is important to recognize the distinction between predicting the expected failure time and predicting the point when a system reaches a specified probability of failure.

In all but the simplest components, remaining life also depends on the destructive forces encountered during use (loads, temperature, humidity, corrosive exposure, shock etc.). These are often governed by unpredictable events and circumstances that cannot be known a priori. Additionally, models used to generate predictions can have multiple input parameters, some of which may be random variables.

Extrapolating an observed trend to a failure threshold predicts the expected time of failure, but this only reveals when something will fail on average. At that point, there could be a 50% probability that the failure has already happened. Further, the probability of failing at exactly that point approaches zero as we increase the precision in defining that point. Thus, a prognosis system should not necessarily focus on the expected time of failure, but rather on the point in time when a component should be taken out of service5 so that failures are avoided with a probability determined by the level of acceptable risk.

Methods that form predictions based on uncertain inputs produce uncertain outputs. How then can a requirement be framed for an uncertain output? How can a system be tested to determine if it satisfies these requirements? If a prediction is acted upon and an operational component is removed from service, how can its failure prediction be validated since the failure didn’t happen? As a given failure progresses, indications typically become more pronounced and predictions can be updated. Predictions made near the end of life for example, may be very accurate. At what point in this iterative process should prediction performance be judged? It is generally accepted that prognosis will mature with field experience. How can we evaluate the level of maturity and how might we use field data to improve predictions in the interim?

To facilitate this calculation, it is useful to form the probability density function (pdf) as a function of time (or use). This is shown in Figure 4. Note that the prediction in this figure and the ones that follow are illustrated as a static forecast in time (a snapshot of what would ordinarily be a continuously updated assessment of the future). In other words, this prediction is the best one that we have at the current time (t0 in this illustration). This situation is further complicated when new data arrives and predictions are updated. We will discuss updated forecasts later in this paper.

This section begins to address these questions by introducing concepts that lead to requirements specification and by defining three central parameters that form the basis of prognosis requirements. These parameters are: maximum allowable probability of failure (this bounds risk), minimum acceptable probability of failure (this bounds unnecessary maintenance) and lead time (this provides advanced warning). The motivation behind these parameters and the validation procedure to determine if these requirements have been satisfied are explained in the discussion that follows.

Expected Failure Time Current time

50% area shaded red

Failure pdf

Failure Threshold

Damage

95%

Prognosis Enables Risk Management The underlying promise of prognosis is to provide confident information about an upcoming failure with sufficient advanced notice to take appropriate action. The appropriate action is often to remove the indicated component prior to its failure, but not so early that its usable capability is not fully depleted. Since achieving the theoretical limit of 100% failure avoidance is both impractical and wasteful, we are forced to accept some level of risk. Prognosis empowers the user to make these decisions by providing the information necessary to evaluate and manage the risks associated with the actions to be taken. Requirements for prognosis must therefore address the type and quality of the information needed to assess the risks involved.

d1 d0

Initial damage state may not be exactly known

95%

t0

There’s a 50% probability that the failure has already happened by this time!

time Future State Diverges From Uncertain Initial State and Uncertain Usage

Figure 4 - Damage Forecast And The Probability Density Function In Figure 4 the state of damage at the current time (t0) is not known exactly and is shown as a random variable (small pdf at the bottom left of the figure) whose value lies between d0 and d1 with a 95% confidence. The compound

The key parameters associated with risk are the probability of a given failure and the impact of that failure. The latter depends on many factors discussed elsewhere, including the effect of failure on mission success, safety, availability and cost. Of significance here is that prognosis enables the computation of probability of failure as it continuously

5 Removing a component from service is not the only remedial action. Some components may intentionally be allowed to fail, but predictions are still needed to order spares, prepare for maintenance etc.

9

effect of the unknown current state and the unknown future usage typically cause the probability density function to broaden with time. The expected6 damage progression is shown as a curved dotted line that extends from the most probable state at the current time (red dot at t0) to the failure threshold.

When the forecast in Figure is sliced horizontally at any specified damage level, the result is a probability density in time showing when this damage level will occur (see Figure 6). Here the time when damage d1 occurs is predicted between tL and tU 95% of the time.

This forecast may be found by calculating the trend in a series of observations that are fit to a model. The extrapolation point where this line intersects the failure threshold indicates the expected time of failure. The two solid red curves represent a forecast of the upper and lower bounds that enclose some percentage (in this case 95%) of all the possible outcomes. The failure pdf (bell-shaped curve in blue drawn above the failure threshold) shows how the outcomes are distributed. The probability of failure between any two points in time is equal to the area under this curve. Thus the probability of failure when we reach the expected time of failure is the area from t=0 to t=expected failure time (area shaded in red). This is ordinarily half the total area thus giving a probability of failure of 50%.

Damage

Current time

95%

t0

Figure 6 - Horizontal Slice Through The Forecast Requirements: Lead Time, Maximum Allowable Probability of Failure and Minimum Acceptable Probability of Failure

Damage

The underlying requirement for prognosis is to provide sufficient lead time (Lt) so that appropriate actions can be taken within the acceptable bounds of risk and unnecessary maintenance. From what reference point should lead time be measured? Using the expected failure time may only provide sufficient time to prepare for half of the failures. It is clearly more useful to measure lead time from the “Justin-Time Point”7 (JIT point) [5]. The JIT point was previously defined in the reference as the point in time where the probability of failure equals one minus the required probability of failure avoidance. In other words, repairs must be made no later than the JIT point to achieve the desired probability of avoiding a failure. In order to adequately prepare for this failure, actions must be taken no later than the JIT point minus the lead time. Figure 7 illustrates the relationship between these parameters where ta is the time that actions must be taken (i.e. spares managed, maintenance scheduled, etc.) to prepare for the predicted failure given a 95% failure avoidance requirement8.

Predicted Damage at Time t1

dc

dU 95%

time

t0

t1

time

tU

tL Predicted Time to Reach Damage d1

Area equals the probability that the damage will be dc or greater at time t2

dL

f(t)

d1

If the forecast shown in Figure 4 is sliced vertically at any time t1, the result is a probability density of the predicted damage at that time (see Figure 5). The 95% interval in this vertical slice lies between damage dL and dU.

Current time

‘Horizontal Slice’ through projection gives a pdf in time to reach a defect size of d1

t2

‘Vertical Slice’ through projection gives a pdf of defect size at time t1

Figure 5 - Vertical Slice Through The Forecast It may also be appropriate to define failure as a critical damage size (dc) or greater. Figure 5 also shows a vertical slice at time t2 where the probability of failure thus defined is equal to the area shaded in blue under this pdf.

8 Note that failures can always be avoided at the last second when symptoms become obvious. Failure avoidance probabilities in this context refer to those predicted at or before time ta. Consequently, the term “Just in time” does not refer to the condition-based approach where a component is removed just prior to failure, but rather to a predicted point far enough in the future so that maintenance has just enough time to prepare for an anticipated failure whose probability of occurrence is specified in the prognosis requirement.

6 “Expected” in this context refers to the expectation of the random variable in the statistical sense. For simplicity, many of the pdfs in these figures are rendered as unimodal and symmetrical functions. Consequently, the expectation and the mode appear to be the same point. In practice, this is often not the case.

10

To satisfy the requirement for failure avoidance, (95% in this example), a component must be removed on or before the probability of failure equals pmax (e.g. maximum allowable probability of failure = 5%). However, removing a component when the probability of failure is too low will result in excessive unnecessary maintenance. If this probability is 0.01% for example, 10,000 components will be removed for repair when only 1 might actually require it.

Just-In-Time Point PoF=5% Failure pdf Lead Time Lt 95% area shaded red

Current time

Failure Threshold

Damage

Action should be taken at this time with an ideal prediction

There’s a 95% probability that a failure will be avoided if the component is removed at this time.

To satisfy the intent to avoid unnecessary maintenance, we impose a second requirement for minimum acceptable probability of failure (pmin). This requirement ensures that we will not waste time preparing for and removing a component until it is likely to fail with at least a probability of pmin.

time ta

t0

In order to satisfy both requirements the component must be removed at any point that falls between pmax and pmin, The further we move to the right within this interval, the lower the probability of unnecessary maintenance. The further we move to the left within this interval, the lower the probability of having a failure. The point that falls half way between pmax and pmin is our refined definition of the JIT point. Note that this definition remains consistent with the previous definition when pmax = pmin. The joint requirement is well formed as long as pmax >= pmin. Using the midpoint as the target for prediction has the advantage of being the easiest to validate as will be explained later.

Figure 7 - Lead Time Measured From The Just-In-Time Point This JIT concept raises the next question, “How close should the prediction be to the ideal probability of failure in order to satisfy the intended requirements?” Prognosis intends to avoid failure while minimizing unnecessary maintenance. The JIT point was previously defined as the one ideal point where there is a perfect tradeoff between two opposed purposes: avoiding failure and avoiding unnecessary maintenance. In practice it is very difficult to define, build and validate a predictor that identifies this exact point. It is more practical to break this point into two points having separate requirements that define an interval between failure avoidance and unnecessary maintenance. These points are the maximum allowable probability of failure (pmax) and the minimum acceptable probability of failure (pmin) respectively (see Figure 8).

Note that pmax, pmin and the JIT point may also be found in the prediction from the vertical slice if the definition of failure is a damage of dc or greater as described in Figure 5. This is accomplished by finding the time t2 where the blue area above the critical damage size is equal to the specified probability of failure. The analyses and methods described in this paper apply to either definition of failure. Validating Predictions

JIT Point PoF=[pmax + pmin ] 2 p p Failure pdf Lead Time Lt min max

Current time

Previously the question was posed, “How can a requirement be framed for an uncertain output?” We have asserted that the requirement must be specified in probabilistic terms (i.e. maximum and minimum probabilities) since the output exhibits random behavior. Consequently, the validation method must also be expressed in probabilistic terms for the reasons that follow.

95% area shaded red

Damage

Failure Threshold To avoid unnecessary maintenance on components that are not likely to fail this soon, we require that the probability of failure be at least this high. This point, the MINIMUM ACCEPTABLE PROBABILITY OF FAILURE (pmin), or any to its right satisfies the unnecessary maintenance requirement

Compliance Interval

Minimum probability of failure avoidance sets this point as the MAXIMUM ALLOWABLE PROBABILITY OF FAILURE (pmax). This point or any point to its left satisfies the failure avoidance requirement

The prognosis method must predict the point in time where the probability of failure is [pmax + pmin]/2. To validate this prediction, we will collect field data from components at this point in time and show that their probability of failure does indeed equal [pmax + pmin]/2. However, a specific failure probability cannot be exactly validated until the outcome of all components to which this requirement applies is known with certainty9. Suppose for example, a particular lot of 100 components is used to validate that the

time t0

ta

Figure 8 - Requirements Illustrated

9

11

Validation by analysis is even less defendable.

current state to the failure threshold and applying the same statistical methods. In all cases, using field data will only produce an estimate of the given parameter and thus should include the confidence interval associated with that estimate.

predicted JIT point does indeed correspond to the point in time where the probability of failure is 3%. It could happen that more than 3% in this lot have failed before this point, but this doesn’t necessarily mean that the prediction does not meet the requirements. Conversely, if the lot contains exactly 3% failed components, this does not necessarily mean that the prediction does meet requirements. Validation can only show that as more and more field data are analyzed, the observed probability of failure converges to an interval around the required value at the required time.

This paper introduces some of the theory associated with validating prognosis performance against requirements, but there are practical aspects which must be considered. For example, if the maximum allowable probability of failure is small (i.e. .1%), then the sample sizes needed to produce an estimate can easily increase tenfold over the illustrations below. More sophisticated methods are needed to enable the estimation process to be accomplished with reasonable sample sizes even for small probabilities of failure.10

Therefore, we will make the following definition: A requirement specified for a random variable is statistically satisfied if a statistical estimator for the variable that is based on the field data (values of the random variable) satisfies the requirement within an acceptable tolerance for an adequate proportion of the time.

Approach 1, For Components That Are Allowed to Fail In Situ

Note that the definition includes the condition that the requirement is satisfied to a tolerance. For example, suppose we want to validate a requirement that a fair coin is true (probability of heads is 50%). If we flip the coin an odd number of times, no matter how large, it can never be exactly 50% even though the coin is a perfectly fair coin. If we are willing to accept that the coin is fair if the proportion of times that heads occurs lies between some bounds (i.e. 49.99% and 50.01%), then we specified a requirement that can be validated if a fair coin is flipped enough times.

To validate a prediction method whose requirements indirectly specify the probability p of a component failing on or before the JIT point, we create a statistical model of the random behavior using a binomial distribution. This distribution is constructed using a collection of n trials from field data where the outcome (operational or failed) of each trial is a random event. Given no external bias (all components are equal at the start), the proportion of components that failed on or before the predicted JIT point out of the total number n of evaluated components, is an estimate of the probability of failure p at that point. Furthermore, a confidence interval can be calculated for this estimate based on the variability of the proportion over the n components (trials). As the number of trials increases, the estimate and the confidence interval converge to the true value p.

The second condition in our definition is that the requirement must allow for the random behavior of the variable that we are trying to validate with some reasonable procedure. In our coin flipping problem, if the coin is true, we can guarantee that most of the time the estimated probability of heads for the coin lies within our acceptable tolerance range if the coin is flipped enough times. However, due to the random behavior of the coin, we cannot guarantee that the estimate will be within tolerance on every batch of n flips. It is always possible no matter how large the number of times you flip the fair coin, that every flip could be tails. It will be extremely rare, but not impossible. Therefore, our requirement must also specify that it has been satisfied if the tolerance is met some sufficiently large proportion of the time.

This procedure can be implemented in a number of ways11. Using the adjusted Wald method12 for example, we illustrate example13 estimates of the probability of failure and of plow and pup, which represent the lower and upper limit for the 90% confidence interval for the estimate of p (see Figure 9).

Using Field Data for Requirement Validation and Maturation Field data can thus be used for validation and prognosis maturation in one of two approaches depending on whether or not the component is allowed to operate until failure. If the component is not removed from service prior to failure, statistical methods can be used to estimate the true probability of failure and the associated confidence at the predicted JIT point. These data may also be used to estimate the true time when we reach a given probability of failure. If a component is removed prior to failure, prediction confidence can be validated by projecting its

10 Concepts like importance sampling or extreme value theory may be applicable, but their consideration is outside the scope of this paper. 11 Reiczigel, Jeno, ‘Confidence intervals for the binomial parameter: some new considerations’, http://bio.univet.hu/qp/Reiczigel_conf_int.pdf 12 Agresti, A. and Coull, B. A., ‘Approximate is better than “exact” for interval estimation of binomial proportions’, The American Statistician, 52, 119-126 (1998). 13 If this illustration is repeated using the same number of components, the bounds and prediction will vary since these are random samples; however, the confidence level (90%) remains the same for all.

12

Number of Evaluated Components

Estimate of p

plow

pup

10

.185

.006

.364

30

.103

.015

.190

50

.082

.020

.145

100

.052

.016

.088

500

.050

.034

.066

5. If the estimate of p is within the interval [pmin, pmax] but both bounds of the confidence interval are not, the method is valid at a lower confidence and more field data is needed to either achieve the desired confidence or show that the prediction has not met its requirement for all field data. In the latter case we can use adaptation methods to update the JIT prediction.14 6. If the estimate of p and the bounds of the confidence interval plow and pup continue to converge outside the requirements interval [pmin, pmax] as n increases, the probability that the requirements will be satisfied using more field data is 1-confidence chosen in step 1. Usually the requirement will not be satisfied at this level of confidence, and a decision must be made to update the JIT prediction.

Figure 9 - Sample Predictions And 90% Confidence Intervals For p = 5%

While the method above provides a pass / fail test to determine if the requirements are satisfied, it does not provide a direct means to improve the prediction. If components are allowed to remain in service for a trial period until they fail, a statistical model for determining the correct time for the JIT point with the requisite confidence interval can be constructed using order statistics. For example, if the requirement for maximum allowable probability of failure is 7% and the minimum acceptable probability of failure is 3%, the target probability of failure at the JIT point is 5%. Given the time of failure for each of n trial component, the 5th time percentile is an estimator of the point where 5% of the failures have occurred. Given the example results in Figure 10, we see from evaluating 10 trial components that we can be 90% confident that the true JIT point is between times 85.13 and 175.3. As before, when more components are evaluated, upper and lower bounds for the same 90% confidence come closer and closer together and surround the true time where the probability of failure is 5%.

Figure 9 contains simulated results from various numbers of trial components where the true value of p in the simulation was 5%. The confidence interval discussed above refers to the estimate of the probability of failure from the field data, not to the prediction of the JIT point using the prognosis model. This confidence simply states that there is a 90% probability that the true probability of failure among the n components lies between plow and pup. Since the field data only provides an estimate of p within some confidence bounds (90% in this example), the results from any one lot of n components could be wrong 10% of the time. Thus, any one lot of data cannot definitively validate the prognosis model. As more data is added, the confidence bounds shrink making it reasonable to recalculate more stringent bounds (i.e. 99%, 99.999% etc). The more confident the estimate from field data, the more definitive validation can be. Note that defining the JIT point as the midpoint between pmax and pmin has the benefit of reducing the amount of field data needed for validation in most cases. This is true when the estimate of p from field data also lies at the midpoint between plow and pup. Accordingly, a prognosis method is valid to any desired confidence when the agreement between the estimated probability of failure from n field data points sampled at the predicted JIT point and the required probability of failure at the same point is statistically satisfied. This test is performed using the following procedure: 1. Choose a desired confidence level for validation 2. Using n components from the field, count the number that have failed on or before the predicted JIT point. 3. Using the adjusted Wald method (or equivalent), estimate the probability of failure p and its confidence bounds plow and pup from the field results in step 2. 4. If the upper and lower bounds of the confidence interval plow and pup fall within the closed interval [pmin, pmax] = [(minimum acceptable probability of failure), (maximum allowed probability of failure)], the prediction method has met its requirements for the n components.

Number of Evaluated Components

Estimate for JIT time

tlow

tup

10

153.5

85.13

175.3

30

138.1

99.70

154.2

50

132.9

104.5

146.8

100

122.8

100.6

134.4

500

122.1

114.4

128.0

Figure 10 - Predictions Of Just In Time Point And Its 90% Confidence Interval

14

Engel, S. and Hoitsma, D. , ‘Adaptive Prognosis’, To Be Submitted for Publication

13

Figure 10 shows a slow convergence of the prediction to the true value and decrease in the width of the confidence interval as the number of components evaluated in the field increases. After 500 trial components, the bounds are reasonably narrow around the true JIT point of 121.9 used in this simulation.

validation and prediction improvement can be used. Admittedly this approach is not as exacting as approach 1, however it is a reasonable compromise when components are not permitted to fail.

Approach 2, Components That Are Not Allowed to Fail In Situ

Up to this point, predictions have been illustrated as static forecasts in time. This is further complicated when new data arrives and predictions are updated. Normally, as the failure approaches, indications become less vague affording an opportunity to update forecasts and reduce uncertainty. This raises another question: Which version of the prediction do we act upon and which do we use for validation? An updated prediction that states we have less than the required lead time may in fact be more accurate and will ultimately affect our avoidance of failures, but does not satisfy our lead time requirement. In practice, we should let predictions continue to be updated with new data and only take action and validate performance when the current time equals ta from the most recent prediction per Figure 8.

Predictions Are Updated

While it may be acceptable to allow some components to fail in the field, the failure of safety critical components must be avoided at all costs. How then can prediction confidence be determined? One approach is to inspect the degree of damage found in components that have been removed from service at some known time. Damage found during inspections should lie within the forecast cone and can be compared to the damage predicted in the vertical slices (refer to Figure 5) formed at the corresponding time in service. For example, a component removed from service at time t1 should have damage that falls between dL and dU per the predicted pdf in Figure 5.

Current time

Thus the requirement for maximum allowable probability of failure should be interpreted from the point of view of providing advanced warning. The ultimate probability of failure avoidance is influenced by symptoms that may be observed right up to the point of failure. For example, we may require 10 hours of lead time, but we can still avoid a failure if the symptoms become sufficiently convincing 1 hour before failure. At this point however, we will not have satisfied the required lead time.

Just-In-Time Point PoF=5% Failure pdf 95% area shaded red

Damage

Failure Threshold

Each data point found in the field can be projected to the failure threshold using prediction models

The number that fall to the left of the JIT point in proportion to the total number, approximates the probability of failure at the JIT point

5. LESSONS LEARNED As various programs progress to develop comprehensive diagnostic, prognostic and health management (PHM) systems many lessons are learned. Similarly, as specific prognostic directed projects are undertaken additional lessons are accumulated. The following paragraphs represent and discuss some of these accumulated lessons learned to date. They address both some general PHM programmatic lessons learned as well as those specifically focused on prognostic capabilities.

time t0

ts

Figure 11 - Using The State Of Removed Components To Measure Prediction Confidence The inspection of one component at service life ts however, only provides one point to compare to the vertical slice (pdf) at time ts. To form a more useful comparison, the prediction model can be used to project the damage from each inspection to the failure threshold where the points from n inspections can be accumulated in a histogram as shown in Figure 11. This histogram can be compared to the pdf generated by the prediction model using various statistical tests. Furthermore, when data points gathered from field inspections are extrapolated to the failure threshold, the proportion that reach failure on or before the JIT point in comparison to the total number n of data points can be used to estimate the probability of failure at the JIT point and define the confidence interval as previously described in approach 1. Given these forecasts, the same procedure for

Prognostic capabilities are new, mostly very hard to develop, and often take time to mature, but are doable in many, many cases. Having said this, prognostics are certainly not doable for all cases. There is a need to identify the clearly not doable cases and then not worry about them. Rather, focus your limited resources on those that are both doable and directed at high value systems, subsystems, and components. Sometimes, what is doable as a prognostic capability for some specific subsystem or component is surprising. This is appearing to be the case for some electronics system; where simple trending of some measurands associated with overall system performance degradation can provide a prognostic capability. Prepare to be pleasantly surprised; record a lot of seemly unrelated but 14

an accurate vibration diagnostic capability to reliably detect and isolate small incipient faults; this capability can be applied as the measurand input to support prognostic modeling and useful life remaining predictions. Subsystem component seeded fault testing is extremely useful but can be prohibitively expensive. Plan for a few and use them wisely. Back up this strategy with a resourced plan to capture performance degradation AND incipient fault to failure data whenever you can in a piggy backed environment. Examples would include planning to capture this invaluable data during all possible system, subsystem, and component developmental testing, qualification testing, environmental tests, final acceptance tests, etc. From a propulsion system perspective, these could include: all specific component qualification tests (e.g., fan, turbine, controls, bearing, etc.); any damage tolerant testing (e.g., bird ingestion, HCF, FOD, etc.); all life demonstration testing (e.g., ASMET, Blade and Disk spin-pit tests, etc.). The data from this type of testing is invaluable in supporting the development of needed fault to failure progression models, alarm decision threshold setting, and useful life remaining predictive models. Ensuring that a procedure is in place to get this data is so imperative that a very specific SOW requirement statement should be included in the propulsion system development contract. Such a contractual requirements statement could read as follows: “All propulsion sub-component, component, rig, engine test cell, and flight tests conducted in support of the development contract shall have adequate PHM technologies, appropriate PHM sensors, and adequate instrumentation to capture the necessary data to develop and mature prognostic models and the overall PHM capabilities set.”

available measurands; and continually analyze for relationships that may track to useful life remaining predictions. It takes significant data, experience, and maturation time to develop sophisticated prognostic and accurate useful life remaining prediction capabilities. This is true in some cases more than others; since there are always “low hanging fruit” examples where degradation always occurs along easy to understand trend lines. This statement is particularly true for accurate useful life remaining prediction capabilities and for specific components where physics of failure models are not well understood or where the root cause failure mechanisms are very random. The Maturation of comprehensive prognostic capabilities takes time, lots of data, and significant analysis resources. It also takes dedicated management commitment, a well thought out maturation strategy and coordinated plan, and adequate funding resources at many levels of the platform development team. This maturation time will often exceed well past the original platform development program. So it will be necessary to plan for long term and continuing maturation of prognostic capabilities. Good Failure Modes and Effects Analysis (FMECA) won’t give us prognostics by themselves but as with the development of good diagnostics capabilities; they are a necessary starting point for many reasons. Comprehensive and up to date FMECA’s identify failure modes and related symptoms; prioritize high occurrence failure modes; and critical failure path components. It must be noted that prognostic capabilities for a specific subsystem Line Replaceable Component (LRC) most often will not cover every failure mode for that LRC or the elements that make up that LRC. This situation translates to a question of the actual percent degree of prognostic coverage that actually exists for that LRC. In some cases the most frequently occurring failure modes and/or the highest priority failure modes are covered; in some cases they are not or it is a mixed degree of coverage. The issue is then how to best assess and represent the actual degree of prognostic coverage provided. The lesson here is this situation has to be addressed to provide a more accurate assessment for actual degree of prognostic coverage provided.

Performance Based Specifications (PBS) are not ideal for driving the development of PHM capabilities and their implementation support systems. They tend to be too top level to really drive the needed design detail. Performance requirement levels for relatively new technology driven capabilities, like those associated with PHM, are both very hard to define and even harder to get potential contractor agreement. A simpler more direct approach to PHM requirements specification is desirable. If you know what works and what you want, specify it. If you know what doesn’t work; write a specification requirement so you ensure you don’t get it.

Perhaps, one of the most important and consistently useful propulsion system diagnostic capabilities is modern vibration monitoring. State-of-the-art vibration data acquisition and analysis techniques are allowing current propulsion and power drive monitoring systems to detect very small changes in vibration frequency signatures for the rotating machinery. This corresponds to the early detection of very small incipient faults, cracks, spalls, unbalance, rub, etc., for blades, gears, bearings, and the like. The number of rotating machinery faults that cause a vibration signature change is very large; enabling accurate vibration monitoring to be a very powerful diagnostic capability for a great many propulsion system component applications. Once you have

The big Prime Contractors want to be System Integrators but don’t necessarily have the “niche” Technologies and expertise to provide fully capable, State-of-the-Art PHM capabilities. If a new PHM technology or capability isn’t fully mature and commercial-off-the-shelf (COTS), they often don’t seem to want to use it. This situation is very limiting in a dynamic and rapidly developing technology environment like that currently associated with PHM. Much of the new and truly innovative PHM technologies and associated capabilities seem to reside in the Small 15

designed to address. Experience with several successful legacy system applications have repeatedly shown that data collected by these system were invaluable in addressing unexpected but serious problems. Often these unexpected problems were caused by basic platform design and/or operational performance deficiencies. Data collected but not originally envisioned as related to any pre-designed detection event was used to identify and understand the TBD problem, mitigate its fleet operational impacts, and support the development of a platform design change solution.

Business arena. Aggressive take advantage of the government SBIR programs by using specific PHM technology supporting topics to provide “niche” capability needs. Look for feeder technologies for new PHM capabilities in other related and non-related disciplines and industries. For example, much of the Advanced Vibration Diagnostics used in Gearbox Monitoring came out of the signal processing and data analysis techniques found in the Anti-Submarine Warfare (ASW) field. Looking for additional and/or advanced capabilities outside of the Aerospace discipline would greatly expand your PHM “tool kit”.

Capturing more PHM data is always better. Learn to handle it and manage it so that it becomes an asset not a burden. With respect to data related to possible prognostic predictions, this is particularly important. Captured and saved examples of periodic and frequent no known incipient fault data sets are needed to establish baseline status. Also, as prognostic detection and useful life remaining models are developed and become more efficient; this previous capture data may be used to develop and execute an earlier prognostic event horizon.

Strong well defined Operational and Support (O&S) Affordability metrics are needed to drive the design optimization of PHM capabilities. Keeping continued top level program management commitment among design/cost pressures through the course of the development program is very challenging. Establish some type of a “stake in the ground” management commitment to PHM capabilities earlier in the program to act as an irrefutable foundation for the need for PHM.

Even with a comprehensive, fully automatic and very capable PHM system; a separate Pilot record switch feature is needed and always useful. No matter how automatic and complete the data acquisition coverage is; there will always be unforeseen events that the system may miss recording. The pilot record switch feature provides some insurance to cover these unknown and TBD events. Also, the pilot record switch feature offers easy opportunities to turn every fleet aircraft into an already instrumented “flight test asset”. This feature can be invaluable in investigating and providing data to analyze new and difficult to understand fleet service related deficiencies.

PHM system development is a very multi-disciplined, multi-functional, multi-technology, multi-faceted endeavor. Understand this well and plan to deal with it. On-Board and Off-Board PHM capabilities need to be designed and developed at the same time, together, and integrated by the same Prime Contractor. If or once the development and implementation schedules of the on-board and off-board capabilities get out of synch or when and if their individual design requirements do not line up; the overall platform PHM system performance will suffer greatly and successful program development and implementation is doomed.

The PHM, Reliability and Maintainability (R+M), System Integrity, and Safety disciplines are married at the hip. They need to work very closely together, support each other, and attend each other’s meetings. PHM, unlike the R+M, Integrity, and Safety engineering teams, have definite deliverable products; and as such should have its own engineering Integrated Product Team (IPT). It is some what misleading to include PHM under a specialty engineering discipline team.

On-Board and Off-Board PHM algorithms need to be the responsibility of the On-Board, Air Vehicle, Subsystem specific engineering design teams. This includes requirement and algorithm development, validation, and verification. Mission System and avionics hardware and software infrastructure issues can significantly limit PHM system development and maturation. PHM system development has a very strong dependency on their hardware, through put, processing, storage, software, etc. attributes and their successful and on time implementation to support our planned design capabilities. They are always a problem and always let you down. The PHM system developers have to watch these potential Mission System and avionics infrastructure issues “like a hawk”.

Autonomic Logistics (AL) or its equivalent is the PHM development IPT’s main Customer. It’s very easy for the logistics team to fall back on legacy supportability approaches. The effectuation of PHM capabilities and data products by the AL, Logistics, or Supportability teams is extremely important to actually achieve the originally envisioned benefits. This successful implementation and execution, effectuation on the AL customer side, is very difficult to accomplish because of the newness of some PHM capabilities, particular prognostics; the paradigm shifting aspects of the new maintenance concept; and the

A comprehensive PHM system is really a robust data acquisition system that will continue to surprise you as it aids in addressing TBD problems that it wasn’t originally 16

particularly reliability and maintainability based profits, whereby the contracts may be structured to provide an inherent profit incentive for the PBL provider to lower operating costs by achieving higher product operational availability and to retain all or a portion of the savings achieved as a result of providing a better product.

tendency of “old logistics hands” to still embrace legacy approaches. PHM capabilities, anticipated benefits, and system attributes must be part of the overall aircraft platform system design process and its many trade studies. This statement covers all PHM capabilities; but the prognostic capabilities will need special focus since they are the newest to “bring to the mix” and the tangible and understandable benefits are sometimes hard to articulate. It can be even harder to provide quantifiable benefit numbers to support these design trade studies.

The key to developing a successful prognosis system is the fundamental realization that a precise, deterministic prediction of the time of failure is not the ultimate goal. By their very nature, future health predictions are unavoidably probabilistic. We have therefore asserted that both the requirements and their validation must be cast in probabilistic terms.

6. SUMMARY AND CONCLUSIONS

Accordingly, three fundamental requirements were introduced: 1. Maximum Allowable Probability of Failure: Since advanced notice for 100% of failures is not feasible, we are forced to accept some level of risk. Specifying a Maximum Allowable Probability of Failure (pmax) of 0.002 means that we want to guarantee that at most one component out of 500 might not be detected with sufficient advanced notice. This requirement is analogous to diagnostic requirements for fault detection. 2. Minimum Acceptable Probability of Failure: This requirement ensures that we will not waste time removing a component until it is likely to fail with at least a probability of pmin. Specifying a Minimum Acceptable Probability of Failure (pmin) of 0.001 means that we are willing to initiate advanced action (manage spares, plan maintenance etc.) for 1000 suspect components when in fact only one may need it. This requirement is analogous to diagnostic requirements for false alarms. In order to satisfy both requirement 1 and 2, we must predict the point in time where the failure probability falls between pmax and pmin. The point that falls half way between the times that pmax and pmin occurs is defined as the Just-In-Time point. This is the ideal point of reference for advanced notice. 3. Lead Time: This is the amount of advance notice required to prepare for the impending failure. This may include the management of spares, maintenance training, opportunistic maintenance, etc. Subtracting the lead time from the Just-In-Time point gives the time when notice must be given to allow sufficient time for the appropriate action to take place.

This paper built on the background, benefit impacts, current playing field for predictive prognostics, the highlighted specific design challenges, issues, and lessons learned in the previous paper “….Real Prognostics Part 1” [1]. The fleet maintainer needs and benefit impacts of real predictive prognostics, the “Big P” is obvious, real, and great. There are several types of prognostic capabilities of varying levels of sophistication; some based on simple trending techniques to those using multiple integrated modeling approaches. The prognostic definitions used for these different approaches may also vary and are evolving with the technology. Real predictive prognostic capabilities are just one element among many interrelated and complementary functions in the playing field of PHM. The challenges involved with developing and implementing prognostic capabilities are many; and this paper focused very much on those impacting and enabling benefits associated with new performance based logistic support concepts, global sustainment strategies, and new business practices. Similarly, the technical issues involved with developing real predictive prognostic capabilities are large and many. This paper focused on some of the issues involving uncertainty and prognostic requirements and validation. The main findings of which are summarized below. Within the described PBL vision and GS concept, the development of valid and measurable metrics to quantify the impact of various prognostic technologies in a new Business environment emerged as real challenge. Indeed, the key component of any PBL implementation is the generation of measurable parameters and the inclusion of “Big P” in some of logistics modeling tools was illustrated. Since the purpose of PBL is “buying performance”, what constitutes performance must be defined in a manner in which the achievement of performance can be tracked, measured and assessed. In order to make decisions beneficial to both contractors and the Government, while avoiding financial consequences of bad decisions, if structured with the right metrics, a PBL support package provides high incentives to develop items with “Big P” capabilities. Improving most of PBL metrics performances, Prognostic influences

Two approaches were described for validating prognosis performance per the stated requirements. The first approach covers operational situations where components are allowed to remain in operation until they fail. The second approach covers those situations where components are not permitted to fail but an assessment of their health state can be ascertained when they are removed from service. In both approaches, methods were cited for testing a prognosis 17

design against the requirements above, and for improving predictions as experience unfolds. In conclusion it can be said that real predictive prognostics, the “Big P”, is hard, challenging, and often involving very complex and difficult technical issues; but with wise development management and the appropriate resources, is doable. Real predictive prognostics represent an essential combination of available, developing, and maturing capabilities critical to an evolving technology discipline called PHM. The challenges, issues, and the lessons learned associated with predictive prognostics are many and will certainly continue to grow, as this technology discipline matures and is more widely applied. Future papers will continue to discuss and document these new found and additional challenges, issues, and lessons learned. We will still be chasing the “Big P’” for some time to come and on many new applications; so stay tuned for parts 3 and 4 of continuing versions of this paper.

7. ACKNOWLEDGMENTS The work on prognostics requirements and validation presented in this paper was partially funded by DARPA contract HR0011-04-C-0003 under the Northrop Grumman Corporation: Structural Integrity Prognosis System Program.

8. REFERENCES [1] Hess A., Frith P., Calvello G. “Challenges, Issues, and Lessons Learned Chasing the Big ‘P’: Real Prognostics Part 1”, paper # 1595, 2005 IEEE Conference, March 2005. [2] Hess A. and Fila L. “The Joint Strike Fighter (JSF) PHM Concept: Potential Impact on Aging Aircraft Problems”, paper # 403, 2001 IEEE Conference, March 2001. [3] Hess A. and Fila L. “Prognostics, from the Need to Reality – from the Fleet Users and PHM System Designer / Developers Perspectives”, paper #116, 2002 IEEE Conference, March 2002. [4] Calvello G., Dabney T., Hess A. “PHM a Key Enabler for the JSF Autonomic Logistics Support Concept”, paper # 1601, 2004 IEEE Conference, March 2004. [5] Engel S, Gilmartin B, Bongort K, Hess A., “Prognostics, The Real Issues Associated With Predicting Life Remaining”, March 2000 IEEE Conference.

18

AUTHOR BIOGRAPHIES Dr Peter Frith is a principal research scientist from the Australian Defence Science and Technology Organization (DSTO). He has over twenty years experience in the development of advanced diagnostic and health management technologies for military propulsion systems and until recently was Head of Engine Performance at DSTO. He is currently attached to the JSF Program Office as a Collaborative Program Personnel in the Air System PHM Team. His prime focus areas are requirements, verification, prognostics and propulsion.

Andy Hess is a graduate of both the University of Virginia and the U. S. Navy Test Pilot School. Andy is world renowned for his work in fixed and rotary wing health monitoring and is recognized as the father of naval aviation propulsion diagnostics. Beginning with the A-7E engine healthmonitoring program of the early 70's, Andy has been a leading advocate for health monitoring in the Navy and has been instrumental in the development of every Navy aircraft application since the A-7E through the F/A-18 and V-22. His efforts have largely led to the successful transition of a development program Helicopter Integrated Diagnostic System into production (IMD HUMS) for all H-60, H-53, and AH-1 aircraft. More recently, Andy has been a strong advocate of prognostic capabilities and has been involved in many efforts advancing the development of these predictive capabilities for a sundry of systems and their component elements. Currently, Andy is hard at work leading the development and integration the Prognostic and Health Management (PHM) system for the Joint Strike Fighter program and is a frequent participant in the international technical community.

Stephen J. Engel is an associate technical fellow for the Integrated Systems Sector of the Northrop Grumman Corporation. For the past fifteen years, Mr. Engel has focused on diagnostics and prognostics design and development for military systems. He is presently coauthor of the model-based reasoning algorithm used for Prognostics Health Management (PHM) onboard the Joint Strike Fighter. He is also the technical lead responsible for the design of reasoning and prediction algorithms for DARPA’s Structural Integrity Prognosis System. This involves probabilistic reasoning methods that fuse physics-based failure models with information from multiple sensors to form adaptive predictions of component health. Mr. Engel holds a dozen patents in PHM, artificial intelligence paradigms and related technologies.

Major Giulio Calvello is an Italian Air Force Corps of Engineers Officer, graduated in 1994 of both the Italian Air Force Academy and the University of Naples “Federico II” in Aeronautical Engineering. He is Professional Engineering (PE) certified as well as a Non Destructive Inspections and Corrosion Analysis Inspector. He served for several years in the F104 Depot Maintenance Center in Grosseto as Mechanics and Avionics Supervisor. Assigned to the Italian Air Force Logistic Command as Deputy of the National Supply Authority for the Tornado IDS, he was selected to serve as co-chairman in the Financial Sub Group in the F3 Tornado ADV Leasing Management between Italian and UK Government. He is currently serving as Autonomic Logistics leader for the Prognostic and Health Management (PHM) for the Joint Strike Fighter Program Office.

Dr David Hoitsma is an information systems technologist at Northrop Grumman Corporation and earned his Ph.D. in mathematics at New York University’s Courant Institute. He has over twenty-five years experience in applied mathematics and software engineering which includes research and development for industry, government, and the university in the areas of engineering and manufacturing, CAE applications, simulation, and information systems. He taught for several years at Wesleyan University, and later worked at NASA Goddard Space Flight Center on weather prediction. Currently he is involved in the development of prognostic algorithms for the JSF and Structural Integrity programs.

19

Suggest Documents