and analyzed to assess its on-going health, provide advance warning of failure, and provide information to improve the ... Keywords: Electronic prognostics, health monitoring, HUMS, reliability, usage monitoring. ..... intrusion into host system.
Quality Technology & Quantitative Management
Vol. 4, No. 2, pp. 235-250, 2007
QTQM © ICAQM 2007
Environment and Usage Monitoring of Electronic Products for Health Assessment and Product Design Nikhil Vichare1, Peter Rodgers2, Valérie Eveloy2 and Michael Pecht1 2
1 CALCE Electronic Products and Systems Center, University of Maryland, College Park, USA Department of Mechanical Engineering, The Petroleum Institute, Abu Dhabi, United Arab Emirates.
(Received August 2005, accepted June 2006)
______________________________________________________________________ Abstract: The life-cycle environmental and usage conditions of a product or system can be monitored and analyzed to assess its on-going health, provide advance warning of failure, and provide information to improve the design and qualification of fielded and future products. The challenge lies in the implementation of this method in application conditions. This paper presents methods to effectively collect and analyze life-cycle environmental and usage data for in-situ health assessments. An integrated hardware-software micro-programmable module for health and usage monitoring of electronic products in their application environment is presented. The hardware incorporates local sensors, and on-board processing power using embedded software. These data processing capabilities are intended to enable immediate and localized processing of the raw sensor data to reduce the power and memory consumption anticipated during load monitoring. Guidelines are also provided to develop a life cycle monitoring plan, that encompasses the selection of environmental and usage parameters. A case study is presented to illustrate the methodology. Keywords: Electronic prognostics, health monitoring, HUMS, reliability, usage monitoring.
______________________________________________________________________ 1. Introduction
T
raditional handbook-based electronic reliability prediction methods (e.g., MilHdbk-217, RIAC, PRISM) have been shown to be misleading and provide erroneous life predictions [19]. Among other concerns these methods do not accurately account for the life cycle environment of electronic equipment. This arises from fundamental flaws in the reliability assessment methodologies used, and uncertainties in the product life cycle loads [27]. All of this has led the United States military to abandon their electronics reliability prediction methods [19]. Although the use of stress and damage models permits a more accurate account of the physics-of-failure (PoF) [8], their application to long-term reliability predictions based on extrapolated short-term life testing data or field data, is typically constrained by insufficient knowledge of the actual operating and environmental application conditions of the product [27]. The life cycle environment of a product consists of manufacturing, storage, handling, operating and non-operating conditions. The life cycle loads (Table 1), either individually or in various combinations may lead to performance or physical degradation of the product and reduce its service life [30]. The extent and rate of product degradation depends upon the nature, magnitude, and duration of exposure to such loads. If one can assess in-situ the extent of deviation or degradation from an expected normal operating condition (i.e., the system’s health), this data can be used to meet the following powerful objectives: provide an early warning of failure, forecast maintenance as needed: avoid scheduled maintenance and
236
Vichare, Rodgers, Eveloy and Pecht
extend maintenance cycles, assess the potential for life extensions, reduce amount of redundancy, provide guidance for system re-configuration and self-healing, provide efficient fault detection and identification, including evidence of “failed” equipment found to function properly when re-tested (no-fault found), and improve future designs and qualification methods. Table 1. Life cycle loads. Load
Examples Steady-state temperature, temperature ranges, temperature cycles, spatial Thermal temperature gradients, temperature ramp rates, heat dissipation Mechanical Pressure magnitude, pressure gradient, vibration, shock load, acoustic level, strain, stress Aggressive versus inert environment, humidity level, contamination, Chemical ozone, pollution, fuel spills Physical Radiation, electromagnetic interference, altitude Electrical Current, voltage, power Health monitoring is a method of measuring and recording the extent of deviation and degradation from a normal operating condition. Prognostics is the prediction of future state of health based on current and historic health conditions. Prognostics and health management (PHM) is a method that permits the reliability of a system to be evaluated in its actual application conditions. Thus by determining the advent of failure, based on actual life cycle application conditions, procedures can be developed to mitigate, manage and maintain the system [39]. Safety critical mechanical systems and structures, such as propulsion engines, aircraft structures, bridges, buildings, roads, pressure vessels, rotary equipment, and gears, have benefited from advanced sensor systems and models for diagnostics and prognostics, developed specifically for in-situ fault diagnosis and prognosis (often called health and usage monitoring or condition monitoring) [6], [37]. As a result, a significant body of knowledge exists on health monitoring of mechanical systems. Extensive research has been conducted on establishing failure precursors for mechanical systems, such as changes in the vibration signatures of roller bearings and variations in acoustic levels due to wear [7], [16], [18]. Despite the fact that most mechanical systems contain significant electronics content that are often the first to fail [17], [21], [35], the application of PHM to electronics is still in its infancy. This is due to the fact that wear and damage in electronics is comparatively more difficult to detect and inspect due to geometric scales and complex architecture. In addition, faults in electronic parts may not necessarily lead to failure or loss of designated electrical performance or functionality, making it difficult to quantify product degradation and the progression from faults to final failure. In addition, there has been a significant shortage of knowledge about failure precursors in electronics. To provide a perspective for the PHM methodology used in this study, the current state of practice and ongoing research in PHM of electronic products and systems is reviewed. Approaches to prognostic health monitoring are typically classified in one of the following three broad categories: z Use of expendable devices such as “canaries” and fuses [23], [41]. z Monitoring or trending of parameters that are precursors to failure [6], [7], [18], [26], [37].
Environment and Usage Monitoring of Electronic Products z
237
Modeling of damage accumulation and deterioration in electronic parts and assemblies utilizing exposure conditions (temperature/vibration/radiation) to compute the accumulated damage to the electronic circuits and structures [24], [29], [38].
The use of weaker elements and fuses for providing advance warning of failure has been investigated by Mishra et al. [23], who studied the applicability of semiconductor level health monitors that consisted of pre-calibrated cells (circuits) located with the actual circuitry on the same chip. Such prognostics cells would monitor semiconductor level failure mechanisms such as electro-migration, hot carrier aging, and time dependant dielectric breakdown. However, although semiconductor cells are available and could be embedded at the product design phase, their actual application has not been reported. Anderson et al. [3] proposed a similar approach that was based on the use of consumable devices (canaries) having circuits that include the same physical features as the main system. The environmental degradation of these canaries is assessed using accelerated testing, and the degradation levels are calibrated and correlated to actual failure levels of the main system. Built in Test (BIT) can be considered as a pioneering effort in diagnostic health monitoring of electronics [1]. BIT is a set of evaluation and diagnostic tests that uses the resources that are an integral part of the system under test. One of the earliest equipment available with BIT was HP-3325A (1980) function generator. However, several studies [1], [9], [26], [31] conducted on the use of BIT for fault identification and diagnostics, showed that BIT can be prone to false alarms and can result in unnecessary and costly replacement, re-qualification, delayed shipping, and loss of system availability. The self-monitoring analysis and reporting technology (SMART) currently employed in certain computing equipment for hard disk drives (HDD) [36]. HDD operating parameters, including flying height of the head, error counts, variations in spin time, temperature, and data transfer rates, are monitored to provide advance warning of failures. This is achieved through an interface between the computer’s start-up program (BIOS) and the hard disk drive. Systems for early fault detection and failure prediction have also been developed for high end servers using continuous sensing of variables such as current, voltage, and temperature combined with analysis using pattern recognition algorithms [25]. In terms of research on establishing precursors to failure, Lall et al. [20] experimentally identified that the phase growth rate of solder interconnects, normal stress at chip interface, and interfacial shear stress could possibly be used as indicators to failure. However, methods to implement the use of these precursors for predicting failure in real or near real time are not yet available. The approach for PHM based on modeling accumulated damage due to exposure conditions has not been widely used for electronics. The basic philosophy underlying this PHM approach is that damage is a function of the loads experienced by the product in its life cycle environment. Exposure to life cycle loads listed in Table 1, leads to an accumulation of damage and eventually failure. This approach is most useful in assessing wearout type failure mechanisms such as fatigue, corrosion, electro-migration, etc. Mishra et al. [24] and Ramakrishnan and Pecht [29] monitored the temperature, humidity, vibration and shock loads experienced by an electronic board operated in automotive underhood environments. The monitored data was used in conjunction with physics-of-failure models for damage estimation and remaining life prediction. The PHM methodology was shown to accurately predict remaining life, in contrast to traditionally-used electronics reliability prediction methods.
238
Vichare, Rodgers, Eveloy and Pecht
Several European electronic companies monitored the application conditions of a game console and a household fridge-freezer. The study provided statistics on usage rate and frequencies that could be used for future product development [5]. Searls et al. [34] undertook in-situ temperature measurements in both notebook and desktop computers used in different parts of the world, and found that notebooks experienced more severe temperature cycling. However, the use of these data for PHM was not considered. In Vichare et al. [38], the present authors monitored and statistically analyzed the temperatures inside a notebook computer, including those experienced during usage, storage, and transportation, and discussed the need to collect such data both to improve the thermal design of the product, and for prognostic health monitoring. The effects of power cycles, usage history, CPU computing resources usage, and external thermal environment on peak transient thermal loads were characterized. It was found that the product thermal operating envelope differed considerably from the operating conditions specified by environmental standard IPC SM-785 [11] and from the worst-case operating conditions assumed by the vendor, in terms of absolute temperatures, temperature cycle amplitudes, and temperature ramp rates. A major challenge in PHM is its implementation within the application environment. Strategies are required to design an effective PHM process for the product and application specific needs. Environmental and usage load profiles need to be efficiently and accurately captured in the application environment, and utilized in real time or near real time health assessment and prognostics. This paper outlines generic strategies both for load monitoring and conversion of the sensor data into a format that can be used in physics-of-failure models, for both damage estimation and remaining life prediction due to specific failure mechanisms. The selection of appropriate parameters to monitor, design of an effective monitoring plan, and selection of appropriate monitoring equipment are discussed. Methods to process the raw sensor data during in-situ monitoring for reducing the memory requirements and power consumption of the monitoring device are suggested. Conceptual approaches are also presented for embedding such processing capabilities in monitoring systems to enable data reduction and simplification (without sacrificing relevant load information) prior to model input for health assessment and prognostics. The strategies presented are generically applicable to electronic health monitoring processes and are illustrated using the thermal load data generated in [38].
2. In-Situ Monitoring of Environmental and Usage Loads In many applications, it is not always known a priori what parameters need to be measured, nor with what frequency or precision. Preliminary field tests may be justified, to understand what data is actually relevant in a fully operating system. For health assessment as well as design decisions, monitoring plans are required to capture the distribution of loading parameters and usage profiles. Table 1 lists loads that are commonly observed in the life cycle of a product. The life cycle loads depend on the usage conditions such as usage frequency, severity, and period of use. The following sections outline the steps in implementing a process for effective monitoring of life cycle loads. 2.1. Selection of Load Parameters for Monitoring The first step in the data collection process is to select the life cycle parameters to be monitored. Parameters can be identified based on factors that are crucial for safety, are likely to cause catastrophic failures, are essential for mission completeness, or can result in long downtimes. Selection can also be based on knowledge of the critical parameters established from past experience and field failure data on similar products, and
Environment and Usage Monitoring of Electronic Products
239
qualification testing. Methods such as failure modes, mechanisms and effects analysis (FMMEA) [24] can be used to determine parameters that need to be monitored. A listing of common failure mechanisms and failure sites in electronics with the relevant loads and associated failure models can be obtained from standards including JESD659-A: Failure-mechanism-driven reliability monitoring [12], JEP 122C: Failure mechanisms and models for silicon semiconductor devices [13], JEP143A: Solid-state reliability assessment and qualification methodologies [14], JESD91A: Method for developing acceleration models for electronic component failure mechanisms [15], SEMATECH #00053955A-XFR: Semiconductor device reliability failure models [32], and SEMATECH #99083810A-XFR: Use condition based reliability evaluation of new semiconductor technologies [33]. 2.2. Selection of Monitoring Equipment The choice of monitoring equipment depends on the parameter to be measured, desired measurement accuracy, size and weight limitations, mounting constraints, desired operating range and operating conditions, power requirements, and purchase and sustainment costs. In Table 2, the left column presents user requirements for in-situ monitoring of life cycle loads. The right column presents a list of desired features of the sensor system, combinations of which have to be integrated in the equipment for use. Table 2. Monitoring requirements versus desired sensor system attributes. Monitoring Requirements • Ability to simultaneously monitor different loads
• •
• Easy to setup • Software for communication • Low power consumption • Sufficient memory • Small and light • Self contained with minimum intrusion into host system • Enables real time evaluation of sensor data • Enables smart monitoring based on thresholds • Requires minimum user intervention
• Reliable • Easy to mount • Remote access • Low cost
• • • • • • • • • • • • • •
Equipment Features Multi-sensor module Better sensor attributes (accuracy, range, resolution, sensitivity, frequency response) Self powered sensors Smart power saving features Miniature size On board analog to digital converter On board non-volatile memory Wireless capabilities On board processing Embedded algorithms Light weight Standard components Host software Robust packaging In-process calibration High Signal/Noise ratio
Advances in the areas of sensors, microprocessors, compact non-volatile memory, battery technologies, and wireless telemetry have enabled the implementation of miniature sensors and autonomous data loggers. However, for in-situ health monitoring, there is still a growing need for miniaturization of sensor systems operated using low-power portable power supplies. One of the major constraints is power consumption, which strongly
240
Vichare, Rodgers, Eveloy and Pecht
influences the selection of components and hence the size, weight, and cost of the monitoring system. The generic architecture of a sensor system for life cycle data collection for health monitoring is depicted in Figure 1.
Internal Sensors
Microprocessor (with ADC)
SRAM Memory (Data Storage)
Transceiver
External Sensor Module
External battery
Handheld Device / Base Station PC
Flash Memory (Embedded Software)
Figure 1. Block diagram of life cycle data collection system. The power consumption of sensor systems can be divided in three domains: sensing, communication, and processing [14]. The power consumed for sensing varies depending upon the parameter being monitored. Periodic sensing can reduce power consumption relative to continuous monitoring, but presents a risk of loosing critical data. Power consumption can also be reduced by performing measurements at events triggered by defined thresholds. The need for power management is even more challenging for battery-powered wireless sensor systems. Most of the power consumed by wireless sensors is expended in communication, which involves both data transmission and reception [2], [28]. The power consumption of wireless sensor systems can be reduced by integrating embedded processing capabilities with on-board processors to enable immediate and localized processing of the raw sensor data. This approach reduces the amount of data (processed instead of raw data) to be transmitted to the base station, hence reduces power consumption. In systems incorporating a large number of sensor systems working in a network, this approach would also allow decentralization of computational power and facilitate efficient parallel data processing [22]. Embedding computational power with on-board processors can also facilitate efficient data analysis for health monitoring applications. Embedded computations can be set to provide real time update for taking immediate actions such as powering off the equipment to avoid accidents or catastrophic failures, and also for providing prognostic horizons for conducting future repair and maintenance activities. Power consumption and flash memory of the microprocessor may limit the integration of computationally intensive algorithms with on-board processors. However, even using simple algorithms and routines to process the raw sensor data, significant gains can be achieved through in-situ analysis. The use of portable devices in conjunction with wireless sensor systems can enable efficient fault diagnosis and prognostics by integrating more complex algorithms in the
Environment and Usage Monitoring of Electronic Products
241
hand held device. Customized processing and reporting tools can be programmed on portable devices for efficient maintenance activities. For example, the data collected by a Bluetooth enabled accelerometer system [25] can be downloaded on a handheld device by maintenance technicians, and can be processed further using Fast Fourier Transforms (FFT) embedded in the handheld device. 2.3. Design of Monitoring Plan The important aspect of the in-situ load monitoring process is the setup of data collection equipment before dispatching in the application environment. Once the parameters to be monitored are selected the next step is to design the complete monitoring process. The user needs to select the monitoring equipment (sensors and data recorders), and setup the monitoring process. To study the user behavior, the data collection process should be conducted to cover usage scenarios across all geographic regions (continents or countries) of use. If possible the monitoring process should not be known to user to avoid any kind of bias. Another important item in the setup includes deciding the monitoring interval and frequency. In making this decision the user has to ensure that the relevant loads are recorded and at the same time the memory is not flooded by irrelevant load data. Table 3 provides a list of challenges and possible solutions that can guide in setting up the monitoring equipment. For effective utilization of sensor-system memory, the user should be able to define threshold values for measurement. Appropriate setting of thresholds can facilitate efficient data collection. For example, measurement will be recorded or a scan will be triggered only if the stimulus meets the set threshold. Events can be set to trigger above or below an absolute value, such as recording acceleration levels above 2g or humidity levels above 80% R.H. Users can also set thresholds based on the value of the slope (positive or negative) of the curve formed by the measurements made by the sensor. This strategy allows usage based data recording, which can result in substantial saving in disk space and extend the battery life of the equipment. The monitored loads are simplified and used as inputs into failure models to assess the damage and failure susceptibility due to a particular failure mechanism. Higher levels of threshold lead to greater abbreviation of the load history. However, this can also introduce errors in the subsequent damage assessment. In this case, the threshold selection strongly depends on the failure mechanism being monitored and the model used for damage assessment. For example, in [29] temperatures on an electronic board mounted on the exhaust manifold of a car were monitored in-situ. This data was processed using the ordered overall range (OOR) method to convert the irregular temperature variations into a sequence of peaks and valleys. The OOR output was used to extract the temperature cycles using the Rainflow cycle counting algorithm. The temperature cycles and a thermal fatigue model were used to assess the damage on the terminals of the components mounted on the board. Using the OOR, a reversal elimination index, (S < 1) can be selected to filter amplitudes that differ from the largest measured amplitude by the specified fraction. This data was screened using different values of reversal elimination index (threshold). The accumulated damage was estimated using each data set. For simplicity it was assumed that the error in damage accumulation is zero when cycles of all amplitudes are considered for the analysis (S = 0). The error in damage accumulation due to the use of reduced data for all other data sets was obtained by the following formula:
242
Vichare, Rodgers, Eveloy and Pecht
Error =
Accumulated damage (S = 0) − Accumulated damage (S > 0) . Accumulated damage (S = 0)
Figure 2 illustrates the percentage error in damage accumulation and the percentage data reduction with change in S-parameter. It is observed that even when S is zero, the data reduction is 84%. This is due to filtering all data points that are in a monotonic increasing or decreasing sequence. Also small values of S, ranging from 0 to 0.1, resulted in more than 90% data reduction with only 1% error in the damage calculation. Higher values of S result in increased error in damage accumulation. This is due to elimination of reversals with large magnitude, and the dependence of the fatigue model on the temperature amplitude.
Data reduction (%) and Error in damage accumulation (%)
100
Data reduction (%) 80 60 40
Error in damage accumulation (%) 20 0 0
0.2
0.4 0.6 0.8 Reversal elimination index, S
1
Figure 2. Example of percentage data reduction and error in damage accumulation. The selection of the S parameter can be made by assessing the distribution of cycles that are obtained after screening the data. Figure 3 (a) compares the number of temperature cycles obtained after processing the data with S = 0.1 against S = 0. By choosing S value as 0.1, all cycles with amplitude < 15oC are eliminated. It cannot be directly concluded from Figure 2 that all the temperature cycles with amplitude < 15oC contributed negligibly to the fatigue damage and can be always neglected. For example, in this case, the failure mechanism being monitored is creep and stress relaxation enhanced fatigue due to thermal cycling. Hence, temperature cycles with small amplitude (say < 10oC) but high mean temperatures (say > 100oC) can cause more damage than cycles with same small amplitude but low mean temperatures (say < 30oC). Figure 3(b) investigates the cyclic mean temperatures of the 2489 cycles with amplitudes < 10oC that were eliminated from the analysis in Figure 2(a). The histogram shows that more than 60% of the thermal amplitudes have a mean temperature within 21oC-30oC, and only 6% of the cycles have a mean temperature greater than 100oC. Figures 1 and 2 show the trade-off analysis that can guide the selection of a threshold. An a-priori estimate of the threshold can be based on conducting simulation using the selected damage model, to understand what can be eliminated. As monitoring continues, the threshold can be updated based on the measured data.
Environment and Usage Monitoring of Electronic Products
243
The monitoring equipment should be setup by effectively dividing the memory between periodic measurements and threshold based measurements. A strategic combination of measurement intervals (for periodic measurements) and thresholds can enable the recording of a larger number of relevant measurements. Table 3. Assessment of product environmental and usage conditions. Environment and Usage
Possible Options
How is the product powered?
What is the usage rate?
How may different users per day?
Average operating period per usage
• • • • • • • • • • • • • • • •
special
usage
•
What does the life environment comprise of ?
cycle
•
What are the conditions if any?
•
• What are the typical environmental extremes (temperature, humidity, radiation, altitude etc.) the product is expected to experience? (without considering usage loads)
• • •
What are the special environmental conditions unique for the product
•
Always ON ON only for operation OFF at end of day (or shift) Auto OFF if not in use Switches to low power mode if not in use Doesn’t matter Continuously operated (e.g., production machine) Irregular fixed number of times per day or week Irregular and unknown Completely unknown (e.g., paper copier at Kinko’s or clothes dryer in a commercial laundry) Only qualified users Doesn’t matter Fixed period every usage (e.g., starter motors) Operating time varies within known intervals (e.g., electronics operating a bank ATM) Predictable but depends on loading conditions (e.g., electronics for induction motor control) Unpredictable operating time per usage (e.g., television, cell phone, etc.) E.g., Regular battery operated products operating in low temperature environments One time transportation, handling, installation followed by operation and maintenance in controlled environment. (e.g., electronics in machines tools) One time transportation, handling, installation but operation and maintenance in uncontrolled environment. (e.g., telecommunication equipment in the field) Portable and uncontrolled e.g., electronics in automobiles Controlled room or office environment Atmospheric ambient in the region of use More severe than atmospheric ambient (e.g., equipment used in oil well drilling) E.g., underwater installations, high shock loading, vacuum, etc.
244
Vichare, Rodgers, Eveloy and Pecht 150 2489
S=0 S = 0.1
100 75 50 25
141-160
121-140
Range of ∆T (oC)
101-120
81-100
41-60
21-40
11-20
0-10
0 61-80
Number of cycles
125
(a) Number of cycles with ∆T < 10oC
1200
1062
1000 800 566
600
56
22
48
92
81-100
101-120
121-140
41-50
31-40
21-30
10-20
0
Range of Tmean (oC)
15
> 140
88
61-80
277
263
200
51-60
400
(b) Figure 3. Comparison of data sets (a) histogram of filtered and non-filtered load cycles (b) histogram of mean loads associated with the filtered cycles. 2.4. Data Preprocessing for Model Input The raw environmental and usage data from sensors is usually not in a form that is compatible with the required damage analysis and reliability prediction models. Hence for further analysis of the acquired data, it is essential to simplify the raw sensor data to a form compatible with the input requirements of the selected models. Data reduction is often the first step in preprocessing and is important for reducing both data storage space and calculation time. By using information that is most relevant to the failure models, an efficient data reduction method should: 1) permit gains in computing speed and testing time, 2) condense load histories without sacrificing important damage characteristics, 3) preserve the interaction of load parameters if any, 4) provide an estimate of the error introduced by reducing and simplifying the data. Data reduction can be achieved using a variety of tools such as filters, Fourier transforms, wavelets, Hayes method, Ordered Overall Range (OOR), etc. Subsequently the data may have to be processed to extract the relevant load parameters (e.g., cyclic mean, amplitudes, ramp rates, hold periods, power spectral densities, etc.) for PoF model input. The data reduction may be a part of the load parameter extraction algorithms. Commonly used load parameter extraction methods include cycle counting algorithms for extracting cycles from time-load signal, and Fast Fourier transforms (FFT)
Environment and Usage Monitoring of Electronic Products
245
for extracting the frequency content of signals. Depending upon the application and type of signal, custom load extraction methods can be developed.
Load (s)
Embedding the data reduction and load parameter extraction algorithms with the sensor modules can enable a reduction in both on-board storage space and power consumption, and uninterrupted data collection over longer durations. Figure 4 shows the schematic of this concept. A time-load signal is monitored in-situ using sensors, and further processed to extract (in this case) cyclic range (∆s), cyclic mean load (Smean), rate of change of load (ds/dt), and dwell periods (tD) using embedded load extraction algorithms. Finally, the extracted load parameters can be are stored in appropriately binned histograms to achieve further data reduction. After the binned data is downloaded, it can be used to estimate the distributions of the load parameters. Optimal binning of parameters can result in accurate estimation of load distributions [40] and serves as visual indication of usage history.
Time (t)
Frequency
Embedded Data Reduction and Load Parameter Extraction 0.5
0.5
0.5
0.5
0.25
0.25
0.25
0.25
0
Range (∆s)
0
Mean load (Smean)
0
Ramp rate (ds/dt)
0
Dwell time (tD)
Figure 4. Load parameter extraction using embedded algorithms for health assessment.
3. Case Study The methodologies presented in Section 2 for load monitoring and analysis are applied to the notebook computer, which was previously thermally characterized by Vichare et al. [38]. While a comprehensive health monitoring plan may involve multiple life cycle conditions, such as humidity, vibration, shock, radiation, and contamination, the present study focuses on temperature. Internal temperatures were dynamically monitored in-situ during all phases of the life cycle, including usage, storage, and transportation. The parameters monitored and monitoring equipment are listed in Table 4, along with key characteristics of the monitoring plan and data pre-processing. Figure 5 shows the raw sensor data, namely absolute temperature-time history for the heat sink and HDD, the corresponding temperature cycles amplitudes, and CPU usage. The raw sensor absolute temperature data shown in Figure 3a was converted into a sequence of peaks and valleys using the ordered overall range method. The sequence of peaks and valleys were processed using a 3-parameter Rainflow method [4] to identify the complete and half cycles and extract the amplitude, mean temperature, and ramp rate of each cycle. The resulting distribution of temperature and temperature amplitudes on the
246
Vichare, Rodgers, Eveloy and Pecht
heat sink and HDD are shown in Figure 3b and Figure 3c, respectively [38]. These data can be used with stress and damage models to asses the degradation of product and estimate the remaining life of the product. Along with the temperatures the usage of the product was recorded by logging the percentage CPU usage and the operation of the CPU cooling fan. An example of CPU usage monitoring is shown in Figure 3d with CPU usage and corresponding heat sink temperature.
(a)
(b)
(c)
(d)
Figure 5. Monitoring and analysis of environmental and usage conditions in a notebook computer: (a) Example illustration of raw sensor data, (b) Distribution of absolute temperatures on heatsink and HDD, (c) Distribution of temperature cycles on heatsink and HDD, (d) Example of monitoring percentage CPU usage [Event A: notebook is powered on. Events B to C: numerical simulation is executed. Event D: Notebook is powered off.]. In Figure 3b, the CPU heat sink temperature was found to be 13oC and 8oC lower than its maximum rating over 90% and 95% of the monitored time period, respectively. This highlights the potential conservativeness of thermal management solutions optimized based on worst-case operating conditions that rarely occur. Such findings could contribute to the design of more sustainable, least-energy consumption thermal management solutions. Also, approximately 97% of the temperature cycles experienced by either the CPU heat sink or hard disk drive had an amplitude of less than 5oC. However, the maximum temperature cycle amplitudes measured were found to exceed those specified by environmental standards for computer and consumer equipment. Such a discrepancy between standardized and actual conditions provides a strong motivation for monitoring actual product application environments.
Environment and Usage Monitoring of Electronic Products
247
The monitored life cycle temperature data could also be used in physics based stress and damage models, to provide both damage estimation and remaining life prediction due to specific failure mechanisms influenced by temperature. The measured data could also be used to determine the stress levels to be imposed in accelerated testing, refining product specifications, and setting product warranties. Table 4. Outline of monitoring setup used in case study. Steps Parameters monitored Monitoring Equipment Monitoring Plan Data Preprocessing
Description CPU heat sink temperature, hard disk drive (HDD) temperature, ambient temperature (in vicinity of laptop), Percentage of CPU utilized, Fan condition (on/off) Miniature RTDs for temperature sensing, Windows administrator log for CPU usage, Battery operated data logger Temperatures monitored and recorded in-situ during power on-off, use, storage, and handling. CPU monitored during usage. Data downloaded on daily basis. Measured data reduced using ordered overall range, and temperature cycles extracted using Rainflow cycle counting algorithm.
4. Conclusion A PHM methodology of electronic systems by monitoring and modeling of environmental and usage loads was discussed. To implement this methodology in a real system, guidelines were provided for the selection of monitoring parameters and the design of a monitoring plan. An integrated hardware-software module was proposed for in-situ monitoring and processing of load data in the life cycle environment. The main attributes of sensor system selection for these applications were identified to be the level of integration, power consumption and management, on-board memory and utilization, low size and weight, wireless communication, embedded computational power, and software support. Methods to improve the efficiency of the data collection process by reducing power consumption, making effective use of memory, and integrating embedded computational power with sensor systems were discussed. Using field data, the trade-off between data reduction and error in damage accumulation was demonstrated. The analysis presented showed that more than 90% data reduction resulted in less than 1% error in damage accumulation. The concept of health monitoring was illustrated for a notebook computer, focusing on thermal loads. The study revealed that the distribution of temperature cycles (∆T) recorded was up to 50oC, which exceeds the worst-case use condition specified by standard IPC SM-785 for consumer and computer products, namely 30oC and 20oC respectively. It is envisaged that future electronic systems will integrate sensing and processing modules that will work on power supplied by the product to enable in-situ health monitoring.
References 1.
Allen, D. (2003). Probabilities associated with a Built-in-Test system, focus on false alarms. Proceedings of AUTOTESTCON, IEEE Systems Readiness Technology Conference, 22-25, 643 – 645.
2.
Akyildiz, I. F., Su, W., Sankarasubramanium, Y. and Cayirci, E. (2002). Wireless
248
Vichare, Rodgers, Eveloy and Pecht
sensor networks a survey. Journal of Computer Networks, 38, 393-422. 3.
Anderson, N. and Wilcoxon, R. (2004). Framework for prognostics of electronic systems, Proceedings of International Military and Aerospace/Avionics COTS Conference, Seattle, WA, August 3-5.
4.
Anzai, H. (1991). Algorithm of Rainflow Method. The Rainflow Method in Fatigue, Butterworth-Heinmann, Oxford, 11-20.
5.
Bodenhoefer, K. (2004). Environmental life cycle information management and acquisition – first experiences and results from field trials. Proceedings of Electronics Goes Green 2004+, Berlin, 541-546.
6.
Carden, E. P. and Fanning, P. (2004). Vibration based condition monitoring: A review. Journal of Structural Health Monitoring, 3(4), 355-377.
7.
Chang, P., Flatau, A. and Liu, S. (2003), Review paper: Health monitoring of civil infrastructure. Journal of Structural Health Monitoring, 3(3), 257-267.
8.
Dasgupta, A. and Pecht, M. (1991). Material failure mechanisms and damage models. IEEE Transactions on Reliability, 40(5), 531-536.
9.
Gao, R. X. and Suryavanshi, A. (2002). BIT for intelligent system design and condition monitoring. IEEE Transactions on Instrumentation and Measurement, 51(5), 1061 – 1067.
10. Gross, K. (2005). Continuous System Telemetry Harness. Sun Labs Open House, 2004, research.sun.com/sunlabsday/docs.2004/talks/1.03_Gross.pdf. 11. IPC-SM-785. (1992). Guidelines for accelerated reliability testing for surface mount attachments, IPC Standard, November. 12. JESD659-A. (1999). Failure-mechanism-driven reliability monitoring, September. 13. JEP 122C. (2006). Failure mechanisms and models for silicon semiconductor devices, March. 14. JEP143A. (2004). Solid-state reliability assessment and qualification methodologies, May. 15. JESD91A. (2003). Method for developing acceleration models for electronic component failure mechanisms, August. 16. Kacprzynski, G. J., Roemer, M. J., Modgil, G. and Palladino, A. (2002). Enhancement of physics of failure prognostic models with system level features. IEEE Aerospace Conference Proceedings, 6, 2925. 17. Karyagina, M. (1996). Designing for fault-tolerance in the commercial environment, Proceedings of Reliability and Maintainability Symposium, 22-25, 258-262. 18. Krok, M. and Goebel, K. (2003). Prognostics for advanced compressor health monitoring. Proceedings of SPIE, 5107. 19. Lall, P., Pecht, M. and Hakim, E. (1997). Influence of Temperature on Microelectronics and System Reliability, CRC Press, New York. 20. Lall, P., Islam, N., Rahim, K., Gale, S. and Suhling, J. (2004). Prognostics and health monitoring for electronic and MEMS packaging. Proceedings of International Military and Aerospace/Avionics COTS Conference, Seattle, WA, August 3-5. 21. Leonard, C. T. and Pecht, M. G. (1991). Improved techniques for cost effective electronics. Annual Proceedings of Reliability and Maintainability Symposium, 29-31 January, 174 – 18. 22. Lynch, J. P., Sundararajan, A., Law, K. H. and Kiremidjian, A. S. (2004). Embedding damage detection algorithms in a wireless sensing unit for operational power
Environment and Usage Monitoring of Electronic Products
249
efficiency. Journal of Smart Materials and Structures, 13, 800-810. 23. Mishra, S. and Pecht, M. (2002). In-situ sensors for product reliability monitoring. Proceedings of SPIE, 4755, 10-19. 24. Mishra, S., Pecht, M., Smith, T., McNee, I. and Harris, R. (2002). Remaining life prediction of electronic products using life consumption monitoring approach. European Microelectronics Packaging and Interconnection Symposium, Cracow, June 16-18, 136-142. 25. Oceana Sensors. (2006). ICHM Wireless Data Acquisition and Processing Module, http://www.oceanasensor.com/Products/ichm-20-20.htm, Viewed March. 26. Pecht, M., Dube, M., Natishan, M. and Knowles, I. (2001). An evaluation of Built-In Test. IEEE Transactions on Aerospace and Electronic Systems, 37(1), 266-272. 27. Pecht, M., Das, D. and Ramakrishnan, A. (2002). The IEEE standards on reliability program and reliability prediction methods for electronic equipment. Microelectronics Reliability, 42, 1259-1266. 28. Porret, A., Melly, T., Enz, C. and Vittoz, E. A. (2000). A low power low voltage transceiver architecture suitable for wireless distributed sensor networks. IEEE International Symposium on Circuits and Systems, Geneva, 1, 56-59. 29. Ramakrishnan, A. and Pecht, M. (2003). A life consumption monitoring methodology for electronic systems. IEEE Transactions on Components and Packaging Technologies, 26(3), 625-634. 30. Ramakrishnan, A. and Pecht, M. (2004). Load characterization during transportation. Microelectronics Reliability, 44(2), 333-338. 31. Rosenthal, D. and Wadell, B. (1990). Predicting and eliminating Built-in Test false alarms. IEEE Transactions on Reliability, 39(4), 500-505. 32. SEMATECH. (2000). #00053955A-XFR: Semiconductor device reliability failure models, May. 33. SEMATECH. (1999). #99083810A-XFR: Use condition based reliability evaluation of new semiconductor technologies, August. 34. Searls, D., Dishongh, T. and Dujari, P. (2001). A strategy for enabling data driven product decisions through a comprehensive understanding of the usage environment. Proceedings of IPACK’01, Kauai, Hawaii, USA July 8-13. 35. Shapiro, A. A., Ling, S. X., Ganesan, S., Cozy, R. S., Hunter, D. J., Schatzel, D. V., Mojarradi, M. M. and Kolawa, E. A. (2004). Electronic packaging for extended mars surface missions. Proceedings of IEEE Aerospace Conference, 4, 2515 - 2527. 36. Self-Monitoring Analysis and Reporting Technology (SMART). (2005). PC Guide, http://www.pcguide.com/ref/hdd/perf/qual/featuresSMART-c.html, last accessed on August 30. 37. Tumer, I. and Bajwa, A. (1999). A survey of aircraft engine health monitoring systems. Proceedings of AIAA. 38. Vichare N., Rodgers P., Eveloy V. and Pecht M. G. (2004). In-Situ temperature measurement of a notebook computer - a case study in health and usage monitoring of electronics. IEEE Transactions on Device and Materials Reliability, 4(4), 658-663. 39. Vichare, N. and Pecht, M. (2006). Prognostics and health management of electronics. IEEE Transactions on Components and Packaging Technologies, 29 (1), 222-229. 40. Vichare, N., Rodgers, P. and Pecht, M. (2006). Methods for binning and density estimation of load parameters for prognostics and health management. International Journal of Performability Engineering, 2(2), 149-161.
250
Vichare, Rodgers, Eveloy and Pecht
41. Yaxley, S. J. and Knight, J. A. G. (1999). The development and integration of novel sacrificial wear sensors into the quarrying industry, Sensors and Actuators, A75(1), 24-34. Authors’ Biographies: Dr. Nikhil Vichare has a Ph.D. in Mechanical Engineering from the University of Maryland, College Park, M.S. in Industrial Engineering from the State University of New York at Binghamton, and a B.S. in Production Engineering from the University of Mumbai, India. He is currently a reliability engineer at Dell Inc. His research interests include reliability, failure prognostics, and system health management. He has published several articles in journals and conference proceedings on his research. Dr. Peter Rodgers received the PhD degree in mechanical engineering from the University of Limerick, Ireland. He is currently an Associate Professor at The Petroleum Institute, United Arab Emirates, and was formally with the CALCE Electronic Products and Systems Center, University of Maryland; Nokia Research Center, Finland; and Electronics Thermal Management, Ltd., Ireland. In these positions he has been actively involved in electronics cooling and reliability, including health monitoring of electronics. He has authored or co-authored over 50 journal and conference publications on a broad range of topics in these areas. He has been an invited lecturer, keynote speaker, panelist, and session chair at international conferences. Dr. Rodgers received the 1999 Harvey Rosten Memorial Award for his publications on the application of computational fluid dynamics analysis to electronics thermal design. He is a member of several international conference program committees, and is program co-chair for EuroSimE 2007. Dr. Valérie Eveloy holds a Ph.D. degree in mechanical engineering from Dublin City University, Ireland, and a M.Sc. degree in physical engineering from the National Institute of Applied Science (INSA), France. She has been involved in the packaging and reliability of electronic equipment for ten years, and is currently an Assistant Professor at the Petroleum Institute in the United Arab Emirates (UAE). She was previously an Assistant Research Scientist at the CALCE Electronic Products and Systems Center at the University of Maryland, College Park, where her activities included electrical overstress failures in integrated circuits, lead-free electronics, and computational fluid dynamics applied to electronics cooling, and Nokia Research Center, Finland. She has authored or co-authored over 45 refereed journal and conference publications and two book chapters. Dr. Eveloy is a member of several international conference program committees focused on electronics reliability, and was Physics-of failure track Chair at the IEEE Workshop on Accelerated Stress Testing & Reliability (ASTR) 2005. Prof. Michael Pecht has a BS in Acoustics, an MS in Electrical Engineering and an MS and PhD in Engineering Mechanics from the University of Wisconsin at Madison. He is a Professional Engineer, an IEEE Fellow and an ASME Fellow. He has received the 3M Research Award for electronics packaging, the IEEE Undergraduate Teaching Award, and the IMAPS William D. Ashman Memorial Achievement Award for his contributions in electronics reliability analysis. He has written eighteen books on electronic products development, use and supply chain management. He served as chief editor of the IEEE Transactions on Reliability for eight years and on the advisory board of IEEE Spectrum. He is chief editor for Microelectronics Reliability and an associate editor for the IEEE Transactions on Components and Packaging Technology. He is the founder of CALCE (Center for Advanced Life Cycle Engineering) and the Electronic Products and Systems Consortium at the University of Maryland. He is also a Chair Professor. He has been leading a research team in the area of prognostics for the past ten years, and has now formed a new Electronics Prognostics and Health Management Consortium at the University of Maryland. He has consulted for over 50 major international electronics companies, providing expertise in strategic planning, design, test, prognostics, IP and risk assessment of electronic products and systems.