Overview of Reliability Testing - IEEE Xplore

12 downloads 0 Views 905KB Size Report
Use of degradation data for reliability estimates and maintenance decisions are presented. Index Terms—Degradation modeling, equivalence of test plans,.
282

IEEE TRANSACTIONS ON RELIABILITY, VOL. 61, NO. 2, JUNE 2012

Overview of Reliability Testing Elsayed A. Elsayed

Abstract—This paper presents an overview of reliability testing, reliability estimation, and prediction models. It also presents approaches for the design of test plans which result in providing failure data and/or degradation data in a limited test duration. Equivalence of different test plans which result in similar reliability estimates is also discussed. Use of degradation data for reliability estimates and maintenance decisions are presented.

NOTATION An arbitrary baseline hazard rate A vector of the covariates or the applied stresses A vector of the unknown regression parameters

Index Terms—Degradation modeling, equivalence of test plans, failure modes, maintenance threshold, Reliability testing.

A known function

ACRONYMS

The number of the stresses

ADDT

Accelerated destructive degradation testing

Natural logarithm

ADT

Accelerated degradation testing

Maximum likelihood

AFT

Accelerated failure time

Total number of test units

ALT

Accelerated life testing

BIST

Built-in-self-test

Cdf

Cumulative distribution function

EHR

Extended hazard regression

ELHR

Extended linear hazard regression

HALT

Highly accelerated life testing

HASS

Highly accelerated stress screening

MLE

Maximum likelihood estimate

MNF

Minimum number of failures

MOS

Metal-oxide-semiconductor

MRL

Mean residual life

MTF

Mean time to failure

pdf

Probability density function

PH

Proportional hazards

PO

Proportional odds

PMRL

Proportional mean residual life

RAM

Random access memory

RGT

Reliability growth test

RDT

Reliability demonstration test

TAFT

Test, analyze, fix, and test

,

,

High, medium, low stress levels, respectively Specified design stress

,

,

Proportion of test units allocated to , and , respectively

,

Prespecified period of time over which the reliability estimate is of interest at normal operating conditions Reliability at time for a given pdf at time for a given Cdf at time for a given Cumulative hazard function at time for a given

I. INTRODUCTION

R

Manuscript received November 13, 2011; revised January 09, 2012; accepted January 12, 2012. Date of publication April 27, 2012; date of current version May 28, 2012. Associate Editor: L. Cui. The author is with the Department of Industrial and Systems Engineering, Rutgers University, Piscataway, NJ 08854-8018 USA (e-mail: [email protected]). Digital Object Identifier 10.1109/TR.2012.2194190

ELIABILITY is one of the key quality characteristics of products, structures, communication networks, military equipment, and many other items. Telecommunication network failures and the electric power outage that could have an effect on millions of users are examples of such complex systems whose failures have immediate impact on many users with different needs. The introduction of new materials, sensors, as well as the functional complexities of the new products, and yet the higher expectations of the users for more “reliable” products, have prompted engineers to conduct extensive testing before the release of products (or services). Indeed, testing represents a significant portion of the total product cost, especially for military equipment. Accurate reliability metrics are obtained using data from testing at normal operating conditions. However, testing under

0018-9529/$31.00 © 2012 IEEE

ELSAYED: OVERVIEW OF RELIABILITY TESTING

normal operating conditions requires a very long time, especially for components and products with long expected lives. It also requires an extensive number of test units, so it is usually costly and impractical to perform reliability testing under normal conditions. Moreover, the results are only useful for the operating environment similar to those at test conditions, and it may not be suitable for predicting reliability of units operating at significantly different conditions. Therefore, reliability engineers seek alternative methods to “predict” the reliability metrics using data and test conditions other than normal operating conditions. We refer to these methods as accelerated testing methods. The main objective of these methods is to induce failures or degradation of the components, units, and systems in a much shorter time and to use the failure data and degradation observations at the accelerated conditions to estimate the reliability at normal operating conditions. Careful reliability testing of systems, products, and components at the first stage of the product’s life cycle (design stage) is crucial to achieve the desired reliability in the following stages. During this early stage, the elimination of design weaknesses inherent to intermediate prototypes of complex systems is conducted via the test, analyze, fix, and test (TAFT) process. This process is generally referred to as “reliability growth.” Specifically, reliability growth is the improvement in the true but unknown initial reliability of a developmental item as a result of failure mode discovery, analysis, and effective correction. Corrective actions generally assume the form of fixes, adjustments, or modifications to problems found in the hardware, software, or human error aspects of a system [1]. Likewise, field test results are used in improving product design and consequently, its reliability. A comprehensive compilation of reliability growth model descriptions and characterizations, as well as discussions of related statistical methodologies for parameter estimation and confidence intervals, is given in [2]; the use of reliability growth models in ALT is described in [3]. There is a wide variety of reliability testing methodologies and objectives. They include testing to determine the potential failure mechanisms, RDT, reliability acceptance testing, reliability prediction testing using ALT, and others. In this overview, we provide a brief description of these different types of testing with emphasis on ALT, design of test plans and modeling, and analysis of degradation testing. A. Highly Accelerated Life Testing (HALT) The objective of this test is to determine the operational limits of components, subsystems, and systems. These are the limits beyond which different failure mechanisms occur, other than those that occur at normal operating conditions. Moreover, extreme stress conditions are applied to determine all potential failures (and failure modes) of the unit under test. HALT is primarily used during the design phase of a product. In a typical HALT test, the product (or component) is subject to increasing stress levels of temperature and vibration (individually and in combination) as well as rapid thermal transitions (cycles) and other specific stresses related to the actual use of the product. In electronics, for example, HALT is used to locate the causes of the malfunctions of an electronic board. These tests often consist

283

of testing the product under temperature, vibration, and moisture. However, the effect of humidity on the product’s failure mechanism requires a long time. Consequently, HALT is conducted only under two main stresses: temperature and vibration [4]. B. Reliability Growth Test (RGT) The objective of RGT is to provide continuous improvement of the unit during its design phase. This test is conducted at normal operating conditions, and test results are analyzed to verify whether a specific reliability goal has been reached. In general, the first prototypes of a product are likely to contain design deficiencies, most of which can be discovered through a rigorous testing program. It is also unlikely that the initial design will meet the target reliability metrics. However, it is rather common that the initial design will experience iterative changes (design changes) that will lead to improvements in the reliability of a product. Once the design is released to production, the product is then tested and monitored in the field for potential design changes that will further improve its reliability [4]. C. Highly Accelerated Stress Screening (HASS) The main objective of HASS is to conduct screening tests on regular production units to verify that actual production units continue to operate properly when subjected to the cycling of environmental variables used during the test [5]. It is also used to detect shift (and changes) in the “quality” of a production process. It uses the same stresses as those used during the HALT test, except the stresses are de-rated because HASS is primarily used to detect process shift and not design marginality issues. D. Reliability Demonstration Test (RDT) RDT is conducted to demonstrate whether the units (components, subsystems, systems) meet one or more measures (metrics) of reliability. These measures include a quantile of the population that does not fail before a defined number of cycles or test duration. It may also include that the failure rate after the test should not exceed a predetermined value. Usually, RDT is conducted at the normal operating conditions of the units. There are several methods for conducting RDT, such as the “success-runs” test. This test involves the use of a sample with a predetermined size, and if no more than a specific number of failures occurs by the end of the test, then the units are deemed acceptable. Otherwise, depending on the number of failures in the test, another sample is drawn, and the test is conducted once more, or the units are deemed unacceptable [6]. E. Reliability Acceptance Test The objective of the reliability acceptance test is similar to that of the RDT, where units are subjected to a well-designed test plan, and decisions regarding the acceptance of the unit (units, prototypes ) are made accordingly. An example of a reliability test plan of a major automotive company is shown in Fig. 1. In Fig. 1, the test units are subjected to the shown temperature profile over a cycle of 480 minutes. The units are deemed acceptable when they survive 200 cycles without failure. The accepted units are expected to have a mean life equivalent to 100 000 miles of driving (based on the engineers’ experience).

284

Fig. 1. Temperature profile for an acceptance test.

Such acceptance tests are based on experience. However, optimum test plans can be designed for acceptance tests based on certain criterion and constraints [4]. F. Burn-In Test The objective of the burn-in test is to screen out the less reliable components from a population. This is usually performed for a relatively short period of time at conditions slightly more severe than normal operating conditions. A typical burn-in test and 80% relative humidity) for is performed at 80/80 (80 24 to 48 hours to remove defects contributing to infant mortality (early failure time region). Almost all integrated circuit makers conduct burn-in by combining electrical stresses and temperature over time to activate temperature-dependent and voltage-dependent failure mechanisms in a short time. Some of the issues to be considered when designing a burn-in test include the duration of the test, the types and levels of the stresses, and the residual life after the burn-in test. Burn-in models for systems operating under different environmental conditions and policies for performing burn-in are presented in [7], [8]. A cost model to determine the optimal termination time of a burn-in test considering the degradation of the units under test is given in [9]; a cost model for burn-in of electronic devices is presented in [10]. G. Built-In-Self-Test (BIST) This test is different from traditional reliability testing as it is a technique in which testing is accomplished through built-in hardware and software features [11]. In a typical BIST, a controller generates the test patterns (usually for electronic chips and guidance systems), controls the circuit under test, and collects and analyzes the responses. Normally, the test is initiated over a single pin, and the result of “pass” or “fail” is signaled via a second pin. There are many variations of this procedure. It is important to note that this approach is commonly used for testing units that are not operating continuously or stored for future use, i.e. missiles. The issues of how often the test should be conducted, and to what extent it should be applied to “assess” the reliability of the units, must be addressed, specifically when aging is considered. H. Accelerated Life Testing (ALT) and Accelerated Degradation Testing (ADT) In many cases, ALT might be the only viable approach to assess whether the product meets the expected long term reliability requirements. ALT experiments can be conducted using

IEEE TRANSACTIONS ON RELIABILITY, VOL. 61, NO. 2, JUNE 2012

three different approaches. The first is conducted by accelerating the “use” of the unit at normal operating conditions, such as in the case of products that are used only a fraction of time in a typical day, which includes home appliances and auto tires. The second is conducted by subjecting a sample of units to stresses more severe than normal operating conditions to accelerate the failure. The third is conducted by subjecting units that exhibit some type of degradation, such as stiffness of springs, corrosions of metals, and wear out of mechanical components, to accelerated stresses. The last approach is referred to as ADT. The reliability data obtained from the experiments are then utilized to construct a reliability model for predicting the reliability of the product under normal operating conditions through a statistical and/or physics based inference procedure. The accuracy of the inference procedure has a profound effect on the reliability estimates and on the subsequent decisions regarding system configuration, warranties, and preventive maintenance schedules. Specifically, the reliability estimate depends on two factors: the ALT model, and the experimental design of the ALT test plans. A “good” model can provide an appropriate fit to testing data; this good model results in achieving accurate estimates at normal conditions. Likewise, an optimal design of the test plans, which determines the stress loadings (constant-stress, ramp-stress, cyclic-stress ), allocation of test units, number of stress levels, optimum test durations, and other experimental variables, can improve the accuracy of the reliability estimates. Indeed, without an optimum test plan, it is likely that a sequence of expensive and time-consuming tests results in inaccurate reliability estimates. This approach might also cause delays in product release or the termination of the entire product as has been observed by the author [12]. We describe the reliability prediction models for both ALT and ADT, which is followed by the design of optimum test plans. II. RELIABILITY PREDICTION MODELS The design of ALT (or ADT) and the development of appropriate reliability prediction models that utilize the test results to estimate reliability metrics at any operating conditions are the two key components of such accelerated testing. The important assumption for relating the results at accelerated conditions to those at normal operating conditions is that the components or products operating at the normal conditions experience the same failure modes and failure mechanisms as those at the accelerated conditions. Elsayed [4] classifies the existing ALT models into three categories: statistics-based models, physics-statisticsbased models, and physics-experimental-based models. In particular, the statistics-based models are generally used when the relationship between the applied stresses and the failure time of the product is difficult to determine based on physics or chemistry principles. In this case, AFT is used to determine the model parameters statistically after assuming either a linear or nonlinear life-stress relationship. The statistics-based models can be further classified into parametric models and semi-parametric or non-parametric models. The most commonly used failure time distributions in the parametric models are the exponential, Weibull, normal, lognormal, gamma, and extreme value distributions. The underlying assumption of these models is that the failure times

ELSAYED: OVERVIEW OF RELIABILITY TESTING

285

of the products follow the same distributions at different stress levels, and that the shape parameter is -independent of the stresses. In reality, when the failure process involves complex or inconsistent failure time distributions or both, the parametric models may not interpret the data satisfactorily, and the reliability prediction will be far from accurate. Consequently, semi-parametric or non-parametric models appear to be attractive and more suitable for reliability estimation due to their “distribution-free” property. We briefly review the most commonly used ALT models.

of failure time distributions, as well as time dependent covariates (stresses).

A. Accelerated Failure Time Models (AFT)

Where is a column vector of covariates (for ALT, it is the column vector of stresses and/or their interactions that components experience). Unlike standard regression models, the PH models assume that the applied stresses act multiplicatively, rather than additively, on the hazard rate, which is a much more realistic assumption in many cases [14]–[16]. The PH model is a class of models with the property that the hazard functions of two units at two different stress levels and are proportional to each other. In other words, the ratio of their hazard rates does not vary with time. One advantage of the PH model is the ability to include time dependent covariates. Let be the covariate vector at time for the th individual unit under study. Then the associated hazard rate function can be expressed as

AFT models are the most widely used approaches that relate results from ALT to normal operating conditions. For many products, there are well-established acceleration models that perform satisfactorily over the desired range of stresses. For instance, for temperature accelerated testing, the Arrhenius model has gained acceptance because of its many successful applications and general agreement of laboratory test results with long-term field performance. In an AFT model, it is assumed that for a unit under the applied stress, vector , the log-lifetime has a distribution with a location parameter depending on the stress vector , and a constant scale parameter in the form,

Where is a random variable whose distribution does not depend on . The location parameter follows some assumed life-stress relationship, e.g., , where and are some known functions of stresses. The popular Inverse Power law and Arrhenius model are special cases of this simple life-stress relationship. The AFT models assume that the covariates act multiplicatively on the failure time, or linearly on the log failure time, rather than multiplicatively on the hazard rate. The hazard function in the AFT model can be written in terms of the baseline hazard function as

The main assumption of the AFT models is that the time to failure is inversely proportional to the applied stresses, e.g. the time to failure at high stress is shorter than the time to failure at low stress. It also assumes that the failure time distributions are of the same type. In other words, if the failure time distribution at the higher stress is exponential, then the distribution at the low stress is also exponential. Therefore, a general Cdf for a two parameter Weibull distribution under an applied stress vector z is

Where is the shape parameter, and is the scale parameter as a function of applied stresses which can be expressed as where is a coefficient of the covariate . Of course, similar derivations can be obtained for other types

B. Proportional Hazards Model (PH) Multiple regression models can be used to predict the time to failure of a component under multiple covariates. A similar regression based model is the PH model introduced in [13]. The PH model is generally expressed as

where the hazard rate at time depends only on the current stress level , and there is no effect caused by the previous stress history. We note that the Weibull distribution is the only continuous distribution that fits both a PH and an accelerated failure time model. A PH model for a random failure time with a Weibull baseline hazard is (1) An AFT model for

with a Weibull baseline hazard

is (2) Comparing (1) and (2), we have

.

C. Extended Linear Hazard Regression Model (ELHR) The PH and AFT models have very different assumptions (failure rate proportionality or failure time proportionality, respectively). The only model that satisfies both assumptions is the Weibull model. Assuming PH or AFT for a particular data set may lead to different results. Therefore, a simultaneous treatment of the two is of practical importance, especially when the assumption regarding the PH or AFT is difficult to justify or does not hold. The EHR model which encompasses both the PH and AFT models as special cases is proposed in [17]. To further enhance the capability of modeling ALT, Elsayed et al.

286

IEEE TRANSACTIONS ON RELIABILITY, VOL. 61, NO. 2, JUNE 2012

[16] propose a more generalized model, the ELHR model, by incorporating the time-varying coefficient effect into the EHR model. The ELHR model is expressed as (3) The ELHR model encompasses all previous models including: PH, AFT, and EHR as special cases. It incorporates the timechanging effects, proportional hazard effects, as well as timevarying coefficient effects into one model. The ELHR model outperforms the PH model and other extended models (e.g., [18]) such that it can better interpret physical failure processes, thus providing a better model fit to the corresponding failure time data. Furthermore, the ELHR model is essentially “distribution-free” and thus has a significant potential for dealing with complex failure processes. For example, by assuming the baseline hazard function to be a quadratic function , the model can be expressed as (4) where

, , , and . Then, the associated reliability is given by

the MRL function not to exist (e.g., for . In cases where the main concern is the risk of immediate failure, needless to say, the PH model proves to be an extremely useful model. Yet, the fact remains that it may not be the most appropriate model when one is concerned, as in replacement and repair models, with the remaining lifetime of an individual of time . In this case, if we can develop the ALT model based on the PMRL, it will provide a useful alternative to the standard models for ALT, which includes the accelerated failure model and the PH. Let be a continuous random variable with reliability function , and finite expectation . The MRL is defined to be

Two distributions with reliability functions , and , and with mean residual lives at time of , and , respectively, are said to have proportional MRL functions, if they are related as follows. (6) Therefore,

(5) is the cumulative hazard rate function. One of the Where drawbacks of the ELHR model is the number of parameters of the model. As the number increases, it is likely that the accuracy of the estimated parameters decreases, which may result in inaccurate reliability prediction at normal operating conditions. This drawback becomes more acute when the failure time data are small. D. Proportional Mean Residual Life Model (PMRL) All the models above are based on failure rates proportionality and failure times proportionality. The concept of the PMRL is originally introduced in [19] by analogy with [13]. Two distributions with reliability functions and are said to have proportional MRL functions if for some , where is the MRL at accelerated conditions, is a constant, and is the mean residual life at normal conditions. In industrial reliability studies of repair and replacement strategies, the MRL function may be more relevant than the hazard function or failure time distribution function, although they are mathematically equivalent in the sense that knowing one of them, the other can be determined . However, for modeling purposes, MRL and the hazard function play somewhat different roles; the former summarizes the entire residual life distribution of time , whereas the latter relates only to the risk of immediate failure at time . On the other hand, it is possible for the MRL function to exist but for the failure rate function not to exist (e.g., the standard Cantor ternary function), although sometimes it is possible for the failure rate function to exist but

where

.

E. Proportional Odds Model (PO) In many applications, it is often unreasonable to assume that the effects of covariates on the hazard rates remain fixed over time as in the PH model. It is observed in [20] that the ratio of the death rates, or hazard rates, of two populations under different stress levels (for example, one population for smokers and the other for non-smokers) is not constant with age or time, but follows a more complicated course, in particular converging closer to unity for older people. So the PH model is not suitable for this case. Therefore, a more realistic model in [21] is (7) This model is referred to as the PO model because the odds functions, which are defined as , under different stress levels are proportional to each other. The proportional odds models have been widely used as a kind of ordinal logistic regression model for categorical data analysis, as described in [22], [23]. Recently the PO model has received more attention for survival data analysis. Estimation methods for only two-sample life time data based on the PO model are described in [24], [25]. Estimates of the parameters in the PO model using ranks, which ignore the actual observations, result in an inaccurate estimation, as given in [26], [27]. The estimation method of the PO model based on profile likelihood is

ELSAYED: OVERVIEW OF RELIABILITY TESTING

287

presented in [28]. The likelihood is constructed from the conditional probability density. Because the number of unknown parameters in the profile likelihood is extremely large, the estimation calculation is also extremely computationally intensive. The description of the PO model and the current inference procedures of parameter estimation are summarized as follows. Let denote the failure time, and the right censoring time. The data, based on a sample of size , consists of the triple , where is the failure time of the th unit under study, is the failure indicator for the th unit ( if the failure has occurred, which means , and if the failure time is right-censored, or ), and is the vector of covariates, or stresses, for the th unit.

Let be the odds function, which is the ratio of the probability of a unit’s failure to the probability of its not failing, at time for a unit with a vector of stresses, . The basic PO model is (8) where, , is the number of stresses. For two failure time samples with stress levels and difference between the respective log odds functions

, the

is -independent of the baseline odds function , and furthermore of the time ; thus, one odds function is constantly proportional to the other. The baseline odds function could be any monotone increasing function of time with the property of . When , the PO model described in (8) becomes the log-logistic accelerated failure time model [29], which is a special case of the general PO model. The PO model is useful when the covariate (stress) effect on the hazard rate diminishes over time. III. DESIGN OF ALT PLANS The first step in the design of reliability test plans is to define clear objectives of the test. As discussed in Section I, there are many testing methods, each with a specific objective. Most, if not all, of the ALT plans is to conduct the test to obtain accurate reliability metrics at operating conditions. The second step requires the determination of the types of stresses to be applied. The third step requires the profiles of the stress applications and the allocation of test units to specific profiles. The test plans have constraints such as the duration of the test, budget, and the maximum number of units available. Once the test is conducted, which is almost never conducted as planned, the test results are used for reliability prediction utilizing the models presented in Section II. Of course, there are other models, such as a physics-based model, or experimentally developed ones, that can be used when deemed appropriate.

A. Stress Types The stress types to be used should be related to the stresses at operating conditions. For example, if the unit will be used in areas where the difference between maximum and minimum temperatures is significant, and that the unit is susceptible to failure due to a thermal effect, then temperature becomes one of the stresses to be applied. Normally, temperature is a key stress to be used in most of the ALT plans. This approach is repeated for other stress types. We briefly summarize some if these stresses. a) Mechanical Stresses: Fatigue stress is the most commonly used accelerated test for mechanical components. Fatigue is the cause of failures of all rotating mechanical components. When the components are subject to elevated temperature, then creep testing (which combines both temperature and static or dynamic loads) should be applied. Shock and vibration testing is suitable for components or products subject to such conditions as in the case of bearings, shock absorbers, cell phones, tires, and airplane circuit boards and automobiles. Wear out is another cause of moving mechanical parts. Depending on the actual use of the unit at normal operating conditions, an accelerated test that mimics these conditions needs to be designed but with increased loads to cause significant wear out of the unit. Other types of mechanical stresses include bending and shear stresses. b) Electrical Stresses: These include power cycling, electric field, current density, and electro-migration. Electric field is one of the most common electrical stresses as it induces failures in relatively short times, and its effect is significantly higher than other types of stresses. The power generated due to the application of the electric stresses results in temperature rise, which in turn causes thermal fatigue due to power-on power-off applications. c) Environmental Stresses: Humidity is as critical as temperature, but its application usually requires a very long time before its effect is noticed. Corrosion is another cause of failure of most ferrous material, and is induced due to humidity and corrosive environment. Units that are subject to corrosion should then be tested using humidity and corrosive environments as a stress. Other environmental stresses include ultraviolet light which affects the strength of elastomers, sulfur dioxide which causes corrosion in circuit boards, and salt and fine particles and alpha rays which cause the failure of RAM and similar components. Likewise, high levels of ionization can cause electrons in outer orbits to be free, which results in electronic noise and signal spikes in digital circuits. Therefore, radiation is an environmental stress that should be applied to the units subject to deployment in space and other similar environments. B. Stress Applications Traditionally, ALT is conducted under constant stresses during the entire test duration. The test results are used to extrapolate the product life at normal conditions. In practice, constant-stress tests are easier to carry out but need more test units and a long time at low stress levels to yield sufficient

288

degradation or failure data. However, in many cases, the available number of test units and the test duration are limited. These cases have prompted industry to consider different types of stress applications. One of the stress applications which has now been accepted other than constant stress is step-stress, where the units are subjected to a stress higher than design stress, and after an elapsed time the stress is increased to an even higher stress (but lower than stresses that could induce different failure mechanisms). This step-stress is referred to as simple step-stress, where the main decisions are the stress levels and the time to increase the stress from low to high. Analysis and design of step-stress test plans for different considerations, design of test plans for highly reliable components, design of test plans for small sample sizes, and test results with masked observations are given in [30]–[35]. Others, in the fatigue testing community, conduct the test in a reverse order (high stress then step down to lower stress). Of course, the order of stress applications has an effect on the test results. When multiple stress types are applied, even for a simple step-stress test, the test is no longer simple and requires careful planning. Other stress applications including cyclic loading, power-on power-off, ramp test, and combinations of these are possible. Due to tight budgets and time constraints, there is an increasing need to determine the best stress loading to shorten the test duration, and reduce the total cost, while achieving an accurate reliability prediction. Most research has been focused on the design of optimum test plans when the stress loadings (applications) are given. However, until recently, fundamental research of the equivalency of these tests has not yet been investigated in reliability engineering literature. Without the understanding of such equivalency, it is difficult, if not impossible, for a test engineer to determine the best experimental settings before conducting actual ALT.

IEEE TRANSACTIONS ON RELIABILITY, VOL. 61, NO. 2, JUNE 2012

Substituting obtain

into the PH model described above, we

We obtain the corresponding cumulative hazard function, , and the variance of the hazard function as

Formulation of the Test Plan: Under the constraints of available test units, test time, and specification of the MNF at each stress level, the objective of the test plan is to optimally allocate stress levels and test units so that the asymptotic variance of the hazard rate estimate at normal conditions is minimized over a pre-specified period of time (normally the warranty length). If we consider three stress levels, then the optimal decision variables are obtained by solving the following optimization problem with a nonlinear objective function and both linear and nonlinear constraints [43], [44].

Subject to

C. Design of ALT Plans When designing an ALT, we need to address several issues: (a) selection of the stress; (b) determination of the stress levels for each stress type selected; and (c) determination of the proportion of units to be allocated to each stress level [36]. We refer the reader to [37]–[42] for other approaches for the design of ALT plans. We consider the selection of the stress level , and the proportion of units to allocate for each , such that the most accurate reliability estimate at normal operating conditions can be obtained. We consider two types of censoring: Type I censoring involves running each test unit for a pre-specified time. The censoring times are fixed, and the number of failures is random. Type II censoring involves simultaneously testing units until a pre-specified number fails. The censoring time is random, while the number of failures is fixed. We assume the baseline hazard function to be linear with time. (Other baseline functions can be considered based on a baseline experiment or similarities of the unit under test with previously modeled units.)

Where

is the inverse of the Fisher’s information matrix.

Other objective functions can be formulated which result in different designs of the test plans. These functions include the D-Optimal design that provides efficient estimates of the parameters of the distribution. It allows relatively efficient determination of all quantiles of the population, but the estimates are distribution dependent. Example: An accelerated life test is to be conducted at three temperature levels for MOS capacitors to estimate the life distribution at a design temperature of 50 . The test needs to be completed in 300 hours. The total number of items to be placed under test is 200 units. To avoid the introduction of failure mechanisms other than those expected at the design temperature, it has been decided, through engineering judgment, that the testing temperature should not exceed 250 . The minimum number of failures for each of the three temperatures is specified as 25. Furthermore, the experiment should provide the most accurate reliability estimate over a 10-year period [44].

ELSAYED: OVERVIEW OF RELIABILITY TESTING

289

Consider three stress levels. Then the formulation of the objective function and the test constraints follow the same formulation given in the previous section. The optimum plan derived that optimizes the objective function and meets the constraints is shown as follows. The corresponding allocations of units to each temperature level are , , .

The second step is to determine the equivalent test plan based on Definitions 1 or 2, using (10) or (11), respectively. Formulation (10) is given as follows.

D. Equivalent Accelerated Life Testing Plans In the design of ALT plans, estimates of one or more reliability characteristics, such as the model parameters, hazard rate and the MTF at certain conditions are common. Accordingly, different optimization criteria might be considered. For instance, if estimates of the model parameters are the main concern, D-optimality which maximizes the determinant of the Fisher information matrix is considered an appropriate criterion. When an estimate of the time to quantile failure is of interest, then the variance optimality that minimizes the asymptotic variance of time to quantile failure at normal operating conditions is commonly used. Meanwhile, different methods, e.g. MLE or Bayesian estimator, can be used for the estimation of the model parameters. However, each method has its inherent statistical properties and efficiencies. In light of this, we discuss equivalent test plans with respect to the same reliability characteristics and optimization criterion and determine equivalent test plans using the same inference procedure. In this paper, we propose two possible definitions of equivalency as follows. Definition 1: Two test plans are equivalent if the absolute difference of the objectives for reliability prediction is less than under the same set of constraints on the number of test units, expected number of failures, or total test time. Definition 2: Two test plans are equivalent if they achieve the same objective for reliability prediction under the same constraints on the number of test units, expected number of failures, or total test time within a margin. According to the above definitions, the equivalent test plans are not unique. Therefore, we recommend the following procedures for constructing equivalent plans [45]–[47]. The first step of the approach is to obtain an optimal baseline test plan. Because constant-stress test is the most commonly conducted ALT in industry, and its statistical inference has been extensively investigated, we propose to use an optimal constantstress plan as a baseline [46]. Suppose an optimal baseline test plan can be determined from the following general formulation.

(9) where is the objective function (e.g. the asymptotic variance of mean time to failure), is its decision variable (which can be expressed as either a vector or a scalar), and and are the corresponding lower and upper bounds of . , and are the possible inequality, and equality constraints, respectively.

(10) and are the baseline and equivalent objecwhere tive functions, respectively, and is the decision variable of the equivalent test plan. may represent the constraint of the total number of test units, or the expected number of failures, or the test time. Note that can be the censoring time under Type-I censoring or expected number of failures under Type-II censoring and vice versa. Similarly, based on Definition 2, the optimal equivalent test plan can be determined as

(11) An example that demonstrates these methods, develops equivalent step-stress and ramp-stress test plans, and the baseline constant-stress test plan, is given in [45]. E. Degradation Testing Most of the data obtained from ALT and standard reliability testing are time-to-failure measurements obtained from testing samples of units at different stresses. However, there are many situations where the actual failure of the units, especially at stress levels close to the normal operating conditions, may not fail but may degrade with time. For example, a component may start the test with an acceptable resistance value, but during the test the resistance reading “drifts” [40]. As the test time progresses, the resistance eventually reaches an unacceptable level that causes the unit to fail. In such cases, measurements of the degradation of the characteristics of interest (those whose failure may cause catastrophic failure of the part) are frequently taken during the test. The degradation data are then analyzed and used to predict the time to failure at normal conditions. Like failure time data, degradation data can be obtained at normal operating conditions or at accelerated conditions. We refer to the latter as ADT which requires a reliability prediction model to relate results of a test at accelerated conditions to normal operating conditions. Proper identification of the degradation indicator is critical for the analysis of degradation data and the subsequent decisions such as maintenance, replacement, and warranty policies. Examples of these indicators include hardness which is a measure of degradation of elastomers. These materials are critical to many applications including: hoses, seals, and dampers of

290

IEEE TRANSACTIONS ON RELIABILITY, VOL. 61, NO. 2, JUNE 2012

various types, and their hardness increases over time to a critical level at which their ability to absorb energy is severely degraded. This condition may lead to cracks, excessive wear, and related failure modes in components [50]. Other indicators include loss of stiffness of springs, corrosion rate of beams and pipes, crack growth in rotating machinery, and more. In some cases, the degradation indicator might not be directly observed, and destruction of the unit under test is the only alternative available to assess its degradation [51]. Modeling approaches for degradation data collected at normal operating conditions and accelerated conditions are beyond the space limit of this paper. Recent advances in degradation modeling are presented in [52], [53]. There are other types of ADT where the degradation of the unit can only be assessed through the destruction of the unit. This type of test is referred to as ADDT, and is presented in [54], [55]. IV. SUMMARY This paper presents an overview of the objectives and approaches of reliability testing. Reliability prediction and estimation using practical assumptions are also discussed. In addition, design of optimum test plans with different objective functions and constraints, as well as equivalency of test plans, are addressed. REFERENCES [1] J. B. Hall, P. M. Ellner, and A. Mosleh, “Reliability growth management metrics and statistical methods for discrete-use systems,” Technometrics, vol. 52, pp. 379–408, 2010. [2] A. Fries and A. Sen, “A survey of discrete reliability growth models,” IEEE Trans. Reliability, vol. R-45, no. 4, pp. 582–604, 1996. [3] P. E. Acevedo, D. S. Jackson, and R. W. Kotlowitz, “Reliability growth and forecasting for critical hardware through accelerated life testing,” Bell Labs Technical Journal, vol. 11, no. 3, pp. 121–135, 2006. [4] E. A. Elsayed, Reliability Engineering. New York: John Wiley and Sons Inc., 2012. [5] W. Lagattolla, “The next generation of environmental testing,” Evaluation Engineering, vol. 44, pp. 44–47, 2005. [6] S. Feth, “Partially Passed Component Counting for Evaluating Reliability,” Ph.D. dissertation, Technische Universität Kaiserslautern, , Germany, 2009. [7] C. J. Hwan and F. Maxim, “Burn-in for systems operating in a shock environment,” IEEE Trans. Reliability, vol. 60, pp. 721–728, 2011. [8] C. J. Hwan and F. Maxim, “Stochastically ordered subpopulations and optimal burn-in procedure,” IEEE Trans. Reliability, vol. 59, pp. 635–643, 2010. [9] T. C. Chun, T. S. Tsaing, and B. Narayanaswamy, “Optimal burn-in policy for highly reliable products using gamma degradation process,” IEEE Trans. Reliability, vol. 60, pp. 234–245, 2011. [10] Y. Tao and K. Yue, “Bayesian analysis of hazard rate, change point, and cost-optimal burn-in time for electronic devices,” IEEE Trans. Reliability, vol. 59, pp. 132–138, 2010. [11] V. Agrawal, C. Kime, and K. Saluja, “A tutorial on built-in-self-test,” in IEEE Design and Test of Computers. Silver Spring, MD: IEEE Computer Society Press, March 1993, pp. 73–80, and pp. 69-77, June 1993. [12] E. A. Elsayed, “Design of Reliability Test Plans: An Overview,” in Stochastic Reliability and Maintenance Modeling - Essays in Honor of Professor Shunji Osaki on his 70th Birthday, T. Dohi and T. Nakagawa, Eds. New York: Springer Publishing, 2012, to appear. [13] D. R. Cox, “Regression models and life tables (with discussion),” Journal of the Royal Statistical Society, Series B, vol. 34, pp. 187–220, 1972. [14] C. J. Dale, “Application of the proportional hazards model in the reliability field,” Reliability Engineering, vol. 10, pp. 1–14, 1985.

[15] E. A. Elsayed and C. K. Chan, “Estimation of thin-oxide reliability using proportional hazards models,” IEEE Trans. Reliability, vol. 39, pp. 329–335, 1990. [16] E. A. Elsayed, H. T. Liao, and X. D. Wang, “An extended linear hazard regression model with application to time-dependent-dielectric-breakdown of thermal oxides,” IIE Trans. Quality and Reliability Engineering, vol. 38, pp. 1–12, 2006. [17] A. Ciampi and J. Etezadi-Amoli, “A general model for testing the proportional hazards and the accelerated failure time hypotheses in the analysis of censored survival data with covariates,” Commun. Statist. Theor. Meth., vol. 14, pp. 651–667, 1985. [18] H.-J. Shyur, E. A. Elsayed, and J. T. Luxhoj, “A general model for accelerated life testing with time-dependent covariates,” Naval Research Logistics, vol. 46, pp. 303–321, 1999. [19] D. Oakes and T. Dasu, “A note on residual life,” Biometrika, vol. 77, pp. 409–410, 1990. [20] , W. Brass, Ed., “On the Scale of Mortality,” in Biological Aspects of Mortality, Symposia of the Society for the Study of Human Biology. London: Taylor & Francis Ltd., 1971, pp. 69–110. [21] W. Brass, “Mortality models and their uses in demography,” Trans. of the Faculty of Actuaries, vol. 33, pp. 122–133, 1974. [22] P. McCullagh, “Regression models for ordinal data,” Journal of the Royal Statistical Society. Series B, vol. 42, no. 2, pp. 109–142, 1980. [23] A. Agresti, Categorical Data Analysis. New Jersey: John Wiley & Sons, 2002. [24] D. M. Dabrowska and K. Doksum, “Estimation and testing in a twosample generalized odds-rate model,” Journal of the American Statistical Assoc, vol. 83, no. 403, pp. 744–749, 1988. [25] C. O. Wu, “Estimating the real parameter in a two-sample proportional odds model,” The Annals of Statistics, vol. 23, no. 2, pp. 376–395, 1995. [26] A. N. Pettitt, “Inference for the linear model using a likelihood based on ranks,” Journal of the Royal Statistical Society. Series B, vol. 44, no. 2, pp. 234–243, 1982. [27] A. N. Pettitt, “Proportional odds models for survival data and estimates using ranks,” Applied Statistics, vol. 33, no. 2, pp. 169–175, 1984. [28] S. A. Murphy, A. J. Rossini, and A. W. Vaart, “Maximum likelihood estimation in the proportional odds model,” Journal of the American Statistical Association, vol. 92, pp. 968–976, 1997. [29] S. Bennett, “Log-logistic regression models for survival data,” Applied Statistics, vol. 32, pp. 165–171, 1983. [30] X. Liu and Q. W. S. Qiu, “Modeling and planning of step-stress accelerated life tests with independent competing risks,” IEEE Trans. Reliability, vol. 60, pp. 712–720, 2011. [31] T.-H. Fan and W.-L. Wang, “Accelerated life tests for Weibull series systems with masked data,” IEEE Trans. Reliability, vol. 60, pp. 557–569, 2011. [32] E. M. Benavides, “Reliability model for step-stress and variable-stress situations,” IEEE Trans. Reliability, vol. 60, pp. 219–233, 2011. [33] H. Ma and W. Q. Meeker, “Strategy for planning accelerated life tests with small sample sizes,” IEEE Trans. Reliability, vol. 59, pp. 610–619, 2010. [34] G. Yang, “Accelerated life test plans for predicting warranty cost,” IEEE Trans. Reliability, vol. 59, pp. 628–634, 2010. [35] M. D. Turner, “A practical application of quantitative accelerated life testing in power systems engineering,” IEEE Trans. Reliability, vol. 59, pp. 91–101, 2010. [36] E. A. Elsayed and L. Jiao, “Optimal design of proportional hazards based accelerated life testing plans,” International Journal of Materials & Product Technology, vol. 17, no. 5/6, pp. 411–424, 2002. [37] W. Q. Meeker and L. A. Escobar, Statistical Methods for Reliability Data. New York: John Wiley & Sons, 1998. [38] W. Q. Meeker, L. A. Escobar, and J. C. Lu, “Accelerated degradation tests: Modeling and analysis,” Technometrics, vol. 40, pp. 89–99, 1998. [39] W. B. Nelson, Accelerated Testing: Statistical Methods, Test Plans, and Data Analysis. New York: John Wiley & Sons, 1990. [40] W. B. Nelson, Accelerated Testing: Statistical Models, Test Plans, and Data Analyses. New York: John Wiley & Sons, 2004. [41] W. B. Nelson, “A bibliography of accelerated test plans,” IEEE Trans. Reliability, vol. 54, pp. 194–197, 2005. [42] W. B. Nelson, “A bibliography of accelerated test plans Part II—References,” IEEE Trans. Reliability, vol. 54, pp. 370–373, 2005. [43] E. A. Elsayed and H. Zhang, “Design of optimum multiple-stress accelerated life testing plans based on proportional odds model,” International Journal of Product Development, vol. 7, no. 3/4, pp. 186–198, 2009.

ELSAYED: OVERVIEW OF RELIABILITY TESTING

[44] E. A. Elsayed and H. Zhang, “Design of PH-based accelerated life testing plans under multiple-stress-type,” Reliability Engineering and Systems Safety, vol. 92, pp. 286–292, 2007. [45] Y. Zhu and E. A. Elsayed, “Design of equivalent accelerated life testing plan under different stress applications,” International Journal of Quality Technology and Quantitative Management, vol. 8, no. 4, pp. 463–478, 2011. [46] H. T. Liao and E. A. Elsayed, “Equivalent accelerated life testing plans and application to reliability prediction,” SAE International Journal of Materials and Manufacturing, vol. 3, no. 1, pp. 71–77, 2010. [47] H. Liao and E. A. Elsayed, “Equivalent accelerated life testing plans for log-location-scale distributions,” Naval Research Logistics, vol. 57, no. 5, pp. 472–488, 2010. [48] Y. Zhu, “Optimal Design and Equivalency of Accelerated Life Testing Plans,” PhD dissertation, Department of Industrial and Systems Engineering, Rutgers University, New Jersey, June 2010. [49] P. A. Tobias and D. Trindade, Applied Reliability. New York: Van Nostrand Reinhold, 1986. [50] J. W. Evans and J. Y. Evans, Development of Acceleration Factors for Testing Mechanical Equipment White Paper. Milford, OH, International TechneGroup Incorporated, ITI Headquarters, 2010.

291

[51] S.-L. Jeng, B.-Y. Huang, and W. Q. Meeker, “Accelerated destructive degradation tests robust to distribution misspecification,” IEEE Trans. Reliability, 2011, 10.1109/TR.2011.2161051. [52] , M. S. Nikulin, N. Limnios, N. Balakrishnan, W. Kahle, and C. Huber-Carol, Eds., Advances in Degradation Modeling. Boston, MA: Birkhauser, 2010. [53] L. A. Escobar, F. Guerin, W. Q. Meeker, and M. S. Nikulin, Eds., “Special issue on degradation, damage, fatigue and accelerated life models in reliability testing,” Journal of Statistical Planning and Inference, vol. 139, no. 5, pp. 1575–1820, 2009. [54] Y. Shi, L. A. Escobar, and W. Q. Meeker, “Planning accelerated destructive degradation tests,” Technometrics, vol. 51, pp. 1–13, 2009. [55] S.-L. Jeng, B.-Y. Huang, and W. Q. Meeker, “Accelerated destructive degradation tests robust to distribution misspecification,” IEEE Trans. Reliability, vol. 60, pp. 701–711, 2011.

Elsayed A. Elsayed, photograph and biography not available at the time of publication.