622
IEEE TRANSACTIONS ON RELIABILITY, VOL. 60, NO. 3, SEPTEMBER 2011
Hardware Error Likelihood Induced by the Operation of Software Bing Huang, Manuel Rodriguez, Member, IEEE, Ming Li, Member, IEEE, Joseph B. Bernstein, Senior Member, IEEE, and Carol S. Smidts, Senior Member, IEEE
Abstract—The influence of the software, and its interaction and interdependency with the hardware in the creation and propagation of hardware failures, are usually neglected in reliability analyses of safety critical systems. The software operation is responsible for the usage of semiconductor devices along the system lifetime. This usage consists of voltage changes and current flows that steadily degrade the materials of circuit devices until the degradation becomes permanent, and the device can no longer perform its intended function. At the circuit level, these failures manifest as stuck-at values, signal delays, or circuit functional changes. These failures are permanent in nature. Due to the extremely high scaling of complementary metal-oxide-semiconductor (CMOS) technology into deep submicron regimes, permanent hardware failures are a key concern, and can no longer be neglected compared to transient failures in radiation-intense applications. Our work proposes a methodology for the reliability analysis of permanent failure manifestations of hardware devices due to the usage induced by the execution of embedded software applications. The methodology is illustrated with a case study based on a safety critical application.
HELIOS MOSFET MTTF NBTI RAM RISC RTL SPICE TDDB VHDL
Index Terms—Circuit simulation, embedded systems, failure propagation, hardware-software interaction, permanent hardware failures.
ALU ANSI CMOS CPU
ACRONYM1 Arithmetic Logic Unit American National Standards Institute Complementary Metal-Oxide-Semiconductor Central Processing Unit
EM HCI
Electromigration Hot Carrier Injection
Hardware Error Likelihood Induced by the Operation of Software Metal-Oxide-Semiconductor Field-Effect Transistor Mean Time To Failure Negative Bias Temperature Instability Random-Access Memory Reduced Instruction Set Computer Register Transfer Level Simulation Program with Integrated Circuit Emphasis Time Dependent Dielectric Breakdown VHSIC (Very High Speed Integrated Circuit) Hardware Description Language NOTATION Cross section of interconnects. Empirically determined constant. Model prefactor, a constant determined from life testing. Device gate oxide area of transistor , equivalent to W (channel width) L (channel length). , Empirically determined constants.
, , , Weibull slope parameter.
Manuscript received August 16, 2009; revised August 23, 2010; accepted November 24, 2010. Date of publication July 22, 2011; date of current version August 31, 2011. This research was funded in part by the Space Vehicle Technology Institute under Grant NCC3-989 (jointly funded by NASA and DOD within the NASA Constellation University Institutes Project, with Claudia Meyer as the project manager), NASA’s Office of Safety and Mission Assurance through the NASA SARP program managed by the NASA IV&V facility under NASA Grant NAG511952 and the Air Force Office of Scientific Research under Grant Number AFOSR FA9550-08-1-0139. Associate Editor: J. C. Lu. B. Huang is with the Everspin Technologies Inc, Chandler, AZ 85224 USA (e-mail:
[email protected]). M. Rodriguez and C. S. Smidts are with Ohio State University, Columbus, OH 43210 USA (e-mail:
[email protected];
[email protected]). M. Li is with the NASA Goddard Space Flight Center/Mantech International Corp, Greenbelt, MD 20771 USA (e-mail:
[email protected]). J. B. Bernstein is with the Bar Ilan University, Israel (e-mail: bernstj@macs. biu.ac.il). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TR.2011.2161699
Coefficient for the dispersion in hydrogen diffusion. Total number of logic combinations of the input signals. (for a n-asynchronous input signal circuit, is equal to ). Set of logic combinations of the input signals for which stress impacts segment . Number of logic combinations of the input signals under which stress has an effect on segment . Number of demand transitions of any type, , that occurred during time window . 1The
singular and plural of an acronym are always spelled the same.
0018-9529/$26.00 © 2011 IEEE
HUANG et al.: HARDWARE ERROR LIKELIHOOD INDUCED BY THE OPERATION OF SOFTWARE
,
Number of demands of type that occurred within time window . Activation energy for the EM wearout mechanism. Apparent activation energy for HCI. Material, oxide electric field and failure mode dependent parameters. Set of failure manifestations that impact the circuit. Set of hardware wearout mechanisms that impact the circuit leading to failure manifestation . Total number of demand transition types (for a n-asynchronous input signal circuit, is equal to ).
, ,
Average current flowing through interconnection . Set of sub-indexes of those demand for which stress transition types impacts segment . Number of demand transition types for which stress impacts segment . Average substrate current of transistor . Boltzmann’s constant. Lifetime of circuit segment under wearout mechanism . Segment corresponds to an interconnect for EM and to a transistor for HCI, TDDB and NBTI. Circuit reliability at time .
,
,
,
Circuit reliability for manifestation at time . Set of segments (transistors and interconnections) stressed by wearout mechanism leading to failure manifestation . Duration of logic combination of the input signals within time window . Duration of logic combinations 00, 01, 10, 11 of the input signals within time window . Absolute temperature in Kelvin (lifetime models of wearout mechanisms). Duration of the time window in which the measurement is being performed (duty factor and reliability models). Duration of a demand transition. Gate-to-source voltage. Duty factor equivalent to the percentage of time segment is subjected to stress during the circuit operation under a software execution.
623
Reliability induced by segment under stress given a duty factor . of Transistor channel width. I. INTRODUCTION ARDWARE failures [1] have traditionally constituted one of the leading causes of abnormal software and system behavior. The very first bug report described a bug-related hardware failure: on September 9, 1945, when Mark II, the Aiken Relay Calculator, was experiencing problems, an investigation showed that there was a moth trapped between the points of Relay #70 [2]. Iyer and Velardi [3] discovered that 35% of software failures in MVS systems were determined to be hardware-failure related. In business, about 32% of e-mail outages were determined to be caused by server hardware failures [4]. The Titan 4A rocket exploded during its launch phase due to a wiring failure (short). Investigations uncovered about 113 cases of wiring damage that could have caused similar launch failures in the 25 Titan missions since 1989 [5]. Cosmic-radiation-induced errors in spacecrafts’ embedded computers and microelectronics are another important source of software and system malfunction. Space agencies, governmental organizations, and industry allocate large budgets to research, development, testing, and integration of radiation-hardened microprocessors. Some examples are RCA1802 (used in the Voyager, Viking, and Galileo spacecrafts), the RAD6000 and RAD750 computers (used in numerous NASA spacecrafts), and the SPARC V8 (developed in conjunction with the European Space Agency) [6]. Thus, the reliability of computer systems needs to be assessed considering the hardware, the software, and the hardware-software interactions and interdependencies. The software execution on hardware devices is in essence a series of 0 and 1 signal alternations for the inputs of hardware components. Such signal alternations lead to voltage changes and current flows in microelectronic devices. Voltage and current act as electronic stresses on a device, and may lead to physical changes and degradations in circuit elements. A failure will occur when the degradation becomes permanent, and the device can no longer perform its intended function (e.g., an open in a metal wire). Device failures manifest in different ways at the circuit level, such as functionality changes, signal delays, or stuck-at signals. This work examines the influence of the software execution on the probability of creation and propagation of hardware failures caused by electronic stresses. These failures are permanent in nature. Permanent hardware failures are today a key concern, and can no longer be neglected compared to other types of hardware failures such as transient failures in radiation-intense applications including space missions, and nuclear power plants (e.g., [11, Chapter 5]). The scaling of CMOS technology into deep submicron regimes has brought about new challenges on semiconductor device lifetime reliability. Such scaling pushes device performance to the limits of technology, and gradually reduces device reliability margins. As a consequence, the expected lifetime of hardware devices is being reduced from decades to years,
H
624
IEEE TRANSACTIONS ON RELIABILITY, VOL. 60, NO. 3, SEPTEMBER 2011
TABLE I WEAROUT MECHANISMS DUE TO OPERATIONAL USAGE
increasing the probability of occurrence of permanent physical defects in today’s semiconductor devices [12]. As far as we know, none of the related work has studied the reliability dynamics of the software execution on the creation and propagation of permanent hardware failures. Li et al. [13] adopted a one-dimensional wearout transistor degradation model developed by Leblebici [14], and built a two-transistor degradation model. The model is used to simulate the faulty behaviors of benchmark circuits. Segura et al. [15] investigated the circuit functionality of CMOS gates with damaged gate oxide through a failure equivalent circuit model. The model consisted of a series connection of two transistors and a resistor between the gate and the common terminal. Srinivasan et al. [16] developed an architecture-level microprocessor model to calculate processor lifetime reliability under multiple wearout mechanisms, and environmental stress conditions. Other researchers have analyzed the system reliability at a high level of abstraction without explicit consideration of hardware wearout mechanisms [17]–[19]. For example, Teng et al. [17] proposed a system-level modeling methodology to address hardware and software interaction failures. The authors studied how hardware failures impact software and system reliability, but the influence of software execution on hardware reliability is not addressed. On the other hand, research described in [18], [19] assumes that hardware and software subsystems’ reliabilities are -independent of each other. The main contributions of the methodology proposed in this paper are: (i) modeling of the effect of the software operation on the creation and propagation of hardware failures, (ii) joint consideration of wearout mechanisms of semiconductor devices, and (iii) analysis of failure manifestations (signal delays, stuck-at signals, functionality changes, etc.) at the circuit gate level (e.g., logic gates and storage elements). The proposed
methodology has been called Hardware Error Likelihood Induced by the Operation of Software (HELIOS). The paper is structured as follows. Section II introduces the physics of wearout mechanisms, the process by which failures propagate through the different hardware levels, and the resulting manifestations. Section III describes the proposed methodology for the reliability analysis of the software operation on the creation and propagation of permanent hardware failures. The methodology is divided into the analysis of failure manifestations, the development of reliability models, and the calculation of failure probability distributions. Section IV illustrates the methodology by applying it to a case study based on a safety critical application. Finally, Section V concludes the paper. II. CREATION AND PROPAGATION OF HARDWARE FAILURES DUE TO SOFTWARE OPERATION The operation of the software leads to electrical stresses in terms of voltage changes and current flows at the semiconductor level. These stresses will progressively deteriorate intrinsic defects within the transistors and interconnections of a circuit, e.g., lithography and processing deficiencies, contamination, and material limitations. This deterioration process is known as operational usage. The defects might originally be small enough so that they are non-lethal, and do not impact the system. During operation, the non-lethal defects may grow into lethal ones, e.g., a transistor being defective so that it never functions or functions incorrectly. The way that defects in transistors and interconnects deteriorate is controlled by well-known physical phenomena, e.g., oxidation, ionization, etc. These physical phenomena are referred to as wearout (or failure) mechanisms. The major wearout mechanisms that arise in a circuit due to operational usage correspond to Hot Carrier
HUANG et al.: HARDWARE ERROR LIKELIHOOD INDUCED BY THE OPERATION OF SOFTWARE
625
TABLE II LIFETIME MODELS FOR HCI, EM, TDDB, AND NBTI
Injection (HCI), Electromigration (EM), Time Dependent Dielectric Breakdown (TDDB), and Negative Bias Temperature Instability (NBTI) [7], [20]. These mechanisms, and their lifetime models are briefly described in Table I, and Table II. The context for the failure propagation analysis across hardware layers examined in this paper is introduced hereafter. Hardware wearout mechanisms affect components at the physical level of semiconductor devices, that is, transistors, and their physical interconnections. Besides the physical level, Integrated Circuit (IC) designs also distinguish the Gate Level, and the Register Transfer Level (RT level or RTL). The Gate Level consists of the abstraction of transistors into a netlist of logic gates (AND, OR, NOT, etc.) and storage elements (flip-flops, latches, etc.). The RT level consists of the grouping of the logic gates and storage elements into blocks that perform basic functions. At this level, the processing carried out by the logic gates is referred to as combinational logic, and normally provides access control to the storage elements. A permanent2 failure at the physical level that becomes lethal leads to a transistor stuck-on or stuck-off, or an open or short in a metal wire. At the Gate and RT levels, these physical defects mainly manifest as stuck-at values (the logic voltage of a signal is stuck either at 0 or 1), and signal delays. Also, physical defects may lead to a functionality change of a combinational logic element, e.g., a transistor stuck-on failure in a NAND gate could change the truth table of the gate, making it behave differently than desired.
Fig. 1. HELIOS methodology—Overview.
III. METHODOLOGY Fig. 1 presents an overview of the proposed methodology. It is divided into three steps: (i) analysis of failure manifestations, (ii) development of reliability models, and (iii) calculation of failure probability distributions. The methodology analyses the gate level circuit elements (i.e., logic gates and storage elements) used in hardware devices of computer systems, e.g., CPU, RAM, etc. During the system operational phase, these circuits are impacted by the software-induced degradation caused
by hardware wearout mechanisms that are responsible for the creation of permanent failures (as explained in Section II). The probability distribution of such failures depends on the particular execution profile of the application software run by the computer system. During the first step (analysis of failure manifestations), the circuit elements are analyzed (in isolation of each other) through SPICE simulations using generic input signal stimuli, and failure equivalent circuit models. This approach allows for calculating the mean time to failure (MTTF) of transistors and interconnections, and capturing into stress patterns the different ways these are stressed by voltage changes and current flows. The main outcome of this step consists of failure manifestation patterns observed on the circuits’ output signal, e.g., delays, functionality changes, or stuck-at values.3 During the second step (development of reliability models), a set of reliability models are built for each failure manifestation
2Note that transient failures [8]–[10], and other types of permanent failures (i.e., failures caused by mechanical stress or ion contamination, temperature and corrosion related failures, interconnection and packaging failures, electrical stress failures, manufacturing failures, etc. [7]) are induced by phenomena other than the operational usage, and thus are out of scope. As far as intermittent failures are concerned, they usually arise during the process of the formation of permanent defects, and eventually lead to permanent failures, so they are not analyzed in this paper.
3The models and tools employed have been validated by other authors. We have used the Cadence Spectre SPICE circuit simulator, which is an industry-leading tool (http://www.cadence.com/products/rf/spectre_circuit/Pages/ default.aspx). It uses production-proven circuit simulation techniques, and is widely used in both industry and academia. The lifetime models and failure equivalent circuit models employed for HCI, EM, TDDB, and NBTI are also well-known and extensively used, and were proposed and validated by previous authors (see [7], [21], [22], [29], [30]).
626
IEEE TRANSACTIONS ON RELIABILITY, VOL. 60, NO. 3, SEPTEMBER 2011
Fig. 2. HELIOS methodology’s detailed steps.
as a function of the software execution profile. The models account for the effect of hardware wearout mechanisms, circuits’ operating conditions, and software-execution induced usage.4 During the third step (calculation of failure probability distributions), the actual usage profile of the hardware devices is obtained through VHDL simulations of the computer system under the execution of an application software. This approach allows for solving the reliability models by using best-fit statistical distributions, and calculating the software-specific probability profiles of the circuits’ failure manifestations. Fig. 2 presents a detailed view of the methodology. The various steps, procedures, and elements depicted in Fig. 2 are described in Sections III-A–III-C. A. Analysis of Failure Manifestations 1) Characterization of Patterns: As indicated by Fig. 2, the main inputs of the methodology during the first step consist of lifetime models for hardware wearout mechanisms, and schematics of the circuits to be analyzed. The lifetime models provided in Table II (i.e., (1), (2), (3), and (4)) are used as first input to the methodology, although models other than those in Table II can also be used. For the second input (circuits’ 4No empirical data modeling is required that is based on techniques like data or curve fitting, regression, or other techniques used to build analytical models from empirical data (such as those employed to build the lifetime models in the microelectronics domain). The development of reliability models involves empirical data only to the extent that empirical data from an actual circuit are required to feed the proposed models.
schematics), an example circuit is used and introduced hereafter, whose schematics are provided in Fig. 4. The example circuit considered (second input to the methodology) is an AND2_1 logic gate circuit operating under HCI, EM, TDDB, and NBTI stresses. The AND2_1 gate implements logic operation and of two input signals (i.e., given inputs and , the output is ). The implementation of the circuit is based on the standard cell library vtvtlib255 developed by the Virginia Tech VLSI for Telecommunication (VTVT) group [31], [32], technology [33]. which employs the TSMC 0.25 The physical layout of the AND2_1 gate is shown in Fig. 3. It to ), and five interconis composed of six transistors ( to ). The corresponding schematic is shown in nections ( Fig. 4. For the purposes of our analysis, the input signal stimuli to a circuit are designed according to the following criteria: 1 The set of input signals includes all possible combinations for logic levels (1s and 0s), and transitions (rising and falling edges). We assume that two or more transitions in different lines do not happen at the same time, e.g., a distance between two signals can be observed in most cases by increasing the precision of the measurement. 2 The set of input signals leads to the same duration for every combination of logic levels. 5In microelectronic circuit design, standard cell libraries are used to translate the high level circuit description into an actual physical level circuit layout based on the technology node of such a library. The vtvtlib25 standard cell library is a set of standard logic gates and flip-flop circuits based on a 0.25 technology node.
m
HUANG et al.: HARDWARE ERROR LIKELIHOOD INDUCED BY THE OPERATION OF SOFTWARE
627
Fig. 5. Standardized Inputs example for a 2-asynchronous input circuit.
m
Fig. 3. Circuit layout of an AND2_1 logic gate for TSMC 0.25- technology node, consisting of six transistors (M0, M1, M2, M3, M4, and M5) and five interconnections (N1, N2, N3, N4, and N5).
Fig. 4. Schematic of the AND2_1 logic gate.
We refer to a set of input signals matching these criteria as Standardized Inputs. Fig. 5 provides an example of Standardized Inputs for a 2-asynchronous input circuit. The first criterion is fulfilled because all combinations for logic levels (i.e., “00,” “01,” “10,” “11”) and transitions (i.e., rising and falling edges in one signal during high, and low levels of the other signal) are present. The second criterion is also fulfilled because the total duration of each logic combination is the same (i.e., for comfor “00,” where ). binations “01,” “10,” “11,” and Criteria 1 and 2 above are chosen because they allow for the “capturing” of all the different occurrence patterns of wearout mechanisms, including their relative time intervals with respect to each input combination. Indeed, transistors and interconnects suffer from electrical stresses only for specific logic combinations and transitions of the input signals. We refer to the resulting diagrams as stress patterns. Fig. 6 shows some stress
Fig. 6. Stress patterns examples for the AND2_1 gate. (a) HCI effect in transistor M0 (instant 1.2 s). (b) EM effect in interconnection N2. (c) TDDB and NBTI effect in transistor M5.
patterns of the AND2_1 logic gate obtained with SPICE (Simulation Program with Integrated Circuit Emphasis). The operating parameters of a lifetime model are determined using a SPICE-like simulator, such as the Cadence Spectre SPICE circuit simulator [28]. SPICE is a general-purpose circuit simulation program for nonlinear , nonlinear transient, and linear analyses. A SPICE circuit simulation leads to the same results as the actual circuit operation. As shown in Fig. 6(a), a transistor suffers from HCI stress , and during transition periods in which both gate voltage are high; and there is current (parameter drain voltage
628
IEEE TRANSACTIONS ON RELIABILITY, VOL. 60, NO. 3, SEPTEMBER 2011
of (1)) flowing through the channel. For transistor , such conditions appear for combination “ ”6 of the input signals in Fig. 6(a)). Fig. 6(b) shows an EM stress (i.e., instant 1.2 pattern. A metal wire suffers from EM when there is a peak of electric current flowing through it (parameter of (2)). For in, it occurs for input combinations “ ”, and terconnection in Fig. 6(b)). As de“ ” (i.e., instants 0.25 , and 1.45 picted in Fig. 6(c), TDDB, and NBTI affect the gate dielectrics (see (3) of the transistors when the gate-to-source voltage and (4)) equals the power supply voltage either in dynamic or , such a condition arises static state operations. For transistor for input combination “11” (i.e., time intervals [0.25, 0.5], and [1.2, 1.45] in Fig. 6(c)). Using the above results, the mean time to failure (MTTF) of a transistor or an interconnection (referred to in the paper as circuit segments) can be calculated. For example, using (1), of transistor under HCI stress would be equiv. The result provides the mean time to alent to due to the Hot Carrier Injecfailure in hours of transistor tion (HCI) stress when the circuit operates under universal input stimuli (standardized inputs, see Fig. 5). The values of the parameters of (1) have been derived through SPICE simulation, such as (transistor’s average substrate current). The MTTF equation employed is (1). The parameters of this equation are , , , , , , and . These parameters are calculated as follows. is equal to 1) The Boltzmann constant . 2) The typical transistor channel width of an nMOSFET . is 3) The typical absolute temperature in Kelvin is 300 K. is equiva4) The apparent activation energy for HCI lent to 0.15 eV. 5) The technology-related constant is equal to 1.5. is tied to a specific semicon6) The model prefactor ductor manufacturing process, and could vary between difis ferent semiconductor foundries. A value of used in the calculation. 7) The average substrate current of the transistor is deter. mined by SPICE simulation, and is equal to 2) Identification of Failure Manifestations: Once the MTTF is calculated for every transistor and interconnection of a circuit element, a selection is made based on the elimination of those segments that are lightly or not stressed by wearout mechanisms, so their impact in the global failure probability of the circuit can be neglected. For instance, this is the case of the pMOSFET transistors under HCI stress. Due to a higher mobility and a lower energy barrier, hot electrons can more easily be injected into the oxide than hot holes. This means that nMOSFET transistors are much more prone to the HCI effect than pMOSFET transistors, and the latter can be ignored from reliability estimates [34]. To analyze the effect of different wearout mechanisms on circuit functionality, the methodology includes the notion of a failure equivalent circuit model. The underlying idea of a failure 6Refer
to Fig. 5 for the notation employed.
Fig. 7. NBTI failure equivalent circuit model proposed in [36].
equivalent circuit model is to represent device degradation by using additional lumped components (resistors, dependent current sources, etc.) to capture the behavior of a damaged device in a circuit’s operating environment. As far as HCI, EM, TDDB, and NBTI are concerned, several failure equivalent circuit models proposed by other researchers have been extensively applied, tested, and validated in practice. For instance, a HCI model named Hot Carrier Induced Series Resistance Enhancement Model (HISREM) was proposed by Hwang [35], and improved in [36]. A model based on a simple resistor is usually employed in EM lifetime tests [37]. Also, the Maryland Circuit Reliability Oriented (MaCRO) model [36] includes TDDB, and NBTI failure equivalent circuit models. As an example, Fig. 7 shows the NBTI model proposed in [36]. In Fig. 7, NBTI-induced pMOSFET threshold voltage increase is modeled as an absolute gate-to-source voltage decrease. Gate tunneling current flowing through the gate leads to the increase of voltage at point . resistance This corresponds to the decrease of pMOSFET absolute gate-to-source voltage, and therefore mimics the threshold voltage degradation effect. Gate tunneling current is modeled with two voltage controlled current sources which follow the . The inclusion of form of a power law formula as and inherently accounts for oxide breakdown effects, and also supplies leakage currents for whose voltage drop is equivalent to pMOSFET threshold voltage degradation. The failure of a circuit segment due to a wearout mechanism can lead to a functional error at the circuit level. Wearout mechanisms are usually considered -independent in reliability analyses [38]. Indeed, the failure probability of a transistor or an interconnection due to the joint effect of various wearout mechanisms occurring at the same time is very low. This is partly due to the fact that wearout mechanisms are induced by distinct physical processes, and act on different parts of a circuit component. Accordingly, the failure equivalent circuit models are used to replace one segment at a time in our analysis. The objective is
HUANG et al.: HARDWARE ERROR LIKELIHOOD INDUCED BY THE OPERATION OF SOFTWARE
629
Fig. 8. Failure manifestation patterns of the AND2_1 logic gate due to HCI, EM, TDDB, and NBTI stresses in its circuit segments. (a) Correct behavior [(a) input 1; (b) input 2; (c) output]. (b) Failure manifestations observed in the output (c).
to analyze circuit level failure manifestations through the SPICE simulation. Therefore, if a circuit contains transistors and interconnections, a maximum of mutated versions of the same circuit are produced to cover wearout mechanisms HCI, EM, TDDB, and NBTI. A “mutant” is thus a circuit in which one segment considered to be faulty is replaced by a failure equivalent circuit model. The next step consists of running one independent SPICE simulation per mutated circuit. The objective is to determine whether the functional behavior of the circuit is impacted by a faulty segment, and whether it leads to a particular failure manifestation. To do so, the waveform of the circuit’s output signal is logged and analyzed after every simulation. These outputs are referred to as failure manifestation patterns. The results obtained during this process for the AND2_1 gate under HCI, EM, TDDB, and NBTI are provided in Fig. 8. Fig. 8(a) displays the inputs, and nominal output of the AND2_1 gate; while Fig. 8(b) provides the observed failure manifestation patterns. For the HCI stress, all the observed failure manifestations consist of output delays. The operation leads to a delay of the circuit under a faulty transistor of every rising transition of the output signal. This failure manifestation is referred to as Delay-rise. A similar behavior is
transistor. On the other hand, falling observed for a faulty is faulty output transitions are delayed when transistor (Delay-fall). For the EM stress, the observed circuit failure manifestations correspond also to different types of output for interconnection (i.e., delay of delays: Delay-fallfalling output transitions triggered by input combination “ ”), and , and Delay-fall for Delay-rise for interconnections interconnection . For the TDDB stress, the observed failure , , and ; manifestations are stuck-at-0 for transistors ; and different-function for , and .A stuck-at-1 for , , and is observed similar behavior for transistors for the NBTI stress. The proposed methodology allows for obtaining very detailed information about the failure manifestations of a circuit, e.g., delayed rising or falling transitions, input combinations leading to a delay or different function, new logic function of a circuit after a failure, etc. These characteristics can be captured by the detailed manifestation notation proposed in Table III, which has been used to label the failure manifestations in Fig. 8(b). Table III also proposes a more compact notation referred to as alternate manifestation notation. In this notation, delays are characterized by the percentage of deferred pulses of the output signal, while the different-function manifestation is charac-
630
IEEE TRANSACTIONS ON RELIABILITY, VOL. 60, NO. 3, SEPTEMBER 2011
TABLE III NOTATION OF THE FAILURE MANIFESTATIONS
terized by the percentage of changes in the truth table of the circuit. Using this notation, failure manifestations Delay-rise and Delay-fall from Fig. 8(b) would be classified under label under label Delay-0.5, Delay-1, manifestation Delay-fallmanifestations DiffFunc-a and DiffFunc-b under label Diffunder label Func-0.25, manifestation DiffFunc-0.13, and manifestation under label DiffFunc-0.88. Although other types of alternate notations might be constructed, the proposed notation is meant to easily allow hardware and software testers to reuse and model failure types, rates, and probabilities, e.g., via VHDL-based fault injection [8].
In (5), the dynamic stress conditions in a circuit operating environment are taken into account by calculating the duty factors corresponds to [40] of the electronic stresses. A duty factor the percentage of time during which a certain type of stress (indicated by parameter ) affects a certain transistor or interconnection (indicated by parameter ). The duty factors are used to calculate the reliability of the circuit. The expression of the reliability has been factorized in terms of the duty factor (6) which in turn is used to calculate the reliability of the entire circuit (5). As an example, according to the information provided in for the AND2_1 gate would Fig. 8, expression correspond to
B. Development of Reliability Models The lifetime of an entire circuit results from a combination of the effects of the different wear-out mechanisms across different circuit segments. This fact requires information on the time-dependent lifetime distribution of each wearout mechanism. In a complex integrated circuit, the whole system will be extremely prone to failure if any segment fails. We can thus approximate a complex integrated circuit using a competing failure mode model [39] to determine a system’s reliability from its individual wearout mechanisms. It elevates the reliability from transistor and interconnection levels to circuit level. The reliability of a circuit element (e.g., logic gate, flip-flop, etc.) can thus be related to the lifetime of its segments. This relation is given by (5) (circuit reliability) as a function of the failure manifestations (e.g., stuck-at-0, stuck-at-1, delay, different function, etc.) that impact the circuit: (5) where is the set of failure manifestations that impact the circuit. Equation (6) (circuit manifestation reliability) provides the for a failure manifestation . This is a circuit reliability induced by a segment under function of the reliability stress : (6)
On the other hand,
would be equivalent to
The duty factors should be modeled as a function of the software execution because different software executions will activate hardware devices in different ways, and lead to different stress time percentages of transistors and interconnections. We thus introduce the concept of Hardware Demand Model to account for the activation of hardware devices due to the software execution (Fig. 9). The execution of software on hardware devices (which normally includes CPU, memory, busses, etc.) is, in essence, a series of 0 and 1 signal alterations in hardware units. This model assumes that software accesses to hardware devices are divided into a series of demand and idle intervals, as depicted in Fig. 9(a). Thus the software execution in terms of a specific unit constitutes a series of being-demanded and not-being-demanded (idle) combinations. As an example, Fig. 9(b) shows the Hardware Demand Model of a 2-asynchronous input circuit under the Standardized Inputs. As shown in Fig. 9(b), a demand interval is actually triggered by a logic change of any of the input signals of the circuit (which is due to the software execution). We assume that the duration of a demand interval is equivalent to the duration of a signal’s transition period (because, in practice, electronic stresses occur along the whole
HUANG et al.: HARDWARE ERROR LIKELIHOOD INDUCED BY THE OPERATION OF SOFTWARE
In general, a duty factor
631
can be described by current-based stress
(7)
voltage-based stress
Fig. 9. Hardware Demand Model. (a) Hardware Demand Model during software execution. (b) Demand (D) and idle (I) intervals of a 2-asynchronous input circuit.
duration of a signal transition, as shown in Section III-A-1). Such a duration is symbolized by . can be calculated using the Hardware The duty factors Demand Model described above in combination with the Standardized Inputs, the Stress Patterns, and the Failure Manifestation Patterns. The way in which these elements are combined is illustrated in Fig. 10 for the AND2_1 gate. As shown in Fig. 10, the different current peaks of the stress patterns for the HCI and EM effects can be matched to a particular demand transition of the Hardware Demand Model. For is stressed by HCI during demand type instance, transistor , and such a stress leads to a Delay-rise failure manifestation. can be calculated as , where is Then, coefficient the duration of the time window within which the measurement is being performed, is the duration of a demand transition, and is the number of demands of type that occurred within time window , e.g., in Fig. 10, is equal to 1 for all . The inis that transistor is stressed terpretation of duty factor by Hot Carrier Injection (HCI) during a certain percentage of . Fig. 10, which time, which is given by the expression uses the standardized inputs (i.e., combination of all possible signal transitions and value levels for the inputs to the circuit, only during see Fig. 5), shows that HCI affects transistor input signal transitions of type (rise of input from 0 to 1 while input is at 1). The duration of a transition is symbolized by . Therefore, given any interval of time (where is the mission time under consideration), the percentage of stress occurring is equal to the duration of all transitions of type where is the number of within the interval (calculated as observed on the mission time ) divided transitions of type by the duration of the interval (represented by time ). On the other hand, the different voltage pulses of the stress patterns for the TDDB and NBTI effects can be matched to a particular logic combination of the input signals (not to a is stressed by demand transition). For instance, transistor TDDB for logic combination “11” of the input signals, and such a stress leads to a stuck-at-0 failure manifestation. Then, coeffiwould be , where is the duration of logic cient for all , . combination “11,” e.g., in Fig. 10, is equal to
In (7), terms and can be better understood with simple examples. Based on the information provided in Fig. 10, would correspond to the set of indexes a term such as to the set of logic combi{2,3,6,7}, and a term like nations {00, 01, 10}. and are software dependent, and allow for Parameters obtaining the software-induced hardware usage profile (i.e., the last step of the methodology in Fig. 2). In other words, the execution of different software applications will lead to different and , and so to different usage profiles. values for and might be time consuming in cerThe measure of tain cases. For example, a 4-asynchrounous input-based circuit (e.g., a NAND4_1 logic gate) requires that the occurrence of 64 demand transition types, and 16 input combination types are identified, monitored, and collected in practice. Accordingly, the methodology provides an additional alternate version of the : duty factor model current-based stress voltage-based stress
(8)
and can be In (8), variable and terms better understood with simple examples. Based on the information provided in Fig. 10, would be equal to 8, a term such as would be equal to , and a term like to . In this model, only parameter (representing the total number of demand transitions of any type) is software dependent, and its practical measurement is more straightforward. We assume a uniform distribution of demand transition types and input logic combinations over time (which is the tendency for sufficiently large time periods). would correspond to Using (6) and (8),
and expression
would be
corresponds to the probability that the rising transitions of the output signal of the circuit will not be delayed. The Delay-rise failure is provoked by either HCI stress impacting or , or EM stress impacting interconnections transistors or , as shown by Fig. 10. Thus, the reliability of the circuit under Delay-rise is equal to the product of the reliabilities for these transistors and interconnections under
632
IEEE TRANSACTIONS ON RELIABILITY, VOL. 60, NO. 3, SEPTEMBER 2011
Fig. 10. Calculation of duty factors for the AND2_1 logic gate.
such stresses, i.e. . This assumes failure independence (see Section III-A-2). The reliability models for the other failure manifestations of the AND2_1 gate are similarly calculated. The resulting models are provided in Table IV (for the sake of conciseness, a term is represented as ).
C. Calculation of Failure Probabilities 1) Computation of the Software-Specific Hardware Usage Profile: As discussed in Section III-B, a device failure probability over time depends on the frequency the unit is demanded. For digital devices, especially a CPU, the way each unit is accessed depends heavily on the set of instructions (collectively called the software) it executes. The use of the tool Synopsys
VCS MX [41] (a full language-VHDL simulator) and Synopsys Design Analyzer [42] (a VHDL design analyzer) makes it feasible to simulate the software execution against the hardware device’s VHDL description. The procedure to obtain the hardware usage profile of circuit elements is described in Fig. 11. First, the functionality of the VHDL RTL script that describes the hardware platform is verified by means of the Synopsys VCS MX simulator. The VHDL RTL script is then ready to be translated into an actual gate-level netlist using the Synopsys Design Analyzer. The process of converting the RTL description into a netlist for a given target technology is called logic synthesis. To produce the synthesized netlist, the synthesis tool requires the RTL code of the hardware devices and the cell libraries. The cell libraries provide information about all the available cells, including connectivity and functionality, timing, area, and corresponding symbol, among others. Some extra VHDL code is then inserted into the
HUANG et al.: HARDWARE ERROR LIKELIHOOD INDUCED BY THE OPERATION OF SOFTWARE
633
TABLE IV RELIABILITY MODELS FOR THE AND2_1 GATE PER FAILURE MANIFESTATION UNDER HCI, EM, TDDB, AND NBTI STRESSES
gate-level VHDL scripts aimed at monitoring the demands issued to the circuit elements (e.g., logic gates, flip-flops) during the VHDL simulation, e.g., every time the actual input of a circuit element changes, the value of a corresponding counter is increased. On the other hand, the application software needs to be compiled into machine code. This code will be loaded into the system memory during the VHDL simulation. The gate level’s VHDL scripts and the machine code are used by Synopsys VCS MX to simulate the execution of the application software on the system. The outcome of this simulation will consist of the software-specific hardware usage profile in terms of the demands issued to the circuit elements under study. This profile would thus encompass the values of parameters , , and of (7) and (8) for each circuit element. 2) Calculation of the Failure Probability Distributions: The methodology allows the user to define the statistical distribu-
tion that should be used for each wearout mechanism during the calculation of failure probabilities. Otherwise, best-fit statistical models will be selected by default: the lognormal distribution has been demonstrated as the best-fit model for HCI [43], EM [44], and NBTI [45], while the Weibull distribution is considered to be the best-fit model for TDDB [46]. The methodology defines a software-specific hardware as the probability failure probability that a circuit (e.g., a logic gate, a flip-flop, etc.) leads under a particular software exto failure manifestation ecution. A software-specific hardware failure probability distribution (or profile) is defined as an n-tuple array specifying the probability for each of the of occurrence of failure manifestation circuit elements of a hardware device (e.g., ALU, RAM, etc.). A combined software-specific hardware failure
634
IEEE TRANSACTIONS ON RELIABILITY, VOL. 60, NO. 3, SEPTEMBER 2011
Fig. 11. Description of the VHDL simulation step of the methodology.
TABLE V LOGIC GATES’ TYPES OF THE ALU
probability distribution (or profile) of a hardware device under any failure manifestation is given by .7 IV. CASE STUDY The proposed methodology has been applied to a safety critical embedded system. We refer to this system as APPlication system (APP). APP’s application software is developed in ANSI C, and consists of about 1200 lines of code. The hardware platform of APP consists of a Reduced Instruction Set Computer (RISC) based microprocessor, referred to as the central processing unit (CPU). The case study aims to obtain the software-specific hardware failure profile under HCI, EM, TDDB, and NBTI stresses of the logic gates of the CPU component of APP. We focus on the Arithmetic Logic Unit (ALU) component because it is one of the critical components of any CPU (see also [49] for an example 7For a given device i (like a random-access memory (RAM), or an ALU), the expression P (t) is an n-tuple array containing the probability of failure for each of the n(1 ; . . . ; n ) circuit elements of the device (for example, each cell of RAM or each gate of an ALU). Accordingly, expression P (t) can be used to build failure probability maps of a hardware device, as illustrated by Fig. 13(e) for an ALU. These failure maps are useful to hardware analysts and designers to detect failure-prone designs. System integrators can also use the failure maps in verification and validation analyses (such as fault injection) to determine the fulfillment of system requirements.
analysis of microprocessor registers). Best-fit statistical models are used for the wearout mechanisms, namely lognormal for HCI, EM, and NBTI; and Weibull for TDDB. A. Failure Manifestations and Reliability Models The VHDL code of the CPU is initially coded at the register transfer level. We use Synopsys Design Analyzer to synthesize the RTL code into logic gate level code. The cell library technology [31], [32] is used vtvtlib25 for the TSMC 0.25 for the logic synthesis. The logic gates of the ALU are of different types. These types are listed in Table V, specifying the , internumber of gates (#), the number of transistors of each gate, inputs of each gate, and connections the logic function implemented . In total, 142 different types of failure manifestations were observed out of the 461 logic gates of the ALU. • 66 manifestation types correspond to delays of the transitions of the output signal triggered by specific input combinations. From the 66, 32 impact the falling transitions of , the output (e.g., ), while 34 impact the rising transitions (e.g., , ). • 74 types correspond to behavioral changes of the gates leading to different logic functions (e.g., , ).
HUANG et al.: HARDWARE ERROR LIKELIHOOD INDUCED BY THE OPERATION OF SOFTWARE
635
TABLE VI RELIABILITY MODELS FOR THE ALU GATES
it is possible to combine them, and reduce these figures to 100 models for 28 manifestation types (namely, , , , ,
Fig. 12. Synthesis of the failure manifestations types for the logic gates. (a) Number of failure manifestation types per logic gate using the detailed and alternate manifestation notations. (b) Impact of the delay and different-function manifestation types per logic gate.
• 2 types correspond to the stuck-at-0, and stuck-at-1 manifestation types. To cover these different types of failure manifestations, we have developed more than 200 reliability models for the ALU gates. Using the “alternate manifestation” notation,
). Fig. 12 provides a synthesis of the failure manifestation types found for the logic gates during the SPICE simulations. Fig. 12(a) provides the number of failure manifestation types per logic gate for both the detailed, and alternate manifestation notations. The gates leading to the highest number of “detailed” failure manifestation types are MUX2_1 (25 types), ABorCorD (18 types), and nABorCorD (17 types). This result can be explained by the fact that these gates are also the most complex ones in terms of the number of transistors and interconnections. In Fig. 12(b), we analyze the impact of the failure manifestations on the shape of the output signal waveform. The impact can be measured in terms of the number of output pulses delayed, and the number of changed entries in the truth table. This impact is analyzed per gate, and calculated as the average between (i) the percentage of pulses delayed, and (ii) the percentage of truth table entries changed. According to Fig. 12(b), gates MUX2_1, and ABorC lead to the manifestations impacting the most the shape of the output signal waveform. The rest of the gates lead to a lower impact. Further, we have merged the various types of delays, and different functions into single categories, which leads to 49 reliability models for four different manifestation types. These models are shown in Table VI for some of the gates. According to Table VI, the HCI and EM stresses lead to delays, while the TDDB and NBTI stresses provoke stuck-at failures, and changes in the circuit functionality. Note that the demand transition duration is a known parameter. The only variables of the reliability models correspond to the simulation time window , and the number of demand transitions . As explained in Section III-B, variable reflects the software-induced hardware usage effect along the system lifetime. The calculation of the usage and failure probability distributions of the logic gates is the object of the following section.
636
IEEE TRANSACTIONS ON RELIABILITY, VOL. 60, NO. 3, SEPTEMBER 2011
Fig. 13. ALU maps, showing usage and probability profiles. (a) Usage in terms of number of demands. (b) Delay probability profile. (c) Different-Function probability profile. (d) Stuck-at probability profile. (e) Combined failure probability profile.
B. Usage and Failure Probability Distributions The calculation of the software-specific hardware failure profile of the ALU gates is performed in two steps. In the first step, the usage information is obtained in terms of the number of demands of the ALU gates by simulating the system as described in Section III-C-1. In the second step, the usage data are included into the reliability models of Section IV-A. The failure probability profiles of the ALU gates are then estimated for the first six months of continuous operation of the APP system. The usage, and the failure probability profiles of the ALU logic gates are presented in Fig. 13. We use a two-dimensional grayscale map to represent an ALU layout. The higher the number of demands, and the probability values, the darker the shade of gray in the maps. The usage information for the ALU gates is presented in . Out of Fig. 13(a). The total number of demands is about the 461 gates, 45 gates are demanded within times, 377 gates within times, 26 gates within times, seven gates within times, and six are not demanded. Per gate type, the most demanded types (i.e., summing up the contribution of each , and the gate) are MUX2_1 and NOR2_1 . To least demanded are OR2_1 and OR4_1 some extent, this result is correlated to the number of gates per type (see Table V). The types with the highest demand averages are OR4_1 (there is only one gate of this type), and ; while the lowest are NOT_AB_OR_C_OR_D , and INV_1 . The highest NAND2_1 demand variance among the gates of a same type in terms of a
coefficient of variation (i.e., ratio between the standard deviation and the mean) is experienced for types NOR4_1 ( 1.55), and INV_1 ( 1.05); while the lowest occurs for XNOR2_1 ( 0.02), and NAND2_1 ( 0.33). Across gates, the maximum number of demands is observed in a gate of type NOR4_1 , and MUX2_1 . The gates leading to the minimum number of demands belong to types INV_1, and OR2_1 (with at least one gate with zero demands). The failure probability profiles of the ALU gates are shown in Fig. 13(b) (delay manifestation profile), Fig. 13(c) (differentfunction manifestation profile), Fig. 13(d) (stuck-at manifestation profile), and Fig. 13(e) (combined manifestation profile). Note that different scales are used in each case (which is necessary to avoid “blank” figures). The delay manifestation leads to probabilities within , different-function within , . Considering the 461 and stuck-at within gates together, the overall probabilities reached per failure (delay), (difmanifestation are about (stuck-at). Per gate type, the ferent-function), and most impacted types (i.e., summing up the contribution of each gate) are MUX2_1 (delay and different-function), and NOR2_1 (stuck-at). The types with the highest failure probability averages are XNOR2_1 (delay and different-function), and OR4_1 (stuck-at). The highest probability variance (in terms of a coefficient of variation) is experienced by gates of types NOR4_1 ( 2.7) for delay. The variance per gate type for different-function and stuck-at is insignificant, and the corresponding failure probabilities are evenly distributed. Across logic gates, the
HUANG et al.: HARDWARE ERROR LIKELIHOOD INDUCED BY THE OPERATION OF SOFTWARE
maximum failure probability is observed in a gate of type MUX2_1 for delay, XNOR2_1 for different-function, and OR4_1 for stuck-at. A number of gates are not affected by some of the manifestations: six gates for delay (of types INV_1, and OR2_1), 77 gates for different-function (of type INV_1), and 44 gates for stuck-at (of types NAND3_1, and XOR2_1). From the analysis above, one can see that the delay manifestation dominates over the different-function and stuck-at manifestations. Accordingly, the combined failure probability profile (Fig. 13(e)) follows a similar tendency as the delay probability profile (Fig. 13(b)). In the combined representation, 27 gates , 110 gates within lead to probabilities within , 55 gates within , and 269 . Considering all the gates togates within gether, the probability of occurrence of a failure manifestation . is about The proposed methodology can help find the root causes of the obtained results. The latter requires the involvement of hardware analysts and hardware designers, e.g., to explain the particular shapes of the ALU failure maps, and whether they reveal a poor ALU design from a reliability viewpoint. This work requires access to hardware-related information that is not publicly available, and is out of the scope of our work. From a system-level viewpoint, the proposed failure maps can be used as a base to perform system-level analysis by fault injection (see [8] for an example of fault injection of transient hardware failures). V. CONCLUSION This paper has proposed a methodology for the reliability analysis of permanent failure manifestations of computer hardware devices resulting from the operational usage induced by embedded software applications. With respect to related approaches, our methodology introduces important contributions. It systematically models, simulates, and analyses the hardware and software interactions and interdependencies responsible for the creation of hardware failures. It considers basic physical phenomena and wearout mechanisms (e.g., Hot Carrier Injection, Electromigration, Time Dependent Dielectric Breakdown, Negative Bias Temperature Instability) leading to the usage of hardware devices as a consequence of a software execution. Additionally, the methodology not only provides estimates for the probability of failure of circuits under software operation, but also analyses the resulting high-level failure manifestations (delayed and stuck-at signals, circuits’ functional changes, etc.). From a technical perspective, the methodology uses universal signal stimuli that allow for obtaining generic stress patterns and failure manifestation patterns through SPICE simulations. Stress patterns characterize all the different combinations of circuits’ voltage and current stresses induced by the software operation at the physical level (i.e., transistors and interconnections). Failure manifestation patterns characterize the permanent failures that can occur at the circuit level (i.e., logic gates and storage devices). A systematic approach including a hardware demand model has been provided to build analytical expressions of failure manifestation probabilities as a function of the software execution. Different structures and notations are
637
also proposed to handle large amounts of manifestations into a reduced, practical set of expressions. The methodology uses an automated procedure based on VHDL simulation to measure the software-induced hardware usage. The failure manifestation probability profiles of computer hardware devices are finally obtained using best-fit statistical models. Overall, the HELIOS methodology provides the Hardware Error Likelihood Induced by the Operation of Software, and we believe it represents a step forward towards filling the gap between the software and microelectronics reliability domains. The methodology has been applied to a case study consisting of a safety critical application. About 142 different subtypes of failure manifestations have been observed, and more than 200 reliability models developed. Such models have allowed for obtaining the software-specific failure probability profiles of the ALU logic gates. Future work can address the extension of the methodology to consider the influence of the inter-cell topology of hardware units in the analysis of the reliability. This extension encompasses the analysis of electromigration stresses in the cell-to-cell wiring, the influence of a circuit fan-out on hot carrier injection, and the propagation of corrupted signals (e.g., delays) through a netlist of circuits. The methodology and results provided in this work have multiple applications. They can help both software and hardware reliability engineers analyze and improve system reliability more efficiently by focusing on the most failure-prone circuit elements. They can also help extend testing techniques such as fault injection, and reliability methodologies such as Probabilistic Risk Assessments (PRA). Fault injection [48] can be extended to address the reliability prediction of software failures by defining hardware fault models based on the failure manifestation probability profiles [49], [50]. The development of non-functional hardware input distributions based on the failure profiles would lead to software operational profiles that extend the use of the PRA methodology [51]. REFERENCES [1] J. Laprie, “Dependable computing and fault tolerance: Concepts and terminology,” in Proc. of IEEE FTCS-15, Ann Arbor, Michigan, 1985. [2] F. R. Shapiro, “Entomology of the computer bug: History and folklore,” American Speech, vol. 62, pp. 376–378, 1987. [3] R. K. Iyer and P. Velardi, “Hardware-related software errors: Measurement and analysis,” IEEE Trans. Software Engineering, vol. SE-11, pp. 223–231, 1985. [4] SunGard, “Why e-mail fails,” White Paper, SunGard Availability Services and MessageOne Survey of e-mail Outages 2004. [5] D. M. Gray, Frontier Status Report #145 1999 [Online]. Available: http://www.asi.org/adb/06/09/07/1999/fs-19990409.html [6] “Wikipedia,” Radiation hardening [Online]. Available: http://en.wikipedia.org/wiki/ Radiation_hardened#Examples_of_rad-hard_computers [7] JEDEC, Failure Mechanisms and Models for Semiconductor Devices JEDEC Publication No. 122B 2003. [8] B. Huang, M. Rodriguez, J. Bernstein, and C. Smidts, “Software reliability estimation of microprocessor transient faults,” in Proc. of The 42nd AIAA/ASME/SAE/ASEE Joint Propulsion Conference and Exhibit (AIAA 2006), Sacramento, CA, USA, July 9–12, 2006. [9] Y. Wei, M. Rodriguez, and C. Smidts, “How time-related failures affect the software system,” in Proc. of The 8th International Conference on Probabilistic Safety Assessment and Management (PSAM 8), New Orleans, Louisiana, USA, May 14–19, 2006. [10] T. Karnik, P. Hazucha, and J. Patel, “Characterization of soft errors caused by single event upsets in CMOS processes,” IEEE Trans. Dependable and Secure Computing, vol. 1, no. 2, April–June 2004.
638
IEEE TRANSACTIONS ON RELIABILITY, VOL. 60, NO. 3, SEPTEMBER 2011
[11] B. Huang, “Study of the Impact of Hardware Failures on Software Reliability,” Ph.D. Dissertation, University of Maryland, , 2006. [12] L. Condra, Proposal for a Contract to Develop an Integrated Aerospace Parts Acquisition Strategy July 2002. [13] P. C. Li et al., “iProbe-d: A hot-carrier and oxide reliability simulator,” in Prof. of IEEE International Reliability Physics Symposium, 1994. [14] Y. Leblebici and S. M. Kang, “A one-dimensional MOSFET model for simulation of hot-carrier induced device and circuit degradation,” in Proc. of IEEE International Symposium on Circuits and Systems, 1990. [15] J. Segura, C. De Benito, A. Rubio, and C. F. Hawkins, “A detailed analysis of GOS defects in MOS transistors: Testing implications at circuit level,” in Proc. of IEEE International Test Conference, 1995. [16] J. Srinivasan et al., RAMP: A Model for Reliability Aware MicroProcessor Design IBM Research Report RC23048, 2003. [17] X. Teng, H. Pham, and D. R. Jeske, “Reliability modeling of hardware and software interactions, and its applications,” IEEE Trans. Reliability, vol. 55, no. 4, Dec. 2006. [18] S. R. Welke, B. W. Johnson, and J. H. Aylor, “Reliability modeling of hardware/software systems,” IEEE Trans. Reliability, vol. 44, no. 3, pp. 413–418, September 1995. [19] M. A. Friedman and P. Tran, “Reliability techniques for combined hardware/software systems,” in Proc. of Annual Reliability and Maintainability Symposium, Las Vegas, NV, 1992. [20] J. Segura and C. Hawkins, CMOS Electronics: How It Works, How It Fails. : Wiley-IEEE Press, 2004. [21] S. Zafar, B. Lee, J. Stathis, A. Callegari, and T. Ning, “A model for negative bias temperature instability (NBTI) in oxide and high k pFETs,” in Proc. of IEEE Symposium on VLSI Technology and Circuits, Digest of Technical Papers, June 2004, pp. 208–209. [22] S. Zafar, “Statistical mechanics based model for negative bias temperature instability induced degradation,” Journal of Applied Physics, vol. 97, no. 1, pp. 1–9, 2005. [23] R. Degraeve, B. Kaczer, A. De Keersgieter, and G. Groeseneken, “Relation between breakdown mode and breakdown location in short channel NMOSFETs and its impact on reliability specifications,” in Proc. of 39th Annual International Reliability Physics Symposium (IRPS’01), 2001, pp. 360–366. [24] Linder, J. Stathis, D. Frank, S. Lombardo, and A. Vayshenker, “Growth and scaling of oxide conduction after breakdown,” in Proc. of 41st Annual International Reliability Physics Symposium (IRPS’03), 2003, pp. 402–405. [25] B. E. Deal, M. Sklar, A. S. Grove, and E. H. Snow, “Characteristics of the surface-state charge of thermally oxidized silicon,” Journal of the Electrochemical Society, vol. 114, pp. 266–274, 1967. [26] V. Huard, M. Denais, F. Perrier, N. Revil, C. Parthasarathy, A. Bravaix, and E. Vincent, “A thorough investigation of MOSFETs NBTI degradation,” Microelectronics Reliability, vol. 45, pp. 83–89, 2005. [27] D. K. Schroder and J. A. Babcock, “Negative bias temperature instability: Road to cross in deep submicron semiconductor manufacturing,” Journal of Applied Physics, vol. 94, pp. 1–18, 2003. [28] T. Quarles, A. R. Newton, D. O. Pederson, and A. Sangiovanni-Vincentelli, SPICE3 Version 3f3 User’s Manual. Berkley, CA: Dept. of Electrical Engineering and Computer Sciences, University of California, 1993. [29] J. R. Lloyd, “Reliability modeling for electromigration failure,” Quality and Reliability Engineering International, vol. 10, pp. 303–308, 1999. [30] E. Wu, J. Sune, W. Lai, E. Nowark, L. McKenna, A. Vayshenker, and D. Harmon, “Interplay of voltage and temperature acceleration of oxide breakdown for ultra-thin gate oxides,” Solid-State Electronics, vol. 46, pp. 1787–1798, 2002. [31] J. B. Sulistyo et al., “A new characterization method for delay and power dissipation of standard library cells,” VLSI Design, vol. 15, pp. 667–678, 2002. [32] J. B. Sulistyo, J. Perry, and D. S. Ha, Developing Standard Cells for TSMC 0.25 Technology under MOSIS DEEP Rules Dept. of Electrical and Computer Eng., Virginia Polytechnic Institute and State University, Blacksburg, VA, Tech. Report VISC-2003-01, 2003. [33] TSMC, 0.25 Micron CL025/CR025 (CM025) Process [Online]. Available: http://www.mosis.org/products/fab/vendors/tsmc/tsmc025/ [34] A. Acovic, G. Rosa, and Y. Sun, “A review of hot-carrier degradation mechanism in MOSFETs,” Microelectronics Reliability, vol. 36, pp. 845–869, 1996. [35] N. Hwang and L. Forbes, “Hot-carrier induced series resistance enhancement model (HISREM) of nMOSTFET’s for circuit simulation and reliability projections,” Microelectronics Reliability, vol. 35, pp. 225–239, 1995.
[36] X. Li, J. Qin, B. Huang, X. Zhang, and J. B. Bernstein, “A new SPICE reliability simulation method for deep submicron CMOS VLSI circuits,” in IEEE Trans. Device and Materials Reliability, 2006, vol. 6. [37] Reliability in CMOS IC Design: Physical Failure Mechanisms and Their Modeling [Online]. Available: http://www.mosis.org/support/technical-notes.html MOSIS Technical Notes [38] M. Ohring, Reliability and Failure of Electronic Materials and Devices. : Academic Press, 1998. [39] ReliaSoft Corporation, Competing Failure Modes [Online]. Available: http://www.weibull.com/LifeDataWeb/competing_failure_modes.htm [40] B. K. Liew, N. W. Cheung, and C. Hu, “Electromigration interconnect lifetime under AC and pulse DC stress,” in Proc. of International Reliability Physics Symposium, April 1989, pp. 215–219. [41] “Synopsys,” VCS MX Reference Guide Synopsys Inc., 2006 [Online]. Available: http://www.synopsys.com [42] “Synopsys,” Synopsys Design Analyzer Reference Manual Synopsys Inc., 2002 [Online]. Available: http://www.synopsys.com [43] E. S. Snyder, A. Kapoor, and C. Anderson, “The impact of statistics on hot-carrier lifetime estimates of n-channel MOSFETs,” SPIE—Microelectronics Manufacturing and Reliability, vol. 1802, pp. 180–187, 1992. [44] M. Gall et al., “Statistical analysis of early failures in electromigration,” Applied Physics, vol. 90, no. 2, pp. 732–740, 2001. [45] H. Masuda et al., “Assessment of a 90 nm PMOS NBTI in the form of products failure rate,” in Proc. of IEEE 2005 Int. Conf. on Microelectronics Test Structures, 2005. [46] E. Y. Wu et al., “Challenges for accurate reliability projections in the ultrathin oxide regime,” in Proc. of IRPS, 1999, pp. 57–65. [47] B. Huang, X. Li, M. Li, J. B. Bernstein, and C. S. Smidts, “Softwarespecific hardware failure profile,” in Proc. of AIAA/ASME/SAE/ASEE Joint Propulsion Conference and Exhibit (AIAA 2005), Tucson, AZ, USA, July 10–13, 2005. [48] J. Arlat, M. Aguera, L. Amat, Y. Crouzet, J.-C. Fabre, J.-C. Laprie, E. Martins, and D. Powell, “Fault injection for dependability validation—A methodology and some applications,” IEEE Trans. Software Eng., vol. 16, no. 2, pp. 166–182, Feb. 1990. [49] B. Huang, X. Li, M. Li, J. Bernstein, and C. Smidts, “Study of the impact of hardware failures on software reliability,” in Proc. of The 16th IEEE Int. Symp. on Software Reliability (ISSRE 2005), Chicago, IL, USA, Nov. 8–11, 2005. [50] B. Huang, M. Rodriguez, M. Li, and C. Smidts, “On the development of fault injection profiles,” in Proc. of The 53nd Annual Reliability & Maintainability Symposium (RAMS 2007), Orlando, Florida, USA, January 22–25, 2007. [51] B. Li, M. Li, and C. Smidts, “Integrating software into PRA,” in Proc. of The 14th IEEE International Symposium on Software Reliability Engineering (ISSRE 2003), Denver, 2003, pp. 457–467, IEEE.
Bing Huang is a Senior Reliability Engineer at Everspin Technologies Inc. in Chandler, AZ. He received his M.S. degree (2000) from Tsinghua University, China, and Ph.D. (2006) in Reliability Engineering from the University of Maryland, College Park. He joined the Center for Reliability Engineering at the University of Maryland in 2001 as a research assistant responsible for SRAM accelerated testing. Since 2004, he worked for a NASA project to study the impact of microprocessor hardware faults on software reliability. His current research focuses on the reliability of Magnetoresistive Random Access Memory (MRAM) products.
m
Manuel Rodriguez (M’08) is a Senior Research Scientist at The Ohio State University (USA). From 1997 to 2002, he was a member of the Dependable Computing and Fault Tolerance (TSF) group in the research laboratory LAASCNRS (Toulouse, France). He received the M.S. degree (1998) from the Polytechnic University of Valencia (UPV), Spain, and the Ph.D. degree (2002) from the National Polytechnic Institute of Toulouse (INPT), France. Dr. Rodriguez has authored and co-authored more than 30 conference and journal papers. He has served as a senior researcher for projects sponsored by government (NASA, ESA, AFOSR), and industry. His research interests include formal methods, software testing, real-time systems, and software & hardware reliability.
HUANG et al.: HARDWARE ERROR LIKELIHOOD INDUCED BY THE OPERATION OF SOFTWARE
Ming Li (M’98) is a Reliability Tech Lead and Principal Reliability Engineer at Mantech International Corp. under NASA GSFC MASCII contract in Greenbelt, MD. Prior to this position, he was the senior Software Reliability Manager at GE Healthcare in Waukesha, WI. His field of interest includes software reliability assessment, system reliability modeling, and PRA for complex systems. He received his B.S. in Electrical Engineering, and M.S. in Systems Engineering from Tsinghua University, China; and Ph.D. in Reliability Engineering from the University of Maryland, College Park. He has authored over 30 papers in peer reviewed journals, and international conferences.
Joseph B. Bernstein (SM’03) is a Professor of Engineering at Bar Ilan University, and formerly from the Mechanical Engineering department at University of Maryland, College Park. Dr. Bernstein received his Ph.D. in Electrical Engineering from MIT in 1990, and he is actively involved in microelectronics device and systems reliability research and physics of failure including power device reliability, ultra-thin gate oxide integrity, radiation effects, MEMS, and laser programmable metal interconnects. He supervises the laboratory for failure analysis and reliability of microelectronic devices, and is the head of the microelectronics device reliability program. His research areas include statistical interactions of multiple failure mechanisms in ULSI devices. He also works extensively with the semiconductor industry on projects relating to system qualification for reliability based on fundamental physics and circuit simulation techniques, and on programmable devices and repair in microelectronic circuits and packaging.
639
Carol S. Smidts (M’94–SM’06) is a Professor at The Ohio State University. She graduated with a B.S. & M.S., and a Ph.D. in Engineering Physics from the Université Libre de Bruxelles, Belgium in 1986, and in 1991, respectively. Professor Smidts then became an Assistant Professor, and later an Associate Professor in the Reliability Engineering Program at the University of Maryland, College Park. Professor Smidts joined The Ohio State University Department of Mechanical and Aerospace Engineering as a Full Professor in 2008. While at the University of Maryland, Dr. Smidts established the certificate and concentration area in Software Reliability Engineering. Dr. Smidts’ research is in Software Reliability Modeling, Software Test Automation, Probabilistic Dynamics for Complex Systems, and Human Reliability. She is the author of more than 100 refereed journal and conference publications, the recipient of multiple awards such as the NASA Rotary Award, and the recipient of three patents. Dr. Smidts’ research has been sponsored by government (DOE, AFSOR, AFRL, NRC, NASA, NSF, FAA, DOD, NSA), as well as by industry (Texas Instruments, IBM). Dr. Smidts is listed in Men and Women of Science, Who is Who in America, Who is Who in the World, and Who is Who in Engineering Education. She is a Senior Member of IEEE, and was the conference co-Chair of the IEEE International Symposium on Software Reliability Engineering (2006), IEEE High Assurance Systems Engineering (2008), is an Associate Editor for IEEE TRANSACTIONS ON RELIABILITY, and is a regular member on review panels (NSF, FDA, ISSRE, HASE, DSN).