A Proactive Wearout Recovery Approach for ...

4 downloads 0 Views 366KB Size Report
dation. IEEE Micro, 25(6):10–16, Nov 2005. [6] D. C. Bossen, et al. Fault-tolerant design of the IBM pSeries. 690 system using POWER4 processor technology.
To appear in the International Symposium on Computer Architecture (ISCA), 2008

A Proactive Wearout Recovery Approach for Exploiting Microarchitectural Redundancy to Extend Cache SRAM Lifetime



Jeonghee Shin† Victor Zyuban‡ Pradip Bose‡ Timothy M. Pinkston† EE-Systems, University of Southern California, Los Angeles, CA 90089, {jeonghes, tpink}@usc.edu ‡ IBM T. J. Watson Research Center, Yorktown Heights, NY 10598, {zyuban, pbose}@us.ibm.com

Abstract Microarchitectural redundancy has been proposed as a means of improving chip lifetime reliability. It is typically used in a reactive way, allowing chips to maintain operability in the presence of failures by detecting and isolating, correcting, and/or replacing components on a first-come, first-served basis only after they become faulty. In this paper, we explore an alternative, more preferred method of exploiting microarchitectural redundancy to enhance chip lifetime reliability. In our proposed approach, redundancy is used proactively to allow non-faulty microarchitecture components to be temporarily deactivated, on a rotating basis, to suspend and/or recover from certain wearout effects. This approach improves chip lifetime reliability by warding off the onset of wearout failures as opposed to reacting to them posteriorly. Applied to on-chip cache SRAM for combating NBTI-induced wearout failure, our proactive wearout recovery approach increases lifetime reliability (measured in mean-time-to-failure) of the cache by about a factor of seven relative to no use of microarchitectural redundancy and a factor of five relative to conventional reactive use of redundancy having similar area overhead.

1 Introduction Deep submicron semiconductor technologies enable greater degrees of device integration and performance, but they also pose many new microprocessor design challenges. Chip lifetime reliability as affected by wearoutrelated failures, for one, has become a major concern [5]. Atomic-range dimensions, escalating power densities, process/operational variation and other consequences of extreme scaling all contribute to this concern. Much recent research has been conducted to understand and model the effects of wearout failure mechanisms such as negative bias temperature instability (NBTI), electromigration, gate oxide breakdown, etc., on chip lifetime reli-

ability [11][12][27]. Circuit and architectural techniques for mitigating and/or tolerating such wearout failures are also being explored for extending chip lifetime. Some techniques are based on adjusting the operational characteristics (e.g., supply voltage, frequency, threshold voltage, or duty cycle) to reduce or recover from wearout stress conditions of failure mechanisms [10][12][22]. Others are based on using some form of redundancy (e.g., component sparing) used reactively to tolerate the effects of wearout [6][13]. In this paper, we propose a proactive wearout recovery approach for exploiting microarchitectural redundancy targeted to enhancing the lifetime reliability of cache SRAM susceptible to NBTI failure. The NBTI failure mechanism significantly impacts the lifetime reliability of chips implemented with deep submicron technology. SRAM arrays are particularly affected by NBTI-induced wearout. SRAM arrays generally take up a large portion of the chip, having a larger number of devices vulnerable to NBTI. Array cells tend to hold the same value over a long period of time, which causes some devices to be under NBTI stress for a large portion of time (i.e., to have a high duty cycle). Moreover, the degradation of cell stability caused by NBTI cannot be mitigated simply by providing sufficient delay margin at design time as is done for degradation of logic circuit speed. Cache SRAM arrays and NBTI-induced wearout are, thus, the focus of this work, though we believe proactive wearout recovery can be applied to other microarchitectural components and failure mechanisms. The rest of the paper is organized as follows. In Section 2, we describe the proposed proactive approach for using redundancy. In Section 3, we propose a circuit-level implementation of SRAM cells which allows them to operate in recovery mode to mitigate the effects of NBTI-induced wearout. In Section 4, we discuss design considerations and describe how proactive wearout recovery can be applied to cache SRAM arrays. In Section 5, we evaluate lifetime reliability enhancement, performance impact and area overhead of the proposed approach in comparison to no use and conventional reactive use of redundancy. In Section 6, we present related work and conclude the paper in Section 7.

2 Proactive Use of Redundancy

3 Wearout Recovery

Redundancy is a commonly used technique for improving lifetime reliability as well as yield of processor systems [1][6][13]. When applied to microprocessors, chips can maintain operability in the presence of defects or failures by detecting and isolating, correcting, and/or replacing microarchitecture components reactively on a first-come, first-served basis after components become faulty. We refer to this as reactive use of microarchitectural redundancy for extending chip lifetime. Reactive use of redundancy allows as many failures to be tolerated as there are non-faulty redundant components. With this, non-faulty components operate either in active or standby modes. Lifetime can be extended by graceful performance degradation of the system in which all components (including redundant ones) initially operate in active mode until failing or by swapping into the system redundant spares that transition from standby mode to active mode when failures occur. An alternative approach proposed in this paper for extending chip lifetime is to use redundancy for the purpose of allowing components to suspend or recover from wearout well before any of them fail. We refer to this as proactive use of microarchitectural redundancy. Redundancy used proactively allows non-faulty microarchitecture components to be temporarily deactivated and later reactivated on a rotating basis to suspend and/or reverse the effects of wearout. With this, non-faulty components (including redundant ones) operate either in active mode or in recovery mode, periodically transitioning between the two modes according to a recovery schedule. This enables chip lifetime reliability to be improved by warding off the onset of wearout failures as opposed to reacting to them posteriorly. While both approaches have similar area and delay overhead to implement redundancy, proactive use of redundancy for extending chip lifetime has several advantages over reactive use. The number of failures occurring over a given period of time (i.e., failure rate) tends to increase rapidly over time after a certain amount of component wearout [1]. Prolonging the time before components reach this point of wearout by suspending their use can extend lifetime. For some failure mechanisms such as NBTI, the effects of wearout can be reversed during the suspended period that stress conditions are removed (i.e., during recovery mode) [27]. Thus, proactive use of even a limited amount of redundancy can suspend or reverse component wearout. Reactive use of redundancy provides no such benefits to component wearout but, instead, provides only for as many wearout failures to be tolerated as there are redundant components, which typically is very limited. Furthermore, with proper scheduling, proactive use of redundancy allows component wearout to be balanced across the chip to stave off chip kill owing to only a few heavily worn-out components.

Proactive use of redundancy can suspend wearout, but it is more effective at delaying the onset of failure if the target failure mechanism has wearout recovery properties that can be exploited. Below, the physical phenomena behind NBTIinduced failure and NBTI recovery effects are described. This is followed by our proposed circuit-level technique for implementing a wearout recovery mode in which devices can undergo intense recovery from NBTI-induced wearout.

3.1

NBTI Failure Mechanism

Negative bias temperature instability (NBTI) is a critical failure mechanism affecting deep submicron technologies [27]. NBTI occurs in PFET devices stressed with negative gate-source bias (i.e., Vgs =−Vdd ) at elevated temperature. After silicon oxidation, most Si atoms bond to oxygen at the interface of silicon and gate oxide, but some Si atoms bond to hydrogen, causing hydrogen-terminated trivalent silicon bonds (Si3 –Si–H). According to the hydrogen reaction-diffusion model [27], these bonds are dissociated under stress conditions such as high electric field and/or elevated temperature. As a result, dangling bonds (Si3 –Si·) create traps at the interface and hydrogen atoms diffused from the interface create traps in the gate oxide. These positively charged traps result in an undesired threshold voltage increase. The shift in threshold voltage causes degradation in circuit speed and noise margin, eventually, leading to circuit failures due to timing violations or array cell state instability or destruction [18][22]. Recovery from NBTI-induced threshold voltage shift can occur during the period over which no stress is applied on the gate (i.e., Vgs =0) as hydrogen atoms diffused during NBTI stress return to the interface to mend the dangling bonds and electrons injected from the substrate neutralize oxide traps created from NBTI stress [20][27]. This naturally occurring recovery effect of NBTI-induced wearout is intensified (i.e., made faster and more pronounced) when PFET devices are reverse biased (i.e., Vgs =Vdd ) as hydrogen atoms are more effectively attracted to the interface and electron injection is more active [20][21][23]. An SRAM array design that allows PFET devices to undergo intensified NBTI wearout recovery is proposed below.

3.2

Implementing Recovery Mode

Conventional power reduction techniques such as voltage scaling and power gating reduce NBTI stress conditions to a certain degree by reducing the applied electric field, resulting in a slow-down of wearout [25]. However, as these techniques do not completely remove the electric field (i.e., Vgs