Trading off Cache Capacity for Reliability to Enable Low ... - CiteSeerX

10 downloads 15042 Views 792KB Size Report
life for handheld and laptop products. This paper ... pulse is then applied allowing write value to drive into the cell .... example, hard failures (due to short or open defects) are typically .... broken bit-pairs and maintains patches to repair the.
Trading off Cache Capacity for Reliability to Enable Low Voltage Operation

Chris Wilkerson, Hongliang Gao*, Alaa R. Alameldeen, Zeshan Chishti, Muhammad Khellah and Shih-Lien Lu Microprocessor Technology Lab, Intel Corporation *University of Central Florida

Abstract One of the most effective techniques to reduce a processor’s power consumption is to reduce supply voltage. However, reducing voltage in the context of manufacturing-induced parameter variations can cause many types of memory circuits to fail. As a result, voltage scaling is limited by a minimum voltage, often called Vccmin, beyond which circuits may not operate reliably. Large memory structures (e.g., caches) typically set Vccmin for the whole processor. In this paper, we propose two architectural techniques that enable microprocessor caches (L1 and L2), to operate at low voltages despite very high memory cell failure rates. The Word-disable scheme combines two consecutive cache lines, to form a single cache line where only non-failing words are used. The Bit-fix scheme uses a quarter of the ways in a cache set to store positions and fix bits for failing bits in other ways of the set. During high voltage operation, both schemes allow use of the entire cache. During low voltage operation, they sacrifice cache capacity by 50% and 25%, respectively, to reduce Vccmin below 500mV. Compared to current designs with a Vccmin of 825mV, our schemes enable a 40% voltage reduction, which reduces power by 85% and energy per instruction (EPI) by 53%.

1. Introduction Ongoing technology improvements and feature size reduction have led to an increase in manufacturing-induced parameter variations. These variations affect various memory cell circuits including Small Signal Arrays (SSAs) (e.g., SRAM cells), and Large Signal Arrays (LSAs) (e.g., register file cells), making them unreliable at low voltages.

The minimum voltage at which these circuits can reliably operate is often referred to as Vccmin. Modern microprocessors contain a number of large memory structures. For each of these structures, the bit with the highest Vccmin determines the Vccmin of the structure as a whole. Since defective cells are distributed randomly throughout the microprocessor, the memory structures with the highest capacity (the L1 data and instruction caches and the L2 cache) typically determine the Vccmin of the microprocessor as a whole. Vccmin is a critical parameter that prevents the voltage scaling of a design. Voltage scaling is one of the most effective ways to reduce the power consumed by a microprocessor since dynamic power is a quadratic function of Vcc and leakage is an exponential function of Vcc [17]1. Therefore, Vccmin is a critical parameter that prevents reducing a particular design’s power consumption. Overcoming Vccmin limits allows designs to operate at lower voltages, improving energy consumption and battery life for handheld and laptop products. This paper shows how reducing Vcc below Vccmin affects the reliability of a microprocessor. Based on published bit failure data, we estimate that a 2 MB cache implemented in 65 nm technology (similar to L2 size of the Intel® Core™ 2 Duo processor) has a Vccmin of 825 mV. We examine two novel architectural mechanisms to enable some of the largest memory structures (specifically the L1 data and instruction caches and the second level cache) to be redesigned to decrease Vccmin. These schemes identify and disable defective portions of the cache at two different granularities: individual words or pairs of bits. With minimal overhead, both schemes significantly reduce the Vccmin of a cache in low-voltage mode while reducing performance 1

Leakage has an exponential dependence on device threshold voltage (Vth) and Vth in turn is a function of Vcc as a result of the Drain Induced Barrier Lowering (DIBL) effect [17].

1.

We highlight the importance of Vccmin and its impact on power and reliability.

2.

We propose two schemes to enable large memory structures to operate at low-voltages. Both schemes achieve reliable operation of caches at 500mV with minimal overhead, while reducing performance marginally when the processor operates at the normal voltage level.

3.

To our knowledge, this is the first paper to show architectural techniques that allow reliable cache operation from unreliable components at sub500mV voltages.

4.

We demonstrate that our schemes can reduce overall microprocessor power significantly compared to a system with traditional caches.

In the remainder of this paper, we discuss the importance of Vccmin and its impact on power (Section 2). We introduce our model for failure probability in Section 3. We discuss some prior work in Section 4. We then discuss our two proposed schemes in detail in Section 5. We introduce our experimental methodology in Section 6. Section 7 evaluates our proposed schemes in terms of performance, power, and area overhead, and we conclude in Section 8.

2. Vccmin and its impact on cache power and reliability A conventional six-transistor (or 6T) SRAM cell is shown in Figure 1 [13]. The cell consists of two identical CMOS inverters, connected in a positive feedback loop to create a bi-stable circuit allowing the storage of either “0” or “1”, and two NMOS access transistors. A word-line select signal (WL) is applied to the access transistors to allow reading or writing the cell by connecting the bit-lines (BL0, BL1) to the cell storage nodes. The SRAM cell is symmetrical and is divided into a left side and a right side. In Figure 1 the left side is assumed to be initially storing a “0” while the right side is assumed initially storing a “1”.

VCC I1 P1

P0 “ 0"

X0

BL1

I0 BL0

marginally in high-voltage mode. Both schemes achieve a significantly lower overhead compared to ECC-based defect tolerance schemes at low voltages. In the first scheme, Word-disable, we disable 32bit words that contain defective bits. Physical lines in two consecutive cache ways combine to form one logical line where only non-failing words are used. This cuts both the cache size and associativity in half in low-voltage mode. Each line’s tag includes a defect map (one bit per word, or 16 bits per 64-byte cache line) that represents which words are defective (0) or valid (1). This scheme allows a 32KB cache to operate at a voltage level of 490mV. In the second scheme, Bit-fix, we use a quarter of the cache ways to store the location of defective bits in other ways, as well as the correct value of these bits. For an 8-way cache, for example, we reserve two ways (in low-voltage mode) to store defect-correction information. This reduces both the cache size and associativity by 25% in low-voltage mode, while permitting a 2MB cache to operate at 475 mV. In this paper, we make the following main contributions:

N0

“ 1"

X1 N1

WL VSS

VSS

Figure 1. Conventional six-transistor (6T) SRAM cell To read, bit-lines are pre-charged to Vcc equally first. The word-line select pulse is then applied to turn on the access transistors allowing a differential voltage (50-100mV) to build across the bit-lines. After that, a sense-amplifier connected to the bit-lines is activated to amplify the bit-line voltage differential to provide the read value. To write, bit-lines are driven to the desired value first. The word-line select pulse is then applied allowing write value to drive into the cell through access transistors.

2.1 Types of cell failures An SRAM cell can fail in four different ways, which we describe next [2, 11, 16]: Read failure. A read failure occurs when the stored value is flipped during a read operation. This happens when the noise developed on the node storing “0” (due to charge sharing with the highly-capacitive precharged BL0) is larger than the trip point of inverter I1 .

increases exponentially. This increase can be attributed to the higher sensitivity of cell stability to parametric variations at lower supply voltages. Thus, stability failures impose a limit on the minimum supply voltage. 1.E+00

1.E-03 pfail (1.E+00 = 100%)

Write failure. A write failure occurs when the cell contents can’t be toggled during a write operation. To toggle the contents of the cell in Figure 1, BL1 is driven to Vss while BL0 remains pre-charged to Vcc. In an operational cell, transistor X1 will start discharging node I1 (on the right) and positive feedback will complete the operation driving I0 (on the left) to 1. Strengthening the pull-up device P1 vs the access device X1 will make toggling the contents of the cell harder and cause write failures. Write and read failures dominate the other two types of cell failures. Access failure. An access failure happens when the differential voltage developed across the bit-lines during a read operation is not sufficient for the senseamplifier to identify the correct value. Increasing the WL pulse period usually helps reduce access failure. Retention (hold) failure. A retention failure occurs when the stored value in the cell is lost during standby. This could be the result of a voltage drop for example.

1.E-06

1.E-09

Vcc (volts)

1.E-12 0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Figure 2. Probability of failure (Pfail) for a cell vs. Vcc [8]

2.2 Impact of supply voltage on reliability

3. Model for failure probability (Pfail)

At high supply voltages, cell operating margins are large, leading to reliable operation. For example, during a read operation, the voltage bump induced on the “0” side of the cell is much smaller than the noise margin of inverter I1, which prevents the cell from losing its contents. However, reducing supply voltage decreases cell noise margins. This decrease, coupled with parametric variation due to manufacturing imperfections, significantly limits the minimum supply voltage (Vccmin) below which the SRAM cell can no longer operate. In particular, intra-die Random Dopant Fluctuations (or RDF) play a primary role in cell failure by arbitrarily impacting the number and location of dopant atoms in transistors, resulting in different Vth for supposedly matched SRAM devices [1]. For example, RDF can make X0 stronger than N0 resulting in a read unstable cell, or making X1 weaker than P1 resulting in write failures for another cell. Random intra-die variations (such as RDF) are modeled as random variables and used as inputs to determine failure probability (or Pfail) of a given SRAM cell [1, 2]. Figure 2 shows an example cell failure probability as a function of supply voltage Vcc [8] 2 . As Vcc decreases, the failure probability

As suggested in the previous section, we assume that stability failures are randomly distributed throughout a memory array. Each cell has a probability of failure (Pfail). A die containing even a single cell failure must be discarded. The aggregate probability of failure for all memory arrays on the die is a significant component of overall manufacturing yield. In our analysis, we assume that the Pfail for each memory array must be kept at less than 1 out of 1000 to achieve acceptable manufacturing yields. With these assumptions, we derive nominal Vccmin as the voltage at which every cell is functional in 999 out of every 1000 dies. Figure 3 shows the Vccmin values for a 2MB and a 32KB cache (representative of the L2 and L1 I&D caches for the Intel® Core™ 2 Duo processor [4]). Ac h i e v i n g ac ce p t ab le ma n u f a c t u r i n g yi e l d s f o r a 2 M B c a c h e r e q u i r e s a V c c mi n o f n e a r l y 8 2 5 mV . B e c a u se o f i t s s ma l l e r s i z e , t h e 3 2 K B c a c h e h a s a l o we r P fa i l t h a n t h a t f o r t h e 2 M B c a c h e . H o we v e r , o n e c a n n o t d e c r e a s e V c c b e l o w n e a r l y 7 6 0 mV f o r t he 3 2 K B c a c h e wi t h o u t s a c r i f i c i n g y i e l d . O n e p o s s i b l e me c h a n i s m t o d e c r e a se c a c h e V c c mi n i s t o se l e c t i v e l y d i sa b l e d e fe c t i v e d a t a i n t he c ac he. S uc h di s ab li ng c a n be c arr ied o ut at

2

Vccmin is closely related to the yield of a product. As such, industrial Vccmin data is unavailable for publication. Data on cell failure rates for different SRAM designs are based on simulated/measured results reported in [8]. The analysis in [8] focuses on a 130nm process node. This paper focuses on a 65nm

technology but assumes a similar Pfail/Vcc slope. This assumes that variations (RDF in particular) won’t get worse on the newer technology node. Worsening variations will make Vccmin worse and increase the need for aggressive mitigation mechanisms such as those discussed in this paper.

d i f f e r e n t g r a n u l a r i t i e s , r a n g i n g f r o m c a c h e wa y s [ 1 2 ] , e n t ir e c a c h e l i ne s [ 3 ] ( c o a r se ) , to i nd i v id u a l b i t s ( f i n e ) . T o e x p l o r e s u c h p o s s i b i l i t i e s , we s h o w t h e r e l a t i o n s h i p b e t we e n V c c a n d P fa i l fo r a 6 4 b yt e l i ne , a b yt e , a n d a b i t i n Figure 3. W h i le t h e f a i l ur e p r o b a b i li t y f o r a 6 4 B li n e i s le s s t ha n t h a t f o r t h e wh o l e c a c h e , a l m o s t e v e r y c a c h e l i n e h a s a t l e a st o ne d e f e c t i ve b i t a t V c c b e l o w 5 0 0 mV . B e c a u se a c a c h e l i ne c o n s i s t s o f ma n y c e l l s a n d t h e f a i l ur e o f a n y o n e o f t he s e c e l l s c o u ld r e nd e r t h e c a c he l i ne d e f e c ti v e , t he P f a il o f a c a c h e li n e i s s i g n i f ic a n tl y h i g h e r t h a n t h a t o f i nd i v id u a l b yt e s a nd b i t s. T he d a t a i n Figure 3 s u g g e s t s t h at d i sa b l i n g a t f i n e r gr a n u la r i t ie s c o u ld e n a b le o p e r a t i o n a t V c c va l u e s b e l o w 5 0 0 mV . W e e x p l o r e me c h a ni s ms f o r f i n e r g r a n u l a r i t y d i sab l i n g i n d e ta i l i n S ec tio n 5 . 1.E+00 probability of failure

2M Cache Byte

1.E-03

32KB Cache Pfail Bit

1.E-06

64B Line Vcc (volts)

1.E-09 0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Figure 3. Pfail vs. Vcc for various cache structures In addition to parametric failures, several other types of failures may occur in an SRAM cell. For example, hard failures (due to short or open defects) are typically mitigated through column/row redundancy. Soft errors, failures due to device aging such as Negative Bias Temperature Instability (NBTI) [9], and erratic bit behavior due to gate oxide leakage [10] effect FIT (Failures in Time) rates. These can be addressed with ECC and additional supply voltage (Vcc) guardband.

4. Related work K h e l la h , e t a l. , p r o p o se a Du a l - V cc ap p r o ac h to a l l o w t h e c o r e t o o p e r a t e a t a l o w v o l t a g e wh i l e m a i n t a i n i n g a s e p a r a t e p o we r s u p p l y fo r l a r g e me mo r y s t r uc t u r e s [ 6 ] . A l t h o u g h t h i s a p p r o a c h i n c r e a s e s p l a t f o r m c o st , i t ma y b e p r a c t i c a l f o r

l a r g e me mo r y s t r u c t ur e s t h a t a r e se g r e g a t e d fr o m t h e c o r e ( e . g . , L 2 c a c h e ) . F o r me mo r y s t r u c t ur e s e mb e d d e d i n t h e c o r e ( e . g . , L 1 I & D c a c h e s) , d i f f e r i n g a d j a c e nt vo l t a g e d o ma i n s a d d a gr e a t d e a l o f d e si g n c o mp l e x i t y. V o l t a g e - l e v e l c o n v e r t e r s m u s t b e a d d e d t o a l l o w s i g n a l s h a r i n g b e t we e n t h e me mo r y s t r u c t ur e s r u n n i n g a t h i g h vo l t a g e s a nd l o g i c b l o c k s r u n n i n g a t l o w v o l t a g e s. F ur t h e r mo r e , a c c e s se s t o me mo r y s t r uc t u r e s r u n n i n g a t hi g h v o l ta g e s g e n e r a t e n o i s e o n n e a r b y l o w v o l t a ge b lo c k s r eq u ir i n g ei t h er sh i e ld i n g o r ad d it io n al n o i s e m a r g i n . T h e m o s t s e r i o u s d r a wb a c k fo r t h i s a p p r o a c h i s t h a t p o we r c o n s u m p t i o n i n t h e h i g h v o l t a g e s t r u c t u r e s wi l l q u i c k l y b e g i n t o d o m i n a t e t o t a l p o we r c o n s u m p t i o n a s a g g r e s s i v e v o l t a g e s c a l i n g r e d u c e s p o we r c o n s u m p t i o n o n t h e r e s t o f t h e d i e . A s a n e x a mp l e , t h e L1 c a c h e s mi g h t a c c o u n t f o r 2 0 % o f t h e d y n a m i c p o we r c o n s u mp t i o n a t h i g h vo l t a g e . A f t e r a vo l t a g e r e d u c t i o n o f 4 0 % , c l a ss i c v o l t a g e s c a l i n g fo r t h e r e s t o f t h e d i e wo u l d i n d i c a t e t h a t t h e L1 c a c h e s a c c o u n t f o r 5 5 % o f t h e d y n a m i c p o we r . I n o u r a n a l y s i s , we a s s u m e t h a t o u r p l a t fo r m p r o v i d e s a s i n g l e p o we r s u p p l y , a n d V c c m i n i n t h e l a r g e me mo r y s t r u c t u r e s d e t e r mi n e s V c c mi n f o r t he d i e . A c a c h e d e s i g n e r h a s s e ve r a l a lt e r n a t i v e s to i mp r o ve t h e r e l i a b i l i t y o f me mo r y c e l l s a t l o w v o l ta g e s. T h e l e a s t i n tr u s i ve o p t io n a d d r e s se s t h e p r o b l e m b y c h a n g i n g t h e b as i c d e si g n o f t h e c e ll . T he d e si g n e r c a n up s iz e t he d e v i c e s i n t h e c e l l, r e d u c i n g t he i mp a c t o f v a r i a t i o n s o n d e v i c e p e r fo r ma n c e a nd i mp r o v i ng t h e s t a b i l i t y o f t he c e ll . V a r i a t i o n s o n t h e t r a d i t i o n a l 6 -T ( 6 t r a n s i s to r ) S R AM d e s i g n c a n a l so b e ad o p t ed i n c l ud i n g 8 -T a n d 1 0 -T c e l l s . K u l ka r n i e t a l . [ 3 ] c o m p a r e s o m e o f t h e s e a p p r o a c h e s wi t h t h e i r p r o p o se d S c h mi d t T r i g g e r ( S T ) 1 0 -T S R A M c e l l . F o r a f i xe d a r e a , t h e y f in d t h a t t h e ST c e ll d e l i v e r s b e t t e r l o w v o l t a ge r e l i a b i l i t y t h a n 6 -T , 8 T , a nd 1 0 -T S R A M c e l l d e s i g n s . T he i mp r o ve d r e l i a b i l i t y o f t h e ST c e l l c o me s a t t h e c o s t o f a 1 0 0 % a r e a i n c r e a s e a n d a n i n c r e a se i n l a t e nc y. F i g u r e 4 s h o ws t h e V c c m i n r e d u c t i o n a c h i e v a b l e i f we u s e t h e S T c e l l fo r a 2 M B c a c h e . T h e S T c e l l a l l o ws u s t o r e d u c e V c c m i n t o 5 3 0 m V f o r t h e 2 M B c a c h e . A s we s h o w i n S e c t i o n 5 , o u r p r o p o s a l s a c h i e v e l o we r V c c m i n t h a n t h e S T c e l l wi t h l e s s a r e a . A s i mp l e so l u t i o n t o i nc r e a se r e l i a b i l i t y, w h i c h h a s b e e n u se d b y i nd u s t r y, i s i mp l e me n t i n g r o w a n d c o l u mn r e d u n d a n c y i n c a c h e s [ 1 4 ] . I n t h e s e s c h e m e s , o n e o r s e v e r a l a d d i t i o n a l r o ws a n d / o r c o l u mn s a r e a d d e d t o a c a c h e a r r a y d e si g n , a n d t h e s e a d d i t i o n a l c e l l s a r e e n a b l e d wh e n o t h e r c a c h e c e l l s fa i l . H o we v e r , t h e s e t e c h n i q u e s a r e

3 A majority function gate of n-input outputs a ‘1’ if the majority of the n inputs is ‘1’.

*all data for 2MB arrays except for 32KB 1.E+00

*6T = 6T cell-based cache

ST Cell Probability of failure

i m p r a c t i c a l f o r d e a l i n g wi t h t h e v e r y h i g h f a i l u r e r a t e s we c o n s i d e r . W h i l e r o w/ c o l u m n r e d u n d a n c y i s e f f e c t i v e fo r d e a l i n g wi t h t e n s o f d e fe c t i v e b i t s , o u r t e c h n i q u e s a l l o w u s t o a d d r e s s c a c h e s wi t h t h o u s a nd s o f d e fe c t i v e b i t s. I n s t e a d o f i mp r o v i n g t he d e s i g n o f t he c e l l , a d e si g n e r c a n c ho o se t o b u i l d e r r o r d e te c tio n / c o r r e c t i o n i n t o t h e a r r a y b y c h o o s i n g f r o m a wi d e v a r ie t y o f a va i la b le e r r o r d e t e c t io n /c o r r e c t io n c o d e s ( E D C /E C C) . E C C h a s p r o v e n to b e a n a t t r a c t i v e o p t i o n fo r r e d u c i n g c a c h e v u l n e r a b i l i t y t o t r a n s i e n t fa u l t s s u c h a s s o f t e r r o r s wh e r e m u l t i b it err o r s ar e u n li k e l y. Figure 4 s ho ws t h at 1 -B i t E C C o r S E C D E D ( si ng l e - e r r o r c o r r e c tio n , d o u b l e - e r r o r d e t e c t i o n ) c a n r e d u c e V c c mi n b y a r o u n d 1 5 0 m V . H o we v e r , S E C D E D f a l l s f a r s h o r t o f o u r 5 0 0 mV g o a l . T o a l l o w o p e r a t i o n a r o u n d 5 0 0 m V , we wo u l d n e e d t o a u g m e n t o u r 2 M B c a c h e wi t h a 1 0 -b i t E C C . H o we v e r , m u l t i - b i t e r r o r c o r r e c ti o n c o d e s h a ve h i g h o ve r he a d [ 7 ] , r e q u i r i n g s t o r a g e f o r c o r r e c t i o n c o d e s a s we l l a s c o mp l e x a nd sl o w d e c o d e r s t o i d e nt i f y t h e e r r o r s . A s a n e xa mp l e , K i m, e t a l . [ 7 ] , p r o p o s e a sc h e me t o u se 2 - D E C C t o d e t e c t a n d c o r r e c t mu l t i -b i t e r r o r s . W h i l e t hi s s c he me r e q ui r e s l e s s o v e r h e a d i n c h e c k b i t s , i t r e q u i r e s a r e a d - m o d i f y - wr i t e o p e r a t i o n fo r e v e r y wr i t e a n d m a n y e x t r a c y c l e s ( d e p e nd i n g o n t h e o r g a n iz a ti o n) to c o r r e c t. H s ia o , e t . a l . [ 5 ] , p r o p o se a c l a s s o f o ne - s t e p ma j o r i t yd e c o d a b le c o d e s c a ll e d O r th o g o na l L a ti n Sq u a r e C o d e s ( O L S C ) t h at is b o t h f a st a n d i mp l e me n t a b l e f o r mu l t i - b i t e r r o r c o r r e c t i o n. E n c o d i n g c h e c k b i t s f o r t -b i t e r r o r s ( 1 0 i n o ur c a se ) a nd m2 d a t a b i t s ( m2 = 5 1 2 i n o ur c a s e ) r e q u i r e s 2 *t * m ( m- 1 ) - i n p u t X O R ga t e s. T o d e c o d e a n d c o r r e c t t h e t r a n s m i t t e d d a t a , we n e e d 2 t m2 m- i n p u t X O R g a t e s a n d a ( 2 t + 1 ) -i n p u t ma j o r i t y f u n c t i o n g a t e 3 f o r e a c h d a t a b i t . U n f o r t u n a te l y, t h i s c la s s o f c o d e s ha v e s ub st a n ti a l s t o r a ge o v e r h e a d , s i n c e m2 d a t a b i t s r e q ui r e 2 *t * m c h e c k b i t s. T h e nu mb e r o f c he c k b i t s g r o ws O ( m 1 / 2 ) r a t h e r t h a n t h e t h e o r e t i c l i m i t o f E C C o f O ( l o g n) . I mp l e m e n t i n g o u r 1 0 -b i t E C C wi t h O L S C t o a l l o w o p e r a t i o n a t 5 0 0 m V wo u l d r e q u i r e 4 6 0 c h e c k b i t s fo r a 5 1 2 b i t c a c he l i ne . T he r e fo r e , a 1 0 -b i t E C C i nt r o d u c e s a s i g n i f i c a n t ( n e a r l y 2 X ) a r e a o v e r h e a d , a s we l l a s s i g n i f i c a n t c o mp l e x i t y t o c o r r e c t e r r o r s . I n t he n e x t se c t i o n , we d e s c r i b e o u r t wo p r o p o s a l s t o a c h i e v e l o we r V c c m i n a t a m u c h l o we r o v e r h e a d .

6T 0.46

1.E-03

0.53 ~70mV

0.67

0.825

0.76

~140mV

6T 10-Bit ECC

~150mV

6T1-Bit ECC

6T 32KB

1.E-06 0.4

0.5

0.6 0.7 Vcc (volts)

0.8

0.9

Figure 4. Pfail for different failure mitigation schemes

5. Two defect-tolerance mechanisms Ideally, a mechanism to handle defective bits should be turned off in high-performance mode when high voltages preclude defects, and turned on at low voltages when bit failures are pervasive. In this section, we describe two novel architectural mechanisms that achieve this objective. Both mechanisms trade off cache capacity at low voltages, where performance (and cache capacity) may be less important, to gain the improved reliability required for low voltage operation. At high voltages where performance is critical, both mechanisms have minimal overhead. To achieve this, we leverage low voltage memory tests performed when the processor boots [3]. These memory tests identify parts of the cache that will fail during low voltage operation allowing defect avoidance and repair. The first mechanism, Worddisable identifies and disables defective words in the cache. The second mechanism, Bit-fix, identifies broken bit-pairs and maintains patches to repair the cache line. Both mechanisms address bit failures in the cache data array but not the tag array. As a result, we implement the tag arrays of our caches using the ST cell, which allows the 32KB and 2MB cache tag arrays to operate at 460mV and 500mV respectively. The tag array for the 2MB cache can also be implemented with SECDED ECC, which will allow correct operation at voltages as low as 400mV.

5.1 Cache word-disable The word-disable mechanism isolates defects on a word-level granularity and then disables words containing defective bits. Each cache line’s tag keeps a defect map with one bit per word that represents whether the word is defective (1) or valid (0). For each cache set, physical lines in two consecutive ways combine to form one logical line. After disabling defective words, the two physical lines together store the contents of one logical line. This cuts both the cache size and associativity in half. Example. Assume we have a 32KB 8-way L1 cache containing 64-byte lines. For each cache set, eight physical lines correspond to four logical lines. We use a fixed mapping from physical lines to logical lines: Lines in physical ways 0 and 1 combine to form logical line 0, lines in physical ways 2 and 3 combine to form logical line 1, and so on. The first physical line in a pair stores the first eight valid words, and the second stores the next eight valid words of the logical line for a total of 16 32-bit words. As a result, in lowvoltage mode, the cache effectively becomes a 16KB 4-way set associative cache. In the tag array, each tag entry is required to store a 16-bit defect map that consists of one defect bit for each of the 16 32-bit words. These bits are initialized at system boot time and are kept unchanged in both high and low voltage modes. Operation in low-voltage mode. During the tag match, the way comparators try to match the incoming address’s tag bits with the address tag for even ways 0, 2, …etc. We do not need to match tags in odd ways since they contain the same information as their pair partners. If one of the tags matches, we select the even way (e.g., way 0) or the odd way based on the sixthleast significant bit of the address tag, since half the words are stored in the even/odd way. Requests that require data from 2 separate 32-byte lines are treated as accesses that cross a cache line boundary. To obtain the 32-byte data in aligned form, we use two four-stage shifters to remove the defective words and aggregate working words as shown in Figure 5. We divide the cache-line into two halves, each with a maximum of four defective words, and each storing four of the eight required word. A line with more than four defective words in either half renders the whole cache defective. Each of the shifting stages depicted in Figure 5 “shifts out” a single defective word. To control each shifter we decompose the defect-map into four separate bit-vectors, each a 1-hot representation of the location of a different defective word.

Word0 – Word7

Word8 – Word15

each shifting stage removes a defective word

Stage 0 shifter (7wds/3def)

Stage 0 shifter (7wds/3def)

Stage 1 shifter (6wds/2def)

Stage 1 shifter (6wds/2def) after 4 shifting stages 4 defect free words remain

Stage 2 shifter (5wds/1def)

Stage 2 shifter (5wds/1def)

Stage 3 shifter (4wds/0def) Words (03)

Stage 3 shifter (4wds/0def) 32 byte (8-word) aligned data

Words (47)

Figure 5. Overview of cache wordord-disable operation

160-bit bit vector divided up into 5 32-bit words. X marks a defective word. 1-hot repair 0... 0... vector extracted from defect 32 32 map 00100 32 32 0

Decoder converts 1-hot to mux control bits

0

1 0

1 0...

1...

32 32

1 0

1

Mux control bits from the decoder

1...

32 32

1 0

0

X

0...

1...

1

1...

New 128 bit defect-free bit-vector is produced

Figure 6. WordWord-disable operation: removing a defective word Figure 6 shows a simplified version of the logic used to disable and remove a single defective word from a group of words. Starting with the defect map, we extract a 1-hot repair vector identifying the position of a single defective word. In the figure, the vector “00100” identifies the 3rd word as defective. The

decoder converts the 1-hot vector into a Mux-control vector containing a string of 0s up to (but not including) the defective position followed by a string of 1s. This has no effect on the words to the left of the defect, but each of the words to the right of the defective word shifts left, thereby “shifting-out” the defective word. Each level of 32-bit 2:1 muxes eliminates a single defective word. Our proposed implementation must eliminate four defective words, requiring a cache line to pass through four levels of muxes. The additional multiplexing, and associated logic for Mux-control, will add latency to the cache access. Assuming 20 FO4 delays per pipeline stage, we estimate that the additional logic increases the cache latency by one cycle.

For our 8-way cache we maintain two fix-lines, one in each bank. As we show later in this section, the repair patterns for three cache lines fit in a single cache line. Therefore, we maintain a single fix-line (a cache line storing repair patterns) for every three cache lines. A fix-line is assigned to the bank opposite to the three cache lines that use its repair patterns. This strategy allows a cache line to be fetched in parallel with its repair patterns without increasing the number of cache ports.

Bank A Cache line w/ data

5.2 Cache bit-fix The Bit-fix mechanism differs from the Worddisable mechanism in three respects. First, instead of disabling at word-level granularity, the Bit-fix mechanism allows groups of 2-bits to be disabled. These defective pairs are groups of 2-bits in which at least one bit is defective. Second, for each defective pair, the Bit-fix mechanism maintains a 2-bit patch that can be used to correct the defective pair. Third, the Bit-fix mechanism requires no additional storage for repair patterns, instead storing repair patterns in selected cache lines in the data array. This eliminates the need for the additional tag bits required to store the defect map in the word-disable mechanism, but introduces the need to save repair pointers elsewhere in the system (e.g., main memory) while they are not in use (in high-voltage mode). During low-voltage operation, the repair patterns (repair pointers as well as patches) are stored in the cache. As a result, any read or write operation on a cache-line must first fetch the repair patterns for the cache line. When reading, repair pointers allow reads to avoid reading data from broken bits. Using patches from the repair patterns, the cache line is reconstructed before being forwarded to the core, another cache, or written back to memory. When writing, repair pointers allow writes to avoid writing to broken bits. New patches must be written to the repair patterns to reflect new data written to the cache. Although we focus on reads in the following discussion, reads and writes are symmetric. Example. Assume we have a 32KB 8-way L1 cache containing 64-byte lines. Each access to data stored in the cache requires an additional access to retrieve the appropriate repair patterns. To access the repair patterns without increasing the number of ports, the Bit-fix scheme organizes the cache into two banks.

Bit shift logic fix 0 Bit shift logic fix 1.. Bit shift logic fix n

Bank B Cache line w/ fix bits ECC Fix bits in repair pattern Decode fix pointers/ swap in patches

Repaired cache line

Figure 7. HighHigh-level operation of the bitbit-fix scheme Operation in low-voltage mode. Figure 7 shows the operation of the Bit-fix mechanism at a high level. On a cache hit, both the data line and fix-line are read. In this figure, we fetch the data line from Bank A and the fix-line from Bank B. The data line passes through ‘n’ bit shift stages, where ‘n’ represents the number of defective bit pairs. Each stage removes a defective pair, replacing it with the fixed pair. Since the fix-line may also contain broken bits, we apply SECDED ECC to correct the repair patterns in the fix-line before they are used. After the repair patterns have been fixed, they are used to correct the data-line. Repairing a single defective pair consists of three parts. First, SECDED ECC repairs any defective bits in the repair pattern. Second, a defect pointer identifies the defective pair. Third, after the defective pair has been removed, a patch reintroduces the missing correct bits into the cache line. We demonstrate how Bit-fix works for a short 10bit line in Figure 8. The logic depicted in Figure 8 can be generalized to 512-bit lines. The 10-bit line consists of five 2-bit segments. We indicate the defective bit with an X and the valid bits as 0 or 1. The defective

pair contains “X1”. To repair the line, we first identify and remove the defective pair. For example, the defect pointer “000” indicates that the first 2-bit pair is defective. In Figure 8, the defect pointer “010” indicates that the third 2-bit pair is defective. We decode the defect pointer into a 5 bit Mux-control vector in a similar fashion to the Word-disable scheme, where the defect position and all the bits to its right are set to 1, and all other bits are set to 0. In Figure 8, a defect in the third 2-bit pair produces the control vector “00111”. The Mux-control is used to control multiplexers that create a new “repaired” line by shifting all bits in the original line located to the right of the defective pair two positions to the left, while keeping the other bits unchanged, thereby “shifting-out” the defective pair. We restore the vector to its original state by appending the patch “01”, where “0” is the correct value of the defect bit “X”. 10-bit bit vector divided up into 5 2-bit segments. X marks a defective bit 00 Repair pointer 010 (2)

2

0

00

2 0

2

2

2

1 0

Repair pointer 0 Decoder 1

X1

11

2

2

1 0

11

2

0

1

2

1 0

1

Mux control bits from the decoder

2

1

.

01 2-bit patch is shifted in .

00

00

11

11

1

01

New 10-bit vector after defective pair “X1” has been removed and patch “01” has been applied

Figure 8. Implementation of the bitbit-fix scheme for ten bits For 512-bit lines, Bit-fix organizes the line as 256 pairs that require 256 2:1 2-bit multiplexers arranged similar to Figure 8. To repair ten defective pairs, we require ten repair stages, each with a separate decoder. Using synthesis tools we have found that each decoder requires 306 FO4 gates with three inputs or less. In total, the logic required to implement this function is 2560 2:1 2-bit muxes and 3060 FO4 gates with a maximum of three inputs, for a total of fewer than 26,000 transistors. Due to the need for ECC on repair lines and the large number of repair stages, we

estimate that repairing up to ten defective bits would increase cache access latency by three cycles, assuming 20 FO4 delays per cycle. Fix Line Structure. To repair a single defective pair in a 512-bit cache line, each defect pointer requires 8 bits (log2256 = 8), and each patch requires 2 bits, in addition to four bits of SECDED ECC to compensate for broken bits in our repair pattern. In total, we need 14 bits for a single defect. Our Pfail model estimates that, with a bit failure probability of 0.001, three out of every billion cache lines contain more than ten defective bit pairs. A 140-bit repair pattern can effectively repair a single data cache line with ten or fewer defects. Each 512-bit fix line will store three 140-bit repair patterns, for a total of 420 bits.

5.3 Discussion Vccmin improvements. We demonstrate the failure probabilities of our proposed schemes in Figures 9 and 10. For a failure probability of 0.001, a 32 KB cache can reliably operate at 490mV, and a 2 MB cache can reliably operate at 510mV using the Word-disable scheme. For Bit-fix, a 32KB cache can reliably operate at 455mV, and a 2MB cache can reliably operate at 475 mV. Compared to a modern high-performance microprocessor, where low-power voltages are in the range of 800-850mV, this represents more than a 40% drop in Vccmin. We show this significantly reduces power in the results section. Figures 9 and 10 also compare the Vccmin for Word-disable and Bit-fix schemes with conventional 6T cell, 6T cell with 1-bit ECC, and the previously proposed ST cell-based cache. Both Word-disable and Bit-fix show significant Vccmin reduction in comparison with the 6T cell and 6T cell with 1-bit ECC. For a 2MB cache, Word-disable and Bit-fix decrease Vccmin by 20 mV and 55 mV respectively without incurring the cell area overhead of the much larger ST cell. Comparison. Both the Word-disable and Bit-fix schemes require memory tests to be performed when the processor is powered on [3]. When switching between low-voltage and high-voltage modes, the cache contents need to be flushed to memory. In addition, both schemes require that the reliability of the tags be improved through the use of ST cells and potentially ECC. The size of the tag array is roughly 5% of the size of the data array. The use of the ST cells and ECC will increase the size of the tag array by roughly 2.5X for the Bit-fix mechanism, and 4X for the word disable mechanism which also requires a more reliable defect map. For a 2MB cache, the Bit-

fix mechanism increases the cache size by ~7%, and the Word-disable mechanism increases the cache size by ~15%. Compared to Word-disable, Bit-fix has a lower area overhead, operates at a lower Vccmin, and preserves three-fourths of the cache for use at low voltages vs. one-half for the Word-disable approach. However, Word-disable requires less complex muxing and as a result has a smaller cache latency. Worddisable’s tag overhead includes storage for the defect map, which eliminates the Bit-Fix complexity of using the cache to store repair patterns. 1.E+00

Probability of failure

WdDis 6T Cell

1.E-03

.49

.455

.625

6-T Cell 1-Bit ECC

.76

ST Cell BitFix

micro-operation (uOp) based, executes both user and kernel instructions, and models a detailed memory subsystem. Table 1 shows the parameters of our baseline out-of-order processor, which is similar to the Intel® Core™ 2 Duo processor on 65nm technology [4]. In order to quantify the relative performance loss as a result of cache latency and capacity changes introduced by our schemes, we also simulate a defect-free low-voltage baseline that has reliable caches without any latency overhead or capacity loss. Since the reduction of supply voltage causes transistor switching delay to increase, we perform circuit simulations to predict the frequencies at different voltages. In order to simplify our discussion, we fix the low voltage at 500mV when applying our schemes instead of using a different voltage for each scheme. In the following sections, we annotate the results of a processor that implements Scheme A for the L1 cache and Scheme B for the L2 cache as “L1A_L2B”. For example, “L1WDis_L2BFix” means that the L1 cache uses the Word-disable scheme, while the L2 cache uses the Bit-fix scheme.

1.E-06 0.4

0.5

0.6 0.7 Vcc (volts)

0.8

0.9

Figure 9. Pfail of 32KB cache with different failure mitigation schemes 1.E+00 6T Cell Probability of failure

ST Cell

WdDis

1.E-03

0.51

0.53

0.475

6-T Cell 1-Bit ECC

BitFix

1.E-06 0.3

0.4

0.5

0.6 0.7 Vcc (volts)

0.8

0.9

Figure 10. Pfail of 2MB cache with different failure mitigation schemes

6. Simulation methodology We use a cycle-accurate, execution-driven simulator running IA32 binaries. The simulator is

Table 1. Baseline processor parameters Voltage Dependent Parameters High Vol. Low Vol. Processor frequency 3 GHz 500 MHz Memory latency 300 cycles 50 cycles Voltage 1.3V 0.5V Voltage Independent Parameters ROB size 256 Register file 256 fp, 256 int Fetch/schedule/retire 6/5/5 width Scheduling window size 32 FP, 32 Int, 32 Mem Memory disambiguation Perfect Load buffer size 32 Store buffer size 32 Branch predictor 16k bytes TAGE [15] Hardware data prefetcher Stream-based (16 streams) Cache line size 64 bytes L1 Instruction cache 32kB, 8-way, 3 cycles L1 Data cache 32kB, 8-way, 3 cycles L2 Unified cache 2MB, 8-way, 20 cycles In our experiments, we simulate nine categories of benchmarks. For each individual benchmark, we carefully select multiple sample traces that well represent the benchmark behavior. Table 2 lists the number of traces and example benchmarks included

in each category. We use instructions per cycle (IPC) as the performance metric. The IPC of each category is the geometric mean of IPC for all traces within that category. Then we normalize the IPC of each category to the baseline to show performance.

Baseline

L1WDis_L2WDis L1BFix_L2WDis

L1WDis_L2BFix L1BFix_L2BFix

1.0 0.95

Table 2. Benchmark suites Category

# of Example benchmarks traces Digital home (DH) 60 H264 decode/encode, flash FSPEC2K (FP2K) 25 www.spec.org ISPEC2K (INT2K) 26 www.spec.org Games (GM) 49 Doom, quake Multimedia (MM) 77 Photoshop, raytracer Office (OF) 52 Excel, outlook Productivity 43 File compression, (PROD) Winstone Server (SERV) 48 SQL, TPCC Workstation (WS) 82 CAD, bioinformatics ALL 462

7. Results In this section, we present the experimental results of our proposed schemes. Section 7.1 evaluates the performance impact of the proposed schemes and discusses the tradeoffs involved in choosing an appropriate scheme at each cache level. Section 7.2 compares our schemes with the ST cellbased cache in terms of area overhead, power consumption, and energy efficiency.

7.1 Performance In this section, we evaluate the performance overhead of our two proposals in both high-voltage and low-voltage modes. For both operating modes4, Word-disable has a 1-cycle latency overhead and Bitfix has a 3-cycle latency overhead (Section 5). We evaluate different combinations of our two schemes for the two cache levels. Recall from Section 5.3 that amongst the two schemes, Word-disable has a latency advantage while Bit-fix provides more capacity in the low-voltage mode.

4

For the high-voltage mode, it is possible to design bypass networks to bypass the logic required for the low-voltage mode. However, we chose to evaluate using a conservative estimate for latencies in the high-voltage mode.

0.9 0.85

0.8 DH FP2K INT2K GM MM OF PROD SERV WS GMEAN

Figure 11. 11. Normalized Normalized IPCs in the highigh-voltage mode Figure 11 shows normalized IPC relative to the high-voltage baseline, when applying different combinations of Word-disable and Bit-fix schemes to the L1 and L2 caches in the high-voltage mode. Performance in high-voltage mode is sensitive to the L1 latency and more tolerant to increases in the L2 latency. When we use Word-disable for the L1 cache and Word-disable (Bit-fix) for the L2 cache, IPC reduces by 4% (5%). On the other hand, if we use Bit-fix for the L1 cache and Word-disable (Bit-fix) for the L2 cache, IPC reduces by 9.7% (10.5%). The INT2K, OF, SERV, and PROD benchmarks are very sensitive to L1 latency and see a 10% performance loss when using Bit-fix for the L1 cache. To minimize additional latency in the L1 cache, we use Word-disable for the L1 cache in the remainder of this section. To quantify the performance overhead of our schemes in the low-voltage mode, Figure 12 shows normalized IPC relative to the defect-free lowvoltage baseline, when applying Word-disable to the L1 cache and Word-disable or Bit-fix to the L2 cache. When we apply Word-disable to the L1 cache, the Word-disable and Bit-fix schemes for the L2 cache have a similar impact on IPC (10.2% and 10.7% reductions, respectively). Five of the nine benchmark categories have more than 10% IPC reductions. However, this performance overhead is in comparison to a defect-free baseline that has reliable caches with no area or latency overhead at such low voltages. Moreover, as the low-voltage mode is normally used when the processor load is low, the performance is not the primary concern.

Baseline

L1WDis_L2WDis

L1WDis_L2BFix

1.0

0.95

0.9

0.85

0.8 DH FP2K INT2K GM MM OF PROD SERV WS GMEAN

Figure 12. 12. Normalized IPCs in the lowow-voltage mode From the results shown in Figures 11 and 12, it is not clear which scheme is better for the L2 cache. To distinguish the two schemes, we examine the increases in memory accesses due to the capacity loss for both the schemes in Figure 13. Compared to the low-voltage baseline, Word-disable increases the number of memory accesses by 28% (up to 115%), while Bit-fix increases the number of memory accesses by only 7.5% (up to 13%). Bit-fix has fewer memory accesses due to its larger effective cache size and higher associativity compared to Word-disable. Since the extra memory accesses increase the platform power consumption, we argue that Bit-fix is a better choice than Word-disable for the L2 cache.

Baseline

L1WDis_L2WDis

L1WDis_L2BFix

a V c c m i n o f 5 3 0 m V f o r a 2 M B c a c h e , wh i c h i s c l o s e t o t h e a c h i e va b l e V c c mi n o f o u r sc h e me s . I n t h i s s ec t io n we present detailed comparison between our schemes and the ST cell-based cache. B a s e d o n t h e a n a l y s i s i n S e c t i o n 7 . 1 , we u s e a p r o c e ss o r t h at ap p li e s W o r d - d is ab le to t h e L 1 c a c h e a nd B i t - f i x t o t h e L2 c a c he a s a r ep r e se n ta t i ve ap p l i ca ti o n o f o ur s c he me s . Table 3 summarizes the achievable Vccmin, cache area, frequency, power consumption, IPC, and energy per instruction (EPI) of each scheme in the low-voltage mode. We normalize the area, power, and EPI results for each scheme to the corresponding results for the conventional 6T cell-based cache, while showing absolute results for Vccmin, frequency, and IPC. I n o r d e r t o e s t i m a t e t h e p o we r c o n s u m p t i o n , we a s s u m e t h a t d y n a m i c p o we r s c a l e s q u a d r a t i c a l l y wi t h s u p p l y v o l t a g e , a n d l i n e a r l y wi t h fr e q u e n c y . W e a l s o a s s u m e t h a t s t a t i c p o we r s c a l e s wi t h t h e c u b e o f s u p p l y vo lt a g e [ 1 7 ] . W e gi v e a n a r ti f i c ia l a d va n t a ge to t h e ST c e l l- b a se d c a c h e i n o ur p o we r c a l c u l a t i o n s b y i g n o r i n g t h e l e a k a g e o v er h ead d u e to ST ce ll ’ s lar g er ar e a. W e a l so i g n o r e t h e la te n c y o v e r h e a d o f t he ST c e l l i n o ur p e r fo r ma n c e si mu l a t i o n s. Table 3. Comparison between the ST cell-based cache and our proposed schemes Scheme

Vccmin Norm. Clock Norm. IPC Norm. (mV) cache Speed Power EP I area (MHz)

6T Cell ST Cell

825 530

1 2

1900 700

1 0.19

1.02 1.17

1 0.45

L1WDis _L2BFix

500

1.08

600

0.15

1.07

0.47

2.5 2.0 1.5 1.0 0.5 0.0 DH FP2K INT2K GM MM OF PROD SERV WS GMEAN

Figure 13. 13. Normalized number of memory accesses

7.2 Comparison with ST cell-based cache A s s h o wn i n S e c t i o n 4 , t h e S T c e l l i s m o r e r e l i a b l e t h a n t h e c o n ve n t io na l 6 T c e l l a nd e n a b le s

O u r s c h e me s ( L1 W D i s _ L 2 B F i x ) a c h i e ve h i g h e r p o we r r e d u c t i o n t h a n t h e S T c e l l -b a s e d c ac h e d e sp i te i g n o r i n g t h e ST c el l ’ s l ea k a g e p o we r o v e r h e a d . C o m p a r e d t o t h e p r o c e s s o r wi t h t h e 6 T c e l l - b a se d c a c he , o ur sc h e me s r e d u c e t h e p o we r c o n s u m p t i o n b y 8 5 % , wh i l e t h e S T c e l l b a s e d c a c h e r e d u c e s t h e p o we r c o n s u m p t i o n b y 8 1 % . M o r e o v e r , o u r s c h e me s h a v e s i g n i f i c a n t l y l o we r a r e a o v e r h e a d ( 8 % ) t h a n t h e S T -c e l l b a s e d c a c h e ( 1 0 0 % ) . W h i l e o n e ma y r e d uc e t h e a r e a o v e r h e a d o f t he ST c e ll - b a se d c a c h e b y r e d uc i n g t h e c a c h e c a p a c i t y, s uc h r e d u c t i o n c o me s a t t h e c o st o f s a c r i f i c i n g p e r fo r ma n c e i n t h e h i g h v o l t a g e mo d e . S i n c e we s c a l e t h e m e m o r y l a t e n c y wi t h f r e q u e nc y i n o u r p e r f o r ma n c e si mu l a t i o n s, t h e ST c e l l - b a s e d c a c he a n d o ur sc h e me s h a v e h i g h e r I P C

t h a n t he 6 T c e l l -b a s e d c a c h e . A t t h e sa me t i me , l o we r fr e q u e n c y m a k e s t h e e x e c u t i o n t i m e l o n g e r , w h i c h r e s u l t s i n a s ma l l e r e ne r g y e f f i c i e n c y i mp r o ve me n t t ha n t he p o w e r r e d u c t i o n . A l t ho u g h t h e la te n c y o v e r h e a d a nd c a p a c it y lo s s o f o u r s c h e m e s i n t h e l o w v o l t a g e m o d e r e s u l t i n a l o we r I P C c o mp a r e d t o t h e S T c e l l -b a se d c a c h e , o u r s c h e m e s ’ l o we r p o we r a c h i e v e s a s i m i l a r e n e r g y e f f i c i e n c y , wh i c h r e s u l t s i n a 5 0 % i m p r o v e m e n t o v e r t h e 6 T c e ll - b a s e d c a c he .

8. Conclusions In this paper, we demonstrated how the minimum supply voltage (Vccmin) is a critical parameter that affects microprocessor power and reliability. We proposed two novel architectural techniques that enable microprocessor caches (L1 and L2) to operate despite very high memory cell failures rates, thereby allowing a processor to decrease its Vccmin to 500mV while having minimal impact on the high-performance mode. The Worddisable scheme combines two consecutive cache lines, in low-voltage mode, to form a single cache line where only non-failing words are used. The Bitfix scheme uses a quarter of the ways in a cache set, in low-voltage mode, to store positions and fix bits for failing bits in other ways of the set. We show that the Word-disable and Bit-fix schemes enable reliable operation below 500mV in exchange for a 50% and 25% loss of cache capacity, respectively. Compared to an unrealistic baseline in which all cache lines are reliable in the low-voltage mode, our schemes reduce performance by only 10%.

Acknowledgements We are especially grateful to Prof. Kaushik Roy and Jaydeep Kulkarni for helping us with the bit failure data and feedback. We thank Tingting Sha, Eric Ernst and the anonymous reviewers for their feedback on our work.

References [1] A. Agarwal, et. al., “Process Variation in Embedded Memories: Failure Analysis and Variation Aware Architecture,” IEEE Journal of Solid-state Circuits, Vol. 40, no. 9, pp. 1804-1814, September, 2005. [2] A. J. Bhavnagarwala, X. Tang, J. D. Meindl, “The Impact of intrinsic device fluctuations on CMOS SRAM Cell stability,” IEEE Journal of Solid-state Circuits, Vol. 40, no. 9, pp. 1804-1814, September, 2005.

[3] J. Chang, et. al., “The 65-nm 16-MB Shared On-Die L3 Cache for the Dual-Core Intel® Xeon Processor 7100 Series,” IEEE Journal of Solid-state Circuits, Vol. 42, no. 4, pp. 846-852, April, 2007. [4] J. Doweck, “Inside the Core™ Microarchitecture,” In Proceedings of the 18th IEEE HotChips Symposium on High-Performance Chips, August, 2006. [5] H. Y. Hsiao, D.C. Bossen, R. T. Chien, “Orthogonal Latin Square Codes,” In IBM Journal of Research and Development, Vol. 14, Number 4, pp. 390-394, July 1970. [6] M. Khellah, et. al., “A 256-Kb Dual-Vcc SRAM Building Block in 65-nm CMOS Process With Actively Clamped Sleep Transistor,” IEEE Journal of Solid-state Circuits, Vol. 42, no. 1, pp. 233-242, January, 2007. [7] J. Kim, et. al., “Multi-bit Error Tolerant Caches Using Two-Dimensional Error Coding,” To appear in the 40th International Symposium on Microarchitecture (Micro-40), December 2007. [8] J. P. Kulkarni, K. Kim and K. Roy, “A 160 mV Robust Schmitt Trigger Based Subthreshold SRAM,” IEEE Journal of Solid-state Circuits, Vol. 42, no. 10, pp. 2303-2313, October, 2007. [9] S. V. Kumar, C. H. Kim, and S. S. Sapatnekar, “Impact of NBTI on SRAM Read Stability and Design for Reliability,” 7th International Symposium on Quality Electronics Design, pp. 6-11, March 2006. [10] Y-H Lee, et. al., “Prediction of logic product failure due to thin gate oxide breakdown”, IEEE International Reliability Physics Symposium Proceedings, pp. 18-28, March 2006. [11] S. Mukhopadhyay, H. Mahmoodi and K. Roy, “Modeling of failure probability and statistical design of SRAM array for yield enhancement in nanoscaled CMOS,” IEEE Trabnsactions on Computer-Aided Design of Integrated Circuits and Systems, Vol. 24, No. 12, pp. 18591880, December, 2005. [12] S. Ozdemir, et. al., “Yield-Aware Cache Architectures,” In Proceedings of the 39th International Symposium on Microarchitecture, pp. 15-25, December, 2006. [13] J. Rabaey, A. Chandrakasan, and B. Nikolic, Digital Integrated Circuits: A Design Perspective, Prentice Hall, 2003, pp. 657. [14] S. E. Schuster, “Multiple Word/Bit Line Redundancy for Semiconductor Memories,” IEEE Journal of Solid-State Circuits, Vol. SC-13, No. 5, pp. 698-703, October 1978. [15] A. Seznec, P. Michaud, “A case for (partially) tagged Geometric History Length Branch Prediction,” Journal of Instruction Level Parallelism, Vol. 8, Feb. 2006. [16] P. P. Shirvani and E. J. McCluskey, “PADded Cache: A New Fault-Tolerance Technique for Cache Memories.” In Proceedings of the 17th IEEE VLSI Test Symposium, April, 1999.

[1 7 ]

Y. Taur and T. H. Ning, Fundamentals of Modern VLSI Devices, Cambridge University Press, 1998, pp. 144.

Suggest Documents