2720
IEEE TRANSACTIONS ON NUCLEAR SCIENCE, VOL. 60, NO. 4, AUGUST 2013
Reliability Models for SEC/DED Memory With Scrubbing in FPGA-Based Designs Yubo Li, Brent Nelson, and Michael Wirthlin
Abstract—Block RAMs on field-programmable gate arrays (FPGAs) are susceptible to radiation-induced single-event upsets (SEUs) and are usually protected by fault-tolerant techniques such as single error correction/double error correction (SEC/DED) codes and/or scrubbing. Scrubbing can be divided into two categories based on the mechanism that controls the scrubbing interval. With deterministic scrubbing, each memory location is scrubbed (repaired) on a regular basis and the interval is fixed. With probabilistic scrubbing, a word is checked and corrected any time it is accessed (read or written). In this case, the scrubbing interval is exponentially distributed. This paper presents two mean-time-to-failure (MTTF) models for SEC/DED memory with scrubbing, which can be applied to general applications, including FPGA designs. The first one considers only probabilistic scrubbing but takes into account nonuniform scrub rates for different memory locations. The second model combines both deterministic scrubbing and probabilistic scrubbing into a single model. The proposed models provide more accurate MTTF estimates compared with prior models. An FPGA-based DSP application is studied as an example to show how the proposed models can be used for estimating memory reliability and to reveal the impact of probabilistic scrub rates and memory access distributions on memory reliability. Index Terms—Error correction, FPGAs, memory fault tolerance, reliability modeling, single-event upset.
F
I. INTRODUCTION
IELD-PROGRAMMABLE gate arrays (FPGAs) are an attractive technology for space applications due to their reprogrammability and low non-recurring engineering (NRE) cost. However, SRAM-based FPGAs are susceptible to radiation-induced single-event upsets (SEUs) because they employ volatile memory for configuration data and user data storage. Block random access memory (BRAM) is widely used in FPGAs to store user data. BRAMs are smaller and more distributed than traditional memory. In addition, the depth and width of BRAMs in Xilinx FPGAs are configurable [1], [2], making them more flexible to fit the functionality of various modules or designs. For example, in digital signal processing (DSP) applications, different modules usually perform different
Manuscript received September 27, 2012; revised December 27, 2012 and March 01, 2013; accepted March 04, 2013. Date of publication April 08, 2013; date of current version August 14, 2013. This work was supported in part by the I/UCRC Program of the National Science Foundation within the NSF Center for High-Performance Reconfigurable Computing (CHREC), Grant No. 0801876. The authors are with NSF Center for High-Performance Reconfigurable Computing (CHREC), Department of Electrical and Computer Engineering, Brigham Young University, Provo, UT 84602 USA (e-mail:
[email protected];
[email protected];
[email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TNS.2013.2251902
operations on the data. This results in different BRAM sizes and nonuniform access rates. The impact of nonuniform access distributions on memory reliability will be discussed in the following sections. The reliability of BRAMs can be jeopardized by radiation-induced SEUs. Single error correction/double error detection (SEC/DED) codes are a common fault-tolerant technique for memory systems. SEC/DED can detect any single-bit error in a memory word and correct it immediately, as well as detect but not correct double-bit errors. However, when a word is corrupted by a single-bit error, the data is only corrected on the output port, while the corrupted word remains erroneous in the memory storage. As errors accumulate in the memory, multiple faults in the word will eventually defeat SEC/DED. Thus, scrubbing, another fault-tolerant technique, is often employed along with SEC/DED to prevent a memory from accumulating a second error in a single word. During deterministic scrubbing, each word of the memory is regularly read and checked for correctness. If a single-bit error is detected, it will be corrected and written back to the memory. An appropriate scrub interval can effectively prevent a memory from accumulating a second error and therefore considerably improves the memory’s reliability. However, another kind of scrubbing exists and is called probabilistic scrubbing [3]. With probabilistic scrubbing, when a word is accessed (read or written), the data is checked for correctness and, if needed, corrected and written back into the memory. Depending on the design implementation, probabilistic scrubbing can be done any time a word is read, written, or both. Others have investigated the problem of memory reliability over the past decades [3]–[6]. Both [3] and [6] have provided mean-time-to-failure (MTTF) models for SEC/DED memory with deterministic scrubbing. Saleh et al. [3] have provided an MTTF model for SEC/DED memory with probabilistic scrubbing but assume that all memory locations have a uniform access (probabilistic scrub) rate, which is unrealistic for FPGA applications. In addition, none of the prior work has proposed models that combine both deterministic and probabilistic scrubbing in the same model. This paper will investigate the reliability and MTTF of memories in FPGA-based designs by proposing two new MTTF models that improve on the prior work. The first one considers only probabilistic scrubbing but takes into account nonuniform scrub rates for different memory locations. The second model combines both deterministic scrubbing and probabilistic scrubbing into a single model. The proposed models provide more accurate MTTF estimates compared with existing models and will allow for insights into the impact of probabilistic scrub rate and memory access patterns on memory reliability.
0018-9499 © 2013 IEEE
LI et al.: RELIABILITY MODELS FOR SEC/DED MEMORY WITH SCRUBBING IN FPGA-BASED DESIGNS
II. RELIABILITY MODELS FOR SEC/DED MEMORY WITH SCRUBBING In this section, several previous models for SEC/DED memory with scrubbing are introduced and compared, and the motivations of the proposed models are discussed.
In [3], Saleh et al. proposed a reliability model for SEC/DED memory with deterministic scrubbing. This model was developed for caches in desktop computing devices rather than for embedded FPGA memories. Saleh’s deterministic model made five assumptions. 1) Transient faults occur with the Poisson distribution. 2) All the bit flips are statistically independent. 3) A second bit flip in a word does not correct the first. 4) Bits storing ones and zeros have identical upset rates. 5) Every word is considered as an entity with error rate , where is the bit failure rate, and is the number of bits in a single word. The second assumption indicates that multiple bit errors induced by a single particle hit are not considered. The third assumption is for the sake of simplicity, and has a conservative effect on the result. The fourth assumption is also for the sake of simplicity. The fifth assumption means that after a first error to a word, their model does not take into account that there are now only bits that could be flipped by a second error (their assumption makes their result more conservative than it really is). The MTTF estimate of Saleh’s deterministic model is given by (1), where is the deterministic scrub rate enforced by the scrubber, and is the number of words in the memory: (1)
B. Edmonds’ Deterministic Model Edmonds et al. [6] proposed another reliability model for SEC/DED memories with deterministic scrubbing. This model was created for BRAMs on a Xilinx FPGA XQR5VFX130 [2] and therefore had a similar context to the proposed models of this paper. The assumptions of Edmonds’ deterministic model are the same as Saleh’s deterministic model, except for the fifth one. Edmonds’ model assumes that every bit has an error rate of ; thus, the error rate of the first bit flip is , and the second bit flip occurs with a rate of . Equation (2) gives the equation of Edmonds’ deterministic model: MTTF
Edmonds’ model has . This difference is caused by the different assumption discussed above. Specifically, the ratio of the result given by Saleh’s model to that of Edmonds’ model is . Since is the number of bits per word, usually being 36 or 72, this difference is very small. C. Saleh’s Probabilistic Model
A. Saleh’s Deterministic Model
MTTF
2721
(2)
It is worth noting that although Saleh’s deterministic model and Edmonds’ deterministic model take different derivation approaches, they give almost the same result. The only difference is in the denominator, where Saleh’s model has the term but
Saleh et al. also presented a model for SEC/DED memories with probabilistic scrubbing [3], following the same assumptions for Saleh’s deterministic model. This model assumes that a word will be repaired in the memory whenever it is read or written. Because READ and WRITE operations occur randomly, the scrubbing behavior is probabilistic. In addition, by assuming the use of the least recently used (LRU) approach for cache line replacement, it is claimed in [3] that all words can be assumed to share the same probabilistic scrub rate . Unlike the scrub interval in deterministic scrubbing, which is fixed, the scrub interval in probabilistic scrubbing distributes exponentially with a rate of . The mathematical expression of Saleh’s probabilistic model is presented by MTTF
(3)
D. Motivation of the Proposed Models In Section I, it was pointed out that probabilistic scrubbing can occur when a word is being written or read. Scrub-on-write and scrub-on-read are mathematically the same, but they are different in implementation. For purposes of derivation, we will discuss probabilistic scrubbing in general without differentiating between the two (similar to what Saleh did). After the model derivation is complete, we then discuss the different implementation costs of probabilistic read scrubbing and probabilistic write scrubbing for real designs. Because of the context of cache memory in which it was developed, Saleh’s probabilistic model assumes a uniform scrub rate for all the memory locations. However, this is not always true for FPGA applications. For example, in digital signal processing (DSP) applications, different modules usually perform different operations on the data. This would result in different BRAM sizes and nonuniform access rates to BRAMs. In order to study the reliability of the memory on FPGAs, a new reliability model will be proposed by this paper that takes into account nonuniform access rates for different memory locations. Furthermore, although Saleh’s deterministic model and Edmonds’ deterministic model both aim at memories with deterministic scrubbing, none of the prior work has proposed models that combine both deterministic scrubbing and probabilistic scrubbing in the same model. The second model proposed by this paper, called the “mixed scrubbing” model, will combine both scrubbing mechanisms and therefore provide more accurate MTTF estimates. The proposed models are applicable to any general applications that involve SEC/DED memory. In this paper, however, we will focus on FPGA-based designs that contain BRAMs.
2722
IEEE TRANSACTIONS ON NUCLEAR SCIENCE, VOL. 60, NO. 4, AUGUST 2013
From Fig. 1 we can write the following transition matrix:
Fig. 1. State transition rate diagram for word with scrub rate
.
III. PROPOSED MODEL FOR PROBABILISTIC SCRUBBING WITH PER-WORD SCRUB RATES
2) MTTF of a Single Word: The solution to this Markov chain can be obtained by defining as being the probability of reaching state in the time interval . From the transition matrix, we can get the following set of differential equations (Chapman–Kolmogorov equations), which represents the dynamics of the probability system [3]:
In this section, the proposed model for probabilistic scrubbing with per-word scrub rates will be derived and analyzed. Several insights revealed by this model will be discussed. A. Mathematical Derivation The proposed models will be built on the following assumptions: 1) Transient faults occur with a Poisson distribution. 2) All the bit flips are statistically independent. 3) A second bit flip in a word does not correct the first. 4) Bits storing ones and zeros have identical upset rates. 5) Every bit has an independent error rate . 6) Every word has its own scrub rate . The first five assumptions are the same as Edmonds’ deterministic model. The sixth assumption is a new one that is introduced by the proposed model and will result in more accurate estimates and novel insights into the relation between probabilistic scrub rates and reliability. This assumption, however, requires more parameters and a more detailed understanding on how the memory is used. This subsection presents the derivation of the proposed model for probabilistic scrubbing. This memory system has nonuniform scrub distribution, meaning that the scrub rate is different for each word. When a memory location is accessed (read or written), we assume it is corrected in memory if needed. The memory fails whenever two errors accumulate in one word to defeat the capability of the SEC/DED protection. For a single word, both the error and the repair processes occur continuously and independently; therefore, their occurrences follow the Poisson distribution. The time interval between two scrub or two bit errors in the same memory location is not fixed and can be modeled by exponential distribution. Thus, this random process can be modeled using a Markov model. 1) Markov Model: The continuous-time Markov model of a single word is given in Fig. 1. Each state represents the number of errors in the word, and the directed arcs between states indicate the transition rate from one state to another. For example, if the total number of bits in the word is , and each bit has the same error rate , then the error rate of the entire word is , which is labeled on the arrow pointing from state 0 to state 1. If the word is already in state 1, then a scrub operation will bring it back to state 0 because the single-bit error is corrected. Therefore, the transition rate from state 1 to state 0 is the probabilistic scrub rate . State 2 is the failing state, or the absorbing state, of a word when SEC/DED is used, so there is no outgoing arc from state 2.
(4) Taking the Laplace transform for the set of equations in (4) gives
(5) Then, by adjusting the set of equations in (5), we get
(6) Therefore, the vector
can be expressed as
where
It is assumed that the word has no error at time 0. In other words, the probability that the word is in state 0 at time 0 is 1. Therefore, we get
Then,
can be calculated as follows:
(7)
LI et al.: RELIABILITY MODELS FOR SEC/DED MEMORY WITH SCRUBBING IN FPGA-BASED DESIGNS
2723
where (8) (9) By solving (7) for , , and , we get (10) By taking the inverse Laplace transform for (10), it gives
(11) is the probability that the word is in state By definition, 2 after time interval , and the reliability of the word is the probability that the word is in state 0 or state 1 at time instant . Therefore, we have (12) Then MTTF , the MTTF of the word is given by MTTF
Fig. 2. Ratio of the approximated value to the exact value.
Fig. 2 shows the ratio of the approximated value presented by (14) to the exact value presented by (12) (i.e., ). The abscissa of Fig. 2 is the ratio of to , ranging from to . In addition, we chose and . Fig. 2 shows how close is to with different values of and . Note that when is small, a large value leads to a significant error. However, when is large enough (larger than in Fig. 2), the effect of becomes negligible. For example, when and , the ratio . This means that the error will be less than 0.1% as long as . With the assumption that , (14) can be viewed as a good approximation of (12). In this case, the reliability of word follows an exponential distribution, and thus word can be seen as an entity with a constant error rate :
(13) 3) MTTF of the Memory: According to definition, the MTTF of the entire memory with words can be calculated by MTTF
MTTF Consequently, the MTTF of the entire memory can be calculated by MTTF
However, this equation is difficult to evaluate, and we seek for ways to simplify the evaluation of in (12). Equations (8) and (9) can be simplified when . This is usually true for most situations as the bit failure rate is usually much smaller than the rate at which memory words are being read and/or written. Under these circumstances, the following simplification can be made:
(15) Note that if is not orders of magnitude higher than (in extreme cases like a ROM, could be zero), then (15) is not accurate and should not be used. When (15) is valid, is orders of magnitude greater than ; therefore, we can discard the term to further simplify (15): MTTF
With this simplification, approaches and approaches . Since is usually 8, 16, 32, or 64, is very small compared with . The term in (12) simplifies to the constant 1, and the term simplifies to the constant 0. In this case, (12) can be approximated as (14)
(16)
This approximation only has a negligible effect on the result and makes the result more conservative. B. Impact of Probabilistic Scrub Rate on Reliability With the knowledge from the previous subsection, it becomes possible to investigate the reliability of a memory with different
2724
IEEE TRANSACTIONS ON NUCLEAR SCIENCE, VOL. 60, NO. 4, AUGUST 2013
Fig. 4. Variation of the reliability with both types of scrubbing with time; upper and lower bound of MTTF.
Fig. 3. Reliability of a memory with two groups.
probabilistic scrub rates. As a simple illustrative example, consider a memory with two groups of words with significantly different scrub rates. In this case, each group has its own scrub rate: and , respectively. Group 1 has words, and group 2 has words. Thus, (16) becomes MTTF
(17)
increasing and being Fig. 3 shows (17) with fixed at 100. In addition, is four orders of magnitude greater than . The memory can reach its highest possible MTTF when , and its lowest possible MTTF when . These two bounds are marked by the two dotted lines in Fig. 3. It is notable that as increases, the improvement of memory reliability shows a strongly nonlinear characteristic. When 10% of the words belong to group 2, the MTTF of the memory is years. When all the words are in group 1, the MTTF skyrockets to years. In summary, this section has presented a model that extends that of Saleh et al. [3] by modeling the effects of nonuniform probabilistic scrub rates. It has also presented a simple illustration that shows that having a small fraction of words with low scrub rates can drastically hurt the reliability of a memory.
When deterministic scrubbing is also taken into account, this reliability is valid before the first scrub. We note the deterministic scrub rate as , and scrub interval as . After the memory is scrubbed at time , it is considered error free. Therefore, the reliability of the memory at time ( is a positive integer) is the probability that the memory has survived all previous deterministic scrub intervals, which can be expressed as (19) Equation (19) has staircase characteristics which help the evaluation of MTTF. In Fig. 4, is depicted by the curve, and MTTF is the area of the shadowed part below the curve. While the exact area of the shadowed part is nonintuitive, we can easily calculate its upper bound and lower bound. For example, the upper bound and lower bound of during time are 1 and , respectively. For each following interval, the upper bound and lower bound decrease at a rate . Using the summation formula of geometric progression, we can get the upper bound and the lower bound of MTTF: MTTF (20) MTTF
IV. PROPOSED MODEL FOR MIXED SCRUBBING In this section, the proposed model for mixing deterministic scrubbing and probabilistic scrubbing will be derived and analyzed. A. Mathematical Derivation In Section III, the reliability of word with probabilistic scrub rate was given by (12). Then, the reliability of an -word memory is (18)
(21) In (20) and (21), is the reliability of the memory at time when only probabilistic scrubbing is applied [can be calculated using (18)]. B. Impact of Deterministic Scrubbing on Reliability Deterministic scrubbing can be applied when probabilistic scrubbing cannot provide sufficient protection, and the mixed scrubbing model proposed by this paper includes both types of scrubbing. In order to better understand the characteristics of the
LI et al.: RELIABILITY MODELS FOR SEC/DED MEMORY WITH SCRUBBING IN FPGA-BASED DESIGNS
2725
Fig. 5. MTTF given by the mixed scrubbing model for a memory with two different access rates.
proposed model for mixed scrubbing, we take the same example from Section III-B, where the memory has in total 100 words, which can be divided into two groups. The first group has a probabilistic scrub rate of 1, and the second group has a probabilistic scrub rate of . The deterministic scrub rate is 0.01 for every location in the memory. Fig. 5 presents the MTTF of the entire memory when group 1 has different number of words. The lower dotted curve in Fig. 5 shows the MTTF when only probabilistic scrubbing is considered ((16)), and the horizontal line shows the MTTF of deterministic scrubbing. When all the words are in group 2, the memory has the lowest MTTF. If deterministic scrubbing is combined with probabilistic scrubbing, then the MTTF can be greatly elevated. From Fig. 5, we can see that deterministic scrubbing provides a “base protection” for the memory, and it helps the most when most words have low probabilistic scrub rates. From the discussion on the model for mixed scrubbing, we can conclude that deterministic scrubbing can improve the MTTF of the entire memory when part of the memory is less frequently accessed than other parts, as long as the deterministic scrub rate is higher than the lowest access rate. In addition, if there are locations with scrub rate of zero, then adding deterministic scrubbing is an effective way to provide a minimum level of protection. V. VALIDATION OF THE PROPOSED MODELS Two sets of Monte Carlo simulations were run to validate the proposed models. Matlab programs were written to simulate upsets as well as scrubbing behaviors of a memory. Fig. 6 presents a flow chart of the Matlab program used for the model of probabilistic scrubbing. The program used for the model of mixed scrubbing is very similar to Fig. 6, with only a few slight changes. The key is to generate random scrubbing events and bit error events with their intervals being exponentially distributed. When two errors show up in a single word, a failure of the memory occurs and the time to failure is recorded. After 100 iterations of such a simulation, the mean time to failure is calculated and then compared to the theoretical value obtained by
Fig. 6. Flowchart of the Monte Carlo simulation.
the proposed models. It is shown that both proposed models give results within 3% of the simulated MTTFs. In the simulation for the proposed model of probabilistic scrubbing, per second and randomly distributes in the range of (100, 200) per second. Therefore is five orders of magnitude higher than and (15) is valid. Results show that the mean value of the simulation results is years, and the MTTF estimate given by (15) is years. The difference between the theoretical value and the simulated value is 1.03%. In the simulation for the model of mixed scrubbing, per second, randomly distributes in the range of (100, 200) per second, and the deterministic scrub interval is (or, ). The mean value of the simulation results is years, the MTTF of the memory with only probabilistic scrubbing is years, the MTTF of the memory with only deterministic scrubbing is years, and the MTTF of mixed scrubbing is years. The estimate given by (21) is very close to the mean value of the simulation results, with a difference of 2.08%. VI. A REAL LIFE EXAMPLE ILLUSTRATING THE VALUE OF THE PROPOSED MODELS The preceding derivation and simulation of our new probabilistic scrub models for memories were done under the assumption that the design contains circuitry such that any time a memory word was accessed, it would be corrected in memory to reduce the probability of accumulating a second error in a
2726
IEEE TRANSACTIONS ON NUCLEAR SCIENCE, VOL. 60, NO. 4, AUGUST 2013
Fig. 7. A block diagram of the 3-FPGA MIMO receiver [7].
TABLE I BRAM USE IN THE SCTR DESIGN
The MTTF of brik2 is calculated using the equation for SEC/DED memory without scrubbing [3].
word. In other words, thus far we have not differentiated between scrub-on-read and scrub-on-write. In reality, the costs of implementating these two types of probabilistic scrubbing are quite different. For example, scrub-on-write requires no work on the designer’s part—a word is considered correct when it is written to memory, and it is only subsequent single event upsets that corrupt it. Thus, all designs already contain probabilistic scrub-on-write protection, something that previous models have failed to take into account. In contrast, scrub-on-read requires additional circuitry and clock cycles to ensure that a corrupted word can be corrected in memory each time it is read. In this section, an example is provided that shows the benefit of scrub-on-write only, which is reflected in the two models of the previous sections. FPGAs are an attractive technology for DSP applications because they are particularly suitable for parallel algorithm implementation. In [7], Lavin et al. described the design of a space-time coded telemetry receiver (SCTR), which can solve the issue of data dropouts due to the use of multiple transmit antennas. Fig. 7 shows a block diagram of this design. The SCTR design spreads over three FPGAs, called brik1, brik2, and brik3, respectively. The information of BRAM memory use in each FPGA is summarized by Table I. Simulations were run on each FPGA block for cycles, and the write rates of all BRAM locations in each FPGA were collected. The data width of the BRAM memories were 18 bits, and clock period was approximately 5 ns. The first column of Table I gives the number of memory words in each design block. The “Least ” column presents the write rate (per second) of the least written location in each FPGA. The third column, “Never”, shows the number of words that were never written to. Using (15), the last column gives the MTTF estimates when all memories in each FPGA are considered as an entity. In brik1, there are 23 BRAMs and 8448 words in total. All memory words in brik1 were written, and the least written location has a write rate of . Using (15), the MTTF of the memory on brik1 is calculated as years. Brik3 has 32 BRAMs and 2,752 words. Similar to brik1, all memory
words in brik3 were written, and the memory has an MTTF of years. Brik2 has 18 BRAMs and 5664 words. However, 12 words from two BRAMs were never written and thus have a probabilistic scrub rate of zero. Other than these 12 words, the least written location has a write rate of . Because (15) is not accurate when , the MTTF of the 12 nonwritten words is calculated using the equation for SEC/DED memory without scrubbing [3] and the result is 32.35 years. Thus the MTTF of the BRAMs on brik2 must be shorter than 32.35 years. This confirms the discussion in Section III that the words with low write rate drastically reduce the MTTF of the entire memory system. If the 12 words can be excluded from the memory, then the MTTF will be elevated to years. There are two points in this example that need to be clarified. First, MTTF is a metric that provides averaged failure time, given a large number of deployment. For example, the MTTF of the BRAMs on this 3-fpga MIMO is limited by Brik2, and is less than 32.35 years. If 100 such designs are deployed, then the overall MTTF will be less than 0.32 years. Second, unlike memory in computer systems, BRAMs can be initialized during FPGA configuration. This means that a write rate of 0 in a BRAM location does not necessarily imply a read rate of 0. For example, there are 12 words in Brik2 that were never written, but they were read thousands of times during the simulation. Therefore, the reliability of these 12 words sets the limitation of the reliability of the entire memory. This example further shows that in DSP applications, most memory cells tend to be written frequently because of the streaming of large amounts of data through memory. The reliability of these locations can be estimated using our model for probabilistic scrubbing. Whether the reliability is sufficient enough should be decided based on the mission requirement. If yes, then no additional reliability technique is required for these locations. This example also shows how severely the words with low write rate can hurt the overall reliability of the memory. Therefore, memories that have words with very low or zero write rate must be protected. For example, adding a scrubber is the most straightforward way to protect those infrequently written locations. Alternatively, if the memory access pattern is known, the designer may be able to modify the design and have it periodically correct these locations. Another possible solution is replacing these locations with triplicated registers if there are only a few such locations. VII. CONCLUSION In this paper, previous reliability models for SEC/DED memory with scrubbing are discussed and compared, and two new models are derived based on the characteristics of FPGA applications. The proposed model for probabilistic scrubbing takes into account nonuniform scrub rates, and the proposed model for mixed scrubbing combines both probabilistic scrubbing and deterministic scrubbing. The proposed models indicate that memory locations that are frequently accessed tend to be reliable while locations that have rare accesses may need extra protection in order to maintain the overall reliability or the entire memory. Therefore, it can
LI et al.: RELIABILITY MODELS FOR SEC/DED MEMORY WITH SCRUBBING IN FPGA-BASED DESIGNS
help the designers if they know the memory access rates of their designs. In addition, MATLAB-based Monte Carlo simulations show that both of the proposed models give results within 3% of the simulated MTTF. DSP is a common application of FPGAs. Experiments show that in the DSP application that was taken as the case study, most memory locations are frequently written and therefore have considerably high reliability. For these locations, no additional reliability technique is required. However, words with very low write rates can significantly hurt the overall reliability of the entire memory. Designers should be aware of such locations and protect them appropriately. If there are just a few such words, they can be put into triplicated registers. On the other hand, if such words make up most of the memory, then using deterministic scrubbing for the entire memory could be more convenient. To sum up, the proposed models can help designers to better understand the reliability of their designs and effectively make case-specific decisions.
2727
REFERENCES [1] “Virtex-4 FPGA User Guide,” Xilinx Corp., San Jose, CA, USA, Tech. Rep. UG070 (v2.6), Dec. 1, 2008. [2] “Virtex-5 FPGA User Guide,” Xilinx Corp., San Jose, CA, USA, Tech. Rep. UG190 (v5.3), May 17, 2010. [3] A. M. Saleh, J. J. Serrano, and J. H. Patel, “Reliability of scrubbing recovery-techniques for memory systems,” IEEE Trans. Reliab., vol. 39, pp. 114–122, Apr. 1990. [4] J. Ayache and M. Diaz, “A reliability model for error correcting memory systems,” IEEE Trans. Reliab., vol. R-28, pp. 310–315, Oct. 1979. [5] S. A. Elkind and D. P. Siewiorek, “Reliability and performance of error-correcting memory and register arrays,” IEEE Trans. Comput., vol. C-29, pp. 920–927, Oct. 1980. [6] G. Allen, L. Edmonds, C. W. Tseng, G. Swift, and C. Carmichael, “Single-event upset (SEU) results of embedded error detect and correct enabled Block Random Access Memory (Block RAM) within the Xilinx XQR5VFX130,” IEEE Trans. Nucl. Sci., vol. 57, no. 6, pp. 3426–3431, Dec. 2010. [7] C. Lavin, B. Nelson, J. Palmer, and M. Rice, “An FPGA-based space-time coded telemetry receiver,” in Proc. Aerosp. Electron. Conf., Dayton, OH, USA, Jul. 2008, pp. 250–256.