increasingly susceptible to induced soft errors and environmental noise. Probabilistic checksum-based error detection and compensation has been proposed in ...
Probabilistic Compensation for Digital Filters Under Pervasive Noise-Induced Operator Errors Maryam Ashouei, Soumendu Bhattacharya, and Abhijit Chatterjee School of Electrical and Computer Engineering Georgia Institute of Technology, Atlanta, GA 30302 {ashouei, soumendu, chat}@ece.gatech.edu Abstract It is well known that scaled CMOS technologies are increasingly susceptible to induced soft errors and environmental noise. Probabilistic checksum-based error detection and compensation has been proposed in the past for scaled DSP circuits for which a certain level of inaccuracy can be tolerated as long as system-level Quality-of-Service (QoS) metrics are satisfied. Although the technique has been shown to be effective in improving the SNR of digital filters, it can only handle errors that occur in the system states. However, the transient-error rate of combinational logic is increasing with technology scaling. Therefore, handling errors in the arithmetic logic circuitry of DSP systems is also essential. This is a significantly more difficult task due to the fact that a single error at the output of an adder or multiplier can propagate to more than one system state causing multiple states to be erroneous. In this paper, a unified scheme that can address probabilistic compensation for errors both in the system states and in the embedded adders and multipliers of DSP filters is developed. It is shown that by careful checksum code design, significant SNR improvements (up to 13 dB) can be obtained for linear filters in the presence of soft errors.
1. Introduction Because of technology scaling and increased susceptibility of DSM circuitry to transient errors (originating from environmental noise and cosmic rays), it has become necessary to design error detection and correction capability into future logic designs for reliable computation. Technology scaling increases the vulnerability of a circuit to transient errors for several reasons. First, feature size reduction reduces the average node capacitance, thereby increasing the voltage fluctuation at a node due to a fixed amount of charge deposited by radiation. Second, supply voltage reduction in every technology generation reduces noise margins and aggravates the transient error generation problem. Third, the increase of clock frequency with every successive technology generation raises the chance of an error being latched and propagated to a primary output. Moreover, because of shorter pipeline stages, the number of gates through which an error propagates (and hence attenuates) is smaller. Therefore, the probability of an error being masked in a modern high-performance digital system is becoming increasingly small compared to earlier technologies. This work was supported by NSF-ITR under award CCR-0220259 and by GSRC MARCO under award 2003-DT660.
It is known that memory elements such as DRAM and SRAM are susceptible to errors and are protected from single event upsets using error correction codes. It is expected that in the future, error rates of combinational logic in scaled technologies will escalate by 9 orders of magnitude and will equal the error rate of unprotected memory elements [1]. Hence, handling of combinational logic errors is also essential for future logic design [2]. Prior work in [3] and [4] introduced the probabilistic checksum-based error correction technique that targets errors in only the states of DSP filters. In this paper, a unified theory for developing checksum-based probabilistic compensation of errors in the states and in the combinational logic is developed. It is shown that despite the fact that an error at the output of an adder or multiplier can propagate to more than one state, the code can be designed in such a way that no additional hardware is necessary for compensation; Algorithms for selecting the right checksum code parameters are developed. The rest of the paper is organized as follows. In the next section, an overview of prior work is presented. Next, a discussion of linear digital state variable systems is presented. This is followed by a summary of the probabilistic checksum correction approach. The generalization of the probabilistic checksum-based error compensation for combinational circuits is presented. Finally, simulation results and the conclusion are discussed. 2. Previous Work Traditionally, fault tolerant techniques were used in applications where an error could result in an irreversible consequence. Different forms of redundancy such as the hardware, software, time, and/or information redundancy, are used in fault tolerant systems [5]. Hardware redundancy techniques such as the triple module redundancy (TMR), have high hardware overhead but are geared to full compensation of occurring errors. The high area and power cost associated with these techniques makes them impractical for more general applications. Therefore techniques such as time redundancy, partial duplication [1] or software redundancy have been proposed. Such techniques have far less hardware overhead but negatively impact system performance. For reliable DSP functions, algorithm-based fault-tolerant
methods with focus on matrix and FFT operations have been proposed [7]-[10]. These methods aim at minimizing both the hardware and the performance penalty. In all of the techniques discussed above, the focus is on reducing the probability of an error being propagated to the system output either by masking the error or by detecting it and performing correction as soon as possible after its occurrence. In recent years, with the increase in soft-error rates of scaled technologies, significant research has been dedicated to protect circuits against soft errors. This is done by using hardware-level solutions that aim to reduce the probability of a transient (soft)-error impacting the circuit functionality with minimal impact on the circuit delay, power, and area. Soft-error resilience techniques can be categorized into the ones targeting the flip flops and the ones designed for the combinational logic. Techniques in [11] and [12] are two examples of recent techniques, proposed to make the combinational logic soft-error resilient. Techniques in [12], [14], and [15] aim to protect the system flip flops. Algorithmic noise tolerant (ANT) techniques were proposed in [16]-[19] to compensate for errors (noise) introduced into a digital signal processing (DSP) system. While coding techniques are used for protecting the memory circuitry, the cost of data and hardware redundancy necessary to implement coding techniques is a key barrier for their wide spread use in other types of circuit. Theoretically, a code of distance t+1 is necessary for detecting up to t errors and a code of distance 2t+1 is necessary for correcting up to t errors. Linear real-number checksum codes have been used in the past for error detection in applications such as the digital filtering, matrix multiplication, and FFT computation [7]-[10]. While error detection is accomplished easily, error correction is a harder problem and can require significant computation for the exact error correction. This renders real-time correction without significant loss of throughput difficult to achieve for DSP applications that are the core of this research. In this paper, a real-time error correction technique under a realistic error model, which includes errors in the system states as well as adders and multipliers of DSP filters, is developed. Fast real-time error compensation is achieved by bypassing the error diagnosis step, used in current exact error correction techniques and instead performing direct probabilistic inexact error compensation with no diagnosis. The latter is acceptable in applications such as voice and image processing where exact error correction is not always necessary. In the following, first the concept of linear digital state variable system is presented. Then the use of checksum code for error detection and correction is reviewed. Next, the concept of probabilistic error correction is introduced.
3. Linear Digital State Variable Systems Linear digital state variable systems can be used to represent linear time invariant systems such as digital filters. The general form of a state variable system is similar to the Huffman representation of a sequential circuit with the combinational block replaced by a module that computes a linear matrix transformation. This module is a network of basic computational elements, such as adders, multipliers, and shifters and feeds the system primary outputs and flip-flops. The processing is purely arithmetic and therefore inputs, outputs, and states represent numerical values. Let (u1...um) and (y1…yw) be the primary inputs and primary outputs of the linear state variable system respectively. If s(t)=[s1(t), s2(t), …, sn(t)]T is the state vector and u(t)=[u1(t), u2(t), …, un(t)]T is the input vector at time t, then the system function can be presented by the following equations: s(t + 1) = A ⋅ s(t ) + B ⋅ u(t + 1) (1) y(t + 1) = C ⋅ s(t ) + D ⋅ u(t + 1) where the A, B, C, D matrices represent arithmetic operations performed on the current state variables, s(t), and the m primary inputs, to generate the next system states, s(t+1), and w primary outputs, y(t+1). 4. Checksum-Based Error Detection and Correction Real number codes can be used for the purpose of error detection and error correction in linear digital state variable systems [20]. The state vector, s(t), is encoded using one or more check variables. The idea is briefly described below. A coding vector, CV=[α1, α2,…, αn], is used to encode the A and B matrices such that X=CV.A and Y=CV.B. A check variable c, corresponding to each coding vector is computed as: c(t+1) = X.s(t)+Y.u(t). If there is no error in the system, c(t+1) = CV.s(t+1). An error signal, e, can be computed as: e(t+1) = CV.s(t+1)-c(t+1) and is zero in the absence of any error. A non-zero value of e(t+1) can be caused by an error either in the state computation, i.e. in the computational block, in the system states, s(t+1), in the check variable computation, c(t+1), or in the error signal computation, e(t+1). In [4], it was shown how to compensate for the errors in system states using a probabilistic checksumbased technique. Below, the compensation technique of [4] is summarized and then it is extended to cover the case when the error could happen in the combinational block or the system states. It is assumed that the probability of error in check variable computation or error signal computation is negligible. Given the complexity of the state computation in relation to that of check variable or error signal computation, this is a reasonable assumption. Also, errors in computing the output signal will die after one clock cycle and is not considered in this paper.
Probabilistic Compensation of Errors Affecting a State In [3] and [4], it was shown how a single checksum variable could be used to partially correct the errors occurring in the state variables of a DSP system such that the overall output SNR is improved. Here, we provide a short review of the proposed technique. Let wi be the probability of the ith state being erroneous where ∑i=1..nwi = 1. If the error signal has the value e and the ith state is faulty, then the error value of the ith state is e/αi (αi is defined in Section 4). Let ∆i be the error vector, when the ith state is faulty. Then ∆i is a n×1 vector whose ith element is e/αi. and other elements are zero. Let EV be a n×1 vector, which indicates the errors in the state variable values, i.e. EV(i) shows the error in the ith state. If an error is detected, then the error vector (EV) is ∆i with probability wi, for i=1…n. Let ygood be the output signal when there is no error and yerr be the outputs when there is an error. The output noise signal is noise = y good − y err . The output noise power and the output signal to noise ratio (SNR) are defined as follows: =
NoisePower
T
∑
2
(2)
noise ( i )
i =1
SNR
var( y good )
= 10 log(
var( noise )
)
(3)
Where T is the duration of the measurement of the output signal and noise (i)2 is the noise power component at time i. An error during the time interval (t, t+1) in one of the system states, results in a deviation in the state value, represented by the error vector EV, and described in equation (4). s err (t + 1) = s good (t + 1) + EV (4) If there is no error correction,, the error in system states at time t+k+1, k cycles after its occurrence , assuming no error happens in between, is shown in equation (5). Thus, if the system is stable, the errors in the system states disappear after m cycles, where Am→0. serr (t + k + 1) = A k EV + s good (t + k + 1) (5) The goal of checksum-based probabilistic correction is to find a correction vector, V, derived from the error signal e(t+1), to be subtracted from the state vector, at the time when an error is detected. After the correction, the error in the system states is EV-V. The error in the states and the output k cycles after the correction are Ak(EV-V) and CAk(EV-V) respectively. The goal is to find V such that the average output noise power, as computed below, is minimized. AverageNoi se = m
k =0
m
k
i
( ∆ i − V )) 2
(6)
i = 1 ...n
(7)
i =1 k = 0
subject to:
∑ (CA
n
∑∑ w (CA
k
m
( ∆ i − V )) 2 ≤ ∑ (CA k ∆ i ) 2 k =0
A solution for (6) is: n
V = ∑ wi ∆i i =1
(8)
In the special case, when all state variables have the same probability of being erroneous, i.e. wi =1/n, i=1...n, and the coding vector elements αi=1, i=1…n, the correction vector is V= ε×[1/n 1/n … 1/n]T. Probabilistic Compensation of Errors Affecting an Operator in the Computational Block If an error, ε, occurs in an operator (an adder or a multiplier) of the combinational block, L, its effects on the final states of the system depends on the structure of the computational block. Here, we first describe the concept of gain of an operator. The gain of an operator quantifies how an error in the operator affects different system states. Then, we show how the probabilistic compensation, mentioned in the previous section, can be applied to cover the case of errors in operators (adder/multiplier). In general, an operator error might affect more than one state. In the following, we show how the probabilistic error compensation for such operators can be performed. Gain of an Operator: To find out how an error in an operator Oj affects the ith state, si, first we find all the paths, pi, from the output of Oj to si . For each such a path, we define the gain, Өi, to be the product of the gains of all the operators on that path. The gain of an adder, sub-tractor, and multiplier are defined to be 1, -1, and the multiplication constant respectively. Let gj, i , the total gain from Oj to si be ∑i=1:P Өi , where P is the number of paths existing from the output of Oj to si . gj, i effectively represents the amount by which an error εj in the output of operator (adder or multiplier) Oj is scaled before being added to the value of state. In another words, an error εj in Oj causes an error gj,i ×εj in si. For example for the system shown in Figure 1, the gain of path 8, 9, 5, 3, S3 from O8 to S3 is (1/3)(+1)(-1) = -1/3. and g8,3 = (1/3)(+1)(-1)+(1/3)(1)(1)(-1) = -2/3 [21]. The gain matrix GM is defined such that each element GM(i,j) represent gj,i.
Figure 1. Structure of a linear State variable system with shared operators [courtesy of [21]].
Next, it is shown how the probabilistic compensation can be performed when an error occurs in an operator or in the system states. If an error, ε, occurs in the output of an operator Oj, then the error in the system state si is ε×gi,j by definition. It can be easily seen that the value of e on the error signal is as shown in Equation (9), where αi is the ith element of the coding vector as described earlier in this section. n
e(t + 1) = (∑ α i g i, j ) × ε
(9)
i =1
where n is the number of states in the system. Therefore, after detecting an error on the e(t+1) signal, if the faulty operator is known, then one can directly compute the error in each state as: g i, j × e(t + 1) es i |O j = n (10) (∑ α i g i , j ) i =1
In the case of Oj being erroneous, the ith element of ∆j, the vector representing the error in the states, is ∆j(i) = esi|Oj as described in Equation (10). We first perform substitution of ∆i in Equation (6), where n is the number of columns in the gain matrix. nHere, wi is the probability of Oi being erroneous and V = ∑ wi ∆i still holds.
Figure 2.An implementation of the system in Table 1 with shared operators.
i =1
It is worth mentioning that the error signal, e(t+1) as n
defined in the Equation (9) is (∑ α i gi , j ) × ε . Therefore, i =1
the elements of the coding vector must be chosen in such a way that there is no aliasing, i.e. e(t+1) = 0 for non-zero values of ε. It is shown in [21] that the condition for preventing the aliasing is that all elements of the CV×GM are non-zeros (for proof see [21]). A key issue is that the choice of coding vector might affect the SNR improvement obtained by using the probabilistic error compensation. In [20], it is shown that the choice of coding vector is in fact a trade off between the round-off error, caused by increasing the values of coding vector elements, and the code reflectivity reduction, caused by smaller coding vector values. Therefore, in this paper we assume that there is a limited range of acceptable values for coding vector elements in [xmin, xmax]. The choice of coding vector is analyzed for the system shown in Table 1. An implementation of the system and its corresponding gain matrix are shown in Figure 2 and Figure 3 respectively. Table 1. A 3rd order linear state system 0 0 0.3 A= − 0.3 − 0.3 − 0.9 B = [2.6 1.2 1.5]T 0.3 0.1 0.8 C= [0.11 0.06 0.08]
D = [0.2]
Figure 3.The gain matrix corresponding to the implementation in Figure 2.
Figure 4. Noise power as a function of coding vector elements (α1 , α2 , α3). The effect of coding the coding vector choice is analyzed by sweeping the values of the coding vector elements, (α1 , α2 , α3), in the range of [xmin, xmax]3 . The reduction in the noise power (compare to the no correction case) is measured. The noise power as a function of the
coding vector is shown in Figure 4. The figure shows the orthogonal slice planes of the volume [1, 5]3 along α3 axis. The figure shows the average noise power over all possible faulty operators. The minimum average noise power is obtained using the code vector, CV=[5 3 4]. It has to be noted that the optimum coding vector depends on the range of acceptable values, i.e. xmin and xmax For the optimal compensation vector, the noise power for different faulty operators and the case of no correction is shown in Figure 5. The results of SNR improvement for different faults are also shown in Figure 5. For this system, the average noise power is reduced by more than 60% over the system with no correction and an average SNR improvement of 7.7 dB was obtained.
Figure 5. Noise Power reduction and SNR improvement using the optimal coding vector. In the result section, the coding vector that results in the maximum noise power reduction is used to find the SNR improvement compare to the case of no error correction. We also compare our result with a technique that is called state restoration from here on. The state restoration technique is in fact the prediction-based ANT [16] in its simplest form. This approach simply sets the state latches to their previous values, whenever an error is detected. The technique requires an extra set of latches to hold the previous state values, so that they can be restored to the system latches when an error is detected. 5. Experimental Results The experimental results of this section are generated using the 3rd system, shown in Table 1. The linear system, the error detection, and the error correction modules were implemented in MATLAB. Also, the errors in different operators were emulated by modifying the magnitude of states proportional to the gain of the faulty operator for each state. For the checksum-based correction, the correction vector in equation (8) using the modified ∆j that covers the operator faults as well. The input is a sinusoid
with the maximum amplitude of 1 and with the frequency 10 KHz sampled at 10 times the signal frequency band. The simulation time is assumed to contain 20 periods of the sine wave. Two different cases were considered, 1) errors in only the states, and 2) errors in both the operators and the states. For each case, first the best coding vector is found as described in Section 4 by sweeping over the range of possible choices. Then, for the optimal coding vector, an error with the unit magnitude is injected in different potential fault points (individual states in case 1, and all operators and states in case 2) and the average SNR improvement over different fault points was computed. The SNR improvement is defined in comparison with a system with no correction capability. The results are shown in Table 2. The average SNR improvement for the state restoration technique is also shown in Table 2. It is assumed that all operators and all states have the same probability of being erroneous. For this simulation a single error of magnitude 1 happens midway through the simulation. For the case of state restoration, the average, the maximum, and the minimum were obtained over both different fault points and different position during the simulation at which the error occurs. This is because as mentioned in [4], the state restoration performance is severally dependent on the position of the error. If the error occurs at positions where the state has the largest derivate, the SNR of the state restoration is the minimum. Conversely, if the error occurs at positions when the derivative is the least, the maximum SNR (improvement) is achieved. Table 2. SNR improvement using the probabilistic correction technique and state restoration technique. Probabilistic Correction State Restoration SNR SNR Improvement (dB) Errors in Errors in Improvement (dB) States Only Operators & States Average 9.8 7.7 7.9 Max 18.9 12.9 15.9 Min 4.7 0.8 -9.3 Coding [4 2 3] [5 3 4] N/A vector As it can be seen from the table, the probabilistic correction SNR improvement reduces as we go from the case 1, errors in the states only, to case 2, errors in the states and the operators. The average SNR of the probabilistic correction was reduced by 21% in case 2 compared to case 1. Similar trend of decrease in SNR improvement is seen for the minimum and the maximum parameters. Nevertheless, still an average SNR improvement of 7.7 dB was obtained. Although, on average the state restoration and the probabilistic checksum-based error correction perform equally well for
this example, in worst case the state restoration have a huge negative impact on the system SNR (≈-9 dB). 6. Conclusion In this paper, it is shown how the probabilistic checksum-based error compensation can be used not only to mitigate the error, occurring in flip flops but also errors in the combinational. This is an important problem because down the technology road map, the transient errors in the combinational part of a system will become as important as the errors in the sequential part. The proposed technique results in a large SNR improvement in a linear digital system. For the example presented here up to 13 dB improvement was achieved. Reference: [1]. P. Shivakumar, et.al., “Modeling the effect of technology trends on the soft error rate of combinational logic,” International Conference. on Dependable System and Networks, Bethesda, MD, 2002, pp. 389-398. [2]. R. K. Iyer, “Recent Advances and New Avenues in
Hardware-Level Reliability Support”, International Symposium on Micro-architecture, Barcelona, Spain, 2005, pp. 18-29. [3]. M. Ashouei, et. al., “Design of Soft Error Resilient Linear Digital Filters Using Checksum-Based Probabilistic Error Correction’, VLSI Test Symposium, Berkeley, CA, 2006, pp. 208-213. [4]. M. Ashouei, et. al., “Improving SNR for DSM Linear Systems Using Probabilistic Error Correction and State Restoration: A Comparative Study”, European Test Symposium, Southampton, UK, 2006, pp. 35-42. [5]. B. W. Johnson, “Design and Analysis of Fault Tolerant Digital System”, Addison-Wesley 1989. [6]. T. Karnik, et al, “Scaling Trend of Cosmic Ray Induced Soft Errors in Static Latches Beyond 0.18u”, Symposium. on VLSI Circuits, Kyoto, Japan, 2001, pp. 61-62. [7]. K. H. Huang and J. A. Abraham, “Algorithm-based fault tolerance for matrix operations,” IEEE Transactions on Computers,” Vol. C-33, pp. 518-528, June 1984. [8]. J. Jou and J. A. Abraham, “Fault-tolerant Matrix Arithmetic and Signal Processing on Highly Concurrent Computing Structures”, Proceedings of the IEEE, vol. 74, No. 5, May 1986. [9]. J. Y. Jou and J. A. Abraham, “Fault Tolerant FFT Networks,” IEEE Transactions on Computers, Vol. 37, pp. 548-561, May 1988. [10]. L. N. Reddy and P. Banerjee, “Algorithm-based fault detection for signal processing applications,” IEEE Transactions on Computers, Vol. 39, No. 10, pp. 1304-1308, October 1990. [11]. Y. S. Dhillon, A. U. Diril, A. Chatterjee, "soft-error tolerance analysis and optimization of nanometer circuits," Proceeding of Design Automation and Test in Europe, Munich, Germany, 2005, pp. 288-293. [12]. A. U. Diril, Y. S. Dhillon, A. Chatterjee, A. D. Singh, "Design of adaptive nanometer digital systems for effective control of
soft error tolerance," VLSI Test Symposium, Palm Sprints, CA, 2005, pp. 98-303. [13]. M. Nicolaidis, “Time redundancy based soft error tolerance to rescue nanometer technologies,” VLSI Test Symposium, San Diego, CA, 1999, pp. 86-94. [14]. S. Mitra, et. al., “Robust System Design Built-In Soft-Error Resilience”, Computer, vol. 38, no. 2, pp. 43-52, Feb. 2005. [15]. Y. Arima, et. al, “Cosmic-ray immune latch circuit for 90nm technology and beyond”, International Solid-State Circuits Conference, San Francisco, CA, 2004, pp. 492-493. [16]. R. Hedge and N. R. Shanbhag,, “Soft Digital Signal Processing”, IEEE Tran. on VLSI, pp. 813-823, Dec. 2001. [17]. B. Shim and N. R. Shanbhag, “Reduced Precision Redundancy for Low-Power Digital Filtering”, Asilomar Conference on Signals, Systems, and Computers, Pacific Grove, CA, 2001, pp. 148-152. [18]. N. R. Shanbhag, “Reliable and Energy-Efficient Digital Signal Processing”, Design Automation Conference, New Orleans, LA, 2002, pp. 830-835. [19]. B. Shim and N. R. Shanbhag, “Energy-Efficient Soft ErrorTolerant Digital Signal Processing”, IEEE Tran. On VLSI, vol. 14, No. 4, pp. 336-348, , April 2006. [20]. V. S. Nair and J. A. Abraham, “Real-number codes for faulttolerant matrix operations on processor arrays,” IEEE Trans. on Computer, vol.39, pp. 426-435, April 1990,. [21]. A. Chatterjee, M. A. d’Abreu, “The design of fault tolerant linear digital state variable system: theory and technique,” IEEE Trans. on Computers, vol. 42, pp. 794-808, July 1993,.