A Cache Error Propagation Model
∗
Arun K. Somani Department of Electrical Engineering, and Department of Computer Science and Engineering Box 352500 University of Washington, Seattle, WA 98195-2500 Tele: (206) 685-1602 email:
[email protected] and Kishor S. Trivedi Center for Advanced Computing and Communication Department of Electrical and Computer Engineering Duke University, Durham, NC 27708-0291 Tele: (919) 660-5269 email:
[email protected] January 12, 2004
∗
This research in part was supported by the NSF grant MIP-9224462 and MIP 9630058 and by the NASA under NAS1-19480 through the Institute for Computer Applications in Science and Engineering, NASA Langley Research Center, Hampton, VA 23681. The authors sincerely thank P. N. Marinos of Duke University and D. Nicol for thorough reading and helpful comments. Authors also acknowledge Allen Sansano’s contribution in setting up the experiments on the Proteus System.
1
Abstract Cache memory is a small, fast, memory system that holds frequently used data. With increasing processor speed, designer follow aggresive design practices in the design of cache memories. Such design practices increase the probability of fault occurrence and the presence of latent errors as processor allows a short duration for read and write. The fault may corrupt the cache memory system or lead to an erroneous internal CPU state. In this paper, we investigate the error propagation in cache memory system due to transient faults either in the cache memory itself or in the processor’s registers or both. The information gained from such an investigation should lead to the development of more effective error recovery mechanisms against failures due to transient faults arising in the machine’s cache memory and register set. We establish that even though the computer system is capable of recovering about 50% of the time from the effect of a single erroneous cache location/processor register, the other 50% of the time error recovery is affected only through specific recovery mechanisms. Our results are obtained using both a discrete-time Markov model and by means of error injection on a real system. Keywords: cache memory system, cache error propagation, cache error recovery, fault injection, latent faults in cache memory systems, Markov models.
2
1
Motivation
Cache memory is a small, fast, memory system that holds frequently used data that are dynamically determined during a program execution. Cache memory systems (also referred to as cache throughout the paper) principally differ in their sizes, control mechanisms, and organization. Almost all microprocessor chips contain on-chip cache. Most computer systems including PCs are now equipped with a second-level cache as well. State of the art systems use very large cache memory systems. For detailed description of various cache memory organizations and control mechanisms and their effect on the % of accesses served by the cache, called the hit ratio, the reader is referred to [2]. We summarize the cache operation briefly below. The use of the cache memory increases the efficiency of a processor. A data read or write operation by a processor is first searched for in the cache. A cache line is a set of contiguous words that the cache transfers to and from the main memory. On a read access if the needed data is found in the cache (a cache hit), it is supplied to the processor directly from the cache. In case it is not found in the cache (a cache miss), the cache line containing the data is read from the main memory, stored in the cache, and supplied to the processor. In case of a write access, if it is a hit in the cache, data can be updated either only in the cache memory system (known as write back) or both in the cache as well as in the main memory (known as write through). On a write miss there are two options: (1) data may be directly written into the main memory (known as write around); or (2) the cache may fetch the line containing the required data word and then treat the write as a hit (known as write allocate). Write around memory operation is write through, but with write allocate, both write through or write back options are available. Write-back option is preferred for high performance. A cache line is initially in an invalid state. When it is read first time from the main memory, it is in a valid and clean state. If a processor modifies the contents of a line, the line is called dirty. If most of the processor read or write accesses can be met in the cache, then the system performance is enhanced considerably. Faster processor clock rates considerably reduce the cycle time, and therefore, the time available to cache in order to perform a read or write operation to meet the processor-cache memory bandwidth requirements has reduced considerably. This reduction in time increases the probability of a transient fault occurrence. An error may flip one or more bits in the word read. Studies in [7, 8, 9]
3
show that a large fraction of errors detected are caused by the transient faults that are 5 to 100 times more common than the permanent faults [7] and such errors in data may even go unnoticed. The presence of erroneous data in a processor register or in a cache memory location may corrupt the results due to incorrect data usage, wrong instruction execution, or even the program may be taking an incorrect branch. This may further corrupt the cache memory locations or lead to an erroneous internal CPU state. An error may occur in a cache line by a direct transient fault or because the processor writes incorrect data into it. A processor using the erroneous data is likely to produce incorrect results. Fast and early recovery of the erroneous cache data is crucial for the overall system reliability. One simple scheme to prevent the processor from using erroneous cache data is to employ error-correcting code (ECC) [10]. The ECC offers the recovery capability up to its code word limit only. Conventional ECC codes have limitations in their recovery capability and cannot detect processor transient faults. The ECC circuitry also increases the cache memory system access time and chip’s real estate. In addition, the complexity of the ECC codes increases tremendously for multiple bit recovery schemes making multiple bit ECC codes prohibitively expensive in many cases. Systems using the single bit correcting ECC codes require memory scrubbing to minimize the probability of multiple-bit errors. In scrubbing, the contents of the memory are read and written back at regular, predefined time intervals. Moreover, erroneous data written by the processor can still corrupt the cache memory system even with the protection of ECC code. They are seen as valid data by cache memory system. The main contributions of this paper include development and solution of a model for error propagation for transient faults in a cache location or a processor register. The ultimate goal is to devise recovery mechanisms for such faults. These mechanisms are dependent on an accurate fault model. We begin by discussing error recovery mechanisms in Section 2 and stress that no measured data has been reported on the extent of error propagation from cache memory system and/or register transient faults. In Section 3 we develop a simple discrete-time Markov models to predict the extent of error propagation. We verify our model by injecting faults into real programs and present the results in Section 4. Our findings will serve as a basis for future work in developing error recovery techniques.
4
2
Error Recovery Mechanisms
Checkpointing and recovery using checkpoint data are commonly used techniques for recovery. Intermediate states of the computation are stored for several time steps in stable storage [1] or back up storage. Many of these schemes assume that reliable cache memory systems and main memory with ECC are available to store the checkpoint information in uni- and multi-processor systems [3, 4, 12]. Processor-based schemes [5] handle transient faults by using processor-based transparent rollback techniques. Memory-based schemes [5] rolls data back instead of instructions and can be integrated with the processor techniques. Most of these papers do not describe mechanisms to detect processor faults or validate a checkpoint. It is generally assumed that processor faults can be detected by using duplication of processors and/or some other special purpose hardware. Such special needs and high frequency of checkpoints limits the use of these schemes. A control-flow checking scheme to detect if a processor deviates from its intended flow due to the presence of a transient fault has been studied in [13, 14]. Checksum error codes are included in the basic blocks in the program code and a special processor memory interface is required to separate out instruction from error checking code. This combination of hardware and software scheme detects transient faults in an instruction sequence. Wilken et al. describe [13] a continuous signature monitoring and Schuette et al [14] exploit instruction-level resource parallelism for transparent, integrated control flow monitoring. Saxena et al. [16] describe the use of control-flow checking using watchdog assists and extended precision checksums that has low error detection latency. Ooi et al. [17] describe a real-time degradable four-way set-associative cache memory control (CMC) that reconfigures a faulty cache partition and reduces the associativity of cache. This affects performance adversely. Adam [8] and Sarnaik et al. [18] present hardware-assisted recovery techniques that detect memory segments in the failed processor that needs to be restored, so that the recovery is accomplished incrementally by restoring the corrupted segments without degrading real-time control functions. HaL memory system [15] employs mechanisms to recover from transient faults using extensive coding in memory management unit. In [1], a Stable Transaction Memory (STM) is specially designed to achieve atomic memory updates (transactions) for a group of processes in a shared-bus multiprocessor
5
system. On successful termination of a transaction, the data is copied in a secondary memory bank. The Sequoia [6], a tightly coupled shared-bus multiprocessor system, also uses a similar scheme. A processor pair, upon taking a checkpoint, first copies its status and flushes all the dirty lines to both, a primary and a secondary, memory units separately. Chen and Somani [11] proposed read-first-dirty- broadcasting protocol for redundant systems employing cache memory to recover from cache/processor transient faults without affecting the system operation. In this scheme, both cache and processor transient faults can be tolerated without the help of any special mechanism for fault detection. From this discussion, we note that various schemes to recover from transient and permanent faults either in memory or processor system have been developed. However, none of the above schemes makes any quantitative assessment on the extent of error propagation since they assume that the faults are detected instantly. Understanding how faults propagate is a key factor in developing effective solutions to the error recovery problem. It will also have a significant impact on cache and overall system design.
3
Fault Propagation Model
An error in a cache word may propagate in the cache as well as in the processor’s register during the execution of a program. As noted earlier, Write-back cache is a preferred protocol for higher performance [2]. When an error occurs in write-back cache memory, it may be propagated into other cache lines during the execution of a program. A read or a write hit in a write-back cache may fail to invoke either the detection, or the recovery procedure even in the presence of an error as detection and recovery mechanism may be employed in the bus interface unit. Such errors may only be detected by the bus interface unit only when a dirty line is replaced to bring a missing line in cache memory for a read miss or write miss.1 Thus, an error may be propagated in the cache and processor’s registers during program execution. For our model, we assume that the main memory remains error-free, and we consider the system as a whole to have failed if an instruction outputs the contents of an erroneous register to external world or cache writes back erroneous data in the main memory. We 1
For a write miss, a cache with write allocation or generation policy will read the missing line first and then
performs the write.
6
assume a standard cache operation. That is, when a read/write miss occurs, the cache will replace an existing line with a new line that is being read/written. If the replaced line is dirty and contains a fault, it will contaminate the main memory, and that is considered a system failure as the fault has propagated outside of processor-cache system. On a hit, the cache reads from or writes to a cache line. For write operations, we assume write allocate and write back strategies although our model is easily extended to account for other strategies. Let the hit ratio of the cache be h, and the probability of a line being dirty be denoted by d. Although it is possible to use a more general model of instruction sequence, we assume that the successive instructions are chosen independently from the following four kinds of instructions. 1. Read or load instructions which are executed with probability pr . 2. Write or store instructions which are executed with probability pw . 3. Compute instructions which are three operand register to register instructions. These instructions read two registers and write their results into a third register. The probability of execution of such instructions is pc . 4. Output instructions are those which propagate the results of a computation to the external world (i.e., external to the processor and its cache). The probability of executing such instructions is pt (= 1 − pr − pw − pc ). A fault that appears in the cache memory system modifies the contents of a particular cache line. It may be propagated in the processor/cache memory system during the execution of a program which consists of a sequence of the above instructions. We represent the state of the system using a two tuple: (1) number of erroneous registers, k, and (2) number of erroneous cache lines l. Let the total number of registers in a processor be denoted by m and the total number of cache lines be denoted by n. Some k out of m registers and l out of n cache lines may be erroneous at a given time. With the above processor and cache model, the fault propagation and the probability of a failure with the execution of each instruction is computed assuming a simple uniform memory access behavior. The probability that a register used by an instruction is errorfree when there are k erroneous registers is given by rk = (m − k)/m. Similarly, the probability that a read or write instruction will use an error-free cache line given that
7
there are l erroneous cache lines is given by rl = (n − l)/n. A processor executes one of the four types of instructions in every machine cycle. With these assumptions, the behavior of the cache and the processor can be represented by a discrete-time Markov chain (DTMC) with the state space {(k, l)|0 ≤ k ≤ m, 0 ≤ l ≤ n}. The state space is quite large but the DTMC is well-structured so that we can develop the transition probabilities from a generic state. The system in state (k, l) may go to state (k + 1, l), (k, l), (k − 1, l), (k, l + 1), (k, l − 1), and (k − 1, l − 1) when an instruction is executed. One-step transition probabilities are derived based on the type of instruction being currently executed as follows: 1. Read Instruction: A read instruction can be a hit or a miss in the cache, and the register and the data in the cache line involved in the operation can be erroneous or error-free. If an erroneous line is replaced then it will corrupt the main memory and we consider that as a non-recoverable situation. All possible cases which may arise are listed in Table 1. Table entries also list all possible next state transitions and the probabilities, i.e., symbols qi ’s. To compute the corresponding associated transition probability value, qi , one needs to multiply the respective row and column probabilities. For example, the value of q6 is given by pr ∗(1−h)∗(1−d)∗(1−rk )∗rl in the table. A table entry is a next state and its associated probability.
Table 1: State Transition and Their Probabilities for a Read Instruction Condition
Register →
no error
no error
error
error
Cache →
no error
error
no error
error
rk ∗ rl
rk ∗ (1 − rl )
(1 − rk ) ∗ rl
(1 − rk ) ∗ (1 − rl )
(k, l), q0
(k + 1, l), q1
(k − 1, l), q2
(k, l), q3
(k, l), q4
(k, l − 1), q5
(k − 1, l), q6
(k − 1, l − 1), q7
(k, l), q8
F ail, q9
(k − 1, l), qa
F ail, qb
Probability Read Hit
pr ∗ h
Read Miss
pr ∗ (1 − h)∗
& Not Dirty (1 − d) Read Miss
pr ∗ (1 − h)∗
& Dirty
d
2. Write Instruction: A write instruction is similar to a read instruction, except that the next state can be different depending on whether it is a hit or a miss in the cache. Again, the register being written and the cache line involved in the 8
operation can be error-free or erroneous. If an erroneous line is replaced then it will corrupt the main memory and we consider that as a non-recoverable situation. All possible cases which may arise are listed in Table 2. Each entry in the table denoting the next state and the associated state transition probability, wi is shown. Table 2: State Transitions and Their Probabilities for a Write Instruction Condition
Register →
no error
no error
error
error
Cache →
no error
error
no error
error
rk ∗ rl
rk ∗ (1 − rl )
(1 − rk ) ∗ rl
(1 − rk ) ∗ (1 − rl )
(k, l), w0
(k, l − 1), w1
(k, l + 1), w2
(k, l), w3
(k, l), w4
(k, l − 1), w5
(k, l + 1), w6
(k, l), w7
(k, l), w8
F ail, w9
(k, l + 1), wa
F ail, wb
Probability Write Hit
pw ∗ h
Write Miss
pw ∗ (1 − h)∗
& Not Dirty (1 − d) Write Miss
pw ∗ (1 − h)∗
& Dirty
d
Figure 1: A segment of the DTMC Model of error propagation
3. Compute Instruction: A compute instruction propagates error if it uses an erroneous register and writes into an error-free register. This causes the system state to change from (k, l) to state (k + 1, l) with probability c1 = pc ∗ (1 − rk2 ) ∗ rk . On the other hand, if instruction writes its error-free result, that it computes using error-free operands, in an erroneous register, it enables the register to recover from the error. This causes the system state to change from (k, l) to state (k − 1, l) 9
with probability c2 = pc ∗ rk2 ∗ (1 − rk ). Otherwise the state remains the same with probability c3 = pc ∗ (1 − (1 − rk2 ) ∗ rk − rk2 ∗ (1 − rk )). 4. Output Instruction: An output instruction causes system failure if the output register is erroneous. This happens with probability t1 = pt ∗ (1 − rk ). Otherwise, the system state remains the same with probability t2 = pt ∗ rk . Notice that the system as a whole recovers if both the cache and all the registers become error-free. This happens whenever cache has no erroneous lines and the last erroneous register is replaced by some good data or the only erroneous cache line is replaced by good data. We assume that the system failure occurs if an erroneous output is produced. The DTMC thus constructed is rather large with the size of state space being equal to (n + 1) × (m + 1). We show a segment of the state diagram of the DTMC corresponding to transitions out of a generic state (k, l) in Figure 1. One-step transition probabilities are given below (where we use 1A to denote the indicator function of the event A). f0 = t1 + q9 + qb + w9 + wb f1 = q1 f2 = c1 f3 = (q2 + q6 + qa + c2 )1((k,l)6=(1,0)) f3r = (q2 + q6 + qa + c2 )1((k,l)=(1,0)) f4 = q0 + q3 + q4 + q8 + w0 + w3 + w4 + w8 + w7 + c3 + t2 (1)
f5 = q7 1((k,l)6=(1,1)) f5r = q7 1((k,l)=(1,1)) f6 = (q5 + w1 + w5 )1((k,l)6=(0,1)) f6r = (q5 + w1 + w5 )1((k,l)=(0,1)) f7 = w2 + w6 + wa f8 = 1(k=mt ) f9 = 1(l=nt )
The state (k, l) recovers to a fault-free state (labeled “rec” in Figure 1) if the only remaining erroneous cache line or register or both are written or replaced by fault-free data. This condition is denoted by using the indicator function in the definitions of transition probabilities f3r , f5r and f6r . Transitions from a state (k, l) to the system
10
failure state (labeled “failure” in Figure 1) are explained later on and transitions with probabilities f8 and f9 belong to special cases. Since the number of registers and the number of cache lines will be rather large, the DTMC will have a very large state space. Automated means of generating the state space and solving for the desired measures is needed. Automated generation of Markov models starting from some variation of stochastic Petri net is quite common and a number of tools support this paradigm. We are also tempted to consider this avenue. However, since most of the tools produce a continuous time Markov chain from the SPN description, one is likely to be dismiss this avenue soon to solve a DTMC. Interestingly, if the SPN paradigm used includes both immediate and timed transitions, the reachability graph produced has a set of vanishing markings that together with their interconnections constitute a DTMC. In the standard process of eliminating vanishing markings, the DTMC is solved [20]. We use this artifice to solve our DTMC.
Figure 2: SPN model of cache error propagation
In order to use the tool that is at our disposal, we introduce a fictitious timed transition with rate 1.0 in the SPN model of Figure 2. All the transitions corresponding to the original DTMC are modeled by immediate transitions in the SPN. When the SPNP [20] generates the reachability graph, there are three tangible markings; two of these will correspond to the two absorbing states, while one is the initial marking. The absorbing states also correspond to the two absorbing states of the original DTMC. All the remaining markings are vanishing. These vanishing markings correspond to transient 11
states of the DTMC. The SPNP then eliminates all the vanishing markings and replaces them by a branching switch; the subsequent merger with the tangible markings produces the CTMC shown in Figure 3. The CTMC consists of three states: fault-free, recover, and failure. The SPNP solves for the two transition probabilities; (i) from fault-free state to recover (after a fault occurs), called coverage C; and (ii) from fault-free state to failure states (after a fault occurs), called failure probability F . We are not interested in the solution of CTMC, but only in determining these factors which are essentially the probability of recovery or failure due to a cache error. The SPNP determines these factors while reducing the reachability graph. We emphasize that we are primarily using the elimination of vanishing markings part of the SPNP here.
Figure 3: Equivalent Markov Chain for error propagation and recovery
In Figure 2, transitions i0 to i7 represent the eight possible transitions out of any state (k, l) as shown in Figure 1 including the self loop from state (k, l) to (k, l). For each transition im, the corresponding transition probability is given by the value of fm defined above. To keep the model from exploding in vanishing state space, we also created a terminating condition. We define two parameters mt and nt as the number of registers and number of cache lines which, once they become erroneous, will always result in a failure. Transitions, i8 and i9, represent these terminating conditions. The number of vanishing markings that are produced are more than (mt +1)∗(nt +1)−3. Constructing and solving such a large Markov chain will be a prohibitive task. Terminating condition helps in keeping the size of model in reasonable limits. Using this model, we computed the probability of failure and recovery and give the results are given Section 3.1.
3.1
Results of the Model
Table 3 shows the effect of various numbers of erroneous registers (mt ) on the probability of recovery for pr = 0.13, pw = 0.07, pc = 0.78, and pt = 0.02. In these results, we assume 12
that the hit ratio is 1.0, and therefore, the probability of a line being dirty does not play any role. We note that nt = 16 and nt = 32 yield identical results in this scenario. However, the probability of recovery changes slightly for values of mt = 8 to mt = 32. Therefore, in the rest of our study, we terminate our model once all 32 registers or 16 cache locations are erroneous.
Table 3: Effect of number of erroneous registers (h = 1.0, d = 0.0) cache size Terminate nt =
Probability of Recovery mt = 4
mt = 8
mt = 16
mt = 32
128 lines
16 or 32
0.5180773 0.5346667 0.5354806 0.5354816
256 lines
16 or 32
0.5183661 0.5348332 0.5356356 0.5356366
512 lines
16 or 32
0.5185141 0.5349188 0.5357154 0.5357163
1k lines
16 or 32
0.5185189 0.5349622 0.5357559 0.5357568
2k lines
16 or 32
0.5186266 0.5349841 0.5357763 0.5357772
4k lines
16 or 32
0.5186455 0.5349950 0.5357864 0.5357874
Next, we study the effect of hit ratio and the probability of a replaced line being dirty on the the probability of recovery. We performed two different sets of experiments involving two sets of values for pr , pw , pc and pt and varied hit ratio, h, and dirty probability, d. In Table 4, we give the probability of recovery for n = 1024, m = 32, nt = 16, mt = 32, pr = 0.13, pw = 0.07, pc = 0.78, and pt = 0.02. In Table 5 we give the probability of recovery for pr = 0.25, pw = 0.10, pc = 0.60, and pt = 0.05, for the same values of n, m, nt , and mt . The results show interesting properties. For values of d less than half the probability of recovery increases as the hit ratio decreases. This is due to the fact that the main memory can provide new data to overwrite the erroneous data in cache. However for d > 0.5, the reverse effect causes the probability of recovery to fall sharply. This is expected since the replaced dirty line may corrupt the main memory, and that is counted as error spread and failure in our studies. For the same value of hit ratio h, the probability of recovery decreases with increasing d. The rate depends on the value of h. In the next section we describe the fault injection experiments to verify the results we obtained using the analytical model described above. 13
Table 4: Effect of hit ratio and dirty probability on recovery probability (n = 1024, m = 32, nt = 16, mt = 32, pr = 0.13, pw = 0.07, pc = 0.78, and pt = 0.02) d
h = 0.95
h = 0.90
0.0 0.571250
0.604926
h = 85
h = 0.75
h = 0.50
h = 0.25
0.636899 0.696152 0.821109 0.920156
0.2 0.555944 0.575075
0.593222
0.626827 0.697734 0.754180
0.4 0.540662 0.545305
0.549706
0.557849 0.575041 0.588779
0.6 0.525402 0.515614
0.506345
0.489201 0.452971 0.423894
0.8 0.510165 0.486001
0.463135
0.420864 0.331475 0.259475
1.0 0.494949 0.456462 0.420071
0.352825
0.210511 0.095480
Table 5: Effect of hit ratio and dirty probability on recovery probability (n = 1024, m = 32, nt = 16, mt = 32, pr = 0.25, pw = 0.10, pc = 0.60, and pt = 0.05) d
h = 0.95
h = 0.90
h = 85
h = 0.75
h = 0.50
h = 0.25
0.0 0.534457 0.575957 0.614692
0.684644
0.823590 0.924681
0.2 0.516965 0.542058 0.565423
0.607541
0.691366 0.753113
0.4 0.499543 0.508404 0.516637
0.531457
0.561017 0.583028
0.6 0.482188 0.474983 0.468300 0.456282
0.432239
0.414145
0.8 0.464900
0.441783 0.420380 0.381927
0.304810
0.246267
1.0 0.447676
0.408793 0.372851 0.308318
0.178559
0.079255
14
4
Error Injection Experiments
If a hardware system is available, faults can be injected into the system at the circuit level using an external or internal built-in fault injector [24, 26, 27] or using heavy ion radiation [7], or using the software-based approaches. The location of a fault in each case is governed by the goal of the study which may be different in each case. If no hardware model is available, then the only feasible option is to have a simulation model of system implemented in software and have the faults injected in the model of the system. The details of the model can be varied based on the goals of the study as the simulation may be very expensive in most cases. Hardware-level fault injection via circuit pins is more suitable to study permanent fault injection, and therefore, not required in our context as we are studying transient faults. Using radiation and other similar methods to inject faults is recommended to study the impact of random transient faults. Software-based fault injection is used to provide perfect controllability and observability and systematic study of fault behavior as both, the location and the time of injection, can be controlled. Fault/error may be permanent or transient in nature. There are numerous tools like SOFIT [21], FIAT [22], FINE [23] and FERRARI [25] which use software implemented fault injection which is known as error injection. Most software implemented fault injection techniques such as FIAT, FINE, SOFIT, and MESSALINE inject faults at the system level. FIAT is a tool capable of emulating a variety of distributed system architectures monitoring system behavior and inject faults for the purpose of experimental characterization and validation of a system’s dependability. It can inject fault both in the code and in the data portions of memory images. FINE injects hardware-induced software errors and software faults into the UNIX kernel and traces the execution flow and key variables of the kernel. In [26], the authors’ approach is based on the use of fault injection at the physical level on a hardware/software prototype of the system considered. An excellent example of combined hardware/software simulation methodology can be found in [25]. FERRARI emulates transient errors and permanent faults in software. Hardware level fault injection is not required in our context as we are studying transient faults. An ideal situation will be to use methods such as heavy ion radiation but separating caches from the other hardware is not feasible in the present context as
15
cache may be on the same chip. So we chose software implemented error injection in real hardware when a program is running. We used the Proteus machine to carry out our error injection studies. We inject errors in actual programs running on Proteus machines. More specifically, we run identical copies of programs on two nodes of the Proteus multiprocessor system, with one of the copies being injected with errors following a given probability of error occurrence. Both cache locations and CPU registers can be corrupted. The program running on two processors is monitored and results are compared to determine how the injected error affected the execution. Since we have complete control of processor registers, cache, and main memory in the Proteus system, we can perform controlled experiments. In Proteus, both these nodes are controlled by a control processor and can be interrupted at precise points in time.
4.1
Experimental Set-up
The following experiment was set up in order to investigate the effects of injecting an error into the cache memory system. For this experiment, we used a node of the Proteus Parallel Processor. A node consists of 4 Processing Elements (PE) and one Control Processor (CP). A program was run on a PE and the CP was used to interrupt the PE at a random time during the course of a run. When interrupted, the PE flipped a random bit in a random location within a certain segment of the cache memory system. The segment chosen was the data segment. The PEs on the Proteus system have direct access to the cache locations as they are also mapped into its address space. Therefore, it is possible to manipulate cache locations without affecting the state of the cache line (valid, dirty, clean). After injecting an error, the PE exits the interrupt and completes the program. Error types were classified into one of the following groups. The first type includes those errors that are injected into an area that are still invalid cache locations and the program initializes them after the error has been injected, thus overwriting the erroneous location. Thus, the error does not affect the execution at all and the program execution produces correct result. The second group of errors include those that are injected into clean or dirty data locations. These errors could either propagate or disappear by means such as overwrite of a clean location. We chose three programs to run which have markedly different data and instruction
16
cache access patterns. We call the programs QSORT, JACOBI, and FFT. QSORT is a quicksort algorithm; JACOBI is akin to the numerical linear algebra Jacobi relaxation technique; and FFT is a two dimensional FFT implementation.
4.2
Experimental Results
In our experiments, errors were injected into the data section of the cache memory system, care was taken in mapping the cache memory system so that program region and data regions never overlapped. This scenario is much like the separate instruction and data cache memory system concept. The results are shown in Table 6. The QSORT program resulted in a failure (that is, produces an erroneous output) almost all of the time. This is a result of the fact that the QSORT algorithm we used is an in-place algorithm, and any error injected into the data would result into a failure. We note, with interest, that out of the 99.3% failure cases listed in the last column for this program, in 39.3% cases, the error actually did not spread; they resulted in only a single location being erroneous, the one where the error was injected. Although they are classified as producing incorrect result, in fact the incorrect result is essentially due to the error injection only. Thus only in about 60% cases, the error in fact spread. The spread did not happen because the corrupted element had already been placed in its final place in sorted order according to the errorfree value before the error had been injected. The element was no longer used in the calculations for succeeding reordering. This sort program behaves as if the the value of d is high and hit ratio is low. In the JACOBI code we expected a failur rate of about 50% since the program involves two separated data locations, the data at step i−1 and the data being calculated at step i as a function of the data at step i − 1. If the error was injected into an element associated with step i − 1 after that element had been used then the fault would be overwritten in the next step. If the error was injected into a data element before it was calculated then the error would be overwritten by the new value. This resulted in a 50% chance of the error landing in one of these two areas. We observed erroneous output in 44.3% cases. The FFT program has a different data access pattern. We perform both the forward and inverse calculations in our program, and we organize the data so that there are four
17
distinct data regions accessed in a specific order during the program execution. The error would have to land in an active data region to affect the output. It is because of this reason, we expect a relatively lower probability of generating an erroneous output. This is like low hit ratio and low dirty probability case. We observed only 15% cases producing an incorrect output.
Table 6: Fault Injection into Program Space Program
Invalid Line
Clean Line
Dirty Line
Clean Line
Dirty Line
No Output Error No Output Error No Output Error Output Error Output Error QSORT
0.3%
0.0%
0.0%
0.0%
99.7%
JACOBI
1.4%
0.0%
54.3%
0.0%
44.3%
FFT
29.8%
12.2%
42.8%
0.0%
15.2%
These results can be suitably scaled for actual error rates. Even with ten times lower error rates, one would expect that 5% of the program execution may be corrupted.
5
Conclusions
The introduction of cache memory increases the probability of fault occurrence and the probability of latent errors as processor cycle time allows for only a short duration for read and write in cache memory. The fault may corrupt the cache memory system or lead to an erroneous internal CPU state. We have studied and reported the extent of error propagation due to a transient fault that originates in a processor register or a cache location through program execution. The error may propagate to other cache lines and/or registers and eventually to the output of a computation given sufficient time and the type of program being executed. For a single cache fault, in most of the programs, the execution of the program eliminates the error about half of the time as an erroneous line or a register gets replaced or overwritten with correct data at a latter computation. We establish that even though the computer system is normally capable of recovering about 50% of the time from the effect of a single erroneous cache location/processor register, the other 50% of the time error recovery is affected only through specific recovering mechanisms. We also developed and used a discrete-time 18
Markov model to study this behavior besides using software-based error injection on a real machine. We are currently developing techniques for fault and error detection in cache memories.
References [1] M. Banatre and P. Joubert, “Cache Management in a Tightly Coupled Fault Tolerant Multiprocessor,” Proc. 20th Symposium Fault-Tolerant Computing, pp. 89-96, 1990. [2] J. L. Hennessy and D. A. Patterson, Computer Architecture, A Quantitative Approach. San Mateo, CA: Morgan Kaufmann, 1990. [3] D. B. Hunt and P. N. Marinos, “A General Purpose Cache-Aided Rollback Error Recovery (CARER) Technique,” Proc. 7th Symposium Fault-Tolerant Computing, pp. 170-175, 1987. [4] K. Wu, W. K. Fuchs, and J. H. Patel, “Error Recovery in Shared Memory Multiprocessors Using Private Caches,” IEEE Transactions on Parallel and Distributed Systems, Vol. 1, No. 2, pp. 231-240, April 1990. [5] N. S. Bowen and D. K. Pradhan, “Processor- and memory-based checkpoint and rollback recovery,” IEEE Computer, Vol. 26, no. 2, pp. 22-31, Feb. 1993. [6] P. A. Bernstein, “Sequoia: A Fault-Tolerant Tightly Coupled Multiprocessor for Transaction Processing,” IEEE Computer, Vol. 21, pp. 37-45, February 1988. [7] U. Gunnelflo, J. Karlsson and J. Torin, “Evaluation of Error Detection Schemes using Fault Injection by Heavy-ion Radiation,” Proc. of 19th Fault Tolerant Computing Symposium, Chicago, Illinois, pp. 340-347, June 1989. [8] S. J. Adams, “Hardware Assisted Recovery from Transient Errors in Redundant Processing Systems,” Proc. 19th Symposium Fault-Tolerant Computing, pp. 512519, 1989. [9] X. Castillo, S. R. Mcconnel, and D. P. Siewiorek, “Derivation and Calibration of a Transient Error Reliability Model,” IEEE Transactions on Computers, Vol. C-31, No. 7, pp. 658-671, July 1982.
19
[10] C. L. Chen and M. Y. Hsiao, “Error-Correcting Codes for Semiconductor Memory Applications: A State-of-the-Art Review,” IBM Journal Research Development, Vol. 28, No. 2, March 1984. [11] C. H. Chen and A. K. Somani, “A Cache Protocol for Error Detection and Recovery in Fault-Tolerant Computing Systems,” 24th Symposium Fault-Tolerant Computing, pp. 321-330, June 1994. [12] H. S. Lin, “High-Performance Comparison-Based Cache-Aided Rollback Error Recovery Computing Systems,” M.S. Thesis, Department of Electrical Engineering, University of Washington, Seattle, WA 98195-2500. [13] K. Wilken, and J. P. Shen, “Continuous signature monitoring: low-cost concurrent detection of processor control error,” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, Vol.9, no.6, pp. 629-41, June 1990. [14] M. A. Schuette and J. P. Shen, “Exploiting instruction-level resource parallelism for transparent, integrated control-flow monitoring,” 21st Fault-Tolerant Computing Symposium, Montreal, Que., Canada, pp. 318-25, June 1991. [15] N. R. Saxena, C. W. D. Chang, K. Dawallu, J. Kohli, and P. Helland, “Faulttolerant features in the HaL memory management unit,” IEEE Transactions on Computers, Vol. 44, no. 2, pp. 170-80, Feb. 1995. [16] N. R. Saxena, E. J. McCluskey, “Control-flow checking using watchdog assists and extended- precision checksums,” IEEE Transactions on Computers, Vol. 39, no.4, pp. 554-9, April 1990. [17] Y. Ooi, M. Kashimura, H. Takeuchi, E. Kawamura, “Fault-tolerant architecture in a cache memory control LSI,” IEEE Journal of Solid-State Circuits, Vol. 27, no.4, pp. 507-14, April 1992. [18] T. R. Sarnaik and A. K. Somani, “On Reducing Test Time and Meeting Deadlines in real-Time Systems,” Proc. First Asian Test Conference, 1992. [19] J. B. Dugan and K. S. Trivedi, “Coverage Modeling for Dependability Analysis of Fault-Tolerant Systems,” IEEE Transactions on Computers, Vol. 38, no. 6. pp. 775-787, June 1989. [20] G. Ciardo, A. Blakemore, P.F. Chimento, Jr., J.K. Muppala, and K.S. Trivedi, “Automated Generation and Analysis of Markov Reward Models using Stochastic 20
Reward Nets,” in Linear Algebra, Markov Chains and Queuing Models, Carl Meyer and Robert Plemmons (eds.), IMA Volumes in Mathematics and its Applications, Vol. 48, pp. 145-191, Springer-Verlag, Heidelberg, 1993. [21] P. K. Tapadiya, D. R. Averesky, “A Framework for Developing a Software-Based Fault-Injection Tool for Validation of Software Fault-Tolerant Techniques Under Hardware Faults,” Technical Report 94-001, Department of Computer Science, Texas A & M University, 1994. [22] Z. Segall, D. Vrsalovic, D. Siewiorek, D. Yaskin, J. Kownacki, J. Barton, R. Dancey, A. Robinson, T. Lin, “FIAT: fault Injection Based Automated Testing Environment,” Proc. 18th Fault Tolerant Computing Symposium, Tokyo, Japan, pp. 102109, June 1988. [23] Wei-lun Kao, R. Iyer and D. Tang, “FINE: A Fault Injection and Monitoring Environment for Tracing the UNIX System Behavior under Faults,” in IEEE Transactions on Software Engineering, Vol. 19, No. 11, pp. 1105-11 18, Nov. 1993. [24] J. Karlsson, P. Folkesson, J. Arlat, Y. Crouzet, G. Leber, J. Reisinger, “Evaluation of the MARS Fault Tolerance Mechanisms using Three Physical Fault Injection Techniques,” Third IEEE International Workshop on Integrating Error Models with Fault Injection, Annapolis, Maryland, April 1994. [25] G. Kanawati, N.A. Kanawati, and J. A. Abraham, “FERRARI: A Flexible Software-based fault Tolerant and Error Injection System,” IEEE Transactions on Computers, Vol. 44, no. 2, pp. 248-260, Feb. 1995. [26] J. Arlat, Y. Crouzet, and J. C. Laprie, “Fault injection for dependability validation of fault-tolerant computing systems,” 19th Fault-Tolerant Computing Symposium, Chicago, IL, USA, pp. 348-55, June 1989. [27] K. G. Shin and Y. H. Lee, “Measurement and application of fault latency,” IEEE Transactions on Computers, vol.C-35, no.4, pp. 370-375, April 1986.
21