On-line Integrity Monitoring of Microprocessor ... - Semantic Scholar

9 downloads 49283 Views 220KB Size Report
This paper presents a low-cost reliability enhancement scheme for the processors' ... and generate a control mistake, data loss, incorrect arithmetic and logical ... cause fatal damage creates the need of error detection and recovery capabilities. ..... However, it is hard to guarantee a solution for every random logic, which is ...
On-line Integrity Monitoring of Microprocessor Control Logic Seongwoo Kim Iowa State University Department of Electrical and Computer Engineering

[email protected] Abstract Traditionally, the control logic of most microprocessors is not checked for soft errors due to great overheads while the remaining parts such as memory array and data path are often protected with error correcting codes. This paper presents a low-cost reliability enhancement scheme for the processors’ control logic. We classify control signals into static and dynamic control depending on their changeability for a given instruction, and employ different mechanisms. For static control, signals used in each pipeline stage are integrated into a signature and verified with a cached check code at commit time. The concept of caching signatures is introduced. Dynamic control is examined on the spot in which the signals are created using parity or component-level duplication. Fault injection simulations on an RTL model of a MIPS-like processor demonstrate that our scheme can achieve more than 99% coverage on average with a very small addition of hardware. We have also investigated the criticalness of errors in the processor logic that suggests a direction in devising efficient allocation of redundancy.

1 Introduction Unstable environmental conditions cause hardware transient faults that may result in soft errors in microprocessors. Technology scaling in VLSI chip fabrication makes silicon chips more prone to soft errors while improving performance dramatically. According to scaling theory [1], a new technology generation scales gate delay by 0.7, resulting in 43% increase in operating frequency. It also reduces energy per signal transition by 65% and doubles transistor density. All of these scaling factors are responsible for the processors’ increasing susceptibility to soft errors. The critical charge (Qcrit ) of a digital circuit element is the minimum charge that is needed to change the element’s logic state. Since Q

=  C

V

, Qcrit decreases

with future technologies as both capacitance (C ) and voltage (V ) reduce. Therefore, the probability that random noises disrupt the circuit elements is significantly increasing. The major sources of such transient faults are electro- magnetic interference, power jitter, alpha particles, and cosmic rays. In highly complex modern microprocessors, soft errors have various effects depending upon their manifestation timing and location. Once an error occurs in an element, it can propagate to other elements and generate a control mistake, data loss, incorrect arithmetic and logical operation, etc. Although every

element may be vulnerable to the soft error, it should be noted that not all errors result into failures. For example, a corrupted register value can be overwritten before it is used as an operand. Sometimes errors may be masked without any special efforts. Nevertheless, the fact that even a single-bit error may cause fatal damage creates the need of error detection and recovery capabilities. As we become dependent on computers in widening range of applications, highly dependable system at affordable cost is in great demand. New fabrication materials with better fault-immunity may enhance the reliability of the microprocessor, but it is not sufficient to prevent soft errors from occurring. Alternatively, fault tolerance techniques using redundancy are often employed in the processor architecture and design to achieve the desired level of dependability. The most common techniques are hardware/time replication and error-checking codes (ECCs). Although replication is advantageous in terms of design and verification complexity, full replication is usually very expensive, and therefore, limitedly applied to critical applications. Partial replication in unit-level is also considered in several designs. In G4 [2] and G5 [3] IBM’s S/390 mainframe, integer and floating-point execution units are duplicated. In protecting memory arrays, ECCs are very effective, but it is not well suited for irregular structures such as control logic.

Debug, performance monitoring, etc. Memory array

11111111 00000000 00000000 11111111 00000000 11111111 Control logic 00000000 11111111 00000000 11111111 00000000 11111111 00000000 11111111 00000000 11111111

Data path

Figure 1: Chip area classification and our target area for protection. In general, some parts of a microprocessor remain unprotected. Figure 1 depicts a microprocessor chip whose area is classified into four logic blocks. 1) Information in storage units such as cache memories and register files is covered by ECCs. The on-chip caches of many general-purpose processors (GPPs) are equipped with byte-parity and SEC/DED [4] protection. 2) Units for debugging, testing, and performance

monitoring, which are not used in actual computation, require no run-time error checking. 3) For data path including buses and some functional units, residue-check and illegal condition check can be used as well as ECCs [5]. 4) Protecting the remaining part, control and other random logic, is difficult in designing appropriate techniques that can replace expensive unit-level replication. As a result, most GPPs incorporate no special protection features for this part. This paper focuses on this problem and proposes a cost-efficient solution. Faults in the control logic of the processor may result in incorrect execution of instructions and their sequence error. For example, a fault in the decoder of a cache memory can cause the delivery of an instruction sequence that is a mixture of instructions from correct and incorrect locations. Signature monitoring is a technique to check integrity of program execution and the flow of control using signature [6], [7], [8]. Traditional signature monitoring schemes were developed at system-level. A watchdog processor concurrently monitors the processor’s behavior using signals on external address and data buses [9], [10]. Because of the scheme’s inherent limitation in observability, the external control flow monitoring for complex processors with built-in caches is not effective. The signature monitoring can be implemented fully in software [11], [12]. Instead of using a specialized hardware monitor, signatures are embedded in the program code and the control flow error is internally checked at every assertion point that is assigned by compiler’s preprocessing. Since the monitoring consumes the processor’s computing power, this approach degrades more than 30% of performance. In order to reduce performance impact, ARC [13] technique exploits the processor’s unutilized resources for error checking. It achieves high coverage in detecting the errors of some defined types, but the original resource usage of the base processor modeled is too low (on average 36% for integer unit and 17% for floating-point unit). Idle resources can be also used for redundant execution of instructions. In RESO technique [14], [15], a function is recomputed with shifted operands for the verification. Operand shifting makes it possible to detect permanent faults as well as transient faults in the computational functional unit. However, only computation instructions can be covered with this approach. Recently, thread-level fault detection schemes have been proposed for simultaneous multithreading (SMT) architecture [16]. AR-SMT [17] and SRT [18] processors run two copies of the same program in separate thread contexts by dynamically partitioning resources and compare the outputs of each redundant instruction pair. While this approach

checks the integrity of every instruction execution with very little hardware addition to the SMT processor, performance penalty may be high unless enough idle resources are available. For integration into modern microprocessors, fault tolerant techniques need to achieve the goals of negligible performance impact, low cost, and high coverage at the same time. Efficient processor integrity monitoring (PIM) is still an open problem. In this paper, we present a hardware-based PIM scheme at circuit-level to protect the processor’s control logic (the shaded portion in Figure 1). The proposed scheme checks the integrity of instructions and their correct sequencing from instruction fetch to commit point. We categorize the control logic into two groups and protect them separately with different techniques. This paper also presents a simple control flow monitoring mechanism. The scheme is developed under a realistic fault model and its high efficiency is validated with software-simulated fault injection simulations. The rest of paper is organized as follows. In Section 2, we present base processor model and fault model for which our schemes are developed. In Section 3, we examine the effects of faults on control logic in the processor pipeline and propose our integrity checking strategy for each classified control logic type. Evaluation methodology is discussed in Section 4 and the results of simulations are given in Section 5. Section 6 addresses fault behavior issue in allocating the limited redundancy for error checking. We conclude our study in Section 7.

2 Processor and Fault Model Modern microprocessors employ superscalar, superpipelinging, and/or dynamic pipelining to achieve maximum speedup. The goal of our study is to develop efficient reliability enhancement techniques for such processors. Figure 2(a) illustrates the base processor pipeline model. The pipeline is divided into two parts. The front-end fetches instructions from cache or memory and feeds them into the back-end. In the back-end, instructions are executed on functional units and the results update the processor’s architectural state. To keep execution engine busy in the presence of branches and cache misses, the front-end may speculatively fetch sufficient instructions ahead and store in a decoupling buffer. Depending on microarchitectural choices, the back-end can process a single or a group of instructions at a time in- or out-of program order. Regardless of the order of instruction issue and execution, the processor always maintains in-order and lookahead state [19] correctly to support precise interrupts. The number of pipeline stages

and functional units varies with implementation. Front−end

Back−end in−order or out−of−order

Instruction pointer generation Instruction fetch/alignment

Instruction decode/issue Register renaming Execution and exception detection

Result commitment

(a) Control signal ( write ) error add r1,r2,r3 noise occurrence Pipeline stage

Generate

0 1 1 0 00000000 11111111 0 1 0 Error detection 1 00000000 11111111 0 1 0 1 0 1 0 1 Error detection latency

Fault duration Fetch

Register write failure

time

If not detected, the error propagates

Decode

Detect

Execute

Write

(b) Figure 2: (a) Base processor pipeline model. (b) An example of soft error propagation. Transient faults disturb processor circuitry used in any pipeline stage for a short period. An error manifesting the faults, i.e., change of the system state, can multiply through pipeline path. The error may cause the system to start malfunctioning instantly or on a later cycle. Figure 2(b) depicts an example of such a event. An electronic noise upsets a part of random logic that produces a control signal for storing the result of an addition operation into a destination register, r1. This fault creates an erroneous control signal, inverting write to no write, in instruction decode stage, and the error is latent until it causes a failure in write stage. Since the control signal does not enable r1 for a write, r1 keeps the old value. If the result differs from the old value, this is propagation of control error to data error. In case of reference to r1 in following operations, more propagation is expected. We categorize processor errors by their originating site on the chip as classified in Figure 1: 1) data errors in memory arrays, 2) data path errors in buses and functional units except control logic, and 3) control errors in control logic blocks. As shown from the example, an error of one type can generate new errors of other types. In this paper, our focus is on the protection against control errors. Control errors that incorrectly change the flow of program control are control flow errors (CFEs) and the rest of them are referred to as control signal errors (CSEs).

Control signals of the processor fall into two types. One is a signal directly derived from instructions. Each field of an instruction becomes a unique signal to control particular processor components in planned pipeline stages. This type of control for a given instruction is always the same and we call it static control. The other type of signals called dynamic control is generated by special conditions of the processor during program execution. Thus these signals may vary for the same given instruction. The state of components and the product of several static controls determine such signals including hazard detection, bypassing control, predicate control, etc. For instruction sequencing, unconditional branch instructions mostly use static control, whereas conditional branch instructions always generate dynamic control as well. A random noise by alpha particle or neutron hits can affect multiple nearby memory cells or logic gates simultaneously. Since circuit’s susceptibility to transient faults vary chip-to-chip and block-to-block within a chip, the number of circuit elements affected by a single fault may vary accordingly; but a larger number would be expected for continuously shrinking and low-powered future processors. Although single-bit failure mode has been used in many studies, it was observed in 1979 that a single alpha particle could cause four contiguous bit cells to be erroneously flipped in an 8-KB DRAM [20]. With taking this observation into account, following transient fault model is established for our study.



Faults occur randomly in terms of time and location - random fault model.



Faults cause signal inversion both from 1 to 0 and from 0 to 1 with an equal probability - inversion model.



A fault results in an error of at most k -bit in the output of a unit component.

k

is no more than the

output width of the component - multiple-bit error model.



When a fault is in action, additional faults may occur and some impacts can overlap. However, the probability of this event is negligible - independent fault model.

3 Integrity Checking Strategy for Control Logic Parity checking is one of the simplest ways to verify the operation of logic blocks if the parity of the logic

1

output is easily predictable from the input. For example, suppose an n-to- multiplexer is provided with n pairs of data and parity delivered from array structure components. The multiplexer is extended such that

it can select a parity code as well as corresponding data. This selected parity code confirms the operation of the multiplexer by testing selected data. The type of the parity determines error coverage. We can flow the parity code through the pipeline along with data as long as possible and use for logic integrity checking whenever it is applicable. Unfortunately, for random logic, this parity flowing approach becomes ineffective since the logic output lose simple relation with the input parity. A different strategy is needed for control logic protection. Examining the behavior of control errors during program run-time provides a direction in devising an effective protection scheme. Figure 3 shows an example program segment running on the base processor pipeline. The pipeline is about to complete

I1

and needs to make sure of correct control in the earlier

stages for the instruction. If there has been any error during

I1

execution, its effect needs to be removed

by this point; otherwise, the system state becomes faulty. I1 I2 I3 I4 I5 I6 I7 I8 I9 I10 I11

loop1 addi slt lui addi sb sb addi bne mthi sub beq

r1, r2, 8 r4, r3, r1 r5, 4097 r6, r7, r8 r6, r5, 4 r16, r5, 8 r9, r6, 8 r1, r9, loop2 r16 r9, r16, r6 r1, r9, loop1

I22 I23 I24 I25

loop2 ori lui sub bgtz

add sb sub bgez

I31 I32 I33 I34

r7, r5, r9, r9,

r1, 1 1025 r6, r2 loop2

r3, r3, r9, r9,

r1, r2 r5, 12 r6, r3 loop1

(a)

I9

I8

I10

I33

I11

I32

Instructions ready for execution

Check & commit

branch

Annulled instructions due to mis−speculation

I31

I22

I7

I2

I4

I23

I5

I6

I3

I1

Dynamic control error

Control flow error

(b) Figure 3: (a) An example program segment. (b) Control flow deviated by a CFE in the base processor pipeline executing the example code (a). The processor uses speculation, i.e., executing instructions before it is known that they will be needed. A branch, I8 , is predicted as taken and instructions starting I22 are speculatively executed. A control error occurred in the front-end propagates and causes a jump to

I31

after

I23

that is not a branch instruction.

Fortunately, fall-through occurs in

I8

and instructions in the false flow are squashed before committed.

As a result, the effect of the CFE is eliminated as well. Like this example, errors may disappear without affecting the system state. Irrespective of the instruction scheduling, only instructions in true flow of program control are committed and need to be protected. In this context, postponing the integrity checking till the last stage may be advantageous. This commit-time checking is suitable for CFE detection and also applicable to other control signals if information required for verification is available at commit time. Error latency is equal to pipeline depth at most and recovery is to re-execute from the affected instruction. The commit-time checking for the dynamic control is difficult to implement and does not have any advantage over checking in the stage in which it operates.

I2

is data- dependent on I1 , but bypassing unit

forwards the result of I4 instead due to a control signal error. If the processor immediately checks the unit and detects the error, the system state can be recovered with a simple operation retry. In this case, error latency is a minimum. Since the immediate checking distributed over the pipeline may require a large overhead and performance impact, they should be taken into account in the design process. It might be possible to adjust the overhead with selective protection based on criticalness of errors in control logic. Our strategy for checking the integrity of control signals is to consider their distinctive characteristics and use different mechanisms for each type. For the static control, signals used in the pipeline are collected and verified with previously known signals at commit time. This is possible because of unchageability of the signals. On the other hand, the dynamic control is examined on the spot with check codes when the signals are created. Random logic for dynamic control is parity-protected or duplicated at componentlevel. In order to detect erroneous flow of control, the branch target address is kept track of and the address of every committing instruction is monitored.

3.1 Static control protection Unless there is an error, static control signals for a given instruction remain unchanged in every execution of the instruction. Once we know the correct signals, the integrity checking for static control logic is to examine whether or not it produces and applies the same signals at the proper time. It is impractical for the processor to keep the static control signals of every instruction and use for error checking in different pipeline stages. Our alternative proposal is to transfer the signals used to the last stage in a compacted form, called signature, and verify the control with a pre-stored signature.

Instruction generation

Signature generation

Signature checking

Figure 4: Pipelined signature generation. The signature is an 8-bit wide code resulted from exclusive-ORing the static control signals throughout the pipeline. Figure 4 shows pipelined generation of a signature. Various symbols in the boxes represent different signals that drive operations in each stage. When an instruction is fetched, its signature starts to form. The generation and use of the control signals can occur in different stages. As the instruction flows through the pipeline, signals actually used in a stage are integrated into the signature and the rest are passed to the next stage. When the instruction reaches the final stage, the signature generation is complete. As a result, the signature manifests errors, if any, in the phases of the generation, transfer, and actual use of the static control signals. Bits in the signature are interleaved in such a way that multiple erroneous signal bits do not overlap in formation. 8-bit signatures are almost the same as wider signatures in coverage. In order to verify a newly generated signature at commit time, a correct signature or checking signature needs to be provided for comparison. The checking signature of each instruction may be embedded in the program beforehand and retrieved with the instruction fetch. However, this approach requires a large overhead. To economically maintain the checking signatures, we employ a small table called signature table and apply cache management techniques. The signature table is configured like a common instruction cache, holding signatures instead of instructions in its entries. We call this scheme signature caching. For simplicity, the table starts with empty state and no checking signatures are prepared separately with a preprocessing. On the first execution of each instruction, the run-time generated signature fills the table entry indexed by the instruction’s address. This signature is used as the checking signature in later execution of the instruction. Since the table has a limited size, it replaces old entries to keep the signatures of recently executed instructions. When an instruction is about to be committed, the table is searched for the corresponding checking signature in parallel to conduct a signature comparison. Error checking for the static control is possible only on a table hit. In case of error detection, i.e., signature mismatch, the pipeline is flushed and the program counter is set to the address of the corrupted instruction. The execution resumes

with re-fetching the instruction. Considering high operation rate, performance impact by annulling few instructions during billions of cycles is negligible. In addition, the processor already includes such kind of a recovery mechanism for speculative execution. Although table misses may still occur after the initial phase, high locality in programs makes it possible for the table to provide desired signatures mostly with a minimal hardware. For further reduction in hardware overhead, the table may be merged with existing structures such as the instruction cache or branch target buffer. Unlike conventional signature-based protection schemes, the proposed scheme requires no compiler modification for embedding signatures and causes no pipeline stalls for error checking.

3.2 Dynamic control protection A signature of dynamic control signals can keep changing even for the same instruction. Thus, the mechanism used for the static control is not applicable. Employing a parity prediction scheme for a random logic block can cost as much as duplication. It may be possible to redesign the logic in such a way that correct output combined with some redundant output always meets a predefined condition for verification purpose. However, it is hard to guarantee a solution for every random logic, which is more advantageous than duplication in terms of error coverage and overhead. We choose duplication at component-level for the dynamic control logic. For a given unit, say integer unit, the entire unit can be duplicated for the design simplicity in high-end systems, but large area, high power dissipation, and other overheads limit its application. On the other hand, in our strategy duplication is used only for the dynamic control part of the unit that is difficult to check using redundant code based techniques with which the rest of the unit are protected. This is to minimize chip area for the integrity checking. Separate checking mechanisms employed for different parts of the unit are complementary. The two outputs of the duplicated logic are compared on every cycle. In case of a mismatch, the comparator promptly asserts error detection before erroneous signals are latched in the pipeline register. The recovery takes place without discarding any instruction in the pipeline. Each pipeline stage repeats the same operation until the two outputs match. The recovery time depends on the duration of the transient fault occurred in the logic. The effect of a transient usually lasts a few clock cycles. In order to optimize protection capability with a limited area, a selective duplication can be made based on the priority obtained from the fault sensitivity information of the logic. The fault sensitivity of logic

components will be discussed in Section 6.

3.3 Control flow monitoring Errors in both the static and dynamic control signals may result in a CFE. Depending on the source and destination of erroneous control transfer, CFEs can be classified into four types. Figure 5 illustrates each type of CFE using the flow diagram of an example code in Figure 3(a). A box denotes a basic block, a sequence of instructions with only one entry and one exit points. The dotted line is the correct flow of control that should have been taken in fault free condition, whereas the solid line is an incorrect flow actually taken due to an operational mistake. I1

I1

I1

I1

I8

I8

I8

I8

I31

I9 I11

I34

I31

I9 I11

I34

I31

I9 I11

I34

I31

I9 I11

I34

I22

I22

I22

I22

I25

I25

I25

I25

(a) Type I

(b) Type II

(c) Type III

(d) Type IV

Figure 5: Classification of control flow errors. In Type I and II, the processor takes a jump before completing the current basic block. The jump can be taken to either the beginning of a basic block (Type I) or an instruction after the entry point of a basic block (Type II). The CFE in Figure 3(b) is Type I. CFEs can also occur during the execution of a branch, i.e., the exit point of a basic block, as in Type III and IV. The processor may start to execute a new, but wrong basic block (Type III) or jump to the middle of a basic block, which can be either correct or wrong one (Type IV). Control flow monitoring (CFM) is to detect CFEs. In traditional CFM schemes, each basic block is associated with a signature, e.g., the checksum of instruction stream, which is prepared before the program execution. This signature is compared with the run-time generated signature at every exit point of basic blocks. Any jump from a non-exit point or to a non-entry point of a basic block causes a discrepancy

between two signatures, achieving error detection. However, CFEs of Type III are undetectable with this approach. nPC NAR cPC

Signature table

Comp sig_in Comp

bubble

Hit

Error flag generation

Error

Figure 6: Hardware block diagram for control flow monitoring combined with the signature table. Once the combination of the signature caching, parity checking, and component-level duplication cover the execution integrity of each instruction, the CFM becomes a simple checking of the address of instruction being committed. Figure 6 shows the block diagram to implement our instruction address (IA)based CFM scheme. The signature table described in Section 3.1 is combined with the CFM hardware. When the processor executes an instruction, it maintains the address of the next instruction to be executed in actual flow of control, denoted by nPC. Unless a branching occurs, nPC points the next contiguous location in current basic block. In case of a branch instruction, nPC is updated with a branch-resolved target address in an earlier stage of the pipeline, and passed along with the instruction’s address (cPC) and signature (sig in). In this way, nPC at commit time indicates the scheduled control flow. Annulling an instruction in a pipeline stage set bubble to 1. Upon completion of an instruction’s commitment, nPC is stored in a register, NAR, for future use. When the next instruction is ready for commitment, its cPC is compared with NAR to check CFEs. In case of a mismatch, the pipeline is flushed and the execution resumes from the instruction addressed by NAR. This is similar to the recovery procedure in the signature caching. The initial source of the cPC and the address in NAR is the same component. If their common source is faulty (we call this case common fault), CFE detection is not possible. However, the source component is protected with duplication. At commit time, the signature table is also accessed with the address in NAR. On a table hit, the run-time signature, i.e., sig in, and the signature from the table should match in error-free condition.

MEMDataLd

WBWrt

MemExit

InstMemStall

hi_lo_immd_in

EX_rt

0

0

1

1

Sign

MulDivSel

DataInSel

Link

EXDataSt, mfhi, mflo EX_rs MEM_alu_out

EX_rt

Exit

MemToReg[1]

MemToReg[0]

WB control next_pc

WBRegWrt

ld_data

Data cache Control

1

1 WB_alu_out

DataSize[2]

DataSign

DataSt

DataLd

Main Memory System

0 0

dst_reg

1 31

data_bus

MEM_lo

0

ExitFlag

WB_rd

OV

1

addr_bus

immd_bypass

0

data_in

1

MEM_rd

RtImmSel

Bypass MUX

MUL

0

MEM_hi

st_data

MUL/ DIV control

DataInSel

ALUOV

1

ALUOp[3]

data_addr

data_in

rt_alu_in

MEMDataLd

OV

0

DIVOV

ALU/ shifter

MEM_alu_out next_pc MEM control WB control

HiLo...[3]

ALUCtl[7] rs_alu_in

funct[2:0] shamt

0 1 2

OR

Dcache control

WB_hi

Exit

DataMemStall

MEMWrt

EXDataLd, EXRegWrt

ExcptFlush

EXWrt

WBCtrl rs_data shamt

MEMRegWrt

MEM_rd

0 1 2

EX_rd

dst_reg

WB_rd

SignBit Lui

1

NOR

MEMRegWrt WBRegWrt

Bypass control

MEMDataLd MEM_rd

MEMHiWrt, MEMLoWrt

extd_data

0

ALU control

funct

RegDst

Sign/bit lui extend

sll by 2

ALUOp[3]

funct

funct

HiLoWrtSel

EPC

WithOV

DIV

hi_lo_immd

NOR

LoWrt

WB_reg_data

1

Lo Reg

IDWrt

WB_rd

rd

Hi Leg

hi_or_lo

lo_reg_in

1

MEMCtrl EX control

Bypass control

0

AND

1

RtSel[2] hi_lo_immd

WB_reg_data

hi_reg_in

shamt funct

MemStall

0

Jr_addr

IDRtSel[2]

MEM_rd

rd

ID_rt ID_rs

HiWrt

OverFlow

1

rs

0 1 2

wrt_num

1

0

OR

RsSel[2]

rt

reg_rt_out

rt_data_in

0 1 2

rt_data

IDRsSel[2]

wrt_data

0

0

0

next_pc

Branch unit

MEM_alu_out

rt instr_out

InstMemStall

PCWrt

DataMemStall

NOR

PC

JumpRs

rd_num2

Instruction cache

HazIDWrt

Register file

0 OR

ExcptFlush

0

HiLoSel

1

PipeRegWrt

MEMDataLd

Exit

reg_rs_out

rd_num1 ID_rt

0

1

0 Exit

WBRegWrt zero_or_instr

+

0

WBRegWrt

ID_rs

pc_4

Concat. sll by 2

dst_reg

offset

opcode

1 pc_in

HazPCWrt

OR

BranchOp[3]

0

Jr_addr

4

Branch control

ID_rt[4,0]

RegImmSel

1

BranchOp[3]

+

BrTaken

next_pc

0

J_addr

CtlSel ControlIn

rs

1

JumpRs

JumpImm

BrTaken

0

Stall

jal, j, jalr, jr

Control

Br_addr

IDFlush

OR Br_addr

JumpImm

Hazard detection

ExcptFlush

Exit

OR

Stall

MemExit

OR

BrTaken

WB_lo

Flush BrFlag

WBRegWrt, WBHiWrt,WBLoWrt

WB_rd

WB_reg_data MEM_lo MEM_rd

Figure 7: Block diagram of SimR2K excluding the redundant logic addition for integrity checking. In addition to control signal errors, an incorrect branch decision by the branch unit can cause CFEs of Type III. Corruption of memory array data such as instruction addresses and operands for a branch instruction can result in an improper jump of any type. As discussed earlier, this paper concentrates on the detection of CFEs induced by control logic errors. However, some of CFEs resulting from such memory and functional errors are also detectable with the proposed scheme. It should be noted that the IA-based CFM technique covers all the types of CFEs described in Figure 5. This run-time error checking is performed off from the critical path, resulting in no or negligible performance degradation.

4 Validation Methodology For the purpose of validation, fault injection simulations have been conducted on SimR2K processor that we built using verilog HDL. SimR2K is an RTL model of a 32-bit RISC CPU core and implemented with instruction set architecture of MIPS R2000/3000 [21]. Figure 7 depicts the block diagram of SimR2K

FIL 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35

Output signal Width Name 32 pc_4 32 pc_in 1 IDFlush 32 zero_or_instr 32 extd_data 32 hi_lo_immd_in 32 lo_reg_in 32 hi_reg_in 32 hi_or_lo 2 IDRsSel 2 IDRtSel 5 wrt_num 32 reg_rs_out 32 reg_rt_out 32 rt_data_in 32 Jr_addr 32 Br_addr 3 BranchOp 1 JumpImm 1 1 1 1 1 1 1 1 1 3 1 1 1 1 1 1

JumpRs Lui SignBit HiLoSel RegImmSel mfhi mflo RegDst Link ALUOp WithOV Exit RtImmSel IDDataSt IDDataLd DataSign

Affected component or operation

FIL

next sequential PC PC, branch instruction instruction operand, branch operand LO register HI register operand operand operand register data operand operand operand operand, branch branch branch branch

36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54

branch operand, branch operand, branch operand operand operand operand register data register data ALU interrupt/halt interrupt/halt operand memory, dependency operand operand

55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70

Signature caching IA-based CFM

Output signal Width Name 2 DataSize 2 MemToReg 1 RegWrt 1 jal 1 j 1 jalr 1 jr 18 ControlIn 1 Stall 2 RsSel 2 RtSel 1 DataInSel 1 MorW 1 MulDiv 1 ByPass 5 dst_reg 1 HiLoWrtSel 1 HiWrt 1 LoWrt 32 32 32 32 32 32 4 1 12 32 1 32 1 2 32 1

Affected component or operation operand register data register data dependency control dependency control dependency control dependency control control logic pipeline stall operand operand memory data operand operand operand register data HI/LO register HI register LO register

lo_data_in LO regsiter hi_data_in HI register st_data memory data rs_alu_in operand rt_alu_in operand immd_bypass operand ALUCtl ALU ExcptFlush flow control MEMCtlIn control logic data_in memory data PipeRegWrt pipeline stall, operand WB_reg_data register InstMemStall pipeline stall, operand DataMemStall pipeline stall ld_data operand BrFlag flow control Signature caching and CFM Component-level duplication

Figure 8: The complete list of FILs and corresponding output signals.

Warm−up period (T/8) Benchmark execution

> 10 clocks FIPi

FIPi+1

Observation period (T+500)

000000000000 111111111111 x x 000000000000 111111111111 Fault injection period (3T/8)

500 clocks

Total execution time in fault−free condition (T)

Figure 9: Fault injection timing during benchmark execution.

time

excluding the redundant logic for integrity checking. This processor incorporates the proposed protection schemes and error checking takes place concurrently when a benchmark program is running. Under our fault model, component-level duplication provides 100% protection coverage for the dynamic control logic. Thus, the dynamic control protection is not separately tested and the redundant logic required for the scheme is not implemented in SimR2K. Four application programs of different algorithms were chosen as a benchmark suite as follows. Hanoi is a program that solves towers of Hanoi puzzle using recursive function calls. Intmm is a 44 integer matrix multiplication using 5 different algorithms. Heap uses the heap sort algorithm for sorting 32 integer numbers. Queens finds all the solutions of the Six Queens problem on a 66 board using a recursive algorithm. All benchmarks are written in C and compiled with a cross-compiler, dlxcc [22]. Software simulated faults based on the model described in Section 2 were injected in the control logic of the processor at run-time. The logic in data path was also examined with fault injection to measure the likelihood that the logic error result in a failure. For a given logic block, fault injection locations (FILs) are identified after grouping fault equivalent gates (common fault effect on output) together by examining error propagation paths with test generation rules. Thus the number of gates comprises an FIL varies depending on the characteristics of the logic block. Figure 8 lists 70 identified FILs for SimR2K and their output signals. A different shade indicates the protection scheme applied for each block and FILs with no shade are data path. Although FIL 70 is represented by FIL 3 in flushing the pipeline register, it also controls the redundant logic that determines the branch resolved nPC in the decode stage, and thus, it is separately considered. The figure shows which architectural component or operation would be disturbed by an error in the output signals of FILs. A fault injection is performed into a targeted FIL at a randomly selected clock denoted by fault injection point (FIP) for a single clock cycle. The fault always creates an inversion of multiple random bits in the signal of the FIL. Thus a fault injection corresponds to the injection of an output signal error. Multiple FIPs are examined for an FIL independently, i.e., one injection per run. The number of FIPs is determined by experiment time limit and desired accuracy of the measurement. Figure 9 depicts fault injection timing during benchmark execution. Faults are injected after warm-up period and before the first half of the total execution time, T , obtained in fault-free condition. The warmup period is one eighth of T . Any FIP is at least 10 clocks apart from other FIPs. After the fault injection,

Case 1 2 3 4 5 6 7 8 9 10

Error

Not detected

Not detected, but not effective Detected by CFM Detected by signature caching

Program termination at T before T at T + 500 after T before T after T at T

Result and system state comparison

Effect of error

mismatch

Program failure

None or negligible match Easy to recover

after T

Hard to recover

Figure 10: Classification of the outcomes of the benchmark run in the present of transient faults. the behavior of the processor is observed till the program ends. On completion of each benchmark run, the outcome in the presence of a fault is evaluated by a comparison with the precomputed correct results and final architectural state. If the program does not terminate by the normal termination time, it is inspected for 500 more clocks. A continuous run is forcibly stopped at T +500 and evaluation is made. Figure 10 categorizes possible outcomes of benchmark run with the fault injections into 10 cases. If a fault is not detected, a failure may or may not occur. In case of a failure, there are four cases depending on program termination time. We assume that program always fails after

T

+500. Therefore, there are three

cases for undetected faults with correct program completion. When the signature caching detects a fault, either of two signatures being compared is erroneous. Since the checking signature in the signature table is also generated at run-time, it can be incorrect as well. In this case, the processor needs to roll back to the point prior to the checking signature generation. This is Case 10. If the processor employs check-pointing [4], recovery is possible. We measure the performance of the proposed schemes with a fault protection coverage (%) computed by

(1

Number of runs resulted in a failure Number of runs with a fault injection

P )  100% = (1 P

4 i=1 Case i 10 j =1 Case j

)  100%

:

5 Experiment Results Unlike duplication, the protection coverage of the signature caching and the IA-based CFM can be affected by the processor’s run-time behavior. In SimR2K, the combination of these two schemes protect 39 control signals (FILs) as shown in Figure 8. The total area for the logic blocks that can be protected by each

technique changes for different processor architectures. Figure 11 presents their performance for four benchmarks. For each FIL on x-axis, the coverage shown was obtained for faults at 100 FIPs. Most signals are completely covered. On average, no more than 1% of injected faults result in a failure for every benchmark. Possible error of this coverage estimate is less than 7% in all cases for 95% confidence level. The distribution of fault injection outcomes based on Figure 10’s classification is also shown. Interestingly, benchmarks have common general tendency in the distribution. This indicates that the characteristics of control logic are more important than the program being executed to the effect of faults. Faults that are not covered correspond to table misses in the signature caching and the common fault cases in the CFM. Effective faults mostly prevent the program from ending at T . For example, faults FIL 2 and FIL 31 cause an illegal jump and premature program termination, respectively. Faults missed by the signature checking may be captured by the CFM, and vice versa. The signature checking for FIL 1 and the CFM for FILs 19 and 20 are examples of such cases. If both the signature caching and the CFM detect the same error, it was accounted for the coverage of the signature caching. Error checking with an incorrect signature in the table, i.e., Case 10 of fault injection outcome, can be seen mostly in FILs 1 and 4. Therefore, once an error is detected by our mechanisms, the recovery is usually simple. The signature table used is a two-way set associative cache with 64 signature entries of 8 bytes. Signature generators/checkers, comparators, and other supporting hardware are required. However, these are still very small overheads compared to unit-level replication. As a result, our strategy achieves low-cost and high coverage integrity checking for control logic of the processor.

6 Prioritized Redundancy Allocation As seen in Figure 11, some faults may not affect the program execution at all. This is also true in other fault injection studies using different methods such as heavy-ion radiation, power supply disturbances, and laser [23], [24], [25], [26], [27], [28]. Even control logic with no intended redundancy for fault tolerance has some fault masking capability already. Figure 12 shows how often faults in each FIL result in a failure, which we call fault sensitivity; FIL with fault sensitivity of 1 means that the computation always fails if a fault occurs in that location. The number of injected faults in an FIL is 30 at different FIPs. Only few FILs show higher than 0.6 of fault sensitivity. The maximum error margins for these proportion estimation with 95% confidence level are 0.18 for hanoi, intmm, and heap and 0.16 for queens.

100 90

10

Coverage (%)

80

9 8

70 60 50

7 6

40 30 20

5

10 0 1

2

3

4

5 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 51 52 53 54 61 63 70

FIL 100 90

10

Coverage (%)

80

9 8

70 60 50

7 6

40 30 20

5

10 0 1

2

3

4

5 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 51 52 53 54 61 63 70

FIL 100 90

10

Coverage (%)

80

9 8

70 60 50

7 6

40 30 20

5

10 0 1

2

3

4

5 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 51 52 53 54 61 63 70

FIL 100 90

10

Coverage (%)

80

9 8

70 60 50

7 6

40 30 20

5

10 0 1

2

3

4

5 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 51 52 53 54 61 63 70

FIL

Figure 11: Fault coverage: hanoi, intmm, heap, and queens (from top to bottom).

1.0

Fault Sensitivity

0.8 0.6 0.4 0.2 0.0 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 57 59 61 63 65 67 69 Fault location 1.0

Fault Sensitivity

0.8 0.6 0.4 0.2 0.0 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 57 59 61 63 65 67 69 Fault location 1.0

Fault Sensitivity

0.8 0.6 0.4 0.2 0.0 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 57 59 61 63 65 67 69 Fault location

1.0

Fault Sensitivity

0.8 0.6 0.4 0.2 0.0 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 57 59 61 63 65 67 69 Fault location

Figure 12: Fault sensitivity for four benchmark programs: hanoi, intmm, heap, and queens programs(from top to bottom).

70

Fault location

60 50 40 30 20 10 0 0

5

10

15 Fault injection pointer number

20

25

30

0

5

10

15 Fault injection point number

20

25

30

0

5

10

15 Fault injection point number

20

25

30

0

5

10

15 Fault injection point number

20

25

30

70

Fault location

60 50 40 30 20 10 0

70

Fault location

60 50 40 30 20 10 0

l

70

Fault location

60 50 40 30 20 10 0

Figure 13: Timing dependency in fault sensitivity for benchmark programs: (a) hanoi and intmm (top two). (b) heap and queens (bottom two).

As it is clear from the figure, for all benchmarks the processor often produces correct results despite loss of integrity in the logic. A fault becomes effective (i.e., causing a failure) only if it coincides with the active cycle of a given logic (the period in which the logic is actually used). As the active cycle increases, the fault sensitivity also increases. For example, the control signal of FIL 62 flushes pipeline in case of an arithmetic overflow or a system call. It is continually active and its incorrect signaling either erroneously flushes an instruction or allows an undesirable instruction to be executed, and therefore, the fault sensitivity is 1. There are several logic signals with 0 fault sensitivity. This is because they are so infrequently used and none of 30 random FIPs fall into their active cycles. While faults in some location usually have a fatal impact on computation, faults in other locations may be rarely critical. We can use this fact in allocating a given redundancy budget to maximize the system reliability. The architectural function of a logic signal affects its fault sensitivity. When a corrupted control signal causes a pipeline stall, it delays the execution time by one cycle, but still produces correct results. Pipeline flushes may also sometimes eliminate the effects of faulty control. The active cycle of a control signal may vary with the characteristics of workload. Some control logic, such as division control for programs with no divisions, is never used. Since only intmm uses data path for multiplication, FILs 53, 54, and 55 of multiplication logic show a little fault sensitivity for intmm alone. In-depth studies of the effects of variations in the workload input on fault behavior can be found in [29], [30]. Fault sensitivity also depends on timing of fault occurrence. Figure 13 illustrates the timing dependence of fault manifestation. The FIPs are numbered on x- axis, but not scaled. Each dot represents an effective fault. If every fault were effective, the graph would have dot for each location-and-point pair. In reality, the faults in certain points turn out to be more often effective. For example, fault injections at FIP 13 for queens result in failures in most logic locations, whereas injections at other points are less effective. We have shown that the fault sensitivity of a logic block is determined by 1) its logic characteristics, 2) architectural importance, and 2) fault timing. However, the faults do not always cause a failure, and in fact, the fault sensitivity in our experiment is only less than 0.4 for most logic. If we take this into account, replicating the processor at unit or system level is overkill. Since an RTL model of a processor is always available in an early design phase, identifying fault sensitivity can be accomplished with minimal efforts. Using that fault sensitivity data, the design can intelligently allocate budgeted redundancy to achieve costeffective reliability enhancement.

7 Conclusion We have presented a comprehensive integrity checking scheme for the control logic of microprocessors, which is hard-to-protect area on the chip. The scheme is devised after examining the effects of faults on the pipeline and the characteristics of the control logic. Control signals are classified into static and dynamic control and separate protection approaches are applied. We have introduced the concept of caching signatures for the static control protection. By exploiting locality in program code, this technique significantly reduces the overheads required in traditional signature-based checking while providing high detection capability. In addition, commit-time checking eliminates unnecessary checking in earlier stages and allows to use existing mechanism for recovery. Our IA-based CFM technique can detect all four types of CFEs with the assistance of the signature caching whereas conventional methods cover three types only. For the dynamic control protection, we change the replication from large unit-level to minimal component-level. This can take advantage of selective redundancy allocation based on fault sensitivity of the control logic. The proposed techniques simultaneously achieve 1) low-cost, 2) high fault coverage, 3) negligible impact on performance, and 4) easy recovery. Also, the schemes are the complement of current integrity checking of microprocessors. As a result, incorporating the proposed techniques will significantly enhance the processor’s reliability.

8 Acknowledgement The author would like to greatly thank Dr. Arun K. Somani for his continuous encouragement, support, and advice. The work was funded in part by NSF Grants MIP-9896025. The author also thanks Siwon Noh for her support.

References [1] S. Borkar, “Design challenges of technology scaling,” IEEE Micro, pp. 23-29, July 1999. [2] L. Spainhower and T. Gregg, “G4: a fault-tolerant CMOS mainframe,” Proc. Int’l Symp. FaultTolerant Computing, pp. 432-440, 1998. [3] T. J. Slegel et el., “IBM’s S/390 G5 microprocessor design,” IEEE Micro, pp.12-23, Mar. 1999. [4] D. P. Siewiorek and R. S. Swarz, Reliable computer systems: Design and evaluation, Digital Press, Massachusetts, 1992. [5] A. Maamar and G. Russell, “A 32 bit RISC processor with concurrent error detection,” Proc. 24th EUROMICRO Conf., pp. 461-467, 1998. [6] J. Ohlsson and M. Rimen, “Implicit signature checking,” Int’l Symp. Fault-Tolerant Computing, pp. 218-227, 1995.

[7] K. D. Wilkden ,“Optimal signature placement for processor-error detection using signature monitoring,” Int’l Symp. Fault-Tolerant Computing, pp. 326-333, 1991. [8] G. Miremadi and J. Torin, “Effects of physical injection of transient faults on control flow and evaluation of some software-implemented error detection techniques,” Dependable Computing for Critical Applications 4, pp. 435- 457, 1995. [9] A. Mahmood, E. McCluskey, and Lu-D, “Concurrent fault detection using a watchdog processor and assertions,” Proc. Int’l Test Conf., pp. 622-628, Oct. 1983. [10] I. Majzik, W. Hohl, A. Pataricza, and V. Sieh, “Multiprocessor checking using watchdog processors,” Computer Systems Science and Engineering, vol. 11, no. 5, pp. 301-310, Sept. 1996. [11] S. S. Yau and F. C. Chen, “An approach to concurrent control flow checking,” IEEE Trans. Software Engineering, vol. SE-5, no.2, pp. 126-137, 1980. [12] Z. Alkhalifa and V. S. Nair, “Design of a portable control-flow checking techniques,” Proc. HighAssurance Engineering Workshop, pp. 120-123, 1997. [13] M. A. Schuette and J. P. Shen, “Exploiting instruction-level parallelism for integrated control-flow monitoring,” IEEE Trans. Computers, vol. 43, no. 2, pp. 129-140, Feb. 1994. [14] J. H. Patel and L. Y. Fung, “Concurrent error detection in ALU’s by recomputing with shifted operands,” IEEE Trans. Computers, C-32, April 1983. [15] G. Sohi and et. al. “A study of time-redundant fault tolerance techniques for high-performance pipelined computers,” Int’l Symp. Fault-Tolerant Computing, 1989. [16] D. M. Tullsen, S. J. Eggers, and H. M. Levy, “Simultaneous multithreading: maximizing on-chip parallelism,” Proc. Int’l Symp. Computer Architecture, May 1996. [17] E. Rotenberg, “AR-SMT: A microarchitectural approach to fault tolerance in microprocessor,” Proc. Int’l Symp. Fault-Tolerant Computing, 1999. [18] S. K. Reinhardt and S. S. Mukherjee, “Transient fault detection via simultaneous multithreading,” Proc. Int’l Symp. Computer Architecture, pp. 25- 36, 2000. [19] W. Johnson, Superscalar microprocessor design, Prentice- Hall, New Jersey, 1991. [20] J. F. Ziegler et al., “IBM experiments in soft fails in computer electronics (1978-1994),” IBM Journal. Research and Development, vol.40, no.1, pp.3-16, Jan. 1996. [21] D. A. Patterson and J. L. Hennessy, Computer organization and design: The hardware/software interface, Morgan Kaufmann, 1997. [22] J. L. Hennessy and D. A. Patterson, Computer architecture: A quantitative approach, Morgan Kaufmann, 1996. [23] G. Choi, R. K. Iyer, and V. Carreno, “FOCUS: an experimental environment for validation of faulttolerant systems-case study of a jet-engine controller,” Int’l Conf. Computer Design: VLSI in Computers and Processors, pp.561-564, 1989. [24] R. Johansson, “On single event upset error manifestation,” 1st European Dependable Computing Conference, pp.217-231, 1994. [25] J. Karlsson, P. Liden, P. Dahlgren, R. Johansson, and U. Gunneflo, “Using heavy-ion radiation to validate fault-handling mechanisms,” IEEE Micro, vol.14, no.1, pp.8-23, Feb. 1994. [26] G. Miremadi and J. Torin, “Evaluating processor- behavior and three error-detection mechanisms using physical fault-injection,” IEEE Trans. Reliability, vol. 44, no.3, pp.441-454, Sept. 1995. [27] W. A. Moreno, F. J. Falquez, J. R. Samson, and T. Smith, “First test results of system level fault tolerant design validation through laser fault injection,” Int’l Conf. Computer Design: VLSI in Computers and Processors, pp.544-548, 1997. [28] C. K. Kouba and G. Choi, “The single event upset characteristics of the 486-DX4 microprocessor,” IEEE Radiation Effects Data Workshop, pp.48- 52, 1997. [29] E. W. Czeck and D. P. Siewiorek, “Observations on the effects of fault manifestation as a function of workload,” IEEE Trans. Computers, vol.41, no.5, pp.599-566, May 1992. [30] P. Folkesson and J. Karlsson, “Considering workload input variations in error coverage estimation,” 3rd European Conf. Dependable Computing, pp. 171-188 , Sept. 1999.

Suggest Documents