An Experimental Framework for Implementing and ... - CMU (ECE)

CARNEGIE

MELLON

An Experimental Framework for Implementing and Evaluating Concurrent Error Detection and Recovery Techniques Alexander G. Dean 1993

An Experimental Concurrent

Framework Error Alexander

for

Detection

Implementing

and

Evaluating

and Recovery

Techniques

G. Dean and John Paul Shen

Computing Systems Department of Electrical

and Computer Engineering

Carnegie-Mellon Pittsburgh,

Correspondent:

Center

University PA 15213

Alexander

G. Dean

Tel: (412) 268-6639 Fax: (412) 268-3204 Email: adean (O> ece.cm u.edu

Abstract This paper presents m:~ experimental framew()rk consisting of lhree software tools ti)r characterizing,

illlplell/2nlillg,

and evaluating concurrent error detection and recovery techniques for general-purpose pro-

cessors. The first

tool analyzes compiled assembly programs (currently MIPSR2{I0{)/R300()), and can

fom~ a nulnber of code transformations, including the embeddingof instructions

fur prt)liling

inlegrated monitoring for both error detection and recovery. The sec(md tool facilitates

and

error injection

expemnents. It can inject permanent or transient faults, managethe execution of the corrupted code, record the error handling behavior, and perlbrm repealed experiments. The third tool analyzes the ~cc~)rdcd cxpc~imental data, classifies

the error types and determines the effectiveness of lhe techniques. This framework

has been implemented and initially

applied to the analysis of intrinsic

error delection mechanismsin con-

temporary pipelined processors, and the detection of and recovery from control Ilow errors via signature monitoring.

Keywords:Concurrent error detection and recovery, signature nlonilorit~g t)l ct)nkrol th)w. fault injccli~)n experiments, code transti~rmation, pipelined processors.

page

1. Introduction Size, cost, weight and power restrictions recovery lllechanisnls retrofiring

limit the addition of massive hardware error detection and

to manyCOlllpulers. Newdesigns ~e limiled by one or mo~c~1 0~c~c lac/~ts,

while

existing compalers is expensive. Software-based concurrenl error delecti{m ((’ED}

cry lech~hques combined wilh exisling p~ocessor mechanismsprovide an inexpensive altc~-nauve it) adtl-tm hardw~gemecha~hsms.Rather than adding dedicaled hardware, ~he applicnliou st~liwa~c can bc create a robust program tc~ mnon an existing sys~em, wi~h the capabilily 1o dctecl and rec~wer D~maerror occu~ences. Manymodern general-purpose microprocessors indirectly nisms as an integral part of the processor’s capabilities.

provide co~lcurrenl error delcclit

m mecha-

Word-addressedmemoryi~He~-faccs idcntily

unaligned accesses, virtual memorysystems detect oul of range accesses, and so lorO~. Sucl~ meclmnisms provide an inexpensive and already existing base of CED, which can bc lu~3her enhanced by s~llwa~c CEI) techmques. Recent reseach has proposed soRware CEDtechmques such as signature m~mito ring l Wil ken [Schuetle 861, which momlors ~e control flow of a program using additional

hardware. Integrated m~mi-

toting has been proposed as well, in wt9ch error detection ins~uct.ions me mserled difeclly into ~hc application program [Schuette 91], rather than adding a dedicated tmrdware monit~r. ~s paper presents a new framework tk~r imp[ementation and evaluating CEDatad t-ecovcvy techtuques. The tools facilitate

implementing, chaaclerizing

In addi0on, the tools can analyze intrinsic

and evaluating soliware-based CEDleclmiques.

hardware based mechamsmsin existing processors, ~e

this l?amework automate most of the tasks inw~lved ill implementing and evaluating concurrent emir detec0on and recovery techniques. A program modification tool inco~po~ates CEDand ~ccovc~y t:t~dc, fault injec0on tool facilitales

a

automa0ci~}ection experiments, and a data analysis tool that classifies

batches of i~jected faulgsymptomp~rs. Each of these tools is extensible, Ii:~rtmng a versatile customizable t?amework.Future work will use the frameworklools It) characterize: a val-icl 3, of CEDand ~cc{~x ~y Itchtuques, as well as expand the l?amework to support superscalar processors. Sec0on 2 presents an overview of the tools and the contexl lk)r this work. Sections 3, 4 and describe the tools Tr~m,~form, Corrupt and Analyze, respectively. Section 6 provides conclusions and fulult: direc0ons.

page 2

2.

Experimental

2.1

Background

Framework

The tools presented in tilts

frameworkimplement and enipirically

techhiques, which embed CEDand recovery instructions

characterize integrated moniloring

within the application

program. In addition

such software-based delection techitiques, the lools can characterize intrinsic mcchanisn~stclyin~ on cxi,~ling hardware in the process()r.

Mosl general purpose micr()processors have virlual memorysupp()rl

ware, an instruction decoder and an address bus interface,

providing a broad base ol supporl for the

hardware-basedtechniques. These mechanismswork together t() pr(>vidc a n inexpensive lk)u tidal i()n concurrent em)r detection There are manyapproaches t() fault injeclion; somerequire the addi0on t)l hardwa~-c, simulalit)u the systeln in software, a physical means of injection,

or some combination of these methods. These lnelh-

otis vary in the types of faults that can be injected, fault duration, fault It)call(m, injection rate and control. execution speed, monitoring inli)rmation

and software aild hardware overhead. Each lechniqt)c

fl)r a certain type of application, but in general mosl fault injection techniques require a substantial ()verhead. Fault injection hardware can be added, as in [Schuetle 86], bu! this requires physical modification ~)1 the system under test. The only signals available/i)r *gering and logging circuitry

corruption are those wttich run off-chip, and fault trig-

maybe cmnplex. However, the system runs al full speed, the laults a)c usc~

defined and manysignals are corruptible.

The system under lest maybe sinmlated entirely in software:

[Ohlsson 92] presents taul: injection into the silnulation

of a RISe system, consisting of a VHDL descrip-

lion of the processor and memory.Fault simulation involves corrupting slalc bits iu the p~~)ccssoi observing the effects on the test programs. Thoughthe syslem behavior is accuralc, and there is control over the injected faults,

simulation speed is very slow. Actual physical experiments may be used:

[Miremadi92] presents a system which mjecls faults into a nticroprocessor by using heavy ions 117o111a radioactive source. This system requires a vacuunl chamber to hold the ion source and tile greatly increasing system complexity. Faults are injected slowly (less than one per lllillule), potentially more representative of real-world faults.

page 3

bul riley arc

The fault injection technique presented here requires no additional hardware, has negligible cxccu lion time overhead, oilers precise conlrol over fault location and durali~m, and is quickly and casi 1? impk" mented. The faults can be repeated, and their results can be observed iustruction by i~slruclion with a commondebugger such as dbx. In tNs paper the Stang~rd and SPECbenchmark suites

arc used. Library

routines (e.g.,J?ea(~ ae not tested here, though these tools are capable of such an analysis. The work presented here lin~ts the i~ection of faults to application code, though these faults maybe passed I~)C library routines through corrupted arguments. A basic automated checkpoim-basedrecovery lechnique [ Siewiorek 92 ] is implemenled in ,mr inilial experiments. II provides substantial restartabilily

when coupled with a h~wqatcncyct~lct~rrc~ll

c~~ dclcc

tion scheme. ~e recovery schemeis lilnited in that il requires reasonably fast error detection, as only a limited amount of memoryin addition to registers

can be restored.

2.2 Framework 432330al 329fa994 0cca1147 84534955 44197227 CheckSort: subu Ssp,40 sw $31,28($sp) sw $0,32($sp) li $14,1 sw $14,3O(Ssp) $43: Iw $ lS,36($sp) mul $24,$15,4

Executable

Self-Monitoring Program

432330al 329fa994 0cca1147 84534955 44197227

Input

Assembly Program

Transform

Corrupt

>~4e, >2 142 >3 9228"2 >4 123341

Analyze "

Executable

Self-Protiling Program

Execution

Protile

Fiigure 1: Use of Framework Tools lo Evaluate Program

Three tools comprise the framework: Tran,~fi)rm, Corrupt, and Am~l),ze. The lools Call implcmem and evaluate an integrated monitoring technique using the followiug steps. Figure I shows how 111,2 are used to characterize a program’s perlk)rmance with a given monitoring technique. First, lhc user targets

page 4

Tran,~+lbrmto inserl the appropriate CEDcode in the control flow grapl~ (CFG), then Tra~++>/+n+m unhanccs the test program by automatically inco~orating detection and recovery instructions.

(+~orr~q~t then repeat-

edly corrupts and executes the modified test program. Finally, an execummprofile o1 the good pt+ogram ~s generated using Trcm.sJbrm. and Analyze uses flus information in conju~lcli~m with Ihc ctwruptit~t~ test results to ga~er CEDperli~rmance statistics. Tran.~/orrn reads an assembly file (currently MIPSR2()()(gR3()()I)) and creales a CFG[Ah~ regisler liveness in/i)rmation.

~e CFGcan be modified to be self-profiling

or selI-ffacing to gencrale exe-

cution inforlnation fl)r later use. ~e CFGcan be e~flmncedwith Ihe inseOion t)f checkpoints to implement so~w~e integrated

mo~loring. A tracMng fullctioll

lracking

lullctit)lls

alld

captures a program’s con-

trol flow ~sR)ry, w~le a checkpoint verifies that it is correct. In addition, code to attempt rcc~vc~-y ca~ bc inserted.

()x40()79c old Oxlcfc021 new ()x34fc021 Segmentation fault

addu r24, r14. addu r24, r26, (core dumped)

r15 r15

Figure 2: Example Error Log Entry

Corrupt inserts one or more transient or permanenlbit faults in the code section of an execulable file. Transient faults are specified by duration n; the given word is corrupt the lirst n ~ries and restored to laulltree after that. Certain address ranges maybe excluded from the corruptim~ prt,cess,

and library r,~ulines

are automatically excluded. As an instruction is corrupted, Corrupt disassembles thai lnslruclio~l’s and corrupt versions ii:)r use in later analysis, as shownin Figure 2.

page 5

original

Table 1: Sample Output from Analyze ;Segmentation Fault

Bus Error

No Modification

0.{X)%

0.00%

0.(X)%

3.03%

(count)

(0)

(0)

(0)

(3)

(0)

Source Register

0.(X)%

0.00%

12.12%

5.05%

3.03%

(count)

(0)

(0)

~ 12)

(5)

~)

Des~nation Register

0 (~)%

0.()0%

7.07%

1.(~1%

3.03%

(count)

(0)

(0)

(7)

(l)

(3)

Address()tfset

0.00%

10.10%

7.07%

3.03°/,;

2().2()~

(count)

(0)

(10)

(7)

(3)

(2O)

0.00%

4.(M%

O.(X)%

(0)

(4)

(0)

(1)

Modification

hmnediate Data

0.00¢,,

(count)

Infinite Loop

t ;orrect Results

Incorrect Results

’lotal 3.()~%

20.20%

1

5.05’/~: (5)

()peration

0.(X)%

0.00%

14.14%

24.24"7,

0.()0%

(couu0

(0)

(0)

(14)

(24)

(0)

(3~)

Address

0.00%

0.00%

2.02%

0.00%

0.00%

2.02%

(count)

(0)

(0)

(2)

(0)

(0)

(2)

Total

0.00%

0.00%

49.49%

40.4()~

10. 1()’~

100.()0%

(count)

(0)

(0)

(49)

(40)

(10)

(99)

Analyze reads the list of injected faults aud resulting errors generated by Corrupt and categ~wizes them according to the type of corrup0on and the type of result,

as illustrated

in Table t. The pr~grams

execution profile determines wNchinjected faults are actually encountered during program execution. Analyze can generate statistics analysis

of error recovery atlempts in addition to detec~on perfom~ance, simplifying

of bo~ mechamsms.

3. Code Modification The lirst of the tools, Tran,ff+n’m, is a geueral purpose code modification program. It h)ads a program into a form which facilitates

code transformations and analysis.

page 6

3.1 Analysis of Original Code The code modification Iool TrarzsJorm begins hy building a co~llrol flow graph (CFG)I Ah~ S(~] lr~m a MIPSR2()0()/R30()0 assembly language program [Kane Tra~,.,Jorm then idcnl ilics the C F( back edges and determines register liveness. Alier completing ~ese ~asks. the CFGis ready lk)r insclai(m generic or techtfique-specific

code, lhrther elaborated in subsequent subsections.

Tran,~/brm generate:~ the CFGby pasing the inpul file procedure by procedure and parIititming code inR) basic blocks usillg labels and control transtEr inslmclions as delimiters. a basic block. The order in w~chthe basic blocks appear mthe program lisli~g

the

Each CFGm~dc

is preserved by i~qcludin~

link pointers in each node; by using these pointers ~e nodes can be accessed as clcmcmso1 a li~lkcd hsl. ~ese links ae mantained to allow generation of the output assembly file. Trans~/orm identifies lbr a jump regisler the jump insffuction.

computed jumps, such as ~ose resulting

instruclion

I?om switch statements, by searching

w~chis not a subroutine return. ~ae jump lane label immediately precedes

Each entry in the jump table is processed and used to define an addilional successor

to lhe basic block in quesOon. After all edges have been added 1o the CFG,a loop identifier

marks loop back edges using a recur-

sive algorithm. The loop back edge markers ae used m most other operations on the CFG. The process ~>l adding the loop idenfifica~on

support revealed that in some compiled C code compuled jumps c(~nlaincd

back edges. ~e final step in the annotalion of the CFGis the register liveness delermination, wl~ich simplil:ies code transfl)rma0on. Liveness is Ibund o~y for general purpose registers, niques imfially investigated do not need to use floaling-point registers,

as the C£Dand root)very tech-

although il is straightforward

support them. Register liveness can be automa~cally indicated in the output assenlbly program lisling ~ud in plamfing a IracMng lu~ctio~ or debugging. Whensubrouliuc ca/tics, liveness determination tool follows guidelines concengng register

exits and calls a~c

use presenled in I Kane 921. System reg-

isters such as the stack poinler, the return address register, the kernel registers and fl)e global pointer are assumed to be always live. ~gument registers live at procedure engy. All o~er registers

a0-a3 and callee-saved regislers ae assumed to be dead.

page 7

s()-s9 are assumed to

CheckSorl:

$43:

$43_0: $44:

$44_0: $44 1: $44_2: $45: $46:

subu sw sw li sw lw mul la addu tw lw ble li sw lw addu sw blt lw bne la jal b la jal tw addu j

Ssp,40 $31 28($sp) $02 $14.1 $14.36($sp5 $15.36($sp} $24 $15.4 $25,sortlist $8,$24,$25 $9,~;($8) $10.4($8) $9,$10.$44 $11,1 $ tl,32($sp) $12,36($sp) $13,$12.1 $13,36($sp) $13,9999%$43 $14.32($sp/ $t4.0.$45 $4,$$20 prinl.f $46

Figure

3: Sample Application

prinff $31.28($spl Ssp,40 $31

TransJorm, can also generate the user to rapidly

Code, DAG File

PostScript

determine the control

[Adobe 85] diagrams of the control

structure

lines

show loop back edges;.

draws directed

3.2

Insertion

acyclic graphs tYom a list

of Generic

Support

a CFGdiagram. Each basic block is lCpl-eSt3111ed as

Solid lines represent

~e PostScript

files

are

nique,

code, and error laandler

regular

conlrol flow edges, while dashed

generated by DAG[Gansner 891, a program which

of edges and node labels.

Code

Three types of generic supporl code can be inserted generation

flow graphs. This allows

t)l a program with~ul decoding lengthy assembly list-

ings. Figure 3 shows a procedure as it is ganslated illl~) a node tagged witl~ the basic block’s label.

for CFG and Diagram of CFG

by Tra~.~form: profile

genera/ion

code. In the course of implementing and characterizing

one or two of these code modifications

is used by Analyze to determine which corrupted

would typically inslructions

page 8

c~de. Irate a CEDlech-

be made Io a program. The execution are executed.

Figure 4: Grouping Basic Blocks into Sequences

Whenrun, self-profiling

code provides an executi~m count f~}r each basic block, while self-l~acing

code generates a sequence of basic blocks execuled. In order to mimmizeoverhead, trace a~d prolile operations deal wi~ sequences of basic blocks. T~s reduces ~he number of tracing or profiling iustructions w~chmust be inserted and executed, as well as the ~race size. A sequence is a contiguous control-equivalent set of code, l~}rmed t~omins~uclions as a basic block is, excepl subroutine calls aud returns arc ignored. If ~e first inslmc~on in a sequence is executed, then all t~llowing insguctions are executed, as all instructions

in a sequeuce are conrail equivalent. Figure 4 shows an example of a procedure’s basic blocks

grouped as sequeuces, and how Tran,~/~rm presents ~s informa~on in ils outpul l~les. Tr~,~j~,-,~ assigus an identifier

number to each sequence and sU~)res ttaese numbers in the TracelD file.

~e sequence numbers

~e used in all subsequent trace and profile operations, as shownin lt~e Profile and Trace Analyze uses file profile or trace iuforma0otl Io delermine which faulls arc encouulc~cd duriu~ gram execution. The profile and trace inlk)rmation can sleet more sophisticated CEDtechniques to concentrate error detection resources on heavily executed code sec~ons, or to nlinimize execulion time overhead. An error handler is invoked aRer a checkpoint detecls an error. A chcckpoiul is a inslmctions w~ch detects emirs. Tran.~[o~ provides ~ee types of error handlers:

Te~i,ate,

Terminate, and Restart. Te,wzit’taW merely a~empts to end ~e program gracefully by restoring the slack pointer to the value l~om program entry and then exiOng l~om the program, hh,~zt~v

page 9

the label of the/ailed checkpoint and then attempts to terminate gracefully. Reslart tries to continue the program by reloading registers

3.3 Insertion

and restarting

of Technique-Specific

the procedure or section of the procedure.

Code

Trw~,~/brm currently ilnplements checkpoint-based recovery to demonslratc the framework’s tools and Io characterize the recovery effectiveness.

Future work will include other lypes of error deleclion and

recovery, as well as support for superscalar process~rs. If an error is detected within a short time, it maybe possible to recover frtmi the cn~)r and conlinuc correcl program execution. One form of error recovery is checkpoinl recovery, in which pr~gram inlormation is saved periodically for use in a possible recovery attempt. This is backwm-derror t-cc~vmy, in that when an error is detected the error handler backs up the program state to the last good checkpoint (rollback) and then starls the program running alter that checkpoint (restarl). cedures can be successfully restarted,

By using checkpoints manypro-

leading to correct program completion.

Saved Register N SavedRegister N- 1

SavedRegister 1 Saved Register 0 Procedure Zone Procedure Code

Shadow Stack Pointer Figure 5: Shadow Stack Organization

Checkpoint recovery consists of saving program inli~rmation on a stack. Shownin Figure 5, this stack is called the shadow stack and is separate from the procedure-call stack. Tra~,~ormcan implement checkpoint recovery by automatically adding shadow stack code Io the application program. It is not possible to restart from all errors, but the technique presented provides recovery from a significant iiaction ol non-memorycorrupting errors.

The shadow stack holds all registers

page 10

which are live upon procedure entry.

Immedialely before the procedure’s exit the current frame of the shadow slack is popped. This lcchniquc provides recovery from a sigtlificanl

anl()unt

of

errors, while adding nli~or llleu~ory ;_t~d execuli~m[llllO

overhead.

Searcl~: subu Ssp, 56 sw $31,40($sp) sw $t7,30($sp) ~,~ sw $16, 32($sp) $8, 7~($sp) lw li $9, 15 ¯

Search: subu Ssp, 56 sw $31,40($sp) sw $t7, 36($sp) sw $16. 32($sp) # Shadow

Original Procedure Enuy Code

Stack Save Code [w $8, ShadowSP lw $9. ($8~ li $10. RegMask sw $10,($9~ fi $10, ProcCode sw $10,4($9) Procedure

Er~try

li sw sw sw

51(I. PmcZouc $10, 8($9) $4. 12($9) $5.

li $10. FrameSizc sw $10. 40{$9} # End of ShadowStack Save lw li

$8_ 76($sp, $9, t 5

Code With Recovery Support

Added

Figure 6: Procedure Entry Modification for Recovery SupD~rl

In order to add recovery support to a procedure, Tran.~/brm adds shadowstack ct~de nea~+ its cntt+y and exit points. As illustrated in Figure 6, the first basic block in the procedure is splil alier any i~lslruclions which save registers on the regular stack. These instructions save registers which do not need to be restored Ii~r the procedure to run. Instructions to save recovery inR)rmati(m on the shad(}w stack inserted in the split point of the entry basic block. This code saves inforination needed to roll back the program, as well as updating the global shadowstack pointer. Figure 5 illustrates stack, wl~ich holds live registers,

the structure of the shadow

a maskidentifyhig the registers, values identifying the procedure and lhc

zone within it, and the frmne size. The procedure’s exit poinls, or return trom subroutine instructions, are located in its exit basic blocks. Shadowstack cleanup instructions

are inserted immediately before these exit basic blocks m order

to remove the current procedure’s shadow stack frame by updating the shadow slack pointer. handler for recovery procedures reloads registers

The error

from the shadow stack and then resumes procedure exe-

cution after the saves t~ the shadowstack. Currently all checkpoints restart be added to provide restart

execution at the beginning of the procedure. Recovery supp~)rt will

capabilities within a loop. This feature will reduce recovery latency by elimi-

page 11

haling the ueed to duplicate successfully completed iterations.

Each protected loop will contain both a

checkpoint and instructions to save live registers which have changed since the previous iteration.

In addi-

tion, the shadow stack’s Procedure Zone entry will indicate which loop within the procedure is being executed.

4. Fault Injection

and Testing

Corrtq)t is a tool which injects Iaults into a programand then Icsts Ihe corrupl program. Each gram test run consists of injectiug one or more faults inlo a program aud executing tile prt~gram while observing the results.

Typically manytests are run to galher representative data. Tile expcrimenlal dala

presented here involve from 300 to 1500 tests per program.

4.1 Fault Types Transiemand pernlaneu! faults can be characlerized by paramclcrs sucl~ as

dut-alioll,

scvcrily

alld

location. A transient fault has a certain duration; after that time the fault disappears. The faul! severily can be measured semantically or syntactically.

One syntactic measure is the number of corrupted bits.

semantic measure of the fault describes the extent of the change in the instruction’s of the fault can be measured semantically or syntactically. address of the fault,

A

function. The location

A syntactic measurementprovides Ihe physical

while a semantic measurement describes the program procedure or funclion where the

fault was encountered. These three fault characteristics

describe a multidimensioual l~tult space.

The fault space used by Corrtq)l has axes of fault duration, synlaclic faull severity, and synlaclic fault location. The fault duration is measured in the numberof times the corrupted instruction is read before the fault disappears. The fault severity is measured m the numberof bits corrupted, while Ihe faull location is described by its address. In order to simplify aualysis, Analyze mapsthis laull space Io ~mcwilh axes of fault duration, semantic fault severity and semantic fault location. The fault duration dimension remains the same, while the other two dimensions are mappedaccording to certain fault classifications. A~talyze can use either of Iw¢~fault severity classifications to quantify lhe semanlic faull severily, Oneindicates the field of the instruction corrupted, while the other classifies the original aud corrupted instructions as data-~ow or control-flow instructions.

The location is determined by the basic block containing the

fault.

page 12

In the tault injection experiments presented here, faults are limited to a subset of the fault space. The fault duration is one or five iterations or permanent,the fault severity is one or two bits and tile lault h)calion is restricted to application code, excluding library routines. Corrtq)t loads an executable file and injects ~meor more faulls, whose charac/erislics on the commandline and in supplementary program informafion files.

are spccilied

The corrupted file is then saved a~ld

executed. Corr~q)t can injec! any numberof bit faults into the test program. Injected faults maybe transienI or permanent. The lifetime of a transient fault is specified as the numberof times the wt~rd in question is accessed incorrectly.

This methodof simulating transient faults requires little

overhead and provides a suc-

cinct methodof specifying laul! duration.

4.2 Fault Injection

Methods

Faults are injected by selecting an instruction within the executable file of the program and either corrupting il directly or replacing it with a jump to a transient fault simulator subroutine, showuin Figure 7. This subroutine has a fault-free and a corrupt copy of the instruction:

one is executed a! runtimc. A

counter variable is used to delermine the number of times the corrupt instruction is to be execuled before the fault-free instruction is executed, thereby controlling the fault duration. The instruction to be corrupted can be selected randomlyor read from a file,

facilitating

the duplication of experiments

The branch delay slot [Kane 921 in the MIPSarchitecture complicates the insertion

of the jump to

the transienl faul~ simulator. As a result, a preceding or delay slot instruction must be added to the simulator. If the instruction to be corrupted is in a control translEr delay slot, then the preceding instruclion must be copied as well. If a control transfer instruction is to be corrupted, then its delay slot must be copied. Most instructions some instructions

can be movedto the transienl fault simulator simply by duplication.

have relative

address references which depend upon the instruclion’s

However,

address. As a

result, Corrul)l ~llust relink these instructions. If a branch instruction is moved,then its targcl offset lield must be updated to contain the off~et from each new position of the branch instruction.

This requires

extracting the offset from the original instruction, combiningit with the original instruction’s address to lind the target, determimngthe offset from the new branch inslruclJon to tile tazgcL, and encoding that value into the new branch instruction.

page 13

Transient_laault: sw $8, sw $9, la $8, lw $9, beq $9, Transient_Fault_Bad: subu $9, sw $9, Iw $8. lw $9, llop

-4($sp) -8($sp) EnabieFaulls ($8) $0, Transient_Fault_Good $9, 1 ($8) -4($sp) -S($sp)

,j Applicaiion_C~)de hop Transient_Fault_Good: lw $8, -4($sp) lw $9, -8($sp) flop nop nop Application_Code ~ uop

save tile old value of $8 save tile old value ~f $9 load p~dnter to counter iudicating # of faults load counter if faults not enabled, branch

()tic !ewcr laull restoreit l’eSt~weit space for corrupted and surroundillg illStl’uciiollS filler [brjump back extra space # restore it # restore it # space for correct # and surrounding # instructions # filler for }umpback to code

Figure 7: Transient Fault Simulator Code

Corruptprints informationaboutthe corruption,displayingthe addressof the corruptedinstruction, its original andcorruptedinstruction words,anddisassembledversions of the original andcormpIed iustmctions. Tl~s imormation is later used by Analyze1o classily the corruption.Theslandm-ddi.~assemblerO thnction is used to decodelhe instruction. ~is ~ncfiondecodesinslmclions as the R21I{)t) and R3000processorsdo; somecorruptedinstruction wordsare equivalenlmthe original instructions, anddixassembler() identifies ~emproperly. Certainsections of code ~e excludedl~oln cormplion.Corruptliners corruptionto the application code, ignoringlibr~y routines such as prin(f. Theaddressrangeof the application code is determined by searchingthe executablefile ti)r applicationbasic blocklabels. II is possibleto allowcorruption of library roulmes,but the current reseach focuses on testing applica~onprograms.This is mpanto reducethe chanceof corruptiugthe experimeutationsystem. For example,execmionof corruplcdfile l/O routines couldlead to corruptionof lhe host compuler’s disk file structure. In orderU) tes~ systeln-widerobustness, one wouldallow corruptionof all programinsffucfions and data wi~nthe program.

page14

Table 2: Corrupt Option

Command-Line Options Description

of Action

reject transient em)r(pcrmm~ent error is default) -nDuration

maketransient error last for Durationiterations

-c

reject only conffol-floweffors

-xExclusionFile

exclude addres~ rangc~ listed in Exch~sumFih~ l~omcorruption

-~ddress

co~upt word at Address

-~ask

use Mask to coopt word

-bNumBits

invert NumBits m0~c word

-wFaultFile

store addresses and masksof errors in FaultFile

-rFaultFile

load addresses and masksof e~ors from FaultFile

-fFaultNumber

use address and maskof l~tult FaultNumber

Table 2 lists the commandline options for Corrtq)t. The user can specify addrcss ranges to bc excluded from corruption, allowing Ile~ble tesOng of pr~grams with built-in

error detection capabililies,

For example, one program flight perli)rm array opera~ous and then compule a checksum, COliipari~g it with the co~ect value. By excluding ~e checksum computation and comparison code from corruplion, can deterngne the effectiveness of ~e checMngcode and ~e vulnerability

of thc user-written

application

code. In order to simulate em)rs of vaying duration, the user mayspecify the numberof ilerations w~ch an insguction is co~upt. ~e experiments presented here simulate permanent errors as well as transient errors lasting one or five ilerations.

~ese numbersare arbitr~mly chosen I~)r our initial

experimems.

Corrul)I can automatically generate an error h)g to record the corrupted addresscs and t~e masRsused. The user mayspeciI} fi~)r Corrup~to recreate a specific corruption lisled in the h)g, simplifying the repetition of a test or a series of tests.

page 15

5. Data Analysis Data analysis concenlrates on how injected faults manifest thclnselves wilh respect to program results.

A fault lnay or maynot lead to an error. A~alyze is a data analysis tool wl~ich examinesfaulls and

symptomsand generates statistics

based upon certain classifications.

5.1 Data Classification Anal.rze reads an error log and classifies

each faull and its symptom.Twofault and one symplOln

classit]ca~on are used, providing a straightforward breakdownof the tesl runs. A~al3’ze uses los1 program trace or profile information to selecI only fl~e tesl rims in which the corrupted instmclion is execuled. In addition, it identifies

ins~uctions w~ch~e undetectable by softw~e means ~such as it~s~ruc~tms

mpted to become jumps to unallocamd lnemory). One of two methods verifies to verify ils resulls.

program output. If convmfient, ~e user adds a functi(m to lhc progran~

~is/m~ction is excluded t?om corruption during tes~ng, so each test run indicates

whether the program produced correct results.

If the function cannol be added easily, ~en aRer Iesling the

program results ae checked. VERIFY,a scripl,

classifies

log and possible output files.

program results based (m output lisled in lhe erm~

The em)r log is annolated by VERIFY1o indicate correcl program comple-

tion, and Analyze uses these annotations in its classification ~e two fault classifications

of ~e tesl run.

characterize the change in ~e instruction,

either in lerms of Ihe

nenl of 0~e insffuclion corrupted (opcode, register specifier, etc.) or lhe type of lhe fault-free and corruplcd instructions (conm)l flow or data flow). Table 3 shows the first classification,

which follows directly

identifying wtdch insu-uction field has been corrupted by the faull. This classil?calioll R2000/3000processors; o~er processors inay have addiaonal instruction

applies 1o the MIPS

fields (e.g. posl-incremcnl

pointer) wNchwould need to be added to the classification. Somecrumpled instmcfions may be decoded by lhe processor the same as ~e original i~s~-uclio~s. O~her corrupted instructions

may result m differenl

instruction is an exWeme example of tiffs,

operaions bul wilh identical

results.

The MIPS

as it is implementedby ~e assembler as sli r0, r0, 0 (shi~ lel~

logical, inslmclion code ()x0()00()()(R), with live fields: opcode, immedialedata, and llgee regisler ~s instruction is ra~er resistant to single bit em)rs. If ~e unused register specifier field is co~upted (five bi~ of the 32 in the instruction), there is no change in execution. If lhe source register specifier or imlncdi-

page 16

ate dala is corrupted (ten bils of 32), the instruction has the same result, writing somevalue 1o rcgisler whereil is discarded, If tile operation is corrupted, five of the Iwelve possible resullalll opcralit)l~s wlTilc I~} r0 with no other effect, resulting in a hop. As a result, a corrupted hop inslructi~m has a 20/32 ol 62.5’/, probability of relnaining a nop functionally.

Table 3: Instruction Instruction Component Modified

Field Fault Classification Example: ()riginal Instruction

Example: Corrupted Instruction

Source Register Specifier

add r3, rl, r2

add r3, rl, r18

Destination Register Specifier

addr3, r I, r2

addr l 1, r l, r2

Address

jal ()x4()(152()

Address Offset

Iw r2, 32(sp)

lw r2, 160isp)

Immediate Data

andi rl, r3, 15

andi rl, r3.7

Operation

slti rl, r14, 200

j 0x7040320

No Modification

j~d 0x400520

.i~d 0x40052()

Table 4 shows the second classification,

which uses the types of the original and corrupled instruc-

tions. This provides a slightly different perspective on the faults and their symptoms. Table 4: Instruction Instruction Type Modification

Type Fault Classification Example: Original Instruction

Example: Corrupted Instruction

Control Flow to Control Flow

jal 0x400520

jal 0x400720

Control Flow to Data Flow

jr r31

addi r0, r31, 8

Data Flow to Control Flow

slti rl. r14, 200

j 0x7040320

Data Flow to Data Flow

add r3, rl, r2

add r3, rl. r18

Control Flow to Illegal Instruction

slti r l, r l4, 200

scathe r14, 80(rl)

Data Flow to Illegal Instruction

add r3, rl, r2

sdc3 rS, 24(r2)

The resull classification, lion shows the relative

shownin Table 5, describes lhe oulcoulc t~l Ihe ct~rruptit)n. This clas,~ilica-

performance of the detection mechanisms. The program may terminale normally

with correcl or incorrect output. If used, the VERIFYscript differentiates program mayhave its ownbuilt-in

error detection capabilities

page 17

between these two cases. Tile

which detecl the error, in which case

lyze uses the program’s error notification

message. The code added for CEDmay detect the error: the error

handler prints a notification of the error fl)r Artalyze to use. A sollware watchdogdetects infinite loops and terminates execution of the test program. Finally, segmentation faults, bus errors and illegal instructi()~ls are all caught by the processor and identified automatically. Artalyze can provide classiticatiol~

inforn~ati(m

ti)r each test, simplifying interpretation of speci/]c faults and errors. Table 5: Error Classification Program Result

Detection Mechanism

Norm;dProgra~n Completiou with Correct Results

Post-Run Comparisonor luherent Progr~m~Chec-king

Normal Progr~n Co~npletion with Incorrect Results

Post-Ruu Comparisonor lnheren! Progrmn Checking

Progr~un Detection of Em)r

lnhereut ProgrmnChecking

CheckpointDetection of Error

CEDSoftware

Iufinite Loop, Segmentation Fault, Bus CPUHardwme Error, Illegal Iustruction

5.2 Test Results A set oI programs from the Slanli~rd and SPEC92 benchmarksuites, listed in Table (~, i,~ used t~ demonstrate the framework, to characterize a processor’s intrinsic

CEDcapabilities

error recovery technique. The benchmark programs are compute intensive, range of data structures,

and to characlerize an

perform little

l/O, and use a

from matrices to dynamically allocated trees. As expected, Ihe programs show a

variety of intrinsic error detection levels. The result of each programis verified by manually inserting a verification function into the application program, except tier air#m, -which uses a script Ibr post-run-time verification. [iOllS

All verilication l~unc

must be excluded from corruption to ensure accurate program result verificatitm.

The use[ must taih~

each result verification approach to match the data format. For example, the results of the tltree sort programs (quick, bubble and tree) are verified by confirming the output list of elements is in order. The permutation program (l)erm) is verified by comparing the output value lo the knowngood value. Each benchmarkis tested in two sets of tests, with three subsets in each set. Each sel uses lhe same faults,

bul

the fault duration varies between permanent, one and five ilerations across the subsels. One Icsl

page 18

set injects single bit faults, while the second injects dual bit faults. Eachbenchnlark is configured to run t() completion in several tens of millions of clock cycles, providing a sufficienlly realistic envirotllllC~ll, benchmarks were compiled using CCwithout any optimizations.

Figures 7 and 8 present the results

fault injection experiments.

Table 6: Benchmark ProgramInformation Program

Benchmark Type

Description

Executable File Size

Instructions Executed

bubble

Bubble Sort

Stanlord

4/)812

37111480

quick

Quick Sort

Slanlk~rd

41036

42439092

tree

Tree S oft

Stanford

41812

5500()944

perm

Permutation

S tanlord

40428

94613962

puzzle

Puzzle Solver

S~mlord

50268

59712518

queens

Eight Queens

Stanford

40324

~ 1698013

,ff?

~ Evaluator

S~mford

426~

79332658

HIIII

Floating Point Matrix Multiply

Stauford

40964

44744219

intmm

In teger Ma~ix Mu 1 fi ply

S umlord

41)98

37088215

alvinn

Neural network simulator

SPECmt 92

67940

36040896

Table7: AverageBenchmark Test Results for 1 Bit Faults Number of Tests

% No Effect

% Incorrect Result

% CI’U Detected

bubble

30(/

28

16

56

quick

297

34

11

55

lrec

294

50

2

48

perm

281

39

9

52

puzzle

270

31

5

64

queens

297

45

6

49

228

14

41

45

mm

300

16

32

52

intmm

288

18

33

49

alvinn

1494

48

15

37

Program

IAverage IlTo~d:4049 page 19

The of the

Table 8: Average

Benchmark Test

Results

for

% No Effect

% Incorrect Result

% CI’U Detected

bubble

300

27

14

59

quick

291

24

6

70

297

27

3

70

pe

288

32

15

53

puzzle

267

20

10

70

queens

297

32

11

57

255

9

29

62

mm

300

13

34

53

intmm

288

10

35

55

alvinn

1491

33

13

54

17.0

is enhanced wilh a rudimentary

software

concurrent

stack recovery code. ~e purpose is ~o determine the resta~ability error detection

scheme in wNch each basic block verifies

path. ~e tech~fique is a silnplified

keys into registers

immediate predecessor.

Because oi’ the shorl detection

for verification

serious damage can be caused. Corr~q)t i~iecls

that

not detected

85% of the rester

Alvinn,

execution

by the processor

attempts

[

error

60.3

detection

by successors.

single iteralion

faults;

lead to co~ect program recovery.

errors

are caughl beli)rc

lhe error detection

errors

9 I]. Basic

Each basic block checks only its

the detecled

(214 of ~e lnissed

page 20

is in a valid control flow

method presented in [Schuette

lalency,

time overhead is only 2.5~.

mechanism and shadow

of the code. Tran,~orm adds a simplistic

that its predecessor

version of the signalure

blocks load identifier

31 of 299 errors

Faults

Number of Tests

Program

Alvinn

2 Bit

have no eflEcl).

Due to the loop-intensive

very

scheme idcnli tics Table 1() shows nature of

Table 9: Alvinn Restart Attem ~t Results % Segmentation Fault

Modilication

% Bus

% Infinite l.oop

EITOF

Source Register

0.0

Destination Register

0.0

5.9

Address ()ffse!

0.0

hmnediate Data ()peration

% Incorrect Results

% Correct Results

Total

14.7

2.9

0.0

8.8

0.0

0.0

(I.0

0.0

0.0

0.0

0.0

0.0

44.1

2.9

47.0

0.0

0.0

2.9

17,7

().0

20.6

85.3

5.9

100.()

14.7

6. Conclusions This paper introduces a frameworkof software tools to automatically implement,characterize and evaluate concurrent error detection and recovery techniques. Modernprocessors provide intrinsic error detection terizes

lnechanisms through virlual the perlbrmance

work has proposed

of these

CEDtechniques

memory, bus interfaces

intrinsic

monitoring,

framework provides paper presents

such as signature

results

This paper presents

effectiveness.

More recent

integrated

monitoring

These represent only our initial

work introduced

the tools’

effectiveness.

inlc-

program The

instructions.

efforls

Recent

an automated

checkpoi hi-based error recovery scheme with inili

framework. The main purpose is to illustrate results,

This paper charac-

are embedded within the applicalfim

an automated method of elnbedding

that show it to be quile effective.

experimental

monitoring.

their

in which the CED instructions

an aulomatically-inserled

decoders.

CED mechanisms in MIPS R2000/R3000 processors.

method R)r implementing them and characterizing grated

and instruction

Finally,

lhc

al perft)rm ancc

in using tl~e experimenlal

However, based on these inilial

we can make some observations.

6.1 Observations The experimentsconducted show a range of prograln resistance to corruption, and 8. The data can also be aualyzed from the point o1: view o1: program completion.

page 21

as seen

in Tables 7

Given that the pro-

gram has terminated normally, with no segmentation faults or bus errors, whal is the likelihood thai the computedresults are correct? Table 1{) presents tlzis inlbrmation tbr each program.

Table10: CorrectProgram CompletionStatistics Percentage t’or 1 Bit Faults

Percentage

for 2 Bit Faults

bubble

64

66

quick

77

81

tr#e

76

89

puzzle

86

66

queen,s

89

76

,~

25

29

mm

34

28

intmm

35

22

alvinn

77

71

pernz

The programs vary significantly, Figure 8 showsthat.ff?

reflecting the nature of the data manipulation in each algorilhm.

is quite vulnerable to corruption, as only about one fourth of the correctly ~erminal-

ing test runs actually give correct results.

This is due to the combination of the large amountof data han-

dled and the fact that the data are not reduced to a smaller set. Figure 9 indicates perm is muchmore robust, as three quarters of correctly terminating test runs yield correcl resulls. Tills is due Io the reduction action of the program; all dala produced by the program are reduced eventually to a single inleger. Alvinn repeatedly refines its data, eventually producing a small data set. This averaging action tends to filter

out errors; about three fourths of the programresults are correct. The tree sorl programis likely to

produce correct results due to its pointer-rich structure; each node in the data structure consisls t)f otlc data field and two pointers. Nearly all data Ilow en’ors which corrupt a pointer lead Io a CPU-delectable error.

page 22

Error Distribution vs, Fault Duradon(

Bit Fault)

Eirof DistriLoutlonvs. FaultE)urationi2Bit Faultj

80 80

8O

7O

7O

CPUDetected

~ 60 E CPU Detected ~ b 5O ~ 40 30

Incorreot incorrect 20

NoEffect 10 - -- - ~ -- - -- _L _ -- - .... 0

10

Permanent Fault Duretionin Iterations

r’./o Effect S Perrnanent Fault Duration in Iteratiuns

Figure 8: Error Characteristics for.//’~ Benchmark

Error Distribution vs Fault Duration (1 Bit Fault)

Error Distribution vs. Fault Duration(2 B t Fault)

9O 8° 7O ~ 60 b 50 EL ~ 40 Li] 30

CPUDetected

CPUDetected

NoEffect NoEffect

20 Incorrect Incorrect 5 Permanent Fault Durationin Iterations

Fault E)uratior~in Iterations

Figure 9: Error Characteristics

6.2

Future

for Perm Benclunark

Work

This fralnework automates most of the work needed to implement and evaluale many CEDand recovery techniques. Future work will include characterization

of existing and new CEDand recovery

techniques, providing a clear comparison of the differenl techniques. The l’ramework will be expanded I~ accommodate CEDand recovery lech~dques I1)r superscalar

page 23

processors.

The frameworkwith its lhree tools has been implemented, consists in lolal of 1(~ 00(} lines of C code and currently largets the MIPSR2()0() and R3()()() processors in DECstation 31()~) and 5~)()() Weintend to dis~ibute these tools to help stimulate more experimental research m this area within the dependable computing research

community.

Acknowledgments This work was funded by the Office of Naval Research under conlrac~ N1)()014-9 l-J-1518.

References [Adobe 851

Adobe Systems, PostScript

[Aho 861

A.V. Aho, R. Sethi, J.D. Ulllnan, Compilers." Prim:iples, Teclmiques. amt Tools, Addison-Wesley,

[Gansner 891

Language Reference

Mamml, Addison-Wesley, 1985

1988

E.R. Gansner, S.C. North, K.P. Vo, "DAG-- A Program that Draws Directed Graphs," AT&TBell Laboratories,

Murray Hill,

NewJcrscy

[Kane 921

G. Kane, J. Heinrich, MIPSRISC Architeclure,

[Miremadi 92]

G. Miremadi, J. Karlsson, U. Gunneflo and J. Torin, "Two Software Techniques

Prentice Hall, 1992

li)r On-line Error Detection," 1992 [Ohlsson 921

J. Ohlsson, M. Rimdn, Ulf

Gunneflo,"A Study ()f the Effects of Transien! Faull

Injection into a 32-bit RISCwith Buiilt-in

Watchdog," Proceedittgs o,/22ml liner-

national Syml)O,s~iumon Fault-Tolerant Coml)uti~,g, 1992 [Schuetle

M.A. Schuette, J.P. Shen, D.P. Siewiorek, Y.X. Zhu, "Experimental Evalualion ()f Two Concurrenl Em)r Detection Schelnes," Proceedbtgs of 16th btternatio~ml Symposium on Fault-Tolerant

[Schuette 91 ]

Computing, 1986

M.A.Schuette, J.P. Shen, "Exploiting Instruction-level

Resource Parallelism li)r

Transparent, Integrated Control-tlow Monitoring," Proceedings of 21st lmertmtiottal [Segall 88]

Symposmmo~ Fault-Tolerant

Computing, 1991

Z.Z. Segall, D. Vrsah)vic, D.P. Siewiorek, D. Yaskin, J. Kt)w~lacki, J. Barton, 13. Dancey, A. Robinson, T. Lm, "FIAT -- Faull h~jeclion Based Aulomaled Testing

page 24

Environment," Proceedings of l~th hzternational

Symposiun’l o~ Fault-Told, ram

Computing, 1988 [Siewiorek 92]

D.P. Siewiorek, R.S. Swanz, Reliable Computer ~vstems. Design a~zd Evaluation, Digital Press, 1992

[Wilken 90]

K.D. Wilken, J.P. Shen "Continuous Signature Momloring: Low-Cost Concurrem Detection of Processor Conm)l Errors," IEEE Transactio~ts o~t Cornl)uter-Aided Desi,~, Vo[. 9, No. 6, June 199~1

page 25

An Experimental Framework for Implementing and ... - CMU (ECE)

An Experimental Framework for Implementing and ... - CMU (ECE)

Suggest Documents

An Experimental Framework for Implementing and

A Modeling Framework for Capturing Parallel Flows and ... - CMU (ECE)

A Multi-core High Performance Computing Framework for ... - CMU ECE

A Framework for Assessing the Dependability of ... - CMU ECE

OmniSense: A Collaborative Sensing Framework for ... - CMU (ECE)

Virtual Probe: A Statistically Optimal Framework for ... - CMU ECE

COMPRESSED SENSING - CMU (ECE)

Ballista - CMU (ECE)

s1 m'l - CMU (ECE)

An Architecture for Combinator Graph Reduction Philip J ... - CMU ECE

The Amaranth Framework: Probabilistic, Utility-Based ... - CMU ECE

An Architecture for Combinator Graph Reduction - CMU (ECE)

An Architecture for Combinator Graph Reduction - CMU (ECE)

An OFDM Design for Underwater Acoustic Channels ... - CMU (ECE)

An FPGA-based Prototyping Platform for Research in ... - CMU (ECE)

An FPGA-based Prototyping Platform for Research in ... - CMU (ECE)

TOPOLOGY FOR GLOBAL AVERAGE CONSENSUS ... - CMU (ECE)

SensOrchestra: Collaborative Sensing for Symbolic ... - CMU-ECE

FFT Compiler Techniques - CMU (ECE)

recursive implementation and performance analysis - CMU ECE

Rethinking Architectural Research and Education - CMU (ECE)

An Experimental and Theoretical Framework for Manufacturing ...

Fault Injection Techniques and Tools - CMU/ECE

Rethinking Architectural Research and Education - CMU (ECE)