$9,$10.$44. $43_0: li. $11,1 sw. $ tl,32($sp). $44: lw. $12,36($sp) addu. $13,$12.1 sw. $13,36($sp) bl t. $13,9999%$43. $44_0: lw. $14.32($sp/ bne. $t4.0.$45.
CARNEGIE
MELLON
An Experimental Framework for Implementing and Evaluating Concurrent Error Detection and Recovery Techniques Alexander G. Dean 1993
An Experimental Concurrent
Framework Error Alexander
for
Detection
Implementing
and
Evaluating
and Recovery
Techniques
G. Dean and John Paul Shen
Computing Systems Department of Electrical
and Computer Engineering
Carnegie-Mellon Pittsburgh,
Correspondent:
Center
University PA 15213
Alexander
G. Dean
Tel: (412) 268-6639 Fax: (412) 268-3204 Email: adean (O> ece.cm u.edu
Abstract This paper presents m:~ experimental framew()rk consisting of lhree software tools ti)r characterizing,
illlplell/2nlillg,
and evaluating concurrent error detection and recovery techniques for general-purpose pro-
cessors. The first
tool analyzes compiled assembly programs (currently MIPSR2{I0{)/R300()), and can
fom~ a nulnber of code transformations, including the embeddingof instructions
fur prt)liling
inlegrated monitoring for both error detection and recovery. The sec(md tool facilitates
and
error injection
expemnents. It can inject permanent or transient faults, managethe execution of the corrupted code, record the error handling behavior, and perlbrm repealed experiments. The third tool analyzes the ~cc~)rdcd cxpc~imental data, classifies
the error types and determines the effectiveness of lhe techniques. This framework
has been implemented and initially
applied to the analysis of intrinsic
error delection mechanismsin con-
temporary pipelined processors, and the detection of and recovery from control Ilow errors via signature monitoring.
Keywords:Concurrent error detection and recovery, signature nlonilorit~g t)l ct)nkrol th)w. fault injccli~)n experiments, code transti~rmation, pipelined processors.
page
1. Introduction Size, cost, weight and power restrictions recovery lllechanisnls retrofiring
limit the addition of massive hardware error detection and
to manyCOlllpulers. Newdesigns ~e limiled by one or mo~c~1 0~c~c lac/~ts,
while
existing compalers is expensive. Software-based concurrenl error delecti{m ((’ED}
cry lech~hques combined wilh exisling p~ocessor mechanismsprovide an inexpensive altc~-nauve it) adtl-tm hardw~gemecha~hsms.Rather than adding dedicaled hardware, ~he applicnliou st~liwa~c can bc create a robust program tc~ mnon an existing sys~em, wi~h the capabilily 1o dctecl and rec~wer D~maerror occu~ences. Manymodern general-purpose microprocessors indirectly nisms as an integral part of the processor’s capabilities.
provide co~lcurrenl error delcclit
m mecha-
Word-addressedmemoryi~He~-faccs idcntily
unaligned accesses, virtual memorysystems detect oul of range accesses, and so lorO~. Sucl~ meclmnisms provide an inexpensive and already existing base of CED, which can bc lu~3her enhanced by s~llwa~c CEI) techmques. Recent reseach has proposed soRware CEDtechmques such as signature m~mito ring l Wil ken [Schuetle 861, which momlors ~e control flow of a program using additional
hardware. Integrated m~mi-
toting has been proposed as well, in wt9ch error detection ins~uct.ions me mserled difeclly into ~hc application program [Schuette 91], rather than adding a dedicated tmrdware monit~r. ~s paper presents a new framework tk~r imp[ementation and evaluating CEDatad t-ecovcvy techtuques. The tools facilitate
implementing, chaaclerizing
In addi0on, the tools can analyze intrinsic
and evaluating soliware-based CEDleclmiques.
hardware based mechamsmsin existing processors, ~e
this l?amework automate most of the tasks inw~lved ill implementing and evaluating concurrent emir detec0on and recovery techniques. A program modification tool inco~po~ates CEDand ~ccovc~y t:t~dc, fault injec0on tool facilitales
a
automa0ci~}ection experiments, and a data analysis tool that classifies
batches of i~jected faulgsymptomp~rs. Each of these tools is extensible, Ii:~rtmng a versatile customizable t?amework.Future work will use the frameworklools It) characterize: a val-icl 3, of CEDand ~cc{~x ~y Itchtuques, as well as expand the l?amework to support superscalar processors. Sec0on 2 presents an overview of the tools and the contexl lk)r this work. Sections 3, 4 and describe the tools Tr~m,~form, Corrupt and Analyze, respectively. Section 6 provides conclusions and fulult: direc0ons.
page 2
2.
Experimental
2.1
Background
Framework
The tools presented in tilts
frameworkimplement and enipirically
techhiques, which embed CEDand recovery instructions
characterize integrated moniloring
within the application
program. In addition
such software-based delection techitiques, the lools can characterize intrinsic mcchanisn~stclyin~ on cxi,~ling hardware in the process()r.
Mosl general purpose micr()processors have virlual memorysupp()rl
ware, an instruction decoder and an address bus interface,
providing a broad base ol supporl for the
hardware-basedtechniques. These mechanismswork together t() pr(>vidc a n inexpensive lk)u tidal i()n concurrent em)r detection There are manyapproaches t() fault injeclion; somerequire the addi0on t)l hardwa~-c, simulalit)u the systeln in software, a physical means of injection,
or some combination of these methods. These lnelh-
otis vary in the types of faults that can be injected, fault duration, fault It)call(m, injection rate and control. execution speed, monitoring inli)rmation
and software aild hardware overhead. Each lechniqt)c
fl)r a certain type of application, but in general mosl fault injection techniques require a substantial ()verhead. Fault injection hardware can be added, as in [Schuetle 86], bu! this requires physical modification ~)1 the system under test. The only signals available/i)r *gering and logging circuitry
corruption are those wttich run off-chip, and fault trig-
maybe cmnplex. However, the system runs al full speed, the laults a)c usc~
defined and manysignals are corruptible.
The system under lest maybe sinmlated entirely in software:
[Ohlsson 92] presents taul: injection into the silnulation
of a RISe system, consisting of a VHDL descrip-
lion of the processor and memory.Fault simulation involves corrupting slalc bits iu the p~~)ccssoi observing the effects on the test programs. Thoughthe syslem behavior is accuralc, and there is control over the injected faults,
simulation speed is very slow. Actual physical experiments may be used:
[Miremadi92] presents a system which mjecls faults into a nticroprocessor by using heavy ions 117o111a radioactive source. This system requires a vacuunl chamber to hold the ion source and tile greatly increasing system complexity. Faults are injected slowly (less than one per lllillule), potentially more representative of real-world faults.
page 3
bul riley arc
The fault injection technique presented here requires no additional hardware, has negligible cxccu lion time overhead, oilers precise conlrol over fault location and durali~m, and is quickly and casi 1? impk" mented. The faults can be repeated, and their results can be observed iustruction by i~slruclion with a commondebugger such as dbx. In tNs paper the Stang~rd and SPECbenchmark suites
arc used. Library
routines (e.g.,J?ea(~ ae not tested here, though these tools are capable of such an analysis. The work presented here lin~ts the i~ection of faults to application code, though these faults maybe passed I~)C library routines through corrupted arguments. A basic automated checkpoim-basedrecovery lechnique [ Siewiorek 92 ] is implemenled in ,mr inilial experiments. II provides substantial restartabilily
when coupled with a h~wqatcncyct~lct~rrc~ll
c~~ dclcc
tion scheme. ~e recovery schemeis lilnited in that il requires reasonably fast error detection, as only a limited amount of memoryin addition to registers
can be restored.
2.2 Framework 432330al 329fa994 0cca1147 84534955 44197227 CheckSort: subu Ssp,40 sw $31,28($sp) sw $0,32($sp) li $14,1 sw $14,3O(Ssp) $43: Iw $ lS,36($sp) mul $24,$15,4
Executable
Self-Monitoring Program
432330al 329fa994 0cca1147 84534955 44197227
Input
Assembly Program
Transform
Corrupt
>~4e, >2 142 >3 9228"2 >4 123341
Analyze "
Executable
Self-Protiling Program
Execution
Protile
Fiigure 1: Use of Framework Tools lo Evaluate Program
Three tools comprise the framework: Tran,~fi)rm, Corrupt, and Am~l),ze. The lools Call implcmem and evaluate an integrated monitoring technique using the followiug steps. Figure I shows how 111,2 are used to characterize a program’s perlk)rmance with a given monitoring technique. First, lhc user targets
page 4
Tran,~+lbrmto inserl the appropriate CEDcode in the control flow grapl~ (CFG), then Tra~++>/+n+m unhanccs the test program by automatically inco~orating detection and recovery instructions.
(+~orr~q~t then repeat-
edly corrupts and executes the modified test program. Finally, an execummprofile o1 the good pt+ogram ~s generated using Trcm.sJbrm. and Analyze uses flus information in conju~lcli~m with Ihc ctwruptit~t~ test results to ga~er CEDperli~rmance statistics. Tran.~/orrn reads an assembly file (currently MIPSR2()()(gR3()()I)) and creales a CFG[Ah~ regisler liveness in/i)rmation.
~e CFGcan be modified to be self-profiling
or selI-ffacing to gencrale exe-
cution inforlnation fl)r later use. ~e CFGcan be e~flmncedwith Ihe inseOion t)f checkpoints to implement so~w~e integrated
mo~loring. A tracMng fullctioll
lracking
lullctit)lls
alld
captures a program’s con-
trol flow ~sR)ry, w~le a checkpoint verifies that it is correct. In addition, code to attempt rcc~vc~-y ca~ bc inserted.
()x40()79c old Oxlcfc021 new ()x34fc021 Segmentation fault
addu r24, r14. addu r24, r26, (core dumped)
r15 r15
Figure 2: Example Error Log Entry
Corrupt inserts one or more transient or permanenlbit faults in the code section of an execulable file. Transient faults are specified by duration n; the given word is corrupt the lirst n ~ries and restored to laulltree after that. Certain address ranges maybe excluded from the corruptim~ prt,cess,
and library r,~ulines
are automatically excluded. As an instruction is corrupted, Corrupt disassembles thai lnslruclio~l’s and corrupt versions ii:)r use in later analysis, as shownin Figure 2.
page 5
original
Table 1: Sample Output from Analyze ;Segmentation Fault
Bus Error
No Modification
0.{X)%
0.00%
0.(X)%
3.03%
(count)
(0)
(0)
(0)
(3)
(0)
Source Register
0.(X)%
0.00%
12.12%
5.05%
3.03%
(count)
(0)
(0)
~ 12)
(5)
~)
Des~nation Register
0 (~)%
0.()0%
7.07%
1.(~1%
3.03%
(count)
(0)
(0)
(7)
(l)
(3)
Address()tfset
0.00%
10.10%
7.07%
3.03°/,;
2().2()~
(count)
(0)
(10)
(7)
(3)
(2O)
0.00%
4.(M%
O.(X)%
(0)
(4)
(0)
(1)
Modification
hmnediate Data
0.00¢,,
(count)
Infinite Loop
t ;orrect Results
Incorrect Results
’lotal 3.()~%
20.20%
1
5.05’/~: (5)
()peration
0.(X)%
0.00%
14.14%
24.24"7,
0.()0%
(couu0
(0)
(0)
(14)
(24)
(0)
(3~)
Address
0.00%
0.00%
2.02%
0.00%
0.00%
2.02%
(count)
(0)
(0)
(2)
(0)
(0)
(2)
Total
0.00%
0.00%
49.49%
40.4()~
10. 1()’~
100.()0%
(count)
(0)
(0)
(49)
(40)
(10)
(99)
Analyze reads the list of injected faults aud resulting errors generated by Corrupt and categ~wizes them according to the type of corrup0on and the type of result,
as illustrated
in Table t. The pr~grams
execution profile determines wNchinjected faults are actually encountered during program execution. Analyze can generate statistics analysis
of error recovery atlempts in addition to detec~on perfom~ance, simplifying
of bo~ mechamsms.
3. Code Modification The lirst of the tools, Tran,ff+n’m, is a geueral purpose code modification program. It h)ads a program into a form which facilitates
code transformations and analysis.
page 6
3.1 Analysis of Original Code The code modification Iool TrarzsJorm begins hy building a co~llrol flow graph (CFG)I Ah~ S(~] lr~m a MIPSR2()0()/R30()0 assembly language program [Kane Tra~,.,Jorm then idcnl ilics the C F( back edges and determines register liveness. Alier completing ~ese ~asks. the CFGis ready lk)r insclai(m generic or techtfique-specific
code, lhrther elaborated in subsequent subsections.
Tran,~/brm generate:~ the CFGby pasing the inpul file procedure by procedure and parIititming code inR) basic blocks usillg labels and control transtEr inslmclions as delimiters. a basic block. The order in w~chthe basic blocks appear mthe program lisli~g
the
Each CFGm~dc
is preserved by i~qcludin~
link pointers in each node; by using these pointers ~e nodes can be accessed as clcmcmso1 a li~lkcd hsl. ~ese links ae mantained to allow generation of the output assembly file. Trans~/orm identifies lbr a jump regisler the jump insffuction.
computed jumps, such as ~ose resulting
instruclion
I?om switch statements, by searching
w~chis not a subroutine return. ~ae jump lane label immediately precedes
Each entry in the jump table is processed and used to define an addilional successor
to lhe basic block in quesOon. After all edges have been added 1o the CFG,a loop identifier
marks loop back edges using a recur-
sive algorithm. The loop back edge markers ae used m most other operations on the CFG. The process ~>l adding the loop idenfifica~on
support revealed that in some compiled C code compuled jumps c(~nlaincd
back edges. ~e final step in the annotalion of the CFGis the register liveness delermination, wl~ich simplil:ies code transfl)rma0on. Liveness is Ibund o~y for general purpose registers, niques imfially investigated do not need to use floaling-point registers,
as the C£Dand root)very tech-
although il is straightforward
support them. Register liveness can be automa~cally indicated in the output assenlbly program lisling ~ud in plamfing a IracMng lu~ctio~ or debugging. Whensubrouliuc ca/tics, liveness determination tool follows guidelines concengng register
exits and calls a~c
use presenled in I Kane 921. System reg-
isters such as the stack poinler, the return address register, the kernel registers and fl)e global pointer are assumed to be always live. ~gument registers live at procedure engy. All o~er registers
a0-a3 and callee-saved regislers ae assumed to be dead.
page 7
s()-s9 are assumed to
CheckSorl:
$43:
$43_0: $44:
$44_0: $44 1: $44_2: $45: $46:
subu sw sw li sw lw mul la addu tw lw ble li sw lw addu sw blt lw bne la jal b la jal tw addu j
Ssp,40 $31 28($sp) $02 $14.1 $14.36($sp5 $15.36($sp} $24 $15.4 $25,sortlist $8,$24,$25 $9,~;($8) $10.4($8) $9,$10.$44 $11,1 $ tl,32($sp) $12,36($sp) $13,$12.1 $13,36($sp) $13,9999%$43 $14.32($sp/ $t4.0.$45 $4,$$20 prinl.f $46
Figure
3: Sample Application
prinff $31.28($spl Ssp,40 $31
TransJorm, can also generate the user to rapidly
Code, DAG File
PostScript
determine the control
[Adobe 85] diagrams of the control
structure
lines
show loop back edges;.
draws directed
3.2
Insertion
acyclic graphs tYom a list
of Generic
Support
a CFGdiagram. Each basic block is lCpl-eSt3111ed as
Solid lines represent
~e PostScript
files
are
nique,
code, and error laandler
regular
conlrol flow edges, while dashed
generated by DAG[Gansner 891, a program which
of edges and node labels.
Code
Three types of generic supporl code can be inserted generation
flow graphs. This allows
t)l a program with~ul decoding lengthy assembly list-
ings. Figure 3 shows a procedure as it is ganslated illl~) a node tagged witl~ the basic block’s label.
for CFG and Diagram of CFG
by Tra~.~form: profile
genera/ion
code. In the course of implementing and characterizing
one or two of these code modifications
is used by Analyze to determine which corrupted
would typically inslructions
page 8
c~de. Irate a CEDlech-
be made Io a program. The execution are executed.
Figure 4: Grouping Basic Blocks into Sequences
Whenrun, self-profiling
code provides an executi~m count f~}r each basic block, while self-l~acing
code generates a sequence of basic blocks execuled. In order to mimmizeoverhead, trace a~d prolile operations deal wi~ sequences of basic blocks. T~s reduces ~he number of tracing or profiling iustructions w~chmust be inserted and executed, as well as the ~race size. A sequence is a contiguous control-equivalent set of code, l~}rmed t~omins~uclions as a basic block is, excepl subroutine calls aud returns arc ignored. If ~e first inslmc~on in a sequence is executed, then all t~llowing insguctions are executed, as all instructions
in a sequeuce are conrail equivalent. Figure 4 shows an example of a procedure’s basic blocks
grouped as sequeuces, and how Tran,~/~rm presents ~s informa~on in ils outpul l~les. Tr~,~j~,-,~ assigus an identifier
number to each sequence and sU~)res ttaese numbers in the TracelD file.
~e sequence numbers
~e used in all subsequent trace and profile operations, as shownin lt~e Profile and Trace Analyze uses file profile or trace iuforma0otl Io delermine which faulls arc encouulc~cd duriu~ gram execution. The profile and trace inlk)rmation can sleet more sophisticated CEDtechniques to concentrate error detection resources on heavily executed code sec~ons, or to nlinimize execulion time overhead. An error handler is invoked aRer a checkpoint detecls an error. A chcckpoiul is a inslmctions w~ch detects emirs. Tran.~[o~ provides ~ee types of error handlers:
Te~i,ate,
Terminate, and Restart. Te,wzit’taW merely a~empts to end ~e program gracefully by restoring the slack pointer to the value l~om program entry and then exiOng l~om the program, hh,~zt~v
page 9
the label of the/ailed checkpoint and then attempts to terminate gracefully. Reslart tries to continue the program by reloading registers
3.3 Insertion
and restarting
of Technique-Specific
the procedure or section of the procedure.
Code
Trw~,~/brm currently ilnplements checkpoint-based recovery to demonslratc the framework’s tools and Io characterize the recovery effectiveness.
Future work will include other lypes of error deleclion and
recovery, as well as support for superscalar process~rs. If an error is detected within a short time, it maybe possible to recover frtmi the cn~)r and conlinuc correcl program execution. One form of error recovery is checkpoinl recovery, in which pr~gram inlormation is saved periodically for use in a possible recovery attempt. This is backwm-derror t-cc~vmy, in that when an error is detected the error handler backs up the program state to the last good checkpoint (rollback) and then starls the program running alter that checkpoint (restarl). cedures can be successfully restarted,
By using checkpoints manypro-
leading to correct program completion.
Saved Register N SavedRegister N- 1
SavedRegister 1 Saved Register 0 Procedure Zone Procedure Code
Shadow Stack Pointer Figure 5: Shadow Stack Organization
Checkpoint recovery consists of saving program inli~rmation on a stack. Shownin Figure 5, this stack is called the shadow stack and is separate from the procedure-call stack. Tra~,~ormcan implement checkpoint recovery by automatically adding shadow stack code Io the application program. It is not possible to restart from all errors, but the technique presented provides recovery from a significant iiaction ol non-memorycorrupting errors.
The shadow stack holds all registers
page 10
which are live upon procedure entry.
Immedialely before the procedure’s exit the current frame of the shadow slack is popped. This lcchniquc provides recovery from a sigtlificanl
anl()unt
of
errors, while adding nli~or llleu~ory ;_t~d execuli~m[llllO
overhead.
Searcl~: subu Ssp, 56 sw $31,40($sp) sw $t7,30($sp) ~,~ sw $16, 32($sp) $8, 7~($sp) lw li $9, 15 ¯
Search: subu Ssp, 56 sw $31,40($sp) sw $t7, 36($sp) sw $16. 32($sp) # Shadow
Original Procedure Enuy Code
Stack Save Code [w $8, ShadowSP lw $9. ($8~ li $10. RegMask sw $10,($9~ fi $10, ProcCode sw $10,4($9) Procedure
Er~try
li sw sw sw
51(I. PmcZouc $10, 8($9) $4. 12($9) $5.
li $10. FrameSizc sw $10. 40{$9} # End of ShadowStack Save lw li
$8_ 76($sp, $9, t 5
Code With Recovery Support
Added
Figure 6: Procedure Entry Modification for Recovery SupD~rl
In order to add recovery support to a procedure, Tran.~/brm adds shadowstack ct~de nea~+ its cntt+y and exit points. As illustrated in Figure 6, the first basic block in the procedure is splil alier any i~lslruclions which save registers on the regular stack. These instructions save registers which do not need to be restored Ii~r the procedure to run. Instructions to save recovery inR)rmati(m on the shad(}w stack inserted in the split point of the entry basic block. This code saves inforination needed to roll back the program, as well as updating the global shadowstack pointer. Figure 5 illustrates stack, wl~ich holds live registers,
the structure of the shadow
a maskidentifyhig the registers, values identifying the procedure and lhc
zone within it, and the frmne size. The procedure’s exit poinls, or return trom subroutine instructions, are located in its exit basic blocks. Shadowstack cleanup instructions
are inserted immediately before these exit basic blocks m order
to remove the current procedure’s shadow stack frame by updating the shadow slack pointer. handler for recovery procedures reloads registers
The error
from the shadow stack and then resumes procedure exe-
cution after the saves t~ the shadowstack. Currently all checkpoints restart be added to provide restart
execution at the beginning of the procedure. Recovery supp~)rt will
capabilities within a loop. This feature will reduce recovery latency by elimi-
page 11
haling the ueed to duplicate successfully completed iterations.
Each protected loop will contain both a
checkpoint and instructions to save live registers which have changed since the previous iteration.
In addi-
tion, the shadow stack’s Procedure Zone entry will indicate which loop within the procedure is being executed.
4. Fault Injection
and Testing
Corrtq)t is a tool which injects Iaults into a programand then Icsts Ihe corrupl program. Each gram test run consists of injectiug one or more faults inlo a program aud executing tile prt~gram while observing the results.
Typically manytests are run to galher representative data. Tile expcrimenlal dala
presented here involve from 300 to 1500 tests per program.
4.1 Fault Types Transiemand pernlaneu! faults can be characlerized by paramclcrs sucl~ as
dut-alioll,
scvcrily
alld
location. A transient fault has a certain duration; after that time the fault disappears. The faul! severily can be measured semantically or syntactically.
One syntactic measure is the number of corrupted bits.
semantic measure of the fault describes the extent of the change in the instruction’s of the fault can be measured semantically or syntactically. address of the fault,
A
function. The location
A syntactic measurementprovides Ihe physical
while a semantic measurement describes the program procedure or funclion where the
fault was encountered. These three fault characteristics
describe a multidimensioual l~tult space.
The fault space used by Corrtq)l has axes of fault duration, synlaclic faull severity, and synlaclic fault location. The fault duration is measured in the numberof times the corrupted instruction is read before the fault disappears. The fault severity is measured m the numberof bits corrupted, while Ihe faull location is described by its address. In order to simplify aualysis, Analyze mapsthis laull space Io ~mcwilh axes of fault duration, semantic fault severity and semantic fault location. The fault duration dimension remains the same, while the other two dimensions are mappedaccording to certain fault classifications. A~talyze can use either of Iw¢~fault severity classifications to quantify lhe semanlic faull severily, Oneindicates the field of the instruction corrupted, while the other classifies the original aud corrupted instructions as data-~ow or control-flow instructions.
The location is determined by the basic block containing the
fault.
page 12
In the tault injection experiments presented here, faults are limited to a subset of the fault space. The fault duration is one or five iterations or permanent,the fault severity is one or two bits and tile lault h)calion is restricted to application code, excluding library routines. Corrtq)t loads an executable file and injects ~meor more faulls, whose charac/erislics on the commandline and in supplementary program informafion files.
are spccilied
The corrupted file is then saved a~ld
executed. Corr~q)t can injec! any numberof bit faults into the test program. Injected faults maybe transienI or permanent. The lifetime of a transient fault is specified as the numberof times the wt~rd in question is accessed incorrectly.
This methodof simulating transient faults requires little
overhead and provides a suc-
cinct methodof specifying laul! duration.
4.2 Fault Injection
Methods
Faults are injected by selecting an instruction within the executable file of the program and either corrupting il directly or replacing it with a jump to a transient fault simulator subroutine, showuin Figure 7. This subroutine has a fault-free and a corrupt copy of the instruction:
one is executed a! runtimc. A
counter variable is used to delermine the number of times the corrupt instruction is to be execuled before the fault-free instruction is executed, thereby controlling the fault duration. The instruction to be corrupted can be selected randomlyor read from a file,
facilitating
the duplication of experiments
The branch delay slot [Kane 921 in the MIPSarchitecture complicates the insertion
of the jump to
the transienl faul~ simulator. As a result, a preceding or delay slot instruction must be added to the simulator. If the instruction to be corrupted is in a control translEr delay slot, then the preceding instruclion must be copied as well. If a control transfer instruction is to be corrupted, then its delay slot must be copied. Most instructions some instructions
can be movedto the transienl fault simulator simply by duplication.
have relative
address references which depend upon the instruclion’s
However,
address. As a
result, Corrul)l ~llust relink these instructions. If a branch instruction is moved,then its targcl offset lield must be updated to contain the off~et from each new position of the branch instruction.
This requires
extracting the offset from the original instruction, combiningit with the original instruction’s address to lind the target, determimngthe offset from the new branch inslruclJon to tile tazgcL, and encoding that value into the new branch instruction.
page 13
Transient_laault: sw $8, sw $9, la $8, lw $9, beq $9, Transient_Fault_Bad: subu $9, sw $9, Iw $8. lw $9, llop
-4($sp) -8($sp) EnabieFaulls ($8) $0, Transient_Fault_Good $9, 1 ($8) -4($sp) -S($sp)
,j Applicaiion_C~)de hop Transient_Fault_Good: lw $8, -4($sp) lw $9, -8($sp) flop nop nop Application_Code ~ uop
save tile old value of $8 save tile old value ~f $9 load p~dnter to counter iudicating # of faults load counter if faults not enabled, branch
()tic !ewcr laull restoreit l’eSt~weit space for corrupted and surroundillg illStl’uciiollS filler [brjump back extra space # restore it # restore it # space for correct # and surrounding # instructions # filler for }umpback to code
Figure 7: Transient Fault Simulator Code
Corruptprints informationaboutthe corruption,displayingthe addressof the corruptedinstruction, its original andcorruptedinstruction words,anddisassembledversions of the original andcormpIed iustmctions. Tl~s imormation is later used by Analyze1o classily the corruption.Theslandm-ddi.~assemblerO thnction is used to decodelhe instruction. ~is ~ncfiondecodesinslmclions as the R21I{)t) and R3000processorsdo; somecorruptedinstruction wordsare equivalenlmthe original instructions, anddixassembler() identifies ~emproperly. Certainsections of code ~e excludedl~oln cormplion.Corruptliners corruptionto the application code, ignoringlibr~y routines such as prin(f. Theaddressrangeof the application code is determined by searchingthe executablefile ti)r applicationbasic blocklabels. II is possibleto allowcorruption of library roulmes,but the current reseach focuses on testing applica~onprograms.This is mpanto reducethe chanceof corruptiugthe experimeutationsystem. For example,execmionof corruplcdfile l/O routines couldlead to corruptionof lhe host compuler’s disk file structure. In orderU) tes~ systeln-widerobustness, one wouldallow corruptionof all programinsffucfions and data wi~nthe program.
page14
Table 2: Corrupt Option
Command-Line Options Description
of Action
reject transient em)r(pcrmm~ent error is default) -nDuration
maketransient error last for Durationiterations
-c
reject only conffol-floweffors
-xExclusionFile
exclude addres~ rangc~ listed in Exch~sumFih~ l~omcorruption
-~ddress
co~upt word at Address
-~ask
use Mask to coopt word
-bNumBits
invert NumBits m0~c word
-wFaultFile
store addresses and masksof errors in FaultFile
-rFaultFile
load addresses and masksof e~ors from FaultFile
-fFaultNumber
use address and maskof l~tult FaultNumber
Table 2 lists the commandline options for Corrtq)t. The user can specify addrcss ranges to bc excluded from corruption, allowing Ile~ble tesOng of pr~grams with built-in
error detection capabililies,
For example, one program flight perli)rm array opera~ous and then compule a checksum, COliipari~g it with the co~ect value. By excluding ~e checksum computation and comparison code from corruplion, can deterngne the effectiveness of ~e checMngcode and ~e vulnerability
of thc user-written
application
code. In order to simulate em)rs of vaying duration, the user mayspecify the numberof ilerations w~ch an insguction is co~upt. ~e experiments presented here simulate permanent errors as well as transient errors lasting one or five ilerations.
~ese numbersare arbitr~mly chosen I~)r our initial
experimems.
Corrul)I can automatically generate an error h)g to record the corrupted addresscs and t~e masRsused. The user mayspeciI} fi~)r Corrup~to recreate a specific corruption lisled in the h)g, simplifying the repetition of a test or a series of tests.
page 15
5. Data Analysis Data analysis concenlrates on how injected faults manifest thclnselves wilh respect to program results.
A fault lnay or maynot lead to an error. A~alyze is a data analysis tool wl~ich examinesfaulls and
symptomsand generates statistics
based upon certain classifications.
5.1 Data Classification Anal.rze reads an error log and classifies
each faull and its symptom.Twofault and one symplOln
classit]ca~on are used, providing a straightforward breakdownof the tesl runs. A~al3’ze uses los1 program trace or profile information to selecI only fl~e tesl rims in which the corrupted instmclion is execuled. In addition, it identifies
ins~uctions w~ch~e undetectable by softw~e means ~such as it~s~ruc~tms
mpted to become jumps to unallocamd lnemory). One of two methods verifies to verify ils resulls.
program output. If convmfient, ~e user adds a functi(m to lhc progran~
~is/m~ction is excluded t?om corruption during tes~ng, so each test run indicates
whether the program produced correct results.
If the function cannol be added easily, ~en aRer Iesling the
program results ae checked. VERIFY,a scripl,
classifies
log and possible output files.
program results based (m output lisled in lhe erm~
The em)r log is annolated by VERIFY1o indicate correcl program comple-
tion, and Analyze uses these annotations in its classification ~e two fault classifications
of ~e tesl run.
characterize the change in ~e instruction,
either in lerms of Ihe
nenl of 0~e insffuclion corrupted (opcode, register specifier, etc.) or lhe type of lhe fault-free and corruplcd instructions (conm)l flow or data flow). Table 3 shows the first classification,
which follows directly
identifying wtdch insu-uction field has been corrupted by the faull. This classil?calioll R2000/3000processors; o~er processors inay have addiaonal instruction
applies 1o the MIPS
fields (e.g. posl-incremcnl
pointer) wNchwould need to be added to the classification. Somecrumpled instmcfions may be decoded by lhe processor the same as ~e original i~s~-uclio~s. O~her corrupted instructions
may result m differenl
instruction is an exWeme example of tiffs,
operaions bul wilh identical
results.
The MIPS
as it is implementedby ~e assembler as sli r0, r0, 0 (shi~ lel~
logical, inslmclion code ()x0()00()()(R), with live fields: opcode, immedialedata, and llgee regisler ~s instruction is ra~er resistant to single bit em)rs. If ~e unused register specifier field is co~upted (five bi~ of the 32 in the instruction), there is no change in execution. If lhe source register specifier or imlncdi-
page 16
ate dala is corrupted (ten bils of 32), the instruction has the same result, writing somevalue 1o rcgisler whereil is discarded, If tile operation is corrupted, five of the Iwelve possible resullalll opcralit)l~s wlTilc I~} r0 with no other effect, resulting in a hop. As a result, a corrupted hop inslructi~m has a 20/32 ol 62.5’/, probability of relnaining a nop functionally.
Table 3: Instruction Instruction Component Modified
Field Fault Classification Example: ()riginal Instruction
Example: Corrupted Instruction
Source Register Specifier
add r3, rl, r2
add r3, rl, r18
Destination Register Specifier
addr3, r I, r2
addr l 1, r l, r2
Address
jal ()x4()(152()
Address Offset
Iw r2, 32(sp)
lw r2, 160isp)
Immediate Data
andi rl, r3, 15
andi rl, r3.7
Operation
slti rl, r14, 200
j 0x7040320
No Modification
j~d 0x400520
.i~d 0x40052()
Table 4 shows the second classification,
which uses the types of the original and corrupled instruc-
tions. This provides a slightly different perspective on the faults and their symptoms. Table 4: Instruction Instruction Type Modification
Type Fault Classification Example: Original Instruction
Example: Corrupted Instruction
Control Flow to Control Flow
jal 0x400520
jal 0x400720
Control Flow to Data Flow
jr r31
addi r0, r31, 8
Data Flow to Control Flow
slti rl. r14, 200
j 0x7040320
Data Flow to Data Flow
add r3, rl, r2
add r3, rl. r18
Control Flow to Illegal Instruction
slti r l, r l4, 200
scathe r14, 80(rl)
Data Flow to Illegal Instruction
add r3, rl, r2
sdc3 rS, 24(r2)
The resull classification, lion shows the relative
shownin Table 5, describes lhe oulcoulc t~l Ihe ct~rruptit)n. This clas,~ilica-
performance of the detection mechanisms. The program may terminale normally
with correcl or incorrect output. If used, the VERIFYscript differentiates program mayhave its ownbuilt-in
error detection capabilities
page 17
between these two cases. Tile
which detecl the error, in which case
lyze uses the program’s error notification
message. The code added for CEDmay detect the error: the error
handler prints a notification of the error fl)r Artalyze to use. A sollware watchdogdetects infinite loops and terminates execution of the test program. Finally, segmentation faults, bus errors and illegal instructi()~ls are all caught by the processor and identified automatically. Artalyze can provide classiticatiol~
inforn~ati(m
ti)r each test, simplifying interpretation of speci/]c faults and errors. Table 5: Error Classification Program Result
Detection Mechanism
Norm;dProgra~n Completiou with Correct Results
Post-Run Comparisonor luherent Progr~m~Chec-king
Normal Progr~n Co~npletion with Incorrect Results
Post-Ruu Comparisonor lnheren! Progrmn Checking
Progr~un Detection of Em)r
lnhereut ProgrmnChecking
CheckpointDetection of Error
CEDSoftware
Iufinite Loop, Segmentation Fault, Bus CPUHardwme Error, Illegal Iustruction
5.2 Test Results A set oI programs from the Slanli~rd and SPEC92 benchmarksuites, listed in Table (~, i,~ used t~ demonstrate the framework, to characterize a processor’s intrinsic
CEDcapabilities
error recovery technique. The benchmark programs are compute intensive, range of data structures,
and to characlerize an
perform little
l/O, and use a
from matrices to dynamically allocated trees. As expected, Ihe programs show a
variety of intrinsic error detection levels. The result of each programis verified by manually inserting a verification function into the application program, except tier air#m, -which uses a script Ibr post-run-time verification. [iOllS
All verilication l~unc
must be excluded from corruption to ensure accurate program result verificatitm.
The use[ must taih~
each result verification approach to match the data format. For example, the results of the tltree sort programs (quick, bubble and tree) are verified by confirming the output list of elements is in order. The permutation program (l)erm) is verified by comparing the output value lo the knowngood value. Each benchmarkis tested in two sets of tests, with three subsets in each set. Each sel uses lhe same faults,
bul
the fault duration varies between permanent, one and five ilerations across the subsels. One Icsl
page 18
set injects single bit faults, while the second injects dual bit faults. Eachbenchnlark is configured to run t() completion in several tens of millions of clock cycles, providing a sufficienlly realistic envirotllllC~ll, benchmarks were compiled using CCwithout any optimizations.
Figures 7 and 8 present the results
fault injection experiments.
Table 6: Benchmark ProgramInformation Program
Benchmark Type
Description
Executable File Size
Instructions Executed
bubble
Bubble Sort
Stanlord
4/)812
37111480
quick
Quick Sort
Slanlk~rd
41036
42439092
tree
Tree S oft
Stanford
41812
5500()944
perm
Permutation
S tanlord
40428
94613962
puzzle
Puzzle Solver
S~mlord
50268
59712518
queens
Eight Queens
Stanford
40324
~ 1698013
,ff?
~ Evaluator
S~mford
426~
79332658
HIIII
Floating Point Matrix Multiply
Stauford
40964
44744219
intmm
In teger Ma~ix Mu 1 fi ply
S umlord
41)98
37088215
alvinn
Neural network simulator
SPECmt 92
67940
36040896
Table7: AverageBenchmark Test Results for 1 Bit Faults Number of Tests
% No Effect
% Incorrect Result
% CI’U Detected
bubble
30(/
28
16
56
quick
297
34
11
55
lrec
294
50
2
48
perm
281
39
9
52
puzzle
270
31
5
64
queens
297
45
6
49
228
14
41
45
mm
300
16
32
52
intmm
288
18
33
49
alvinn
1494
48
15
37
Program
IAverage IlTo~d:4049 page 19
The of the
Table 8: Average
Benchmark Test
Results
for
% No Effect
% Incorrect Result
% CI’U Detected
bubble
300
27
14
59
quick
291
24
6
70
297
27
3
70
pe
288
32
15
53
puzzle
267
20
10
70
queens
297
32
11
57
255
9
29
62
mm
300
13
34
53
intmm
288
10
35
55
alvinn
1491
33
13
54
17.0
is enhanced wilh a rudimentary
software
concurrent
stack recovery code. ~e purpose is ~o determine the resta~ability error detection
scheme in wNch each basic block verifies
path. ~e tech~fique is a silnplified
keys into registers
immediate predecessor.
Because oi’ the shorl detection
for verification
serious damage can be caused. Corr~q)t i~iecls
that
not detected
85% of the rester
Alvinn,
execution
by the processor
attempts
[
error
60.3
detection
by successors.
single iteralion
faults;
lead to co~ect program recovery.
errors
are caughl beli)rc
lhe error detection
errors
9 I]. Basic
Each basic block checks only its
the detecled
(214 of ~e lnissed
page 20
is in a valid control flow
method presented in [Schuette
lalency,
time overhead is only 2.5~.
mechanism and shadow
of the code. Tran,~orm adds a simplistic
that its predecessor
version of the signalure
blocks load identifier
31 of 299 errors
Faults
Number of Tests
Program
Alvinn
2 Bit
have no eflEcl).
Due to the loop-intensive
very
scheme idcnli tics Table 1() shows nature of
Table 9: Alvinn Restart Attem ~t Results % Segmentation Fault
Modilication
% Bus
% Infinite l.oop
EITOF
Source Register
0.0
Destination Register
0.0
5.9
Address ()ffse!
0.0
hmnediate Data ()peration
% Incorrect Results
% Correct Results
Total
14.7
2.9
0.0
8.8
0.0
0.0
(I.0
0.0
0.0
0.0
0.0
0.0
44.1
2.9
47.0
0.0
0.0
2.9
17,7
().0
20.6
85.3
5.9
100.()
14.7
6. Conclusions This paper introduces a frameworkof software tools to automatically implement,characterize and evaluate concurrent error detection and recovery techniques. Modernprocessors provide intrinsic error detection terizes
lnechanisms through virlual the perlbrmance
work has proposed
of these
CEDtechniques
memory, bus interfaces
intrinsic
monitoring,
framework provides paper presents
such as signature
results
This paper presents
effectiveness.
More recent
integrated
monitoring
These represent only our initial
work introduced
the tools’
effectiveness.
inlc-
program The
instructions.
efforls
Recent
an automated
checkpoi hi-based error recovery scheme with inili
framework. The main purpose is to illustrate results,
This paper charac-
are embedded within the applicalfim
an automated method of elnbedding
that show it to be quile effective.
experimental
monitoring.
their
in which the CED instructions
an aulomatically-inserled
decoders.
CED mechanisms in MIPS R2000/R3000 processors.
method R)r implementing them and characterizing grated
and instruction
Finally,
lhc
al perft)rm ancc
in using tl~e experimenlal
However, based on these inilial
we can make some observations.
6.1 Observations The experimentsconducted show a range of prograln resistance to corruption, and 8. The data can also be aualyzed from the point o1: view o1: program completion.
page 21
as seen
in Tables 7
Given that the pro-
gram has terminated normally, with no segmentation faults or bus errors, whal is the likelihood thai the computedresults are correct? Table 1{) presents tlzis inlbrmation tbr each program.
Table10: CorrectProgram CompletionStatistics Percentage t’or 1 Bit Faults
Percentage
for 2 Bit Faults
bubble
64
66
quick
77
81
tr#e
76
89
puzzle
86
66
queen,s
89
76
,~
25
29
mm
34
28
intmm
35
22
alvinn
77
71
pernz
The programs vary significantly, Figure 8 showsthat.ff?
reflecting the nature of the data manipulation in each algorilhm.
is quite vulnerable to corruption, as only about one fourth of the correctly ~erminal-
ing test runs actually give correct results.
This is due to the combination of the large amountof data han-
dled and the fact that the data are not reduced to a smaller set. Figure 9 indicates perm is muchmore robust, as three quarters of correctly terminating test runs yield correcl resulls. Tills is due Io the reduction action of the program; all dala produced by the program are reduced eventually to a single inleger. Alvinn repeatedly refines its data, eventually producing a small data set. This averaging action tends to filter
out errors; about three fourths of the programresults are correct. The tree sorl programis likely to
produce correct results due to its pointer-rich structure; each node in the data structure consisls t)f otlc data field and two pointers. Nearly all data Ilow en’ors which corrupt a pointer lead Io a CPU-delectable error.
page 22
Error Distribution vs, Fault Duradon(
Bit Fault)
Eirof DistriLoutlonvs. FaultE)urationi2Bit Faultj
80 80
8O
7O
7O
CPUDetected
~ 60 E CPU Detected ~ b 5O ~ 40 30
Incorreot incorrect 20
NoEffect 10 - -- - ~ -- - -- _L _ -- - .... 0
10
Permanent Fault Duretionin Iterations
r’./o Effect S Perrnanent Fault Duration in Iteratiuns
Figure 8: Error Characteristics for.//’~ Benchmark
Error Distribution vs Fault Duration (1 Bit Fault)
Error Distribution vs. Fault Duration(2 B t Fault)
9O 8° 7O ~ 60 b 50 EL ~ 40 Li] 30
CPUDetected
CPUDetected
NoEffect NoEffect
20 Incorrect Incorrect 5 Permanent Fault Durationin Iterations
Fault E)uratior~in Iterations
Figure 9: Error Characteristics
6.2
Future
for Perm Benclunark
Work
This fralnework automates most of the work needed to implement and evaluale many CEDand recovery techniques. Future work will include characterization
of existing and new CEDand recovery
techniques, providing a clear comparison of the differenl techniques. The l’ramework will be expanded I~ accommodate CEDand recovery lech~dques I1)r superscalar
page 23
processors.
The frameworkwith its lhree tools has been implemented, consists in lolal of 1(~ 00(} lines of C code and currently largets the MIPSR2()0() and R3()()() processors in DECstation 31()~) and 5~)()() Weintend to dis~ibute these tools to help stimulate more experimental research m this area within the dependable computing research
community.
Acknowledgments This work was funded by the Office of Naval Research under conlrac~ N1)()014-9 l-J-1518.
References [Adobe 851
Adobe Systems, PostScript
[Aho 861
A.V. Aho, R. Sethi, J.D. Ulllnan, Compilers." Prim:iples, Teclmiques. amt Tools, Addison-Wesley,
[Gansner 891
Language Reference
Mamml, Addison-Wesley, 1985
1988
E.R. Gansner, S.C. North, K.P. Vo, "DAG-- A Program that Draws Directed Graphs," AT&TBell Laboratories,
Murray Hill,
NewJcrscy
[Kane 921
G. Kane, J. Heinrich, MIPSRISC Architeclure,
[Miremadi 92]
G. Miremadi, J. Karlsson, U. Gunneflo and J. Torin, "Two Software Techniques
Prentice Hall, 1992
li)r On-line Error Detection," 1992 [Ohlsson 921
J. Ohlsson, M. Rimdn, Ulf
Gunneflo,"A Study ()f the Effects of Transien! Faull
Injection into a 32-bit RISCwith Buiilt-in
Watchdog," Proceedittgs o,/22ml liner-
national Syml)O,s~iumon Fault-Tolerant Coml)uti~,g, 1992 [Schuetle
M.A. Schuette, J.P. Shen, D.P. Siewiorek, Y.X. Zhu, "Experimental Evalualion ()f Two Concurrenl Em)r Detection Schelnes," Proceedbtgs of 16th btternatio~ml Symposium on Fault-Tolerant
[Schuette 91 ]
Computing, 1986
M.A.Schuette, J.P. Shen, "Exploiting Instruction-level
Resource Parallelism li)r
Transparent, Integrated Control-tlow Monitoring," Proceedings of 21st lmertmtiottal [Segall 88]
Symposmmo~ Fault-Tolerant
Computing, 1991
Z.Z. Segall, D. Vrsah)vic, D.P. Siewiorek, D. Yaskin, J. Kt)w~lacki, J. Barton, 13. Dancey, A. Robinson, T. Lm, "FIAT -- Faull h~jeclion Based Aulomaled Testing
page 24
Environment," Proceedings of l~th hzternational
Symposiun’l o~ Fault-Told, ram
Computing, 1988 [Siewiorek 92]
D.P. Siewiorek, R.S. Swanz, Reliable Computer ~vstems. Design a~zd Evaluation, Digital Press, 1992
[Wilken 90]
K.D. Wilken, J.P. Shen "Continuous Signature Momloring: Low-Cost Concurrem Detection of Processor Conm)l Errors," IEEE Transactio~ts o~t Cornl)uter-Aided Desi,~, Vo[. 9, No. 6, June 199~1
page 25