Dynamic Binary Control-Flow Errors Detection - Semantic Scholar

2 downloads 114 Views 298KB Size Report
techniques to address soft-errors, software techniques can provide a less expensive and .... The control-flow checking by software signatures. [OSM02] (CFCSS) ...
Dynamic Binary Control-Flow Errors Detection Edson Borin‡†, Cheng Wang†, Youfeng Wu†, Guido Araujo‡ †Programming Systems Lab Intel Corporation 2200 Mission College Blvd Santa Clara, CA, USA

‡Institute of Computing UNICAMP Campinas, SP, Brazil

Abstract

Hardware based approaches generally relies on inserting redundant hardware such as duplicating functional units or even the entire processor. As an alternative to the hardware ones, software-only techniques change only the software, and the reliability comes free of cost (except for the performance loss). This paper focus on a special class of faults called controlflow errors that occur when a processor jumps to an incorrect next instruction due to a soft-error. We propose a control-flow error classification and two software based control-flow checking techniques. The techniques were evaluated in our dynamic binary translator, the StarDBT. Section 2 depicts the control-flow error classification. Section 3 describes related works and the new control-flow checking techniques. Section 4 presents the dynamic binary translator and the control-flow checking techniques implementation. Section 5 shows the results, and Section 6 concludes the paper.

Shrinking microprocessor feature size will increase the soft-error rates to unacceptable levels in the near future. While reliable systems typically employ hardware techniques to address soft-errors, software techniques can provide a less expensive and more flexible alternative. This paper presents a control-flow error classification and proposes new software based control-flow error detection techniques. The new techniques are better than the previous ones in the sense that they detect errors in all the brancherror categories. We also compare the performance of our new techniques with that of the previous ones using our dynamic binary translator.

1

Introduction

Transient faults, also called soft-errors or single-event upsets (SEUs), are intermittent faults that do not occur consistently. Generally, these faults are caused by external events such as neutron and alpha particles striking the chip or power supply and interconnect noise [Con03]. Although these faults do not cause permanent damage, it may result in incorrect program execution by altering signal transfers or stored values. In the past, soft-errors have been addressed in highavailability systems and safety-critical applications [VHM03], but nowadays, new trends in microprocessor manufacturing have pushed these faults under the spot light. Transistors are becoming increasingly smaller, faster, and with tighter noise margins, which make processors more susceptible to soft-errors [SKK+02]. Indeed, soft-errors are already changing the way the industry looks at processors design. Sun Microsystems lost a major customer to IBM due to server crashes caused by soft-errors [Bau02]; and the fear of cosmic ray strikes led Fujitsu to protect 80 percent of the 200,000 latches in its recent Sparc processor with some form of error detection [AYI+03]. Most modern microprocessors already incorporate mechanisms for detecting soft-errors. Memory elements, particularly caches, are protected using mechanisms such as error-correcting codes (ECC) and parity. The protection is typically focused on memory because the techniques are well understood and do not require expensive extra circuitry. Moreover, caches take up a large part of the chip area in modern microprocessors. Recent studies [SKK+02] show that in a near future the soft-error rate in combinational logic will be comparable to that of memory elements, and protecting the entire chip, instead of only the memory elements, will be in the top of the designers to do list. Several works have investigated redundancy techniques to provide soft-errors reliability [SM90, MLS91, Nam83]. ACM SIGARCH Computer Architecture News

2

Control-flow Errors

A control-flow error is a deviation from the program’s normal instruction flow execution. This error can be a result of a fault in a comparison or even a change in the instruction pointer (IP) register due to external interference [ORT+96]. We classify the control-flow errors into two main categories: • Branch-error: when the error occurs in a branch instruction (mistaken branch, or branch to a random address, due to an error in the branch flag or in the target address). Although the error occurs at the branch instruction, it could be caused by instructions executed earlier than the branch instruction, such as instructions that generate the flags which affect the branch instruction. • IP-error: when the error occurs in any place, due to a change in the IP register. IP-errors are very hard to cover with software based control-flow reliability techniques. For example, take the instruction: a=a+c. If after executing this instruction the IP turns back to the same instruction and executes it again, the fault generates an error, but we cannot detect it. Therefore, we will assume that the IP register is reliable, and concentrate our efforts on branch-errors. When a fault occurs in a branch instruction, we have the following possibilities: • Mistaken branch: when it is supposed to jump, but falls through (or vice-versa). • Random target: when it jumps to an address other than the jump target or the next instruction (fall through). o Same-BB-branch: When the execution jumps to the same basic block (backwards). 15

Vol. 33, No. 5, December 2005

ƒ BB-Beginning: When the execution jumps to the beginning of the basic block; ƒ BB-Middle: When the execution jumps to the middle of the basic block; o Other-BB-branch: When the execution jumps to another basic block. ƒ BB-Beginning: When the execution jumps to the beginning of the basic block; ƒ BB-Middle: When the execution jumps to the middle of the basic block; o Outside-code-branch: When the execution jumps to a region of memory that does not have code. We classify the branch-errors in the following categories: A – Mistaken branch; B – Random target – same BB – Beginning; C – Random target – same BB – Middle; D – Random target – other BB – Beginning; E – Random target – other BB – Middle; F – Outside code target. Figure 1 shows the branch-errors categories in a controlflow graph. The solid lines are correct control-flows and the dashed ones represent the different categories of branch errors.

signatures and to detect the faults in the control flow. The PC’ always contains the signature for the currently executing block. Upon entry to any block, the PC’ is xor’ed with a statically determined constant to transform the previous block’s signature into the current block’s signature. After the transformation, the PC’ can be compared to the basic block signature. The CFCSS technique requires that predecessor blocks have the same signature. Due to this restriction, the technique cannot detect errors in categories A, D, and E if the correct and the wrong target have the same signature. Errors in category C are not detected. The ECF (enhanced control-flow checking), proposed by Reis et al. [RCV+05b], extends the CFCSS technique with a run-time adjusting signature. The extension fix the CFCSS technique restrictions of detecting faults in categories A, D and E, but still cannot detect faults in category C. We propose two new control-flow checking techniques: the EdgCF (Edge control-flow) and RCF (Region based control-flow) checking techniques.

3.1

A

C

B

D

F

BB1 (L1) ... 5: xor PC’, L1_to_L2 6: cmp ... 7: jmp BB2

E

BB2 (L2)

Figure 1 - Branch-errors categories. The faults in category F generate invalid instruction exceptions or invalid memory access in protected memory systems. Therefore, we consider the faults in this category to be automatically covered by all the control-flow checking techniques.

3

1: 2: 3: 4:

cmp jnz add sub

PC’, L2 .fault ... ...

Figure 2 – PC’ update and branch fault example. Notice that the example in Figure 2 still does not detect faults that jump to the middle of the correct target basic block. If, due to a fault, instruction 7 branches directly to instruction 4, the execution skip instruction 3, and the code does not detect the fault. The undetected fault in Figure 2 occurs because the control-flow jumps between two points that have the same signature (L2). In order to detect this kind of fault we also update the signature in the beginning of the basic block. Figure 3 shows the EdgCF technique updating PC’ in the beginning of the basic blocks. The EdgCF technique modifies PC’ so that between basic blocks (in the control-flow edges) PC’ contains the correct next basic block signature, and in the middle of the basic blocks it contains zero. The technique is able to detect the fault in Figure 2. Although this fault skips the checking code in BB2, the PC’ value will also be wrong in the next basic block, and the next checking code will detect the fault.

Control-flow Checking Techniques

Control-flow checking is usually performed through signature monitoring. The basic idea is to detect errors by comparing a run-time signature with a pre-computed one. Although some works have used hardware to assist the checking, we focus on the software-only techniques. Alkhalifa et al. [ANKA99] proposed the Enhanced Control-flow Checking using Assertions (ECCA). This technique modifies the source code inserting a test assertion in the beginning of every basic block to check the signature, and a set assignment in the end to update the signature to the next basic blocks. The technique cannot detect control-flow errors on categories A and C. The control-flow checking by software signatures [OSM02] (CFCSS) assigns a signature to each basic block, and uses a general purpose register (PC’) to track these ACM SIGARCH Computer Architecture News

The EdgCF Checking Technique

The main idea behind the Edge Control-Flow (EdgCF) Checking Technique is to update the PC’ register with the next basic block signature in the end of the current basic block and check it in the beginning of the next one. Figure 2 shows an example of PC’ being updated and checked. Instruction 5 updates the PC’ with the next basic block signature. L1_to_L2 is a constant that when xor’ed with L1 generates L2. Instructions 1 and 2 check PC’.

16

Vol. 33, No. 5, December 2005

approach to check the signature, the “div” instruction is very expensive, and the performance overhead would be prohibitive. We also experienced a performance overhead when using the instruction “cmov” to update the signature in the end of the basic block. To overcome these problems, we proposed the Region based Control-Flow technique.

BB1 (L1) 1: xor 2: cmp 3: jnz ... 5: xor 6: cmp 7: jmp

PC’, L1 PC’, 0 .fault PC’, L1_to_L2 ... BB2

BB2 (L2)

3.2

1: xor 2: cmp 3: jnz 4: add 5: sub ...

The Region based Control-Flow (RCF) Checking Technique attributes signatures to regions, instead of basic blocks. A region is a small sequence of instructions; therefore, a basic block can have many regions. The PC’ register hold the current region signature, and in the end of each region, the technique updates PC’ accordingly to the next region. Figure 5 shows a basic block with regions. The "xor" instructions change the signature at the end of each region.

PC’, L2 PC’, 0 .fault ... ...

Figure 3 - PC' updated in the beginning of the basic blocks. To update PC’ in the end of the basic block, we need to consider the following cases: • If the basic block has only one successor: We insert one instruction to transform the current value of PC’ (zero) to the new value (the successor basic block signature). • If the basic block has a conditional branch: We use conditional instructions (such as predicated instructions, or conditional branches) to update the signature accordingly to the next basic blocks. • If the basic block has a dynamic branch, such as indirect jumps/calls, or a return instruction: We generate code to get the dynamic target address and map it to the target basic block signature. To avoid the cost of mapping the address to the signature, we use the address of the first instruction in a basic block as the basic block signature. This is very convenient, since in this way we always have unique signatures, and the address to signature mapping has no cost. Figure 4 shows an example of a basic block with a conditional branch. Instructions 6 to 10 update the PC’ (using the conditional move instruction-“cmov”) to the next basic block signature (L1 or L2) accordingly to the branch condition.

R1E R1 R3E

R2E/R3E

xor EDR, L0 test EDR, EDR jnz .fault cmp R1’, R2’ mov AUX, EDR xor EDR, L1 xor AUX, L2 cmovle EDR, AUX cmp R1, R2 jle L2

Figure 4 – Basic block with a conditional branch. The examples in Figure 4 use a branch instruction to check the signature. This instruction is a new potential source of branch-errors, but the EdgCF and the previous techniques do not detect these faults. The ECCA [ANKA99] technique uses the “div” instruction to check the signature. The technique generates two numbers and divides both so that if the signature is wrong, a division by zero occurs. The divide by zero exception handler is modified to detect if the exception is a control-flow error or a pure divide by zero exception. Although the EdgCF technique can use the same ACM SIGARCH Computer Architecture News

BB1 (L1) 1: xor PC’, R1E_to_R1 ... 4: xor PC’, R1_to_R3E 5: cmp R1’, R2’ 6: jcc .BB1_sig 7: xor PC’, R3E_to_R2E .BB1_sig: 8: cmp R1, R2 9: jcc BB3

Figure 5 - Fragment of a program with regions. Region R2E/R3E means that R2E and R3E are valid signatures. Region R1 comprises the original basic block instructions (other than the branch). Region R1E was attributed to the basic block entrance. We could assign a region for each instruction, but the performance cost and code footprint size would be prohibitive. Instructions 6 to 9 are used to update the PC’ with the next two basic blocks signatures (R2E and R3E). Notice that after the signature update in BB1 there is a region with two possible signatures (R2E or R3E). If, due to a fault, the control-flow jumps between two regions with different signatures, the PC’ will not match the current (wrong) region signature, and the same happens to each update to PC’. Since we are considering only one fault model, we could check the control-flow correctness only in the end of the program (or functions). Figure 5 has an issue in basic block BB1. If the instruction 6 (used to update the PC’) jumps directly to the basic block with signature RE3, a fault occurs, but the signature is correct. Therefore we are not able to detect the fault. This happens because the branch instruction and the target have the same region signature. To fix this issue we create a new region for the update signature code in BB1. Figure 6 shows the fixed code. Notice that there is a region with a weird signature (R1U/*). It means that there are two correct signatures in this region: R1U and “R1U xor R2E xor R3E”. This signature is generated due to the inversion of the update signature instructions ("xor"). The R3E_to_R2E transformation was applied before the R1U_to_R3E one, therefore the intermediate result is a value corresponding to “R1U xor R2E xor R3E”.

BB1 (L0) 1 : 2 : 3 : ... 6 : 7 : 8 : 9 : 10: 11: 12

The RCF Checking Technique

17

Vol. 33, No. 5, December 2005

R1E R1 R1U R1U/* R2E/R3E

BB1 Program Binary Code

1 : xor PC’, R1E_to_R1 ... ... 6 : xor PC’, R1_to_R1U 7 : cmp R1’, R2’ 8 : jcc .BB1_sig 9 : xor PC’, R3E_to_R2E .BB1_sig: 10: xor PC’, R1U_to_R3E 11: cmp R1, R2 12: jcc BB3 * = R1U xor R2E xor

StarDBT Run Time

Code Cache Back End

Figure 6 - Region based Control Flow technique. Figure 7 shows how to check the control-flow using a branch instruction. Notice that a new region was created, so that we can detect faults that occur in the branch instruction used to check the control-flow.

OS

Figure 8 - StarDBT overall structure. The Frontend module manages the whole program execution for dynamic binary translation. It dynamically recognizes the original program instructions, translates them into instructions in the code cache using different dynamic binary translation techniques, and controls the code execution in the code cache. For the system related features in the program, it interacts with the Runtime module to get system support. To provide optimization to the dynamic binary translation, the Frontend module also collects program profiling information during the code execution and selects hot traces based on the profiling information for run-time optimization. The Backend module performs run-time optimization for the dynamic binary translations. It builds an intermediate representation (IR) from the hot traces selected by the Frontend module. It then performs optimizations on the IR, and finally generates optimized code into the code cache to improve performance. In our experiments, we configured the StarDBT to translate IA32 [IA-32] to EM64T [EM64T] binary code. The binary translation is done on demand, which means that every time a non-translated basic block has to be executed, the StarDBT takes control of the execution and translate the basic block. Therefore, only executed blocks are translated. In order to implement the control-flow checking techniques, we insert instructions to check and update the signature in every translated basic block. Due to the translation on demand scheme, the CFG generation is expensive; therefore we do not implement the techniques that need the CFG to attribute signatures to the basic blocks, such as CFCSS and ECCA. The ECF, EdgCF and RCF techniques were implemented using the address of the first instruction in each basic block (region) as the signature number.

R1E 1 : xor PC’, R1E_to_R1 R1 ... BB1 R1E PC’, R1E_to_R1C PC’, R1C .fault_detected PC’, R1C_to_R1

Figure 7 - Transformation to check the signature in the RCF technique. We update PC’ in the end of basic blocks using the same approach in EdgCF technique. Notice that this approach updates the signature three times in each basic block.

4

Implementation

We implemented the control-flow checking techniques in our dynamic binary translator, the StarDBT. StarDBT is a dynamic binary translation system developed at Intel Programming System Lab. It provides a flexible and efficient infrastructure for researches on different dynamic binary translation techniques. The overall structure of StarDBT is shown in Figure 8. StarDBT runs on top of OS as a user-level run-time system. The program binary code is dynamically translated and stored into the code cache. Then the translated code can be executed under the control of StarDBT, which allows us to apply different dynamic binary translation techniques to the code, such as compatibility support, security checking, reliability enforcement, performance improvement, etc. StarDBT consists of three individual modules: the Runtime module, the Frontend module and the Backend module. The Runtime module provides the system supports for StarDBT. It automatically loads the original program code into memory and initializes the program execution context at program startup. To facilitate the program execution, it provides all OS interfaces such as I/O and system calls. The Runtime module also handles all the system events such as OS call-backs, exception, dynamic library load, code selfmodification, etc.

ACM SIGARCH Computer Architecture News

Data Flow

Control Flow

BB1

1: xor R1C 2: cmp 3: jne 4: xor R1 ...

Front End

4.1

EM64T Architecture Issues

The control-flow checking techniques require dedicated registers for PC’ and RTS. Since the EM64T architecture have more registers than the IA32 one, we do not need to spill registers to implement PC’ and RTS.

18

Vol. 33, No. 5, December 2005

2.500 2.000 1.500 1.000 0.500

RCF Slow dow n

EdgCF Slow dow n

ll na ea om ge

zi p 4. g

17

16

5. vp 17 r 6. gc c 18 1. m cf 18 6. cr af 19 ty 7. pa rs er 25 2. e 25 on 3. pe rl b m 25 k 4. ga 25 p 5. vo r te x 25 6. bz ip 2 30 0. tw ge ol f om ea nint

16

8. w up

wi se 17 1. sw im 17 2. m gr id 17 3. ap pl u 17 7. m es 17 a 8. ga lg el 17 9. ar 18 t 3. eq ua 18 ke 7. fa ce re c 18 8. am m p 18 9. lu ca 19 s 1. fm a3 20 d 0. si xt ra ck 30 1. ap ge si om ea nfp

0.000

ECF Slow dow n

Figure 10 - Performance slow down for the RCF, EdgCF and ECF techniques. When modifying the binary, we have to make sure that the Since the RCF technique requires the signature to be inserted instructions do not change the program behavior. updated more than twice in each basic block, it inserts more Although the instructions to update the signatures only instructions per basic block than the other techniques. modify registers not used by the original program, the “xor” Therefore, this technique shows worse performance than the instruction implementation in the EM64T architecture other two. modifies the EFLAGS register. Therefore, instead of using The EdgCF and the ECF techniques insert the same the “xor” instruction, we use load effective address (lea). amount of instructions per basic block; however, in the The “lea” instruction does not have side-effects and has current ISA, the former uses cheaper instructions to update performance similar to the “xor” instruction. Moreover it is RTS, which leads to a slight performance difference. a three address instruction, which allows us to save some Although the RCF technique presented the worst instructions in the implementation. Figure 9 shows the performance, it has the best error coverage, since it can usage of the “lea” instruction to update the signature. detect the errors in the branch instructions inserted to update We implemented the signature checking using branch and check the signatures. The EdgCF and ECF techniques instructions. This approach insert unsafe branches in the can use safe instructions (like “cmov” and “div”) to update ECF and EdgCF techniques, but it provides a fair and check the signatures, but these instructions may lead to performance comparison with the RCF technique. a significant performance loss. The signature checking code cannot generate side-effects In our tests, the StarDBT itself was not modified to as well. In order to check the signature without changing the provide reliability. The binary translation time is very small EFLAGS, we use the jump if CX is zero (jcxz) instruction. compared to the application execution time. Therefore, we Figure 9 shows an example of the RCF technique checking estimate that a reliable version of the StarDBT translator the signature. Instruction 1 updates PC’ to R1C region. will not change significantly the total execution time. Instructions 2 and 6 save and restore CX, and instructions 3 6 Conclusion and Future Work to 5 check PC’. We proposed a new control-flow error classification and BB1 R1E two control-flow checking techniques: the EdgCF and the 1: lea PC’, [PC’+R1E_to_R1C] RCF. We also evaluated the techniques using our dynamic 2: mov AUX, ecx 3: lea ecx, [PC’-R1C] binary translator, the StarDBT. 4: jcxz BB1.check The results show that the RCF technique can cover the R1C 5: jmp fault_detected branch-errors even when new branch instructions are BB1.check: 6: mov ecx, AUX inserted to update/check the signature, and the performance ... cost is very close to the other techniques. The current implementation checks the signatures in every Figure 9 – RCF technique checking the signature. basic block. In non real-time safety critical applications it 5 Results can be done less frequently, for example, at every function return or even only at the end of the program. We are Figure 10 shows the performance slowdown for the SPEC investigating these optimizations to improve the techniques 2000 applications when applying the RCF, EdgCF and ECF performance. techniques in the StarDBT. The left part of Figure 10 shows the SPEC floating point benchmarks and the right part 7 References shows the integer ones. The figure also shows the geometric [Bau02] R. Baumann, Soft Errors in Commercial Semiconductor mean for the floating point, the integer, and the entire Technology: Overview and Scaling Trends, IEEE 2002 benchmark. The techniques RCF, EdgCF, and ECF Reliability Physics Symp. Tutorial Notes, Reliability presented an average slowdown of 1.46, 1.41, and 1.39 Fundamentals, IEEE Press, 2002, pp.121-01.1-121-01.14. times, respectively. [AYI+03] H. Ando et al., A 1.3GHz Fifth Generation SPARC64 Notice that the performance slowdown is less dramatic in Microprocessors, Proc. IEEE International Solid-State Circuits the floating point benchmarks. It happens because these Conference. (ISSCC 03), IEEE Press, 2003, pp. 246-247. benchmarks have large basic blocks and/or more timeconsuming instructions (like floating point instructions). ACM SIGARCH Computer Architecture News

19

Vol. 33, No. 5, December 2005

[VHM03] R. Venkatasubramanian, J. P. Hayes, and B. T. Murray. Low-cost on-line fault detection using control flow assertions. 9th IEEE On-Line Testing Symposium, pp.137-143, July 2003. [Con03] C. Constantinescu. Trends and challenges in VLSI circuit reliability. IEEE Micro, vol. 23, pp. 14-19, Jul.-Aug. 2003. [SKK+02] P. Shivakumar, M. Kistler, S. W. Keckler, D. Burger, and L. Alvisi. Modeling the effect of technology trends on the soft error rate of combinational logic. In Proceedings of the 2002 International Conference on Dependable Systems and Networks, pages 389-399, June 2002. [SM90] N. R. Saxena, E. J. McCluskey. Control-flow Checking Using Watchdog Assists and Extended-Precision Checksums. IEEE Transactions on Computers, Vol. 39, No. 4, Apr. 1990, pp. 554-559. [MLS91] T. Michel, R. Leveugle and G. Saucier. A New Approach to Control-flow Checking without Program Modification. Proc. FTCS-21, 1991, pp. 334-341. [Nam83] M. Namjoo. CERBERUS-16: An Architecture for a General Purpose Watchdog Processor. Proc. Symposium on Fault-Tolerant Computing. 1983, pp.216-219.

ACM SIGARCH Computer Architecture News

[ORT+96] T. J. O’Gorman, J. M. Ross, A. H. Taber, J. F. Ziegler, H. P. Muhlfeld, I. C. J. Montrose, H. W. Curtis, and J. L. Walsh. Field testing for cosmic ray soft errors in semiconductor memories. In IBM Journal of Research and Development, pages 41-49, January 1996. [ANKA99] Z. Alkhalifa, V. S. S. Nair, N. Krishnamurthy, and J. A. Abraham, Design and evaluation of system-level checks for on-line control-flow error detection, IEEE Trans. Parallel Distrib. Syst., vol. 10, pp. 627-641, June 1999. [OSM02] N. Oh, P. P. Shirvani, and E. J. McCluskey. Controlflow checking by software signatures. IEEE Transactions on Reliability, vo. 51, No 2, pp. 111-122, March 2002. [RCV+05b] G. A. Reis, J. Chang, N. Vachharajani, R. Rangan, and D. I. August. SWIFT: Software Implemented Fault Tolerance. Proceedings of the Third International Symposium on Code Generation and Optimization, March 2005. [IA-32] IA-32 Intel@ Architecture Software Developer’s Manual [EM64T] Intel@ Extended Memory 64 Technology Software Developer’s Guide

20

Vol. 33, No. 5, December 2005

Suggest Documents