SHIELD: A Software Hardware Design Methodology for ... - CiteSeerX

3 downloads 105901 Views 348KB Size Report
We employ a dedicated security processor (monitor processor) to supervise the application processors on the MPSoC. Custom hardware is designed and used ...
46.3

SHIELD: A Software Hardware Design Methodology for Security and Reliability of MPSoCs Krutartha Patel, Sri Parameswaran School of Computer Science and Engineering, The University of New South Wales, Sydney, Australia {kpatel,

sridevan}@cse.unsw.edu.au

ABSTRACT Security of MPSoCs is an emerging area of concern in embedded systems. Security is jeopardized by code injection attacks, which are the most common types of software attacks. Previous attempts to detect code injection in MPSoCs have been burdened with significant performance overheads. In this work, we present a hardware/software methodology “SHIELD” to detect code injection attacks in MPSoCs. SHIELD instruments the software programs running on application processors in the MPSoC and also extracts control flow and basic block execution time information for runtime checking. We employ a dedicated security processor (monitor processor) to supervise the application processors on the MPSoC. Custom hardware is designed and used in the monitor and application processors. The monitor processor uses the custom hardware to rapidly analyze information communicated to it from the application processors at runtime. We have implemented SHIELD on a commercial extensible processor (Xtensa LX2) and tested it on a multiprocessor JPEG encoder program. In addition to code injection attacks, the system is also able to detect 83% of bit flips errors in the control flow instructions. The experiments show that SHIELD produces systems with runtime which is at least 9 times faster than the previous solution. SHIELD incurs a runtime (clock cycles) performance overhead of only 6.6% and an area overhead of 26.9%, when compared to a non-secure system.

Categories and Subject Descriptors B.8.1 [Performance and Reliability]: Reliability, Testing and FaultTolerance

General Terms Security, Reliability, Design, Measurement

Keywords

In this paper, we propose an automated, fast, hardware/software methodology to the problem of detecting code injection attacks and transient bit flips in the instruction memory using a dedicated monitor processor. The methodology is implemented as a design flow. To show the efficacy of the methodology, we implement the system on an MPSoC architecture using a commercial processor (Xtensa LX2) and a practical multiprocessor application benchmark (JPEG encoder). The system is briefly described here. The MPSoC system is connected via FIFO queues to a monitoring processor. The code running in the application processors is instrumented and the time taken (a max and min time is given for each basic block - to allow for cache misses) for each basic block is stored in the monitoring processor. The monitor processor also contains the control flow map of the programs in all the application processors. Thus, if the application program takes more time than it should, or takes an unintended path, an interrupt is raised from the monitor processor which alerts the application processors. In addition to code injection attacks, the system also tackles some reliability issues caused due to control flow errors (CFEs). According to the studies on reliability by Schutte et al. in [16] and Ohlsson et al. in [10], between 33% and 77% of all transient faults correspond to CFEs which may be caused due to transient bit flips. The methodology targets Tensilica’s Xtensa processing system. In this system the designer has no access to either the Program Counter, the Instruction Register, or the Hardware Description. However, the designer is able to add instructions and also able to use a special feature called Ports and Queues to implement FIFO buffers for communication between processors. We exploit these features to implement the security measures in the system. The remainder of the paper is organized as follows. A summary of related work is presented in Section 2. Section 3 describes the software and hardware design flows while Section 4 describes the systematic methodology of using SHIELD to generate a secure MPSoC. Results are presented in Section 5 and the paper is concluded in Section 6.

2.

Multiprocessors, Tensilica, Architecture, Bit Flips, Code Injection

1. INTRODUCTION Multiprocessor System on Chips (MPSoCs) are emerging as the preeminent design solution to increasing demands in functional requirements, low power needs, and programmability. Such systems are particularly useful in computationally intensive multimedia systems [17]. As more and more MPSoCs are deployed in embedded devices, it is necessary to make these systems secure against known attacks. Software attacks (and in particular code injection attacks) are commonly unleashed upon embedded systems. Such attacks do not require any special equipment or sophisticated techniques, unlike physical attacks or side-channel attacks. A detailed explanation of common software attacks (heap attacks, format string vulnerabilities, arc injection, etc.) can be found in [12, 19, 20]. Multiprocessor systems are inevitably more complex and thus attacks pose significant risks. Very little work has been done for MPSoC security and reliability. Secure and reliable systems always add overheads, and the necessity of having a minimal overhead system becomes critical in an MPSoC due to the already large and power hungry system. Hence designers have to integrate security features in MPSoC based embedded systems while keeping overhead associated with these additional features to a minimum [14]. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. DAC 2008, June 8–13, 2008, Anaheim, California, USA. Copyright 2008 ACM ACM 978-1-60558-115-6/08/0006 ...$5.00.

RELATED WORK

Countermeasures for the problem of code injection attacks haven’t been researched extensively in the multiprocessor domain but countermeasures do exist in the single processor domain with which comparisons can be made. We evaluate the scalability of the countermeasures from the single processor domain for detecting code injection attacks on a commercial, extensible processor in the multiprocessor domain. The countermeasures for code injection attacks can be classified into static and dynamic techniques. Static techniques try to eradicate the vulnerabilities in the code during compilation while the dynamic techniques detect attacks at runtime. A number of static analysis techniques have been proposed in [2, 3, 5, 19]. A dynamic code analysis tool proposed in [6] helps in protection against invalid array accesses. CCured proposed in [9] uses both static and dynamic analysis for checking that the pointers are safe and would not cause memory errors which could potentially be used for code injection attacks. In terms of hardware based countermeasures, Milenkovic et al. in [8] propose a method to fetch instructions from the memory using a signature verification unit. Ragel et al. in [13] also propose a method for basic block validation using microinstructions. Static analysis tools like Stack Guard in [2], only aim to solve buffer overflow problems, and may cause issues of portability. Static analyzers are also known to raise a number of false positives and false negatives. The dynamic code analysis approaches in [6] and [9] incur high runtime overheads of up to 220% and up to 150% respectively. Many of the solutions proposed using hardware assisted techniques require major hardware modifications which are not possible in all commercial processors such as Tensilica’s Xtensa LX2. Xtensa LX2 can be extended only to a certain extent using Tensilica’s Instruction Extension (TIE) language. The approach in [8] requires architecture modification to allow interception of fetched instructions while the approach in [13] requires modification of the microinstructions. The hardware techniques proposed in [1, 7] to detect code injection attacks or perform fast cryptographic operations [4] also face similar

858

3. SYSTEM DESIGN The security system incorporates both custom hardware and software instrumentation as shown in Figure 1. Our design flow, SHIELD, is used to automate the software instrumentation of the source code, and identify hardware customizations required to implement security features on an MPSoC. The inputs to SHIELD are the C/C++ source files. The architectural template contains information regarding the number of processors and their configurations. The configurations in the architectural template also contain a random secure key built into the hardware. SHIELD assigns each processor core in the architectural template, a unique processor ID, pId. The output of SHIELD is the instrumented binaries generated from the partitioned source ready to be loaded on the customized MPSoC.

4.1 Architectural Template The architecture proposed in Figure 2(a) is a generic MPSoC system with N application processors and one additional monitor processor. The monitor processor is used to supervise the N application processors. The N application processors can run programs independently as shown in Figure 2(a) or as a pipeline of processors communicating amongst themselves. Because this is a multiprocessor application, the key feature employed is a FIFO queue for inter-processor communication. The FIFO queue is a timestamping FIFO (TS FIFO) for communication between an application processor and the monitor processor. The TS FIFO shown in Figure 2(b) ensures that every time it receives an input, it attaches a timestamp to the input. The timestamp is placed using the clock cycle count register (CCOUNT) of the processor which is trying to push an entry into the TS FIFO. The queue stalls when attempting to read from an empty queue and write to a full queue using the Empty and Full signals shown in Figure 2(b) respectively. App Proc 1

TS_ FIFO1

TS_ FIFO2

Register File

Basic Block Timing Assemble and Link

Build Hardware Secure Loader

(a) Software Design

Timestamp Unit

CCOUNT Full

Push Req

FIFO Queue

Empty Pop Req

Monitor Proc PM

Monitor (b)

We define a basic block (BB) as a collection of sequential instructions that starts immediately after a control flow instruction (CFI) and ends at a CFI or a label. In Figure 3(a), starting at the label L3, the first basic block finishes when the jump instruction j is seen. In Figure 3(b), we can see that the xor, mov and j instructions make up one basic block with two special instructions inform and confirm.

SIMD Units

Control Flow Extraction

TS_ FIFON

Application Proc PN

4.2 Basic Block Division

TS_FIFO Queues

Basic Block Division

App Proc N

(a)

Storage Tables

Source ASM

MPSoC

L3:

(b) Hardware Design

Figure 1: The hardware and software flow of the proposed design.

L4:

Software Design

L3:

... xor mov j add ...

a5, a4, a3 a4, a7 L4 a3, a3, a5

(a)

As shown in Figure 1, the system inputs are the processor architectural template and an already partitioned software program. The program is compiled to generate an assembly code for the target instruction set architecture (ISA). The software flow involves SHIELD instrumenting the assembly file of the source program. The instrumentation is automated and involves basic block division, generating the control flow map and obtaining the timing of basic blocks. Following instrumentation, the application programs are assembled and linked to obtain executable files which are then loaded into the instruction memory of the individual processors of the MPSoC.

3.2

App Proc 2

Figure 2: (a) An MPSoC system with a Monitor (b) Monitoring via TS FIFO.

Custom Hardware

Compile

3.1

METHODOLOGY

We discuss the architecture template first, and explain aspects of SHIELD (such as the automatic software instrumentation and monitoring hardware design). Finally we explain the admissible behavior of the application that has been instrumented with SHIELD.

Architectural Template

Source

Binary

4.

InterruptN

Contributions and Limitations

The main contributions of this paper can be summarized as follows. For the first time, a systematic methodology is proposed to detect code injection attacks in MPSoCs. This methodology automates the process required to achieve an instrumented binary and a secure MPSoC architecture. This is the first time a fast, hardware oriented detection mechanism is proposed for code injection attacks at runtime in an MPSoC architecture. The limitations of our work can be outlined as follows. Our approach is reliant on obtaining the minimum (min) and maximum (max) times for basic blocks. The min and max times are unavailable for the basic blocks that do not fall on the execution path, and have to be estimation as described in subsection 4.4. The basic blocks employing rendezvous communication with other processors may often have their max time violated. Hence the monitoring processor may cause a false alarm for these blocks. Data corruption is not detected by our approach. Additionally bit flips as well as corruption in data memory are not covered. The target of the CFIs employing register indirect addressing must be deterministic at compile time.

Interrupt2

2.1

Design of custom hardware for the monitor processor includes storage tables, TS FIFO queues, Single Instruction Multiple Data (SIMD) units and a custom register file. SHIELD’s software instrumentation accurately predicts the length and word size of the storage tables that are needed to store the control flow map and timing information. For an MPSoC with N application processors to be monitored, we need N TS FIFO queues and SIMD units. SIMD units are hardware that is initiated using a single instruction to perform operations on data from all the application processors. A custom register file is used for fast storage and retrieval of frequently used data. A “Secure Loader” loads the binaries to the MPSoC encrypting the special instructions inserted during software instrumentation. SHIELD uses a secure loader that is equipped with the same random hardware key that is built into the configurations of the architectural template. Every architectural template as well as the secure loader that goes with it are built with a different random hardware key.

Interrupt1

limitations of needing significant architectural modifications. Thus the existing single processor hardware techniques cannot be applied for the MPSoC domain on commercial processors like Xtensa LX2 and similar where the processor hardware description is unavailable. In contrast to the case study in [11], the work described in this paper is a general methodology (with the associated design flow) which targets security and some reliability aspects of MPSoCs, has much faster detection due to the additional customized hardware, and is easier to implement due to automation. To the best of our knowledge, in an MPSoC domain, our work is the first to address both code injection attacks and reliability together using a single approach.

Hardware Design

The partitioned program and the architectural template is used to design custom hardware for the monitor processor as shown in Figure 1.

L4:

... inform xor mov confirm j inform add confirm ... (b)

5940 a5, a4, a3 a4, a7 5940 L4 4125 a3, a3, a5 4125

Figure 3: (a) Extract of an assembly program (b) The basic block division by SHIELD. The inform and confirm instructions are custom hardware instructions defined using Tensilica Instruction Extension (TIE) language. The instruction inform is inserted at the start of each basic block and confirm is inserted at the end of the basic block or just before the CFI if the basic block ends in a CFI. In Figure 3(b), the number 5940 in the inform and confirm instructions is an encrypted pId bId, i.e., a combination of processor ID pId and block ID bId which is encrypted using a secure key. The inform instruction registers with the monitor processor which block on a particular processor is currently being executed and the confirm instruction signals the end of the executing basic block to the monitor processor.

859

4.3

Control Flow Extraction

A control flow map of a program is a layout of all the possible paths each basic block in the program can take. Being consistent with our definition of a basic block defined in subsection 4.2, most basic blocks can have one or two transitions except a basic block terminated by a return instruction. A basic block terminated by a return instruction has possible transitions to all basic blocks that called the current function or label. SHIELD analyzes the instrumented assembly file automatically to generate a control flow map of the program. The generated control flow map of the program is then stored in the processor hardware for runtime verification.

4.4

Basic Block Timing

After instrumentation by SHIELD, the MPSoC system with all the application processors is executed, producing a trace file for each of the processors. A trace file shows each instruction that was executed on the processor and also the number of clock cycles taken by each instruction. Since each basic block has already been assigned a block ID and is enclosed between the inform and confirm instructions, the execution time of a particular basic block is the number of clock cycles between its inform and confirm instructions. Some basic blocks can be executed many times and their execution times differ because our MPSoC system has a cache. A cache hit or a miss can have significant differences in the execution time. Therefore we analyze the execution trace file and for each basic block, obtain the min and max times and store them in the monitor processor’s hardware tables. A small number of basic blocks however may not fall on the execution path. For such basic blocks, a theoretical worst case execution time (WCET) can be estimated using the processor’s ISA or by looking at WCET of similar instructions.

4.5

of the top OR gate in Figure 4. The bottom shaded part in Figure 4 performs the CF check and also checks that the previously communicated special instruction was a confirm instruction. The result is an ERR CF signal shown as the output of the bottom OR gate in Figure 4. The curr time and curr bId wires drawn from the TS FIFO Data block refers to the time and block ID information of the TS FIFO Data currently being processed. N fail signals, for N SIMD units are computed simultaneously. If any of the signals are set to 1, the application processors are interrupted and the multiprocessor program exits. The code values 0 and 1 from the TS FIFO Data are used to select the output signal from control flow checker and timing checker respectively. The code value of 2 indicates that the TS FIFO Data is not ready yet and the code value of 3 indicates that the application has finished. Hence the code values of 2 and 3 selects input 0 of the Mux.

4.6 Admissible Application Behavior We classify the inform and confirm instructions as boundary instructions (BIs) and all the other instructions as non-BIs. We also categorize the branch, jump, call and return instructions as CFIs. Table 1 shows a list of possible code integrity violations in the first column resulting from code injection or code corruption. The second column shows the original instructions, and column three, the modified or corrupted instructions. The corresponding error signals generated by the monitor processor in our MPSoC are in the final column. Attack Type A B C D E F G H I J K L M N

Runtime Monitoring

The basic blocks in the application processors continuously communicate with the monitor processor. The program map and timing information obtained through SHIELD is already in the hardware storage tables of the monitor processor. At runtime, the monitor processor checks (using the customized hardware in the monitoring processor discussed later) the information coming through the TS FIFO from the application processors against the information in the storage tables. The algorithm used by the monitor processor is shown in Algorithm 1. a

Initialize f inished = 0; while ((f inished == 0) AND (status == 0)) do for j = 1 to N do if (TS FIFOPj not EMPTY) then Read and Decrypt TS FIFOPj Information AUDIT(status, finished);

b

Algorithm 1 first obtains the TS FIFO information from each of the N application processors and decrypts it. The AUDIT instruction then initiates the N SIMD units to process up to N blocks of data received from N application processors. The AUDIT instruction updates the status to 1 if any of the checks performed by the SIMD units is violated and also updates f inished to 1 at the end of the application. update

Start Time

curr_time curr_bId TIME INFO update

Prev Code

update

Prev Block Id

code

ts tf

Tmin tf - t s

∆t Tmax

CTRL MAP

map

∆t < Tmin

TI Check 0 1 2 3

∆t > Tmax

curr_bId in map?

Modified

Error

non-BI non-BI non-BI non-BI CFI CFI CFI CFI inform/confirm inform/confirm inform/confirm inform confirm entire BB

added more non-BI inform confirm CFI another CFI modify target non-BI BI modify pId bId CFI non-BI confirm inform another BB

ERR TI ERR CF ERR TI ERR CF ERR CF / ISS a ERR CF ERR CF b ERR CF/TI ERR CF/TI ERR CF ERR CF/TI ERR TI ERR CF ERR CF/TI

Table 1: Types of code integrity violations and error signals

Algorithm 1 Initiation of the monitoring hardware

TS_FIFO Data

Original

fail S

CF Check

Figure 4: Architecture of the SIMD Unit. Figure 4 shows the hardware blocks inside each SIMD unit. The runtime checks in the hardware by the monitor processor comprise two main tasks: checking of the control flow (CF); and checking of the timing (TI). A sequence check of the inform and confirm instructions also aids the checking for CF and TI. Since the top of the basic blocks have been instrumented with inform instruction (code=0) and the end with confirm instruction (code=1), the monitor processor expects these instructions in code=0,1,0,1... order. To ensure this, the monitor processor keeps a record of the previously communicated special instruction in Prev Code as shown in Figure 4. The top shaded part in Figure 4 performs the TI checks and also checks that the previous special instruction communicated was an inform instruction. The result is an ERR TI signal shown as the output

Attack not detected if only opcode changed. Attack not detected if branch instruction changed to non-BI.

Error signals ERR TI and ERR CF are only reported when the inform or confirm BIs are executed. Type A attacks cause the attacked basic block’s runtime to exceed its max limits and hence generate an ERR TI. In attacks of type B and M, the monitor processor will register two inform BIs consecutively which will generate ERR CF. Similarly type C and L attacks cause the monitor to register two confirm BIs consecutively and hence generate an ERR TI. In type D attacks, the CFI will cause a premature switch in the control flow to another block. Since every block starts with an inform BI, the monitor processor will register two inform BIs consecutively and hence generate ERR CF. The attacks of type E generate an ERR CF, if the modified CFI’s target is different such that it violates the control flow check. However if the opcode only is changed, leaving the target unchanged, the attack will not be detected. Type F attacks will generate an ERR CF if the modified target of the CFI is not on a valid control flow path from the current basic block. Type G attacks generate an ERR CF if the original CFI was supposed to cause a change in the control flow to a basic block other than the basic block following the current CFI. The inform BI in type H attack generates an ERR CF because the decryption of pId bId numbers results in an incorrect control flow (assuming that the attacker does not know about the numbers being encrypted or the key). Similarly, the confirm BI generates an ERR TI due to a second consecutive confirm BI being received by the monitor. An attack of type I causes an ERR CF since modifying a pId bId of an inform BI would fail the control flow check. In the case of the confirm BI, a corrupted pId bId means that incorrect timing values are fetched from the storage tables for timing checks, thus generating an ERR TI. An attack of type J would skip the current basic block. If the transition from the previous basic block to the next basic block is an invalid control flow, an ERR CF is generated. Attack K, depicts a situation similar to attacks B and C where consecutive inform or confirm BIs are encountered by the monitor. In the attack of type N, if the inserted foreign block uses the inform or confirm BIs (interfacing instructions), the attack will be detected.

860

However without the interfacing instruction, the attack is only detected when the next inform or confirm BI is executed. Any attack that relies on the corruption of pId bId numbers or insertion of the BIs is prevented through encryption. It is impossible for an attacker to know the hardware key in the processor, and hence generating the BIs with correct control flow is impossible. We do not consider physical attacks in this paper, but even if the attacker managed to get hold of the hardware key through physical or side-channel attacks, mass attacks will be impossible since each processor has a random hardware key.

5. EXPERIMENTAL SETUP AND RESULTS We used a commercial processor Xtensa LX2 for both the application processor and the monitor processor. The processor offered extensible features that can be synthesized in addition to the base core with 80 instructions and 64 general purpose registers [15]. We obtained a six processor JPEG encoder benchmark produced by Shee et al. in [18] (using the Tensilica’s toolset) for our experimentation. The partitioned JPEG program (pipeline of processors) was mapped in Xtensa LX2 using the design flow shown in Figure 1. The six processors were responsible for file reading, RGB conversion, DCT, Quantization, Huffman Encoding and writing to file respectively. The runtime security checks were implemented using six SIMD hardware units. This allowed the monitor processor to process information from all the six application processors at once. Although we used six, we could have just used one SIMD hardware unit to perform the runtime checks by sharing it. Our decision was based on rapid detection, rather than on area savings.

5.1

Performance Impact

The MPSoC for the JPEG encoder benchmark was designed on a 130nm LV process and had an individual core speed of 303 M Hz with an area of 0.55 mm2 without additional customized hardware. The application increased in code size by 35.2% because of the software instrumentation at the basic block level. There was an overall increase of only 6.6% in the runtime of the application. We tested the approach described in [11] as well as the approach in our SHIELD design flow on 25 different JPEG frames from five different benchmark images (5 frames per benchmark image). In the case when there were no simulated attacks, the average execution times (in clock cycles) of the application, and instrumented systems (both the approach in [11] and SHIELD) is shown in Table 2. The App. Exec. time refers to the time taken by the JPEG encoder benchmark to run with the instrumented inform and confirm instructions without the monitor processor monitoring it. The Sys. Exec. time refers to the time taken by the entire MPSoC design to run, which is essentially the time taken for the monitor processor to finish execution. Table 2 clearly shows our monitor processor finishes soon after the application (within 4000 cycles) and thus is able to rapidly process the incoming information from the application processors. Benchmark App. Exec. Sys. Exec. time (×103 cc) Speedup Image time (×103 cc) Alg in [11] Alg in SHIELD grandmom 4500.2 42350.1 4504.2 9 mom 4500.7 47674.8 4504.6 11 mom-daughter 4500.5 44170.1 4504.4 10 flower garden 4513.0 85129.9 4516.9 19 tennis 4503.3 67754.2 4507.2 15

Table 2: Results from the tests comparing two approaches Table 2 shows the performance speedup achieved by SHIELD over the work in [11] for the five benchmark images. SHIELD obtained a performance speedup of about 18 times compared to the approach in [11] for the flower garden image benchmark which has a complicated and varying background. This is expected as flower garden, being a complicated benchmark, causes certain basic blocks in some of the processors (e.g., Huffman Encoding processor) to be executed a greater number of times compared to other benchmarks. This will lead to greater communication overhead and hence a wider gap between the two approaches.

5.2

Area Overheads

The implementation of our methodology for the JPEG case study required us to employ a monitor processor and define custom hardware. The original area of 3.36mm2 increased to 4.27mm2 representing an increase of 26.9%. The dedicated monitor processor incurred an overhead of 16.4% and the custom hardware for security in all the six applications and the monitor processor accounted for 10.5%.

5.3 Fault Injection Analysis We tested our system for bit flip errors in the CFIs that may occur in the instruction memory of our MPSoC system. Table 3 shows the analysis of 100 faults injected on our MPSoC system. Each of the 6 application processors had 16 or 17 faults randomly injected to individual processors and the runtime output of the system was observed. Each processor N has a certain number of CFIs represented by CN . So we generated random numbers between 1 and CN for each N . These numbers were used to identify the CFI in which the fault was to be injected. To determine, which bit of the instruction was to be corrupted, again a random number was generated between 1 and 16 or 1 and 24 depending on whether the instruction was 16 or 24 bits long. Processor Read File RGB Convert DCT2 Quantization Huffman Write File Total

Func/ Lib Call 6 9 9 7 3 11 45

Branch 9 6 4 10 7 3 39

Jump 2 2 4 0 6 2 16

Total 17 17 17 17 16 16 100

Detected 13 11 17 15 14 13 83

% Detected 76 65 100 88 88 81 83

Table 3: Results from Fault Injection in instruction memory Table 3 shows that 83% of the errors were detected in our MPSoC system by either the instruction set simulator (ISS) or runtime monitoring. It should be noted that the CFIs in the system library functions were not tested for fault injection as they are not currently instrumented. They can be instrumented to provide protection over a bigger base. Some of the faults that have not been detected may be a result of the fault not actually being on the execution path of the program.

6.

CONCLUSIONS

In this paper, we have for the first time, presented an automatic hardware-software design flow, SHIELD, for detecting code injection attacks and ensuring reliability of CFIs on MPSoCs. We implemented this design flow on a commercial ASIP design tool from Tensilica Inc. and tested it on a JPEG encoder multiprocessor program. Reliability errors were simulated by injecting faults to the CFIs in the instruction memory of the application processors and 83% of these errors were detected. Our methodology resulted in a speedup of at least 9 times over a previously proposed methodology with a modest performance overhead of around 6.6% and an area overhead of 26.9%.

7.

REFERENCES

[1] D. Arora et al. Secure embedded processing through hardware-assisted run-time monitoring. In DATE ’05, pages 178–183, Washington, DC, USA, 2005. [2] C. Cowan et al. StackGuard: Automatic adaptive detection and prevention of bufferoverflow attacks. In Proc. 7th USENIX Security Conference, pages 63–78, San Antonio, Texas, jan 1998. [3] N. Dor, M. Rodeh, and M. Sagiv. Cssv: towards a realistic tool for statically detecting all buffer overflows in c. In PLDI ’03, pages 155–167, New York, NY, USA, 2003. [4] J. G. Dyer et al. Building the ibm 4758 secure coprocessor. Computer, 34(10):57–66, 2001. [5] D. Larochelle and D. Evans. Statically detecting likely buffer overflow vulnerabilities. pages 177–190, 2001. [6] E. Larson and T. Austin. High coverage detection of input-related security facults. In SSYM’03, pages 9–9, Berkeley, CA, USA, 2003. USENIX Association. [7] J. Mcgregor et al. A processor architecture defense against buffer overflow attacks. pages 243–250, 2003. [8] M. Milenkovic, A. Milenkovic, and E. Jovanov. Hardware support for code integrity in embedded processors. In CASES ’05, pages 55–65, New York, NY, USA, 2005. [9] G. C. Necula, S. McPeak, and W. Weimer. Ccured: type-safe retrofitting of legacy code. In POPL ’02, pages 128–139, New York, NY, USA, 2002. [10] J. Ohlsson, M. Rimn, and U. Gunneflo. A study of the effects of transient fault injection into a 32-bit risc with built-in watchdog. In FTCS, pages 316–325, 1992. [11] K. Patel, S. Parameswaran, and S. L. Shee. Ensuring secure program execution in multiprocessor embedded systems: a case study. In CODES+ISSS ’07, pages 57–62, New York, NY, USA, 2007. [12] J. Pincus and B. Baker. Beyond stack smashing: Recent advances in exploiting buffer overruns. IEEE Security and Privacy, 2(4):20–27, 2004. [13] R. G. Ragel and S. Parameswaran. Impres: integrated monitoring for processor reliability and security. In DAC ’06, pages 502–505, New York, NY, USA, 2006. [14] S. Ravi et al. Security in embedded systems: Design challenges. ACM Trans. Embedded Comput. Syst., 3(3):461–491, 2004. [15] C. Rowen and D. Maydan. Automated processor generation for system-on-chip. Technical report, Sept 2001. [16] M. A. Schuette and J. P. Shen. Processor control flow monitoring using signatured instruction streams. IEEE Trans. Comput., 36(3):264–276, 1987. [17] M. Shafique, L. Bauer, and J. Henkel. An optimized application architecture of the h.264 video encoder for application specific platforms. ESTIMedia 2007, pages 119– 124, 4-5 Oct. 2007. [18] S. L. Shee and S. Parameswaran. Design methodology for pipelined heterogeneous multiprocessor system. In DAC, pages 811–816, 2007. [19] D. Wagner et al. A first step towards automated detection of buffer overrun vulnerabilities. In Network and Distributed System Security Symposium, pages 3–17, San Diego, CA, February 2000. [20] Y. Younan, W. Joosen, and F. Piessens. Code injection in C and C++: A survey of vulnerabilities and countermeasures. Technical Report CW386, Departement Computerwetenschappen, Katholieke Universiteit Leuven, July 2004.

861