SW Reliability Mitigation Approach for ... - SCoRPiO

0 downloads 0 Views 3MB Size Report
—We formulate the buffer size selection and the checkpoint's frequency as an optimiza- tion problem to .... [2008]. This work uses a Partially Protected Cache (PPC) to store a portion ...... MediaBench: A tool for evaluating and synthesizing.
OCEAN: An Optimized HW/SW Reliability Mitigation Approach for Scratchpad Memories in Real-Time SoCs MOHAMED M. SABRY and DAVID ATIENZA, Embedded Systems Lab (ESL), EPFL FRANCKY CATTHOOR, IMEC

Recent process technology advances trigger reliability issues that degrade the Quality-of-Service (QoS) required by embedded Systems-on-Chip (SoCs). To maintain the required QoS with acceptable overheads, we propose OCEAN, a novel cross-layer error mitigation. OCEAN enforces on-chip SRAMs reliability with a fault-tolerant buffer. We utilize this buffer to protect a portion of the processed data used to restore from runtime error. We optimally select the buffer size to minimize the energy overhead, with timing and area constraints. OCEAN achieves full error mitigation with 10.1% average energy overhead compared to baseline operation that does not include any error correction capability, and 65% energy savings, compared to a cross-layer error mitigation mechanism. Categories and Subject Descriptors: B.8.1 [Performance and Reliability]: Reliability, Testing, and FaultTolerance General Terms: Design, Algorithms, Performance, Reliability Additional Key Words and Phrases: Error correction, hybrid mitigation, embedded systems ACM Reference Format: Mohamed M. Sabry, David Atienza, and Francky Catthoor. 2014. OCEAN: An optimized HW/SW reliability mitigation approach for scratchpad memories in real-time SoCs. ACM Trans. Embedd. Comput. Syst. 13, 4s, Article 138 (March 2014), 26 pages. DOI: http://dx.doi.org/10.1145/2584667

1. INTRODUCTION

Future processing technologies increase the system functionality by integrating more transistors within the same unit area. However, CMOS scaling triggers different reliability issues, such as Negative Bias Temperature Instability (NBTI) [Agostinelli et al. 2005b] and erratic bit errors [Agostinelli et al. 2005a; Mitra et al. 2005] that are major challenges in robust systems design and operation. With the dramatic augmentation of error rates and reliability issues, it is evident that prospective embedded systems should have error resiliency as a pivotal design parameter [Henkel et al. 2011; Karakonstantis et al. 2012]. However, such systems should overcome the recent reliability challenges that result in different error types (e.g., soft errors, intermittent errors, and wear-out [Mitra 2008]) with a reasonable cost in terms of area, energy, and performance. Traditionally, bit errors resulting from Single-event Single-bit Upset (SSU) are alleviated at different abstraction layers, based on correction and detection techniques. This work was supported in part by a joint research grant for ESL-EPFL by IMEC, the EC FP7 FET SCoRPiO project (no. 323872), and the BodyPoweredSenSE RTD project (no. 20NA21 143069) evaluated by the Swiss NSF and funded by Nano-Tera.ch with Swiss Confederation financing. Authors’ addresses: M. M. Sabry (corresponding author), D. Atienza, Embedded Systems Lab (ESL), EPFLSTI-IEL-ESL ELG 130, Station 11, 1015 Lausanne, Switzerland; email: [email protected]; F. Catthoor, IMEC, Kapeldreef 75, 3001 Leuven, Belgium. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies show this notice on the first page or initial screen of a display along with the full citation. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of this work in other works requires prior specific permission and/or a fee. c 2014 ACM 1539-9087/2014/03-ART138 $15.00  DOI: http://dx.doi.org/10.1145/2584667 ACM Transactions on Embedded Computing Systems, Vol. 13, No. 4s, Article 138, Publication date: March 2014.

138

138:2

M. M. Sabry et al.

At the hardware level, Error Correction Circuitry (ECC) has been widely used in different memory levels [AMD 2001; Mitchell et al. 2005; Kongetira et al. 2005] as ECCs provide single-bit correction with feasible area, energy, and timing overhead. For example, previous work [Pyo et al. 2009] shows that Single-Error-Correction Double-Error-Detection (SECDED) ECC adds 15% area overhead when used to protect L1 SRAMs. However, as technology scales, Single-event Multibit Upset (SMU) rate increases significantly [Ibe et al. 2010], debilitating the SECDED ECC mitigation capability. Although multibit ECC circuits can be used to mitigate SMU-based errors, multibit ECC circuitry demands significant area, energy, and timing overheads. These overheads can be feasible in high-capacity, low-level (e.g., L2, L3) memories [Paul et al. 2011; Sun et al. 2009]. However, these overheads are unacceptable, from an industrial perspective, for low-capacity and fast embedded SRAMs (e.g., 64KB). For example, the area overhead of an 8-bit ECC integrated to a 64KB L1 SRAM is reported to be more than 80% [Kim et al. 2008]. At the software level, several backward-error or forward-error corrections schemes have shown their efficiency in recovering from errors [Lee et al. 2008; Siewiorek and Swarz 1998; Pradhan 1996]. However, as indicated in previous works [Li et al. 2012], these techniques either incur significant time overhead or degrade the output signal by a considerable value. Thus, if these techniques are used as-is with the expected growth in fault rate, the time overhead will alarmingly increase to unacceptable values. A promising error resilience paradigm to overcome the expected multi-parameter overheads is based on cross-layer reliable systems design and management [Lee et al. 2008; Henkel et al. 2011]. Cross-layer reliable systems design combines the error resiliency at different layers of abstraction, where information on reliability from the hardware level is propagated to the application level. This information would enhance the inter-layer reliability communication in order to successfully handle the error rates at a reasonable cost. Another utilization of the cross-layer paradigm is the decoupled error detection and correction mechanisms. For instance, an error detecting unit could be a software anomaly that triggers a hardware ECC unit [Eles et al. 2008]. Another example is the utilization of a partially protected cache, where a small segment of the data is HW protected, while the remaining data segment is placed in error-prone cache and software-based techniques are used to recover from errors [Lee et al. 2008; Sabry et al. 2012]. However, cross-layer designs can be too fine grained (micro-operation level [Li et al. 2012]) or too coarse grained (task level [Huang et al. 2009]) with respect to the interlayer communication. Such abstract granularity, if not carefully and globally decided, will lead to one optimized parameter (e.g., time) at the expense of another suboptimal parameter (e.g., energy). It is worth mentioning that a well-balanced multiparametrized design is crucial to an industrial designer. Thus, the aforementioned diverse behavior cannot be accepted. In this context, we propose OCEAN, an Optimized Cross-layer Error AtteNuation methodology. Our proposal targets full error protection using cross-layer designs, but using multi-objective optimization of various design and runtime parameters, such as energy, time, and area. Thus, our hybrid multilayer proposal is energy, time, and area efficient. We show in this article that OCEAN is a demand-driven per-processing element technique, which makes it scalable to multi-processor SoCs (MPSoCs). Moreover, we focus in this article on the realization of OCEAN for particular embedded Systemson-Chip (SoCs) designs that are handle streaming applications (e.g., multimedia). These designs adopt an architectural structure where one or more processing units are connected to software-controlled memories, namely ScratchPad Memories (SPMs). Then, these SPMs are further connected to higher memory levels. This architectural

ACM Transactions on Embedded Computing Systems, Vol. 13, No. 4s, Article 138, Publication date: March 2014.

OCEAN: HW/SW Reliability Mitigation for Scratchpad Memories in Real-Time SoCs

138:3

Fig. 1. Schematic diagram of typical SoC platform we are targeting. The SoC has n processing units that are connected to L1 ScratchPad Memories (SPMs), with the opportunity to integrate additional L1 memories. These L1 memories are further connected to lower-hierarchy memory units.

structure has been proliferating recently as an alternative to using HW L1 caches, since scratchpad memories are software mapped, hence user controlled for higher performance and energy efficiency. Typical examples of such systems are the NXP microcontrollers [NXP 2014] and the Texas Instruments C60 Digital Signal Processors (DSPs) [TIC60 2014]. Figure 1 shows a typical structure of an embedded SoC. In this example, n processing units are connected, by an interconnection structure (e.g., direct connection, AMBA-AHB, NoC), to m L1 scratchpad memories. These memories are classified to n direct-connected scratchpad memories, and (m − n) additional L1 memories that resemble instruction, shared, and other scratchpad memories. In addition, the processing units and L1 memories are further connected to different memory units and/or peripherals. In this article, we primarily focus on the most energy- and performance-critical on-chip modules, namely the processing units, SPMs, and their information exchange. It is important to mention that the considered target architectures do not include caches in their memory hierarchy and organization. In particular, we enhance the target system’s reliability with the following contributions. —We propose a hybrid multilayer mitigation technique that uses a combination of added HW modules of small area overhead, and SW routines with negligible time overhead. We propose to integrate an error-protected buffer, with small size, to the target HW platform. In addition, we propose to add frequent checkpoints and use a rollback-based error mitigation mechanism at the target application level. —We formulate the buffer size selection and the checkpoint’s frequency as an optimization problem to minimize the energy overhead, given that the performance and size overheads are restricted with hard constraints decided beforehand by the system designers. —We evaluate this proposal on a low-power embedded system running various applications as case studies. We show the overheads variation when different chunk sizes are selected, which enforces the existence of an optimal operating point, with single- and multiple-error incidents. Moreover, we show that we can achieve full error mitigation with hard time and area constraints, with maximum 22% and average 10.1% energy overheads with respect to a baseline (no error correction) system operation, while guaranteeing all the design-time constraints. We also show that OCEAN reduces the time and energy overhead of a cross-layer error mitigation by an average of 65%. This article starts by reviewing the state-of-the-art error mitigation techniques in Section 2. In Section 3, we elaborate on our proposed multilayer hybrid solution (OCEAN), showing its implementation details. We identify the introduced overheads

ACM Transactions on Embedded Computing Systems, Vol. 13, No. 4s, Article 138, Publication date: March 2014.

138:4

M. M. Sabry et al.

due to our mitigation scheme, as well as formulating the problem of finding the best chunk size to meet the metrics we impose as an optimization problem in Section 4. We describe the target applications, HW platform, and the simulation platform we use in evaluating our mitigation scheme in Section 5. In Section 6, we evaluate our proposed mitigation scheme effectiveness, in term of energy and time, by comparing with respect to strict HW- and SW-based solutions. Finally, we conclude our work in Section 7. 2. RELATED WORK

This section explores the various related work branches to our proposal. Since our proposal is a multilevel HW/SW error mitigation scheme, we show in this section the related works on mitigation techniques applied at HW, SW, or hybrid HW/SW layers. 2.1. HW-Based Error Mitigation

HW-based redundancies for reliability enhancement have been widely studied in previous works. For example, May et al. [2008] use HW resource duplication and triplication in reliability-aware design of a Low-Density Parity-Check (LDPC) code decoder. The amount of resource redundancy is based on the protection priority of the corresponding resource. The controller and functional units in the LDPC are triplicated, while only the most significant bytes of the SRAM memory words are duplicated (or triplicated), to withstand a small Mean-Time-Between-Failures (MTBF) value. This proposal has the ability to correct multibit errors, but it comes with a significant area overhead that is also accompanied by a substantial increase in the leakage power. Other research directions use interconnected modules to mitigate hard failures through fined-grained redundancy. For example, Gupta et al. [2008b] propose StageNet. StageNet is a highly reconfigurable multicore architecture that is designed as a network of pipeline stages, rather than isolated cores. Using this connectivity, instruction flow can be routed away from a faulty pipeline stage and use another healthy counterpart to ensure reliability. However, this approach comes with a rather large area overhead that reaches over 20%. Other work proposes energy overhead minimization in fault-tolerant hardware redundant systems [Ejlali et al. 2009], where primary and spare processing units are used in parallel. The proposed technique uses Dynamic Voltage Scaling (DVS) and Dynamic Power Management (DPM) to achieve energy minimization. However, this solution may be costly to power- and area-sensitive systems since a full duplicate of the processing unit is only dedicated for this technique. To protect L1 and L2 cache architectures, Manoochehri et al. [2011] propose a lowcost hardware mechanism that provides multibit error protection. This mechanism enhances the write-back parity-protected cache by adding two registers used to store information on data written in caches, such that if an error occurs in one of the written lines, a recovery could be performed. While this approach is effective at low error rates, it is limited by the ability to correct a simultaneous number of injected errors. Moreover, this approach corrects any detected error, which implies significant energy consumption at high error rates, regardless of whether the written data is reused or not. Kim et al. [2008] propose a cache memory error protection by using 2D code schemes. This proposal places two error detection circuitry units, namely horizontal and vertical units, which are used together for a guaranteed error correction of any faulty cache line. However, this work adds area and energy overheads that would not be acceptable to the targeted low-power SoCs. The main reason behind the significant energy overhead

ACM Transactions on Embedded Computing Systems, Vol. 13, No. 4s, Article 138, Publication date: March 2014.

OCEAN: HW/SW Reliability Mitigation for Scratchpad Memories in Real-Time SoCs

138:5

is in the unnecessary correction of all bit flip occurrences in the target memories. This may not be required due to data error masking [Krishnamohan and Mahapatra 2005]. 2.2. SW-Based Error Mitigation

Error mitigation at the SW layer has brought a lot of attention. SW-based mitigation can be split into techniques that apply checkpoints and rollback-based recovery (backward error correction), or resource redundancy (forward error correction). We demonstrate the various related works in these categories as follows. Previous works have inducted the use of adding several checkpoints and rollback re¨ covery in case of failures [Pop et al. 2009; Abate et al. 2008; Eles et al. 2008; Vayrynen et al. 2009]. For example, Gupta et al. [2008a] propose a delayed commit and rollback mechanism to overcome soft errors resulting from different sources such as noise margin violations and voltage emergency occurrence. The authors divide stored data in the processor pipeline to two different states: noise-speculative and noise-verified states. Moreover, the authors rely their solution on a violation detector that has a time lag (D) to detect a margin violation. If a data value is in noise-speculative state for a time period D and no violation is detected, it is considered as noise verified (correct data). Otherwise, it is considered faulty and a rollback to the last verified checkpoint is performed, with flushing all noise-speculative states. Although this approach seems interesting, this approach is orthogonal to our scheme, as we are primarily interested in mitigating errors in the memory system. However, this approach has a performance loss that reaches 18%, and the authors consider memories as fault tolerant, thus this technique cannot mitigate memory-based faults. Other approaches exploit data redundancy, such as redundancy in networked embedded system design for reliability management [Lukasiewycz et al. 2009]. This work identifies the data redundancy between different functions that run on a certain architecture. This identification is later used to optimize resource allocation, hence ameliorating the reliability. However, this approach does not mitigate any error occurrence. Thus, this approach is complementary to our technique. Some related work minimizes Soft Error Rates (SER) by using temporal and spatial redundancy [Hyman et al. 2009]. This approach is based on creating frequent checkpoints and performing a rollback in case of error occurrence. In the temporal redundancy approach, an instruction execution is duplicated in its latency-use slack, which is the elapsed cycles before the computed result from the instruction becomes the source operand of a subsequent instruction. In the spatial redundancy approach, the instruction is duplicated in a nearby idle core. These techniques have varying latency overheads (8%–25%), while there is lack of information on the energy overhead. Moreover, this approach assumes that the soft errors exist in the system logic while the storage elements are fault tolerant. Thus, it is complementary to our approach as well. 2.3. HW/SW Reliability Management

In addition to strictly HW and SW mitigation mechanisms, several works have exploited the benefits of developing integrated HW/SW reliability management mechanisms. These mechanisms mainly target multi-objective goals such as minimizing energy while meeting certain reliability constraints. We briefly show various HW/SW reliability mechanisms as follows. Research directions exploit the use of control elements, such as voltage and frequency scaling and task management, to enhance the system reliability. Previous work proposes reliability-aware task allocation and scheduling [Huang et al. 2009], that includes an approximated Mean-Time-To-Failure (MTTF) model. This model is used in a

ACM Transactions on Embedded Computing Systems, Vol. 13, No. 4s, Article 138, Publication date: March 2014.

138:6

M. M. Sabry et al.

reliability-aware task allocation, based on a simulated annealing technique. However, the authors provide only a design-time approach that does not take into consideration any runtime soft-error-based bit flips. Another work includes task scheduling for reliability-aware energy management [Zhu and Aydin 2009]. This scheduling minimizes the energy consumption for periodic real-time systems while preserving the functional reliability. This scheduling uses preemptive Earliest-Deadline-First (EDF) scheduling and Dynamic Voltage and Frequency Scaling (DVFS), to use the slack generated from a correct execution of a task (without functional failure) to scale down the operated frequency of another task, while preserving a time slot for fault recovery to the latter task. Instruction-level redundancy is proposed in prior work [Vera et al. 2009] to reduce soft errors. In this work [Vera et al. 2009], a micro-architectural technique that replicates a subset of the executed instructions is proposed. The replicated instructions are chosen by their significant impact on the whole vulnerability of the system. However, this technique does not guarantee error-free operation, that may violate the timing constraints if a nonreplicated instruction is affected. Another approach that targets Single-Event Upsets (SEUs) mitigation is proposed by Shafik et al. [2010]. In this approach, task mapping and voltage scaling are utilized to minimize the power consumption while improving the reliability in a homogeneous MPSoC. In this technique, there is no error mitigation technique applied. Thus, this technique still experiences SEUs, which limits this technique deployment in systems with high error rates. The use of checkpoints with a partially protected memory segment has been proposed by Lee et al. [2008]. This work uses a Partially Protected Cache (PPC) to store a portion of the streaming data, such that it is used to recover from an error in the unprotected cache. In case of an error, this technique applies either rollback or drop error recoveries to the nearest checkpoint. Although this work is close to our proposal, the authors in Lee et al. [2008] assume a certain portion of protected memory, without taking into consideration this memory impact on the overall system cost, in terms of area, time, and energy. Thus, our work has the clear advancement over the work by Lee et al. [2008], by including the memory impact in the system energy, area, and time overheads, as well as the optimization of these overheads. Based on our survey on the state-of-the-art of error mitigation techniques, we find that there is a gap between the mitigation schemes. Different approaches assume certain HW platform conditions, without taking into consideration whether these conditions are scalable or affordable. Furthermore, other approaches assume that the executed tasks have enough slack to tolerate execution overhead, which may not be the case in low-power systems. In our proposal, we do not only take both HW and SW impacts on the mitigation technique into consideration, but we also optimize their combined impact for overall efficiency. We advance the state-of-the art and our previous work [Sabry et al. 2012] with the following contributions. First, we show the implementation details of OCEAN in both HW/SW layers. Second, we identify the introduced overheads from the proposed scheme and use them to formulate the selection of the data chunk size and checkpoint frequency as a nonlinear programming problem. Finally, we evaluate our mitigation scheme with a typical HW platform and different benchmarks. We explore the influence of the chunk size and number of checkpoints on the energy overhead, highlighting the existence of an optimal value. Moreover, we compare our proposal with other mitigation techniques, showing the benefits and novelty of our proposal. In this comparison, we show the detailed energy consumption and overall time overheads of each technique. Moreover, we show the error sensitivity of OCEAN with respect to varying error rates. ACM Transactions on Embedded Computing Systems, Vol. 13, No. 4s, Article 138, Publication date: March 2014.

OCEAN: HW/SW Reliability Mitigation for Scratchpad Memories in Real-Time SoCs

138:7

Fig. 2. A schematic diagram showing an overview of OCEAN.

3. PROPOSED MITIGATION SCHEME 3.1. Overview

As we mention earlier, OCEAN targets error mitigation via cross-layer optimization. For generality, we show a generalized framework of OCEAN, which enables its applicability in many abstraction levels. An overall view of OCEAN can be shown in Figure 2. From a system-level perspective, OCEAN starts with holistic information of all the targeted platform layers, namely the circuit-level layout, target architecture, utilized system software and mapping, and the target applications. From this information, OCEAN achieves optimal error correction by following a four-phase protocol, where each phase is elaborated as follows. —Layers Identification. This is the initial step required to identify the layers where various components of the mitigation scheme would be implemented. For instance, the hardware layer could be selected as the error detection, whereas the protection can be in the middleware layer. —Granularity Identification. This phase indicates the minimal correction capability, which means the smallest time or space or data unit where the reliability technique is applied and cannot be applied for an error occurrence at a resolution finer than that unit. To put it in other terms, this phase identifies  such that the error correction capability function f (x) = 0 if |x| <  and f (x) = 1 otherwise. —Overhead Identification. This is the stage where the overheads are parameterized as a function of the underinspection parameters (e.g., time, energy, or data size). This identification is achieved by the thorough analysis of the various target system layers, such as hardware layer, middleware, and the target application. ACM Transactions on Embedded Computing Systems, Vol. 13, No. 4s, Article 138, Publication date: March 2014.

138:8

M. M. Sabry et al.

—Overhead Optimization. After the overheads parameterizations, a cost function is derived for optimization. For simplicity, this process can be single variate, such that the cost function is dependent on a single variable, while the remaining variables impose certain constraints on the solution space. Alternatively, the cost function can be multivariate where more complex optimization routines (e.g., simulated annealing) can be used to achieve the Pareto-optimal operating point. It is important to mention that Layers Identification and Granularity Identification are intertwined, particularly in the error correction scheme. Thus, the two phases are combined into a single large phase, as shown in Figure 2. These phases are performed at design time, to derive the runtime error mitigation scheme. While the preceding description is viewed as a generalized approach, in this article we show a crisp realization of OCEAN using the hardware layer in the error detection and protection, while the correction mechanism is achieved using the interaction of middleware and the application layers. We select the granularity at the data size level, where it is called later data chunk as well as the insertion of periodic checkpoints. We identify the overheads in terms of area, energy, checkpoints, data chunk and execution time, and we optimize the energy consumed overhead in error mitigation subjected to area and time constraints in Section 4. The granularity identification is the initial layer of OCEAN where the error detection and correction mechanisms are identified, as well as the communication and error propagation mechanism from the error detection to the correction layers. The following sections show in more details the error detection and correction mechanisms. 3.2. Error Detection Mechanism

The error detection mechanism is a hybrid HW/SW-based module that we design to detect data processing errors. We first add a trigger-based error detection circuit that has a negligible overhead. This circuit is triggered with an enable signal, which we control using middleware routines. The exact detection mechanism at architectural and circuit levels is beyond the scope of this work. Moreover, several research directions have invested in designing low-overhead multibit error detection circuitry [Bhattacharya et al. 2009; Kim et al. 2008; Mukherjee et al. 2005; Abate et al. 2008; Nicolaidis 2005; Nieuwland et al. 2006]. The error detection circuit operates in a demand-driven fashion. Error detection in a demand-driven fashion, like in each memory read, is effective in triggering the mitigation mechanism. Errors injected to the system do not always incur system faults and computational errors. Hence a portion of the errors are masked and do not require any correcting action [Krishnamohan and Mahapatra 2005; Lee et al. 2008; Mukherjee et al. 2005]. Consequently, we benefit from error masking in not correcting every induced error, rather than errors affecting the system-level computing functionality and performance. Figure 3(a) shows a schematic diagram example of the memory system with the error detection module. This module is placed at the interconnection bus. In this way, we can use the same physical module to check for errors of read words from different memory banks. In our experiments, we use another implementation technique where we directly integrate this error detection circuit, which is based on multibit parity detection, to the processing unit’s memory data port. It is important to mention that the area and power overheads of the error detection circuit are included in our evaluations later, in all the examined error protection scenarios in Section 6. Figure 3(b) shows the detection flow we implement in this mechanism. In our approach, we enable the error detection mechanism when the memory address is issued from the Memory Address Register (MAR). Before buffering the read data to the ACM Transactions on Embedded Computing Systems, Vol. 13, No. 4s, Article 138, Publication date: March 2014.

OCEAN: HW/SW Reliability Mitigation for Scratchpad Memories in Real-Time SoCs

138:9

Fig. 3. Our proposed error detection mechanism that acts at each memory read transaction.

processor, the error detection circuit is triggered by asserting the error check signal. If the read data is faulty, the mitigation routine is triggered, which rollbacks the system to the latest committed checkpoint. If not, then the data is buffered to the Memory Data Register (MDR) to be used by the processor. It is important to mention that there is no execution time delay due to this mechanism, since this detection is done within the memory access time. 3.3. Error Correction Mechanism

In this article we use OCEAN by using the concept of data chunks, which we define as follows. Data chunk. It is the data segment (DCH (i)) that is generated in the computation phase(i) and/or should be alive between two consecutive computation phases. It is important to mention that we select the data chunk, such that DCH (i) is the only data segment needed to compute DCH (i + 1). Indeed, in our chunk selection, we take into consideration the data live ranges within the task execution. If a variable is alive within a sequence of checkpoints, it has to be considered in all generated chunks in that sequence. Our proposed methodology relies primarily on the insertion of a number of periodic checkpoints Nch within a task execution. At each checkpoint CH(i)|i ∈ [1, Nch], a data chunk is stored in a protected memory buffer that we integrate to the system. We refer to this buffer with L1 . When checkpoint CH(i) is being committed, DCH (i) is buffered to L1 to overwrite the data chunk generated at the previous checkpoint DCH (i − 1) while the task is being executed. However, if DCH (i) is faulty, it is regenerated using the error-free DCH (i − 1). For illustration purposes, we assume that a certain computation task (T1 ) contains an iterative procedure. We further assume that this iterative procedure is repeated 5 · n times (i.e., the number of iterations is divisible by 5). Thus, we can split T1 to five computation phases Pi , i ∈ [1, 5], as shown in Figure 4. After each phase Pi , L cycles are elapsed in the checkpoint process and data chunk buffering. If the data is error free, the data chunk DP (i) is buffered while executing Pi+1 . If an error occurs in one of the phases, as shown in the example in P3 , a system rollback to the last successful checkpoint is performed. From that rollback, P3 is restarted by reading DP (2) from the protected buffer L1 , hence only the data chunk DP (3) is recomputed. Therefore, the deadline violation previously occurred due to the introduced error is avoided in ACM Transactions on Embedded Computing Systems, Vol. 13, No. 4s, Article 138, Publication date: March 2014.

138:10

M. M. Sabry et al.

Fig. 4. An example of dividing T1 to five phases (P1 −P5 ), showing the impact on error mitigation. The faulty task/phase is shaded with light grey, the checkpoints are shaded with dark grey, and the data chunks are shaded in stripe-patterned blue.

this case. However, in the case where no checkpoints are inserted, T1 has to restart to recover from the same error, leading to a deadline violation. As mentioned earlier, our proposal involves a hybrid HW/SW mitigation mechanism that adds to the system additional HW modules and SW routines. We integrate an additional storage buffer, referred by L1 , between the L1 SRAM and the processing unit. L1 has an extremely limited size, and is used to buffer the data chunk(s) at checkpoint CH(i), which is(are) essential and sufficient to mitigate an error occurred between checkpoints CH(i) and CH(i + 1). The introduction of SW routines and splitting the data into chunks requires, in most cases, investing some effort in refitting the target application. However, we would like to highlight that mainly we are targeting low-power embedded SoCs with SW-Controlled ScratchPad Memories (SPM), as mentioned before in Section 1. Thus, this application adaptation, which involves checkpoints insertion and remapping the data to SPM segments, can be done automatically at compile time, or even higher at the source-code level. Consequently, there is a design-time overhead applied to the application sources, but the benefits obtained from the robust application operation at runtime overcome the negligible design-time overhead. We elaborate our mitigation implementation technique by splitting the overall mechanism into two major modules: checkpoint routine and mitigation routine. These modules are designed and implemented as follows. 3.4. Checkpoint Routine

This routine is used to trigger a computation phase termination. We implement it as a software routine that controls data buffering from the memory L1 to the protected buffer L1 . A call to this routine is inserted within the task execution, and it can be inserted at the code compilation phase. For a specific application, we first study application behavior using a Control-Flow Graph (CFG)-like representation. This study is explained briefly as follows. —We first identify the application subtasks, or routines represented by functional calls. This identification involves characterizing each subtask with the required input data, output data, operating time, and energy consumption.

ACM Transactions on Embedded Computing Systems, Vol. 13, No. 4s, Article 138, Publication date: March 2014.

OCEAN: HW/SW Reliability Mitigation for Scratchpad Memories in Real-Time SoCs

138:11

—We monitor execution flow of these subtasks. This process is required to know the inter-relation between the subtasks and whether an iterative processing of certain subtasks occurs. —Based on the previous two steps, we define various sets of computation phases. In each set, subtasks are grouped into a computation phase such that all phases within a set have similar behavior defined by operating time and energy consumption. —We identify the start and end points of each computation phase, and mark them as potential checkpoint placement points. Based on this analysis, we detect the possible checkpoint insertion points. Then, based on the number of checkpoints required, we place at each desired injection point a call to the checkpoint routine, that applies the algorithm shown in Algorithm 1. The algorithm takes the to-be-protected data address space, represented by a starting address and an offset, as input. Then, the routine checks for errors in this memory range (with error detection circuitry; refer to Section 3.2), as this is a crucial part for checkpoint validation. If an error is found, this routine is terminated by triggering another routine, namely the mitigation routine (refer to Section 3.5). If there are no errors in the selected data range, this routine stores the Program Counter (PC), along with other status register(s) to the protected buffer L1 , then it sends the desired address range to a DMA-like module to enable the transfer between the memory module and the buffer. Thus, by checking the validity of the to-be-stored data and guaranteeing the protection of this data, we ensure the validation of each checkpoint [Prvulovic et al. 2002; Sorin et al. 2002]. It is worth mentioning that we implement this checkpoint routine with a minimal register footprint. This checkpoint routine is implemented as an SW macro. This macro is defined by a set of assembly instructions that provides the desired functionality with minimal routine- and context-switching overheads. We report the cycle overhead of the checkpoint routine on the target architecture in Section 5.2. ALGORITHM 1: Checkpoint(Start address,Offset) y = Check Error(Start address, Offset) if y = 0 then TRIGGER(Mitigation routine) Return −1 else L1 .PC buffer ← Proc.PC μDM A.base address ← Start address μDM A.length ← Offset μDM A.trans f er(L1, L1 ) Commit checkpoint() Return 1 end if

The DMA-like (referred to by μDMA) module connects the protected buffer (L1 ) to the L1 SRAM, as shown in Figure 5. This μDMA module is accessible, from the processing unit, via a special address space. In our implementation, we use a customized DMA module, such that it is tailored to the target system, as we describe in Section 5. 3.5. Mitigation Routine

This routine manages the rollback from a faulty memory access to a committed checkpoint. It can be implemented as an HW, SW, or a hybrid routine. Moreover, this routine is triggered with different control knobs that exist both in HW and SW (refer to Section 3.2

ACM Transactions on Embedded Computing Systems, Vol. 13, No. 4s, Article 138, Publication date: March 2014.

138:12

M. M. Sabry et al.

Fig. 5. Schematic diagram of the target system with the implemented μDMA module.

and Section 3.4). The mitigation routine starts by transferring the data chunk from the protected buffer to the L1 memory, while granting the read access of the processor to the protected buffer. Afterwards, it restores the processor Program Counter (PC) with the committed checkpoint address, and the processor pipeline is flushed, to avoid any hazards in the system. When the processor restores from this routine, it executes the program segment that starts at the last committed checkpoint. Finally, once the chunk restoration is performed, the processor is granted the access to L1 memory. 4. MITIGATION OVERHEADS IDENTIFICATION AND OPTIMIZATION

The inclusion of checkpoints and dividing data into chunks add area, timing, and energy overheads to the system. First, the data chunk generated at a certain phase (CH(i)) incurs additional memory access energy, due to the chunk storage, for a possible upcoming mitigation at the consequent phase (CH(i + 1)). Second, checkpoint routines at the end of each computation phase add additional time and energy overheads to the overall execution time and energy consumption. This is related to the chunk sanity check performed to ensure that the chunk is error free, as well as the issuing order to migrate the data chunk to the protected memory buffer. Finally, if an error occurs, the system is penalized additional time and energy to recompute a faulty chunk of data. Since OCEAN achieves multivariate efficiency through optimization, a proper cost function is formulated. In order to make our proposal energy, time, and area efficient, an optimum chunk size and number of checkpoints pair must be selected. We first start by parametrizing the energy, time, and area overhead functions in terms of data chunk size and number of checkpoints. Then, we use these functions in the optimization problem formulation. It is worth mentioning that what we optimize here is the physical storage requirements and not the logical data size. Since different logical data variables have different lifetimes and can be in different data chunks, they can be mapped to a single physical storing unit. 4.1. Energy Consumption Overhead

The energy overhead results from two operations, namely, the checkpoint and rollback procedures. We identify this energy into storage and computation costs, such that the storage energy (Estore ) is introduced due to storing each data chunk in the protected buffer L1 at each checkpoint (CH(i)). Estore includes the energy of buffering data chunks to L1 at each checkpoint. The chunks buffering energy includes the chunk reading energy from L1 SRAM, chunk writing energy to L1 , and the μDMA module energy. In addition, more storage energy is spent when an error occurs to transfer back data

ACM Transactions on Embedded Computing Systems, Vol. 13, No. 4s, Article 138, Publication date: March 2014.

OCEAN: HW/SW Reliability Mitigation for Scratchpad Memories in Real-Time SoCs

138:13

chunks from L1 to L1. Hence, we compute the storage energy as follows: Estore = NCH · [SCH · (EL1 (SCH , W) + EL1 (SM , R)) + EμDM A] + err · [SCH · (EL1 (SCH , R) + EL1 (SM , W)) + EμDM A],

(1)

where NCH is the number of checkpoints, SCH is the chunk size (in bytes), err is the expected number of chunks that will be faulty within a running task, EL1 (SCH , Y ) and EL1 (SM , Y ) are the consumed energy in accessing (Y ∈ {R, W}) the protected buffer L1 of size SCH and the L1 SRAM, respectively, and EμDM A is the consumed energy in the μDMA module. We simplify Estore in (1) by assuming that the read and write energy consumption values are similar, as well as the μDMA energy is negligible. Thus, Estore is rewritten as follows: Estore = (NCH + err) · SCH · (EL1 (SCH ) + EL1 (SM )).

(2)

The computation energy (Ecomp) results from two items, namely, the energy consumed at each call to the checkpoint routine and the used energy to correct the error using the mitigation routine that in turn reexecutes a certain computation phase. We define the computing energy cost as follows: Ecomp = NCH · ECH + err · (EMitig + E(P(SCH ))),

(3)

where ECH is the checkpoint routine energy consumption, EMitig is the energy consumed by the mitigation routine triggered when an error occurs, and E(P(SCH )) is the energy consumed to recompute a data chunk of size SCH . The energy parameters used in deriving Eqs. (1), (2), and (3) are dependent on the target SoC technological aspects (as in the L1, L1 and μDMA energy values), the computation phase set selection of the target application (E(P(SCH )); refer to Section 3.4), or the mapping of the developed routines to the target SoC (ECH and EMitig ). 4.2. Area Overhead

The introduced area overhead to the system is identified by the area of the protected buffer (AL1 (SCH )), the area of the μDMA module (AμDM A), and the additional interconnects (Ainterconnect ). We define the total area overhead Aov as follows: Aov = AL1 (SCH ) + AμDM A + Ainterconnect .

(4)

It it important to mention that the buffer area (AL1 (SCH )) is related to the chunk size, as well as the amount of correction bits we target. Thus, it is important to show the impact of the chunk size and correction bits variations on the buffer overall area. We rely our correction codes on Hamming codes, as it is the most widely known class of block codes, as well as being optimal in the required protection overhead [MorelosZaragoza 2002]. Based on error correcting coding theory [Morelos-Zaragoza 2002], for an information word of size n bits, the amount of m redundancy bits required to ensure the recovery of t errors must satisfy  t   n+ m ≤ 2m. (5) i i=0

By further simplification, and for a large enough word length n that we observe in the target HW platforms [NXP 2014; TIC60 2014] (n > 16), we can rewrite (5) as (n + m)t ≤ 2m.

(6)

Inequality (6) shows that Aov is dominated by A (SCH ). Therefore, in our optimization we only account for AL1 (SCH ), while safely assuming that AμDM A + Ainterconnect contribute to Aov by less than 10%. L1

ACM Transactions on Embedded Computing Systems, Vol. 13, No. 4s, Article 138, Publication date: March 2014.

138:14

M. M. Sabry et al.

4.3. Execution Time Overhead

An executed application encounters two primary sources of delay due to our mitigation approach, namely, the delay due to the checkpoint routine DCH and the delay due to mitigation DMit . DMit is the delay in the mitigation routine and the delay in reexecuting the computation phase. We define these delays as follows: Instcheckpoint · CPI proc Memcheckpoint · CPMmem + , f Proc fmem InstC Pavg · CPI proc MemC Pavg · CPMmem = DMitig + + ,  f Proc fmem = NCH · DCH + err · DMit .

DCH =

(7)

DMit

(8)

Dtot

(9)

The delay due to the checkpoint routine is composed of the number of instructions and memory accesses of this routine Algorithm 1, including processor context switching), multiplied by the Cycles Per Instruction (CPI) and the Cycles Per Memory access (CPM). The delay is divided by the operating frequencies to get the delay in seconds. DMit is composed of the mitigation routine delay (DMitig ), and computation phase reexecution. Since we do not know in advance the occurrence of the error, we use worst-case error occurrence scenarios, where the error is at the end of a computation phase. We use the average instruction and memory access of computation phases (InstC Pavg and MemC Pavg ) as a representation of any computation phase. We can add a correction vector (α T ) representing the deviation of different computation phases from the average case. However, we are omitting this term here for simplicity. Moreover, we can select the computation phases such that they experience delays with negligible variations from the average case. Indeed, we are relying on the computation phase selection to eliminate the correction factor in our experiments (refer to Section 5). It is important to note that we differentiate between the frequency used in normal   operation ( f Proc , fmem) and the frequency applied in mitigation ( f Proc , fmem ). We enable frequency scalability for systems with a tight time constraint for which mitigation using a normal operating frequency is not adequate. However, the system can increase its operating frequency, but with the cost of higher energy consumption. 4.4. Optimization Problem Formulation

It is crucial to minimize the aforementioned overheads for an efficient mitigation implementation. Thus, we use nonlinear programming to find the optimum chunk size and ∗ ∗ number of checkpoints pair (SCH , NCH ). In our formulation, we use the energy overhead as the cost function, while putting the area and time overheads as hard constraints. This is due to the fact that area overhead, as well as time in real-time systems, are more crucial than the energy consumption. Thus, our technique has to guarantee that time and area overheads satisfy the acceptable values. In our problem formulation, we present the time and area overheads by inequality constraints as follows: AL1 (SCH ) ≤ OV1 · M, Dtot ≤ OV2 · Exec.

(10) (11)

Inequality (10) guarantees that the area overhead to implement the buffer of the optimum chunk size AL1 (SCH ) is less than the affordable area overhead in the system OV1 · M, where M is the area of the system storage element, and OV1 is the maximum area overhead percentage. In addition, inequality (11) guarantees that the cycle overhead required for error mitigation Dtot is maintained within the allowed overhead of the program execution time OV2 · Exec, where OV2 is the maximum time overhead percentage. Overall, our problem of finding the optimum chunk size SCH and number ACM Transactions on Embedded Computing Systems, Vol. 13, No. 4s, Article 138, Publication date: March 2014.

OCEAN: HW/SW Reliability Mitigation for Scratchpad Memories in Real-Time SoCs

138:15

Fig. 6. SDF graph of JPEG decoder we use in our evaluation.

of checkpoints NCH is formulated as follows: min

(SCH ,NCH )

J = Estore + Ecomp

Subject to : AL1 (SCH ) Dtot NCH SCH

≤ ≤ ∈ =

OV1 · M OV2 · Exec Z+ F(NCH , C FG prog )

(12)

(13) (14) (15) (16)

To simplify solving this optimization problem, we fix the expected error rate to the worst-case rate, which is dependent on technological- and operational-related aspects [Baumann 2002]. Nevertheless, varying error rates can be included in the optimization procedure, but with additional variations in the runtime checkpoint and mitigation procedures. By fixing the error rate, the optimization problem is deterministic. However, the optimization cost function is nonlinear. In this respect, we mainly rely on the Kraush-Kuhn-Tucker (KKT) first-order necessary conditions to ensure optimality [Bazaraa et al. 2006]. It is worth mentioning that the number of checkpoints and the data chunk size cannot be chosen arbitrarily and they are dependent on the application behavior. We notice that in some applications there exists an inverse relation between the number of checkpoints and data chunk size. For example, a small amount of checkpoints implies a large data chunk size, while a higher number of checkpoints implies a smaller data chunk. In other applications, we notice that the data chunk size saturates at certain upper and lower bounds for the range of allowable checkpoints. Thus, in our optimization strategy we place such an application-dependent relation, for generality, as an equality constraint (16) rather than direct substitution in the cost function derivation. 5. EXPERIMENTAL FRAMEWORK 5.1. Target Application Case Studies

In our evaluation, we use various embedded benchmarks that exist in literature such as MiBench [Guthaus 2001] and Mediabench [Lee et al. 1997] suites. We rely on the Synchronous DataFlow (SDF) [de Kock 2002] representation of these applications in each of the Granularity Identification, Overhead Identification, and Overhead Optimization phases (refer to Section 3). Figure 6 shows the SDF graph of a used benchmark in our simulations, namely the JPEG decoding benchmark [de Kock 2002; Kumar et al. 2011]. These graphs consist of a group of actors (ellipses) that contain the actor name and the execution cycles, while the edges represent the amount of data produced from/required by the connected actors. We apply these applications on a single processing system (refer to Section 5.2). We refer the reader to previous work [de Kock 2002] for more details on SDF graphs. ACM Transactions on Embedded Computing Systems, Vol. 13, No. 4s, Article 138, Publication date: March 2014.

138:16

M. M. Sabry et al.

Fig. 7. Feasible chunk areas with number of correctable bits, based on the 5% area overhead constraint.

5.2. Target System Architecture

Since our mitigation is a per-core technique, we run the aforementioned applications on a simulated single-core NXP SoC platform [NXP 2014]. The targeted platform is based on the 32-bit ARM9 processor, which operates at a maximum frequency of 250 MHz. In our experiments, we fix the operating frequency (processor and memory) to 200 MHz, with no frequency scaling. A schematic diagram of the target system with the proposed L1 buffer integrated is shown in Figure 5. This system has a 64KB L1 SRAM, split into instruction- and software-controlled scratchpad data memories. We select the data memory as our targeted vulnerable memory. We use CACTI 6.5 [CACTI 2014] to estimate the L1 SRAM area, energy, and access time using 65nm process technology. We assume that L1 access time is half the access time of L1, and we verify this assumption from the access time estimations we deduce using CACTI. In our evaluation, we select the affordable area overhead (OV1 ) (10) as 5% of the total chip area, which is the maximal overhead that our industrial partners [NXP 2014] typically use as target. This overhead restricts the storage capacity of L1 , as well as the maximum number of correctable bits per word, as shown in Figure 7. This figure shows the feasible L1 size values with different error correcting capabilities, which we use in the overhead optimization phase of OCEAN (refer to Section 4.4). In real-time embedded systems design, the processing platform is slightly overdesigned with respect to the target application Worst-Case Execution Time (WCET). This means that the processing time of a certain data segment is slightly lower than the deadline of this segment. Thus, there exists a slight time slack that we can rely on deriving the timing overhead constraints. Based on the used case studies and the target platform characterizations, we select the affordable cycle overhead (OV2 ) (11) as 10%. In our experiments, we use an error rate of 10−6 word per cycle, which is an upper bound comparable to the rate values mentioned in previous work [Leem et al. 2010]. Thus, this is our worst-case considered situation to show the functional effectiveness of the target system with our proposal. However, we also experiment with more realistic lower error rates, to further emphasize the effectiveness of our proposal. It is important to mention that this error rate is only applied to the derived data segments stored in the data memory. We do not explore error injection impact on the instruction memory since instructions can be viewed as static memory that does not change its values while in execution. Thus, handling errors in instruction memory can be mitigated in complementary techniques [Abate et al. 2008; Bhattacharya et al. 2009; Gizopoulos et al. 2011; Krishnamohan and Mahapatra 2005] to our proposal. For example, if an instruction memory word is faulty, a reread from the lower memory levels, which can be protected with low overhead [Kim et al. 2008], corrects that error. We report in this article the detailed on-chip power consumption. This includes the processing unit, instruction memory, data memory, and the introduced protected buffer. Moreover, we briefly report the off-chip power consumption represented by interconnects and lower memory levels. ACM Transactions on Embedded Computing Systems, Vol. 13, No. 4s, Article 138, Publication date: March 2014.

OCEAN: HW/SW Reliability Mitigation for Scratchpad Memories in Real-Time SoCs

138:17

5.3. Simulation Platform

We simulate the efficiency of the target system with OCEAN on a cycle-accurate virtual platform, namely MPARM [Benini et al. 2005]. MPARM is a multi-processor virtual platform, which includes SystemC models of different modules (processing units, cache memories, SPMs, DMA, etc.) interconnected with different interconnection protocols (AMBA-AHB, AMBA-AXi, NoC, etc.). In our work, we extend MPARM as follows. —We have modified the structure of the SRAM to be error prone. In this context, we have added the ability to inject multibit errors with a user-defined error rate (in words per cycle). Based on the passed error rate, random bits of a randomly selected memory word are flipped, as used in Leem et al. [2010]. In particular, we have integrated an error injection module connected to the target L1 SPM. This module generates a random mask applied to a random memory word, based on the userrequired error rate, and flips the bits based on the generated mask. It is important to mention that this error-injector module does not contribute to the system energy consumption. —We have added a parity-based error detection mechanism. We have inserted multiple parity bits per memory word, which are generated twice: once when this memory word is written, and the other when this word is read to check the correctness of this memory word. We have used CACTI [CACTI 2014] to estimate the energy required for this error detection mechanism, and we have found that the energy contribution of this mechanism is negligible to the access energy of the L1 SRAM. —We have added the secured memory module, along with its connectivity to the processing element and the error-prone SRAM, as shown in Figure 5. This memory module is fully secured, with an n-bit ECC circuitry. The energy consumption and the timing delay caused by this module are estimated from CACTI and used in MPARM. —We have integrated the checkpoint and rollback mechanisms to the simulator (refer to Section 3). The checkpoint mechanism is triggered with a software-driven enabling signal specified at the application level, while the rollback mechanism is triggered with a hardware-driven signal, which is asserted when a faulty memory word is read from the SPM. We optimize both checkpoint and mitigation routines to minimize their overheads to the system execution time. For the target SoC, we implement the checkpoint routine such that the time overhead is only 500ns (100 clock cycles at 200 MHz), while the mitigation overhead is 240ns (48 clock cycles at 200 MHz) in addition to the time needed to transfer the data from the buffer back to the memory. For n protected words, n + 4 cycles are needed for the data transfer. 6. EXPERIMENTAL RESULTS 6.1. Single Error Mitigation Impact

We first explore the impact of having a single error on the overheads, while varying the number of checkpoints and data chunks. This exploration is worked out for all the examined benchmarks. For illustration, however, we use the ADaptive Pulse Code Modulation (ADPCM) benchmark as a case study. In this example, we have applied our mitigation technique to the ADPCM decoding algorithm from the MediaBench benchmark suite [Lee et al. 1997]. We explore the execution time and energy overhead induced by our cross-layer mitigation mechanism, with different number of checkpoints applied to the system. In our simulations, we use a 2.6Mb streaming signal to the ADPCM decoder. Moreover, we have tested our proposal with a single error injection that is placed randomly within the application execution. We normalize the benchmark execution time and energy consumption with respect to the default case where there are no checkpoints nor secured ACM Transactions on Embedded Computing Systems, Vol. 13, No. 4s, Article 138, Publication date: March 2014.

138:18

M. M. Sabry et al.

Fig. 8. Execution time and energy consumption of ADPCM decoder with different number of checkpoints, normalized to the default system configuration case. The reported values are for the cases with and without errors. The results are averaged for 100 simulations.

memory module. For each number of checkpoints, we run the system simulation for 100 times and we report the average execution time, and the module’s energy consumption, with and without error cases in Figure 8. This figure shows that the overhead is highly related to the number of checkpoints. For large number of checkpoints (NCH > 1024), we observe that the overheads in case of error occurrence and no error occurrence are similar, implying that the checkpoint overhead is the dominant part. Despite the minute checkpoint overhead, a substantial amount of checkpoints leads up to 6% increase in the time and 10% energy consumption. On the other hand, the overhead due to mitigation is negligible when a large number of checkpoints exists, but it becomes the dominant part when the number of checkpoints decreases. Thus, it may cause a time constraint violation at the worst case. 6.2. Multiple Errors Mitigation Impact

We extend the study we perform in Section 6.1 to show the variation of energy and time overheads with an arbitrary number of errors resulting from the injected error rate. The total energy consumption and execution time of the ADPCM decoder with the aforementioned error rate (10−6 word/cycle), with different number of checkpoints, are shown in Figure 9. This figure shows that there is a clear correlation between the consumed energy and time overheads. With low number of checkpoints, we observe significant overheads that in the worst case reach over 27%. Then the overheads decrease with a minor deviation between the energy and time such that the minimum energy overhead, with combined average and worst-case errors, is observed at number of checkpoints NCH = 512. By contrast, the minimum time overhead is observed at number of checkpoints NCH = 2048. Since our optimization function is energy based (refer to Section 4), the energy consumption curve is of more interest to us than the time curve. However, since our optimization is constrained by a time overhead, an optimum solution may differ in each case. For example, if we select OV2 = 10%, then the optimum number of checkpoints and data chunk size (in words) is the pair ((NCH , SCH ) = (512, 11)). However, if we place ACM Transactions on Embedded Computing Systems, Vol. 13, No. 4s, Article 138, Publication date: March 2014.

OCEAN: HW/SW Reliability Mitigation for Scratchpad Memories in Real-Time SoCs

138:19

Fig. 9. Normalized execution time and energy consumption of ADPCM decoder with different number of checkpoints. The shown lines are for the average cases, with error margins showing the maximum and minimum values obtained at each checkpoint.

tighter constraints on execution time such that OV2 = 5%, the optimum operating point is moved to ((NCH , SCH ) = (1024, 10)), with 3% energy increase from ((NCH , SCH ) = (512, 11)). We observe varying increase in the energy consumption when changing the time constraint to 5% from 10%, which indicates that this increase is application dependent. For example, we observe 10% energy increase when we change the time overhead constraint in the JPEG decoding benchmark. This study shows how crucial the optimal selection of the controlled variables. Indeed, without cross-layer optimization, we have shown that the system overheads reach violating levels that degrade the system operation and the output quality. This is the key novelty of OCEAN that brings the overheads to their minimum values, as we show later. 6.3. Compared Mitigation Techniques

We investigate the effectiveness of OCEAN on different benchmarks, as well as other mitigation approaches, as follows. (1) The Default case is where the system is operating with no error mitigation. (2) The HW mitigation case is where the targeted L1 is fully protected, but at the cost of a large area overhead. (3) The SW mitigation case is where the memory has minimal ECC capability, while the mitigation is performed by task restarting. (4) The Cross-layer case is where we use a cross-layer error mitigation mechanism. This mechanism uses the same layers we mention in Section 3, but no optimization is performed. Based on the problem formulation mentioned in Section 4.4 and the conditions mentioned before, we use the MATLAB nonlinear programming toolbox, fmincon to get the optimum chunk size and number of checkpoints. We mainly rely on the KKT conditions (refer to Section 4.4), while using the interior-point solving technique [Bazaraa et al. 2006; Amini and Peyghami 2006] in deducing the optimal operating points, with ACM Transactions on Embedded Computing Systems, Vol. 13, No. 4s, Article 138, Publication date: March 2014.

138:20

M. M. Sabry et al. Table I. Optimum Chunk Size Obtained for Different Benchmarks Benchmark ADPCM encode JPEG encode G721 encode FFT BlowFish Encrypt GSM toast

Optimum protected buffer size (words) 11 112 16 128 40 32

Benchmark ADPCM decode JPEG decode G721 decode IFFT BlowFish Decrypt GSM untoast

Optimum protected buffer size (words) 11 44 32 128 40 32

√ computational complexity O( n · log(n)). Table I shows the obtained optimum chunk sizes for the used benchmarks. This table shows that the buffer size can vary significantly based on the target application and the input dataset. While the optimum buffer size is fairly small in most of the applications, the buffer size for FFT and IFFT is significantly larger, with a buffer size of 128 words. This is mainly due to the application semantics. For instance, in the case of N point FFT, the minimal data chunks size is always N words. This can violate the area requirements if the area overhead is constrained with a small value. Thus, optimizing the application physical data memory utilization would aid significantly the effectiveness of this work, and is part of our future work. 6.4. Energy Consumption Overhead

We use detailed system energy consumption (for every component) to quantify the energy overhead of the different mitigation techniques. Figure 10 shows the energy consumption of the target system with different benchmarks and the aforementioned mitigation techniques. From this figure, we observe that HW mitigation experiences the highest energy overhead among the different mitigation techniques. This is due to the fact that protecting the whole L1 SRAM with a multibit ECC circuit adds a substantial energy overhead to the memory access. Thus, the protected data memory becomes the dominant cause of increased power consumption. On the contrary, SW mitigation experiences more distributed energy contribution from the different modules, since the mitigation is realized by task reexecution. Thus, all components are utilized to regenerate the desired output. We observe the energy overhead of both HW mitigation and SW mitigation to be, on average, 70% higher than the default case. The maximum energy overhead of both HW mitigation and SW mitigation is more than 100%. Similar to SW mitigation, OCEAN and cross-layer mitigation approaches experience distributed energy consumption overheads from the different components. However, Figure 10 shows that OCEAN manages to meet the area and timing constraints, while maintaining an error-free processing with a maximum of 22% energy overhead with respect to the default case. On average, we observe that the energy overhead is 10.1%. We observe a higher energy consumption when the cross-layer mitigation approach is applied, namely, 17% on average and 55% as peak energy overhead. Thus, OCEAN achieves 65% average and 80% peak overhead energy savings when compared to crosslayer error mitigation. Moreover, we observe from Figure 10 that the energy profile of the system is highly dependent on the workload behavior. For instance, we observe that the energy contribution of the data memory (Data mem in Figure 10) is significant in some of the applications such as ADPCM and G721 benchmarks. Thus, when HW mitigation is deployed, the scratchpad data memory energy is the dominant part of the system energy consumption. This substantial increase in the data memory energy consumption is based on the data-access-intensive nature of such applications, as shown in ACM Transactions on Embedded Computing Systems, Vol. 13, No. 4s, Article 138, Publication date: March 2014.

OCEAN: HW/SW Reliability Mitigation for Scratchpad Memories in Real-Time SoCs

138:21

Fig. 10. Normalized components (processing core, instruction memory, scratch mem, and the protected buffer) energy consumption for different benchmarks using different mitigation schemes. For each benchmark, the component’s energy consumptions are normalized to the default total energy consumption.

MiBench [Guthaus 2001] and MediaBench [Lee et al. 1997]. On the other hand, the core’s energy consumption is significant in the case where substantial processing is required when mitigation is applied. This is clear in the BlowFish benchmark, and in particular when SW mitigation is applied. Despite the variation in the modules’ contribution to the overall energy consumption, OCEAN manages to maintain the overhead contribution of each module within similar percentages across all the benchmarks. This is due to the formulated cost function (Eq. (12)) we use in our optimization, and the workload characteristics extraction we perform at design time. On average, we observe that the total 10.1% energy overhead is decomposed into: 5.21% core energy, 2.36% instruction memory energy, 2.42% data memory energy, and 0.11% L1 energy. In addition to this detailed on-chip component energy consumption analysis, we briefly report the off-chip components (off-chip memory, IO connections, etc.) energy figures. With the exception of the SW mitigation technique, the off-chip energy consumption is around 40−50% to the total energy consumption of each application running in the default case, hence it is not affected by applying any of the other mitigation techniques. In the case of the SW mitigation technique, we find that the off-chip energy is increased due to application reexecution. This increase reaches up to 2 folds in the case of the SW mitigation technique, compared to the energy consumed in the default case. This is yet another reason for discouraging SW mitigation utilization for our target domain. 6.5. Execution Time Overhead

We evaluate the execution time overhead of the different applications to show the impact of the mitigation techniques on the application performance. Figure 11 shows the normalized execution of different mitigation scenarios, when running various benchmarks. As this figure shows, both HW mitigation and SW mitigation violate the time constraints by values reaching more than 100%. It is expected for the SW mitigation case to exceed the allowable overhead, due to the reexecution of the program in case of an error. However, due to the multibit protection in HW mitigation, the memory access time is increased substantially due to the encoding/decoding of memory words to generate the protection code, and correct accordingly if an error occurred. It is important, however, to mention that the HW realization we use may not be the optimal implementation. Based on previous work [Rossi et al. 2011], we expect that the time overhead due to HW-based memory protection can be reduced, yet the system still experiences a significant 50% overhead. In contrast, OCEAN manages to maintain the ACM Transactions on Embedded Computing Systems, Vol. 13, No. 4s, Article 138, Publication date: March 2014.

138:22

M. M. Sabry et al.

Fig. 11. Normalized execution time for different benchmarks using different mitigation schemes. For each benchmark, the execution time is normalized to the default case.

execution time overhead constraint provided at design time. This is due to the fact that our developed overhead models provide highly accurate design-time estimations on the runtime execution overhead, as these models are driven from the application profiling on cycle-accurate simulations. In some applications (JPG, and G721) the time overhead constraint (11) is active, hence the time overhead is 10%. On the contrary, the delay constraint (11) is not active with other applications (ADPCM), causing the time overhead to drop below the delay constraint to have a delay of only 2%. This is due to the fact that ADPCM benchmarks are lightweight benchmarks, hence the mitigation delay overhead using the checkpoint and rollback technique is negligible. Moreover, we observe that OCEAN brings additional time savings with respect to conventional cross-layer error mitigation. When compared to cross-layer, OCEAN lowers down the execution time overhead by 65% on average and by 85% at the best case. 6.6. Varying Error Rate Analysis

In addition to our proposal effectiveness at high error rates, we also analyse the sensitivity of all examined mitigation techniques (SW mitigation, HW mitigation, OCEAN, and cross-layer) to varying error rates. In particular, we analyse the energy and time overheads observed when we run the examined benchmarks, along with the aforementioned mitigation techniques, and vary the error rate (SER) from 10−6 to 10−12 words per cycle. For brevity, we only show the analysis results for the ADPCM Decode benchmark (refer to Section 6.1), but similar behavior is observed in all the other benchmarks. Figure 12 shows the execution time and energy consumption sensitivities of each of the examined mitigation techniques with varying error rates. As this figure shows, different behaviors occur in each of the mitigation techniques. While the SW mitigation significantly reduces the average energy and time overheads, that even attains acceptable overheads (less than 10%) surpassing the proposed technique at very low error rates, the worst-case scenarios at each error rate violate the system time constraint. The worst-case performance of SW mitigation is indeed its downfall if used in systems with guaranteed error correction requirements and tight timing constraints. We also observe in Figure 12 that the HW mitigation overheads behave differently. The time overhead is enhanced when the error rate reduces to acceptable values (less than 10%) due to error corrections abatement. However, this amelioration in time overhead is not translated to the energy figure, since the significant energy overhead (more than 100%) is caused by the multibit ECC codes. The multibit ECC unit in the HW mitigation technique generates a codeword for each written memory word. Despite

ACM Transactions on Embedded Computing Systems, Vol. 13, No. 4s, Article 138, Publication date: March 2014.

OCEAN: HW/SW Reliability Mitigation for Scratchpad Memories in Real-Time SoCs

138:23

Fig. 12. Energy and time overheads for various error mitigation techniques with varying error rates. All the results are normalized to the default case where no error mitigation is applied. The error bars indicate the worst-case observed scenario at each error rate.

the fact that the codeword is generated simultaneously with the actual word storage, the energy consumption cannot be shadowed by other units’ energy consumption. Finally, Figure 12 shows that our proposed technique enacts a consistent overhead (within 10%) behavior in all of the error rates. In all error rate cases, our proposal guarantees, in average and worst cases, the satisfaction of the time constraint, outperforming the cross-layer approach (refer to Section 6.3). Moreover, our proposal consumes the least energy for all the worst cases, exceeding the savings achieved in cross-layer mitigation. However, our proposal can be surpassed by SW mitigation techniques, that utilize either backward or forward error correction-based mechanisms. These techniques are effective when the target system can tolerate erroneous operation [Leem et al. 2010], and when best-effort (not guaranteed) error correction is satisfactory.

ACM Transactions on Embedded Computing Systems, Vol. 13, No. 4s, Article 138, Publication date: March 2014.

138:24

M. M. Sabry et al.

7. CONCLUSION

In this article, we have proposed a novel error mitigation mechanism, called OCEAN, that relies on a hybrid HW/SW mechanism. Using OCEAN, we enforce the error-prone on-chip SRAMs with a fault-tolerant memory buffer with minimal capacity to ensure error-free operation. We utilize this buffer to temporarily store a chunk of data at the end of a periodic computation phase, that can be used to restore another chunk of data in case the latter is faulty. We select the size of this chunk of data and the number of computation phases as optimal values to minimize the energy overhead, subject to system constraints that are decided beforehand by the system designers. Our experimental results have explored the impact of a single error on the system overheads, and highlighted the existence of an optimal solution that can be found with OCEAN. Moreover, we show that OCEAN achieves full error mitigation with only 10.1% average energy overhead (and 22% overhead in the worst case) with respect to a baseline system operation, while guaranteeing all the design-time constraints. REFERENCES F. Abate, L. Sterpone, and M. Violante. 2008. A new mitigation approach for sof errors in embedded processors. IEEE Trans. Nuclear Sci. 55, 4. M. Agostinelli, J. Hicks, J. Xu, B. Woolery, K. Mistry, et al. 2005a. Erratic fluctuations of sram cache vmin at the 90nm process technology node. In Proceedings of the IEEE International Electron Devices Meeting Technical Digest (IEDM’05). 655–658. M. Agostinelli, S. Pae, W. Yang, C. Prasad, D. Kencke, et al. 2005b. Random charge effects for pmos nbti in ultra-small gate area devices. In Proceedings of the 43rd Annual IEEE International Reliability Physics Symposium (IRPS’05). 529–532. AMD. 2001. AMD eighth-generation processor architecture. AMD white paper. http://intel80386.com/amd/k8 architecture.pdf. K. Amini and M. Peyghami. 2006. Complexity analysis of interior-point methods for linear optimization based on some conditions on kernel function. Elsevier Appl. Math. Comput. 176, 1. R. Baumann. 2002. The impact of technology scaling on soft error rate performance and limits to the efficacy of error correction. In Proceedings of the International Electron Devices Meeting (IEDM’02). 329–332. M. S. Bazaraa, H. D. Sherali, and C. M. Shetty. 2006. Nonlinear Programming: Theory and Algorithms. John Wiley and Sons. L. Benini, D. Bertozzi, A. Bogliolo, F. Menichelli, and M. Olivieri. 2005. MPARM: Exploring the multiprocessor soc design space with systemc. J. VLSI Signal Process. Syst. 41, 2. K. Bhattacharya, N. Ranganathan, and S. Kim. 2009. A framework for correction of multi-bit soft errors in l2 caches based on redundancy. IEEE Trans. VLSI Syst. 17, 2. CACTI. 2008. CACTI, an integrated cache access time, cycle time, area, leakage, and dynamic power model for uniform and non-uniform cache architectures. www.cs.utah.edu/∼rajeev/cacti6/. E. de Kock. 2002. Multiprocessor mapping of process networks: A jpeg decoding case study. In Proceedings of the 15th International Symposium on System Synthesis (ISSS’02). 68–73. A. Ejlali, B. M. Al-Hashimi, and P. Eles. 2009. A standby-sparing technique with low energy-overhead for fault-tolerant hard real-time systems. In Proceedings of the 7th IEEE/ACM International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS’09). 193–202. P. Eles, V. Izosimov, P. Pop, and Z Peng. 2008. Synthesis of fault-tolerant embedded systems. In Proceedings of the Design, Automation and Test in Europe Conference and Exhibition (DATE’08). 1117–1122. Fmincon. 2013. Fmincon: Minimization of non-linear multivariate functions. www.mathworks.com/help/ toolbox/optim/ug/fmincon.html. D. Gizopoulos, M. Psarakis, S. V. Adve, P. Ramachandran, and S. K. S. Hari. 2011. Architectures for online error detection and recovery in multicore processors. In Proceedings of the Design, Automation and Test in Europe Conference and Exhibition (DATE’11). M. S. Gupta, K. K. Rangan, M. D. Smith, G. Y. Wei, and D. Brooks. 2008a. DeCoR: A delayed commit and rollback mechanism for handling inductive noise in processors. In Proceedings of the 14th IEEE International Symposium on High Performance Computer Architecture (HPCA’08). 381–392. S. Gupta, S. Feng, A. Ansari, J. Blome, and S. Mahlke. 2008b. StageNetSlice: A reconfigurable microarchitecture building block for resilient cmp systems. In Proceedings of the International Conference on Compilers, Architectures and Synthesis for Embedded Systems (CASES’08). 1–10.

ACM Transactions on Embedded Computing Systems, Vol. 13, No. 4s, Article 138, Publication date: March 2014.

OCEAN: HW/SW Reliability Mitigation for Scratchpad Memories in Real-Time SoCs

138:25

M. R. Guthaus, J. S. Ringenberg, D. Ernst, T. M. Austin, T. Mudge, and R. B. Brown. 2001. MiBench: A free, commercially representative embedded benchmark suite. In Proceedings of the IEEE International Workshop on Workload Characterization (WWC’01). 3–14. J. Henkel, L. Bauer, J. Becker, O. Bringmann, U. Brinkschulte, et al. 2011. Design and architectures for dependable embedded systems. In Proceedings of the 9th International Conference on Hardware/ Software Codesign and System Synthesis (CODES+ISSS’11). 69–78. L. Huang, F. Yuan, and Q. Xu. 2009. Lifetime reliability-aware task allocation and scheduling for mpsoc platforms. In Proceedings of the Design, Automation and Test in Europe Conference and Exhibition (DATE’09). R. Hyman, K. Bhattacharya, and N. Ranganathan. 2009. A strategy for soft error reduction in multi core designs. In Proceedings of the IEEE International Symposium on Circuits and Systems (ISCAS’09). 2217–2220. E. Ibe, H. Taniguchi, Y. Yahagi, K. Shimbo, and T. Toba. 2010. Impact of scaling on neutron-induced soft error in srams from a 250 nm to a 22 nm design rule. IEEE Trans. Electron Devices 57, 7, 1527–1538. G. Karakonstantis, C. Roth, C. Benkeser, and A. Burg. 2012. On the exploitation of the inherent error resilience of wireless systems under unreliable silicon. In Proceedings of the 49th Annual Design Automation Conference (DAC’12). 510–515. J. Kim, N. Hardavellas, K. Mai, B. Falsafi, and J. Hoe. 2008. Multi-bit error tolerant caches using twodimensional error coding. In Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’08). 197–209. P. Kongetira, K. Aingaran, and K. Olukotun. 2005. Niagara: A 32-way multithreaded sparc processor. IEEE Micro 25. S. Krishnamohan and N. R. Mahapatra. 2005. Combining error masking and error detection plus recovery to combat soft errors in static cmos circuits. In Proceedings of the International Conference on Dependable Systems and Networks (DSN’05). 40–49. A. Kumar, H. Corporaal, B. Mesman, and Y. Ha. 2011. Multimedia Multiprocessor Systems Analysis, Design and Management. SpringerLink. C. Lee, M. Potkonjak, and W. H. Mangione-Smith. 1997. MediaBench: A tool for evaluating and synthesizing multimedia and communications systems. In Proceedings of the 30th Annual ACM/IEEE International Symposium on Microarchitecture (MICRO’97). 330–335. K. Lee, A. Shrivastava, M. Kim, N. Dutt, and N. Venkatasubramanian. 2008. Mitigating the impact of hardware defect on multimedia application - A cross-layer approach. In Proceedings of the 16th ACM International Conference on Multimedia (MM’08). 319–328. L. Leem, H. Cho, J. Bau, Q. A. Jacobson, and S. Mitra. 2010. ERSA: Error resilient system architecture for probabilistic applications. In Proceedings of the Conference on Design, Automation and Test in Europe (DATE’10). T. Li, R. Ragel, and S. Parameswaran. 2012. Reli: Hardware/software checkpoint and recovery scheme for embedded processors. In Proceedings of the Conference on Design, Automation and Test in Europe (DATE’12). M. Lukasiewycz, M. Glass, and J. Teich. 2009. Exploiting data-redundancy in reliability-aware networked embedded system design. In Proceedings of the 7th IEEE/ACM International Conference on Hardware/ Software Codesign and System Synthesis (CODES+ISSS’09). 229–238. M. Manoochehri, M. Annavaram, and M. Dubois. 2011. Cppc: Correctable parity protected cache. In Proceedings of the 38th Annual International Symposium on Computer Architecture (ISCA’11). 223–234. M. May, M. Alles, and N. Wehn. 2008. A case study in reliability-aware design: A resilient ldpc code decoder. In Proceedings of the Conference on Design, Automation and Test in Europe (DATE’08). 456–461. J. Mitchell, D. Henderson, and G. Ahrens. 2005. IBM power 5 processor-based servers: A highly available design for buisness-critical applications. IBM white paper. http://www-07.ibm.com/systems/includes/pdf/ power5 ras.pdf. S. Mitra. 2008. Globally optimized robust systems to overcome scaled cmos reliability challenges. In Proceedings of the Conference on Design, Automation and Test in Europe (DATE’08). S. Mitra, N. Seifert, M. Zhang, Q. Shi, and K. S. Kim. 2005. Robust system design with built-in soft-error resilience. Comput. 38, 2. R. H. Morelos-Zaragoza. 2002. The Art of Error Correcting Coding. Wiley. S. S. Mukherjee, J. Emer, and S. K. Reinhardt. 2005. The soft error problem: An architectural perspective. In Proceedings of the 11th International Symposium on High-Performance Computer Architecture (HPCA’05). 243–247. ACM Transactions on Embedded Computing Systems, Vol. 13, No. 4s, Article 138, Publication date: March 2014.

138:26

M. M. Sabry et al.

M. Nicolaidis. 2005. Design for soft error mitigation. IEEE Trans. Device Mater. Reliab. 5, 3. A. K. Nieuwland, S. Jasarevic, and G. Jerin. 2006. Combinational logic soft error analysis and protection. In Proceedings of the 12th IEEE International Symposium on On-Line Testing (IOLTS’06). 99–104. NXP. 2014. NXP arm-based microntrollers. www.nxp.com/documents/data sheet/LH7A400 N.pdf. S. Paul, F. Cai, X. Zhang, and S. Bhunia. 2011. Reliability-driven ecc allocation for multiple bit error resilience in processor cache. IEEE Trans. Comput. 60, 1, 20–34. P. Pop, V. Izosimov, P. Eles, and Z. Peng. 2009. Design optimization of time- and cost-constrained fault-tolerant embedded systems with check pointing and replication. IEEE Trans. VLSI Syst. 17, 3, 389–402. D. K. Pradhan. 1996. Fault-Tolerant Computer System Design. Prentice Hall. M. Prvulovic, Z. Zhang, and J. Torrellas. 2002. Revive: Cost-effective architectural support for rollback recovery in shared-memory multiprocessors. In Proceedings of the 29th Annual International Symposium on Computer Architecture (ISCA’02). 111–122. S.-S. Pyo, C.-H. Lee, G.-H. Kim, K.-M. Choi, Y.-H. Jun, and B.-S. Kong. 2009. 45nm low-power embedded pseudo-sram with ecc-based auto-adjusted self-refresh scheme. In Proceedings of the IEEE International Symposium on Circuits and Systems (ISCAS’09). 2517–2520. D. Rossi, N. Timincini, M. Spica, and C. Metra. 2011. Error correcting code analysis for cache memory high reliability and performance. In Proceedings of the Conference on Design, Automation and Test in Europe (DATE’11). M. M. Sabry, D. Atienza, and F. Catthoor. 2012. A hybrid hw-sw approach for intermittent error mitigation in streaming-based embedded systems. In Proceedings of the Conference on Design, Automation and Test in Europe (DATE’12). R. A. Shafik, B. M. Al-Hashimi, and K. Chakrabarty. 2010. Soft error-aware design optimization of low power and time-constrained embedded systems. In Proceedings of the Conference on Design, Automation and Test in Europe (DATE’10). D. P. Siewiorek and R. S. Swarz. 1998. Reliable Computer Systems: Design and Evaluation. A. K. Peters. D. J. Sorin, M. M. K. Martin, M. D. Hill, and D. A. Wood. 2002. Safetynet: Improving the availability of shared memory multiprocessors with global checkpoint/recovery. In Proceedings of the 29th Annual International Symposium on Computer Architecture (ISCA’02). 123–134. H. Sun, N. Zheng, and T. Zhang. 2009. Leveraging access locality for the efficient use of multibit errorcorrecting codes in l2 cache. IEEE Trans. Comput. 58, 1297–1306. TIC60. 2014. Texas instruments c60 member data sheet. http://focus.ti.com/lit/ds/symlink/tms320c6424.pdf. M. Vayrynen, V. Singh, and E. Larsson. 2009. Fault-tolerant average execution time optimization for generalpurpose multi-processor system-on-chips. In Proceedings of the Conference on Design, Automation and Test in Europe (DATE’09). X. Vera, J. Abella, J. Carretero, and A. Gonzalez. 2009. Selective replication: A lighweight technique for soft errors. ACM Trans. Comput. Syst. 27, 4. D. Zhu and H. Aydin. 2009. Reliability-aware energy management for periodic real-time tasks. IEEE Trans. Comput. 58, 10. Received August 2012; revised May 2013; accepted November 2013

ACM Transactions on Embedded Computing Systems, Vol. 13, No. 4s, Article 138, Publication date: March 2014.