A Framework for Supporting Adaptive Fault Tolerant Solutions

4 downloads 0 Views 2MB Size Report
Vaughn Betz, Jonathan Rose, and Alexander Marquardt (Eds.). 1999. Architecture and CAD .... Satish Sivaswamy and Kia Bazargan. 2008. Statistical Analysis ...
39 [INVITED FROM VIPES’2013] A Framework for Supporting Adaptive Fault Tolerant Solutions KOSTAS SIOZIOS, National Technical University of Athens DIMITRIOS SOUDRIS, National Technical University of Athens ¨ MICHAEL HUBNER , Ruhr-University of Bochum

For decades computer architects pursued one primary goal: performance. The even-faster transistors provided by Moore’s law were translated into remarkable gains in operation frequency and power consumption. However, the device-level size and architecture complexity impose several new challenges, including a decrease in dependability level due to physical failures. In this paper we propose a software-supported methodology based on game theory for adapting the aggressiveness of fault tolerance at run-time. Experimental results prove the efficiency of our solution since it achieves comparable fault masking to relevant solutions, but with significant lower mitigation cost. More specifically, our framework speedups the identification of suspicious for failure resources on average by 76%, as compared to HotSpot tool. Similarly, the introduced solution leads to average Power×Delay (PDP) savings against to existing TMR approach by 53%. General Terms: Algorithms; Reliability Additional Key Words and Phrases: Adaptive Fault Tolerance; FPGA; Reliability; CAD Tool ACM Reference Format: ¨ Kostas Siozios, Dimitrios Soudris and Michael Hubner, 2013. A Framework for Supporting Adaptive Fault Tolerant Solutions. ACM Trans. Embedd. Comput. Syst. 9, 4, Article 39 (March 2010), 23 pages. DOI:http://dx.doi.org/10.1145/0000000.0000000

1. INTRODUCTION

The rapid progress of process technology has resulted in tremendous growth in terms of capacity and performance for Field-Programmable Gate Arrays (FPGAs). Even though this trend has alleviated the architects concern about designing even faster devices, it introduced new headaches. Among others, reliability degradation becomes an important issue not only during the fabrication process, but also for the product’s lifetime [ITRS 2012], whereas the goal for architects today is to design reliable products from unreliable components. Defects at VLSI designs are tightly coupled to the operation conditions. Specifically, the Mean Time To Failure (MTTF) due to various intrinsic reliability factors, such as Electromigration [Srinivasan et al. 2004], Time-Dependent Dielectric Breakdown (TDDB) and NBTI [Wang et al. 2007], vary exponentially with on-chip temperatures [Gielen et al. 2008] [Black 1969] [Mangalagiri et al. 2008] [Srinivasan et al. 2004] [Wang et al. 2007]. This problem becomes far more important if we take into consideration the non-uniformity in thermal profile over the chip’s area due to variations in Author’s addresses: K. Siozios, School of Electrical and Computer Engineering, National Technical University of Athens; D. Soudris, School of Electrical and Computer Engineering, National Technical University of ¨ Athens; M. Hubner, Ruhr-University of Bochum, Germany. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies show this notice on the first page or initial screen of a display along with the full citation. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of this work in other works requires prior specific permission and/or a fee. Permissions may be requested from Publications Dept., ACM, Inc., 2 Penn Plaza, Suite 701, New York, NY 10121-0701 USA, fax +1 (212) 869-0481, or [email protected]. c 2010 ACM 1539-9087/2010/03-ART39 $15.00

DOI:http://dx.doi.org/10.1145/0000000.0000000 ACM Transactions on Embedded Computing Systems, Vol. 9, No. 4, Article 39, Publication date: March 2010.

Available Fault Tolerant Solutions

Technique

Protects

Modifies

Applied Uniformly?

Reference

Routing Algorithm

Yes

and DeHon 2009] } [Rubin [Sivaswamy and Bazargan 2008] [Campregher et al. 2005]

Spare Routing

Routing

Hardware

Yes

} [Yu and Lemieux 2005]

Defect Map

Logic & Routing

Nothing

No

et al. 2006] } [Jain [Doumar et al. 1999]

TMR

Logic

HDL

Yes

2011c] } [Xilinx [Johnson and Wirthlin 2010]

No

} [Siozios and Soudris 2010]

applied at hardware-level

K. Siozios et al.

applied at software-level

39:2

[Pratt et al. 2006] Proposed Work

Fig. 1. Taxonomy of various pioneering works on fault tolerance.

the power density of components. For instance, an intra-die temperature variation of up to 20◦ C was observed on a Virtex-4 FPGA [Sundararajan et al. 2006]. In order to prevent, or at least alleviate, the consequences of reliability degradation, a number of fault tolerant mechanisms have been proposed. The term fault tolerant corresponds to a design able to continue its operation, possibly at a reduced level, rather than failing completely, when some part of the system fails. Even though fault tolerance might be a pre-requisite for existing architectures, the excessive mitigation cost makes it affordable only for mission critical systems. However, there are numerous applications that can afford lower fault coverage for significantly reduced mitigation cost. The importance of designing fault-tolerant systems is much more crucial for FPGAs as compared to the rest platforms, because upsets at FPGAs can alter the design, not just user data. Additionally, applications mapped onto FPGA utilize only a subset of the actually fabricated resources, and hence only a part of upsets might result to failures. Hence, FPGA-specific mitigation techniques, able to provide a reasonable balance between the desired fault masking and the performance degradation, are required. Up to now a number of architectures and design methodologies able to provide nondistributed device operation have been proposed at different levels of abstraction, as it is depicted at Fig. 1 [Avizienis et al. 2004] [Cheatham et al. 2006]. Specifically, in literature, there are two mainstream approaches for designing fault tolerant systems. The first of them deals with the design of new hardware elements, which are fault tolerant enabled [Nikolic et al. 2002] [Kastensmidt et al. 2006] [Yu and Lemieux 2005] [Campregher et al. 2005], whereas the desired fault masking at the second approach is provided at software-level with the usage of specialized CAD tools [Bhaduri and Shukla 2004] [Xilinx 2011d] [Jain et al. 2006] [Doumar et al. 1999] [Koren and Krishna 2007]. Both approaches exhibit advantages and disadvantages. The (re-)designed hardware blocks can either replace existing components at conventional FPGAs, or new fault tolerant architectures can be designed to improve robustness. The drawback of applying such a strategy affects the increased design complexity, while the derived FPGA provides also a static (defined at fabrication time) fault tolerant mechanism. Typical instantiations of this approach involve the usage of spare logic and routing resources [Yu and Lemieux 2005] [Campregher et al. 2005], as well as the defense-grade [Xilinx 2011a] and space-grade [Xilinx 2011b] FPGA devices. ACM Transactions on Embedded Computing Systems, Vol. 9, No. 4, Article 39, Publication date: March 2010.

A Framework for Supporting Adaptive Fault Tolerant Solutions

39:3

On the other hand, the software-based fault masking combines the required dependability level with the low-cost of commodity devices. For instance, by appropriately handling the inherent re-programmability feature of FPGAs, it is possible to overcome from hardware malfunctions. However, the software-based fault tolerant systems assume that designer is responsible for protecting the design. Since this approach does not impose any hardware modifications, it is widely accepted for research and product development. Among others algorithms that provide application P&R under fault tolerant [Jain et al. 2006] [Doumar et al. 1999] and/or reliability [Sivaswamy and Bazargan 2008] constraints have been proposed, whereas a solution that embeds testing operations and alternative configurations into bitstream files is discussed in [Rubin and DeHon 2009]. Also, there are solutions that perform application P&R by taking into consideration a defect map of target FPGA, in order to avoid utilizing nonfunctional hardware resources (either routing or logic blocks). The only available commercial product for providing fault masking is based on triplication of application’s functionality without imposing any architectural modifications [Xilinx 2011d]. This approach, also known as Triple Modular Redundancy (TMR), assumes three functional blocks that work in parallel and the output is derived by comparing their partial outputs with a majority voter. Although TMR provides the maximum fault coverage at logic infrastructure, the additional (replica) functionality introduces mentionable performance, power and area overheads, which usually violate the system’s specifications [Carmichael 2006]. For this purpose, a number of improvements have been proposed. A partial TMR, where redundancy is applied selectively only to portions of the design is discussed in [Pratt et al. 2006]. Similarly, an algorithm for inserting alternative voters in designs with TMR is presented in [Johnson and Wirthlin 2010]. A methodology that supports application mapping with the maximum affordable (in terms of delay and power overheads) redundancy can be found in [Siozios and Soudris 2010]. Even though these approaches improve performance and power consumption metrics, all of them are applicable at design-time. In this paper we introduce a software-supported framework that allows balancing at run-time the fault masking and the consequent mitigation cost. The introduced framework, also called adaptive TMR, employees a game theory approach in order to identify and protect with redundancy only the sensitive (suspicious to fail) subcircuits of the design. Different reliability degradation mechanisms could be taken into account for this analysis (e.g., TDDB, NBTI, electromigration, etc), whereas for the scopes of this work the injected upsets follow the NBTI reliability degradation model. By applying the proposed framework to 20 biggest MCNC benchmarks, we achieve to speedup the identification of suspicious for failure resources on average by 76%, as compared to a well-established HotSpot tool. Even though one might expect that such a speedup sacrifices the quality of derived solutions, our analysis depict that the average error is only 6%, which could be though affordable for consumer (non-mission critical) products. Similarly, the introduced solution leads to an average Power×Delay (PDP) saving against to conventional TMR by 53%. The contributions of this paper are summarized, as follows: (1) Introduction of adaptive TMR, a novel software-supported methodology for enabling rapid exploration of different scenarios of fault tolerance at FPGAs. Rather than similar approaches that focus on Single Event Upsets (SEUs), our solution aims to provide fault masking against aging phenomena. Furthermore, since the proposed methodology does not impose any hardware modification, it is also applicable to commercial devices (e.g., in conjunction to the Altera Quartus Framework [Altera 2011a]). ACM Transactions on Embedded Computing Systems, Vol. 9, No. 4, Article 39, Publication date: March 2010.

39:4

K. Siozios et al.

(2) Rather than inserting redundancy to the entire FPGA, sensitive sub-circuits are identified with a game theory approach, and then only application’s functionalities mapped onto these sub-circuits are protected with the introduced adaptive TMR. Hence, the proposed framework can be thought as a proactive approach for alleviating the consequences of reliability degradation at FPGAs. (3) Development of a new fault injection tool targeting at FPGA devices. This tool supports control over where (spatial criterion) and when (temporal criterion) faults occur, while it also enables observability (e.g., monitors the internal values of a circuit after fault injection) and error propagation analysis. (4) By providing a spectrum of solutions that trade-off fault masking and mitigation cost, designer can select the operation point that better match system’s specifications. The rest paper is organized as follows: Section 2 discusses the motivation of this work, whereas the proposed adaptive TMR methodology is described in Section 3. Section 4 provides a number of quantitative results that prove the efficiency of the proposed framework. Finally, conclusions are summarized in Section 5. 2. MOTIVATION

The impact of intrinsic failure mechanisms on the life-time reliability is tightly firmed both on the operating conditions, as well as to design specific attributes. Previous studies shown that hardware resources with increased power densities and thermal stress exhibit higher failure probability [Gielen et al. 2008] [Black 1969] [Mangalagiri et al. 2008]. This mainly occurs because a number of important device parameters, such as mobility, threshold voltage, and saturation velocity, depend on temperature values. For instance, the power density, and hence the thermal stress, of an application’s kernel mapped onto an FPGA depends on the decisions made during the P&R. This section proposes a method for rapid classification of FPGA’s hardware resources either as suspicious to fail, or not. Hence, a potential solution for designing an efficient fault tolerant system is to identify regions of the device that include hardware resources with increased temperature values, and then to selectively protect these resources by applying fault detection and correction mechanisms. Such an approach allows to balance the desired fault masking with the consequent mitigation cost. Throughout this paper, we investigate the impact of Negative Bias Temperature Instability (NBTI) physical degradation to FPGAs. This phenomenon has recently gained a lot of attention due to its increasingly adverse impact on nanometer CMOS technology. NBTI is typically seen as a threshold voltage shift after a negative bias has been applied to a MOS gate at elevated temperature. The NBTI phenomenon mainly affects pMOS transistors, while degradation of channel carrier mobility is also observed [Mahapatra et al. 2005]. Equation 1 gives the MTTF (mean time to failure) due to NBTI.  1 γ  E  a × exp (1) Vgs k×T , where AN BT I is a process related constant, Vgs is the gate voltage, γ is voltage acceleration factor, Ea is activation energy, k is Boltzmann constant and T denotes on-chip temperature. Note that the selection of NBTI degradation mechanism does not affect the efficiency of proposed methodology, which is also applicable to any other aging degradation phenomenon (e.g., electromigration, TDDB, hot-currier, etc). Figs. 2(a) and 2(b) plot the spatial distribution of power sources and temperature values at slice level, respectively, for the des benchmark [Yang 1991]. These values are retrieved after application’s P&R onto a Stratix-like FPGA device [Altera 2011b] M T T F = AN BT I ×

ACM Transactions on Embedded Computing Systems, Vol. 9, No. 4, Article 39, Publication date: March 2010.

A Framework for Supporting Adaptive Fault Tolerant Solutions

Power consumption

Temperature values

60

60

50

50

40

40

30

30

20

20

10

10

10

20

30

40

50

39:5

60

10

20

Minimum value

0%

30

40

50

60

(b)

(a) 20%

40%

60%

80%

100%

Maximum value

Fig. 2. Spatial distribution of (a) power consumption and (b) temperature values regarding the des benchmark.

with the usage of VPR tool [Betz et al. 1999], whereas power and thermal profiles were retrieved with two well-established tools in FPGA research community, named PowerModel [Poon et al. 2005] and Hotspot [Kong et al. 2012], respectively. Note that up to now, application description does not consider any fault tolerance. Different colors in Fig. 2 denote regions of the device with different power and temperature values, while as closer to red color a region is, the corresponding hardware resources (i.e. slices) operate under higher power and temperature, respectively. For demonstration purposes, both of these parameters are plotted in normalized manner with Equations 2 and 3. power(i,j) =

power consumption at slice(i, j) slice with maximum power consumption over the F P GA

(2)

temperature at slice(i, j) slice with maximum temperature over the F P GA

(3)

temperature(i,j) =

Based on Fig. 2 we can conclude that power and temperature values are not constant over the FPGA, since both of them vary considerably between any two arbitrary points (i1 ,j1 ) and (i2 ,j2 ) of the device. However, it is possible to determine regions with similar (e.g., excessive high or low) values. Note that these regions are not common among applications, while even for a given application there might be variations depending on the selected optimization strategy (e.g., timing-driven, power-driven, etc). Hence, the challenge with which a designer is faced up is to choose only the actually needed fault coverage, considering the associated spatial information from Fig. 2. The first task for applying such localized redundancy scheme affects the identification of suspicious for failure hardware resources. Figs. 3(a) and 3(b) depict the spatial location of these resources over the FPGA architecture, as they are classified by applying the threshold described by Equation 4 to the power and temperature values. The ACM Transactions on Embedded Computing Systems, Vol. 9, No. 4, Article 39, Publication date: March 2010.

39:6

K. Siozios et al.

Power Consumption (PowerModel)

Temperature (Quick_Hotspot)

Differences (XOR)

60

60

60

50

50

50

40

40

40

30

30

30

20

20

20

10

10 10

20

30

40

50

60

10 10

20

30

40

50

60

10

(b)

(a)

0 ≤ value(i, j) < Q = 0.5

20

30

40

50

60

(c) Q =0.5 ≤ value(i, j) ≤ 1.0

Fig. 3. Suspicious for failure hardware resources regarding the des benchmark: (a) based on power consumption, (b) based on temperature values, and (c) differences between the two previously mentioned distributions.

value of this threshold, mentioned as Q (where 0 < Q ≤ 1), is designer specified and depends on the aggressiveness of applied fault masking.

if

  value(i,j) ≥ Q, slice(i,j) is suspicious for failure → yes (black color);  value

(i,j)

(4)

< Q, slice(i,j) is suspicious for failure → no (white color).

For demonstration purposes, we assume (without affecting the generality of our solution) that Q = 0.5. This threshold denotes that 601 of the 1,581 utilized slices are characterized as suspicious to fail based on the distribution of power consumption (Fig. 3(a)), whereas a more accurate approach (Fig. 3(b)) reports that 727 slices belong to regions with increased probability of failure. The error of 8% between the two alternative methods occurs mainly due to the thermal diffusion effect, which is not taken into consideration at Fig. 3(a). This is also highlighted at Fig. 3(c), where we plot (with black color) the slices of des benchmark that are characterized as suspicious to fail only with one of the two alternative methods (by making an exclusive disjunction (XOR) between Figs. 3(a) and 3(b)). Even though such an error is not negligible, it can be thought affordable, if we take into account the significant lower execution time for identifying suspicious to fail resources based on power sources (∼ 1 second), as compared to thermal analysis (∼ 1 hour [Siozios et al. 2011]). 3. TARGET PLATFORM

Fig. 4 depicts a schematic view of the employed architecture. Fundamental role in this architecture is given to the virtual FPGA (V-FPGA) architecture, which is an island-style FPGA. This approach exhibits the novelty that the FPGA is realized as virtual reconfigurable hardware upon a traditional off the shelf FPGA device. Additionally, as the architectural properties (e.g., the LUT size and the number of LUTs per slice) of each V-FPGA are easily customizable at design-time (through a graphical user interface) depending on the application’s requirements, the employed V-FPGA can be though as an optimized application-specific platform. The advantage with this approach is that the specification of the virtual FPGA stays unchanged, independent to the underlying hardware and provides therefore features, which the exploited physical host FPGA cannot provide. This becomes far more important regarding the target ACM Transactions on Embedded Computing Systems, Vol. 9, No. 4, Article 39, Publication date: March 2010.

A Framework for Supporting Adaptive Fault Tolerant Solutions

39:7

Memory Controller

AMBA APB

V-FPGA

AMBA APB

AMBA APB

FPGA

Config Controller

Arbiter/MUX

AHB/APB Bridge

AMBA AHB

Microprocessor Core

μP

V-FPGA

μP

Virtex-6

Configuration Bus

Memory V-FPGA

V-FPGA

MPSoC Platform Fig. 4. Overview of the virtual’s FPGA (V-FPGA) architecture.

application domain of our analysis (consumer products), where low-cost reconfigurable platforms, usually without supporting dynamic reconfigurability, are employed. Apart from the V-FPGA, the target platform includes also a microprocessor core, a configuration controller, some off-chip memories and busses for on-chip communication and configuration, as it is depicted in Fig. 4. While the microprocessor is well suited for control-oriented tasks and interfacing, the V-FPGA cores add the advantage of efficient parallel data processing to the system. On-chip communication is realized by AMBA busses. The microprocessor can access the V-FPGA cores by the AMBA APB bus, since each core acts as an APB slave. The microprocessor also communicates with the configuration controller by the AMBA APB bus. It specifies which bitstream file the configuration controller should load into a certain V-FPGA core. The configuration controller can access the memory controller by the AMBA AHB bus. This requires a bus arbiter as there are two masters on the AMBA AHB, the microprocessor and the configuration controller. Additional details about the architecture of V-FPGA can be ¨ found in [Hubner et al. 2011]. Regarding the architecture of each V-FPGA core, it is a generic recent FPGA similar to Xilinx Virtex [Xilinx 2011c] and Altera Stratix [Altera 2011b] devices, as it is depicted in Fig. 5(a), consisting of an array of slices. The term slice refers to the Configurable Logic Block (CLB), the upper and right routing segments, as well as the corresponding Switch Box (see Fig. 5(b)). In order to label these resources we use Cartesian co-ordinates. The next level of hierarchy assumes that CLBs are formed by a number of Basic Logic Elements (BLEs), each of which consists of a Look-Up Table (LUT), a flip/flop, some multiplexers, as well as the required wires for local connectivity (Fig. 5(c)). Such an architectural arrangement allows local interconnects inside CLBs to be optimized in terms of delay and power consumption [Betz et al. 1999]. The main architectural features of the employed CLB are summarized as follows: — Cluster of 4 Basic Logic Elements (BLEs). — 4-inputs Lookup-Table (LUT) per BLE. — One double edge-triggered Flip-Flop per BLE. — 10-inputs and 4-outputs provided by each CLB. — A 14-to-1 multiplexer at every LUT input in order to retrieve a fully connected CLB. ACM Transactions on Embedded Computing Systems, Vol. 9, No. 4, Article 39, Publication date: March 2010.

39:8

K. Siozios et al.

Slice (2,5)

Slice (1,4)

Slice (2,4)

Slice (1,3)

Slice (2,3)

Slice (1,2)

Slice (2,2)

Slice (1,1)

Slice (2,1)

Slice (3,1)

Slice (4,6)

Slice (6,8) Slice (6,7)

DSP

Slice (4,5)

Slice (6,6) Slice (6,5)

DSP Slice (4,4)

Slice (6,4)

Slice (4,3)

Slice (6,3)

Slice (4,2) Slice (4,1)

DSP

Slice (5,1)

Slice (6,2) Slice (6,1)

Slice (7,8)

RAM

Slice (1,5)

Slice (5,8)

Slice (4,7)

RAM

Slice (2,6)

Slice (4,8)

RAM

Slice (1,6)

Slice (3,8)

RAM

Slice (2,7)

RAM

Slice (2,8)

Slice (1,7)

RAM

Slice (1,8)

Slice (7,1)

Slice (8,8)

Slice (9,8)

Slice (8,7)

Slice (9,7)

Slice (8,6)

Slice (9,6)

Slice (8,5)

Slice (9,5)

Slice (8,4)

Slice (9,4)

Slice (8,3)

Slice (9,3)

Slice (8,2)

Slice (9,2)

Slice (8,1)

Slice (9,1)

BLE #1 N

Slice SB

N BLEs

I

CLB

Outputs

BLE #N Inputs

Clock

(a)

(b)

(c)

Fig. 5. Architecture template of the employed FPGA device.

— All 4-outputs can be registered. — A single asynchronous clear signal for the whole CLB. — A single clock signal for the whole CLB. — One gated clock signal per BLE and CLB. — One power gate signal per BLE and CLB. The last two of the previously mentioned architectural parameters, namely the gated clock and the power gate, are employed in our adaptive TMR methodology to support the insertion and removal of fault tolerance at run-time. More specifically, the goal of gated clock is to isolate unutilized F/Fs and BLEs from the clock network in order to reduce the transition activity, and thus the effective capacitance. Similarly, power gating allows to temporarily shutdown unused BLEs and CLBs in order to reduce the overall leakage power of the architecture. In order to realize the switch on/off of BLEs, our methodology takes advantage of the partial reconfiguration feature provided by V-FPGA. Even though throughout this paper we use an instance of V-FPGA where partial reconfiguration is performed at slice level, coarser granularities (i.e. the column-based approach employed by Xilinx Virtex FPGAs) are possible to be handled. Note that our target FPGA platform does not incorporate any special blocks, or mechanisms, aiming to support fault tolerance. Hence, the proposed methodology is applicable apart from the architecture discussed in this section, to any other FPGA device. Next section describes in detail the proposed methodology for supporting the application implementation under adaptive fault masking. 4. PROPOSED FAULT TOLERANT METHODOLOGY

This section introduces the proposed methodology for supporting adaptive fault masking. Rather than relevant approaches, where redundancy is applied uniformly over the entire architecture [Xilinx 2011d] [Carmichael 2006], the proposed solution triplicates selectively only application’s functionalities mapped onto suspicious for failure slices. As we have already mentioned, this strategy imposes partial reconfiguration at slice level, which is supported through our V-FPGA platform. Such a selectively insertion of fault tolerance, named adaptive TMR, provides a trade-off between the desired fault masking and the affordable performance degradation due to redundant hardware. Fig. 6 depicts the proposed methodology which consists of two complementary stages: (i) the static fault prevention applied at design-time and (ii) the adaptive fault tolerance, which modifies the aggressiveness of employed fault tolerance at run-time, as a response to reliability degradation posed by application’s execution. ACM Transactions on Embedded Computing Systems, Vol. 9, No. 4, Article 39, Publication date: March 2010.

Design-Time (static)

A Framework for Supporting Adaptive Fault Tolerant Solutions

Insertion of TMR (similar to [Xilinx 2011d])

Application (HDL)

39:9

Synthesis (Quartus)

P&R (VPR)

START Model reliability degradation (based on AgeSim)

Compute distribution of power sources

Identify suspicious to fail resources

Selective elimination of TMR

Run-Time (adaptive)

Task to be solved with game theory

Selectively enable/disable fault tolerance

Input/Output

Identify suspicious to fail resources

Existing Software Tool

Monitor distribution of power sources

Application execution

New Software Tool

Fig. 6. The proposed framework for supporting adaptive TMR.

Initially, the application’s HDL description is triplicated following a similar technique to Xilinx TMR [Xilinx 2011d]. At the output of this step, the application is annotated with the maximum possible fault tolerance. Note that, even with commercial tools, it is not possible to improve further the fault coverage without using a specialpurpose device (e.g., defense-grade [Xilinx 2011a], or space-grade [Xilinx 2011b] FPGAs). Next task involves the application synthesis and technology mapping with Quartus II software [Altera 2011a]. The derived netlist is P&R onto the target FPGA with VPR tool [Betz et al. 1999] and then power consumption per slice is computed with PowerModel [Poon et al. 2005]. Different scenarios for power analysis are feasible by appropriately selecting the test vectors (e.g., worst-case, normal operation, etc). By intersecting the results from P&R, the power consumption per slice and the reliability degradation retrieved from AgeSim [Huang and Xu 2010], it is possible to identify the suspicious for failure resources (similar to Fig. 3(a)), which are candidate for protection with TMR. This task is solved with the usage of employed game theory. The remaining steps deal with the evaluation of the derived fault tolerant solution. For this purpose, a number of upsets that model reliability degradation are injected to the configuration file. Both the spatial (which resource is affected), as well as the temporal distribution (i.e. how often they occur) of these upsets, are retrieved from AgeSim [Huang and Xu 2010]. The proposed methodology incorporates also a mechanism for adapting the aggressiveness of fault tolerance at run-time, as a response to variations at reliability degradation occurred during the execution phase. To accomplish this task, the power sources over the target architecture are periodically monitored and in case there are differences between them and their previous values, they are appropriately tuned by the proposed game theory approach. Based on the conclusions derived from this analysis, the fault tolerance for these slices is turned either “on” (insertion of redundancy), or “off ” (removal of redundancy). Next subsections provide additional details about the dynamic insertion and removal of redundancy. ACM Transactions on Embedded Computing Systems, Vol. 9, No. 4, Article 39, Publication date: March 2010.

39:10

K. Siozios et al.

Region1 without protection

60

50

40 Q

Region2 Protected with TMR

30

QH

20

Region3 (to be solved with game theory)

QL borders

10 QL

10

(a)

20

30

40

50

60

(b)

Fig. 7. Clustering of hardware resources based on QL and QH thresholds.

4.1. Supporting Adaptive TMR at the Target FPGAs

The threshold Q, as it was discussed in Section 2, enables to cluster hardware resources either as suspicious to fail, or not. However, such a strict decision usually does not provide sufficient flexibility in terms of balancing the desired fault masking with the consequence performance and power overheads. Hence, instead of Q we employ two threshold values, named QL and QH , where 0 ≤ QL < Q ≤ QH ≤ 1. The absolute values of these thresholds are designer specified and affect the area coverage, where game theory is applied. Based on these thresholds, the FPGA slices are classified into three regions as follows: — Region1 : Whenever the normalized power consumption for the slice(i,j) is lower than QL (0 ≤ P owerslice(i,j) < QL ), we can almost safely claim that application’s functionalities mapped onto this slice are not likely to be affected by upsets due to aging phenomena. Consequently, our methodology eliminates TMR from these slices in order to improve delay and power metrics. — Region2 : In case the normalized power consumption for the slice(i,j) is higher than QH (QH < P owerslice(i,j) ≤ 1), it is very likely this slice to be affected by reliability degradation. Hence, TMR protection remains to this slice. — Region3 : For the rest slices, where normalized power consumption ranges between QL and QH (QL ≤ P owerslice(i,j) ≤ QH ), it is not clear whether they should be protected with redundancy, or not. These regions mostly include slices that are affected by thermal diffusion. In order to find out which of these slices should incorporate TMR, we employ a game theory approach, as it is described in next subsection. Fig. 7(a) shows graphically the QL and QH thresholds, whereas Fig. 7(b) depicts how these regions are applied to the des benchmark. For demonstration purposes, at Fig. 7(b) we plot only the borders that correspond to QL threshold. In order to clarify, how a slice could incorporate or not TMR, Fig. 8 demonstrates the two alternative operating modes for the underline FPGA device. More specifically, whenever our framework indicates that a slice has to be protected against upsets (beACM Transactions on Embedded Computing Systems, Vol. 9, No. 4, Article 39, Publication date: March 2010.

A Framework for Supporting Adaptive Fault Tolerant Solutions

39:11

TMR = “enabled”

G

Voter

D - F/F

D - F/F

G

4 Outputs

Switched off SRAM

Voter Clock SRAM

V

Switched off SRAM

4-input LUT

SRAM G

D - F/F

G

Replica #2 V

2:1 MUX

4-input LUT

C

Switched off SRAM

4-input LUT

SRAM

2:1 MUX

BLE #2

10 Inputs

C

B

2:1 MUX

2:1 MUX 2:1 MUX 2:1 MUX

D - F/F

D - F/F

G

Replica #1 BLE #3

G

A

SRAM

4-input LUT

BLE #4

BLE #4

B

SRAM

Replica #2

Clock SRAM

D - F/F

D - F/F

G

Original functionality

4 Outputs

G

4-input LUT

4-input LUT

SRAM

Replica #1 BLE #3

10 Inputs

BLE #2

4-input LUT

A

2:1 MUX

Original functionality

D - F/F

BLE #1

G

2:1 MUX

BLE #1

4-input LUT

TMR = “disabled”

G

4

4

Clear

Clear

(a)

(b)

Original BLE (A)

Replica #1 (B)

A

0

0

0

0

1

1

1

1

B

0

0

1

1

0

0

1

1

C

0

1

0

1

0

1

0

1

V

0

0

0

1

0

1

1

1

Output (V)

Replica #2 (C)

(c) Enabled BLE (switched-on TMR)

(d) Disabled BLE (switched-off TMR)

Fig. 8. Application mapping onto target architecture: (a) a slice with TMR-based fault tolerance and (b) a slice without fault tolerance.

longs to Region2 ), the three Basic Logic Elements (BLEs) are configured to perform exactly the same functionality, while the fourth BLE is configured to perform as majority voter for the TMR (see Fig. 8(a)). The functionality, as well as the truth table of majority voter, are depicted in Figs. 8(c) and 8(d), respectively. On the other hand, if the slice’s failure probability is considerable low (Region1 ), then only one of the four BLEs is configured to perform application’s functionality; the rest BLEs are not utilized (see Fig. 8(b)). Finally, for those slices assigned to Region3 , the BLEs are configured similar to the first case (Fig. 8(a)), but the decision whether the fault tolerant is enabled, or not, is derived from the employed game theory approach. In order to support the dynamic insertion and removal of redundancy, our adaptive TMR methodology switches on, or off, respectively, the corresponding slices with the usage of power gate and clock gate low-power techniques. 4.2. The Employed Game Theory Approach

This subsection describes the employed approach based on non-cooperative game theory [Neumann and Morgenstern 2004] for determining whether to enable, or not, redundancy at slices belonging to Region3 . Game theory is a branch of applied matheACM Transactions on Embedded Computing Systems, Vol. 9, No. 4, Article 39, Publication date: March 2010.

39:12

K. Siozios et al.

matics that studies the interaction of decision makers with multiple and usually conflicting objectives. The essential elements of a game, which also form the rules of the game, are the players, actions, payoffs, and information. Games may be cooperative, or non-cooperative. More specifically, a game is said to be cooperative when the rules can be previously stated, or agreed, upon by the players for use in deducing common strategies. On the other hand, a game without any such agreements among the players, is called a non-cooperative game [Nash 1950]. Non-cooperative games are typically played between rational players who know the complete details of the game, including each other’s preferences and outcomes. Ordinarily, the set of players of a game are assumed to be rational decision makers. Towards this goal, each player, or decision maker, attempts to maximize his own utility by a set of actions in the presence of other decision makers [Osborne and Rubinstein 1994]. An action, or a move by a player, is a choice at an instant of the game. These actions take place based on the player’s strategy, which is a rule, or set of rules, to decide upon an action at every instant of the game. Another crucial parameter for the game’s execution affects the player’s payoff. More specifically, this parameter denotes the utility obtained by each player after selecting all the players their strategies and the game has been played. Each of the game’s players is aware about the actions previously chosen by the rest players at a given instant of the game, as well as their selections as the game proceeds. In order to complete a game, it has to reach the equilibrium. Such a case occurs whenever the corresponding combination of strategies comprises the best strategy for each of the players in the game. In other words, at the equilibrium, the combination of strategies selected by the players tends to maximize their individual payoffs among the possible strategy combinations for these players. Having a game to equilibrium phase, it is possible to derive the set of payoff values of the players corresponding to their equilibrium strategies. The problem we tackle to address in this section with the usage of game theory is a non-cooperative normal form game. In these games, players choose their own strategies independently and the rules of the game do not allow finding commitments among them. Thus, such games are played with exclusively rational players, where each player pays effort on his own selection of strategies and the corresponding payoff. Similarly, at the normal form games, which are also non-cooperative, the players move simultaneously to choose their strategies. The normal form shows what payoff results from each possible strategy combination. Since at this approach, all the players make their moves simultaneously, it is not possible to learn each other’s private information by observing the selections from rest players. Consequently, it is possible to model a normal form game by setting the information of each player about the rest players to null. The problem we tackle at this subsection is formulated as follows: Problem definition: Given an application, the virtual FPGA platform and the maximum affordable mitigation cost as compared to the conventional (i.e. without fault coverage) application implementation, find the number of slices that should be protected with redundancy, as well as their spatial location over the virtual FPGA, in order to maximize the fault coverage against to reliability degradation. Different levels of granularity might be used for modeling the underline architecture. Regarding the scopes of this paper, we assume that a slice assigned to spatial location (i, j) is tackled as a player (player(i,j) ) of the employed game. Supposed that target architecture consists of M × N slices, then the game is described as:

ACM Transactions on Embedded Computing Systems, Vol. 9, No. 4, Article 39, Publication date: March 2010.

A Framework for Supporting Adaptive Fault Tolerant Solutions

39:13

  Game = P, S(i,j) , Score(i,j) , ∀(i, j) ∈ (M, N )

(5)

h i , where P = player(i,j) | i = {1, 2, . . . , M }, j = {1, 2, . . . , N } corresponds to the set of players, whereas the selected strategy for a player assigned to spatial location (i, j) is denoted as S(i,j) . Regarding our game, the valid strategies for each player are either to employ redundancy (S(i,j) = yes), or not (S(i,j) = no). The efficiency of player’s selections is evaluated with a score metric. Equation 6 gives the score for a player assigned to spatial location (i, j). Since our game is noncooperative, the goals for each player are: (i) to minimize the probability of failure (PoF) for this slice and (ii) to reduce the Power×Delay Product (PDP). These two parameters are competitive for a single player, because the minimization of PoF imposes that the slice should incorporate TMR redundancy, which is actually implemented by introducing additional (replica) functionalities, and hence such a selection increases both application’s delay and power consumption. Score(i,j) =







WP oF × P oF(i,j) + WP DP × P ower(i,j) × Delaynet



 (6)

Parameters P oF(i,j) and P ower(i,j) are computed for each player periodically after every turn of the game, whereas the Delaynet is common for all the players belonging to the same network. In order to retrieve power and delay values, we incorporate two well-established models, named PowerModel [Poon et al. 2005] and Elmore delay model [Gupta et al. 1997] [Okamoto and Cong 1996], respectively. Similarly, the PoF values at slice level (P oF(i,j) ) are based on AgeSim model [Huang and Xu 2010]. The weighting factors WP oF and WP DP define the importance of improving either the fault masking, or the PDP value, respectively. Note that the combinations of weighting factors WP oF and WP DP have to meet Equation 7. For instance, in case WP oF = 1.00, the only objective during game’s execution is to retrieve a solution that maximizes the fault masking, ignoring about the consequent mitigation cost. On contrary, if the maximum fault masking is not a pre-requisite, it is possible to achieve reasonable power and delay savings by relaxing the value of WP oF . WP oF + WP DP = 1.00

(7)

Algorithm 1 gives the pseudo-code for solving the adaptive TMR problem with game theory. This algorithm guarantees to find Nash equilibrium, which is very close to the game’s optimal solution [Puschini et al. 2008].

The worst case time complexity for Nash equilibrium regarding a M × N -player game with two strategies for each player (e.g., employ or not redundancy) is given by O(M × N × 2M ×N ) ' O(M × N ) [Nash 1950]. Since the proposed adaptive fault tolerance has to be executed onto an embedded processor, we should further reduce the problem’s complexity. For this purpose, rather than employing a single game for the whole FPGA, we apply a separate sub-game per region of importance (Region3 ), as these regions are depicted at Fig. 7(b). This selection enables dynamic optimization of G different regions (Region3 ) simultaneously, guarantying that the game will provide an acceptable solution even under run-time constraints [Puschini et al. 2008]. We have to notice that for a given benchmark, all the partial sub-games are applied simultaneACM Transactions on Embedded Computing Systems, Vol. 9, No. 4, Article 39, Publication date: March 2010.

39:14

K. Siozios et al.

ALGORITHM 1: Pseudo-code for solving the adaptive TMR problem with game theory. Input: number of players Input: desired level of fault masking Input: affordable performance degradation in terms of delay and power consumption Output: Nash equilibrium solution Optimal solution ← Empty; Current score ← MAX VALUE; Optimal score ← MAX VALUE; foreach partial sub-game g ∈ [1, · · · G] do foreach player i ∈ [1, · · · M ] do foreach player j ∈ [1, · · · N ] do foreach strategy s ∈ [yes/on] do  P Current Score ← Score(i, j) ; if Current Score < Optimal Score then Optimal solution ← Current solution; Optimal Score ← Current Score; end end end end end solution ← Report (Nash equilibrium);

ously with identical QL and QH values (this claim does not affect the generality of our proposed methodology). 5. EXPERIMENTAL RESULTS

One of the most complicated tasks during the design of a reliable architecture affects the verification that the derived system provides the desired fault masking, while meeting application’s timing and power specifications. This section provides a number of quantitative comparisons that prove the efficiency of the proposed adaptive TMR, as compared to relevant approaches. For this purpose, we employ the 20 biggest MCNC benchmarks [Yang 1991], whereas reliability degradation is modeled with AgeSim tool [Huang and Xu 2010]. For sake of completeness each benchmark is mapped onto an identical instance of Virtual FPGA (in terms of array size and number of logic/routing resources) mapped onto a Virtex-6 FPGA board [Xilinx 2011c]. Furthermore, as there are no other public available tools targeting to support fault masking against to reliability degradation, while the existing techniques are difficult to be tuned for our experimental setup (e.g., target architecture, employed P&R algorithms, etc), it is not possible to provide quantitative comparisons against to the approaches discussed in related work. Initially, we quantify the efficiency of proposed methodology to identify suspicious for failure resources (assigned to Region2 and Region3 ) based on their power consumption, as compared to a more accurate approach (HotSpot tool [Kong et al. 2012]). The results of this analysis, as they are summarized in Table I, show that the error in term of slices characterized as suspicious for failure with only one of two alternative approaches ranges between 2% and 11%, whereas the average error among the 20 benchmarks is only 6%. As we have already mentioned, this error can be thought affordable, if we take into account that our framework concludes about the criticality of a slice much faster compared to thermal analysis approach. ACM Transactions on Embedded Computing Systems, Vol. 9, No. 4, Article 39, Publication date: March 2010.

A Framework for Supporting Adaptive Fault Tolerant Solutions

39:15

Table I. Identify suspicious for failure slices with two alternative methods (discussed in Fig. 3).

Benchmark alu4 apex2 apex4 bigkey clma des diffeq dsip elliptic ex1010 ex5p frisc misex3 pdc s298 s38417 s38514 seq spla tseng Average:

Total slices 1,509 1,861 1,255 1,817 8,073 1,581 1,475 1,370 3,600 4,545 1,038 3,532 1,376 4,506 1,931 6,208 6,293 1,733 3,631 1,044 2,918.9

Slices clustered as suspicious for failure Based on power sources Based on thermal profile [Poon et al. 2005] [Kong et al. 2012] Number Percentage of Number Percentage of of slices utilized slices (%) of slices utilized slices (%) 392 26% 501 33% 558 30% 739 40% 351 28% 442 35% 799 44% 999 55% 2,203 27% 2,987 37% 601 38% 727 46% 590 40% 635 43% 425 31% 547 40% 1,620 45% 1,692 47% 1,591 35% 2,013 44% 311 30% 418 40% 1,024 29% 1,272 36% 537 39% 601 44% 1,802 40% 1,955 43% 676 35% 731 38% 2,732 44% 2,980 48% 2,769 44% 3,147 50% 537 31% 712 41% 1,489 41% 1,561 43% 418 40% 459 44% 1,071 36% 1,256 42%

Error (%) versus HotSpot tool [Kong et al. 2012] 7% 10% 7% 11% 10% 8% 3% 9% 2% 9% 10% 7% 5% 3% 3% 4% 6% 10% 2% 4% 6%

Then, we summarize some properties of the employed game. Specifically, Table II gives the number of players (slices) per benchmark, the number of partial sub-games applied simultaneously, the average number of players per partial sub-game, as well as the reduction in execution cycles achieved by applying the proposed solution with partial sub-games. Based on Table II, we can conclude that on average 9.15 partial subgames per benchmark are required to solve the adaptive TMR problem, while each of these sub-games has on average 57.25 players. This leads almost to 5× faster execution for solving the adaptive fault tolerance, as compared to the corresponding execution cycles required for the single game (reference solution). The efficiency of adaptive TMR is highly affected by the values of weighting factors at Equation 6. For this purpose, four representative combinations of weighting factors (depicted at Equation 8) are evaluated under the performance and power consumption metrics. The results of this analysis in terms of maximum operation frequency and power consumption are summarized at Tables III and IV, respectively. The second column at these tables corresponds to initial application implementation (without redundancy), which is tackled as reference solution, whereas the columns 3-6 depict results for the alternative combinations of weighting factors. Finally, the values at last column (mentioned as “Uniform TMR”) correspond to the case where all the application’s functionalities are triplicated similar to Xilinx TMR [Xilinx 2011d].  (1.00, 0.00);   (0.75, 0.25); (WP oF , WP DP ) =   (0.50, 0.50); (0.25, 0.75).

(8)

A number of conclusions might be derived from these tables. Among others, considerable variations both in delay and power consumption are feasible for different combinations of weighting factors. Additionally, adaptive TMR achieves average perACM Transactions on Embedded Computing Systems, Vol. 9, No. 4, Article 39, Publication date: March 2010.

39:16

K. Siozios et al.

Table II. Properties of employed partial sub-games. Benchmark alu4 apex2 apex4 bigkey clma des diffeq dsip elliptic ex1010 ex5p frisc misex3 pdc s298 s38417 s38584 seq spla tseng Average:

Single game Number of players for a single game (reference solution) 1,509 1,861 1,255 1,817 8,073 1,581 1,475 1,370 3,600 4,545 1,038 3,532 1,376 4,506 1,931 6,208 6,293 1,733 3,631 1,044 2,918.9

Number of simultaneous sub-games 8 14 14 5 10 14 9 11 8 5 7 11 8 9 8 6 11 7 13 5 9.15

Partial sub-games Average players per sub-game 19.63 32.86 20.64 80.00 91.80 33.36 42.22 32.73 85.38 93.60 26.14 73.09 33.50 85.78 48.25 97.50 95.45 35.71 93.00 24.80 57.25

Reduction at execution cycles (vs. reference solution) 8.62× 3.05× 3.34× 3.55× 7.80× 2.38× 2.89× 2.81× 4.28× 8.71× 4.72× 3.39× 4.13× 4.83× 4.00× 9.61× 4.99× 5.92× 2.00× 7.49× 4.93×

Table III. Evaluation in term of maximum operation frequency for alternative instantiations of game theory. Benchmark alu4 apex2 apex4 bigkey clma des diffeq dsip elliptic ex1010 ex5p frisc misex3 pdc s298 s38417 s38584 seq spla tseng Average: Ratio:

Without TMR (ref. solution) 15.91 13.09 15.84 24.31 5.72 12.47 19.20 22.44 12.13 7.28 12.56 7.37 15.97 6.68 10.04 12.17 11.99 14.77 7.56 25.12 13.63 1.00

WP oF = 0.25 WP DP = 0.75 14.19 12.33 14.26 21.67 5.24 11.95 19.20 22.07 11.01 7.21 11.93 7.00 14.90 6.07 9.95 11.96 11.19 14.03 7.56 22.82 12.83 0.94×

Maximum operation frequency (MHz) Adaptive TMR WP oF = 0.50 WP oF = 0.75 WP oF WP DP = 0.50 WP DP = 0.25 WP DP 13.45 12.63 12.11 10.64 13.02 13.02 20.54 18.04 5.11 4.53 11.43 10.63 17.53 16.15 20.54 17.86 10.73 10.63 6.78 5.95 11.31 10.91 6.94 6.82 14.51 13.50 5.75 5.60 9.09 8.38 11.65 10.54 10.51 10.23 12.93 11.36 7.10 6.91 22.82 22.02 12.19 11.32 0.89× 0.83×

= 1.00 = 0.00 12.52 10.27 11.88 17.89 4.33 10.08 14.89 16.30 9.89 5.48 10.91 6.29 12.8 5.51 7.72 10.17 9.70 10.77 6.43 20.11 10.70 0.78×

Uniform TMR (similar to Xilinx TMR [Xilinx 2011d]) 12.18 10.27 11.67 16.26 4.02 9.35 14.62 14.97 9.17 5.24 10.62 6.17 11.75 5.41 7.30 9.52 9.52 10.57 6.02 18.28 10.15 0.74×

ACM Transactions on Embedded Computing Systems, Vol. 9, No. 4, Article 39, Publication date: March 2010.

A Framework for Supporting Adaptive Fault Tolerant Solutions

39:17

Table IV. Evaluation in term of power consumption for alternative instantiations of game theory. Benchmark alu4 apex2 apex4 bigkey clma des diffeq dsip elliptic ex1010 ex5p frisc misex3 pdc s298 s38417 s38584 seq spla tseng Average: Ratio:

Without TMR (ref. solution) 22.81 21.07 19.72 47.46 41.05 29.73 14.88 35.63 33.93 28.83 16.32 30.00 21.75 67.02 14.65 29.01 32.86 31.61 37.5 10.32 29.31 1.00

WP oF = 0.25 WP DP = 0.75 27.25 23.31 22.76 54.76 45.05 33.17 15.33 41.11 34.97 37.79 20.22 37.5 22.08 71.80 18.48 33.76 34.13 34.98 38.64 11.61 32.94 1.12×

WP oF WP DP

Power consumption (mWatt) Adaptive TMR = 0.50 WP oF = 0.75 WP oF = 0.50 WP DP = 0.25 WP DP 27.45 33.31 28.61 30.56 29.83 30.56 58.21 58.68 48.27 54.36 33.17 38.75 17.54 18.43 41.11 49.41 40.35 43.86 47.24 53.19 25.04 28.46 43.64 45.08 28.94 34.78 80.78 82.77 19.96 23.1 35.33 44.17 35.17 36.94 35.78 36.96 45.76 50.18 14.13 16.50 36.82 40.50 1.26× 1.38×

= 1.00 = 0.00 34.76 36.68 30.56 60.18 62.72 46.05 19.57 55.41 46.57 57.51 28.69 47.04 36.29 84.17 23.89 47.75 38.89 40.69 55.76 19.23 43.62 1.49×

Uniform TMR (similar to Xilinx TMR [Xilinx 2011d]) 39.58 40.95 33.8 64.68 72.13 48.58 21.64 59.01 52.00 64.84 28.69 48.30 36.93 88.8 25.21 49.03 39.57 40.69 60.49 22.11 46.85 1.60×

Table V. Average Power×Delay for alternative fault tolerant scenarios (mWatt×µsec). Benchmark PDP Ratio:

Without TMR (ref. solution) 2.15 1.00

WP oF = 0.25 WP DP = 0.75 2.57 1.19×

Adaptive TMR WP oF = 0.50 WP oF = 0.75 WP DP = 0.50 WP DP = 0.25 3.02 3.58 1.40× 1.66×

WP oF = 1.00 WP DP = 0.00 4.08 1.89×

Uniform TMR (similar to (Xilinx TMR [Xilinx 2011d]) 4.62 2.14×

formance and power savings compared to uniform insertion of redundancy, ranging up to 26% and 48%, respectively. In order to have a global overview about the performance of these weighting factors, Table V provides the average Power×Delay product (PDP) regarding the previously mentioned combinations of weighting factors over the 20 biggest MCNC benchmarks. These results indicate that the highest PDP value occurs when uniform TMR is applied to the entire architecture. Specifically, this approach increases the average PDP by 2.13×, as compared to initial application implementation (without redundancy). A slightly reduced PDP overhead (about 1.88×) occurs if during the employed game theory the only objective is to improve fault masking (WP oF = 1.00). On the other hand, whenever WP oF = 0.25, the number of triplicated functionalities at Region3 is minimized, and hence, the PDP overhead is only 1.19× compared to the reference solution. For the rest of the paper we select to tune our game with WP oF = WP DP = 0.50, since our framework aims to balance the fault masking with the consequent mitigation cost. After defining the weighting factors, we quantify the efficiency of proposed methodology for different combinations of QL and QH thresholds. This study mainly focuses on minimizing the execution time for solving the game theory approach, since we modify the area coverage of Region3 . The results of this analysis are summarized at Fig. 9. The horizontal and vertical axes in this figure give the variation in terms of fault masking and PDP, respectively. For demonstration purposes, both of these axes are plotted in normalized manner over the maximum corresponding value among the alternative ACM Transactions on Embedded Computing Systems, Vol. 9, No. 4, Article 39, Publication date: March 2010.

39:18

K. Siozios et al.

Fig. 9. Evaluation of proposed adaptive TMR under different QL and QH thresholds.

solutions. We have to mention that values at Fig. 9 are geometric averages over the 20 biggest MCNC benchmarks. Based on Fig. 9, we can conclude that threshold values QL and QH highly affect the efficiency of derived solutions, since they define the area where game theory is applied (Region3 ). The combination of thresholds marked as “optimal solution in term of fault masking” corresponds to a game where QH = 1.00 and QL = 0.00. This imposes that all the slices at Region3 are candidate for triplication, leading up to 93% higher fault masking among the studied solutions. Similarly, if the threshold values are set equal to QH = QL = 0.50 (“optimal solution in term of PDP”), then it is possible to achieve almost 83% lower Power×Delay product. However, apart from these two border solutions, there are numerous combinations of QL and QH that balance fault masking with mitigation cost. For demonstration purposes, a subset of candidate solutions are highlighted with a circle in Fig. 9. For the rest of the paper, we incorporate a game, where QL = 0.34 and QH = 0.66. This combination of threshold values belongs to the Pareto curve and provides an acceptable trade-off between the previously mentioned design criteria. The selected game leads to average fault masking and PDP reduction, as compared to the corresponding “optimal” solutions, 58% and 79%, respectively. Next, we quantify the execution overhead for identifying critical for failure resources. The results of this analysis are summarized in Fig. 10. More specifically, this figure plots the number of execution cycles per benchmark with the proposed technique (game theory) both for the design-time, as well as the run-time. For demonstration purposes, these values are normalized over the corresponding number of cycles required ACM Transactions on Embedded Computing Systems, Vol. 9, No. 4, Article 39, Publication date: March 2010.

A Framework for Supporting Adaptive Fault Tolerant Solutions

39:19

Fig. 10. Comparison in term of execution cycles for identifying suspicious for failure slices.

by the HotSpot tool [Kong et al. 2012]. Based on this analysis, we can conclude that our proposed methodology identifies much faster the slices that should be protected against to reliability degradation, as compared to HotSpot tool. Specifically, the usage of partial sub-games at design-time requires on average 76% fewer execution cycles, whereas the corresponding savings for run-time is 90%. Also, it is while worth to mention that such a behavior is almost common among the studied benchmarks. The additional savings in term of execution cycles for solving game theory at run-time are achieved, as our framework takes into account the already computed thermal stress (performed either at design-time, or run-time) and applies an even more aggressive localized thermal analysis. Finally, the effectiveness of proposed methodology is quantified with the number of sensitive bits in each design. The term sensitive bit refers to a user memory content bit, or an FPGA configuration memory bit that, when upset, causes erroneous output. Table VI quantifies the number of sensitive bits, as it is retrieved after applying the proposed adaptive TMR methodology. The references to this study are two border solutions: (i) an application implementation without fault tolerance (referred as “No TMR”) and (ii) with uniform insertion of redundancy similar to Xilinx TMR [Xilinx 2011d]. We have to mention that in order to compute these values, a number of upsets (equal to 5% of the bitstream file size) are injected to the configuration data, whereas the modeling of reliability degradation (i.e. the temporal and spatial distribution of these upsets) is retrieved from AgeSim [Huang and Xu 2010]. As we can conclude based on Table VI, the proposed adaptive TMR methodology achieves to reduce significantly the number of sensitive configuration bits. More specifically, our solution leads on average to 92% fewer sensitive bits, as compared to initial (without protection) application implementation, whereas the Xilinx TMR (maximum possible fault coverage) imposes an additional reduction to this number only by 8.26%. ACM Transactions on Embedded Computing Systems, Vol. 9, No. 4, Article 39, Publication date: March 2010.

39:20

K. Siozios et al. Table VI. Number of sensitive bits under different fault masking.

Benchmark

alu4 apex2 apex4 bigkey clma des diffeq dsip elliptic ex1010 ex5p frisc misex3 pdc s298 s38417 s38584 seq spla tseng Average:

Total number of configuration Kbits 557.79 724.17 534.38 953.74 3,088.20 1,287.21 508.84 923.05 1,358.99 1,686.67 452.50 1,434.38 546.19 2,023.79 662.35 2,067.88 2,119.48 680.89 1,520.26 369.69 1,175.02

Sensitive configuration Kbits No Adaptive TMR TMR (WP oF =0.50, WP DP =0.50) Number Number Gain (%) 49.70 5.47 88.99% 92.69 6.33 93.17% 48.63 4.21 91.34% 115.40 8.05 93.02% 308.51 30.12 90.24% 171.71 11.49 93.31% 63.45 5.67 91.06% 73.84 4.01 94.57% 127.74 12.23 90.43% 185.53 11.95 93.56% 50.68 4.32 91.48% 154.91 15.44 90.03% 55.27 6.08 89.00% 287.18 20.97 92.70% 75.51 3.67 95.14% 248.15 21.30 91.42% 254.34 19.09 92.49% 75.37 6.38 91.54% 163.43 12.90 92.11% 32.53 3.53 89.15% 131.73 10.66 91.74%

Difference as compared to Xilinx TMR [Xilinx 2011d] -11.01% -6.83% -8.66% -6.98% -9.76% -6.69% -8.94% -5.43% -9.57% -6.44% -8.52% -9.97% -11.00% -7.30% -4.86% -8.58% -7.51% -8.46% -7.89% -10.85% -8.26%

However, such an improvement at fault masking comes with a mentionable mitigation cost, which usually is not affordable for consumer products. 6. CONCLUSIONS & FUTURE WORK

A novel framework for supporting adaptive TMR against to aging phenomena for consumer products was introduced. The proposed framework rather than protecting the whole design with redundancy (e.g., similar to Xilinx TMR), it employees a game theory approach for selectively insertion of TMR only to functionalities mapped onto slices with increased probability of failure. Experimental results proved the efficiency of such a solution, as it achieves comparable fault masking to existing approaches, but with mentionable lower mitigation cost. More specifically, based on our analysis, we found that the error between sensitive bits at configuration file after applying the introduced framework differs on average 8.26%, as compared to a uniform insertion of redundancy (i.e. Xilinx TMR). Since our framework aims to consumer products, such an error could be though affordable, if we take into consideration the significant savings in term of Power×Delay (PDP) by 53%, as compared to reference solution. Additionally, it is wellworth to mention that our framework performs faster, as it speedups the identification of suspicious for failure resources on average by 76%, as compared to HotSpot tool. This reduction at execution time, which can be further enhanced by selecting a different configuration for the game theory, enables our adaptive fault tolerant methodology to be applicable also at run-time. For next steps, there are two complementary extensions which we aim to address. The first of them affects the extension of our methodology, as well as the supporting tool-flow, in order to handle upsets occurred due to either different reliability degradation effects (e.g., EMI, TDDB, etc), or additional parameters that lead to enormous operation (apart from thermal stress). The second direction for future work affects the development of appropriately software tools in order to apply the proposed framework ACM Transactions on Embedded Computing Systems, Vol. 9, No. 4, Article 39, Publication date: March 2010.

A Framework for Supporting Adaptive Fault Tolerant Solutions

39:21

directly to an existing reconfigurable platform, without the need for existence of Virtual FPGA. ACKNOWLEDGMENT

Authors would like to thank Altera for their support in this work. REFERENCES Altera. 2011a. ALTERA QUARTUS II FRAMEWORK (URL: http://www.altera.com). (2011). Altera. 2011b. Stratix V Device Handbook (URL: http://www.altera.com/literature/hb/stratix-v/stratix5 handbook.pdf). (2011). Algirdas Avizienis, Jean-Claude Laprie, Brian Randell, and Carl Landwehr. 2004. Basic Concepts and Taxonomy of Dependable and Secure Computing. IEEE Trans. Dependable Secur. Comput. 1 (January 2004), 11–33. Issue 1. Vaughn Betz, Jonathan Rose, and Alexander Marquardt (Eds.). 1999. Architecture and CAD for DeepSubmicron FPGAs. Kluwer Academic Publishers, Norwell, MA, USA. Debayan Bhaduri and Sandeep K. Shukla. 2004. NANOPRISM: a tool for evaluating granularity vs. reliability trade-offs in nano architectures. In GLSVLSI ’04: Proceedings of the 14th ACM Great Lakes symposium on VLSI. ACM, New York, NY, USA, 109–112. DOI:http://dx.doi.org/10.1145/988952.988980 J. R. Black. 1969. Electromigration – A brief survey and some recent results. IEEE Transactions on Electronic Devices (April 1969), 338–347. Nicola Campregher, Peter Y. K. Cheung, George A. Constantinides, and Milan Vasilko. 2005. Analysis of yield loss due to random photolithographic defects in the interconnect structure of FPGAs. In FPGA ’05: Proceedings of the 2005 ACM/SIGDA 13th international symposium on Field-programmable gate arrays. ACM, New York, NY, USA, 138–148. DOI:http://dx.doi.org/10.1145/1046192.1046211 C. Carmichael. 2006. Triple Module Redundancy Design Techniques for Virtex FPGAs. Technical Report. Xilinx Application Notes 197. Jason A. Cheatham, John M. Emmert, and Stan Baumgart. 2006. A survey of fault tolerant methodologies for FPGAs. ACM Trans. Des. Autom. Electron. Syst. 11, 2 (2006), 501–533. DOI:http://dx.doi.org/10.1145/1142155.1142167 Abderrahim Doumar, Satoshi Kaneko, and Hideo Ito. 1999. Defect and Fault Tolerance FPGAs by Shifting the Configuration Data. In DFT ’99: Proceedings of the 14th International Symposium on Defect and Fault-Tolerance in VLSI Systems. IEEE Computer Society, Washington, DC, USA, 377–385. G. Gielen, P. De Wit, E. Maricau, J. Loeckx, J. Mart´ın-Mart´ınez, B. Kaczer, G. Groeseneken, R. Rodr´ıguez, and M. Nafr´ıa. 2008. Emerging yield and reliability challenges in nanometer CMOS technologies. In Proceedings of the conference on Design, automation and test in Europe (DATE ’08). ACM, New York, NY, USA, 1322–1327. R. Gupta, B. Tutuianu, and L.T. Pileggi. 1997. The Elmore delay as a bound for RC trees with generalized input signals. Computer-Aided Design of Integrated Circuits and Systems, IEEE Transactions on 16, 1 (jan 1997), 95 –104. Lin Huang and Qiang Xu. 2010. AgeSim: a simulation framework for evaluating the lifetime reliability of processor-based SoCs. In Proceedings of the Conference on Design, Automation and Test in Europe (DATE ’10). European Design and Automation Association, 3001 Leuven, Belgium, Belgium, 51–56. http://dl.acm.org/citation.cfm?id=1870926.1870942 ¨ ¨ M. Hubner, P. Figuli, R. Girardey, D. Soudris, K. Siozios, and Jurgen Becker. 2011. A Heterogeneous Multicore System on Chip with Run-Time Reconfigurable Virtual FPGA Architecture. In IPDPS Workshops. 143–149. ITRS. 2012. International Technology Roadmap for Semiconductors 2011 Edition (URL: http://www.itrs.net/ Links/2011ITRS/Home2011.htm). (2012). Rahul Jain, Anindita Mukherjee, and Kolin Paul. 2006. Defect-Aware Design Paradigm for Reconfigurable Architectures. In ISVLSI ’06: Proceedings of the IEEE Computer Society Annual Symposium on Emerging VLSI Technologies and Architectures. IEEE Computer Society, Washington, DC, USA, 91. DOI:http://dx.doi.org/10.1109/ISVLSI.2006.32 Jonathan M. Johnson and Michael J. Wirthlin. 2010. Voter insertion algorithms for FPGA designs using triple modular redundancy. In FPGA ’10: Proceedings of the 18th annual ACM/SIGDA international symposium on Field programmable gate arrays. ACM, New York, NY, USA, 249–258. DOI:http://dx.doi.org/10.1145/1723112.1723154 Fernanda Lima Kastensmidt, Luigi Carro, and Ricardo Reis. 2006. Fault-Tolerance Techniques for SRAMBased FPGAs (Frontiers in Electronic Testing). Springer-Verlag New York, Inc., Secaucus, NJ, USA.

ACM Transactions on Embedded Computing Systems, Vol. 9, No. 4, Article 39, Publication date: March 2010.

39:22

K. Siozios et al.

Joonho Kong, Sung Woo Chung, and Kevin Skadron. 2012. Recent thermal management techniques for microprocessors. ACM Comput. Surv. 44, 3, Article 13 (June 2012), 42 pages. DOI:http://dx.doi.org/10.1145/2187671.2187675 I. Koren and C. Krishna (Eds.). 2007. Fault-Tolerant Systems. Elsevier Inc. S. Mahapatra, M. A. Alam, P. Bharath Kumar, T. R. Dalei, D. Varghese, and D. Saha. 2005. Negative bias temperature instability in CMOS devices. Microelectron. Eng. 80 (June 2005), 114–121. Issue 1. Prasanth Mangalagiri, Sungmin Bae, Ramakrishnan Krishnan, Yuan Xie, and Vijaykrishnan Narayanan. 2008. Thermal-aware reliability analysis for platform FPGAs. In ICCAD ’08: Proceedings of the 2008 IEEE/ACM International Conference on Computer-Aided Design. IEEE Press, Piscataway, NJ, USA, 722–727. John F. Nash. 1950. Equilibrium points in n-person games. In Proceedings of the National Academy of Sciences of the United States of America. John Neumann and Oskar Morgenstern. 2004. Theory of Games and Economic Behavior (Commemorative Edition) (Princeton Classic Editions). Princeton University Press. K. Nikolic, A. Sadek, and M. Forshaw. 2002. Fault-tolerant techniques for nanocomputers. Nanotechnology 13 (2002), 357–362. Takumi Okamoto and Jason Cong. 1996. Buffered Steiner tree construction with wire sizing for interconnect layout optimization. In ICCAD ’96: Proceedings of the 1996 IEEE/ACM international conference on Computer-aided design. IEEE Computer Society, Washington, DC, USA, 44–49. Martin J. Osborne and Ariel Rubinstein. 1994. A Course in Game Theory. The MIT Press. http://www. amazon.com/exec/obidos/redirect?tag=citeulike07-20&path=ASIN/0262650401 Kara Poon, Steven Wilton, and Andy Yan. 2005. A detailed power model for fieldprogrammable gate arrays. ACM Trans. Des. Autom. Electron. Syst. 10, 2 (2005), 279–302. DOI:http://dx.doi.org/10.1145/1059876.1059881 B. Pratt, M. Caffrey, P. Graham, K. Morgan, and M. Wirthlin. 2006. Improving FPGA design robustness with partial TMR. In In 44th Annual IEEE International Reliability Physics Symposium Proceedings. 226–232. D. Puschini, F. Clermidy, P. Benoit, G. Sassatelli, and L. Torres. 2008. A Game-Theoretic Approach for RunTime Distributed Optimization on MP-SoC. International Journal of Reconfigurable Computing 2008, Article ID 403086 (2008), 11 pages. DOI:http://dx.doi.org/10.1155/2008/403086 Raphael Rubin and Andr´e DeHon. 2009. Choose-your-own-adventure routing: lightweight load-time defect avoidance. In FPGA ’09: Proceeding of the ACM/SIGDA international symposium on Field programmable gate arrays. ACM, New York, NY, USA, 23–32. DOI:http://dx.doi.org/10.1145/1508128.1508133 K. Siozios, D. Rodopoulos, and D. Soudris. 2011. On Supporting Rapid Thermal Analysis. IEEE Comput. Archit. Lett. 10, 2 (July 2011), 53–56. DOI:http://dx.doi.org/10.1109/L-CA.2011.19 Kostas Siozios and Dimitrios Soudris. 2010. A Methodology for Alleviating the Performance Degradation of TMR Solutions. IEEE Embedded Systems Letters 2, 4 (Dec. 2010). Satish Sivaswamy and Kia Bazargan. 2008. Statistical Analysis and Process Variation-Aware Routing and Skew Assignment for FPGAs. ACM Trans. Reconfigurable Technol. Syst. 1, 1 (2008), 1–35. DOI:http://dx.doi.org/10.1145/1331897.1331900 Jayanth Srinivasan, Sarita V. Adve, Pradip Bose, and Jude A. Rivers. 2004. The Case for Lifetime Reliability-Aware Microprocessors. SIGARCH Comput. Archit. News 32, 2 (March 2004), 276–. DOI:http://dx.doi.org/10.1145/1028176.1006725 Priya Sundararajan, Aman Gayasen, N. Vijaykrishnan, and T. Tuan. 2006. Thermal characterization and optimization in platform FPGAs. In Proceedings of the 2006 IEEE/ACM international conference on Computer-aided design (ICCAD ’06). ACM, New York, NY, USA, 443–447. DOI:http://dx.doi.org/10.1145/1233501.1233589 Wenping Wang, Zile Wei, Shengqi Yang, and Yu Cao. 2007. An efficient method to identify critical gates under circuit aging. In Proceedings of the 2007 IEEE/ACM international conference on Computeraided design (ICCAD ’07). IEEE Press, Piscataway, NJ, USA, 735–740. http://dl.acm.org/citation.cfm? id=1326073.1326228 Xilinx. 2011a. Defense-grade Virtex-6Q FPGA Family (URL: http://www.xilinx.com/products/silicon-devices/ fpga/virtex-6q/index.htm). (2011). Xilinx. 2011b. Space-grade Virtex-5QV FPGA (URL: http://www.xilinx.com/products/silicon-devices/fpga/ virtex-5qv/index.htm). (2011). Xilinx. 2011c. VIRTEX-6 Family Overview. Technical Report. Technical Report DS150. Xilinx. 2011d. Xilinx TMR Tool (URL: http://www.xilinx.com/ise/optional prod/tmrtool.htm). (2011).

ACM Transactions on Embedded Computing Systems, Vol. 9, No. 4, Article 39, Publication date: March 2010.

A Framework for Supporting Adaptive Fault Tolerant Solutions

39:23

S. Yang. 1991. Logic Synthesis and Optimization Benchmarks, User Guide. Technical Report. Microelectronic Center of North Carolina. Anthony J. Yu and Guy G. Lemieux. 2005. Defect-Tolerant FPGA Switch Block and Connection Block with Fine-Grain Redundancy for Yield Enhancement. In FPL. 255–262.

ACM Transactions on Embedded Computing Systems, Vol. 9, No. 4, Article 39, Publication date: March 2010.

Suggest Documents