Document not found! Please try again

Reduced Triple Modular Redundancy for Built-In Self ... - CiteSeerX

0 downloads 0 Views 124KB Size Report
tions of the VLIW architecture in order to detect a mis- ... required. If the application runs in an embedded system in many cases also ..... If mod-bit is set to 0 RefReg is selected as tar- ... been written to destination register dst in register file.
Reduced Triple Modular Redundancy for Built-In Self-Repair in VLIW-Processors Mario Schölzel Department of Computer Science, Brandenburg University of Technology, P.O. 101344, 03013 Cottbus, Germany, e-mail: [email protected] Abstract: In this paper we propose a new idea for built-inself-repair of application specific VLIW processors, which relies on a special kind of triple modular redundancy, which we call Reduced Triple Modular Redundancy (RTMR). The key idea is to employ the redundancy of operators in the data path of a VLIW processor. I.e., every operation is executed twice by two different operators during normal program execution. Only in case a mismatch between both computed results occurs, the operation is executed by a third operator. Therefore, during most of the execution time, the third operator can be used for executing regular operations of the program. We propose modifications of the VLIW architecture in order to detect a mismatch in computed results. Necessary program transformations are introduced, in order to obtain an internal representation for fault tolerant programs that can be scheduled to the proposed VLIW architecture. Furthermore, we propose the program execution model that is used in case a permanent fault in the data path has been detected and give some preliminary results.

1

Introduction

In many signal processing applications high performance is required. If the application runs in an embedded system in many cases also power and/or area consumption should be minimized, due to increase battery life and/or to decrease costs. Beside these area and power concerns also reliability becomes a more and more important factor in the design of embedded systems. Due to a reduced feature size, hardware becomes more sensitive to radiation and irregularities in production process. While radiation causes transient faults, irregularities in production process and aging of the circuit may cause a permanent fault. Permanent faults may first appear in the field and not immediately after production. Transient and permanent faults appearing in the field must be handled by a reliable system in order to ensure correct execution of the application. Due to reduced development costs, many embedded systems are composed of components off the shelf (COTS), including processors (e.g. DSPs). However, to meet all the mentioned goals above, processors of the shelf will not always be the best solution. There are two reasons for that: First, processors off the shelf do not always meet perfectly the demands of the application. Thus, area and power is wasted by unused features. Furthermore, many embedded systems include additional hardware components for performance improvement. These components must be de-

signed as well as the communication between them. This process is time consuming and error-prone. Second, mechanisms for handling transient and/or permanent faults are hardly available in processors off the shelf. In order to make these mechanisms available, modifications of the data- and control path would be necessary. Because such modifications for processors off the shelf are nearly impossible, triple modular redundancy (TMR) is often used to obtain a reliable system. TMR means, that the hardware, e.g. a DSP, which should be hardened against faults, is simply tripled and coupled with a voter. The application is executed simultaneously on each of the three processor instances and the computed results are compared. This can be done without changes of the data- and control path of the processor. If there is at most one faulty processor in a TMR system at least two processors compute the same result. By the voter the result of the majority is forwarded to systems output. Another option to handle transient faults is a re-computation of the result after detecting it. But real time systems must have short response times. I.e., the correct result must be forwarded immediately to systems output, even in case of a fault. Therefore, after detecting an error, a third result must be available to the voter immediately. This means, not in every case it can be recomputed, because it may depend on previously computed results and a re-computation from the scratch would be too time consuming. TMR leads to an overhead in area and power consumption of approximately 200%. The problem here is, that half of this overhead comes from a processor that performs computations, whose results will be necessary only in case of a fault. During most of the runtime of the system area and power of that processor is wasted. In order to overcome these disadvantages of processors off the shelf it becomes beneficial to use application specific processors (ASPs), which are composed of hardware blocks like register files, functional units, etc. Such processors can be adapted to the given application in order to save area and power. Furthermore, modifying the data path of ASPs is much easier than for COTS. Thus mechanisms for detecting and repairing permanent and transient faults can be added to them. We use as architecture template, which is adapted to a given application, a static scheduled Very Long Instruction Word (VLIW) architecture. The data path of a simple VLIW architecture is shown in figure 1.

Program Memory

Data Memory

Control Path

Extern

Data Path

Register File

Control Logic

Instruction Pointer

Branch

FU 1

...

FU n

Figure 1: Data path of a simple VLIW architecture. Every functional unit (FU) contains at least one operator. All operators within the same FU have different types. Every operator can execute a certain operation of the corresponding type with at most two source and one destination operand(s). Every FU executes one operation per clock cycle, even if it contains more than one operator. All operations, which are executed within the same clock cycle, form an instruction. An example is given in figure 2. opcode1

src1.1

src1.2

dst1

opcode2

...

opcoden

srcn.1

srcn.2

dstn

Figure 2: Example of an instruction word of a VLIW architecture with n FUs. The whole data path is controlled by such an instruction word. Execution of instructions is done in a simple 4-stage pipeline. In the first stage the instruction is fetched from memory. In the second stage it is decoded. In the third stage register values are supplied by the register file and operations are executed by operators. Results are written back to the register file during the fourth stage. VLIW processors are well suited as ASPs for executing high performance signal processing applications. On the one hand they provide a high level of instruction parallelism. On the other hand they have a simple control path, because all parallelization of operations is done by the compiler. The compiler also has knowledge about pipeline architecture. Thus, the compiler prevents pipeline hazards. The outline of the rest of the paper is as follows: After discussion of related work we describe in detail the necessary hardware architecture of a VLIW processor to support reduced triple modular redundancy (RTMR). After that, necessary program transformations are introduced. Finally we explain how transient and permanent faults are detected and how operator redundancy in the VLIW data path can be used for built-in self-repair.

2

Related Work

A concurrent execution of operations in a VLIW, in order to detect hardware faults, is already proposed in [2]. However, it is a pure software based methodology. Furthermore, the proposed idea only focuses on detecting faults and not on repairing them. For repairing permanent faults a Built-

In-Self-Repair (BISR) mechanism is necessary, which detects a faulty processing element and manages the correct execution of the application even in the presence of a faulty element. Some modified design space exploration (DSE) methodologies were proposed to provide BISR functionality for ASICs. Modifications of DSE process are necessary, because ASICs, which are adapted to a specific application using a traditional DSE approach, will not have any redundant processing element to support BISR. The BISR functionality is obtained either by accepting a performance degradation in case of a permanent fault or by providing additional hardware, which is used in case of a fault. In the first case execution order of operations must be changed, because a faulty ASIC can not execute all the operations at the same time, which can be executed by a non faulty ASIC. One such approach of rescheduling the operations in the case of a fault is L/U-scheduling [1] [8]. Another approach, which provides additional hardware, is presented in [5]. There, a certain number of n > 1 applications must be executed by the same ASIC. Instead of synthesizing n subsystems, each of them dedicated to one of the applications, n subsystems are synthesized, each of them capable of executing at least two applications. Furthermore every application can be executed by at least two subsystems. If an element fails permanently in one of the subsystems at least one of the applications can be executed by it and the application which can not is moved to another subsystem. This approach results in an overhead of hardware, while the execution time of each application remains the same, even in the presence of faulty elements. Another problem of that approach is the possible high number of schedules that must be available for all different fault cases.

3

Reduced Triple Modular Redundancy

A VLIW processor naturally provides a certain amount of redundant operators. Operators of the same type are available in more than one FU in order to operate in parallel. If every operator appears at least three times in the data path we have a kind of TMR. However, executing each operation three times would not result in a better performance to area ratio than in a simple TMR system. According to the observation, that three results will be only necessary in case of a fault, every operation of the given application is executed only two times by different operators of the same type. The same operation is executed by a third FU only, if first two results are different. Then the third result is used for voting. Therefore, as long as no fault appears, the third operator can be used for executing regular operations of the algorithm. This will result in a much better area to performance ratio than simple TMR. Approximately we expect to get an area overhead of only 100% compared to 200% in case of TMR. Furthermore, we are able to detect permanent as well as transient faults. If a permanent fault in an operator is detected, that operator is no longer used.

3.1

RTMR Program Representation

3.2

The proposed idea requires hardware and software support. Required hardware support is explained in section 3.2. Software support requires some modifications during code generation process for the given application, in order to support RTMR. We refer to modified code as the fault tolerant version of the application. Modifications are discussed at the level of basic blocks, which are represented as a data flow graph (V, E). Here, V is the set of operation nodes and D ∪ A = E ⊆ V × V is a set of edges, where D and A are disjoint sets. D is a set of data dependencies between operations. I.e., if (u,v) ∈ E then operation v reads a value that is produced by operation u. Therefore, every edge in E corresponds to a value that must be stored in a register of the register file. Furthermore, A is also a set of operation dependencies, where (u,v) ∈ A means, that operation v must be executed after execution of operation u is finished. In contrast to an edge in E, an edge in A does not correspond to a value that is saved in a register. In the following (V,E) is a basic block of a non fault tolerant version of the given application. This representation can be obtained by standard compilation techniques. To obtain the data flow graph (V', E') for fault tolerant version of the given application (V, E) must be modified in the following way: Every operation v ∈ V is duplicated:

V' := {(v, 0),(v,1) v ∈ V } . The original operation is labeled with 0 and the duplicated one with 1. We refer to (v,0) as reference operation of (v,1) and vice versa. All dependencies between original operations are retained between operation labeled with a 0. Furthermore, every duplicated operation gets the same input values as its reference operation. Therefore we get:

E ' := {((u, 0),(v, 0)),((u, 0),(v,1)) (u, v ) ∈ E } . Furthermore we claim that every original operation and its duplicate are executed before the result of the original operation is used by any another operation. We get this by adding additional dependencies to set A'.

A ' := {((u, 0),(v, 0)),((u, 0),(v,1) (u, v ) ∈ A} ∪

In order to support RTMR some modifications of data path, control path and in operation coding are necessary. We extend the instruction word by additional bit fields, add a temporary register file (TRF), fault detection and compensation logic (FD&C logic) for every FU to the data path and a voting control logic to control path. The modifications are shown in figure 3. Program Memory

Extern

Control Path

Data Path Regular Register File

Control Logic

Instruction Pointer

Branch

...

FU 1

FD&C Logic

Voting Control Logic

FU n

FD&C Logic

Temporary Register File

Figure 3: Modified data path for implementing RTMR. The result of every FU can be written through the FD&C logic to the TRF and to the regular register file. Thereby, the written value is extended by a fault bit, which is set to 1, if it is known that the value is erroneous. Among other thing, fault bit is used by regular register file to avoid storing of erroneous values to registers. Furthermore, FD&C logic block has access to extended output value of its FU and an arbitrary extended register value from TRF. Please note, that the value, which is written by a certain FU to TRF, can be used immediately by FD&C logic of another FU. This can be accomplished by an appropriate forwarding network within the TRF. Details about the FD&C logic are shown in figure 4. Write Port of Result of FU i in RF to voting logic FU i

errOpc

error

(1.1)

This restriction ensures that an error is discovered as early as possible and no unchecked values are used as input. Therefore, in case of an error, no complicated roll back mechanism is necessary. Please note, results of duplicated operations (u,1) are never used. Nevertheless, it is no dead code.

Data Memory

i_Fault

{((v,1),(u, 0)),((v,1),(u,1)) (v, u ) ∈ E }

Architecture Extensions

errDet

From voting logic to voting logic

Fault Remember

Fault Vector opcode RefFU

mod

RefReg

Cmp Read Port TRF

Write Port TRF

Figure 4: FD&C logic of FU i.

Control of TRF Read Ports

The shown FD&C logic belongs to FU i. Fault Vector is a bit field, where every bit corresponds to a certain operator in FU i. If it is already known, that there is a permanent fault in corresponding operator, the fault bit is set to 1. The opcode of current operation in execution stage is stored in bit field opcode. Depending on current opcode a certain operator in FU i is used in order to execute that operation. Fault vector and opcode are required by fault remember logic to determine, if a permanent fault already has been detected for currently used operator. Signal errOpc is set to 1 in such a case. By RefReg a reference register from the TRF is selected. This register is either used to deliver a value to the compare unit (cmp) or to save the result of FU i together with a its fault bit. Which action is performed depends on the value of mod-bit. Therefore, the computed result of FU i can either be compared to a value, which was previously produced by another FU or written to TRF for later comparison. The previously produced value is the result of the reference operation which was executed with same input operands. A mismatch of compared results indicates an error, either a transient one or maybe a static one. If a mismatch is discovered the result is recomputed by a third FU and the obtained result is used for voting (see section 3.3). However, this is only necessary, if the detected mismatch was not detected previously, i.e. errOpc is currently set to 0. Otherwise a re-computation is not necessary, because it is known that the corresponding operator is faulty. In order to initiate a re-computation, the signal i_fault is set to 1 by errDet unit if a mismatch is discovered and errOpc = 0. This signal is read by voting logic. Furthermore, fault remember logic can be used to distinguish between permanent and transient faults. I.e., if a mismatch is expected, because it is known that operator of current operation is faulty, but does not appear, corresponding bit in fault vector can be set to 0. The control bits RefFu, RefReg and mod are coded together with currently executed operation in instruction word. A modified part of an instruction word for FU i is shown in figure 5. opcode

src1 src2 dst mod RefREG RefFU

Figure 5: Modified instruction word. The coding of bit fields opcode, src1, src2 and dst remains the same as in the original instruction word in figure 2. The coding of RefFu, RefReg and mod is as follows: RefFu always encodes the number of FU that computes the reference value. If mod-bit is set to 0 RefReg is selected as target register for the value currently computed by FU i. If mod-bit is set to 1, value of RefReg is compared with currently computed value of Fu i. Please note, selected value may be the currently computed value of another FU. By this coding scheme it is possible to execute an operations u and its reference operation v at the same time by different FUs and to compare their results. In that case mod-bit of

original operation is set to 0 and mod-bit of duplicated operation to 1. If original and duplicated operation are executed at different times, mod-bit of earlier executed operation is set to 0.

3.3

Fault-Detection

In this section we discuss the situation, if a mismatch between an operation and its reference operation occurs. We assume both operations have type t, result of operation v can be found in register r of TRF and has been produced by FU f. Furthermore, reference operation v' of operation v is in execution stage of current clock cycle and executed by FU f'. Therefore, RefReg = r and reference value is delivered via Read Port TRF in figure 4. The mismatch is detected immediately after the result of v' is available at the output of FU f'. Compare unit in figure 4 indicates an error and the operation must be executed by a third FU, if corresponding fault bit in fault vector is 0, i.e. errOpc = 0. For this reason, f'_Fault signal is set to one and the processor is switched pro from normal operation mode to voting-mode by voting control logic in figure 3. In voting-mode any other FU f'', that contains an operator t, may be used for re-computing result of operation v. All operations which are currently in fetch and decode stage are discarded. Furthermore, FU f'' may executes any other operation and execution is not finished within the current clock cycle. Therefore, all operations, whose execution stage is not finished within the current clock cycle, are discarded. All necessary information for controlling processor in voting mode are available in the data path themselves and no further instructions are fetched. The following steps are executed in voting mode: 1. Voting logic determines the number of an FU f'', which can execute operation v and is different from f and f'. FU f'' can be determined depending on the signals f'_Fault, RefFu (see figure 4) and knowledge of data path configuration. 2. All FUs save locally their last input values. Voting logic forwards complete instruction word part of operation v' (see figure 5) from FU f' to FU f''. Thus FU f'' executes operation v' and reads the values from the same registers as FU f' did. These values are unmodified, because the write back of operation v' has been stalled in FU f'. Furthermore, result of f'' is compared again with value of register r in TRF. 3. If compare unit of FU f'' detects a mismatch (i.e. f''_Fault is set to 1), the result of FU f' is assumed to be correct. Otherwise the result of FU f is assumed to be correct. In the latter case the correct result already has been written to destination register dst in register file and write back of result of FU f' must be suppressed if pipeline resumes (i.e. fault bit of result of FU f' is set to 1). In the first case, the wrong result has been written to the destination register. Therefore, result of FU f' must

be written to destination register if pipeline resumes. Finally the fault bit in faulty FU is set to 1. After these steps processor switches back to normal operation mode. It must be noticed, that discarded operations must be re-executed. This can be accomplished by remembering unfinished operations and its input values. Last input values were saved locally by each FU. Remembering of unfinished operations can be implemented by saving the last k executed instruction words in an instruction cache. All finished operations are overwritten in instruction cache by a nop-operation. If processor resumes to normal operation mode, instructions from instruction cache are fetched first.

3.4

of our proposed idea. The selected benchmark program is the inner loop of an auto regression filter (ARF). It was taken from [7]. Data flow graph of non-fault tolerant program is shown in figure 6. All dependencies belongs to set E of data flow graph. We applied our program transformation rules from section 3.1 to the shown data flow graph and obtained a graph for the fault tolerant program version. 5:∗

4:∗

7:∗

8:∗

6:+

Program Execution in Permanent Fault Case

13:+

10:+

14:+

24:∗



In situation I: Fault bit of current result is set to 1. Therefore, no wrong result is written back to destination register of regular register file. The correct result is written back by reference operation.



In situation II: The correct result, which already has been written to destination register by reference operation, is not changed. Therefore, the correct result remains in the destination register.

Second, a detection of an already known fault must be avoided:

4



If situation I holds, the fault bit of the value that is written to TRF is set to 1, indicating a fault. If this value is read by the reference operation for comparing, comparing can be suppressed, in order to avoid switching to voting-mode.



If situation II holds, comparing in the current FU is suppressed.

Preliminary Results

We applied the described program transformations to a benchmark program in order to demonstrate applicability

16:∗

2:∗

1:∗

3:+

15:∗

19:+

23:∗

26:+

I. f executes an operation whose reference value is not available so far, i.e. mod-bit is set to 0.

The generated errOpc signal is used for two reasons: First, if errOpc indicates that FU f has an error in current operation the write back of the current result to the destination register is suppressed. By this:

18:∗

20:+

II. f executes an operation whose reference value is available in any register of TRF.

11:∗

9:+

17:∗

In case any operator of FU f has a permanent fault, the corresponding bit in the fault-register of FU f is set to one. Depending on the opcode of current operation in execution stage in FU f an errOpc signal is generated, which is 1, iff it is known that the current result of f is wrong due to a permanent fault. Now, two situations are possible:

12:∗

28:+

21:∗

22:∗

25:+

27:+

Figure 6: Data Flow graph of ARF benchmark program. For both data flow graphs we performed a design space exploration in order to determine the best VLIW architecture for executing them within several schedule lengths. For design space exploration we used our tool DESCOMP [3, 4]. Please note, that introduced transformations do not increase critical path length. Therefore, for both data flow graphs we generated well suited architectures for schedule lengths from 8 to 16. Required amount of operators in a fault tolerant VLIW processor increases as expected approximately by two. The same holds for the number of register file ports. More details are given in table 1. L 8 9 10 11 12 13 14 15 16

Non-Fault Tolerant FUs Add Mul 4 4 4 4 3 4 3 3 3 3 3 3 3 3 2 3 3 2 2 2 2 2 2 2 2 2 2

Table 1: Benchmark results.

Fault tolerant FUs Add Mul 8 8 8 8 7 8 6 6 6 6 6 6 5 5 5 5 5 5 5 5 4 4 4 4 4 4 3

Results show that we get in most cases an overhead of about 100% for functional units (FUs), adders (Add) and multipliers (Mul). In some cases (L = 12 and L = 13) the overhead of FUs is lower and in one case (L = 14) it is higher than 100%. However, because register file complexity grows more than linear with number of required ports [6] the total overhead in area and power consumption in data path is more than 100%. For deeper analysis more details of hardware implementation would be necessary and will be part of future work.

5

Conclusions and Limitations

We can conclude that our generated fault tolerant program even in presence of a permanent fault in an operator is executed without a loss of time. We proposed methodologies to detect and handle a single transient or permanent fault during regular execution of program. Detecting and handling of a single fault takes a few additional clock cycles the first time the fault is recognized. This is because the faulty operation must be re-executed. We achieve these properties with an overhead of approximately 100% for operators in the data path and ports in the register file. Furthermore, a smaller temporary register file and some control logic are required. Furthermore, we are able to tolerate faults of multiple operators. The system can produce correct results as long as for every operator its reference operator works correct. However, we loose the opportunity to check the current result for correctness, if a certain operator o is faulty. This is because we assume that its reference operator o' works correct. But there still exists an opportunity to check the correctness of o'. In the case there is another operation which is executed by operator o' and by another operator o'' ≠ o, also a fault in operator o' can be detected. Nevertheless, in such a case we are not able to execute the program correct, because for all operations which are executed by o and o' no backup operator is available. However, this situation can be detected by errDet-logic, if fault bits of both input values are set to one. Furthermore it can be noticed that the proposed transformation scheme for data flow graphs helps to save operators and FUs (see table 1). This is because our proposed program transformation allows scheduling an operation and its reference operation at different times. Therefore, in some cases FUs and operators may be reused. But it can also be noticed, that in some cases a simpler transformation scheme would be sufficient, that simply duplicates all FUs and its operators in data path. A deeper analysis of benefits would be necessary. So far an operation and its reference operation use the same source operands, i.e. the same registers. If there is a fault in one of the registers it could not be detected. However, either the register can be hardened by some error correction code or different source operands can be used.

In the latter case, the number of required registers is increased. But also other control and data lines will be used to read and write values to redundant registers. Therefore, a better fault detection is possible. Furthermore, that opportunity has some drawbacks to proposed transformations of the data flow graph and should be further investigated.

6

Outlook

In order to compare area and power consumption of our proposed architecture a synthesizable architecture model is necessary. Fault simulation techniques may be used in the model in order to determine its fault coverage. Furthermore, proposed architecture introduces limitations to scheduling and resource allocation process during design space exploration. These limitations should be included in our DSE tool DESCOMP in order to support a DSE methodology for proposed fault tolerant VLIW architecture.

7

References [1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

Alex Orailoglu: Microarchitectural Synthesis of Gracefully Degradable, Dynamically Reconfigurable ASICs. International Conference on Computer Design (ICCD'96), pp. 112-117, 1996. Cristiana Bolchini and Fabio Salice: A Software Methodology for Detecting Hardware Faults in VLIW Data Paths. Proc. of the 2001 IEEE Int. Symp. on Defect and Fault Tolerance in VLSI Systems (DFT'01), pp. 170-175, 2001. Mario Schölzel and Peter Bachmann: DESCOMP: A New Design Space Exploration Approach. Proc. of the 18th Int. Conference on Architecture of Computing Systems (ARCS'05), pp. 178-192, 2005. Mario Schölzel, Peter Bachmann et. al.: Durchgängiger automatisierter Entwurf von der Prozessor-Architektur bis zur Anwendungs-Software. Proc. of the Workshop Dresdner Arbeitstagung Schaltungs- und Systementwurf (DASS'06), pp. 19-24, 2006. Ramesh Karri, Kyosun Kim et. al.: Computer Aided Design of Fault-Tolerant Application Specific Programmable Processors. IEEE Transactions on Computers, 49(11), pp. 1272-1284, 2000. Scott Rixner, William J.Dally et. al.: Register Organization for Media Processing. Proc. of the 6th Int. Symp. on High-Performance Computer Architecture (HPCA'00), pp. 375-386, 2000. Viktor S.Lapinskii. Algorithms for CompilerAssisted Design-Space-Exploration of Clustered VLIW ASIP Datapaths. Dissertation, University of Texas at Austin, 2001. Wah Chan and Alex Orailoglu: High-Level Synthesis of Gracefully Degradable ASICs. Proc. of the European Design and Test Conference (ED&TC'96), pp. 50-54, 1996.

Suggest Documents