Novel Low Overhead Post-Silicon Self-Correction ... - Semantic Scholar

1 downloads 0 Views 780KB Size Report
Self-corrective PPA's with added redundancies: (a) Han-Carlson, (b) Sklansky, (c) Ladner-Fischer, and (d) Brent-Kung. The redundant columns are circled.
1504

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 19, NO. 8, AUGUST 2011

Novel Low Overhead Post-Silicon Self-Correction Technique for Parallel Prefix Adders Using Selective Redundancy and Adaptive Clocking Swaroop Ghosh and Kaushik Roy Abstract—In this paper, we present a post-silicon self-correction technique to leverage the redundancy present in parallel prefix adders (PPA). Our technique is based on the fact that a set of carries in PPAs can be made mutually exclusive. Therefore, defects in a set of bits can only corrupt the corresponding set of Sum outputs whereas the remaining Sums are computed correctly. To efficiently utilize the above property of PPAs in presence of defects, we perform addition in multiple clock cycles. In cycle-1, one of the correct set of bits are computed and stored at the output registers. In the subsequent cycles, the operands are shifted by one bit at a time and the remaining sets of bits are recovered. This allows us to compute the correct output at the cost of throughput degradation and minor area and delay overhead while maintaining high frequency and yield. Finally, the proposed technique is used in a superscalar processor, whereby the self-correcting adder is assigned lower priority than fault-free adders to reduce the overall throughput degradation.

Fig. 1. Effect of fault in a 4-bit KSA. The redundant column is also shown.

Index Terms—Adaptive clocking, parallel prefix adders, self-correction, spatial and temporal redundancy.

I. INTRODUCTION Success of a new technology depends on the self repair mechanisms that ensure the operation of the chip even under defects. Several techniques have been proposed in the past to tolerate various kinds of defects in arithmetic and logic circuits [1]–[7], [16]. In this paper, we utilize the inherent spatial redundancy present in high-speed parallel prefix adder (PPA) circuits in an efficient manner in order to tolerate defects. For example, due to its structure, the even and the odd carries are mutually exclusive in Kogge-Stone adder (KSA) [8]. As shown in Fig. 1, a defect in bit-1 can introduce errors only in Sum1 and Sum3. The other Sum outputs (i.e., Sum0 and Sum2) are computed in parallel and will be fault-free. To efficiently utilize the above property of KSA in the presence of defects, [5] suggests adding small overhead in the adder during the design time. A redundant column is added in the adder (see Fig. 1). The fault free adder operates normally (in single clock cycle). However, if the adder is defective, the addition is performed in two clock cycles. In cycle-1, one of the correct set of bits (Sum0 and Sum2) are computed and stored at the output registers. In cycle-2, the operands are left shifted by one bit and the remaining sets of bits (Sum1 and Sum3) are recovered and stored. This allows them to tolerate any kind of defects at the cost of throughput degradation due to few extra data recovery cycles while maintaining the rated frequency and yield. It is worth mentioning the fact that [5] deals with repair of only KSA where the even and the odd carries are mutually exclusive. Other PPAs do not have such properties. In this paper, we propose selective duplication along with extra latencies to make any adder topology amenable to self repair. The proposed self-repair technique has the following advantages over the conventional techniques: 1) better than masking the faulty ALU in terms of throughput; 2) better than DMR/TMR in terms of area overhead; and 3) scalable to future technology nodes where DMR/TMR may lose effectiveness due to high defect density. Manuscript received April 30, 2009; revised November 17, 2009; accepted April 14, 2010. Date of publication June 21, 2010; date of current version July 27, 2011. The authors are with the Department of Electrical Engineering, Purdue University, West Lafayette, IN 47906 USA (e-mail: [email protected]; [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TVLSI.2010.2051169

Fig. 2. Self-correcting KSA [5].

II. SELF CORRECTING KSA The overall structure for an 8-bit self-correcting adder example is illustrated in Fig. 2. It can be observed that the following components are required for self-correction. 1) Multiplexers at the Inputs: Multiplexers are required at the inputs for shifting the operands to the left side by 1-bit during the second cycle. 2) Extra Bit Computation: An extra redundant column is added in the Kogge-Stone tree to make sure that even if the fault affects the last bit, the correct Sum can be restored by this column in cycle-2. 3) Application of Non-Controlling Values in LSB and MSB: During cycle-2, non-controlling values must be applied to the LSB in order to prevent inadvertent masking of carry chain. 4) Multiplexers at Output: These are required at the outputs for shifting the partially correct Sums to the right side by 1-bit during the second cycle in faulty adder. 5) Generation of Shift Enable Signal: The shift enable signal is generated properly to allow shifting of operands and Sum outputs in normal mode. 6) Clocking of Output Registers: In a defective adder, only the flipflops corresponding to the incorrect output bit should be clocked during recovery cycle. This is done to prevent destruction of correct data in the registers from the first cycle. Although the above technique is very low overhead, it is not scalable to other adder topologies. Fig. 3(a) shows a 4-bit Han-Carlson adder (HCA) [10]. Since the even carries are generated from odd carries, a defect in node (0, 1) corrupts three output bits. Obviously such failures cannot be recovered by shift-and-recomputation. In order to generalize this concept to any PPA, we adopt a two-step process. In the first step, we protect a part of the adder by duplication that creates the required independence among the even and odd set of output bits for the given PPA. Note that duplication is similar to DMR

1063-8210/$26.00 © 2010 IEEE

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 19, NO. 8, AUGUST 2011

1505

required number of cycles (to perform recomputation); and 5) generation of gated clock (gclk) to the output flip-flops so that outputs are stored in correct order after the recovery latency. B. Spatial Redundancy Insertion

Fig. 3. Four-bit HCA: (a) fault at node (0, 1) corrupt most of the outputs that are non-repairable and (b) self-correcting adder with important node protected by duplication. The redundant columns, clocking of output flip-flops are also shown.

except for the fact that the good unit is selected statically (at test time) instead of using a voter. Next, shifting and recomputation is performed in the second step to recover the correct outputs in presence of fault in the unprotected part. In the example of Fig. 3(a), we protect node (0,1) by duplication [see Fig. 3(b)]. The select line of all duplication MUXes are tied together and controlled by a programmable register. If the original node is faulty, we switch to the redundant one by programming this register. If the fault is anywhere else, shift-and-recomputation can be used to recover the correct output. Note that the MUX adds a little overhead to the critical path delay but that is negligible compared to the conventional DMR/TMR and voter. The clocking requirement (for the faulty adder), two redundant columns and input/output scan flip-flops are also shown. The self correction mechanism is discussed further in Section III. III. GENERIC FRAMEWORK FOR SELF-CORRECTING PARALLEL PREFIX ADDERS In this section, we generalize the above concept for a set of other PPAs. We propose addition of selective spatial redundancy at important locations and use adaptive latency control on top of it to make them self corrective. A. Basic Idea Fig. 4(a) illustrates the structure of a generic PPA consisting of three stages—setup, prefix carry tree and sum. The setup stage computes the (p,g) of each bit, prefix carry tree computes the carry at each bit position whereas the sum stages computes the summation with the carry input and propagate. For the sake of simplicity, we only focus on three possible faulty nodes (namely, A, B, and C) and the corresponding fan-outs feeding the sum stage. It can be noted that any fault in node A can corrupt four successive intermediate carry outputs i; i + 1; i + 2 and i + 3 and it cannot be recovered in less than four clock cycles. Since more latency associated with fault recovery is detrimental to throughput, we limit the maximum latency for fault recovery to three cycles. Therefore, in this case it is necessary to protect node A by duplication. A fault in nodes B or C, on the other hand, can corrupt only two successive bits that can be repaired by shifting the operands two times and recomputing (at the cost of correct computation in three clock cycles). In the next subsection, we present a systematic approach to identify such important nodes for replication. Fig. 4(b) shows the structure of the self-correcting PPA. Along with the self-correction components illustrated in Fig. 2, we add the following extra components: 1) redundant copy of the important nodes; 2) a set of multiplexers that can select the outputs of fault-free partial tree; 3) addition of redundant columns (one or two columns depending on the adder structure); 4) modifications for generating the shift enable signal in order to control the shifting of operands for the

Fig. 4(c) shows the flowchart to identify the important nodes of the adder for insertion of redundancy. The row (column) index of the adder is labeled by k(i). This is similar to Fig. 3(a). The algorithm traverses each node of the adder in a row-wise fashion and determines whether a replication of the node is required or not. From each node (i; k), the algorithm traverses all the fan-out branches and creates a list of Sum outputs (Flist) that will be affected if a fault occurs at this particular node. If the list contains more than 2 consecutive Sum’s, a fault in this node cannot be corrected by recomputation. Therefore, the current node (i; k) is marked as “important.” The algorithm proceeds till all nodes in prefix carry tree have been traversed. Finally, the adder is traversed in a column-wise fashion, redundant copies are inserted at the nodes marked as “important” and multiplexers are inserted at the last important node of each column. The redundant nodes form an extra partial tree and the multiplexers enable the selection of fault-free partial tree in the presence of defects [see Fig. 4(b)]. In this context, it is important to note that faults may originate from the multiplexers themselves nullifying the effectiveness of the proposed approach. One can employ two copies of multiplexers or fault tolerant multiplexers [17] for better reliability. C. Fault Localization Since the previously mentioned fault tolerant technique is not based on voting, we have developed a methodology to isolate or bin the fault in either: 1) protected part; 2) even bits; or 3) odd bits. If only the even set of output bits are corrupted for any input, it indicates a fault in even bit of unprotected node. Similar is true for the corrupted odd output bits. However, if both the even as well as odd bits are corrupted it indicates a fault in the protected node (under a single fault assumption). D. Application to Superscalar Pipeline In superscalar processors, there are typically several functional units [integer ALUs, integer multipliers, floating point arithmetic logic units (ALUs), etc.] of the same type. Consider a processor that has a faulty integer ALU. In this case, the manufacturers would either have to discard the faulty chip, or disable the faulty functional unit resulting in a significant yield impact. Disabling the faulty ALU is significantly more attractive since it allows the faulty chip to be salvaged, albeit with a throughput penalty due to the availability of fewer ALUs [5]. For this work, we assume that the faulty functional unit is the adder of the integer ALU. Instead of completely disabling the faulty adder, we use it for computation, but employ adaptive clocking to allow it to perform computations in multiple-clock cycles. There are two major challenges to employing this scheme. The first one is to ensure that the faulty functional unit is scheduled rarely [5]. This is achieved by assigning lowest priority to this faulty ALU. Therefore, we schedule it for computation only when all other good ALUs are in use. In addition to this, we also ensure that dependent instructions are not woken up before the faulty adder has completed the computation. Both these tasks require slight modification to the schedule and issue logic of the superscalar processor. Whenever an instruction is ready to be issued (all its other dependent instructions have completed), the scheduler locates an available functional unit. If a functional unit is available, the scheduler issues the instruction to the functional unit, and informs the wake-up logic to wakeup instructions that are dependent on the one that has been issued after a given number of cycles. In addition, each functional unit has a REQUEST signal to indicate that it is available for execution [5].

1506

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 19, NO. 8, AUGUST 2011

Fig. 4. Structure of (a) generic PPA and effect of faults: failure in node A cannot be corrected by shift-and-recomputation and (b) self correcting PPA containing redundant copy of partial tree and a set of multiplexers for selection. The redundant columns are also shown. (c) Flowchart for identification of important nodes where redundancy should be added.

Fig. 5. Self-corrective PPA’s with added redundancies: (a) Han-Carlson, (b) Sklansky, (c) Ladner-Fischer, and (d) Brent-Kung. The redundant columns are circled. The shaded parts are duplicated and multiplexers are placed at the interface.

Fig. 6. (a) Area and (b) delay overhead for various adders of width 8, 16, 32, and 64 bits (c) throughput degradation with various amount of faulty (but repairable) adders. Note that we have considered 64-bit HCA to be the adder structure for these simulations. Throughput loss with discarded faulty adder is also plotted.

In order to implement the modification to the scheduling policy, each functional unit requires an additional status bit to indicate when it is

faulty. The scheduler checks this FAULT bit in addition to the REQUEST signal. If the FAULT bit is set, the scheduler wakes up depen-

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 19, NO. 8, AUGUST 2011

TABLE I PROCESSOR CONFIGURATION

dent instructions after the desired number of repair clock cycles. Note that the FAULT bit is set during test and does not change during execution, thus the performance and power overhead is small. Furthermore, the fault bit is not set if the fault can be repaired by duplication. It is also worth noting that the proposed technique can hurt the throughput in out-of-order execution processors (due to bypass logic). In order to simplify the implementation and analysis we limit the application of the proposed technique to the in-order execution processors only. E. Simulation Results We modified Simplescalar [11] to accommodate the changes in the scheduling policy. We used ref inputs, fast forwarded 500 million instructions and simulated 1 billion instructions for the SPEC 2000 benchmarks. The processor configuration is shown in Table I. As shown in the configuration, the integer execution unit consists of four integer ALUs. Assuming that one of the ALUs had a faulty adder core, we simulated two scenarios: 1) disabling the faulty ALU and 2) using self-corrective adder with the scheduling policy as described above. For applying rare two-cycle operations on self-corrective adders, we considered all instructions that use the adders (e.g., addition, subtraction, address calculation, shift, rotation, etc.). We explored four PPA structures (HCA [10], BKA [12], LFA [13], and SKA [14]) to demonstrate the effectiveness of the proposed selfcorrection methodology. However, the technique is generic and can be applied to any PPA. For the estimation of area and delay overhead, we have considered adders of width 8-, 16-, 32-, and 64-bits. The self-correcting 8-bit adders (with the important nodes shaded for duplication and multiplexer insertion) are shown in Fig. 5. The redundant columns at the MSB side are also shown. Note that the self-corrective HCA, LFA, and BKA consumes three cycles whereas the faulty SKA consumes two clock cycles for correct computation. Furthermore, the above mentioned multi-cycle operation for computation is limited to the self-corrective adders where the fault is located in the unprotected part. If the fault is confined in the protected part, it can be corrected by the duplication. The area and delay overhead associated with self-correcting adder is computed by synthesizing the Verilog netlist using Synopsis Design Compiler [15]. For synthesis, we have used IBM 90-nm process technology libraries. It can be noted from Fig. 6(a) that the area overhead for narrow width adders is large compared to the wide adders. This is attributed to the significant fraction of duplication and insertion of redundant columns in narrow width adders. The delay overhead due to insertion of multiplexers is very small [