Using Platform FPGAs for Fault Emulation and Test ... - Semantic Scholar

1 downloads 54607 Views 443KB Size Report
Email:[email protected] ... benchmark suites, we show that our approach does better than ..... copy and the fault inserted copies of the benchmark.
JOURNAL OF COMPUTERS, VOL. 6, NO. 11, NOVEMBER 2011

2335

Using Platform FPGAs for Fault Emulation and Test-set Generation to Detect Stuck-at Faults Carson Dunbar Electrical and Computer Engineering, University of Maryland, College Park, MD, USA Email:[email protected] Kundan Nepal School of Engineering, University of St. Thomas, St. Paul, MN, USA Email:[email protected]

Abstract—This paper investigates the use of reconfigurable computing and readily available Field Programmable Gate Array (FPGA) platforms to expedite the generation of input-patterns for testing integrated circuits after manufacture. Unlike traditional fault simulation approaches, our approach emulates single stuckat fault behavior in a circuit and finds the minimum test pattern set to detect it. In this paper, we present a method to insert faults into a circuit netlist by identifying circuit fault sites. We then present our parallel method of fault emulation and describe our method to organize and compress the input patterns needed to identify all faults. Using circuits from the ISCAS and MCNC benchmark suites, we show that our approach does better than a commercial tool in test-set reduction. Index Terms—Field Programmable Gate Arrays, Reconfigurable Computing, Fault emulation, stuck-at-faults, manufacturing defect, test pattern generation, test-set compression.

I. I NTRODUCTION In 1965, Intel co-founder Gordon Moore, made an observation. He predicted that the number of transistors in a given space would double every eighteen months. This prediction, now popularly known as Moore’s Law, has been the driving force behind rapid innovations in semiconductor industry. For the past few decades, the transistor dimensions have steadily downscaled. The smaller dimensions have helped the number of transistors integrated on a single chip to increase allowing for a higher packing density, better circuit performance and lower cost-of-production. Building large circuit structures or integrating billions of devices at very small dimensions raises its own set of challenges. The precise transfer of circuit patterns onto a wafer using lithography is becoming difficult as dimensions of the transistors are becoming much smaller than the wavelength of the optical sources. Precise control of doping, reliability of thin oxide structures, decreased mobility of charge carriers are few other challenges directly attributed to scaling [1]. As a result faults manifest themselves as permanent defects in wires and transistors during the complex manufacturing steps. The constraints of energy consumption placed by the large number of devices will also require the lowest possible supply voltages. A supply voltage of 0.6V is the current ITRS prediction for low-power operation in 2021 [2]. This lower supply voltage also leads to the reduction in noise margin. This

© 2011 ACADEMY PUBLISHER

1

1

0 1 1

Fig. 1.

1(0)

1(0)

SA0 X

Single stuck-at fault at an internal node of a circuit.

reduced noise margin, exposes computation to higher soft error rates since thermal fluctuations can easily alter the state of the devices [3]. The noise margin reduction also stems from the inherent variations in the process and system parameters (eg. threshold voltage, gate lengths, doping profile etc), which give rise to non-uniform switching behavior causing deterioration of performance as well as increased functional failure and susceptibility to noise. While transient errors that occur during circuit operation will require complex online error detection approach [4]–[6], permanent faults can be checked for during production and the chip can be discarded before it causes errors for the enduser. Circuit designers use variety of defect models to capture the behavior of a permanent defect in chip. One such model, the stuck-at-fault model deals with wires within the circuit being stuck at either a logic-1 (SA1), or a stuck-at-zero fault (SA0). This fault model does not have a specific cause; rather, it is an abstract fault model with numerous causes. Figure 1 shows a portion of a circuit where an internal circuit line has been permanently stuck-at-0. This incorrect value at this node propagates to the rest of the circuit and causes an incorrect outcome at the output. The incorrect output due to the fault is shown in parenthesis at each affected node. In real circuits, faults can also occur at multiple locations at once. Handling more than one fault at the same time is realistic but makes detection algorithm very complex. However, it was shown by Hughes and McCluskey that any algorithm that covers all possible single-faults in a circuit will also cover multiple-faults in the same circuit [7], [8]. Hence, the single stuck-at fault model is the subject of focus in this paper. A general way to test for stuck-at faults after manufacturing is to run a set of input stimulus and compare the output of the integrated circuit with a set of expected outputs. Any deviation from the expected outputs results in a functional error. Since

2336

JOURNAL OF COMPUTERS, VOL. 6, NO. 11, NOVEMBER 2011

the number of input stimulus grows exponentially with the number of inputs to the circuit, finding a representative subset of the circuit is of utmost importance. The goal of any faultsimulation algorithm is to find the least number of input conditions that can cover the most number of faults quickly and efficiently using the single-stuck-at fault to model errors in the system. Automatic Test Pattern Generation (ATPG) is a technique that aims at finding a compact and optimal testset to detect all possible faults in the system. For test pattern generation, numerous approaches for fault simulation in software and fault emulation in hardware have been proposed by researchers in the past. Fault simulation uses a computer program to determine the best possible vectors. It allows engineers to reproduce any detail of a circuit and gives them an option to focus on a higher level behavior before concentrating on the lower level netlist. This is done primarily to see that the higher levels of behavior fall within the acceptable performance boundaries. Software related techniques popularly used today include serial, semi-parallel, deductive, concurrent, differential-calculus, probabilistic and multi-threaded algorithms [8]–[11]. At the same time, fault emulation approaches in hardware have included vector processors [12], [13], multi-processors, Graphics Processing units [14], supercomputers [15] as well as reconfigurable computing platforms [16], [17]. These approaches have their own merit but mostly suffer from requiring significant design time and effort and require the use of complex and very expensive specialized hardware. In contrast to these methods, the approach proposed in this paper performs fault emulation by using off-the-shelf commercial reconfigurable computer called the Field Programmable Gate Array (FPGA) processor. An FPGA contains a collection of configurable logic blocks and programmable interconnects that can be configured by the designer to fit the design needs. An FPGA can be configured in a variety of ways – using a hardware description language (such as Verilog or VHDL); a schematic with a group of logic gates/modules; or using a syntax similar to most current software programming languages, such as C++ and JAVA. Hardware description languages facilitate the transfer of simulation techniques to an FPGA and allows for programmers to develop hardware easier using concepts that software developers use. By leveraging the FPGAs reconfigurability and parallel processing capabilities, a speed up in fault detection can be achieved over previous computer simulation techniques. This paper discusses our automated fault insertion methodology and presents result for our parallel fault emulation technique that finds the smallest set of input vectors required to find all detectable single-stuckat faults. II. V IRTEX II P RO FPGA Xilinx Virtex-II Pro Development System (XUPV2P) serves as the platform for this work. The low-cost but powerful board houses a Xilinx XC2VP30 FPGA with 30,816 Configurable Logic Blocks (CLBs), 136 18-bit multipliers, 2,448Kb of block RAM, and two PowerPC Processor cores [18]. The PowerPC features a 64-bit architecture that can also run in

© 2011 ACADEMY PUBLISHER

Fig. 2. Physical Layout of the XC2VP30 FPGA with the design for C432 mapped into it. The two PowerPC processor block are shown together with the used logic blocks (red) and unused logic blocks (white).

a 32-bit mode. This processor has 5 pipeline stages, 16 KB of instruction and data caches, and can run at clock rates of up to and above 400 MHz. The communication between the processor and the custom logic cores built using the available CLBs takes place via the Processor Local Bus (PLB). Figure 2 shows the physical layout of the Virtex II Pro FPGA FPGA used in this design. It shows the actual implementation of the fault detection algorithm (described later in Section IV) for benchmark circuit C432. III. FAULT I NSERTION Fault testing and test vector generation requires the insertion of faults into a copy of the circuit and comparison of the outputs with the working version of the circuit. Even when a simple stuck-at fault model is considered, this can be a daunting task because of the vast number of possible fault sites. Faults can appear on every line within the circuit - on every primary input/output, intermediate node, and fan out stem and branch. However, it is possible to bring the number of faults to a manageable count by exploiting the logical structure of the circuit under test. Based on the behavior of the logic gates, certain faults are considered equivalent to other faults because a test pattern that detects one also provides a detection of other faults. Similarly, certain faults are considered to dominate other faults because a test pattern that detects a fault could be a superset of the test pattern detecting some other faults. These fault equivalence and dominance relations can be used to find fault checkpoints on the circuit - a collapsed list of sites where faults need to be inserted. It has been shown that a checkpoint consists of all primary inputs to a circuit as well as all fan out branches that occur within the circuit [8].

JOURNAL OF COMPUTERS, VOL. 6, NO. 11, NOVEMBER 2011

N1

N10 N22

N2

N3

N16

N11 N23

N6 N19

N7

Fig. 3.

2337

module c17test(N1,N2,N3,N6,N7,N22,N23, SS0,SS1,SS2,SS3,SS4,SS5, SS6,SS7,SS8,SS9,SS10, errsig); input N1,N2,N3,N6,N7, SS0,SS1,SS2,SS3,SS4,SS5, SS6,SS7,SS8,SS9,SS10, errsig; output N22,N23; wire N10,N11,N16,N19, N1_0, N2_0, N3_0, N6_0, N7_0, N3_1, N3_2, N11_1, N11_2, N16_1, N16_2; mux Mux0(N1_0, SS0, errsig, N1); mux Mux1(N2_0, SS1, errsig, N2); mux Mux2(N3_0, SS2, errsig, N3); mux Mux3(N6_0, SS3, errsig, N6); mux Mux4(N7_0, SS4, errsig, N7); mux Mux5(N3_2, SS5, errsig, N3_0); nand NAND2_1 (N10,N1_0, N3_2); mux Mux6(N3_1, SS6, errsig, N3_0); nand NAND2_2 (N11,N3_1,N6_0); mux Mux7(N11_2, SS7, errsig, N11); nand NAND2_3 (N16,N2_0, N11_2); mux Mux8(N11_1, SS8, errsig, N11); nand NAND2_4 (N19,N11_1,N7_0); mux Mux9(N16_2, SS9, errsig, N16); nand NAND2_5 (N22, N10, N16_2); mux Mux10(N16_1, SS10, errsig, N16); nand NAND2_6 (N23,N16_1,N19); endmodule

module c17 (N1,N2,N3,N6,N7,N22,N23); input N1,N2,N3,N6,N7; output N22,N23; wire N10,N11,N16,N19; nand NAND2_1 (N10, N1, N3); nand NAND2_2 (N11, N3, N6); nand NAND2_3 (N16, N2, N11); nand NAND2_4 (N19, N11, N7); nand NAND2_5 (N22, N10, N16); nand NAND2_6 (N23, N16, N19); endmodule

C17 schematic and structural verilog netlist. TABLE I D IRECTED GRAPH TABLE FOR C17 Name

Type

Parents

N1 PI Null N2 PI Null N3 PI Null N6 PI Null N7 PI Null N10 NAND N1, N3 N11 NAND N3, N6 N16 NAND N2, N11 N19 NAND N11, N7 N22 NAND N10, N16 N23 NAND N16, N19 Parents used more than once: Null,N3,N11,N16

To insert a fault on the netlist, we devised a C++ code that takes in a structural Verilog netlist and outputs a fault inserted netlist. Consider the schematic and structural verilog netlist shown in Figure 3 for circuit C17 from the ISCAS benchmark suite. From the circuit netlist, a directed graph table is created with the name of the circuit node, the logic type and the node parents as shown in Table I. After the table has been created, the output wires from other gates that lead into each node are all stored into a vector. This vector is cycled through numerous times, and if the wire appears more than once it is stored in another vector along with the number of times it has been detected. If the parent column contains the same node more than once, this node is detected as a fanout. Any node with a NULL parent is a primary input. Since the checkpoint consists of all the fanout branches and primary inputs, it is now possible to add multiplexers at these nodes of the circuit to perform stuck-at-fault emulation accurately. At every checkpoint a multiplexer is inserted with one input connecting to the original gate connection and the second input connecting to the stuck-at-1 or 0 fault being simulated. A muxselect signal is used to choose between the faulty node data and the correct data as shown in Figure 4.

Original Line

Fig. 4.

S1

Output

input to the multiplexer are the original signal and the error signal. The multiplexer output is labelled by adding a  0 to the original input name. The same process is repeated for each fanout branch in the checkpoint list. At the end of the netlist, a module is instantiated for the multiplexer. The generated netlist is shown in Figure 5. The multiplexers are numbered in the order they are created. In the netlist the input errsig is either a 0 (to simulate stuck-at-0) or a 1 (to simulate stuckat-1) and the inputs SS0 to SS10 are the multiplexer select signals. Since we are working with single-stuck-at faults, at any given time only one of the SS0-SS10 signals is HIGH. SS10-SS0= SS10-SS0= SS10-SS0= {00000000001 } {00000000010 } {00000000100 }

C17

C17-1

C17-2

C17-3

Error-1

Error-2

Error-3

S

Fig. 6. Fault free C17 and three instantiations of the faulty circuit. The faults are activated based on the SS10-SS0 vector.

Select fault on/off

Figure 6 shows an example of three instantiations of the C17 circuit with different multiplexer select signals. In the circuit labeled C17-1, the select lines SS10 − SS0 = 00000000001. This means that only fault associated with SS0 is added to the netlist. Similarly, C17-2 and C17-3 have faults associated with multiplexer select lines SS1 and SS2 added to them respectively. As shown in the figure, the output of the circuit with the fault is compared to that of a fault-free circuit with the use of a series of XOR gates. This creates an ERROR signal indicating that an input pattern applied has detected a fault in

Fault insertion multiplexer.

Once the checkpoints have been identified,a new file is generated with the fault inserted netlist. The module name, input, and output lines from the original netlist are copied directly from the original netlist into the new file. The declaration of ports on the module and wires is also copied over with the extra multiplexer control signals added. The multiplexers are then added for the nodes identified as the primary inputs. The

© 2011 ACADEMY PUBLISHER

C17 schematic with faults inserted.

S0 O

Fault(SA0 or SA1)

Fig. 5.

2338

JOURNAL OF COMPUTERS, VOL. 6, NO. 11, NOVEMBER 2011

that particular module. During the first iteration of these three instantiations, faults associated with SS0, SS1 and SS2 were tested. The next iteration will test the next three set of faults (i.e. SS3, SS4 and SS5) until all faults have been excited and tested. IV. FAULT T ESTING The fault testing algorithm went through a number of phases of implementation, assessment, and redesign. There were three main phases where the fault emulation algorithm concept was gradually moved from the hardware/software co-simulation to a full hardware emulation. In this Section, we detail the different approaches taken, and describe the algorithms used for fault detection and test-set compression. Approach 1 and 2 are mainly a combination of hardware and software approaches. In these first two hardware/software approaches, the FPGA core was able to perform the fault detection algorithm quickly, but we noticed a severe bottleneck in the transmission of the results for further processing. Approach 3 alleviates the bottleneck by moving the post-processing task to the FPGA fabric. A. Approach 1: Hardware emulation with support software on host computer For the first part of this research, the test pattern generation algorithm was in its most basic form. The FPGA was connected to a host computer that supplied the input stimuli to the circuit under test, and acted as a support machine to receive output data from the FPGA and do post-processing on the data. The basic configuration is shown in Figure 7.

RS232 Link FPGA

Host Computer Fig. 7. A block diagram of the FPGA emulation system supported through software running on a host computer.

The key steps of this approach are outlined below: 1) Implement as many instantiations of a benchmark circuit on the FPGA as the FPGA fabric will permit. 2) Insert a unique fault in each instantiation. 3) Generate a random input vector on the host computer and send them to the FPGA board. 4) Apply the same random input vectors to the original copy and the fault inserted copies of the benchmark circuit. 5) Send the outputs from the FPGA to a Hyperterminal connection on the host computer through the serial port. 6) Repeat steps 2-5 until all possible faults are detected with the given number of random input vectors. 7) Save data to a text file. 8) Using the computer, analyze the data for the best possible input vectors based on output data. After all of the benchmark circuit input vectors and fault detection output data were displayed on the terminal display,

© 2011 ACADEMY PUBLISHER

a way of determining the most efficient input vectors was needed. To do this a program was written in MATLAB that determined which input vector could detect the most faults, and then eliminate those faults from further consideration. This vector compression algorithm proceeds in the following steps: 1) Detect the vectors with the most faults. 2) Log the vector. 3) Eliminate faults found by the vector from future consideration. 4) Repeat steps 1-3 until all detected faults are accounted for. 11011000100101001 11000111110000101 11111110111101100 11010100000011010 (a) 00000000000000000 00000000000000000 00000000000000000 00000000000010010 (c)

Fig. 8. faults).

Most faults detected

Most faults detected

00000000000000001 00000001000000001 00000000000000000 00000000000010010 (b) 00000000000000000 00000000000000000 00000000000000000 00000000000000000 (d)

Most faults detected

Fault detection algorithm (rows are vectors, columns are detected

Consider the example shown in Figure 8(a) where the original faults detected using the first input vector are displayed. The example shows a circuit’s fault and vector pair. The circuit had a total of 17 faults and was simulated with four random vectors. Figure 8 (a) is the original list received directly through the RS232 port from the FPGA. Our goal is to find the minimum set of vectors that will detect all 17 faults. Vector #3 detects 13 out of the 17 faults, the largest number of faults in this example. Our algorithm compares the four vectors, detects the ability of Vector #3 to detect 13 faults and retains this vector in the compressed test-set. We then create a binary mask equal to the width of the faults. This mask is initialized with 0s in the location of the faults detected already by the vectors saved in the compressed test-set and 1s in other locations. After retaining Vector #3, the binary mask (00000001000010011) is ANDed with the fault-vector pair creating the pair shown in Figure 8(b). In Figure 8(b), Vectors #2 and #4 both detect the largest number of faults, and the first vector of the two is chosen automatically. A new mask is created and the process repeats until all detected faults are accounted for as shown in Figures 8(c) and (d). In this example, we started with four random vectors that were used to emulate the faults and our compressed test-set retained only three of those four vectors. Ease of use is the main benefit of using an approach that implemented fault emulation on the FPGA and vector compression on the computer. Xilinx ISE was a familiar program and provided the infrastructure to simulate and implement the designs with ease. It also allowed for designs to be implemented in schematic form, an alternative to Verilog. This allowed for easy visualization of the modules and made the RS232 interface between the FPGA and computer easier to understand and implement. MATLAB provided a simple interface and a powerful mathematical tool to perform the compression algorithm. There were several issues with this approach to test pattern generation and compression. The first issue was the serial

JOURNAL OF COMPUTERS, VOL. 6, NO. 11, NOVEMBER 2011

interface between the FPGA and the computer. This first implementation was designed and implemented using Xilinx ISE and thus did not have native RS232 serial port compatibility. To work around this, an RS232 serial port interface was implemented in Verilog. Data was then formatted in groups of eight bits and sent to a relatively small buffer. In addition to this, the timing of the entire circuit had to be monitored and controlled so that data could be sent out correctly. The second major issue with this design was the need to take data from the hyperterminal and convert it to a text file, followed by running the MATLAB code with the text file containing all the collected data. The constant saving and opening of large text files took extra time. Although the compression algorithm was easy to write in MATLAB, the runtime was relatively slow because MATLAB is an interpreted language and usually runs slower than compiled program. The next approach addresses these two issues by eliminating the need for a basic serial port design and the need to run external code to determine the best test vectors. B. Approach 2: Hardware emulation with support software running on a PowerPC core within the FPGA

FPGA

Power PC

Host Computer Fig. 9. A block diagram of the FPGA emulation system and Power PC supported through software running on a host computer.

The second approach was to use the embedded PowerPC processor present on the Virtex-II FPGA. The PowerPC is able to run code written in C (or C++) and input/output data to the FPGA cores through the use of a dedicated data bus (described in Section II). The data transmission over serial lines between the FPGA core and the host computer was limited to the final transmission of the processed vectors to the computer, removing the serial communication bottleneck experienced during test pattern compression and processing in Approach 1. At the same time, the compression algorithm could be written directly in C and run in the PowerPC so working with large text files in MATLAB was not needed, eliminating the slow runtime experienced in Approach 1. The use of the PowerPC core required the use of the Xilinx Platform Studio, a significantly more powerful utility for running different circuits on the Virtex II Pro board. This design suite came with a preconfigured RS232 serial port which was significantly more versatile and user friendly than the one created by us in Approach 1. This new setup is shown in Figure 9. The host computer is primarily used to compile and download the code to the PowerPC and FPGA via the JTAG cable. The approach can be summarized using the following steps: 1) Implement as many instantiations of a benchmark circuit on the FPGA as the FPGA fabric will permit. 2) Insert a unique fault in each instantiation.

© 2011 ACADEMY PUBLISHER

2339

3) Randomly generate an input in the PowerPC and store in on-board RAM. 4) Send the input vector to the FPGA through the internal bus. 5) Run the original circuit and all instantiated faulty circuits using this random vector. 6) Send the outputs from the FPGA core back to the PowerPC to be stored in RAM 7) Repeat steps 1-6 for the number of vectors that are to be tested 8) Perform test-vector compression algorithm described in Approach 1 on the PowerPC and store the compressed result in RAM. 9) Send only the compressed test-set to the host computer using the RS232 interface of the PowerPC and display the result using a hyperterminal. With this new setup, the program on the PowerPC sent randomly generated input vectors to the circuits created on the FPGA. The implemented FPGA core with the fault detection algorithm then processed the input vectors and sent back the detected faults. All of the inputs and faults were stored in a DDR RAM module present in the FPGA board. The DDR RAM module was utilized because local block RAM (BRAM) was too small to hold all the data that the Power PC needed to store. Finally, the PowerPC ran the algorithm for deciding the most efficient inputs and then using the RS232 serial port, it outputted in an easier to read format which inputs would cover the most faults. There were some issues with this design as well. Despite the speed of the FPGA fault detection algorithm, there was a significant bottleneck with the Power PC’s speed when determining the best input vectors. With any large circuit, the algorithm for determining the best inputs took minutes to run. Because this circuit needed to access external RAM and was burdened with overhead processing, it could not process the sheer amount of data needed to in a timely manner. Another revision was made to eliminate these issues. C. Approach 3: Hardware emulation with built-in random vector generator The culmination of previous experiences led to the final incarnation of this emulation algorithm and implementation technique. In this approach of the algorithm, most functionality is transferred to the implemented FPGA core so that hardware speeds can be effectively utilized. This required rethinking how the core would work and how data would be processed. The inputs and outputs were also changed and the need for DDR RAM was eliminated. The flowchart for the core of this algorithm is shown in Figure 10. The algorithm goes through a number of states with decisions and processes that are repeated several times. The core of this program is run on the FPGA fabric. The FPGA fabric is instantiated with a fault-free version of the circuit and several faulty instantiations. Fault testing starts with the generation of a Linear Feedback Shift Register (LFSR) equal to the width of the number of inputs in the benchmark. The LFSR creates the pseudo-random patterns required for

2340

JOURNAL OF COMPUTERS, VOL. 6, NO. 11, NOVEMBER 2011

Wait until test vector size has been recieved

No

Yes

Have the full number of vectors been tested yet?

Generate new random vector

Condense found faults into one memory array

Acquire a set of fault data for new vector

Count total faults found thus far

Increment intermittent vectors count.

Reset intermittent vector count to 0

No

Yes Have all faults been accounted for?

Total Faults > Previous test?

Add 1 to final vector count

Yes No

Compare vector to previously tested vectors

More faults covered?

End emulation

No

Yes

Yes Replace previous best data with new data on faults covered and for which input

Fig. 10.

Output faults counted, total vectors found thus far and newest vector.

No

Has the program checked enough faults or tested enough vectors?

Flowchart illustrating the fault detection and vector selection algorithm.

testing the faults in the circuit. A test pattern generation module, that takes results from these different circuit clones and the input vectors generated by the LFSR, is also added to the FPGA fabric. After the circuits are instantiated in the core, the core waits for the user to specify the number of randomly generated test vectors to examine. This user generated number is received from the PowerPC core. Until this number is received, the FPGA core remains in the idle state. If the number of test vectors specified by the user has not been reached, a new test vector is generated with the use of a linear feedback shift register (LFSR). In addition to this, a number of intermediate variables (e.g. total faults counted) are reset to zero, and the threshold counter is incremented by one. The algorithm used for test pattern generation is summarized below:

input test vectors with the use of the LFSR. b) WHILE all fault locations have NOT been tested:

1) Define and initialize all constants (e.g. number of random vectors to test). 2) Instantiate as many possible copies of fault injected benchmark circuits and one clean circuit into the FPGA. 3) a) Initialize all intermediary values and generate new

i) Overwrite stored previous best input vector with current input vector. ii) Overwrite previous best fault data with current fault data. iii) Overwrite previous best detected faults total

© 2011 ACADEMY PUBLISHER

i) Insert SA0 faults into the circuit. ii) Compare outputs of fault injected circuits and clean circuit. Record any discrepancy into memory location as fault data. iii) Change fault signal from an SA0 to an SA1 by tying the ERROR line from Logic 0 to Logic 1. iv) Repeat Step (ii) for SA1 fault. v) Change fault locations by increasing specific registers and go back to step (i). c) Count all NEW faults detected with the new input vector. d) If the faults detected thus far number larger than previous data:

JOURNAL OF COMPUTERS, VOL. 6, NO. 11, NOVEMBER 2011

TABLE II E XAMPLE O UTPUT FOR S TEP 3- F - III . T HE I NPUT VECTOR IS IN H EXADECIMAL NOTATION . Input Vector F12BCD A3DAF8 B98321C C18EB24

Total Faults Detected 123 205 267 301

Total vectors used 1 2 3 4

Time (sec) 0.20 0.40 0.61 0.81

Table II shows a typical display of how the output from the FPGA after being processed and sent to the host computer would look like. Each line represents step 3(f) of the algorithm being run one time. The data represents the newest useful input vector in a hexadecimal format, total faults detected so far, the total vectors needed and the total time taken. With this new version of the fault detection algorithm there are several improvements over the previous two approaches. First, there was a major increase in speed because all of the processing is done in hardware and not on the PowerPC. There is also an increase in simplicity of the interaction between the FPGA and PowerPC. The algorithm that controls the FIFO data transfers only needs to worry about one vector of data being sent to the FPGA. Every other data transfer is from the FPGA to the PowerPC. Because of this, after the FIFO control sends the first vector of data from the PowerPC to the FPGA, the FIFO that sends data to the FPGA can be turned off and only the FIFO that receives data needs to be active. Because of the primary use of the FPGA, no external RAM is needed, reducing the coding and run time used by the PowerPC by a significant amount. V. R ESULTS We ran experiments with a number of combinational circuits from the ISCAS and MCNC benchmark suites to validate the effectiveness of our approach. Table III summarizes the characteristics of the circuits tested. The first four circuits (c432, c499, c880 and c1355) are from the ISCAS85 suite and the remaining five circuits are from the MCNC benchmark suite. All experiments were run on the Xilinx Virtex-II Pro Development System. The design was synthesized using the Xilinx ISE Design Suite Version 10.1. The Verilog code generated by our C++ program was tested in a behavioral simulator to verify its correctness and to get a general idea of how long the test would take when implemented in hardware. Once completed, the algorithm was implemented on an FPGA.

© 2011 ACADEMY PUBLISHER

The Verilog netlist generator written in C++ took on average 0.5545 seconds to find all the fault sites and add the fault multiplexers as described earlier in Section III. For our experiments, we varied the number of instantiations of the circuit netlist on the FPGA from 1 to 64. For all experiments the FPGA board was operated with a clock frequency of 25MHz. For larger benchmark circuits, the clockrate had to be slowed down to accommodate the extra long routing of signals. Figure 11 shows the median reduction in test time for test pattern generation for all benchmark circuits as the circuit instantiation is creased from 1 to 64. It can be seen that the time taken for test pattern generation decreases at the rate of log2 of the number of instantiated module. % reduction in pattern generation time

with current detected faults total. e) Repeat steps (a)-(d) until the user defined number of vectors has been tested. f) i) Add the newly detected faults to currently saved faults detected list. ii) Detect if any new faults were actually generated. iii) If new faults were found, output new data in the format of Table II. iv) Re-initialize data and move back to Step 3(a). 4) Repeat step 3 until all possible faults found are detected or another user-defined trigger is flagged.

2341

0

−20

−40

−60

−80

−100

0

10

20

30 40 50 # of instantiations

60

70

Fig. 11. Median reduction in time for test pattern generation using random vectors for benchmark circuits as a function of the number of instantiations.

We also compare the hardware resource usage on the FPGA as a function of the number of instantiations in Figure 12. The plot of Figure 12, shows the average usage across all benchmarks. To make an observation across different circuit sizes, the plot was normalized to the single-instantiation case for each benchmark. The primary hardware resources on the FPGA fabric are the Look-up-tables (LUTs), the slices and the slice registers. Since we are dealing with combinational circuits the number of slice registers stay constant regardless of the number of instantiations. It can be seen that the Lookup-table hardware and the slices go up as more copies of are instantiated in the fabric. The LUT usage goes up by almost 25% and the slice usage goes up by about 19% when the number of instantiations are increased from 1 to 64 for the circuits. This indicates that more parallelism is actually possible for these circuits. However, when only one module was instantiated, we found that the number of slices and the LUTs are higher than that when two modules are instantiated. In fact, the instantiation of 16 module on average requires the same amount of slices and look-up-tables as that for a single instantiation case. This is probably because of the synthesis tool’s resource sharing optimization routines. A. Effect of LFSR random seed on test generation In Section IV-C we mentioned the use of the Linear Feedback Shift Register (LFSR) to create the pseudo-random patterns required for testing faults in the circuit. The pseudorandom patterns generated by the LFSR are dependent on

2342

JOURNAL OF COMPUTERS, VOL. 6, NO. 11, NOVEMBER 2011

TABLE III C HARACTERISTICS OF THE B ENCHMARK C IRCUITS . Circuit c432 c499 c880 c1355 Clip rd73 t481 Z5xp1 z9sym

Inputs 36 41 60 41 9 7 16 7 9

Outputs 7 32 26 32 5 3 1 10 1

# of Gates 160 202 383 546 144 148 74 166 147

Circuit Depth 29 23 23 23 11 13 14 10 12

Function 27-channel interrupt controller 32-bit Error Correcting circuit 8-bit Arithmetic Logic Unit 32-bit Error Correcting circuit Unknown Unknown Unknown Unknown Unknown

1.25

Benchmark Suite ISCAS ISCAS ISCAS ISCAS MCNC MCNC MCNC MCNC MCNC

1800 Slice

1.2

LUT

1600 1400

1.1

# of faults found

hardware usage

1.15

1.05 1 0.95

1000 800

0.9 0.85 0.8

1200

600 0

10

20

30 40 # of instantiations

50

60

70

Fig. 12. Average FPGA hardware resources (Slices and Look-up tables) used for test pattern generation using random vectors for benchmark circuits as a function of the number of instantiations. The hardware usage is normalized to the single instantiation case.

the seed or the initial state of the LFSR. In this section, we will closely look at circuit c1355 from the ISCAS suite and investigate how the the initial seed might affect the final test pattern count. C1355 is a single-error correcting circuit consisting of 546 logic gates, 41 inputs, and 32 outputs. A verilog netlist consisting of the circuit with different stuckat-faults was generated from our C++ program. A number of instantiations of C1355 ranging from 1 to 64 were tested on the FPGA board with 10 different seed values for the LFSR. Here we summarize the result for the two circuits when 32 instantiations were used. Using 32 instantiations of the circuit and a varying number of random input vectors, we ran experiments to determine how many faults were found, the number of vectors needed to find these faults, and the time it took to process them. C1355 has 241 possible input vectors. We found that, the rate at which the time increases for different numbers of tested vectors was linear with T ime(sec) = 0.0005x#vectors + 0.06. Hence, to test every input vector of C1355 would take approximately 2036 days. This is an unacceptable amount of time. Thus the goal of this experiment was to find how many random input vectors needed to be tested to obtain a reliable result. The number of vectors to be tested for each iteration of the test was entered manually. The initial number was chosen so that a significant range of vectors could be tested given the maximum number of possible input vectors. For the next iteration, the number of vectors was doubled and the experiment was run again. Figure 13 shows the 10 iterations of the circuit with 10 different seeds for the LFSR. The red

© 2011 ACADEMY PUBLISHER

400 0 10

2

10

4

10 # of vectors tested

6

10

Fig. 13. The number of vectors tested vs. faults found for 32 instantiations of C1355 using 10 different seeds for the LFSR. The red solid line is the mean of the 10 iterations.

solid line indicates the mean of the 10 iterations for each vector count. The ’x’ on the plot for each vector count denotes the fault fault when a different seed is used. As can be seen, for lower vector counts, there is a large variation across the iterations but as the number of vector counts go up, the variation introduced by random seeds is almost non-existent. Simulation of other circuits from the benchmark also verify this trend. From Figure 13, it is clear that for a small input test set (from 10-160 vectors), a significant number of faults are left undetected. However, once 5120 vectors are tested, all 1610 detectable faults in the circuit are found. There is no increase in detection even when 327680 vectors are tested; as shown in the figure. Figure 13 clearly shows the stability point to be around 5120 input vectors. This stability point is based on the number of faults found and the number of vectors that are needed to detect them. This confirms that a subset of input vectors can be tested and will accurately represent all possible input vectors. Circuit C1355 has 809 checkpoint locations and which means there are 1618 possible faults. We were able to detect 1610 total faults; leaving 8 faults unaccounted for. This is a 99.5% fault detection rate which is generally considered an acceptable detection rate. Upon further investigation, we found that these 8 missing faults were actually redundant faults. It is common for circuit designer to add redundancy in a design to counteract glitches due to logic hazards. These redundant logic structures do not change the overall function of the circuit, it merely allows the circuit output to remain stable and glitch

JOURNAL OF COMPUTERS, VOL. 6, NO. 11, NOVEMBER 2011

Patterns

c432 c499 c880 c1355 Clip rd73 t481 Z5xp1 z9sym

50 54 43 87 59 71 35 47 77

Fastscan Fault Coverage (%) 99.2 98.9 100 99.5 97.4 100 100 98.7 87.5

Time (s)

Patterns

0.4 0.6 0.4 0.8 0.4 0.4 0.4 0.4 0.4

32 52 25 84 48 72 33 46 75

This Work Fault Coverage (%) 98.7 98.6 100 99.5 96.9 100 100 98.7 88.7

Time (s) 12 0.34 32 2.75 0.24 0.37 0.17 0.23 0.39

free when different circuit paths switch with different delay. However, these redundant faults do not get detected using single stuck-at testing. 90

# of vectors needed in test−set

80 70 60 50 40 30 20 10 0 0 10

2

4

10 10 # of vectors tested

6

10

Fig. 14. The number of vectors tested vs. the number of vectors needed in the test-set for C1355 using 10 different seeds for the LFSR. The red solid line is the mean of the 10 iterations.

Our next goal was to see how the seeds for the LFSR would affect the number of vectors needed for the actual test-set. Figure 14 shows that when the number of vectors tested increases, so does the final test set size, until a certain point. This point is where the maximum number of faults is found. After this point no new faults are detected. Testing more vectors after this point makes no difference in the final test-size. We also see that across the 10 iterations, there is a slight variation in the number of vectors needed but once the algorithm finds the test-set around 84 vectors, the variation is very little across the different seeds.

Columns 2 and 5 present the total number of compressed patterns in the test-set for the benchmark circuits. In all cases, except for rd73, our approach provides a lower number of patterns compared to the FastScan results. The average reduction in test-data volume over all benchmark circuits was 13%. Circuit c880 showed the maximum reduction at 42% while rd73 showed an increase in 1 vector causing the test-data volume to increase by 1.4%. The total fault coverage shown in columns 3 and 6 show that our approach had the same fault coverage for a number of circuits. For most circuits the fault coverage using our approach is around 97%. The lowest fault coverage using our method was for Z9sym at a rate of 88.7%. No additional faults can be detected for Z9sym because all possible vectors were tested. We found that the coverage of the collapsed fault-set for this circuit using Mentor Graphic’s FastScan was only 87.5%. Compared to FastScan, circuits c432, c499 and clip showed a slight reduction in coverage of 0.5% but z9sym showed an increase in coverage by 1.2%. The time for pattern generation and compression to create the test-set using our approach is lower for all the circuits except for c432, c880 and c1355. These benchmark circuits were run longer even though all possible detectable faults were found with less vectors. Having more vectors enabled us to have more choices in input patterns and do the final compression more efficiently. For example in circuit c432, we were able to create a final test-set of 32 vectors. We were able to detect all faults in c432 by simulating just 2560 input patterns in 0.23 seconds. However, the final test-set resulting from these 2560 vectors was 39. By allowing extra time to simulate 163840 input patterns, we still detected the same number of faults but the extra vectors allowed more flexibility with the compression algorithm leading to a final test-set size of just 32. This extra 7 vector reduction can lead to a larger reduction in test cost. 90 [19] [20] [21] This work

80 70 # of test patterns

TABLE IV C OMPARISON OF PATTERN COUNT, FAULT COVERAGE AND TEST GENERATION TIME BETWEEN FASTSCAN AND THIS WORK .

2343

60 50 40 30 20 10

B. Comparison with commercial tool and prior work. So far we have shown our approach of multiple instantiation to be both feasible in the amount of time taken and feasible for the number of test patterns produced. In this section, we compare our approach with with a state-of-the-art commercial ATPG toolset. Mentor Graphic’s FastScan v8.2008 2.10 ATPG tool running on a Intel Xeon X5355 at 2.66GHz. Our approach was run on the Virtex-II boards running with a system clock frequency of 25MHz. The experiments consisted of generating and compressing patterns for a collapsed singlestuck-at fault test. The results are tabulated in Table IV.

© 2011 ACADEMY PUBLISHER

0

c432

c499

c880

c1355

Fig. 15. Comparison of test pattern counts for ISCAS benchmark circuits with previous work of [19]–[21].

To measure the effectiveness of our approach, we compare our results to existing results from [19], [20] and [21]. All circuits were tested with 32 instantiations on the FPGA with a system clock frequency of 25MHz. Figure 1 compares and summarizes the results for the different benchmark circuits. Comparisions are made only for the ISCAS circuits as the published work of [19]–[21] do not include results for the

2344

JOURNAL OF COMPUTERS, VOL. 6, NO. 11, NOVEMBER 2011

MCNC circuits. As shown in Figure 15, our results produce the smallest test set and match the pattern count of the best test set sizes shown in [20] except for c880. VI. C ONCLUSION We have described an efficient implementation of test pattern generation and compression algorithm in a Virtex2 FPGA. The hardware verilog netlist is generated from a C++ script that automatically identifies circuit checkpoint nodes and adds multiplexes to enable generation of patterns that test for stuck-at-faults at those nodes. Three approaches to hardware emulation are considered. The first two approaches relied on a hardware/software co-design approach where fault insertion and detection are done on the FPGA fabric but the generation of input patterns and the post-processing of the test-set are done in software either on a desktop computer or on the PowerPC core present in the Virtex2 FPGA. The third approach used Linear Feedback Shift Registers (LFSRs) to generate patterns on the FPGA fabric. The fault emulation and the processing on the test patterns to compress the size of the final test-set was done in the fabric as well. Our results show that for benchmark circuits, our approach produces the smallest test-set and the set size is almost always smaller than those produced using commercial ATPG tools. R EFERENCES [1] S. Luryi, J. M. Xu, and A. Z. eds., Future Trends in Microelectronics: The Nano, the Giga, and the Ultra. New York: Wiley, 2004. [2] International Technology Roadmap for Semiconductors, “The latest update is at http://www.public.itrs.net.” [3] K. Nepal, R. I. Bahar, J. Mundy, W. R. Patterson, and A. Zaslavsky, “Designing logic circuits for probabilistic computation in the presence of noise,” in DAC, Anaheim, CA, June 2005. [4] R. Vemu, A. Jas, J. Abraham, S. Patil, and R. Galivanche, “A lowcost concurrent error detection technique for processor control logic,” in DATE, March 2008, pp. 897–902. [5] K. Nepal, N. Alves, J. Dworak, and R. I. Bahar, “Using implications for online error detection,” in ITC, October 2008. [6] N. Alves, A. Buben, K. Nepal, J. Dworak, and R. I. Bahar, “A cost effective approach for online error detection using invariant relationships,” IEEE Trans. on Computer-Aided Design of Integrated Circuits and Systems, vol. 29, no. 5, pp. 788–801, 2010. [7] J. Hughes and E. J. McCluskey, “Multiple stuck-at fault coverage of single stuck-at fault test sets,” in ITC, 1986, pp. 368–374. [8] M. L. Bushnell and V. D. Agrawal, Essentials of Electronic Testing for Digital, Memory and Mixed-Signal VLSI Circuits. USA: Springer, 2005. [9] J. P. Hayes, “Fault modeling,” IEEE Design and Test of Computers, no. 4, pp. 37–44, 1985. [10] M. Abramovici, M. Breuer, and A. Friedman, Digital Systems Testing and Testable Design. USA: Computer Science Press, 1990. [11] R. H. Klenke, R. D. Williamsand, and J. H. Aylor, “Parallel-processing techniques for automatic test pattern generation,” IEEE Computer, vol. 25, no. 1, pp. 71–84, 1992. [12] F. Ozguner, C. Aykanat, and O. Khalid, “Logic fault simulation on a vector hypercube multiprocessor,” in The 3rd Conference on Hypercube Concurrent Computers and Applications, 1988, pp. 1108–1116. [13] R. Raghavan, J. Hayes, and W. Martin, “Logic simulation on vector processors,” in ICCAD, 1988, pp. 268–271. [14] K. Gulati and S. Khatri, “Towards acceleration of fault simulation using graphics processing units,” in DAC, 2008, pp. 822–827. [15] F. Ozguner and R. Daoud, “Vectorized fault simulation on the cray x-mp supercomputer,” in ICCAD, 1988, pp. 198–201. [16] F. Kocan and D. G. Saab, “Dynamic fault diagnosis of combinational and sequential circuits on reconfigurable hardware,” Journal of Electronic Testing: Theory and Applications, vol. 23, no. 5, pp. 405–420, 2007.

© 2011 ACADEMY PUBLISHER

[17] A. Parreira, J. P. Teixeira, and M. Santos, “A novel approach to fpga-based hardware fault modeling and simulation,” in Design and Diagnostics of Electronic Circuits and Syst. Workshop, 2003, pp. 17–24. [18] “Virtex-II Pro and Virtex-II Pro X Platform FPGAs: Complete data sheet availaible at http://www.xilinx.com/support/documentation/data sheets/ds083.pdf.” [19] I. Pomeranz, L. N. Reddy, and S. M. Reddy, “COMPACTEST:a method to generate compact test sets for combinational circuits,” in International Test Conference, 1991, pp. 194–203. [20] M. C. Hansen and J. P. Hayes, “High-level test generation using physically-induced faults,” in IEEE VLSI Test Symposium, 1995, p. 20. [21] E. Bareisa, V. Jusas, K. Motiejunas, and R. Seinauskas, “Test generation at the algorithm-level for gate-level fault coverage,” Microelectronics Reliability, vol. 48, no. 7, pp. 1093 – 1101, 2008.

Carson Dunbar is currently a Ph.D. student at the University of Maryland, USA. He received his BS and MS degrees in Electrical Engineering from Bucknell University, Lewisburg, PA in 2008 and 2010, respectively. His research interests include reconfigurable computing and digital VLSI Design.

Kundan Nepal received the BS degree in Electrical Engineering from Trinity College, Hartford in 2002, the MSEE from the University of Southern California in 2003 and the Ph.D. degree in Electrical and Computer Engineering from Brown University in 2007. He is currently an Assistant Professor in the School of Engineering at the University of St Thomas, Minnesota, USA. His research interests include defect/fault tolerant circuits and systems; nanometer digital VLSI system design and reconfigurable computing.

Suggest Documents