SPITFIRE: Scalable Parallel Algorithms for Test ... - Semantic Scholar

To appear in IEEE VLSI Test Symposium, 1997

SPITFIRE: Scalable Parallel Algorithms for Test Set Partitioned Fault Simulation Elizabeth M. Rudnicky Janak H. Pately Prithviraj Banerjeez Dilip Krishnaswamyy y Center for Reliable and High-Performance Computing, University of Illinois, Urbana IL z Center for Parallel and Distributed Computing, Northwestern University, Evanston, IL

Abstract

Fault simulation is an important step in the electronic design process and is used to identify faults that cause erroneous responses at the outputs of a circuit for a given test set. The objective of a fault simulation algorithm is to find the fraction of total faults in a sequential circuit (also referred to as fault coverage) that is detected by a given set of input vectors. In its simplest form, a fault is injected into a logic circuit by setting a line or a gate to a faulty value (1 or 0), and then the effect of the fault is simulated using zero-delay logic simulation. Most fault simulation algorithms are of O(n2 ) time complexity, where n is the number of lines in the circuit. Studies have shown that there is little hope of finding a lineartime fault simulation algorithm [1]. In a typical fault simulator, the good circuit (fault-free circuit) and the faulty circuits are simulated for each test vector.

If the output responses of a faulty circuit differ from those of the good circuit, then the corresponding fault is detected, and the fault can be dropped from the fault list, speeding up simulation of subsequent test vectors. A fault simulator can be run in stand-alone mode to grade an existing test set, or it can be interfaced with a test generator to reduce the number of faults that must be explicitly targeted by the test generator. In either of the two environments, fault simulation can consume a significant amount of time, especially in random pattern testing and functional testing, for which millions of vectors may have to be simulated. Thus, parallel processing can be used to reduce the fault simulation time significantly. A new two stage approach to parallelizing fault simulation for sequential VLSI circuits was recently proposed in which the test set is partitioned among the available processors [2]. This approach for the first time overcame the limitations of the good circuit serial logic simulation that is necessary for fault simulation. We propose in this paper three scalable synchronous parallel fault simulation algorithms, based on the algorithm that was proposed earlier, with the test vector set partitioned across processors. The first algorithm, SPITFIRE1, is a modification of the two stage synchronous algorithm proposed earlier [2], and it eliminates some redundant computation that was present in the earlier algorithm. The second algorithm, SPITFIRE2, is a hybrid of fault parallel and test set partitioning algorithms which attempts to reduce the granularity of the work performed in each stage. Both algorithms may have some degree of pessimism in terms of the fault coverage that may be obtained in a parallel run of the algorithm. The third algorithm, SPITFIRE3, is a synchronous pipelined parallel algorithm, geared towards removing this pessimism. All three algorithms eliminate the excessive redundant computations required in the traditional fault-partitioning approach and provide good speedups. The parallel algorithms presented can be used independent of which serial algorithm is used for fault simulation. In evaluating our proposed approach to parallelizing fault simulation, we used a fault simulator which was a modified version of the PROOFS algorithm [3].

This research was supported in part by the Semiconductor Research Corporation under Contract SRC 95-DP-109 and the Defense Advanced Research Projects Agency under contracts DAA-H04-94-G-0273 and DABT63-95-C0069 administered by the Army Research Office.

This paper is organized as follows. We begin with a brief description of various existing approaches to parallel fault simulation in Section 2. In Section 3, we describe test set and

We propose three synchronous parallel algorithms for scalable parallel test set partitioned fault simulation. The algorithms are based on a new two-stage approach to parallelizing fault simulation for sequential VLSI circuits in which the test set is partitioned among the available processors. The test set partitioning inherent in the algorithms overcomes the good circuit logic simulation bottleneck that exists in traditional fault partitioned approaches to parallel fault simulation. The implementations were done on a shared memory multiprocessor and on a network of workstations. Two of the algorithms show a small degree of pessimism in a few cases, with respect to the fault coverage as compared with a uniprocessor run, while the third algorithm provides the same results as in a uniprocessor run. All algorithms provide excellent speedups and perform much better than a traditional fault partitioned approach, on both shared and distributed memory parallel platforms.

1 Introduction

fault partitioning strategies in detail. In Section 4, we present the three SPITFIRE algorithms. The results are presented in Section 5, and all algorithms are compared. Section 6 is the conclusion.

2 Parallel Fault Simulation Several algorithms have been proposed for parallelizing sequential circuit fault simulation [4]. A circuit partitioning approach to parallel sequential circuit fault simulation is described in [5]. The algorithm was implemented on a sharedmemory multiprocessor. The circuit is partitioned among the processors, and since the circuit is evaluated level-by-level with barrier synchronization at each level, the gates at each level should be evenly distributed among the processors to balance the workloads. An average speedup of 2.16 was obtained for 8 processors, and the speedup for the ISCAS89 circuit s5378 was 3.29. This approach is most suitable for a shared-memory architecture for circuits with many levels of logic. Algorithmic partitioning was proposed for concurrent fault simulation in [6][7]. A pipelined algorithm was developed, and specific functions were assigned to each processor. An estimated speedup of 4 to 5 was reported for 14 processors, based on software emulation of a message-passing multicomputer [7]. The limitation of this approach is that it cannot take advantage of a larger number of processors. Fault partitioning is a more straightforward approach to parallelizing fault simulation. With this approach [8][9], the fault list is statically partitioned among all processors, and each processor must simulate the good circuit and the faulty circuits in its partition. Good circuit simulation on more than one processor is obviously redundant computation. Alternatively, if a shared-memory multiprocessor is used, the good circuit may be simulated by just one processor, but the remaining processors will lie idle while this processing is performed, at least for the first time frame. Fault partitioning may also be performed dynamically during fault simulation to even out the workloads of the processors, at the expense of extra interprocessor communication [8]. Speedups in the range 2.4–3.8 were obtained for static fault partitioning over 8 processors for the larger ISCAS89 circuits having reasonably high fault coverages (e.g., s5378 ) [8][9]. No significant improvements were obtained for these circuits with dynamic fault partitioning due to the overheads of load redistribution [8]. However, in both the static and dynamic fault partitioning approaches, the shortest execution time will be bounded by the time to perform good circuit logic simulation on a single processor. One observation that can be made about the fault partitioning experiments is that larger speedups are obtained for circuits having lower fault coverages [8][9]. These results highlight the fact that the potential speedup drops as the number of faults simulated drops, since the good circuit evaluation takes up a larger fraction of the computation time. The good cir-

cuit evaluation is not parallelized in the fault partitioning approach, and therefore, speedups are limited. For example, if good circuit logic simulation takes about 20 percent of the total fault simulation time on a single processor, then by Amdahl’s law, one cannot expect a speedup of more than 5 on any number of processors. Parallelization of good circuit logic simulation, or simply logic simulation, is therefore very important and is known to be a difficult problem. Most implementations have not shown an appreciable speedup. Parallelizing logic simulation based on partitioning the circuit has been suggested but has not been successful due to the high level of communication required between parallel processors. Recently, a new algorithm was proposed, where the test vector sequence is partitioned among the processors [2]. The partitions are not completely disjoint, and the overlapping vectors are used to initialize the circuit, as will be explained in a later section. Each processor simulates the vectors in its partition starting from an unknown state. Fault simulation proceeds in two stages. In the first stage, the fault list is partitioned among the processors, and each processor performs fault simulation using the fault list and test vectors in its partition. In the second stage, the undetected fault lists from the first stage are combined, and each processor simulates all faults in this list using test vectors in its partition. With the test set partitioned, each processor needs to perform good circuit logic simulation only on the partition it owns. In a fault-partitioned approach, each processor performs the good circuit logic simulation on the entire test set. Obviously, the test set partitioning strategy provides a more scalable implementation, since the good circuit logic simulation is also distributed over the processors. Test set partitioning is also used in the parallel fault simulator Zamlog [10], but Zamlog assumes that independent test sequences are provided which form the partition. If only one test sequence is given, Zamlog does not partition it. If, for example, only 4 independent sequences are given, it cannot use more that 4 processors. Our work does not make any assumption on the independence of test sequences and hence is scalable to any number of processors.

3 Test Set and Fault Partitioning We now describe, the different approaches to partitioning the test set and fault list in a parallel processing context. Let the test set, or more precisely the test sequence, be denoted by T , the fault list by F , and the number of processors by p. Let us partition the test sequence T into p partitions: f T1 ; T2; Tp g, where Tj is a subsequence of T . Let us also partition the fault list F into p partitions: f F1 ; F2 ; Fp g.

3.1 Test Sequence Partitioning Parallel fault simulation through test sequence partitioning is illustrated in Figure 1 and in Figure 2(a). The test sequence

Example: A Test Sequence of 5n vectors on 5 Processors Overlap

1

Test Sequence 2n 3n

n

P2

P1

4n

P4

P3

5n

P5

Processors

Figure 1. Test Sequence Partitioning partition Ti and the fault list F are allocated to the i’th processor. Each processor performs the good and faulty circuit simulations for the subsequence in its partition only, starting from an all-unknown (X) state. Of course, the state would not really be unknown if we did not partition the vectors. Since the unknown state is a superset of the known state, the simulation will be correct but may have more X values at the outputs than the serial simulation. This is considered pessimistic simulation in the sense that the parallel implementation produces an X at some outputs which in fact are known 0 or 1. From a pure logic simulation perspective, this pessimism may or may not be acceptable. However, in the context of fault simulation, the effect of the unknown values is that a few faults which are detected in the serial simulation are not detected in the parallel simulation. Rather than accept this small degree of pessimism, the test set partitioning algorithm tries to correct it as much as possible, as illustrated in Figure 1.

T1 T2 T3 T4 T5

F P1 P2 P3 P4 P5

T F1 F2 F3 F4 F5

P1 P2 P3 P4 P5

Test Sequence Partitioning

Fault Partitioning

(a)

(b)

Figure 2. Test Set Partitioning and Fault Partitioning To compute the starting state for each test segment, a few vectors are prepended to the segment from the preceding segment. This process creates an overlap of vectors between successive segments, as shown in Figure 1. Our hypothesis is that a few vectors can act as initializing vectors to bring the machine to a state very close to the correct state, if not exactly the same state. Even if the computed state is not close to the actual state, it still has far fewer unknown values than exist when starting from an all-unknown state. Results in [2] showed that this approach indeed reduces the pessimism in the number of fault detections. The number of initializing vectors required

depends on the circuit and how easy it is to initialize. If the overlap is larger than necessary, redundant computations will be performed in adjacent processors, and efficiency will be lost. However, if the overlap is too small, some faults that are detected by the test set may not be identified, and thus the fault coverage reported may be overly pessimistic.

3.2 Fault Partitioning Fault partitioning for parallel fault simulation is illustrated in Figure 2(b). The figure shows that the fault partition Fi and the entire test set T are allocated to the i’th processor. In this approach, each processor uses the entire test set T to target the fault partition Fi that it owns. This partitioning suffers from the problem that each processor has to perform the complete good circuit simulation for the entire test set T . This is a huge sequential bottleneck for any parallel implementation which employs such a partitioning strategy. Also, there is a potential problem of load imbalance across processors, depending on which faults are allocated to which processors. In the worst case, if all the hard-to-detect faults are allocated to a single processor, these faults may not be detected for a long time. Hence, the total execution time will depend on how long the processor with the highest load may take to complete its task. It is possible to dynamically balance the load by employing a strategy where faults are migrated from busy processors to idle processors. However, one would either have to perform resimulation from the beginning [8] or migrate the circuit state information for the faulty circuits associated with these faults to an idle processor. For large circuits, migrating circuit state information is prohibitively expensive. In addition, for large test sets, performing resimulation is very expensive too. Hence, in general, a purely fault partitioning approach has limited scalability and does not provide good performance.

4 Parallel Test Set Partitioned Fault Simulation We now describe three different synchronous algorithms developed in the SPITFIRE project for scalable parallel test set partitioned fault simulation at the University of Illinois. The first algorithm, SPITFIRE1, is a modified version of a two-stage synchronous approach that was proposed earlier [2]. The second algorithm, SPITFIRE2, is a new algorithm, similar to the first algorithm, but it attempts to reduce the amount of work done in the first algorithm by using a different partitioning strategy. The above algorithms may have some degree of pessimism in terms of the number of faults detected. It is possible that a few faults may be missed using these algorithms, if the overlap used is not sufficient. We therefore propose a new synchronous pipelined algorithm, SPITFIRE3, which avoids this pessimism.

4.1 SPITFIRE1: Synchronous Two Stage Algorithm In a simple parallel implementation employing the test set partitioning strategy, the test set is partitioned across the p

processors as described in the previous section and illustrated in Figure 1. If N is the size of the entire test set T , then the size of each partition is approximately Np . Assuming a vector overlap of q vectors, a processor with index i (1 < i p) is assigned vectors in the range ( N (pi?1) ? q ) to Npi . The processor with index 1 is assigned vectors in the range 1 to Np . The entire fault list is allocated to each processor. Thus, each processor targets the entire list of faults using a subset of the test vectors. This simple algorithm is somewhat inefficient in that many faults are very testable and are detected by most if not all of the test segments. Simulating these faults on all processors is a waste of time. Therefore, one can filter out these easyto-detect faults in an initial stage in which both the fault set and the test set are partitioned among the processors. This results in a two stage algorithm. In the first stage, each processor targets a subset of the faults using a subset of the test vectors, as illustrated in Figure 3. A large fraction of the detected

T1 T2 T3 T4 T5 F1 P1 P2 F2 P3 F3 P4 F4 P5 F5

U1 U2 U3 U4 U5

T1 T2 T3 P2 P3 P1 P3 P1 P2 P1 P2 P3 P1 P2 P3

T4 T5 P4 P5 P4 P5 P4 P5 P5 P4

Partitioning in Stage 1

Partitioning in Stage 2 Figure 3. Partitioning in SPITFIRE1

faults are identified in this initial stage, and only the remaining faults have to be simulated by all processors in the second stage. This algorithm was proposed in [2]. The overall algorithm is outlined below. 1. Partition test set T among p processors: f T 1 ; T 2 ; T p g. 2. Partition fault list F among p processors: f F 1 ; F 2 ; F p g. 3. Each processor Pi performs the first stage of fault simulation by applying Ti to Fi . Let the list of detected faults and undetected faults in processor Pi after fault simulation be Ci and Ui respectively. 4. Each processor Pi sends the detected fault list Ci to processor P1 . 5. Processor P1 combines the detected fault lists from p other processors by computing C = i=1 Ci . 6. Processor P1 now broadcasts the total detected fault list C to all other processors. 7. Each processor Pi finds the list of faults it needs to target in the second stage Gi = F ? (C Fi ). 8. Reset the circuit. 9. Each processor Pi performs the second stage of fault simulation by applying test segment Ti to fault list Gi .

S

S

10. Each processor Pi sends the detected fault list Di to processor P1 . 11. Processor P1 combines the detected fault lists from p other processors by computing D = i=1 Di . The result after parallel fault simulation is the list of detected faults C D, and it is now available in processor P1 . p Note that Gi = j =1 Uj ; j 6= i is another equivalent expression for Gi . The reason that a second stage is necessary is because every test vector must eventually target every undetected fault if it has not already been detected on some other processor. Thus, the initial fault partitioning phase is used to reduce redundant work that may arise in detecting easyto-detect faults. It can be observed though that one has to perform two stages of good circuit simulation with the test set partition on any processor. However, the first stage eliminates a lot of redundant work that might have been otherwise performed. Hence, the two-stage approach is preferred. The test set partitioning approach for parallel fault simulation is subject to inaccuracies in the fault coverages reported only when the circuit cannot be initialized quickly from an unknown state at the beginning of each test segment. This problem can be avoided if the test set is partitioned such that each segment starts with an initialization sequence. The definitive redundant computation in the above approach is the overlap of test segments for good circuit simulation. However, if the overlap is small compared to the size of the test sequence partition assigned to a processor, then this redundant computation will be negligible. Another source of redundant computation is in the second stage when each processor has to target the entire list of faults that remains (excluding the faults that were left undetected in that processor). In this situation, when one of the processors detects a fault, it may drop the fault from its fault list, but the other processors may continue targeting the fault until they detect the fault or until they complete the simulation (i.e., until the second stage of fault simulation ends). This redundant computation overhead could be reduced by broadcasting the fault identifier, corresponding to a fault, to other processors as soon as the fault is detected. However, the savings in computation might be offset by the overhead in communication costs.

S

S

S

4.2 SPITFIRE2: A Hybrid Approach We will now present a new algorithm which attempts to reduce the size of the partitions used in SPITFIRE1. Let us partition the test set T into (p + 1) partitions: f T1 ; T2 ; Tp ; Tp+1 g, where p is the number of processors. If N is the size of the entire test set T , then the size of each partition is now pN . The partitioning for the two stages of +1 fault simulation in Algorithm SPITFIRE2 is illustrated in Figure 4. As can be seen from the figure, processor i uses T1 and Fi in the first stage of fault simulation. Since all faults are targeted in the first stage using the input vectors in T1 , there is no need to resimulate these vectors in the second stage. Let

T1 T1 T1 T1 T1 F1 P1 P2 F2 P3 F3 P4 F4 P5 F5

U1 U2 U3 U4 U5

T2 T3 T4 T5 P1 P2 P3 P4 P1 P2 P3 P4 P1 P2 P3 P4 P1 P2 P3 P4 P1 P2 P3 P4

T6 P5 P5 P5 P5 P5

Partitioning in Stage 1

Partitioning in Stage 2 Figure 4. Partitioning in SPITFIRE2

Sp

j=1 Uj be the set of undetected faults left at the end of stage 1. Then, in the second stage, processor i uses the test set Ti+1 and fault list G. Thus, in the first stage, processors target different sets of faults, and in the second stage, processors target the same list of undetected faults that was available at the end of the first stage. The advantage of this algorithm is that the number of vectors simulated in each stage is now rep as compared to SPITFIRE1. A small duced by a factor p+1 additional advantage is that the faulty circuit states available for the undetected faults in the set U1 can be used for simulation with the test set T2 in the second stage of fault simulation in processor 1. It is possible, though, that one may not drop as many faults after the first stage of fault simulation as compared to SPITFIRE1. G

=

4.3 SPITFIRE3: A Multistage Pipelined Synchronous Algorithm SPITFIRE3 avoids the small degree of pessimism present in the other two algorithms. It was shown in [2] that, for a random input test vector set with an overlap of 20 vectors, only one fault in one circuit was missed, even though in certain cases, the circuit state was not fully justified. We have performed two different experiments to further explore the pessimism in SPITFIRE1. The first experiment was performed with actual test vector sets from an ATPG, and the second experiment was performed with a random input test vector set. No faults were missed with random test vector sets when comparing a parallel run on 8 processors with a uniprocessor run. However, it was found that some faults were missed with test vector sets from an ATPG (one fault missed for 3 circuits, three faults missed for one circuit, no faults missed for 4 circuits) when comparing a parallel run on 8 processors with a uniprocessor run. The reason for this is partly that the ATPG test vector sets were small, resulting in small test set partitions that prevented circuit initialization. In any case, one cannot escape the fact that this pessimism may exist. This has motivated the need for a third parallel algorithm, which is a pipelined multistage version of SPITFIRE1. The algorithm is illustrated in Figure 5. The algorithm executes the same algorithm as SPITFIRE1 initially. The first stage of fault simulation is identical to that in SPITFIRE1. Synchronization points are introduced in the second stage, at which processors exchange in-

formation about detected faults. This may reduce the amount of work that a processor may have to do subsequently, since each processor does not need to target faults that have been detected by other processors. However, the synchronization points introduce barriers which may slow down execution, assuming that the load may be somewhat imbalanced in different processors. Therefore, there is some degree of tradeoff involved in using synchronization points in the second stage. At the end of the second stage, when processor i has finished executing the test vectors in test set Ti , all processors exchange information and drop all faults detected by other processors. At this point, processor i resumes execution and starts working on vectors in partition Ti+1 without resetting its state. Hence, if processor i stopped execution at vector Ni p, + 1 it resumes execution at vector Ni . Note that this means p that processor p is now idle, since it has simulated the last vector in the test set T , while all other processors are busy. Once again, processors exchange information at synchronization points. In addition, information is stored regarding the number of faults that were detected at the end of that stage. If at the next synchronization point, more faults are detected, then execution is continued. If, however, there is no change in the number of faults detected between two synchronization points, execution is stopped, and the final fault coverage is the coverage available at that synchronization point. This approach has helped in detecting the faults that had been detected in a uniprocessor run but were left undetected in a parallel run for the previous two approaches. Obviously, we are paying a price in continuing execution with synchronization to identify a few more detected faults. If one is willing to tolerate the pessimism that may exist, this approach may not be essential. Note that in this pipelined approach, at the end of the second stage of fault simulation, processor p becomes idle, and then after every Np vectors have been simulated, processors (p ? 1), (p ? 2), : : : , 3, 2, and 1 become idle in order. In the worst case, processor 1 may have to perform the good circuit logic simulation on the entire test set T , if more faults continue to get detected. Processors continue to work on fewer faults, as more faults get detected. The caveat here is the fact that if the synchronization points are too close to each other, some faults may still be missed. In our implementation, we introduced synchronization points at regular intervals of 4Np vectors, and it turned out that the faults that had not been detected were indeed detected before the second synchronization point after the end of the second stage. Thus, the extra overhead in continuing execution was not very high, and the fault coverage obtained was identical to that obtained in a uniprocessor run.

5 Experimental Results The four algorithms described in the paper were implemented using the MPI [13] library. The implementation is portable to any parallel platform which provides support for the MPI communication library. Results were obtained on

Synchronization Points indicates vector to be simulated after synchronization Denotes Busy Processor Denotes Idle Processor

Number above

P1 P2 P3 P4

1

N/p + 1

2N/p + 1

3N/p + 1

N/p - q

2N/p + 1

3N/p + 1

4N/p = N

2N/p - q

3N/p + 1

4N/p = N

3N/p - q

4N/p = N

4N/p = N

Program terminated here q = vector overlap for circuits considered N = number of test vectors in test set p = number of processors (p = 4 in the figure) Figure 5. Multistage Synchronous Pipelined Algorithm Execution (After First Stage) Table 1. Uniprocessor Execution Times (in seconds)

Circuit s1423 s526 mult16 div16 am2910 pcont2 piir8 s5378

Faults 1515 555 1708 2141 2391 11,300 19,920 4603

Gates 754 224 755 974 1055 4249 8904 3043

PIs 17 3 18 33 20 9 9 35

POs 5 6 33 34 16 8 8 49

FFs 74 21 55 50 87 24 56 179

Random Test Set Size 10000 Time Time Faults (Shared) NOWs Det 195 314 802 57 97 52 143 202 1468 111 145 1640 147 200 2115 756 1089 6829 1434 1970 15,004 511 776 3031

a SUN-SparcCenter 1000E shared memory multiprocessor with 8 processors and 512 MB of memory, and on a network of SUN-Sparc5 workstations (NOWs), each with 48 MB of memory. Results are provided for 8 circuits, s5378, s526, s1423, am2910, pcont2, piir8o, mult16, and div16. The circuits were chosen because the test sets available for them were the largest among the circuits available. The mult16 circuit is a 16-bit two’s complement multiplier; div16 is a 16-bit divider; am2910 is a 12-bit microprogram sequencer; pcont2 is an 8-bit parallel controller used in DSP applications; and piir8 is an 8-point infinite impulse response filter. s5378, s526 and s1423 are circuits taken from the ISCAS89 benchmark suite. Parallel fault simulation was done with a random test set of size 10,000 (i.e., a sequence of 10,000 randomly generated input test vectors) and with actual test sets obtained from an ATPG tool [11]. In practice, for large circuits, the test set sizes are very large, typically in millions of vectors. Note that the ATPG test set sizes are still quite small for the cir-

Test Size 5999 3734 2419 5114 3364 3399 768 31,802

Actual Test Set from an ATPG tool Time Time (Shared) NOWs 115 169 11 17 24 34 53 69 46 62 324 406 127 179 1347 2073

Faults Det 1407 448 1665 1802 2198 6837 15,070 3486

cuits considered except for s5378. For example, the circuit piir8 has an ATPG test set size of only 768. An overlap of 20 vectors (i.e., q = 20), as suggested in [2], was used for the algorithms here. Table 1 shows the characteristics of each circuit, including number of faults, number of gates, number of primary inputs (PIs), number of primary outputs (POs), number of flip-flops (FFs), as well as the execution times on a single processor and number of faults detected for both types of input test sets, and for both types of platforms, viz., for a single thread execution on the shared memory machine and on a SUN-Sparc5 workstation. For the subsequent tables, SPF1, SPF2, and SPF3 refer to algorithms SPITFIRE1, SPITFIRE2, and SPITFIRE3, respectively, and FPAR refers to a fault partitioned parallel implementation. Table 2 shows the execution times in seconds on 8 processors, on the SUNSparcCenter 1000E, and on the network of SUN-Sparc5 workstations, obtained using the four algorithms for the ATPG test sets. The same number of

faults are detected by all algorithms. It can be seen from the table that SPF2 usually has the lowest execution time on all platforms. Also, FPAR always has the highest execution time. SPF1 has a slightly higher execution time than SPF2 but lower than SPF3. Since test set partitions are quite small for the test sets, one can expect some pessimism in the fault coverage obtained for algorithms SPF1 and SPF2. Table 3 shows the faults detected and speedups on 8 processors with ATPG test sets. For SPF1 and SPF2, it can be seen that for circuits s1423, div16, and piir8, one fault was not detected compared to a uniprocessor run; for circuit s526, 3 faults were not detected; while for circuits mult16, am2910, pcont2, and s5378, the results in terms of fault coverage for all algorithms were the same. We can also see that the speedups are highest for algorithms SPF2 and SPF1, a little bit lower for algorithm SPF3, and lowest for FPAR. Table 4 shows the execution times in seconds on 8 processors on the SUNSparcCenter 1000E and on the network of SUN-Sparc5 workstations, obtained using the four algorithms for random test sets. Similar results can be observed here as in Table 2. Table 5 shows the number of faults detected and the speedups on 8 processors with random test sets for both shared and distributed memory parallel platforms. One striking result with the random test sets is that the fault coverage is the same for all algorithms for all circuits. Also, since the test set sizes are larger with random test sets, better speedups have been obtained, in general. In practice, test sets sizes are very large and therefore, one could obtain even better performance and scalability with the proposed algorithms as test set sizes are increased.

6 Conclusion Parallel fault simulation has been a difficult problem due to the limited scalability and parallelism that previous algorithms could extract. Parallelization in fault simulation is limited by the serial logic simulation of the fault-free machine. By partitioning the test set across processors, we have achieved a scalable parallel implementation and have thus avoided the serial logic simulation bottleneck. We have presented three parallel algorithms developed for scalable parallel test set partitioned fault simulation. The implementations were done in MPI and were ported to a shared memory multiprocessor and to a network of workstations. The algorithms were also studied for both random test sets and ATPG test sets. All algorithms showed good scalability and speedups. It was seen that all algorithms provided the same fault coverage for random test sets, but a few faults were missed for four circuits with ATPG test sets for algorithms SPITFIRE1 and SPITFIRE2. However, the algorithm SPITFIRE3 provided the same result, in terms of fault coverage, as in a uniprocessor run for all cases. We thus conclude that algorithms SPITFIRE1 and SPITFIRE2 can provide excellent speedups and fault coverages but may have a very small degree of pessimism in terms of the fault coverage. These two algorithms would certainly be useful in situations where a fast fault grade

of functional vectors would be required. The algorithm SPITFIRE3 is a more conservative approach geared towards getting the exact fault coverage that may be required, and providing good speedups at the same time.

Acknowledgement We would like to thank Mark Johnson of VLSI Technology Inc. for suggestions which led to the SPITFIRE2 algorithm.

References [1] D. Harel and B. Krishnamurthy, “Is there hope for linear time fault simulation,” Proc. Fault Tolerant Computing Symp., pp. 28-33, June 1987. [2] E. M. Rudnick and J. H. Patel, “Overcoming the serial logic simulation bottleneck in parallel fault simulation,” Proc 10th Intl. Conf. VLSI Design, pp. 495-501, 1997. [3] T. M. Niermann, W. -T. Cheng, and J. H. Patel, “PROOFS: A fast, memory-efficient sequential circuit fault simulator,” IEEE Trans. Computer-Aided Design, pp. 198–207, February 1992. [4] P. Banerjee, Parallel Algorithms for VLSI Computer-Aided Design. Englewood Cliffs, NJ: PTR Prentice Hall, 1994. [5] S. Patil, P. Banerjee, and J. H. Patel, “Parallel test generation for sequential circuits on general-purpose multiprocessors,” Proc. Design Automation Conf., pp. 155–159, 1991. [6] P. Agrawal, V. D. Agrawal, K. T. Cheng, and R. Tutundjian, “Fault simulation in a pipelined multiprocessor system,” Proc. Int. Test Conf., pp. 727–734, 1989. [7] S. Bose and P. Agrawal, “Concurrent fault simulation of logic gates and memory blocks on message passing multicomputers,” Proc. Design Automation Conf., pp. 332–335, 1992. [8] S. Parkes, P. Banerjee, and J. Patel, “A parallel algorithm for fault simulation based on PROOFS,” Proc. Int. Conf. Computer Design, pp. 616–621, 1995. [9] M. B. Amin and B. Vinnakota, “ZAMBEZI: A parallel pattern parallel fault sequential circuit fault simulator,” Proc. VLSI Test Symp., pp. 438–443, 1996. [10] M. B. Amin and B. Vinnakota, “Zamlog: A parallel algorithm for fault simulation based on Zambezi,” Proc. Intl. Conf. on Computer-Aided Design, pp. 509-512, 1996. [11] M. S. Hsiao, E. M. Rudnick, and J. H. Patel, “Automatic test generation using genetically-engineered distinguishing sequences,” Proc. VLSI Test Symp., pp. 216-223, 1996. [12] A. Warshawsky and J. Rajski, “Distributed fault simulation with vector set partitioning,” VLSI Design Laboratory, McGill University, Montreal, Canada,1991. [13] W. Gropp, E. Lusk, and A. Skellum, Using MPI: Portable Parallel Programming with the Message Passing Interface. MIT Press, 1994.

Table 2. Execution Time on 8 Processors with Actual ATPG Test Sets


Shared Memory Multiprocessor Execution Time (secs) FPAR SPF1 SPF2 SPF3 51 21 21 29 9.8 5.1 4.0 4.5 19.9 6.4 6.1 6.2 45 28 25 24 35 11 11 13 144 110 70 110 73 48 39 59 809 242 255 271

Network of Workstations Execution Time (secs) FPAR SPF1 SPF2 SPF3 66 33 28 45 13.9 6.7 5.7 7.8 27.3 9.0 8.0 10.1 63 31 29 34 45 15 12 15 179 166 89 161 86 71 47 86 1005 355 322 429

Table 3. Faults Detected and Speedups on 8 Processors with Actual ATPG Test Sets


FPAR 1407 448 1665 1802 2198 6837 15,070 3486

Faults Detected SPF1 SPF2 1406 1406 445 445 1665 1665 1801 1801 2198 2198 6837 6837 15,069 15,069 3486 3486

SPF3 1407 448 1665 1802 2198 6837 15,070 3486

Shared Memory Multiprocessor Speedup FPAR SPF1 SPF2 SPF3 2.3 5.4 5.5 4.0 1.1 2.1 2.7 2.4 1.2 3.8 4.0 3.9 1.2 1.9 2.1 2.2 1.3 4.2 4.1 3.6 2.2 2.9 4.7 2.9 1.7 2.6 3.2 2.2 1.7 5.6 5.3 5.0

Network of Workstations Speedup FPAR SPF1 SPF2 SPF3 2.5 5.1 5.9 3.8 1.2 2.6 3.0 2.2 1.2 3.8 4.3 3.4 1.1 2.2 2.3 2.0 1.4 4.1 5.1 4.0 2.3 2.4 4.6 2.5 2.1 2.5 3.8 2.1 2.1 5.8 6.4 4.8

Table 4. Execution Time on 8 Processors with Random Test Sets


Shared Memory Multiprocessor Execution Time (secs) FPAR SPF1 SPF2 SPF3 112 38 39 45 30 10 11 17 78.16 32 33 37 105 26 27 29 126 28 41 32 421 138 172 187 800 271 316 354 279 93 94 118

Network of Workstations Execution Time (secs) FPAR SPF1 SPF2 SPF3 113 57 49 65 42 19 17 30 117 47 42 58 137 32 29 37 139 40 43 43 512 205 180 242 1060 378 341 490 330 159 120 172

Table 5. Faults Detected and Speedups on 8 Processors with Random Test Sets


FPAR 802 52 1468 1640 2115 6829 15,004 3031

Faults Detected SPF1 SPF2 802 802 52 52 1468 1468 1640 1640 2115 2115 6829 6829 15,004 15,004 3031 3031

SPF3 802 52 1468 1640 2115 6829 15,004 3031

Shared Memory Multiprocessor Speedup FPAR SPF1 SPF2 SPF3 1.7 5.2 5.0 4.3 1.9 5.5 5.2 3.4 1.8 4.5 4.3 3.9 1.1 4.3 4.0 3.8 1.2 5.2 3.6 4.5 1.8 5.5 4.4 4.0 1.8 5.3 4.5 4.0 1.8 5.5 5.4 4.3

Network of Workstations Speedup FPAR SPF1 SPF2 SPF3 2.8 5.5 6.4 4.8 2.3 5.2 5.7 3.2 1.7 4.3 4.8 3.5 1.1 4.4 5.0 3.9 1.4 5.0 4.6 4.7 2.1 5.3 6.0 4.5 1.9 5.2 5.8 4.0 2.3 4.9 6.5 4.5

SPITFIRE: Scalable Parallel Algorithms for Test ... - Semantic Scholar

SPITFIRE: Scalable Parallel Algorithms for Test ... - Semantic Scholar

Suggest Documents

scalable algorithms for parallel tree search - Semantic Scholar

Scalable Parallel Algorithms.

Asynchronous Parallel Algorithms for Test Set ... - Semantic Scholar

Implementing Scalable Parallel Search Algorithms for ... - CiteSeerX

Scalable Parallel Algorithms for Difficult ... - Carleton University

p4est: SCALABLE ALGORITHMS FOR PARALLEL ADAPTIVE MESH ...

Scalable Parallel Algorithms for Difficult ... - Carleton University

Methodology for Implementing Scalable Test ... - Semantic Scholar

Distributed-Memory Parallel Algorithms for ... - Semantic Scholar

PARALLEL ALGORITHMS FOR CELLULAR ... - Semantic Scholar

PARALLEL ALGORITHMS FOR SPLIT ... - Semantic Scholar

Scalable Massively Parallel Artificial Neural ... - Semantic Scholar

Genetic Epidemiology, Parallel Algorithms, and ... - Semantic Scholar

Realistic Parallel Algorithms: Priority Queue ... - Semantic Scholar

Can Parallel Algorithms Enhance Serial ... - Semantic Scholar

Study of Scalable Declustering Algorithms for Parallel Grid ... - CiteSeerX

A Scalable Parallel Algorithm for Self-Organizing ... - Semantic Scholar

Must-Work": A Scalable Model for Parallel ... - Semantic Scholar

A Scalable Parallel Algorithm for Sparse Cholesky ... - Semantic Scholar

A Scalable, Robust Network for Parallel Computing - Semantic Scholar

A Scalable Parallel SSOR Preconditioner for E ... - Semantic Scholar

Implementing Scalable Parallel Search Algorithms for Data-intensive

Scalable splitting algorithms for big-data ... - Semantic Scholar

Scalable splitting algorithms for big-data ... - Semantic Scholar