Asynchronous Parallel Algorithms for Test Set ... - Semantic Scholar

Workshop on Parallel and Distributed Simulation, 1997

Asynchronous Parallel Algorithms for Test Set Partitioned Fault Simulation Prithviraj Banerjeez Elizabeth M. Rudnicky Janak H. Pately Dilip Krishnaswamyy y Center for Reliable and High-Performance Computing, University of Illinois, Urbana IL z Center for Parallel and Distributed Computing, Northwestern University, Evanston, IL Abstract We propose in this paper two new asynchronous parallel algorithms for test set partitioned fault simulation. The algorithms are based on a new two-stage approach to parallelizing fault simulation for sequential VLSI circuits in which the test set is partitioned among the available processors. These algorithms provide the same result as the previous synchronous two stage approach. However, due to the dynamic characteristics of these algorithms and due to the fact that there is very minimal redundant work, they run faster than the previous synchronous approach. A theoretical analysis comparing the various algorithms is also given to provide an insight into these algorithms. The implementations were done in MPI and are therefore portable to many parallel platforms. Results are shown for a shared memory multiprocessor.

1 Introduction Fault simulation is an important step in the electronic design process and is used to identify faults that cause erroneous responses at the outputs of a circuit for a given test set. The objective of a fault simulation algorithm is to find the fraction of total faults in a sequential circuit that is detected by a given set of input vectors (also referred to as fault coverage). In its simplest form, a fault is injected into a logic circuit by setting a line or a gate to a faulty value (1 or 0), and then the effects of the fault are simulated using zero-delay logic simulation. Most fault simulation algorithms are typically of O(n2 ) time complexity, where n is the number of lines in the circuit. Studies have shown that there is little hope of finding a lineartime fault simulation algorithm [1]. In a typical fault simulator, the good circuit (fault-free circuit) and the faulty circuits are simulated for each test vector. If the output responses of a faulty circuit differ from those of the good circuit, then the corresponding fault is detected, and the fault can be dropped from the fault list, speeding up simulation of subsequent test vectors. A fault simulator can This research was supported in part by the Semiconductor Research Corporation under Contract SRC 95-DP-109 and the Advanced Research Projects Agency under contract DAA-H04-94-G-0273 and DABT63-95-C-0069 administered by the Army Research Office.

be run in stand-alone mode to grade an existing test set, or it can be interfaced with a test generator to reduce the number of faults that must be explicitly targeted by the test generator. In a random pattern environment, the fault simulator helps in evaluating the fault coverage of a set of random patterns. In either of the two environments, fault simulation can consume a significant amount of time, especially in random pattern testing, for which millions of vectors may have to be simulated. Thus, parallel processing can be used to reduce the fault simulation time significantly. We propose in this paper two scalable asynchronous parallel fault simulation algorithms with the test vector set partitioned across processors. This paper is organized as follows. In Section 2, we describe the various existing approaches to parallel fault simulation and we motivate the need for a test set partitioned approach to parallel fault simulation. In Section 3, we discuss our approach to test sequence partitioning. In Section 4, we present the various algorithms that have been implemented including the two proposed asynchronous algorithms. A theoretical analysis of the sequential and parallel algorithms proposed is given in Section 5 to provide a deeper insight into the algorithms. The results are presented in Section 6, and all algorithms are compared. Section 7 is the conclusion.

2 Parallel Fault Simulation Due to the long execution times for large circuits, several algorithms have been proposed for parallelizing sequential circuit fault simulation [2]. A circuit partitioning approach to parallel sequential circuit fault simulation is described in [3]. The algorithm was implemented on a shared-memory multiprocessor. The circuit is partitioned among the processors, and since the circuit is evaluated level-by-level with barrier synchronization at each level, the gates at each level should be evenly distributed among the processors to balance the workloads. An average speedup of 2.16 was obtained for 8 processors, and the speedup for the ISCAS89 circuit s5378 was 3.29. This approach is most suitable for a shared-memory architecture for circuits with many levels of logic. Algorithmic partitioning was proposed for concurrent fault simulation in [4][5]. A pipelined algorithm was developed, and specific functions were assigned to each processor. An estimated speedup of 4 to 5 was reported for 14 processors,

based on software emulation of a message-passing multicomputer [5]. The limitation of this approach is that it cannot take advantage of a larger number of processors. Fault partitioning is a more straightforward approach to parallelizing fault simulation. With this approach [6][7], the fault list is statically partitioned among all processors, and each processor must simulate the good circuit and the faulty circuits in its partition. Good circuit simulation on more than one processor is obviously redundant computation. Alternatively, if a shared-memory multiprocessor is used, the good circuit may be simulated by just one processor, but the remaining processors will lie idle while this processing is performed, at least for the first time frame. Fault partitioning may also be performed dynamically during fault simulation to even out the workloads of the processors, at the expense of extra interprocessor communication [6]. Speedups in the range 2.4–3.8 were obtained for static fault partitioning over 8 processors for the larger ISCAS89 circuits having reasonably high fault coverages (e.g., s5378 ) [6][7]. No significant improvements were obtained for these circuits with dynamic fault partitioning due to the overheads of load redistribution [6]. However, in both the static and dynamic fault partitioning approaches, the shortest execution time will be bounded by the time to perform good circuit logic simulation on a single processor. One observation that can be made about the fault partitioning experiments is that larger speedups are obtained for circuits having lower fault coverages [6][7]. These results highlight the fact that the potential speedup drops as the number of faults simulated drops, since the good circuit evaluation takes up a larger fraction of the computation time. The good circuit evaluation is not parallelized in the fault partitioning approach, and therefore, speedups are limited. For example, if good circuit logic simulation takes about 20 percent of the total fault simulation time on a single processor, then by Amdahl’s law, one cannot expect a speedup of more than 5 on any number of processors. Parallelization of good circuit logic simulation, or simply logic simulation, is therefore very important and it is known to be a difficult problem. Most implementations have not shown an appreciable speedup. Parallelizing logic simulation based on partitioning the circuit has been suggested but has not been successful due to the high level of communication required between parallel processors. Recently, a new algorithm was proposed, where the test vector set was partitioned among the processors [8]. We will call this algorithm, SPITFIRE1. Fault simulation proceeds in two stages. In the first stage, the fault list is partitioned among the processors, and each processor performs fault simulation using the fault list and test vectors in its partition. In the second stage, the undetected fault lists from the first stage are combined, and each processor simulates all faults in this list using test vectors in its partition. Obviously, the test set partitioning strategy provides a more scalable implementation, since the good circuit logic simulation is also distributed

over the processors. Test set partitioning is also used in the parallel fault simulator Zamlog [9], but Zamlog assumes that independent test sequences are provided which form the partition. If only one test sequence is given, Zamlog does not partition it. If, for example, only 4 independent sequences are given, it cannot use more that 4 processors. Our work does not make any assumption on the independence of test sequences and hence is scalable to any number of processors. It was shown in [8] [10] that the synchronous two-stage algorithm, SPITFIRE1, performs better than fault partitioned parallel approaches. Other synchronous algorithms, SPITFIRE2 and SPITFIRE3, which are extensions of the SPITFIRE1 algorithm, were presented in [10]. SPITFIRE3, in particular, is a synchronous pipelined approach which helps in overcoming any pessimism that may exist in a single or two stage approach. We propose in this paper, two new asynchronous algorithms, based on the test set partitioning strategy for parallel fault simulation. We will demonstrate that the asynchronous algorithms perform better than their synchronous counterparts and shall provide reasons for the same. The first algorithm, SPITFIRE4, is a two stage algorithm, and it is a modification of the SPITFIRE1 algorithm described above. It leaves the first stage unchanged, but the second stage is implemented with asynchronous communication between processors. The second algorithm, SPITFIRE5, obviates the need for two stages. The entire parallel fault simulation strategy is accomplished in one stage with asynchronous communication between processors.

3 Test Sequence Partitioning Parallel fault simulation through test sequence partitioning is illustrated in Figure 1. We use the terms test set and test sequence interchangeably here, and both are assumed to be an ordered set of test vectors. The test set is partitioned Example: A Test Sequence of 5n vectors on 5 Processors Overlap

1

Test Sequence 2n 3n

n

P1

P2

P3

4n

P4

5n

P5

Processors

Figure 1. Test Sequence Partitioning among the available processors, and each processor performs the good and faulty circuit simulations for vectors in its partition only, starting from an all-unknown (X) state. Of course, the state would not really be unknown for segments other than the first if we did not partition the vectors. Since the unknown state is a superset of the known state, the simulation will be correct but may have more X values at the outputs than the serial simulation. This is considered pessimistic simulation in

the sense that the parallel implementation produces an X at some outputs which in fact are known 0 or 1. From a pure logic simulation perspective, this pessimism may or may not be acceptable. However, in the context of fault simulation, the effect of the unknown values is that a few faults which are detected in the serial simulation are not detected in the parallel simulation. Rather than accept this small degree of pessimism, the test set partioning algorithm tries to correct it as much as possible. To compute the starting state for each test segment, a few vectors are prepended to the segment from the preceding segment. This process creates an overlap of vectors between successive segments, as shown in Figure 1. Our hypothesis is that a few vectors can act as initializing vectors to bring the machine to a state very close to the correct state, if not exactly the same state. Even if the computed state is not close to the actual state, it still has far fewer unknown values than exist when starting from an all-unknown state. Results in [8] showed that this approach indeed reduces the pessimism in the number of fault detections. The number of initializing vectors required depends on the circuit and how easy it is to initialize. If the overlap is larger than necessary, redundant computations will be performed in adjacent processors, and efficiency will be lost. However, if the overlap is too small, some faults that are detected by the test set may not be identified, and thus the fault coverage reported may be overly pessimistic.

4 Parallel Test Set Partitioned Algorithms We now describe four different algorithms for test set partitioned parallel fault simulation. The first two algorithms are parallel single-stage and two-stage synchronous approaches which have been proposed earlier[8][10]. The third and fourth algorithms are parallel two-stage and single-stage asynchronous approaches.

4.1 SPITFIRE0: Single Stage Synchronous Algorithm In this approach, the test set is partitioned across the processors as described in the previous section. This algorithm is presented as a base of reference for the various test set partitioning approaches to be described later. The entire fault list is allocated to each processor. Thus, each processor targets the entire list of faults using a subset of the test vectors. Each processor proceeds independently and drops the faults that it can detect. The results are merged in the end.

4.2 SPITFIRE1: Synchronous Two Stage Algorithm The simple algorithm described above is somewhat inefficient in that many faults are very testable and are detected by most if not all of the test segments. Simulating these faults on all processors is a waste of time. Therefore, one can filter out these easy-to-detect faults in an initial stage in which both the

fault set and the test set are partitioned among the processors. This results in a two stage algorithm. In the first stage, each processor targets a subset of the faults using a subset of the test vectors, as illustrated in Figure 2. A large fraction of the

T1 T2 T3 T4 T5 F1 P1 P2 F2 P3 F3 P4 F4 P5 F5

U1 U2 U3 U4 U5

T1 T2 T3 P2 P3 P1 P3 P1 P2 P1 P2 P3 P1 P2 P3

T4 T5 P4 P5 P4 P5 P4 P5 P5 P4

Partitioning in Stage 1

Partitioning in Stage 2 Figure 2. Partitioning in SPITFIRE1

detected faults are identified in this initial stage, and only the remaining faults have to be simulated by all processors in the second stage. This algorithm was proposed in [8]. The overall algorithm is outlined below. 1. Partition test set T among p processors: f T 1 ; T 2 ; T p g. 2. Partition fault list F among p processors: f F 1 ; F 2 ; F p g. 3. Each processor Pi performs the first stage of fault simulation by applying Ti to Fi . Let the list of detected faults and undetected faults in processor Pi after fault simulation be Ci and Ui respectively. 4. Each processor Pi sends the detected fault list Ci to processor P1 . 5. Processor P1 combines the detected fault lists from other processors by computing C = pi=1 Ci . 6. Processor P1 now broadcasts the total detected fault list C to all other processors. 7. Each processor Pi finds the list of faults it needs to target in the second stage Gi = F ? (C Fi ). 8. Reset the circuit. 9. Each processor Pi performs the second stage of fault simulation by applying test segment Ti to fault list Gi . 10. Each processor Pi sends the detected fault list Di to processor P1 . 11. Processor P1 combines the detected fault lists from p other processors by computing D = i=1 Di . The result after parallel fault simulation is the list of detected faults C D, and it is now available in processor P1 . p Note that Gi = j =1 Uj ; j 6= i is another equivalent expression for Gi . The reason that a second stage is necessary is because every test vector must eventually target every undetected fault if it has not already been detected on some other processor. Thus, the initial fault partitioning phase is used to reduce redundant work that may arise in detecting easy-todetect faults. It can be observed though that one has to perform two stages of good circuit simulation with the test segment on any processor. However, the first stage eliminates

S

S

S

S

S

a lot of redundant work that might have been otherwise performed. Hence, the two-stage approach is preferred. The test set partitioning approach for parallel fault simulation is subject to inaccuracies in the fault coverages reported only when the circuit cannot be initialized quickly from an unknown state at the beginning of each test segment. This problem can be avoided if the test set is partitioned such that each segment starts with an initialization sequence. The definitive redundant computation in the above approach is the overlap of test segments for good circuit simulation. However, if the overlap is small compared to the size of the test segment assigned to a processor, then this redundant computation will be negligible. Another source of redundant computation is in the second stage when each processor has to target the entire list of faults that remains (excluding the faults that were left undetected in that processor). In this situation, when one of the processors detects a fault, it may drop the fault from its fault list, but the other processors may continue targeting the fault until they detect the fault or until they complete the simulation (i.e., until the second stage of fault simulation ends). This redundant computation overhead could be reduced by broadcasting the fault identifier, corresponding to a fault, to other processors as soon as the fault is detected. However, the savings in computation might be offset by the overhead in communication costs.

4.3 SPITFIRE4: A Two Stage Asynchronous Algorithm We will now describe an asynchronous version of the Algorithm SPITFIRE1. Consider the second stage of fault simulation in Algorithm SPITFIRE1. All processors have to work on almost the same list of undetected faults that was available at the end of the first stage (except faults that it could not detect in Stage 1). It would therefore be advantageous for each processor to periodically communicate to all other processors a list of any faults that it detects. Thus, each processor asynchronously sends a list of new detected faults to all other processors provided that it has detected at least MinFaultLimit new faults. Each processor periodically probes for messages from other processors and drops any faults that may be received through messages. This helps in reducing the load on a processor if it has not detected these faults yet. Thus, by allowing each processor to asynchronously communicate detected faults to all other processors, we dynamically reduce the load on each processor. It should be observed that in the first stage of Algorithm SPITFIRE1, all processors are working on different sets of faults. Hence, there is no need to communicate detected faults during Stage 1, since this will not have any effect on the workload on each processor. It would make sense therefore to communicate all detected faults only at the end of Stage 1. The asynchronous algorithm used for fault simulation in Stage 2 by any processor Pi is outlined below. Set NumberOfNewFaultsDetected = 0

k

T

For each vector in the test set i FaultSimulate vector if (NumberOfNewFaultsDetected MinFaultLimit) then Send the list of newly detected faults to all Processors using a buffered asynchronous send Set NumberOfNewFaultsDetected = 0 end if while (CheckForAnyMessages()) Receive new message using a blocking receive Drop newly received faults (if not dropped earlier) end while end for

k

>

The routine CheckForAnyMessages() is a non-blocking probe which returns a 1 only if there is a message pending to be received.

4.4 SPITFIRE5: A Single Stage Asynchronous Algorithm It is possible to employ the same asynchronous communication strategy used in the algorithm SPITFIRE4 for the algorithm SPITFIRE0. In the latter algorithm, all processors start with the same list of undetected faults, which is the entire list of faults F . Only faults which each processor may detect get dropped, and each processor continues to work on a large set of undetected faults. Once again, it would make sense for each processor to communicate detected faults periodically to other processors provided that it has detected at least MinFaultLimit new faults. The value of MinFaultLimit is circuit dependent. It also depends on the parallel platform that may be used for parallel fault simulation. For a very small circuit with mostly easy to detect faults, it may not make sense to set MinFaultLimit too small, as this may result in too many messages being communicated. On the other hand, if the circuit is reasonably large, or if faults are hard to detect, the granularity of computation between two successive communication steps will be large. Therefore, it may make sense to have a small value of MinFaultLimit. Similarly, it may be more expensive to communicate often on a distributed parallel platform such as a network of workstations. However, this factor may not matter as much on a shared memory machine. Our results were obtained on a shared memory multiprocessor where the value of MinFaultLimit was empirically chosen to be 5 as we will show. This means that whenever any processor detects at least 5 faults, it will communicate the new faults detected over to other processors to possibly reduce the load on other processors that may still be working on these faults. It is therefore important to ensure that the computation to communication ratio be kept high and hence depending on the parallel platform used, one needs to arrive at a compromise at the frequency at which faults are communicated between processors. One may also use the number of vectors in the test set that have been simulated, say MinVectorLimit,

as a control parameter to regulate the frequency of synchronization. This may be useful towards the end of fault simulation when there faults are detected very slowly. One can also use both parameters, MinFaultLimit and MinVectorLimit, simulaneously and communicate faults if either control parameter is exceeded. As long as the granularity of the computation is large enough compared to the communication costs involved, one can expect a good performance with an asynchronous approach. If we assume that communication costs are zero, then one would ideally communicate faults as soon as they are detected to other processors. If the frequency of communication is reduced, then one may have to perform more redundant computation. There is a tradeoff between algorithms SPITFIRE4 and SPITFIRE5. As we can see in SPITFIRE4, we have a completely communication independent phase in Stage 1 followed by an asynchronous communication intensive phase. However in SPITFIRE5, we have only one stage of fault simulation. This means that the good circuit simulation with test set Ti on processor Pi needs to be performed only once. Thus, although we may have continuous communication in algorithm SPITFIRE5, we may obtain substantial savings by performing only one stage of fault simulation. We will see in the next section that this is indeed the case. The same approach for asynchronous communication that was discussed in the previous section is used for this algorithm. However, the asynchronous communication is applied to the first and only stage of fault simulation that is used for this algorithm.

5 Analysis of Algorithms A theoretical analysis of the various algorithms is now presented. We first provide an analysis of serial fault simulation and then extend the analysis for various test set partitioning approaches and for a fault partitioning approach.

remaining, U (n) when the n’th vector has to be simulated is given by U (n) = F (1 ? r(1 ? e?(n?1) )), where F is the total number of faults in the circuit. Let us assume that

is the unit of cost for execution in seconds per gate evaluation. Assume that a fraction of the total number of gates G in the circuit is being simulated for each fault. Then the cost for simulating all the faulty circuits left with the n’th vector is GU (n). Assume that a fraction of the gates G are simulated for the good circuit logic simulation for each vector. (Usually

Asynchronous Parallel Algorithms for Test Set ... - Semantic Scholar

Asynchronous Parallel Algorithms for Test Set ... - Semantic Scholar

Suggest Documents

PARTIALLY ASYNCHRONOUS, PARALLEL ALGORITHMS ... - MIT

Parallel Asynchronous Tabu Search for ... - Semantic Scholar

SPITFIRE: Scalable Parallel Algorithms for Test ... - Semantic Scholar

Parallel Proposals in Asynchronous Search - Semantic Scholar

Distributed-Memory Parallel Algorithms for ... - Semantic Scholar

PARALLEL ALGORITHMS FOR CELLULAR ... - Semantic Scholar

PARALLEL ALGORITHMS FOR SPLIT ... - Semantic Scholar

Parallel, Asynchronous and Decentralised Ant ... - Semantic Scholar

Designing Asynchronous Parallel Process ... - Semantic Scholar

Asynchronous Distributed Genetic Algorithms with ... - Semantic Scholar

Test Set Compaction Algorithms for Combinational ...

Parallel Algorithms for the Summed Area Table on the Asynchronous ...

parallel asynchronous primal-dual methods for the ... - Semantic Scholar

Parallel asynchronous control strategy for target ... - Semantic Scholar

Genetic Epidemiology, Parallel Algorithms, and ... - Semantic Scholar

Realistic Parallel Algorithms: Priority Queue ... - Semantic Scholar

Can Parallel Algorithms Enhance Serial ... - Semantic Scholar

Data-Flow Algorithms for Parallel Matrix ... - Semantic Scholar

Sequential and Parallel Algorithms for Embedding ... - Semantic Scholar

scalable algorithms for parallel tree search - Semantic Scholar

Parallel Algorithms for Indexing and Retrieval in ... - Semantic Scholar

Parallel sorting algorithms for declustered data - Semantic Scholar

E cient Parallel Algorithms for Modular ... - Semantic Scholar

Parallel Induction Algorithms for Data Mining - Semantic Scholar