Functional F max test-time reduction using novel DFTs for circuit ...

2 downloads 86 Views 288KB Size Report
{ujjwal.guin, tehrani}@engr.uconn.edu tapanc@qualcomm.com. Abstract—Using functional test for Fmax analysis is still the only effective method used in ...
Functional Fmax Test-Time Reduction using Novel DFTs for Circuit Initialization 6KKXBM(VJO1∗,5BQBO$IBLSBCPSUZ2,and.PIBNNBE5FISBOJQPPS1∗ 1 ECE

Dept., University of Connecticut {ujjwal.guin, tehrani}@engr.uconn.edu

2 Qualcomm

[email protected]

Figure 1 shows conventional functional testing with initialization (IN) and computation (COMP) phases. Two distinct scenarios may occur. We will refer to these two scenarios as Case I and Case II, where, • Case I: The circuit is always initialized to the same initialization state before the computation phase starts (see Figure 1(a)). • Case II: The circuit is initialized to different initialization states for different computation phases (see Figure 1(b)). We will later develop our proposed approaches based on these two cases.

Abstract—Using functional test for Fmax analysis is still the only effective method used in practice in spite of the fact that the test cost associated with functional Fmax test remains to be a major problem. In this paper, we develop novel design-for-testability (DFT) structures to considerably reduce the cost of initializing the circuit during functional test. The proposed architectures take advantage of existing DFT structures to reduce the overall cost of hardware and have no impact on the circuit timing. Our implementations of these DFT structures for initializing ITC’99 benchmark circuit b19 demonstrate the effectiveness of these techniques in reducing test time and thus the overall test cost. Keywords—Functional Fmax , Initialization, Design for Testability, and Compression.



I. I NTRODUCTION Over the past decade, there has been a major shift toward using structural tests to reduce the overall test cost. Structural tests are now used for stuck-at faults, open faults, bridging faults, transition delay faults and path delay faults. In recent years, several research attempts have been made to estimate the functional Fmax using structural tests [1] [2] [3] [4] [5]. However, due to the increased noise and crosstalk effects in scan-based testing, the inaccuracy of performance verification using path and transition delay faults have kept the functional Fmax test an unavoidable approach for many semiconductor companies. The major issues associated with the functional test include: (1) difficulty of generating effective test patterns targeting long paths in the circuit, (2) may lead to an increase of up to 30-40% in the total production test cost, and (3) the requirement of a high-speed tester in order to apply functional test patterns to the circuit under test. Further analyzing the cost of functional Fmax test shows that the initialization phase of the test contributes significantly to the total test time. To start functional testing, the design must be initialized to a desired functional state, i.e., initialization state (IS) or initialization pattern (IP). The design reaches the initialization state at the end of the initialization phase, and then the computation phase begins. The speed-path parameters of the design are checked in the computation phase where the functional vectors are applied to the design, and the response is collected and compared with the expected one. A significant amount of test time is spent on the initialization phase, which creates a time overhead to the functional tests. ∗









(a) Functional testing with same initialization phase. 













(b) Functional testing with different initialization phases. Figure 1.

Functional testing with initialization and computation phases.

Today’s complex designs are highly modular and consist of multiple modules or IPs (intellectual properties). In functional mode, these modules work inter-connectedly to produce the functional response. Let us consider the example of a microprocessor. It generally consists of a core, a floating point unit, a memory, a dram controller, a cache controller etc. To test the floating point unit, we must initialize the dram controller for proper memory operations. A set of functional test vectors are required to test a module when the other modules help to produce to the correct inputs and pass the response to the primary outputs. To do so, each module has to be initialized properly such as the finite state machine corresponding to that module has to be set to a desired functionally valid state. Setting the initialization state of a complex industrial circuit through primary inputs becomes a major challenge and could potentially take millions of clock cycles [6] depending on the size and type of the circuit. Moreover, it wastes resources by requiring functional test engineers to write the test patterns. On the other hand, design for testability (DFT) based on scan

This work was supported in part by a research gift from Qualcomm.

978-1-4799-2987-0/13/$31.00 ©2013 IEEE





1

and automatic test pattern generation (ATPG) has been widely accepted by the industry to obtain high test coverage for structural tests. Shifting the initialization state through DFT structure into the scan chain is a possible approach. Figure 2(a) shows the functional testing process to initialize the circuit through primary inputs (PIs). The circuit is set to the initialization state after n functional clock cycles (initialization cycles) and the corresponding initialization time is referred to as TI . Figure 2(b) shows the initialization of the circuit through DFT structure. The IP is shifted into the scan chains through the test inputs (test si). Then the circuit is set to functional mode and the computation phase starts. This initialization time is referred to as the DFT-based initialization time (TDFT I ).





 



 

 

 



Today, almost every industrial circuit uses compression as a part of DFT to compress the test data and thus reduce test time and cost. Compression is based on either in the form of linear feedback shift register (LFSR) [7] [8] [9] [10] or adaptive scan [11] [12] [13]. The obvious question is, can we use compression to initialize the design for functional testing? The standard compression technique relies entirely on the high percentage of don’t-cares (Xs) in the test cubes. The compression scheme fails to find a solution when the test cube is dominated by care bits [10]. The problem with using compression in the initialization phase is that the IP is completely specified and there are no don’t-care bits in it. The results in Table I do not consider compression during the initialization phase. If we are able to incorporate the compression during initialization, our gain will increase to the compression ratio (k) times the improvement mentioned above. In this case the break even will only be 1K initialization cycles when k = 30 and we can save significant amount of initialization time when using DFT. In our proposed approaches, the initialization is carried out through the existing DFT structure, taking compression-like behavior into consideration to reduce initialization time. Currently, there is not enough research carried out to address the initialization during functional testing. In [6], the authors mention the initialization for functional testing using a fullhold scan system. They used DFT to initialize the design. In their approach, the design is loaded with all 0s during reset pin assertion and then a scan store operation is executed to initialize the state of 200K nonarray sequential elements in the design to 0. The authors do not consider compression during scan store, which results in a slower shift-in frequency.





    

  ,3

(a) Timing waveforms for functional testing without using DFT.  

  



 

 









(b) Timing waveforms for functional testing using DFT. Figure 2.

Timing waveforms for functional testing.

Let us now consider the test parameters of a design to calculate the test time. Assume that the shift-in frequency (structural frequency, fs ), functional frequency ( f f , f f  fs ), length of the scan chain (L), and compression ratio (k = M/N) are 40 MHz, 200 MHz, 200, and 30, respectively. Here, N and M are the number of tester pins and the number of internal scan chains, respectively. We will follow these notations throughout the rest of the paper. The benefit of using DFT will be obvious if the initialization time through scan chains is much less than that of the initialization through primary inputs. Table I shows the time for one initialization phase with different numbers of initialization cycles. It is clear from the table that TDFT I is constant and equal to (LM/N fs ) whereas TI increases linearly with n. The break-even point is 30K initialization cycles when TDFT I and TI become equal. If the number of initialization cycles is 1M, the improvement using DFT for initialization becomes 33.3X. The gain becomes increasingly higher with a greater number of initialization cycles.

A. Contributions and Paper Organization

TI (sec) (n/ f f )

TDFT I (sec) (kL/ fs )

Improvement (TI /TDFT I )

In this paper, we will describe two different approaches using the existing DFT structure to expedite the initialization phase. First is the pattern dependent (PD) approach corresponding to the previously mentioned Case I where the IP is unique. We will describe two different architectures for this approach. Second is a pattern independent (PI) approach corresponding to either Case I or Case II. We will present an architecture that helps any design to initialize with any functionally valid state (i.e., IP). We are able to achieve the Best Case initialization time for the PD approach and an intermediate initialization time between the Best Case and the Worst Case for PI approach; where, • Best Case: TDFT I is L shift clock cycles. This is the minimum number of clock cycles required to shift in an IP into the scan chains. • Worst Case: The regular DFT compression is bypassed and TDFT I is kL shift clock cycles. In this case N bits are shifted into N scan chains directly from the tester. The process of shifting an IP into the scan chains through N pins will be repeated k = M/N times.

5x10−6 5x10−5 1.5x10−4 5x10−3 5x10−2

1.5x10−4 1.5x10−4 1.5x10−4 1.5x10−4 1.5x10−4

1/30 1/3 1 33.3 333.3

These initialization techniques represent a novel direction in functional testing. To the best of our knowledge, this is the first attempt to address the initialization of functional tests to reduce test time and cost. We provide comprehensive solutions

Table I I NITIALIZATION T IME C OMPARISON

# Initialization cycles (n) 1K 10K 30K 1M 10M

2

to initialize a design considering area overhead and power consumption. Finally, any one of our proposed techniques can be applied to any design without modifying the core logic and impacting the design flow. This paper will be organized as follows: in Section II, we will present the underlying principles of the PD approach and its two different architectures for the initialization. In Section III, we will describe the theory and architecture of PI approach. In Section IV, we will present the simulation results. We will conclude with final remarks in Section V.

         !"#

II. PATTERN D EPENDENT A PPROACH Figure 1(a) showed the functional testing where the design is initialized to the same functionally valid initialization state (IP) during the initialization phase and different computation phase starts afterwards. This approach is called pattern dependent as the IP is fixed. In the following, we will present two different architectures, namely, decoder-based and ROMbased, to generate this fixed IP.

/, $     +

+    

' ()      *+  &  

    

1, +  **  9

", + 

      * +  +    

2, $)   **

$%     &   



0 3, 4 

Figure 3.

A. Decoder-Based Architecture Figure 3 describes the proposed flow for the decoder-based architecture based on PD approach. It starts with the random selection of few bits from the IP, generally less than 5% of the total bits present in it and assumes the rest of the bits are don’t-cares. Each selection bit represents a distinct equation which results in a set of linear equations. The objective is to find a solution set rather than a single solution from that set of equations. If the number of independent variables is greater than the number of equations, we will get a solution set. The detailed process of forming the set of linear equations and finding a solution is described in [7] [10]. Then, we apply these solutions from the set one by one to the decompressor to generate the pattern. The generated pattern and the IP are compared bit-wise to find the conflicts. Here, a conflict represents a mismatch of the value of a particular bit position of the generated pattern with the IP. Finally, we select the solution from the set that gives the least conflicts. The complexity of this approach is O(p), where p is the number of solutions present in the solution set. By flipping these conflicting bits, the generated pattern and IP will match exactly and our objective will be fulfilled to generate the fully specified IP. The XOR function can satisfy our requirement of flipping the bits. Thus, we have introduced an additional XOR network in front of the internal scan chains to select the output or the inverted output of the decompressor. If there is a conflict, the other input of the XOR gate will be “1” and the decompressor output will be inverted. To generate the other input of XOR network, we have introduced a custom decoder with output V where, V = [V1 V2 . . . VL ]. Vi (0 ≤ i ≤ L) represents the decoder output at ith clock cycle. Vi can be represented as Vi = [v1 v2 . . . vM ]T where,  1, if there is a conflict vj = 0, otherwise

-, (   %*.

Decoder-based PD approach.

obtained before, is applied to test pins (test si). The output of the decompressor is connected to one of inputs of the XOR network. The other input of the XOR network comes from the decoder. This design needs one extra pin IN EN from the tester to enable the initialization phase. The number of inputs of the decoder is log2 (L) and is generated by a modulo L counter which is driven by the same shift clock. The counter will be initialized to zero when IN EN = 0. The decoder output will always be zero when IN EN = 0, i.e., except for the initialization phases in which the decompressor output can be sent to the scan chains without inversion. The entire initialization process will take L scan clock cycles.

5

6787$$8







67678

 

9:78

9L

  

Figure 4.

Decoder-based architecture.

B. ROM-Based Architecture In our proposed ROM-based architecture, the IP is stored in an on-chip ROM. During the initialization phase, the decompressor is bypassed and the ROM output is selected using a MUX network. In every shift-in clock cycle, the data output of the ROM is loaded into the scan chain. The initialization time for this approach is L clock cycles, which

Figure 4 presents the proposed decoder-based architecture for the initialization during functional testing. The compressed test data, i.e., the solution corresponding to the least conflicts

3

in frequency ( fs ). Thus, the initialization through a slow shiftin frequency results in increased initialization time. However, we cannot increase the shift-in frequency due to the switching power dissipation in the scan chain. Our objective will be fulfilled if we accept the high speed data from the tester and load the data into the scan chains with the slower shift-in clock. This conversion can be done by a serial-in parallel-out shift register. The register will accept the high speed data and load at a much slower rate. The performance improvement will be the ratio of f f to fs compared to the Worst Case and the initialization time will be ffsf · k · L.

is the Best Case initialization time. The block diagram for this proposed approach is shown in Figure 5.

Store in internal memory (ROM)

Figure 5.

Load data in every shift clock cycles into the scan chain

ROM-based PD approach.

Figure 6 presents the proposed ROM-based architecture. This design needs one extra pin, IN EN, to select the ROM output and is controlled by the tester to enable the initialization phase. The address input of the ROM will be generated by a modulo L counter, which will be driven by the same shift-in clock. The counter will be initialized to zero when IN EN=0. The structure of the ROM-based design is similar to the decoder-based design except that the XOR network of the decoder- based design is replaced with MUX network. Also, this design bypasses the decompressor as the IP is stored in the ROM. There is no test data (test si) required for the initialization to be applied by the tester.

  



5

  

Initialization Pattern (IP)





  



 

















;

;







;

 

      



:  f f / fs , then the shift-in clock must be slowed down to match the capture and load data rate. As a result, the new shift-in frequency will be, fsnew = f f /k. Again, if k < f f / fs , then the functional clock must be slowed down to match

 

Figure 6.



ROM-based architecture.

III. PATTERN I NDEPENDENT A PPROACH Figure 1(b) showed the functional testing with different initialization and computation phases, as in Case II. Each set of of the initialization patterns corresponding to different initialization phases is distinct, which directly leads to a pattern independent initialization approach. The idea of developing a structure utilizing compression to shift any initialization pattern during functional testing is more promising than the pattern dependent one. However, after studying the constraints from the decompressor, we have concluded that it is not possible to construct a fully specified (all care bits) test cube, i.e., the initialization pattern using a decompressor. If we bypass the decompressor and shift the initialization pattern through the existing DFT structure, we can initialize the design to any functionally valid state. The initialization time will then increase to Worst Case initialization time. The core idea of the pattern independent approach is based on the slow shift-in process for the structural testing. Generally, the functional frequency ( f f ) is much faster than the shift-

4

a compression ratio of 30. The decrease in prediction accuracy with the increase of compression ratio does not affect the area of the decoder as it is proportional to the scan length (inversely proportional to the compression ratio); this is illustrated in Figure 10.

the capture and load data rate which needs the functional frequency during initialization to be, f fnew = k fs . The initialization time will be,  L shift clock cycles, when k ≤ f f / fs = kL( fs / f f ) shift clock cycles, when k > f f / fs





Figure 8.









Prediction Accuracy (%)

Figure 8 shows the timing waveforms of the pattern independent architecture. The shift register captures the data at a frequency of f f . After k clock cycles, the functional clock will be disabled. During this time, the signal at IN EN will make a positive transition and the k-bit data captured by the shift register will be available to the internal scan chain and will be captured with the shift-in clock. After that, IN EN will make a negative transition and the functional clock will be enabled again for k clock cycles. The data from the tester and the functional clock must be synchronized for proper operation, as shown in this figure.

        













Compression Ratio (k)

Figure 9. Prediction accuracy vs. compression ratio analysis for decoderbased architecture.



The area overhead versus compression ratio is shown in Figure 10. It shows that the area overhead decreases linearly with the increase of compression ratio (except for k < 5). With the increase of compression ratio, the length of the scan chain becomes shorter for a specific design, resulting in a fewer number of decoder inputs (log2 (L)). With a compression ratio of 30, the area overhead due to the decoder is around 0.63% where as it is 0.91% with a compression ratio of 5.

Timing waveforms for PI architecture.

IV. S IMULATION R ESULTS

 Area Overhead (%)

The proposed initialization schemes were implemented on the ITC 99 b19 benchmark which has 6042 flip-flops and 53373 gates. All simulations were carried out in the Synopsys environment [14]. We used 90nm technology to implement our proposed architectures. A. Pattern Dependent Approach In this section, we will discuss the decoder-based architecture, as ROM-based architecture is an extension of it. In the decoder-based architecture, we have implemented a decoder to generate the IP. The algorithm tries to find the minimum size decoder by selecting the solution that leads to the minimum number of conflicts. In this experiment, we have selected LFSR-based compression architecture and simulated it using C/C++. The XOR equations corresponding to the selection bits are solved using prolog solver [15]. Each solution from the solution set obtained from the solver was applied to the compression architecture and conflicts were computed using C/C++. Figure 9 describes  # con f licts the percentage prediction accuracy 1 − Total bits in IP versus compression ratio. It shows that prediction accuracy decreases almost linearly (except for k < 5) with the increase of compression ratio. This result leads to the fact that in every clock cycle, we can inject fewer variables into the system and can specify fewer care bits. When there is no compression, i.e., k = 1, we can specify all the bits in the IP and, as a result, the prediction accuracy goes to 100%. It eventually goes down to 52.2% with

         













Compression Ratio (k)

Figure 10. Area overhead vs. compression ratio analysis for PD architecture.

B. Pattern Independent Approach The design described in Figure 7 has a k-stage serial-in parallel-out shift register for each tester pin. It is important to note that the size of the shift register exactly matches the compression ratio. It is obvious that the area overhead will not change with the compression ration as it is one shift register stage and three buffers per scan chain. However, the area overhead will vary with the length of the scan chain, as shown in Figure 11. The area overhead decreases with the increase of scan length. We have simulated the design with a varying scan length (L) of 100 to 250 as in the case of modern industrial designs. The area overhead is around 0.22% for a scan length of 250.

5

DFT are the F where TFmax and TFmax max test time without and with DFT. The improvement becomes greater with higher initialization cycles and eventually reaches 50%.

Area Overhead (%)

  

Table III F UNCTIONAL Fmax



# IN cycles (n) 1K 10K 30K 1M 10M

   













Scan Length (L)

Figure 11.

Area overhead vs. scan length analysis for PI architecture.

The comparisons between all three designs are shown in Table II. We have calculated DFT-based initialization time (TDFT I ), area/power overhead, and test data (bits for one IP) for comparison. The initialization time for both PD architectures is L shift clock cycles. In worst case, PI architecture provides an initialization time of kL( fs / f f ) cycles. The area overhead is extremely small for the PI architecture (0.22%), whereas it is close to 1% for both PD architectures. However, we must keep in mind that both PD architectures can generate only one IP. The power overhead is insignificant (

Suggest Documents