Test Clock Domain Optimization to Avoid Scan Shift ... - IEEE Xplore

5 downloads 18804 Views 5MB Size Report
Mar 15, 2013 - Test Clock Domain Optimization to Avoid. Scan Shift Failure Due to Flip-Flop. Simultaneous Triggering. Yu-Chiuan Huang, Min-Hong Tsai, ...
644

IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 32, NO. 4, APRIL 2013

Test Clock Domain Optimization to Avoid Scan Shift Failure Due to Flip-Flop Simultaneous Triggering Yu-Chiuan Huang, Min-Hong Tsai, Wei-Sheng Ding, James Chien-Mo Li, Member, IEEE, Ming-Tung Chang, Min-Hsiu Tsai, Chih-Mou Tseng, and Hung-Chun Li, Member, IEEE

Abstract—This paper presents a design for testability technique to avoid scan shift failure due to flip–flop simultaneous triggering. The proposed technique changes test clock domains of flip– flops in the regions where severe IR-drop problems occur. A massive parallel algorithm using a graphic processor unit is adopted to speed up the IR-drop simulation during optimization. The experimental data on large benchmark circuits show that peak IR-drop values are reduced by 15% on average compared with the circuit after simple MD-SCAN partition. Our proposed technique quickly optimizes a half-million-gate design within two hours. Index Terms—Design for testability (DfT), parallel IR-drop simulator, peak IR drop, test clock domain optimization.

I. Introduction

E

XCESSIVE IR drop during scan testing may cause severe yield loss problems [1]–[3]. Because a large portion of the circuit under test (CUT) is activated simultaneously during scan, IR drop during scan can be even more serious than that during normal operation. Excessive IR drop can occur either in capture cycles or shift cycles when scan test patterns are applied [4]. There have been tremendous research efforts dedicated to the former problem [2], [5], [6]. This paper focuses on the latter problem because we observed real industrial data where many circuits failed the scan chain integrity test. This test simply shifts in an alternating sequence of zeros and ones “1100110011. . . ,” followed by an immediate shift out without capture. Since the scan chain integrity test is the first test pattern applied, overheating caused by high test power is ruled out. Because

Manuscript received March 21, 2012; revised August 23, 2012; accepted October 5, 2012. Date of current version March 15, 2013. This paper was recommended by Associate Editor K. Chakrabarty. Y.-C. Huang is with Mentor Graphics Corporation, Hsinchu 30078, Taiwan (e-mail: [email protected]). M.-H. Tsai is with MediaTek Company, Ltd., Hsinchu 30078, Taiwan (e-mail: [email protected]). W.-S. Ding and J. C.-M. Li are with National Taiwan University, Taipei 10617, Taiwan (e-mail: [email protected]; [email protected]). M.-T. Chang, M.-H. Tsai, C.-M. Tseng, and H.-C. Li are with Global Unichip Corporation, Hsinchu 30078, Taiwan (e-mail: karl.chang@guc-asic. com; [email protected]; [email protected]; [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TCAD.2012.2228741

Fig. 1.

Power supply current in a scan cycle.

the scan chain integrity test does not capture any response from the logic, slowing down test speed does not cure this problem. It is therefore suspected that IR drop caused by flip– flop simultaneous triggering is the culprit. A similar problem was addressed by a recent publication [4]. To verify this point, dynamic IR-drop simulation is performed on b19 ITC benchmark circuit. Fig. 1 shows an IRdrop map and a transient waveform of power supply current in a scan cycle simulated by a commercial tool in 0.13-μm technology. It is observed that peak current (> 700 mA) occurs right at the clock edge, which can cause flip–flops to fail. Please note that, in this case, IR drop caused by logic switching (long tail) is relatively small compared with the peak. Extra logic delay caused by IR drop can be mitigated by slower test speed, while scan shift failure cannot be cured in the same way. Also note that, by showing this particular case, we do not attempt to claim that peak IR drop of every circuit always occurs at clock edge in every cycle. In this paper, we simply want to draw the attention to such an IR-drop problem caused by simultaneous flip–flop triggering and propose a solution. Traditional low-power testing approaches may not be sufficient to avoid IR drop due to simultaneous flip–flop triggering. Overdesign of the power supply network, such as adding decoupling capacitors or widening the power supply network, is a waste of resources because most chips are tested only once. low-power automatic test pattern generation (ATPG), which reduces logic switching activity, is good to reduce IR drop for capture cycles but may not be effective for shift cycles. Although minimum transition X-filling is widely used to reduce test power during scan shift, experiments in [7] showed that it has little impact on peak IR drop.

c 2013 IEEE 0278-0070/$31.00 

HUANG et al.: TEST CLOCK DOMAIN OPTIMIZATION TO AVOID SCAN SHIFT FAILURE

Test clock domain optimization (TCDO) has been shown to be effective for reducing peak IR drop during scan shifting [7]. TCDO changes the clock domains of flip–flops so the maximum number of flip–flops triggered simultaneously in a region is reduced. For the optimization purpose, iterative IRdrop simulation is very time consuming, so flip–flop density (FFD) was proposed as a new metric [7]. This paper presents a new TCDO algorithm that changes clock domains of flip– flops where the worst IR drop occurs due to simultaneous triggering of flip–flops. This new TCDO algorithm improves the effectiveness and also reduces the routing overhead of [7]. This technique solves linear equation systems to identify IRdrop hotspots induced by simultaneous test clocking during scan. Large systems of linear equations can be solved quickly by the preconditioned conjugate gradient (PCG) method implemented on graphic processor unit (GPU). This technique uses Jacobi preconditioning matrix as a robust preconditioner to speed up convergence. The contributions of this paper over [7] are as follows. 1) This technique is based on IR-drop simulation results, instead of FFD [7]. 2) Location of flip–flops, power supply, resistance, and capacitance are extracted from the physical layout. 3) GPU accelerated preconditioned conjugate gradient is applied to speed up the computation. The structure of this paper is as follows. Section II gives a brief review of the past research in low-power DfT and IRdrop analysis. Section III describes our proposed technique. Section IV shows the experimental data on large benchmark circuits. Section V concludes this paper. II. Preliminaries A. Past Research Most research in low-power DfT has been focused on reducing weighted switching activities [3] or flip–flop toggle count [8]. Reordering the sequence of the scan cells to reduce test power is proposed in [9] and [10]. Inserting gates into scan chains can minimize the toggling when scan chains shift [11]. The toggle suppression technique separates the data outputs and scan outputs of scan cells so that the toggle activity in logic is reduced [12]. None of the above techniques considered the peak IR drop at clock edges. Scan chains can be partitioned into several segments, each of which is enabled one at a time [4], [13]. Although the scan segmentation reduces peak IR drop during scan shifting, additional adaptor logic is needed. Staggered clocking, gating test clocks [14], or skewed test clocks (called MD-SCAN) [15] reduces the number of flip–flop triggered simultaneously with fine-grained clock staggering. The above three DfT techniques did not consider physical information such as scan flip–flop location and parasitic RC of the power grid. If the segmentation is not done carefully, peak IR drop can still occur in a local IR-drop hotspot (see Section IV). There has been intensive research on low-power ATPG in the past decade [5]. X-filling of test cubes algorithm successfully reduces the peak power during scan test [16]– [19]. Some hardware solutions have been proposed to divide

Fig. 2.

645

Overall flow.

scan chains for both shift and capture power reduction [12]. Recently, low IR-drop ATPG has been gaining more and more attention. A regional model for ATPG is presented in [20], and a critical path-aware X-filling is shown in [21]. The simultaneous capture and shift power reduction ATPG proposes to reduce both the capture power and shift power [8]. A layout-aware ATPG that utilizes local switching activity for delay testing is presented in [22]. Although ATPG patterns effectively reduce IR drop during capture, simulation results in [7] show that they are not very effective at reducing peak IR drop during scan shifting. Dynamic IR-drop simulation requires intensive computation to solve systems of linear equations. Many techniques have been proposed to accelerate this process, including multigrid [23], [24], preconditioned Krylov subspace [25], and so on. Recently, massive parallel algorithms have been developed on GPU [24], [26]. Parallel GPU software packages for linear systems, such as CUBLAS for dense matrices [27], and CUSPARSE for sparse matrices [27], have also been developed for public use. B. TCDO A test clock domain optimization algorithm is proposed to reduce the maximum IR drop during scan [7]. First, scan flip– flops are evenly separated into k partitions, each of which belongs to an individual test clock domain, each of which is slightly phase shifted. This effectively reduces the number of flip–flops triggered simultaneously by k. Second, an optimization algorithm is applied to adjust test clock domains so that the maximum FFD belonging to the same test clock domain is reduced. Because the placement of flip–flops remains mostly unchanged, this has little impact on functional timing. There is little change in ATPG pattern and fault coverage (except that test compression has to be redone). Unfortunately, the TCDO technique presented in [7] still has drawbacks and needs to be improved. First, it uses FFD to save computation time. Experimental data showed some mismatches between FFD and peak IR drop regions. We found IRdrop hotspots where FFD is low and some IR-drop coldspots where FFD is high. The former reduces the optimization effectiveness while the latter increases routing overhead. III. Proposed Technique A. Overall Flow Fig. 2 shows the overall flowchart of the proposed flow, where our test clock domain optimization tool is inserted after

646

IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 32, NO. 4, APRIL 2013

window sizes w to be 5% of the die width. The window size does not affect the accuracy of IR-drop simulation. However, the selection of window size is a tradeoff between runtime and routing overhead. Larger window size makes test clock domain optimization faster at the cost of larger routing overhead. This is because we have more flip–flops to adjust in a larger window, but the routing can be longer. In this paper, we made the decision according to our past experimental data in [7]. 2) After extracting the RC model of the power network, an IR-drop simulation is performed (for details see Section III-D). Flip–flop IR drop [FIR(d, f )], is the peak IR drop (and ground bounce) at the location of flip–flop f in test clock domain d. For window(i, j), windows IR drop (WIR) is the worst FIR among all flip–flops in the window WIR(i, j) =

max [FIR(d, f )].

∀d,∀f ∈(i,j)

(1)

Global peak IR drop (GIR) is defined as the maximum IR drop of all windows GIR = max[WIR(i, j)]. ∀(i,j)

Fig. 3.

TCDO flow.

power planning before clock tree synthesis (CTS). The initial test clock domain partition is done by a commercial DfT tool. After that, ATPG test patterns are generated using the initial scan chain partition. The circuit is then floorplanned and all cells are placed according to their functional timing constraints. Power planning is then performed, followed by an RC extraction. Our TCDO technique optimizes the test clock domains to reduce the peak IR drop during scan shifting. After that, clock trees are synthesized for the new design and then wires are routed. To minimize the routing wire length, scan flip–flops are re-stitches by the physical design tool. Finally, original ATPG test patterns are modified by simply shuffling scan data according to the re-stitched order. Please note that if we use test compression, then test patterns may have to be regenerated.

(2)

3) and 4) Algorithm 1 shows the pseudocode of test clock domain adjustment (TCDA). All windows are sorted according to their WIR. If a window has larger WIR than a predefined threshold IR drop (TIR), this window is called a violating window. The test clock domains of all violating windows will be adjusted. The threshold coefficient a, between 0 and 1, determines how many windows should be optimized. The larger a is, the fewer violating windows would be selected. a is initially set to 0.95 in our implementation TIR = a · GIR.

(3)

B. TCDO Algorithm

Violating windows are processed one by one in decreasing order of their WIR. For a violating window (i, j), we change a number of flip–flops from test clock domain Dfrom to Dto . Typically, Dfrom is simply the test clock domain where peak IR drop occurs in the current violating window. Let NF 2C be the expected number of flip–flops to be changed from Dfrom . Equation (4) means that, the closer WIR is to the threshold TIR, the fewer flip–flops would be changed

Fig. 3 shows the flow of our proposed TCDO. There are four input files to our flow: netlist (Verilog), test patterns (standard test interface language), cell placement information (design exchange format, DEF), and power grid RC model (detail standard parasitic format, DSPF). The outputs are the modified netlist and test patterns. The details of each step are described as follows. 1) In the initialization step, physical locations of flip–flops are extracted from the DEF file and original test clock domain information is parsed from the Verilog netlist. The whole circuit is covered by 2-D array of windows indexed (i, j). A window is a square of size w × w, where w is a user-defined parameter (window size). During optimization, windows are swept across the whole circuit in step size s, which is smaller than window size w so that windows are overlapped. In 0.13μm technology, we choose step size s = 15 μm, which is about two times the row height in the standard cell design. We set

WIR(i, j) − TIR . GIR − TIR (4) Let D denote the set of destination clock domains to which flip–flops could be changed. Every clock domain d in D must satisfy two criteria: 1) d should be a neighbor clock domain to avoid long clock routing, and 2) IR drop of d is sufficiently smaller than that of Dfrom . Equation (5) describes the second criterion, where the threshold coefficient b is 0.8 in our implementation. This threshold guarantees that the FIR in the destination clock domain is less than 80% of the FIR in the original clock domain. Coefficient b helps select a save destination clock domain that has much lower IR drop than the original clock domain. Our observation is that, clock domains have very different IR drop so we usually can easily find a neighboring clock domain to change to. Because of this guard band parameter b, we are not likely to create a new IR drop NF 2C = [total number of FF in Dfrom ] ×

HUANG et al.: TEST CLOCK DOMAIN OPTIMIZATION TO AVOID SCAN SHIFT FAILURE

violation after changing. Even if we select an inappropriate clock domain in this iteration, we would have a chance to correct the mistake in the next iteration FIR(d, f ) < b · FIR(Dfrom , f ).

(5)

We randomly choose a test clock domain c from D as our destination Dto , according to the probability in (6). If c has larger FIR, which means more serious IR drop, then the probability of choosing c is smaller FIR(c, f ) . d∈D FIR(d, f )

Prob(c, f ) = 1 − 

(6)

After a flip–flop is changed from Dfrom to Dto , the window is locked to prevent future change on the same flip–flop in the same iteration. The NF 2C counter is decremented by one. At the end of TCDA, the size (number of flip–flops) of each test clock domains has been changed, which may increase the test application time. It is therefore necessary to restore the number of flip–flops in each test clock domain. Test clock domain size restoration changes some flip–flops from clock domains that gained flip–flops (Dgained ) to clock domains that lost flip–flops (Dlost ) during the last TCDA. TCD size restoration is implemented by a simple greedy algorithm. The windows are picked in increasing order of their WIR, beginning from the lowest WIR window. If a window has flip–flops of both domains, then a flip–flop of Dgained is changed to Dlost . This process is repeated until all test clock domains are restored to their original sizes. The rationale of this simple heuristic is that we observed many windows of very low WIR (e.g., windows very close to power pads) in our experiments. These windows are all good candidates for TCD size restoration. Even if WIR of window becomes too large after TCD size restoration, which is very rare in our experiment, we can still fix it in the next iteration. Steps 2–4 are iteratively executed. If no improvement is achieved in this iteration, a is decremented by 0.05 in the next iteration. Finally, TCDO ends if a predefined number of iterations has been reached or the IR-drop improvement is good enough.

647

Algorithm 1 Test clock domain adjustment (TCDA)

1. 2. 3. 4.

sort windows according to WIR foreach unlocked violating window (i, j) Dfrom = test clock domain of most serve IR-drop NF 2C = [number of flip-flops in Dfrom in window (i, j)] × fraction 5. foreach flip-flop f in Dfrom 6. if NF 2C = 0 break 7. if FIR(Dfrom f ) ≥ TIR 8. D = {d|d is neighbor clock domain and FIR(d, f ) < b · FIR(Dfrom, f )} 9. Chose a test clock domain c from D, according to Prob(c, f ) 10. if c ! = φ 11. Dto = c 12. Change test clock of f from Dfrom to Dto 13. lock windows that contains f 14. NF 2C = NF 2C − 1 15. test clock domain size restoration 16. unlock all windows 17. return

Fig. 4.

TCDA example (initial).

Fig. 5.

TCDA example (change f from CK1 to CK3).

C. TCDA Example Fig. 4 gives an illustrative example of TCDA. Suppose that the circuit contains three test clock domains: CK1, CK2, and CK3. Fig. 4 shows many windows in the circuit, among which window (i, j) has the worst IR drop (734 mV). According to our definition, GIR is 734 mV and TIR is 697 mV (= 734 mV × 0.95). The center of Fig. 4 zooms in window (i, j) to show FIR of every flip–flop. Suppose a particular flip–flop f has the worst FIR. The left of Fig. 4 shows FIR of f for each test clock domain, and its peak IR drop (734 mV) occurs at CK1. We see large difference in IR drop among three clock domains so we can change flip–flop f to the other clock domains. Fig. 5 shows how we change flip–flop f for the same example as Fig. 4. Since both CK2 and CK3 are in the neighborhood of CK1 and their FIRs are sufficiently low, i.e., satisfy (5), they are both candidates of Dto . According to (6), CK3 has higher probability to be chosen because its IR drop

is smaller than that of CK2. In this example, we change flip– flop f from CK1 to CK3. The left and right figures show the window before and after TCDA, respectively. Window (i, j) is locked and new FIR will be updated in the next TCDA iteration. Fig. 6 continues the same example to show the test clock domain size restoration. Suppose window (i , j  ) has the lowest WIR in this circuit because it is very close to a VDD pad. There are two clock domains: CK1 and CK3 in this window. Since we have changed one flip–flop from CK1 to CK3, now we need to change a flip–flop from CK3 back to CK1. Since the WIR of this window is very low, it usually does not matter which flip–flop we pick.

648

IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 32, NO. 4, APRIL 2013

Algorithm 2 Preconditioned conjugate gradient

Fig. 6.

1. r0 = J − Gv0 2. for i = 1, 2, · · · , MAX− ITERATION 3. zi−1 = M 1 ri−1 // solve M zi−1 = ri−1 4. if i = 1 5. p1 = z0 6. else T T 7. βi−1 = ri−1 zi−1 /ri−2 zi−2 8. pi = zi−1 + βi−1 pi−1 T 9. αi = ri−1 zi−1 /pTi−1 G pi 10. vi = vi−1 + αi pi 11. ri = ri−1 − αi G pi 12. if (riT ri < tolerance) return 13. else next i

TCDA example (TCD size restoration).

tioning, (7) now becomes M −1 GV (t) = M −1 J(t).

Fig. 7.

Norton theorem for (a) capacitance and (b) voltage source.

D. Parallel IR-Drop Simulation This technique uses nodal analysis method to convert the power grid RC model to a linear equation system GV (t) = J(t)

(7)

where G is the conductance matrix, V (t) is a column vector of nodal voltage, and J(t) is a column vector of current flowing into nodes. In dynamic IR-drop simulation, both V (t) and J(t) are time variant values. We use backward Euler integration approximation with time step t. By Norton theorem, illustrated in Fig. 7, a capacitance c is converted into an equivalent conductance (C/t) in parallel with a current source i + Cv/t, where i is the current through capacitance and v is the voltage across capacitance in the last iteration. Similarly, a voltage source V in serial with a resistance R (set to 10−4 ) is converted into a current source V /R in parallel with a conductance R−1 . After this conversion, (7) becomes a symmetric positive definite system [25]. The dynamic IR-drop simulator is a two-level nested loop. The outer loop is an iteration of all time steps and the inner loop is an iteration of conjugate gradient for a single time step. Each flip–flop is modeled as a current sink of triangular waveform. The width and height of the triangle is estimated based on the cell power model provided in the Synopsys .lib file. For the 0.13-μm technology in our experiment, the width is 120 ps and the height is 118 μA for a scan flip–flop. Because our tool only simulates the IR drop at clock edge, we assume all flip–flops are triggered at time zero simultaneously. The details of this estimation are shown in the Appendix. To improve the efficiency of the conjugate gradient method, we use a Jacobi preconditioner, M, which is simply a diagonal matrix equal to the diagonal elements of G. With precondi-

(8)

The pseudocode of preconditioned conjugate gradient is shown in Algorithm 2. The initial solution v0 is set to ideal voltage (VDD or GND) for each node. ri is a residual vector in the ith iteration. In each iteration, Mzi−1 = ri−1 can be solved by a simple matrix-vector multiplication because M −1 is a diagonal matrix. pj is a G-orthogonal basis generated in the ith iteration. αi and βi are scalar values that represent the length of vectors. In each iteration, a new pi can be computed from zi−1 and βi pi−1 in line 8. A new solution vi is updated by vi−1 + αi pi in line 10. The iteration ends if the norm of residue is smaller than a predefined tolerance (e.g., 10−4 in our experiment). More details about PCG can be found in [28]. The preconditioned conjugate gradient algorithm has been implemented on NVIDIA GPU, using the CUBLAS and CUSPARSE open software packages. The former calculates the scalar–vector and vector–vector multiplication while the latter calculates the matrix–vector multiplication. The vectors (such as J, V, r, z, and p) are stored in a dense format. The matrices (G) are stored in a compressed sparse row format [28]. CUBLAS and CUSPARSE are widely used in many applications and have been proven stable. IV. Experimental Results To demonstrate the effectiveness of our proposed technique, experiments are performed on ISCAS’89, ITC’99, and IWLS’05 benchmark circuits. These benchmark circuits are mapped to TSMC 0.13-μm standard cell technology (VDD = 1.2 V) and then placed by the commercial tool, SOC Encounter. The power grid RC models (including wire conductance and capacitance) are extracted by the commercial tool, QRC. To improve the condition of the matrix, we ignore resistors smaller than 10−5 . The circuits are placed, routed, and then simulated by commercial tools. Each circuit has eight pairs of VDD /GND pads. Power stripes and grids are added to each circuit when needed. A zero-one alternating test pattern “0101. . . ” is shifted in scan mode to create 100% switching activities. This is the worst-case IR-drop scenario that can occur during scan.

HUANG et al.: TEST CLOCK DOMAIN OPTIMIZATION TO AVOID SCAN SHIFT FAILURE

649

TABLE I Experimental Results of Benchmark Circuits

Circuit s15850 s38417 s38584 b18 b19 leon3mp Average

Number of Gates

Number of FF

10 840 23 815 20 679 115 741 233 578 545 836 –

534 1636 1426 3320 6642 108 839 –

1 TCK 344 1075 853 748 764 1444 –

MD-SCAN 4 TCK 112 433 357 744 760 1011 0%

IR drop (mV) Reference [7] 4 TCK 121(+8%) 464(+7%) 439(+23%) 937(+26%) 691(−9%) 748(−26%) +4%

This Paper 4 TCK 112(0%) 433(0%) 357(0%) 683(−8%) 426(−44%) 646(−36%) −15%

TABLE II Compare IR Drop of s38417 and b19 Circuit TCK org First iteration Fifth iteration Tenth iteration

TCK 1 456 456 456 456

IR drop in s38417 (mV) TCK 2 TCK 3 TCK 4 460 457 455 460 457 455 460 457 455 460 457 455

TCK 1 564 641 524 389

IR drop in b19 (mV) TCK 2 TCK 3 TCK 4 750 224 218 645 203 218 482 419 217 411 403 276

TABLE III Runtime and Error

Circuit s15850 s38417 s38584 b18 b19 leon3mp Average

Commercial IR Drop (mV) 112 433 357 744 760 1011 0%

Before TCDO This Paper IR Drop (mV) (Error %) 102 (−9%) 462 (+7%) 417 (+17%) 252 (−66%) 750 (−1%) 970 (−4%) 18%

After TCDO Commercial IR Drop (mV) Time (s) 112 N/A 433 N/A 357 N/A 683 5039 426 9877 646 30 568 0% –

Table I shows peak IR drop (commercial simulator results) of four techniques. The “1 TCK” column shows the peak IR drop when all flip–flops are connected to one single test clock domain. The “MD-SCAN 4 TCK” column shows the peak IR drop when flip–flops are arbitrarily partitioned into four test clock domains. This column corresponds to the peak IR drop when a simple physical-unaware MD-SCAN is applied. The “Ref[7] 4 TCK” column shows the peak IR drop when flip– flops are optimized by the method proposed in [7]. The last column shows the peak IR drop when flip–flops are optimized by the method proposed in this paper. The default TCDO stop criterion is 10 iterations. For small circuits, such as s15850, simply MD-SCAN alone is very good. This is because the physical sizes and the number of flip–flops are small so IR-drop hotspots can be easily eliminated by simply partitioning flip–flops into four domains. Unlike [7], which performed unnecessary optimization, our new TCDO chooses not to optimize those small circuits. This demonstrates that our new technique makes smarter decision than [7]. But for larger circuits, such as b19, a simple MD-SCAN alone is still not sufficient to reduce peak IR drop. Because a simple MD-SCAN ignores the physical information, serious IR drop still occur in hotspots where many flip–flops

This Paper IR Drop (mV) (Error %) 102 (−9%) 462 (+7%) 417 (+17%) 164 (−76%) 411 (−4%) 595 (−8%) 20%

Time (s) 45 130 89 920 1030 6272 –

are clustered. On the average, this paper reduces the peak IR drop by 15% compared with the results of MD-SCAN 4 TCK. Table II compares the IR-drop improvement progress of four clock domains. It is observed that, for small circuits like s38417, the peak IR drop of four test clock domains are almost the same, so there is no more room for TCDO. On the contrary, peak IR drop of large circuits, such as b19 and leon3mp, of four test clock domains are initially very different, so the improvement is significant after TCDO. Table III compares the runtime and peak IR drop simulated by a commercial tool and our IR-drop simulator. On average, the error between our simulator and the commercial tool is no more than 20%. Our tool spent only 2 h optimizing leon3mp, while the commercial tool took over 8 h to simulate its IR drop. Please note that this comparison of accuracy and runtime is not fair since the commercial tool simulates both flip–flops and logic gates, while our tool simulates only flip–flops. However, for our purpose to identify IR hotspots due to simultaneous flip–flop triggering, the contribution of logic gates can often be ignored. Table IV shows the scan chain wire length. It is observed that our routing overhead is much smaller than that of [7]. Please note that the utilization rates of the most benchmark

650

IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 32, NO. 4, APRIL 2013

TABLE IV Scan Chain Wire Length Circuit s15850 s38417 s38584 b18 b19 leon3mp Average ∗ Utilization

1 TCK 10 923 23 987 26 390 49 793 95 132 1 982 333 0%

Scan Wire Length (μm) Reference [7] 4 TCK 12 870 (+18%) 27 932 (+16%) 31 994 (+21%) 59 688(+20%) 119 475 (+26%) 2 274 445 (+15%) +19%

This Paper 4 TCK 10 923 (+0%) 23 987 (+0%) 26 390 (+0%) 58 451(+17%) 106 303 (+12%) 1 958 203 (−1%) +5%

rate = 80%, except leon3mp = 60%.

Fig. 8. IR-drop hotspots of leon3mp. (a) Our IR-drop simulator. (b) Commercial tool.

Fig. 11. IR-drop waveforms of s38417, 90 nm (dotted line: our too, solid line: commercial tool).

Fig. 9.

WIR map (our tool). (a) Original. (b) 10 iterations. (c) 21 iterations.

Fig. 10. IR-drop map (commercial). (a) Original. (b) 10 iterations. (c) 21 iterations.

circuits are 80% except leon3mp, which could not be routed at utilization rate higher than 60% within a reasonable run time. In addition, the wire length of leon3mp actually decreases after optimization. This is probably due to the randomized routing algorithm. Although our tool only simulates the flip–flops, we plot the IR-drop maps of our tools and commercial tools for comparison. Fig. 8 shows IR-drop maps of leon3mp between our and commercial maps. The four white circles indicate IRdrop hotspots of leon3mp. Fig. 9 (WIR map by our tool) and Fig. 10 (IR-drop map by commercial tool) compares IR-drop maps during the optimization progress. Before optimization, we can see serious IR-drop hotspots even though flip–flops are already divided into four MS-SCAN test clock domains. After 10 and 21 iterations, peak IR drop is significantly reduced. The reason why hotspots appear larger in our WIR maps than the commercial IR-drop maps is that the formers show peak IR

drop of all time while the latter only show peak IR drop when flip–flops are triggered. After conducting experiments of b18, we found the large error is caused by clock buffers, which are added by CTS. In b18, three clock buffers, which drive unusually large numbers of flip–flops (> 200 per buffer), are clustered together. These three clock buffers cause serious IR drop even though there is no flip–flop nearby. Because our TCDO is inserted before CTS, this problem was not properly solved. There are three possible ways to improve the accuracy of our simulator. First, we can refine the test clock domain after CTS, so that clock buffers can be considered. Second, we can add constraint to the clock buffer load so that one single clock buffer cannot consume too much current. Third, we can add levels of clock buffers to reduce the clock buffer load.

V. Conclusion This paper presented a new TCDO technique to avoid scan shift failure caused by simultaneous flip–flop triggering. This technique implemented a parallel preconditioned conjugate gradient technique for fast IR-drop simulation. In each iteration, flip–flops that cause the worst IR drop were changed to test clock domains where IR drop was less serious. Because IR-drop information was updated in each iteration, peak IR drop was effectively reduced without much wiring overhead. The experimental data on large benchmark circuits showed that peak IR drop was reduced by 15% on average compared with circuits after a simple MD-SCAN.

HUANG et al.: TEST CLOCK DOMAIN OPTIMIZATION TO AVOID SCAN SHIFT FAILURE

Appendix The total energy of a gate transition can be divided into internal energy and switching energy [29]. The former is consumed by the short circuit current, while the latter is consumed by charging/discharging the load capacitance. The internal power (p) can be looked up in the power model (e.g., Synopsys lib file). Given that the duration of this transition (w), the internal energy is therefore pw. The switching energy 2 to charge or discharge a load capacitance (C) is 0.5CVDD . Therefore, the total energy is equal to 1 2 . E = pw + CVDD 2 The height h of the triangle current waveform is 1 1 2 E = whVDD = pw + CVDD 2 2 2p VDD h= +C . VDD w For 0.13-μm technology (VDD = 1.2 V), p is about 0.01 mW according to the power model in the .lib file. We analyzed the netlist and decided that average C is 0.01 pF (about three gate input loads). Based on this model, we set the width to 120 ps and the height to 118 μA for a scan flip–flop. For 90-nm technology, p is 9 μW and C is 8 fF. The width w to 100 ps and the height h to 103 μA (VDD = 1.0 V). To validate our current model, we performed a simulation on ISCAS’89 s38417 circuit, mapped to 90-nm technology (V DD = 1.0 V). Fig. 11 shows both IR-drop waveforms of our tool (dotted line) and a commercial tool (solid line). The former considers only flip–flops current consumption while the later considers both logic and flip– flops. The peak IR drop of commercial tool is 68 mV while our simulator is 58 mV (−14%). References [1] J. Saxena, K. M. Butler, V. B. Jayaram, S. Kundu, N. V. Arvind, P. Sreeprakash, and M. Hachinger, “A case study of IR-drop in structured at-speed testing,” in Proc. IEEE Int. Test Conf., vol. 1. Sep. 2003, pp. 1098–1104. [2] M. Tehranipoor and K. M. Butler, “Power supply noise: A survey on effects and research,” IEEE Design Test Comput., vol. 27, no. 2, pp. 51–67, Mar. 2010. [3] C. Tirumurti, S. Kundu, S. Sur-Kolay, and Y.-S. Chang, “A modeling approach for addressing power supply switching noise related failures of integrated circuits,” in Proc. Design, Autom. Test Eur., vol. 2. Feb. 2004, pp. 1078–1083. [4] Y. Yamato, X. Wen, M. A. Kochte, K. Miyase, S. Kajihara, and L.-T. Wang, “A novel scan segmentation design method for avoiding shift timing failure in scan testing,” in Proc. IEEE Int. Test Conf., Sep. 2011, pp. 1–8. [5] P. Girard, “Survey of low-power testing of VLSI circuits,” IEEE Design Test Comput., vol. 19, no. 3, pp. 80–90, May 2002. [6] P. Girard, N. Nicolici, and X. Wen, Power-Aware Testing and Test Strategies for Low-Power Devices. Berlin, Germany: Springer, 2009. [7] J.-Y. Wen, Y.-C. Huang, M.-H. Tsai, K.-Y. Liao, James C.-M. Li, M.T. Chang, M.-H. Tsai, C.-M. Tseng, and H.-C. Li, “Test clock domain optimization for peak power supply noise reduction during scan,” in Proc. IEEE Int. Test Conf., Sep. 2011, pp. 1–8. [8] H.-T. Lin and J. C.-M. Li, “Simultaneous capture and shift power reduction test pattern generator for scan testing,” IET Comput. Digital Techniques, vol. 2, no. 2, pp. 132–141, Mar. 2008. [9] V. Dabholkar, S. Chakravarty, and I. Pomeranz, and S. Reddy, “Techniques for minimizing power dissipation in scan and combinational circuits during test application,” Trans. Computer-Aided Design Integrated Circuit Syst., vol. 17, no. 12, pp. 1325–1333, Dec. 1998.

651

[10] N. Badereddine, P. Girard, S. Pravossoudovitch, A. Virazel, and C. Landrault, “Power-aware scan testing for peak power reduction,” in Proc. IFIP VLSI-SOC, 2005, pp. 441–446. [11] O. Sinanoglu, I. Bayraktaroglu, and A. Orailoglu, “Test power reduction through minimization of scan chain transitions,” in Proc. VLSI Test Symp., 2002, pp. 166–171. [12] A. Hertwig and H. J. Wunderlich, “low-power serial built-in self-test,” in Proc. Eur. Test Workshop, 1998, pp. 49–53. [13] L. Whetsel, “Adapting scan architectures for low-power operation,” in Proc. IEEE Int. Test Conf., Oct. 2000, pp. 863–872. [14] Y. Bonhomme, P. Girard, L. Guiller, C. Landrault, and S. Pravossoudovitvh, “A gated clock scheme for low-power scan testing of logic ICs or embedded cores,” in Proc. Asian Test Symp., 2001, pp. 253–258. [15] T. Yoshida and M. Watati, “MD-SCAN method for low-power scan testing,” in Proc. Asian Test Symp., Nov. 2002, pp. 80–85. [16] X. Wen, Y. Yamashita, S. Kajihara, and L. T. Wang, “On low-capturepower test generation for scan testing,” in Proc. VLSI Test Symp., May 2005, pp. 265–270. [17] W. Li, S. M. Reddy, and I. Pomeranz, “On reducing peak current and power during test,” in Proc. Symp. VLSI, 2005, pp. 156–161. [18] X. Wen, Y. Yamashita, S. Morishima S. Kajihara, L.-T. Wang, K. K. Saluja, and K. Kinoshita, “Low-capture-power test generation for scanbased at-speeding testing,” in Proc. IEEE Int. Test Conf., Nov. 2005, p. 1028. [19] N. Badereddine, P. Girard, S. Pravossoudovitch, C. Landrault, A. Virazel, and H. J. Wunderlich, “Minimizing peak power consumption during scan testing: Structural technique for don’t care bits assignment,” in Proc. IEEE Conf. Ph.D. Res. Microelectron. Electron., Sep. 2006, pp. 65–68. [20] V. R. Devanathan, C. P. Ravikumar, and V. Kamakoti, “Glitch-aware pattern generation and optimization framework for power-safe scan test,” in Proc. VLSI Test Symp., May 2007, pp. 167–172. [21] X. Wen, K. Miyase, T. Suzuki, S. Kajihara, Y. Ohsumi, and K. K. Saluja, “Critical-path-aware X-filling for effective IR-drop reduction in at-speed scan testing,” in Proc. Des. Autom. Conf., Jun. 2007, pp. 527–532. [22] J. Lee, S. Narayan, M. Kapralos, and M. Tehranipoor, “Layout-aware, IR-drop tolerant transition fault pattern generation,” in Proc. Des. Autom. Test Eur., Mar. 2008, pp. 1172–1177. [23] S. R. Nassif and J. N. Kozhaya, “Fast power grid simulation,” in Proc. Des. Autom. Conf., 2000, pp. 156–161. [24] Z. Feng and P. Li, “Multigrid on GPU: Tackling power grid analysis on parallel SIMT platform,” in Proc. Int. Conf. Comput.-Aided Design, 2008, pp. 647–654. [25] T.-H. Chen and C. C.-P. Chen, “Efficient large-scale power grid analysis based on preconditioned Krylov-subspace iterative methods,” in Proc. Design Autom. Conf., 2001, pp. 559–562. [26] Z. Feng, X. Zhao, and Z. Zeng, “Robust parallel preconditioned power grid simulation on GPU with adaptive runtime performance modeling and optimization,” IEEE Trans. Comput.-Aided Des. Integr. Circuit Syst., vol. 30, no. 4, pp. 562–573, Apr. 2011. [27] NVIDIA Corporation. (2012). Libraries [Online]. Available: http://developer.nvidia.com/technologies/libraries [28] G. Golub and C. V. Loan, Matrix Computation, 3rd ed. Baltimore, MD: Johns Hopkins Univ. Press, 1996. [29] Synopsys Library Compiler User Guide, 2008. Yu-Chiuan Huang was born in Taiwan on September 10, 1985. He received the B.S.E.E. degree in mathematics from National Central University, Taoyuan, Taiwan, in 2009, and the M.S.E.E. degree in electronics engineering from National Taiwan University, Taipei, Taiwan, in 2011. He is currently a Research and Development Engineer with Mentor Graphics Corporation, Hsinchu, Taiwan. His current research interests include lowpower testing, voltage drop simulation, and numerical matrix analysis. Ming-Hong Tsai received the B.S. degree in electrical engineering from National Central University, Taoyuan, Taiwan, in 2010, and the M.S. degree in electrical engineering from National Taiwan University, Taipei, Taiwan, in 2012. He is currently an Engineer with MediaTek Company, Ltd., Hsinchu, Taiwan.

652

IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 32, NO. 4, APRIL 2013

Wei-Sheng Ding received the B.S. degree in computer science from National Tsing Hua University, Hsinchu, Taiwan, in 2011. He is currently pursuing the M.S. degree in electrical engineering with National Taiwan University, Taipei, Taiwan.

James Chien-Mo Li (S’93–M’02) received the B.S.E.E. degree from National Taiwan University, Taipei, Taiwan, in 1993, and the M.S.E.E. and Ph.D. degrees in electrical engineering from Stanford University, Stanford, CA, in 1997 and 2002, respectively. He is currently an Associate Professor with the Graduate Institute of Electronics Engineering, National Taiwan University, Taipei. He has coauthored more than 50 technical papers and two book chapters. His current research interests include test generation, low-power testing, and diagnosis.

Ming-Tung Chang received the M.Sc. degree in information and electrical engineering from Feng Chia University, Taichung, Taiwan, in 2007. He joined SynTest Technologies, Hsinchu, Taiwan, in 1996. Since 2007, he has been with Global Unichip Corporation, Hsinchu. His current research interests include design for test solutions. He has been involved with the electronic design automation and test industry for over 15 years.

Min-Hsiu Tsai received the B.S. and M.S. degrees in electrical engineering from Tamkang University, Taipei, Taiwan, in 1995 and 2000, respectively. Since 2000, he has been with the ASIC and design service vendor Global Unichip Corporation, Hsinchu, where he is currently in charge of advanced design methodology development.

Chih-Mou Tseng received the B.S. degree in electronic engineering from National Tsing Hua University, Hsinchu, Taiwan, in 1993, and the M.S. degree in electronic engineering from National Chiao Tung University, Hsinchu, in 1996. In 1996, he joined the Industrial Technology Research Institute of Taiwan as an EDA Engineer. Since 2000, he has been with Global Unichip Corporation, Hsinchu, where he leads the Design Methodology Division to develop SoC implementation methodology. Hung-Chun Li (M’03) was born in Tainan, Taiwan, in 1968. He received the B.E. degree in electrical engineering from National Taiwan University, Taipei, Taiwan, in 1990, and the M.S. degree in electrical engineering from the University of Florida, Gainesville, in 1994. In 1995, he joined the Nonvolatile Memory Division of ISSI U.S. as a Circuit Design Engineer working on flash memory integrated circuit design. From 1996 to 2000, he was the CAD Engineer of the Design Service Division, Taiwan Semiconductor Manufacturing Company, Hsinchu, Taiwan. From 2000 to 2001, he was a Physical Design Engineer with the startup company Allayer, which targets 1 Gb/10 Gb network switching integrated circuit and has been acquired by Broadcom. He joined Cadence Systems as an RD Architect and was in charge of developing digital design auto place and route and static timing analysis tool from 2002 to 2007. Since 2007, he has been with the ASIC and design service vendor Global Unichip Corporation, Hsinchu, where he is currently in charge of advanced design methodology development from 65-nm technology to the latest 20-nm technology node.

Suggest Documents