IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 28, NO. 2, FEBRUARY 2009
285
Algorithms for State Restoration and Trace-Signal Selection for Data Acquisition in Silicon Debug Ho Fai Ko, Student Member, IEEE, and Nicola Nicolici, Member, IEEE
Abstract—To locate and correct design errors that escape pre-silicon verification, silicon debug has become a necessary step in the implementation flow of digital integrated circuits. Embedded logic analysis, which employs on-chip storage units to acquire data in real time from the internal signals of the circuit-underdebug, has emerged as a powerful technique for improving observability during in-system debug. However, as the amount of data that can be acquired is limited by the on-chip storage capacity, the decision on which signals to sample is essential when it is not known a priori where the bugs will occur. In this paper, we present accelerated algorithms for restoring circuit state elements from the traces collected during a debug session, by exploiting bitwise parallelism. We also introduce new metrics that guide the automated selection of trace signals, which can enhance the real-time observability during in-system debug. Index Terms—Embedded logic analysis, silicon debug, state restoration, trace-signal selection.
I. I NTRODUCTION
T
O ENSURE that a complex digital integrated circuit is error free, various checkpoints are enforced at different stages of the implementation flow to validate a design, as shown in Fig. 1. Pre-silicon verification techniques, such as formal verification and simulation, help designers check the correctness of the circuit implementation against its specification [2]. To detect physical defects such as shorts and opens, manufacturing test is applied to each fabricated circuit [3]. However, with the growing complexity of integrated circuits, the time required to thoroughly validate a circuit using extensive simulation or formal verification techniques in the pre-silicon stage has become unbearable. In addition, the accuracy of circuit models is inadequate to guarantee the first silicon to be error free. In order to avoid the escalating mask costs caused by respinning a design, as well as to reduce the time to market, it is important that the undetected design bugs are fixed as soon as the first silicon is available [4], [5]. As a result, silicon debug has emerged as a powerful technique in recent years to detect and locate design errors in silicon [6]–[8]. Silicon debug can be divided into two main steps: data acquisition and analysis. Physical probing techniques, such as the time-resolved photon emission, are widely used to acquire Manuscript received June 17, 2008. Current version published January 21, 2009. An earlier version of this paper has been published at the IEEE/Association for Computing Machinery Design, Automation and Test in Europe Conference in 2008 [1]. This paper was recommended by Associate Editor R. F. Damiano. The authors are with the Department of Electrical and Computer Engineering, McMaster University, Hamilton, ON L8S 4K1, Canada (e-mail:
[email protected];
[email protected]). Digital Object Identifier 10.1109/TCAD.2008.2009158
Fig. 1. Implementation flow for digital integrated circuits.
circuit data for failure analysis [9]. However, the decreasing feature size, flip-chip technologies, and the growing complexity of integrated circuits make data acquisition using physical probing cumbersome, unless augmented by complementary design techniques that can help narrow down the location to be probed. Design-for-debug (DFD) hardware, such as an embedded logic analyzer (ELA) [10], is used for improving the observability of internal signals by sampling data into on-chip trace buffers. The acquired data can then be transported through low bandwidth device pins, such that postprocessing algorithms can analyze the acquired data and identify design errors off-chip. In this paper, we focus only on data acquisition during silicon debug. Despite the numerous DFD techniques that have been proposed in the literature, as summarized in the following section, only a small amount of debug data can be transferred off-chip. This limited observability of internal signals may lengthen the debug process, and it motivates our work. Instead of introducing yet another technique for increasing observability by acquiring more data on-chip, our objective is to better utilize the limited storage available during data acquisition. By consciously selecting the trace signals when designing the ELA, one will be able to restore missing data from the state elements that are not traced. For instance, it is a common practice for microprocessor designers to manually identify which signals to be captured (e.g., pipe control signals), such that additional information (e.g., values in pipe data registers) can be reconstructed offchip [11]. It is important to note that, although this additional information cannot guarantee a successful identification of design errors, it will nevertheless increase the observability of internal signals and, thus, could aid the search for bugs. Motivated by the lack of information in this area (that is disseminated in the public domain), the aim of this paper is
0278-0070/$25.00 © 2009 IEEE
286
IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 28, NO. 2, FEBRUARY 2009
to provide a first step to develop the understanding on how computer-aided design technology could assist structured silicon debug. By proposing automated algorithms for selecting trace signals and restoring state information, we aim to develop a structured method for debugging microprocessors, as well as circuits manufactured using standard cells or implemented in field-programmable gate arrays. It should be also noted that our method complements the existing data acquisition techniques, because when more signals are traced, more data can be acquired and subsequently used by our state restoration algorithms. It is also important to note that the proposed technique focuses only on silicon debug for functional errors, i.e., we assume that the circuit-under-debug (CUD) has successfully passed the manufacturing test [3]. In the rest of this paper, Section II discusses several relevant DFD techniques for silicon debug. Section III explains the principles of a novel state restoration algorithm. Section IV details the technique for selecting trace signals to enhance observability during silicon debug. The experimental results and conclusion are given in Sections V and VI, respectively. II. R ELATED W ORK AND M OTIVATION To address the problem of limited observability in silicon debug, a number of ad hoc DFD solutions for improving data acquisition in microprocessors have been proposed. A comprehensive review of these data acquisition solutions can be found in [4]. They can be divided into two main categories: scan- and trace-buffer-based techniques. Reusing the internal scan chains, which is the most widely used technique to increase the observability of a circuit during manufacturing test, is the primary goal in scan-based silicon debug. It first captures all the internal state elements using the scan technique in a design when a specific triggering event occurs. After that, the captured data can then be off-loaded through the scan chains. Postprocessing algorithms such as latch divergence analysis [12] or failure propagation tracing [13] can then be applied to identify the failing state elements. Although there are proposals for improving the controllability and observability of the circuit (e.g., [14]), one will not be able to acquire data in real time using the scan-based technique during silicon debug. This is because the circuit has to stop and then resume its execution during scan dump. Since functional bugs can sometimes appear in circuit states that may be exercised thousands of clock cycles apart [15], it is therefore desirable to maintain circuit execution during scan dumps. Although this can be overcome by double buffering the scan elements, it will lead to a substantial area penalty [16]. Even if this penalty would be acceptable, data sampling in consecutive clock cycles using only the available scan chains will not be possible. However, this ability to acquire data continuously is an essential requirement for identifying timing-related problems in a design during silicon debug. To acquire data in real time during silicon debug, the tracebuffer-based technique [17], which is complementary to scan, can be employed. The debug flow for using this approach is shown in Fig. 2. The first step of this approach is to design the ELA during the chip realization process. The ELA includes various trigger units, which determine when data should be
Fig. 2.
Debug flow when using the trace-buffer-based technique.
acquired, and sample units, which monitor a small set of signals (i.e., trace signals) using trace buffers such as embedded memories on-chip [18], [19]. After the circuit has been implemented with the ELA, the debug engineer initiates the debug cycle by setting up the trigger events. Then, the CUD can be put into operational mode, and the ELA will monitor the trigger events, upon when data will be sampled in real-time into the on-chip trace buffers. The sampled data are then subsequently transferred off the chip via a low bandwidth interface to the postprocessing stage [20]. This stage includes organizing the sampled data such that it can be fed to a simulator, where the debug engineer can analyze the data to identify functional bugs. It has been shown in [21] that a complex design can contain tens to hundreds of bugs. As a result, it is very likely that the debug engineer will have to iterate steps 2–7 in Fig. 2 during debug to gather the additional data for identifying all the concerned bugs. The amount of data that can be acquired is limited by the trace-buffer depth, which limits the number of samples to be stored, and its width, which limits the number of trace signals sampled in each clock cycle. In order to reduce the total debug time, the number of iterations between steps 2–7 should be lowered. In this sense, a number of solutions have been proposed to improve the design of ELAs (e.g., [10], [22]–[24]). However, the reluctance to invest additional area for large trace buffers only for the purpose of silicon debug limits the amount of data that can be acquired on-chip. As a result, trace compression techniques (e.g., [25]–[27]) were proposed to compress debug data on-chip before being stored into trace buffers. Although using compression can increase the number of samples stored for each trace signal, the amount of signals that can be monitored is still limited. Thus, it is desirable to find an automated way to identify the essential signals to be traced (called trace-signal selection), such that, after the acquired data are sent off-chip, it can be used to reconstruct as much missing data as possible (for both combinational nodes and state elements). This should be done in such a manner that any postprocessing algorithm could search for design bugs using the enlarged set of data in step 7. To the best of the authors’ knowledge, the only solution available in the public domain with a similar goal is that in [28]. However, based on the description from [28], their algorithm restores data only in the combinational logic nodes of the circuit. In this paper, we show that restoring data in state elements (called state restoration) can enlarge the set of debug data. Given the lack of public knowledge in this area, the purpose of this paper is to develop the understanding of what are the key factors that enable efficient state restoration and trace-signal
KO AND NICOLICI: ALGORITHMS FOR STATE RESTORATION AND TRACE-SIGNAL SELECTION
287
Fig. 4. Principal operations for state restoration. (a) Forward. (b) Backward. (c) Combined. (d) Not defined.
Fig. 3. Sample circuit for state restoration. (a) CUD. (b) Restored data in sequential elements.
selection. We first show how state restoration can be accelerated by exploiting the inherent parallelism of the problem at hand. We then define circuit-dependent metrics that guide the automatic selection of trace signals used by the DFD hardware, such that the sampled data can help restore a large amount of missing data in the circuit during silicon debug. III. S TATE R ESTORATION FOR S ILICON D EBUG In this paper, we will use the circuit shown in Fig. 3 to illustrate the key points of our solution. Although it is unlikely that one would need our method to debug such a simple circuit in a practical environment, we have decided to use it to demonstrate that our algorithms based on circuit analysis can indeed automatically select the same trace signals as what an experienced designer would choose manually. The easiest way to debug the circuit in Fig. 3(a) is to sample all five flip-flops (FFs) for five consecutive clock cycles. In this case, one needs a trace buffer of a size of 5 × 5 bits to store all the data during silicon debug. Instead of monitoring all the signals, if only FFC is sampled, one will be able to reconstruct some of the missing data of the circuit. The basic idea of state restoration is to forward propagate and backward justify known values from a trace signal to other nodes in a circuit in order to reconstruct the missing data. This is achieved by applying the Boolean relations between state elements. Although this may seem similar to the automatic test pattern generation (ATPG) problem used in manufacturing tests, the state restoration problem does not require any decisions to be made. For state restoration, the algorithm only needs to check if data can be reconstructed at a circuit node and no branching and backtracking will be done (if unsuccessful, undefined values will be concluded). After state restoration, the designer can then use the expanded set of data to verify the behavior of the circuit against its specification. In this paper, we assume that a gate level netlist is available while debugging a circuit and that the trace signals are FFs. In the case when a netlist cannot be obtained for blocks of logic in a circuit (i.e., the use of hard cores in a system), one will not be able to apply the proposed state restoration method on the hidden logic. However, state restoration can still be applied
to the surrounding logic for reconstructing the input stimuli and output responses of the protected circuit blocks. One may argue that the set of expanded data obtained from the netlist does not reflect the actual circuit responses from the chip. However, as we only focus on the identification of functional bugs, it is assumed that the behavior of the manufactured circuit matches the behavior of the circuit netlist. Note that this assumption is correct if the circuit has successfully passed the manufacturing test and has already been validated in practice. For example, state restoration based on behavioral circuit models and manually selected trace signals has been employed successfully for microprocessor systems [11], where real-time stimuli are used for detecting hard-to-find bugs (either hardware or software) that have escaped pre-silicon verification. In this paper, we aim to show that, with structural circuit models and a new algorithmic framework for automated trace-signal selection, we can restore missing data for random logic blocks. A. Principal Operations for State Restoration The combinational logic part of any digital circuit can be decomposed into a network of two input primitive gates (e.g., AND , OR, XOR, and NOT ). Our proposed algorithm relies on applying two principal operations to each logic gate in the translated circuit netlist for state restoration. We call them forward and backward operations, and they are shown in Fig. 4. The authors understand that these are very simple operations that have been discussed before in different perspectives (e.g., for ATPG). We provide the descriptions of the operations in this paper for the sake of completeness. A forward operation is applied to a gate when the input values to the gate are known, and it will try to determine the output value of the gate using Boolean algebra. This is similar to what is normally done in functional simulators. An example is shown in Fig. 4(a) using an AND gate and an OR gate. In the case of an AND gate, when one of the inputs is logic 0, the output can be concluded as logic 0 without analyzing the other input of the gate. On the other hand, when the output value of a gate is known, the backward operation can be applied to determine the values on the inputs of a gate. This is similar to backward justifying data in ATPG as elaborated with two examples in Fig. 4(b). When the output of an AND gate is logic 1, both of its inputs can be immediately justified to be logic 1. Likewise, the inputs of an OR gate evaluate to logic 0 if the output is logic 0. In the case when the forward and backward operations are not sufficient, a combined method, during which both the known input and output values are evaluated, can be employed to reconstruct the missing value. This operation, which is explored
288
IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 28, NO. 2, FEBRUARY 2009
for ATPG, is shown in Fig. 4(c). In the AND gate, when the output and one of the inputs are known to be logic 0 and 1, respectively, the remaining input can be set to logic 0. The similar behavior can also be observed for the OR gate. Note that the same principles from Boolean algebra on the forward and backward operations can also be applied to other primitive gates (i.e., NAND, NOR, XOR, XNOR, and NOT). It is obvious that the principal operations from Fig. 4(a)–(c) will not always be able to reconstruct the missing values of a gate. For example, as shown in Fig. 4(d), when the output and the known input of an AND gate are both logic 0, there is insufficient information to conclude the missing value on the unknown input. Fig. 3 shows an example that demonstrates how the principal operations are applied to a circuit that has both combinational and sequential elements. As Fig. 3(a) shows the simple circuit with five FFs, Fig. 3(b) shows the data in the state elements after the restoration algorithm is applied. In this example, only FFC is sampled during clock cycles 0–3. It should be noted that the X’s in the table in Fig. 3(b) refer to values that cannot be restored using only the available sampled data. Before applying the state restoration algorithm shown in Algorithm 1, the circuit netlist is translated into a graph where the nodes represent logic gates, state elements, primary inputs, and outputs, and the directed edges represent signal dependencies. For example, given a two-input logic gate, it will be translated into a single node, with two directed edges connecting from its parent nodes, which represent the two circuit elements that drive the logic gate. Likewise, a directed edge will be added from the newly created node to each of its child nodes, which represent the circuit elements that it drives. With the translated circuit graph, the principal operations will be applied to each node repeatedly until no more data can be reconstructed for all signals from the given subset of data. This can be shown by applying Algorithm 1 to the circuit in Fig. 3(a). Using backward justification (line 5), whenever a logic 1 is captured in FFC, the values of FFA and FFB can be evaluated as logic 1 and 0, respectively, in the previous clock cycle. In addition, the inverted values of FFC can be forward propagated to be the values of FFD in the following clock cycles (line 9). For FFE, its values in the current clock cycles can only be reconstructed when the values of FFB and FFC are known in the previous clock cycles. This will be done by the next iteration of the loop (line 2) with the updated information on FFB. Note that, whenever the BackwardOperation or the ForwardOperation is applied to a node, the operation will try to perform state restoration from that node for all the considered clock cycles. The reason for this is to lower the number of times a node has to be revisited, thus reducing the CPU runtime of the state restoration algorithm. As will be discussed in the following section, bitwise parallelism can also be exploited when using this implementation. By forward propagating and backward justifying known data between gates in the circuit, data can be restored for other state elements one clock cycle at a time. It is essential to note that our proposed algorithm aims at restoring data for sequential elements across multiple time frames, which is different to [28] where only values in the combinational logic are reconstructed with the known data in the sequential elements. When com-
paring the amount of data that are available before and after state restoration, 14 data will be available in the entire circuit after applying the restoration process on only four initial data from FFC. This gives a restoration ratio of 14/4 = 3.5X for the elaborated example. It should be noted that the amount of data that can be reconstructed using state restoration depends on the initial set of sampled data. For instance, if only FFE is sampled in Fig. 3(a), no new data can be reconstructed for any state elements in the circuit. The computation time for the state restoration algorithm is proportional to the amount of nodes presented in the circuit graph. In addition, unlike ATPG, when restoring state data for a circuit, a large number of clock cycles will have to be considered. This affects the CPU runtime for reconstructing the missing data. Although bitwise parallelization has been explored to speed up logic simulation across multiple clock cycles, it is only performed for the forward operation [29]. As a result, in the following section, we introduce how bitwise parallelization is exploited by eliminating any branching decisions in the state restoration process for all the principal operations shown in Fig. 4. Algorithm 1: Algorithm for state restoration Input: Circuit Graph, Trace_Signal_List Output: Circuit Graph with restored data 1 search_list = T race_Signal_List; 2 while search_list is not empty do 3 cur_node = first node in search_list; 4 for each (parent_node of cur_node) 5 BackwardOperation (cur_node, parent_node); 6 if (new data are restored for parent_node) then 7 Put parent_node at end of search_list 8 for each (child_node of cur_node) do 9 ForwardOperation (cur_node, child_node); 10 if (new data are restored for child_node) then 11 Put child_node at end of search_list B. Exploiting Bitwise Parallelism for State Restoration In order for the state restoration algorithm to be applicable to large circuits, it is essential for the state restoration algorithm to be compute efficient. This is because the designer may need to test the circuit with different stimuli when iterating through steps 2–7 in Fig. 2 during the debug process. Thus, it is desirable to restore data as fast as possible for each stimulus to reduce debug time. In order to reduce the computation time for the state restoration algorithm, we explore two facts when applying the principal operations on a node. First, to restore data for one clock cycle in a given node, a branch decision will have to be made to see if the data can be reconstructed. Thus, the computation time for restoring data across all the circuit nodes for a large number of clock cycles depends on how well these decisions are made during program execution. Also, one can parallelize the algorithm by allowing multiple branch decisions to be evaluated at the same time. This is feasible for state restoration because, when performing principal operations on a node for multiple clock cycles, the results are independent of each other for each data point in different clock cycles. This idea can be better explained using the example shown
KO AND NICOLICI: ALGORITHMS FOR STATE RESTORATION AND TRACE-SIGNAL SELECTION
289
TABLE I TWO-BIT CODES FOR DATA REPRESENTATION
in Fig. 3. In order to restore data for the circuit for five clock cycles, one can iteratively apply the principal operations to each node in each clock cycle. However, the same outcome can be achieved by allowing five different branch decisions to be evaluated at the same time with the corresponding data for the specific clock cycles. Nevertheless, since a debugged circuit can contain tens of thousands of logic gates and data are usually restored over thousands of clock cycles, speeding up the algorithm by parallelizing the branch decisions may still incur a large computation time due to the execution penalty caused by mispredictions. As a result, we derive new logic operations such that the principal operations can be applied concurrently at a node across multiple clock cycles, without the need to evaluate any branch decisions during state restoration. We exploit the integer data type in ANSI C on a 32-bit platform to enhance the performance of our algorithm by storing data for 32 consecutive clock cycles in two integers (8 bytes) for each node. For example, to represent the data [0, 1, 1, 0, X] for clock cycles 0–4 for FFC in Fig. 3(a), using the two-bit codes in Table I, we can store the data for FFC using two integer variables as follows: int0 = 0, 1, 1, 0, 1, . . . , 1
int1 = 0, 1, 1, 0, 0, . . . , 0.
In these equations, the first 5 bits of the two variables store the data for clock cycles 0–4 for FFC, and the remaining 27 bits store the code for undefined data. By working with two integer variables, the algorithm can restore data for 32 consecutive clock cycles at a time using a sequence of logic equations based on the bitwise operations provided by ANSI C for each of the primitive gates. For each principal operation, two different equations (one for each integer) will be developed in such way that the number of two-operand bitwise operations are minimized. Although the formalism of multivalued logic and input/output encoding from logic synthesis can be used to derive these systems of equations [30], the following discussion relies on the illustrative advantage of the Karnaugh map (K-map) representation. Fig. 5(a) and (b) shows the K-maps for deriving the logic equations for the forward operation at the output z, while Fig. 5(c) and (d) shows the K-maps for the backward equations at the input a of a two-input AND gate. Note that the inputs of the AND gate are labeled a and b, and since two bits are needed for data representation, the variables for the logic equations are labeled a0 , a1 , b0 , and b1 for the inputs, and z0 and z1 for the output. From Boolean algebra, we know that, when any input of an AND gate is 0, the output should also be a 0. This is why the entries are set to 0 on the first rows and the leftmost columns of the K-maps for z0 and z1 in Fig. 5(a) and (b). Also, when both inputs are logic 1, z0 and z1 are both set to 1 to represent a logic 1 on the output of the AND gate. The shaded regions of the K-maps in Fig. 5(a) and (b) show that the output
Fig. 5. Derivation of forward and backward equations for the AND gate. (a) K-map for z0 . (b) K-map for z1 . (c) K-map for a0 . (d) K-map for a1 .
port of the AND gate is inconclusive due to the insufficient data from the input ports. In these regions, the values of z0 and z1 can be filled consciously in such a way that the resulting code is 01 or 10 (in order to represent the undefined values according to the two-bit codes in Table I) and, at the same time, the number of bitwise operations is minimized. The K-maps for deriving the logic equations for the backward operation can also be constructed using the same principle. For instance, as can be seen in Fig. 5(c) and (d), when the output of the AND gate is logic 1, the inputs of the gate can be justified to logic 1. Also, input a can be concluded as logic 0 when the output z is 0, and the other input b is 1. In any other cases, the value for a is inconclusive, as shown by the shaded regions in Fig. 5(c) and (d). Note that, for backward operation, four logic equations (two equations for each input port) will be derived. However, the variables in the logic equations among input ports will be exchanged. In this example, for an AND gate, the equations for forward and backward operations are z0 = a0 a1 b0 b1
z1 = (a0 + a1 )(b0 + b1 )
a0 = z0 + (z1 + b0 b1 )
a1 = z1
b0 = z0 + (z1 + a0 a1 )
b1 = z 1 .
Using these logic equations, if the input ports of an AND gate have the following data for 32 clock cycles: a = 0, X, 1, . . . , 1, X, . . . , X
b = X, 1, 1, . . . , 1, 0, . . . , 0
using our two-bit codes in Table I, the input to the logic equations will be translated into four integer variables as a0 = 0, 1, 1, . . . , 1, 1, . . . , 1
b0 = 1, 1, 1, . . . , 1, 0, . . . , 0
a1 = 0, 0, 1, . . . , 1, 0, . . . , 0
b1 = 0, 1, 1, . . . , 1, 0, . . . , 0.
The output z of the AND gate can then be obtained by applying the a0 , a1 , b0 , and b1 data into the forward operation z0 = 0, 0, 1, . . . , 1, 0, . . . , 0
z1 = 0, 1, 1, . . . , 1, 0, . . . , 0.
Referring back to the two-bit codes, the output z of the AND gate for the 32 clock cycles can be found as z = 0, X, 1, . . . , 1, 0, . . . , 0.
290
IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 28, NO. 2, FEBRUARY 2009
In the aforementioned equations, if the values of the output variables are known, these values will be overwritten by the results of the equations. For example, in the forward equations introduced for z0 and z1 , if z0 and z1 are known from sampling the circuit, while a0 , a1 , b0 , and b1 are not known, applying the forward equations will overwrite the sampled values in z0 and z1 with unknowns. This is because the existing values in the output variables are not considered in the equations. As a result, additional operations will have to be added to preserve the existing values in the output variables if they are known. The modified forward and backward equations for the AND gate will then become z0 = (z0 ⊕ z1 )z0 + (z0 ⊕ z1 )(a0 a1 b0 b1 ) z1 = (z0 ⊕ z1 )z1 + (z0 ⊕ z1 )(a0 + a1 )(b0 + b1 ) a0 = (a0 ⊕ a1 )a0 + (a0 ⊕ a1 ) z0 + (z1 + b0 b1 ) a1 = (a0 ⊕ a1 )a1 + (a0 ⊕ a1 )z1 b0 = (b0 ⊕ b1 )b0 + (b0 ⊕ b1 ) z0 + (z1 + a0 a1 ) b1 = (b0 ⊕ b1 )b1 + (b0 ⊕ b1 )z1 . Note that the ⊕ symbol represents the XOR operation. With these equations, the values of the output variables will only be overwritten when the existing values are unknown (i.e., the variable pair has a value of 01 or 10 as defined in Table I). By replacing the implementation of the BackwardOperation and ForwardOperation in Algorithm 1 with these logic equations to restore data in 32 clock cycles, the total number of bitwise operations for the AND gate can be calculated as follows. For z0 , one operation will be needed for the XOR between its previous values. Note that the result of this XOR can be reused to reduce the number of bitwise operations in the equations. In addition, one inversion is needed for the XNOR; five bitwise AND operations and one bitwise OR operation are needed. On the other hand, three bitwise AND operations and three bitwise OR operations are required for z1 . Thus, 14 bitwise operations will be performed for forward propagating data in the AND gate. For backward justifying data in the AND gate, there are one XOR, two NOT, three AND, and three OR bitwise operations for a0 , while a1 requires only two AND and one OR operations. Together, 12 bitwise operations will be needed to backward justify data for one input. Thus, a total of 24 bitwise operations are necessary for backward justifying data on a0 , a1 , b0 , and b1 for the AND gate. Although the derivation of forward and backward equations for other primitive gates is not shown, they can be obtained using similar concepts from the aforementioned elaborated example. They range from ten bitwise operations for forward operations of NOT to 33 bitwise operations for backward operations of XOR as shown in Table II. It should be noted that the proposed method requires multiple CPU instructions to restore data for 32 clock cycles. In fact, the proposed method will need 33 CPU instructions if the 32 data are reconstructed using the backward operation on the XOR gate. On the other hand, one can implement only 32 if-then-else instructions to reconstruct the same amount of data. However, these 32 if-then-else instructions do not necessarily translate to only 32 CPU instructions since
TABLE II NUMBER OF BITWISE OPERATIONS FOR THE TWO-INPUT PRIMITIVE GATES
mispredictions during branching incur execution penalty. As will be discussed in the experimental results, using the proposed method, which eliminates branching, can reduce the execution time of the state restoration algorithm. Moreover, the proposed method can further scale down the execution time when the executing platform has a higher bitwidth (e.g., 64-bit CPUs). This is because the number of CPU instructions required to perform the principal operations with the proposed method will not change, while the number of clock cycles in which one can reconstruct data increases with the bitwidth of the platform. It should also be emphasized that digital circuits often involve more complex logic gates or logic gates with higher fan-in. These complex gates can be either decomposed into a hierarchy of two-input primitive gates (such that the derived operations can be applied on the circuit graph that has more nodes and thus prolong the computation time), or additional equations specific to their behavior can be generated. IV. T RACE -S IGNAL S ELECTION The amount of data that can be restored by the state restoration algorithm depends on an initial set of data acquired from the DFD hardware. For instance, for the circuit shown in Fig. 3(a), if only FFE is sampled, no data can be restored for any other nodes in the circuit. In this section, two metrics that will influence the selection of trace signals will be presented. The first metric accounts for the topology of the CUD, while the second metric considers the logic behavior in addition to the circuit topology. A. Restorability Using Circuit Topology If a trace signal has large input and output logic cones, it is obvious that the likelihood of restoring data for other signals through forward and backward operations on this signal will be higher. This is because the number of parent and child nodes that can be reached from the trace signal is higher. As a result, by analyzing the topology of a circuit, one can select a set of trace signals that can help restore missing data for the other signals in the circuit. This observation is captured by the equations shown in Fig. 6 for calculating the restorability of all the nodes in a circuit when a particular signal is monitored by the DFD hardware. We define the forward restorability of a node to be the likelihood of restoring the data of that node through forward propagation [Fig. 4(a)], while backward restorability represents the chance of restoring the data of the node from backward justification [Fig. 4(b)] or the combined operation [Fig. 4(c)]. When a node can be fully restored through forward (backward)
KO AND NICOLICI: ALGORITHMS FOR STATE RESTORATION AND TRACE-SIGNAL SELECTION
Fig. 6.
291
Equations for restorability calculation using only topology.
operations, the forward (backward) restorability will be 1. If data cannot be reconstructed, the restorability will then be 0. In Fig. 6, the equations for calculating the forward restorability of the output, as well as the backward restorability of the inputs of a gate, are shown. Although it is shown with an AND gate in the figure, it should be noted that the same set of equations are applied to all the gates in a design. The forward restorability of a gate is calculated by summing the forward restorability of its inputs and then dividing the sum by the number of inputs of the gate. This is because the more the parent signals of a node are monitored, the higher the chance that the data for that node can be restored by the proposed state restoration algorithm through forward operations. On the other hand, the backward restorability of an input of a gate is calculated by first finding the maximum backward restorability from its children, since it can be restored through any one of its fan-out branches (as shown in Fig. 6 with the term max{B(z)} in the equations). The result is then added with the forward restorability of the other inputs of the gate. This is because it is sometimes insufficient to restore the input values from only the output values of a gate. However, if the values of the output and the other inputs are known, the chances of restoring values of the targeted input will be higher. The sum from the equation is then divided by the number of inputs of a gate to normalize the restorability to 1. After defining the equations that look at the topology of a circuit for calculating the restorability of a node, Algorithm 2 can be used to show how these equations are applied to aid the selection of trace signals. The algorithm uses a breadth-firstsearch approach to calculate the restorability values for all the nodes. The calculation starts by first computing the forward restorability of all the child nodes of the first node in the search list (line 8). It then works out the backward restorability of all the parent nodes of the same node (line 12). When sequential loops are found in a circuit, the algorithm will iterate the forward and backward calculations for the nodes in the loop. This is because the state restoration algorithm may be able to restore data for multiple clock cycles by iterating in the loop. Fig. 7 can be used to explain the greedy nature of the algorithm when trying to select the first trace signal using Algorithm 2 for the circuit shown in Fig. 3(a). In the figure, the F and B values represent the forward and backward restorability of each FF, respectively, for three iterations. Note that, for the sake of clarity, only the restorability values of the FFs are shown, but in fact, the restorability for the logic gates between FFs are also calculated. Algorithm 2: Algorithm for identifying trace signals incrementally Input: Circuit, TB_width, Threshold Output: signal_selection_list 1 while cur_width < T B_width do 2 while not all nodes in Circuit are calculated do
Fig. 7. Restorability calculation for three iterations using the topology metric. (a) Restorability values of circuit nodes when F F E is selected. (b) Restorability values of circuit nodes when F F C is selected.
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
search_list = Get chosen nodes; Set initial values for chosen nodes; while search_list is not empty do cur_node = first node in search_list; for each (child_node of cur_node) do CalculateForward (child_node); if (new_value − old_value ≥ T hreshold) then Put child_node at end of search_list for each (parent_node of cur_node) do CalculateBackward (parent_node); if (new_value − old_value ≥ T hreshold) then Put parent_node at end of search_list Sum the restorability of all nodes in the circuit; Select the node with highest restorability cur_width + + Return signal_selection_list
In Fig. 7(a), the restorability values of all the signals in the circuit when FFE is selected as the trace signal using Algorithm 2 are shown. In the first iteration, all the F and B values of the signals are set to 0, except for FFE, for which the values are set to 1 since it is traced. In the second iteration, the backward restorability values of FFC and FFB are calculated to be 0.5 according to the backward equation in Fig. 6. It should be noted that, although the forward restorability values of FFC and FFB should be 0 according to the forward equation, the values will be updated to be 0.5 at the end of the iteration. This is to reflect that no matter if data are restored through forward or backward operation, the restored data can still be used for state restoration in the next iteration. The updated values will then be used in the third iteration of Algorithm 2 for further calculation. It can be seen in Fig. 7 that, as the algorithm iterates further to propagate the metric, the restorability value of each node either stays the same or gradually increases (it never decreases). As a result, a user-defined parameter T hreshold is provided to control the amount of metric propagation among circuit nodes and to limit the computation time for the trace-signal-selection algorithm (lines 9 and 13). This threshold is used to check the newly computed values against the ones from the previous
292
IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 28, NO. 2, FEBRUARY 2009
iteration. It is obvious that the lower the threshold, the more effort the algorithm will spend on calculating the restorability of each signal in the circuit. After the restorability of all nodes are calculated and the T hreshold parameter is satisfied, the calculated values are summed together to give the restorability of the circuit if FFE is selected as a traced signal (line 15). To decide which node to select as the trace signal, the algorithm will calculate the circuit restorability, which is obtained by summing all the restorability values of all nodes in the circuit, for when each node is selected, it will then choose the node that produces the highest circuit restorability as the trace signal (line 16). To select the targeted number of trace signals, Algorithm 2 incrementally calculates circuit restorability to select one signal at a time in a greedy manner. Using the circuit in Fig. 3(a) as an example when choosing two signals and assuming that signal FFE is chosen in this iteration, Algorithm 3 will then select the second trace signal by trying to select signals in the sequence FFE and FFA, FFE and FFB, . . ., until all other signals are selected together with FFE as trace signals. It will then choose the signal pair that will produce the highest restorability values across the circuit as trace signals. This gradual approach for trace-signal selection follows the same philosophy on how new states are chosen to be probed during microprocessor debug, when signals are also selected incrementally to determine what additional data can be gathered from the microprocessor. Although Algorithm 2 can identify the trace signals that would give a high restoration ratio by propagating the restorability metric over and over with each incremental signal selection, this could lead to a prohibitively high computation time for large circuits. Thus, we introduce Algorithm 3, which greedily selects trace signals by estimating how much data from other signals can be restored when each signal is chosen as trace signal. Algorithm 3: Algorithm for selecting trace signals using coverage estimation Input: Circuit, TB_width Output: signal_selection_list 1 while not all signals in Circuit are evaluated do 2 search_list = one of the nonevaluated signal 3 Set initial values for chosen signal; 4 while search_list is not empty do 5 cur_node = first node in search_list; 6 for each (child_node of cur_node) do 7 if (number of parent nodes with nonzero restorability values has increased since the last visit) then 8 CalculateForward (child_node); 9 Put child_node at end of search_list; 10 for each (parent_node of cur_node) do 11 if (number of child nodes with nonzero restorability values has increased since the last visit) then CalculateBackward (parent_node); 12 13 Put parent_node at end of search_list; 14 Record restorability values of all nodes in the circuit for current signal selection;
15 cur_high_value = 0.9; 16 while (cur_width < T B_width) do 17 chosen_signal = FindMaxCover(cur_high_value) 18 if (chosen_signal == N U LL) then 19 cur_high_value = cur_high_value − 0.1 20 else 21 Put chosen_signal in signal_selection_list; 22 cur_width + +; 23 Return signal_selection_list; To estimate how much data can be restored when a signal is traced, Algorithm 3 first sets the restorability value of the selected signal to 1 (line 3). It then employs a breath-firstsearch approach to propagate the restorability value using the equations in Fig. 6 to its child and parent nodes through forward propagation (line 8) and backward justification (line 12), respectively, in the same way as Algorithm 2 propagates the restorability metric among circuit nodes for calculating the restorability values of the circuit when selecting the first signal. However, instead of using a custom parameter, another mean is provided to control the amount of metric propagation among circuit nodes and, thus, to limit computation time. Whenever a node is being visited, the algorithm will only proceed to update the restorability value of the current node through forward propagation when the number of parent nodes that have nonzero restorability values has been increased since the last time the current node has been visited (line 7). Likewise, the restorability values of its child nodes will be checked to determine if backward justification will be performed (line 11). These checks ensure that a node will only be revisited if the changes of the restorability value come from a different circuit path when compared to the last time the node is visited. Algorithm 3 selects trace signals based on how many other signals will likely to be covered after state restoration. Using the restorability values in Fig. 7(b) as an example, when selecting FFC as the trace signal, FFD will also be covered since data for FFD can be fully reconstructed during state restoration as indicated with a restorability value of 1 in FFD. As a result, although both FFC and FFD will likely help reconstruct data for other signals if any of them is selected as a trace signal, the two signals together will not help reconstruct more data if they are both selected as trace signals for the circuit shown in Fig. 3(a). This is why Algorithm 3 analyzes how many signals will be covered at line 17 and will greedily select trace signals so that the maximum amount of signals can be covered in order to achieve a high restoration ratio. It should be noted that it is not necessary for the restorability value of a signal to be 1 before it can be classified as covered. For example, FFD is the only signal that will achieve a restorability value of 1 when FFC is selected as the trace signal as shown in Fig. 7(b). In this case, Algorithm 3 will gradually decrease the requirement by lowering the targeted restorability value used for classifying signal coverage when no signal can be found to cover additional signals at line 19. This technique on selecting trace signals based on signal coverage in Algorithm 3 is different than the incremental trace-signal-selection technique by recalculating the restorability values of the circuit with each additional signal selection
KO AND NICOLICI: ALGORITHMS FOR STATE RESTORATION AND TRACE-SIGNAL SELECTION
Fig. 8.
293
Equations for restorability calculation using topology + logic.
in Algorithm 2. As will be shown by the experimental results, although the signal selection method in Algorithm 2 may be able to select trace signals that yield a higher restoration ratio during silicon debug, the computation time for calculating restorability values repeatedly for each incremental signal selection can become prohibitively high with large circuits. On the other hand, by selecting trace signals that give high coverage for restoring data in other signals without re-evaluating the restorability values of the circuit, a good restoration ratio can still be achieved, while the computation time for Algorithm 3 falls into an acceptable range. B. Restorability Using Topology and Logic Gate Behavior To further refine the equations for calculating the restorability of a node, the logic behavior of each gate can be taken into consideration. The new metric that considers both the topology and the logic behavior of a gate is shown in Fig. 8. In the new equations, the restorability is further divided into F 0 and F 1, which represent the likelihood of restoring a logic 0 and 1, respectively, on the output of a gate using forward operations. Likewise, B0 and B1 depict the chance of restoring a logic 0 and 1, respectively, on the input of a gate using backward operations. Note that the equations for the AND, OR, and XOR gates are different as shown in Fig. 8 due to their different logic behaviors. For the AND gate, when any of the inputs is logic 0, the output can be concluded as logic 0, whereas to restore a logic 1 on the output through forward operations, both inputs have to be logic 1. On the other hand, an input of an AND gate can be justified as logic 1 whenever the output is also logic 1. In order to evaluate the input of an AND gate to logic 0, the output of the gate has to be logic 0, and other inputs have to be logic 1. The equations for other primitive gates are derived using a similar reasoning. It should be noted that although both Algorithms 2 and 3 can be used to select trace signals with both of the proposed metrics, the CPU runtime increases when choosing signals using the second metric. This is due to the fact that the restorability equations are more complex for each node, due to the increased accuracy of the metric.
Fig. 9. Restorability calculation for three iterations using the topology + logic metric. (a) Restorability values of circuit nodes when F F E is selected. (b) Restorability values of circuit nodes when F F C is selected.
Fig. 9 shows the restorability values calculated using the equations from Fig. 8 for all the FFs in the circuit from Fig. 3(a) for three iterations. It can be seen from Figs. 7 and 9 that FFC should be chosen as the trace signal, since in both cases, it covers more signals with higher restorability values in the circuit. Note the differences between restorability values in other signals in Figs. 7 and 9 when using the two proposed metrics for signal selection. This is due to the further refinement in the equations that consider the topology and logic behavior of the circuit for restorability calculation. When working with large circuits, this refinement can help better evaluate the signal coverage when choosing the trace signals. One may argue that the proposed restorability equations that consider both the topology and behavior of logic gates (shown in Fig. 8) resemble the Sandia Controllability/Observability Analysis Program (SCOAP) concept that guides the ATPG process for manufacturing tests [31]. There is, however, a fundamental difference between the proposed restorability metrics and the SCOAP metrics. SCOAP captures the controllability/ observability of the internal circuit nodes by measuring how the assignments on neighboring signals will affect the targeted signal. Thus, the controllability/observability of a node does not give any information on how the specific value obtained for the targeted node can help restore data for other circuit nodes. For instance, for the circuit in Fig. 3(a), the SCOAP measure will identify FFE as a hard-to-control signal due to the presence of the XOR gate. However, tracing only FFE will not be able to restore data for any other signals in the circuit. Unlike SCOAP, the proposed restorability metrics capture how the data acquired on a node during a debug session will give more information about the internal state of the circuit. V. E XPERIMENTAL R ESULTS Experimental studies from [7] indicate that trace buffers of sizes 1 k × 8 (i.e., depth of 1024 and width of 8 bits)–8 k × 32
294
IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 28, NO. 2, FEBRUARY 2009
TABLE III STATE RESTORATION RESULTS FOR S38584 WHEN TRACE SIGNALS ARE SELECTED RANDOMLY AND CONTROL SIGNALS ARE DRIVEN RANDOMLY
TABLE IV STATE RESTORATION RESULTS FOR S38417 WHEN TRACE SIGNALS ARE SELECTED RANDOMLY AND CONTROL SIGNALS ARE DRIVEN RANDOMLY
are acceptable in practice today. If one needs to debug larger logic blocks, it is common that the trace buffer is used as a timeshared resource [22]. Time sharing is also justified by the fact that, if at most 32 bits per sample are acquired, it is difficult to restore values to more than 2000 FFs (which normally belong to a logic block in the range of 50 000 gates). Given this expected logic block size, we perform our experiments on the three largest ISCAS89 benchmark circuits [32] (i.e., s38584, s38417, and s35932), which fit the gate count range for acceptable trace-buffer sizes as of today. Moreover, since the ISCAS89 benchmark circuits are publicly available, we hope that future proposals on this emerging area can benchmark their algorithms against ours. For our experiments, the state restoration algorithm and the two trace-signal-selection algorithms are implemented using ANSI C and executed on a PC with dual Xeon processors at 2.4 GHz with 1 GB of RAM. Also, all the high fan-in gates are decomposed into two-input logic gates when translating the ISCAS circuits into circuit graphs for both the state restoration and trace-signal-selection algorithms. Tables III–V show the state restoration ratio and the data reconstruction time between the implementation exploiting the bitwise parallelism as discussed in Section III-B (columns labeled Nonbranching) and the if-then-else implementation
TABLE V STATE RESTORATION RESULTS FOR S35932 WHEN TRACE SIGNALS ARE SELECTED RANDOMLY AND CONTROL SIGNALS ARE DRIVEN RANDOMLY
(columns labeled Branching) for s38584, s38417, and s35932, respectively. In these tables, the trace-signal selection was done randomly by using five different seeds for the pseudorandom generator in ANSI C. Also, five different sets of random data are generated for all the primary inputs (both control and data inputs) of the circuits. The random data on the primary inputs will then be fed to a simulator to obtain the debug data on the trace signals for state restoration. As shown in these tables, the computation time for the data restoration algorithm using the parallel equations is on average about four to five times less than the nonaccelerated method. The speedup falls short of the theoretical upper bound of 32X (since the accelerated algorithm restores data for 32 clock cycles at a time on a 32-bit machine). This is because restoring data in 32 clock cycles using the accelerated method involves more than one CPU instruction as shown in Table II. On the other hand, the nonaccelerated method will need one if-thenelse statement to perform the principal operations for each data point. This results in 32 if-then-else statements to restore data for 32 clock cycles. However, the number of CPU instructions executed to restore each data with the nonaccelerated method depends on how a branch is taken during program execution. As verified by the results, the total number of CPU instructions required to perform data restoration across multiple clock cycles using the nonaccelerated method is higher than that of the accelerated method. Moreover, when machines with higher bitwidths are available, the accelerated method can better scale to utilize the larger bitwidth and further speed up the state restoration process when compared with the if-then-else implementation, as discussed in Section III-B. Another interesting point to note is that in Table IV, with the same trace-buffer depth and width, by randomly selecting different trace signals, the restoration time increases significantly even though the restoration ratio decreases (which indicates less data to be restored). This is because, if the sampled signal resides in a sequential loop and the missing data in the loop cannot be reconstructed by the side signals that are connected to the gates from this loop, the restoration algorithm may have to iterate in the loop to restore data one clock cycle at a time. In this case, more decisions will need to be made by the algorithm
KO AND NICOLICI: ALGORITHMS FOR STATE RESTORATION AND TRACE-SIGNAL SELECTION
295
TABLE VI STATE RESTORATION RESULTS FOR S38584 WHEN GLOBAL CONTROL INPUTS ARE DRIVEN RANDOMLY AND DETERMINISTICALLY
TABLE VII STATE RESTORATION RESULTS FOR S38417 WHEN GLOBAL CONTROL INPUTS ARE DRIVEN RANDOMLY AND DETERMINISTICALLY
TABLE VIII STATE RESTORATION RESULTS FOR S35932 WHEN GLOBAL CONTROL INPUTS ARE DRIVEN RANDOMLY AND DETERMINISTICALLY
to check if the newly restored data can help reconstruct any other data in the neighboring nodes in every clock cycle, thus increasing the restoration time. It is also interesting to note that when the trace-buffer depth and width are fixed, sampling different signals varies the amount of data that can be reconstructed even if the set of input data is identical. This is why, in this paper, two metrics are proposed in Section IV to guide the trace-signal-selection process. Tables VI–VIII give the state restoration ratios and the restoration times for s38584, s38417, and s35932, respectively, when comparing different signal selection methods. For the results generated using Algorithm 2, a threshold parameter of 0.1 is used. In all the tables, the restoration ratios are obtained by comparing the total number of restored values versus the number of trace signals allowed on-chip. It is obvious that the
lower the number of trace signals, the more amount of data will need to be restored. The experiments are carried out for tracebuffer depths of 1, 4, and 8 k, and trace-buffer widths of 8, 16, and 32. As in the previous experiments discussed in Tables III–V, all but the last column of the reported results in Tables VI–VIII stand for five different sets of debug data obtained by simulating the randomly generated data on all primary inputs. Although using random data on the primary inputs for obtaining the debug data on the trace signals in the experiments is sufficient for evaluating the proposed metrics and algorithms for tracesignal selection, one might argue that, during an actual debug experiment, the global control signals of a circuit will not be driven randomly. For example, a global reset signal that is active low should be driven by a zero for a few clock
296
IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 28, NO. 2, FEBRUARY 2009
cycles for bringing the CUD into a known state. After that, a constant 1 should be asserted to the reset signal for the remaining part of the debug experiment. This is why, in the last column of Tables VI–VIII, the best restoration ratios from the two proposed trace-signal-selection metrics and algorithms are reported when the debug data are obtained by driving the global control signals deterministically. We have identified a global synchronous reset signal in both s38584 and s35932, and two global control signals, which define the operating mode, in s35932. It can be seen that, when the control signals are driven deterministically, the restoration ratios are smaller when compared to the randomly generated data on all the primary inputs. This is because randomly controlling the reset signals in s38584 and s35932 will bring the circuits into the reset state multiple times. Each time a reset state is reached, data in all FFs can be easily restored, thus resulting in very high restoration ratios. When there is no such control signals presented in a circuit, as in the case of s38417, the last column of Table VII will just give the largest restoration ratio in each trace-buffer configuration using the different trace-signal-selection metrics and algorithms. Note that, for s38584 and s35932, the last column gives the best restoration ratios when the global control inputs are held constant (to the noncontrolling values) during the debug experiments. There are several important points to be noted. First, using any of the proposed metrics for trace-signal selection gives higher restoration ratios when compared with the results based on random signal selection. Second, using the metric that considers both circuit topology and logic behavior (Fig. 8) helps Algorithms 2 and 3 to select trace signals that give higher restoration ratios when compared with the less sophisticated metric that considers only circuit topology (Fig. 6), except when selecting 16 trace signals for s38584 using Algorithm 3. This is because, when selecting 16 signals for s38584 using either of the proposed metrics, the differences in signal coverage between each candidate signals are very small. In this case, the algorithm may conclude that the coverage from each candidate signal is virtually the same, and thus, the greedy nature of the algorithm will not help select the proper trace signals. On the other hand, when the topology- and logic-behavior-based metric yields a higher variation for the calculated restorability values (as in the case of s35932), Algorithm 3 will be able to identify the high coverage signals and thus select a set of trace signals that will yield high restoration ratios. Another interesting point that should be noted is that, with s38584, even though when restoration ratios are lower, which indicate less data to be restored, the restoration times are actually longer. This is due to the same reason discussed previously on the variations in restoration time in Table IV that, if the chosen trace signals reside in sequential loop, it may require data to be reconstructed one step at a time. It is also interesting to note that increasing the trace-buffer depth for s35932 does not help improve the restoration ratios due to the low sequential depth of the circuit. This is unlike s38584 where the sequential depth is larger, and it is visible from the results that, as the trace-buffer depth increases, better ratios are achieved for the same trace-buffer widths. This obviously comes at the expense of higher restoration time which is
still within an acceptable range of a few minutes. Another factor that significantly contributes to a high restoration ratio is the presence of large fan-ins, as in the case of s35932. When using the metric that only considers the topology of a circuit for signal selection, one can already improve the restoration ratios when compared to the random signal selection. Another interesting point is that, when the trace-buffer width is increased, more data will be restored, however, at a lower rate than the increase in the number of trace signals. One notable exception is s35932 when increasing the number of trace signals from 8 to 16 and driving the global control inputs deterministically. When comparing the restoration ratios between the tracesignal-selection algorithms, it can be seen that the results from Algorithm 2 are better when the metric that considers both the circuit topology and logic behavior is used. This is because, when setting the threshold parameter to 0.1, Algorithm 2 spends more effort on recalculating the restorability values whenever a new trace signal is selected. This gives a more accurate evaluation on how much data may be restored with the chosen trace signals. However, this increased accuracy comes with two limitations. First, when the less accurate metric that considers only the circuit topology is used for signal selection, Algorithm 2 does not perform well against Algorithm 3 for s38584 and s35932. This is because, when combining the monotonically increasing nature of the metric and the high computation effort, Algorithm 2 may overevaluate the restorability values for the signals placed in sequential loops. On the other hand, Algorithm 3 only calculates the restorability values once; thus, it is not prone to inflating the restorability values in sequential loops. The second limitation of Algorithm 2 is the prohibitively high computation time, which is in the range of tens of hours for s35932 when selecting 32 trace signals. This is significantly larger than Algorithm 3 that only calculates restorability values once and then selects trace signals by evaluating how many signals will be covered during state restoration. When Algorithm 3 is used to select 32 trace signals, the runtime is reduced to within one hour. Another interesting note when comparing runtime for the two trace-signalselection algorithms is that the runtime scales proportionally with the number of trace signals when using Algorithm 2, while the runtime stays the same for Algorithm 3. This is because Algorithm 2 selects signals incrementally by re-evaluating the restorability metrics during each selection, while Algorithm 3 only calculates the restorability values once no matter how many trace signals are being selected. One last important point to discuss is the runtime for the state restoration algorithm. Unlike the trace-signal-selection algorithm, which is run only once during implementation, the state restoration algorithm is run repeatedly after each debug session, as shown in Fig. 2. Thus, it is important for the state restoration algorithm to be compute efficient. As shown in Tables VI–VIII, the runtime for this algorithm is in the range of a few seconds to minutes. These results have been obtained using the bitwise parallelism exploited by the technique from Section III-B. Note that the state restoration runtimes are greater than the ones observed for the same trace-buffer capacity in Tables III–V. However, this increase in runtime is due to larger restoration ratios caused by the improved choice of trace signals, where
KO AND NICOLICI: ALGORITHMS FOR STATE RESTORATION AND TRACE-SIGNAL SELECTION
the random trace-signal selection has been replaced by the deterministic Algorithms 2 and 3. Having more useful debug data extracted from the circuit obviously causes the state restoration algorithm to process more circuit nodes over the same number of clock cycles. Nonetheless, despite this increase, the runtime is practical, and the state restoration algorithm fits seamlessly between the data acquisition and analysis steps. VI. C ONCLUSION Chip designers rely primarily on design knowledge and intuition to decide which signals to probe and how many state elements to observe. Because the design complexity will continue to increase, structured debug methods with automated support will become crucial for decreasing the length of the debug cycle during silicon debug. In this paper, we have introduced a compute-efficient algorithm for state restoration. We have also presented two new metrics and two algorithms for automatically selecting the trace signals. These algorithms show how, by consciously choosing only a small number of signals to be probed in real time, the observability of the CUD can be improved. R EFERENCES [1] H. F. Ko and N. Nicolici, “Automated trace signals identification and state restoration for improving observability in post-silicon validation,” in Proc. IEEE/ACM Des. Autom. Test Eur., 2008, pp. 1298–1303. [2] W. K. Lam, Hardware Design Verification: Simulation and Formal Method-Based Approaches. Englewood Cliffs, NJ: Prentice-Hall, 2005. [3] M. L. Bushnell and V. D. Agrawal, Essentials of Electronic Testing for Digital, Memory and Mixed-Signal VLSI Circuits. Boston, MA: Kluwer, 2000. [4] B. Vermeulen and S. K. Goel, “Design for debug: Catching design errors in digital chips,” IEEE Des. Test Comput., vol. 19, no. 3, pp. 35–43, May 2002. [5] H. Balachandran, K. Butler, and N. Simpson, “Facilitating rapid first silicon debug,” in Proc. IEEE Int. Test Conf., Oct. 2002, pp. 628–637. [6] B. Vermeulen, “Special session: Why doesn’t my system work?” in Proc. 43rd IEEE/ACM Des. Autom. Conf., 2006. [7] M. Abramovici, “Experience and opinion (design for debug),” in Proc. IEEE Int. Workshop Silicon Debug Diagnosis, 2006. [8] A. Khoche and D. Conti, “TRP in action: Embedded instrumentation in FPGAs,” in Proc. 24th IEEE VLSI Test Symp., 2006, pp. 152–153. session 4C. [9] N. Nataraj, T. Lundquist, and K. Shah, “Fault localization using time resolved photon emission and STIL waveforms,” in Proc. IEEE Int. Test Conf., Sep. 2003, pp. 254–263. [10] M. Abramovici, P. Bradley, K. Dwarakanath, P. Levin, G. Memmi, and D. Miller, “A reconfigurable design-for-debug infrastructure for SoCs,” in Proc. IEEE/ACM Des. Autom. Conf., 2006, pp. 7–12. [11] ARM Limited, Embedded Trace Macrocells, Apr. 2007. [Online]. Available: http://www.arm.com/products/solutions/ETM.html [12] P. Dahlgren, P. Dickinson, and I. Parulkar, “Latch divergency in microprocessor failure analysis,” in Proc. IEEE Int. Test Conf., 2003, pp. 755–763. [13] O. Caty, P. Dahlgren, and I. Bayraktaroglu, “Microprocessor silicon debug based on failure propagation tracing,” in Proc. IEEE Int. Test Conf., 2005, pp. 284–293. [14] G. Van Rootselaar and B. Vermeulen, “Silicon debug: Scan chains alone are not enough,” in Proc. IEEE Int. Test Conf., 1999, pp. 892–902. [15] D. Josephson, “The manic depression of microprocessor debug,” in Proc. IEEE Int. Test Conf., Oct. 2002, pp. 657–663. [16] D. Josephson and B. Gottlieb, “The crazy mixed up world of silicon debug,” in Proc. IEEE Custom Integr. Circuits Conf., 2004, pp. 665–670. [17] C. MacNamee and D. Heffernan, “Emerging on-chip debugging techniques for real-time embedded systems,” IEE Comput. Control Eng. J., vol. 11, no. 6, pp. 295–303, Dec. 2000.
297
[18] Altera Verification Tool, SignalTap II Embedded Logic Analyzer, 2006. [Online]. Available: http://www.altera.com/products/software/products/ quartus2/verification/signaltap2/sig-index.html [19] Xilinx Verification Tool, ChipScope Pro, 2006. [Online]. Available: http://www.xilinx.com/ise/optional_prod/cspro.htm [20] K. Morris, “On-chip debugging—Built-in logic analyzers on your FPGA,” J. FPGA Structured ASIC, vol. 2, no. 3, Jan. 2004. [21] S. Sarangi, S. Narayanasamy, B. Carneal, A. Tiwari, B. Calder, and J. Torrellas, “Patching processor design errors with programmable hardware,” IEEE Micro—Special Issue: Micro’s Top Picks from Computer Architecture Conferences, vol. 27, no. 1, pp. 12–25, Jan./Feb. 2007. [22] M. Riley and M. Genden, “Cell broadband engine debugging for unknown events,” IEEE Des. Test Comput., vol. 24, no. 5, pp. 486–493, Sep./Oct. 2007. [23] R. Leatherman and N. Stollon, “An embedding debugging architecture for SOCs,” IEEE Potentials, vol. 24, no. 1, pp. 12–16, Feb./Mar. 2005. [24] A. Mayer, H. Siebert, and K. McDonald-Maier, “Boosting debugging support for complex systems on chip,” Computer, vol. 40, no. 4, pp. 76–81, Apr. 2007. [25] M. Burtscher, I. Ganusov, S. Jackson, J. Ke, P. Ratanaworabhan, and N. Sam, “The VPC trace-compression algorithms,” IEEE Trans. Comput., vol. 54, no. 11, pp. 1329–1344, Nov. 2005. [26] E. Anis and N. Nicolici, “On using lossless compression of debug data in embedded logic analysis,” in Proc. IEEE Int. Test Conf., 2007, pp. 1–10. [27] E. Anis and N. Nicolici, “Low cost debug architecture using lossy compression for silicon debug,” in Proc. IEEE/ACM Des. Autom. Test Eur., 2007, pp. 1–6. [28] Y.-C. Hsu, F. Tsai, W. Jong, and Y.-T. Chang, “Visibility enhancement for silicon debug,” in Proc. IEEE/ACM Des. Autom. Conf., 2006, pp. 13–18. [29] J. Huang, C. James, and D. Walker, VLSI Test Principles and Architectures: Design for Testability. New York: Academic, 2006. [30] G. D. Micheli, Synthesis and Optimization of Digital Circuits. New York: McGraw-Hill, 1994. [31] L. H. Goldstein, “Controllability/observability analysis of digital circuits,” IEEE Trans. Circuits Syst., vol. CAS-26, no. 9, pp. 685–693, Sep. 1979. [32] F. Brglez, D. Bryan, and K. Kozminski, “Combinational profiles of sequential benchmark circuits,” in Proc. IEEE Int. Symp. Circuits Syst., 1989, pp. 1929–1934.
Ho Fai Ko (S’05) received the M.A.Sc. degree from McMaster University, Hamilton, ON, Canada, in 2004. He is currently working toward the Ph.D. degree in the Department of Electrical and Computer Engineering, McMaster University. His research interests are in the area of computeraided design with special emphasis on test and debug.
Nicola Nicolici (S’99–M’00) received the Dipl.Ing. degree in computer engineering from the University of Timisoara, Timisoara, Romania, in 1997 and the Ph.D. degree in electronics and computer science from the University of Southampton, Southampton, U.K., in 2000. He is currently an Associate Professor with the Department of Electrical and Computer Engineering, McMaster University, Hamilton, ON, Canada. His research interests are in the area of computer-aided design and test. He is the author of a number of papers in this area.