Debugging of Systems-on-a-Chip
Gert Jan van Rootselaar, Frank Bouwman, Erik Jan Marinissen, Math Verstraelen Philips Research Laboratories, Prof. Holstlaan 4, M/S WAY-41, 5656 AA Eindhoven, The Netherlands tel. +31 (0)40 274 2720 fax. +31 (0)40 274 4657
[email protected]
Abstract | For today's multi-million transistor de- Both the necessary hardware measures, referred to as signs, existing design veri cation techniques can- design for debug (DfD), and the required (debugger) not guarantee that rst silicon is designed error software are considered. This paper does not discuss free. Therefore, techniques are necessary to e- software debugging. ciently debug rst-silicon. In this article we show how scan-based debug can be used in a multiple First the general requirements for debugging ICs are clock domain system-on-a-chip. Both the neces- introduced in Section II. Design for Debug is treated sary hardware measures, referred to as design for in Section III. Debugger software is considered in Secdebug (DfD), and the required (debugger) software tion IV. Conclusions and acknowledgments follow in are considered. Special attention is paid to clock Sections V and VI, respectively.
controller requirements. Keywords | Debug, Test, Validation.
II. Requirements
I. Introduction
Modern process technologies and design tools allow the realization of very large and complex systems on a single die. Because of the increased system complexity, improvements in integrated circuit (IC) veri cation techniques are necessary. For today's multimillion transistor designs, existing design veri cation techniques such as simulation, formal veri cation, static timing analysis, and emulation cannot guarantee that rst silicon is designed error free. There are two reasons for this: (1) the methods are applied to a model of the IC, and (2) none of the methods can be applied exhaustively because of the computational cost involved. Time-to-market pressure demands that design errors are found quickly. Therefore, techniques are necessary to eciently debug rst-silicon. A number of techniques exist to debug implemented ICs. These include e-beam probing [1], [2], the use of needles, dedicated pins that bring signals from inside the chip to the outside [1], and scan-based debug [3], [1], [2], [4]. Because direct probing using needles or an e-beam is often dicult or impossible, e.g. for a multiple metal layer IC, we want some way of accessing the memory elements through the pins of the IC. In this article we show how scan-based debug can be used in a multiple clock domain system-on-a-chip.
In order to debug ICs, it must be possible to apply known stimuli to, and observe responses from, the terminals of the IC and the terminals of internal state holding elements, such as ip- ops and RAMs. Errors are found by applying stimuli and comparing the actual responses with expected responses. The pins can be accessed through a VLSI tester or other hardware, such as an application test board. To allow internal access through the pins, additional hardware must be present in the IC. We distinguish between a transport process and an interface process, both in the IC. During transport, data is copied from the ip- ops inside the IC to an interface block and vice versa. The interface process communicates the data with the outside world via the pins. There are a number of ways in which the transport can be accomplished; registers can be accessed via functional paths in the IC, e.g. register les can be accessed via the bus, or dedicated debug hardware can be used. For hardware debugging, it must be possible to access
ip- ops. Typically, only a subset of the ip- ops can be accessed directly via functional paths. In scantestable designs, however, the scan chains (which are present for production testing) can provide an access path to a much larger set of ip- ops. In full scan
439
440
Proceedings of the ProRISC Workshop on Circuits, Systems and Signal Processing 1997
designs there is even a path to all ip- ops. The disadvantage of using the scan chains for transport is that all the ip- ops in a scan chain are used during scanning. This means that if we want to access some
ip- op in scan chain c, all clock domains in the IC that contain ip- ops which are part of chain c must be halted. In the following, we consider only the scan-based approach for providing transport. In a typical scenario, the IC rst executes functionally for a while, after which (part of) it is halted to provide access to the internal state. After this, singlestep operation can be used to read and write internal signal traces. There are two reasons not to singlestep from t = 0: (1) single-stepping is several orders of magnitude slower than normal operation, and (2) the interaction of a halted subsystem in the IC with the environment is impaired if the environment does not support single-stepping. The requirements placed on the interface process depend on the environment in which the IC must be debugged. If the IC must be debugged on an application board, the interfacing must take place via a dedicated port because the other pins are not accessible for writing. The entire debugging process must be controlled from software executing on a computer. The software must provide the user access to the terminals of internal
ip- ops. In order to make debugging practical, it must be possible for users to specify their accesses in terms of VHDL [5] signal names, Verilog signal names, or ip- op names just as they do during simulations. It must be possible to communicate signal traces between the IC and a simulator to allow comparisons and silicon/model co-execution.
A. Transport through scan chains In order to preserve the state of the ip- ops that are not to be overwritten, their original values must be scanned back in to them before a functional clock cycle is applied. This can be achieved by making the scan chains circular. This concept is illustrated by means of Figure 1. In this gure we see C scan chains, numbered 1 trough to C in the inner dashed box. These are the scan chains as present for production testing. Generally, ICs contain multiple scan chains in order to reduce the test time. By adding a debug shell around the set of scan chains, a circular chain can be made. For production testing, the concatenate signal is made 0, and the shell is transparent. When scanning data during debugging, concatenate is made 1. This then creates one long scan chain from the leftmost s in scan chain 1 to the rightmost s in scan chain C .
III. Design for Debug
clock
i
o
probe_i circular
DEBUG_SHELL
0
1 flip-flops in scan chain 1
CLOCK_DOMAIN
1 s
i
0
s s o i
s i
s o
si
s o
s i
s o
concatenate
flip-flops in scan chain 2 1 0
si
si
concatenate
flip-flops in scan chain C 1 s 0
i
s si o
concatenate probe_read
While accessing the data via the scan chains, parts of the IC cannot be functionally active. Data can be written to and read from the ip- ops by repeatedly (1) scanning in and out the desired data, and (2) applying functional clock cycles. The transport through scan chains is explained in Section III-A. The application of clock cycles is considered in more detail in Section III-B. The details of halting (parts of) the IC are considered in Section III-C. In reference [6] similar issues are considered for a single clock domain system.
scan_enable
‘clock’ is connected to the clock input of every flip-flop (not drawn) ‘scan_enable’ is connected to the scan_enable input of every flip-flop (not drawn)
Fig. 1. Circular scan chain
Now consider we want to read the value of some
ip- op. This can be achieved by applying the correct number of clock cycles to the clock input while concatenate=1, circular=1, and scan enable=1. The value of the ip- op will become visible at the probe read output. After this, the application of additional clock cycles scans the data of all the ip- ops back to their original position. Data can be written
Debugging of Systems-on-a-Chip
441
into the scan chain by applying data to probe i input while temporarily setting circular=0. When using the con guration shown, all the ip- ops in the concatenated scan chain should be driven by a single clock tree to avoid clock skew problems. This means that there should be at least one debug shell per clock tree. If there are multiple debug shells present in an IC, it must be possible to apply clock cycles to the clock input of each domain independent of the others during scanning (while scan enable=1). This must be possible so that in each domain the values can be restored without aecting the state in the other domains, i.e. without scanning the other domains. There is no need for locking the dierent clocks to each other during scanning, because there is no inter clock domain communication during scanning if each scan chan is limited to one clock tree (and hence one clock domain). B. Single-stepping In order to obtain traces, functional clock cycles are issued in-between scan operations. In a single clock domain system, this can be achieved simply by applying a clock pulse after every access operation. In a multiple clock domain system, however, not all the clock domains must receive a clock cycle after scanning. This is illustrated by means of Figure 2. This gure shows two clocks. We assume they are used in a system with only positive edge triggered ip- ops. The thick arrows are the last functional (scan enable=0) clock pulses that are received on the clock inputs of the ip- ops. After this, the clocks are stopped and the ip- ops can be accessed using the procedure described. After the desired data has been accessed, new functional clock pulses must be issued. If we look at the dotted line labeled `A', we see that a clock pulse must be given for the domain with clock 2, but no clock pulse must be issued for the other clock domain. stop
clock 1
A
B
bers of functional (scan enable=0) clock cycles to each clock domain independent of the others. There is one large dierence with the application of clock cycles during scanning (scan enable=1), however: whenever clock pulses must be issued simultaneously during the functional mode (as is the case at the dotted line labeled `B'), they must arrive at the clock inputs of the
ip- ops in the dierent domains within a very tight margin, or clock skew problems may occur. The clocks must, therefore, come from the same circuitry as used during functional execution. C. Halting Halting (part of) the IC is non-trivial for two reasons: (1) the phase relationship of the clocks must be known when stopped, and (2) multiple clock domains must be halted simultaneously. That the phase relationship of the clocks must be known when stopped, follows from the discussion in Section III-B. The phase relationships of clocks can be made known after stopping by measuring and storing the phase relationship when halting, or by delaying a halt until a known phase relationship occurs. We consider why multiple clock domains must be halted simultaneously. Figure 3 shows two communicating clock domains with clocks with equal frequencies. Again we use the thick arrows to denote the clock pulses that are passed in functional mode. Here we see that in clock domain 2, one more functional cycle is executed than in the other domain. If data calculated at rising edge `A' is to be communicated to clock domain 1, it is lost because it is not read by clock domain 1, and it is overwritten at clock edge `B'. We conclude that, in general, communicating clock domains must be halted simultaneously. clock 1 1
2
3
A
B
4
clock 2
Fig. 3. Clocks of communicating domains
In order to halt two clock domains simultaneously, the output of their clock controllers must be gated almost simultaneously. If there is communication from clock domain 1 to clock domain 2, the window within which Fig. 2. Two clocks with dierent frequencies both the clocks must be gated is less wide than the disThus, it must be possible to apply controlled num- tance between a rising edge in domain 1 and the next clock 2
442
Proceedings of the ProRISC Workshop on Circuits, Systems and Signal Processing 1997
edge in domain 2 which clocks in the values produced by the rising edge in domain 1. This is illustrated in Figure 4. If the clock gate circuits of both domain 1 and domain 2 receive a halt pulse within this gate window, the domains are halted simultaneously. t
s
clock 1
clock 2 t t
s GW
: setup time
tGW
: gate window
Fig. 4. The window during which halt pulses must arrive
The problem is that when the dierent clocks have dierent frequencies or phases, the distance in time between a rising edge in domain 1 and a rising edge in domain 2 can be smaller than the smallest period. This is illustrated in Figure 5. Here we see two clocks, one with period T , and one with period 1 21 T . The smallest distance, in time, between a rising edge in domain 1 and a rising edge in domain 2, is smaller than the smallest period (T ), namely 21 T . We conclude that the gate window can be narrower than the smallest clock period. 1
1 2
T
T
Halt requests are latched. The output of the latch is applied to the clock gate circuitry at the rising edge of a halt-clock signal. The width of the window in which requests may be received is equal to the period of the halt-clock signal minus a setup time. For each clock controller, the halt-clock is locked to the halt-clock in all the other clock controllers using a phased lock loop (PLL). This assures that all clocks are gated within a very narrow window. The generation of halt requests by embedded triggerblocks is treated in [7]. IV. Debugger Software
In this section we describe the main features of a prototype tool named INCIDE. INCIDE is an acronym for INtegrated CIrcuit Debugging Environment. The primary use of the INCIDE software is debugging rst silicon; it can be used to detect design errors in actual prototype ICs. The software allows access to the actual silicon in a way similar to the access of a design being simulated using a VHDL or Verilog simulator. The software can be used for (1) constructing and controlling a test bench (so that stimuli can be applied to the IC, and responses can be collected), and (2) providing internal access to the IC (such as accessing the state of ip- ops or embedded memories). A. Test bench
clock 1
clock 2 1 2T
Fig. 5. Clocks with dierent frequencies giving a distance between rising edges that is smaller than the smallest clock period
Because the physical distance between the dierent clock controllers can be large, signal propagation delays can cause a time between the reception of halt requests at the clock controller gates that is larger than the width of the gate window. This problem can be solved by using a two-step approach. The idea is to provide a wider time window within which halt requests may be received by the clock controllers.
Figure 6 shows an example of a test bench with an implemented IC. In the middle of this picture we see an IC with one signal input pin, and one signal output pin (power, ground, and clock pins are not considered here). The IC is placed on a veri cation tester. This tester interfaces the real world with the debugger software. The VLSI tester block, receives its stimuli from an input le. The output of the tester is directed to an output le. Test benches are represented using a netlist which can be loaded into INCIDE. In such a netlist, the various components that must be connected are represented as cells. Dierent cell types provide dierent functionality. In this example we have four cell types; a le reader, a le writer, a VLSI tester, and a waveform viewer. But in principle any data source or sink can be encapsulated in a cell. With a simulator cell, for example, silicon/model co-execution can take place.
Debugging of Systems-on-a-Chip
443
sequence has to be gured out. Instead, INCIDE provides methods giving internal access directly. There are cells that perform the transformations of access requests into stimuli, and transform collected responses back into a meaningful form. The access requests are given in a command stream. Figure 8 illustrates this concept. Here we see that the VLSI tester is encapsulated in a transformation cell. The transformation cell allows virtual connections to be made from pins to the inside of the IC. Consider that the IC has some internal ip- op named x. The transformation cell can be instructed to `connect' the Fig. 6. Test bench
ip- op to an external pin (p in this example). The transformation cell contains a sub-cell that performs B. Internal access the JTAG interfacing. Now when the circuit starts exConsider that an IC contains a design error. In such ecuting, appropriate commands are sent to the JTAG a case we want to `zoom in' on the error by accessing port of the IC, so that after every clock cycle the value of ip- op x is read. The result is collected from the the internal data streams within the IC. JTAG port and written to the pin p . IC
file reader
file writer
VLSI tester
input:
output:
waveform viewer
x
x
JTAG port surrounded by dotted line command stream
Transformation cell
p
x
file writer
file writer JTAG interface
file reader IC
IC
file reader
file reader
file writer
file writer VLSI tester
VLSI tester
input:
input: output:
output: waveform viewer
waveform viewer
Fig. 7. Test bench providing internal access via the JTAG port
Figure 7 shows an example of a test bench with an IC which has DfD measures implemented. In this example ve extra pins provide access to the state of the internal ip- ops through the JTAG [8] port which is used as an interface between the internal transport, and the external tester. If we want to access some internal ip- op, we have to apply a sequence of bits to the JTAG port. This can be done by applying the stream directly from a le as shown in Figure 7. However, this is impractical. For every access to be made, the corresponding
Fig. 8. The transformation cell V. Conclusions
In this paper, the hardware and software requirements for silicon debugging of Systems-on-a-Chip are considered. It is shown how the scan chains can be used to access the internal IC state during debugging. Data on the IC is accessed in a single-step mode by repeatedly (1) scanning in and out the desired data, and (2) applying functional clock cycles. Circular scan chains facilitate the restoration of the internal chip state after scan-based access. The clock control for both the scanning, and the application of functional clock cycles are explained. It is also shown how a functionally
444
Proceedings of the ProRISC Workshop on Circuits, Systems and Signal Processing 1997
running IC can be halted to allow internal data access at a suitable point in time. Finally, we introduce a prototype debugger tool that lets the user control the debug process at suitable level of abstraction. VI. Acknowledgments
The authors thank Erik Hermans of Data Sciences for programming large parts of the INCIDE software. They thank Paul Merkus of Philips ED&T for his involvement in the software development process. Furthermore, they thank the members of the CPA team at Philips Research for the stimulating discussions on Design for Debug, and the reviewers for their comments and suggestions. References [1] Marc Levitt, Srinivas Nori, Sridhar Narayanan, GP Grewal, Lynn Youngs, Arnjali Jones, Greg Billus, and Siva Paramanandam, \Testability, debuggability, and manufacturability features of the UltraSPARCtm -I microprocessor," in Proceedings IEEE International Test Conference, 1995, pp. 157{166. [2] Hong Hao and Rick Avra, \Structured design-for-debug the SuperSPARCtm -II methodolgy and implementation," in Proceedings IEEE International Test Conference, 1995, pp. 175{183. [3] Kalon Holdbrook, Sunil Joshi, Samir Mitra, Joe Petolino, Renu Raman, and Michelle Wong, \microSPARCtm : A case-study of scan based debug," in Proceedings IEEE International Test Conference, 1994, pp. 70{75. [4] Wayne Needham and Naga Gollakota, \DFT strategy for Intel microprocessors," in Proceedings IEEE International Test Conference, 1996, pp. 396{399. [5] Zainalabedin Navabi, VHDL: Analysis and Modeling of Digital Systems, McGraw-Hill, Inc., 1993. [6] Hong Hao and Kanti Bhabuthmal, \Clock controller design in SuperSPARCtm -II microprocessor," in Proceedings IEEE International Conference on Computer Design, 1995, pp. 124{129. [7] Gert Jan van Rootselaar, Frank Bouwman, Erik Jan Marinissen, and Math Verstraelen, \Debugging of systems on a chip: Embedded triggers," in Informal digest of papers distributed at the IEEE High Level Design Validation and Test Workshop, 1997, (To appear). [8] Harry Bleeker, Peter van den Eijnden, and Frans de Jong, Boundary-Scan Test: A Practical Approach, Kluwer Academic Publishers, 1993.