Synergies for Design Verification
TPartition: Testbench Partitioning for HardwareAccelerated Functional Verification Young-Il Kim
Chong-Min Kyung
Korea Advanced Institute of Science and Technology
Korea Advanced Institute of Science and Technology Integrated Circuit Design Education Center after all.3-7 However, the communication overhead between the software simulator and the hardware accelerator has become a new critical bottleneck. To reduce communication overhead, designers use a transaction-level interface to reduce the amount of communication data.8,9 Accellera introduced the Standard Co-Emulation Modeling Interface (SCE-MI), which defines APIs for testbench and hardware interfaces for transactor description.10 Commercial emulators use the SCE-MI for transactionlevel interfaces.3,4,7 However, this method requires designers to describe the transactor in a synthesizable fashion and rewrite the testbench in a high-level language. In addition, when the verification result and the expected result don’t match, designers cannot be sure whether the design under test (DUT) or the transactor is wrong and must make additional efforts to verify the transactor and the new testbench design. What designers want is not a changing verification environment but an increasing simulation speed. In another emulation method, the system analyzes hardware description language codes and classifies HDL components into synthesizable and unsynthesizable parts.4,11 The synthesizable part runs on a software simulator in the host machine and the unsynthesizable part executes in a hardware accelerator. The designer can apply an existing test environment without remodeling efforts. Although this method increases performance by offloading the synthesizable part of the testbench from the software simulator, the partition between the soft-
Editor’s note: This hybrid dynamic simulation scheme implements part of the simulator in software running on a processor and maps the rest onto a programmable hardware accelerator. An algorithm for hardware synthesis of behavioral testbenches enables better partitions, resulting in lower communication costs between the two components. —Sharad Malik, Princeton University
AS SOC DESIGN COMPLEXITY INCREASED, the verification process became a critical bottleneck in the design process.1 Although software simulation is the most common and familiar verification method, software simulator performance is inadequate to handle today’s hardware design complexity. Simulators run on general-purpose computers, which process tasks sequentially, whereas hardware signals propagate simultaneously along numerous paths in the actual circuits. To overcome this performance limitation, designers have used special-purpose hardware. Most hardware simulators are used with real target boards for in-circuit emulation. Although emulation has high performance capability, it has low flexibility because a behavioral testbench cannot be applied to the emulator. Researchers have tried to execute behavioral testbenches in processor-based emulators,2 but this eventdriven simulation degrades emulator performance. Most important, it cannot match the flexibility of a software simulator, which can execute system calls, such as terminal display and file I/O, and high-level testbenches, such as Vera and C++. Therefore, the software simulator must be combined with a hardware emulator
484
0740-7475/04/$20.00 © 2004 IEEE
Copublished by the IEEE CS and the IEEE CASS
IEEE Design & Test of Computers
ware simulator and the accelerator does not account for communication efficiency, and performance is limited to 100,000 cycles per second (cps).6 Another method stores stimuli in the memory located in the emulator. When the designer performs additional simulation, the system applies the prestored patterns to the DUT and compares the outputs with the expected values.5 This method avoids communication overhead by not interacting with the testbench. Although the method is useful for fast regression testing, the designer must perform cosimulation at least once to get the stimuli patterns and the expected results. Moreover, it is not applicable to designs that are self-driven and have nondeterministic behavior. In this article, we present TPartition, a new scheme for accelerating functional simulation. Starting with conventional hardware-accelerated simulation, we exploit the characteristics of the channel between the simulator and the accelerator. To speed up functional simulation, we use hardware to offload calculation-intensive tasks from the software simulator. To reduce communication overhead, we identify a part of the testbench involved in the input port’s data dependency on the output port and move it into the hardware accelerator. We do not raise the abstraction level or modify communication data. Thus, our method accelerates simulation speed without a designer’s modeling efforts and without losing compatibility with the original test environment.
task is necessary for synchronizing the software simulator with the hardware accelerator. Here, we describe a detailed synchronization mechanism and propose a new method of reducing communication overhead.
Conventional synchronization scheme In the conventional scheme, as Figure 1b shows, the designer executes the testbench on the HDL simulator and maps the DUT to the FPGA. Synchronization of a clock cycle requires four steps: 1. When the testbench clock event occurs, the coemulation interface delivers the input port value and indicates which clock will be advanced to the DUT. 2. The hardware accelerator then advances the DUT one clock cycle and evaluates it. 3. After the hardware accelerator stabilizes the result value at the DUT’s output port, the coemulation interface delivers the output port value to the testbench. 4. Finally, the testbench checks the DUT’s output results and calculates the input port value for the next clock cycle. Two operations, steps 1 and 3, occur at every clock cycle across the coemulation interface. The CPU time for a single clock cycle consists of three distinct components: ttotal = tsimulator + tsync + taccelerator
System architecture Figure 1a shows a typical system architecture of hardware-accelerated simulation. Software-simulated design (SSD) is performed on the HDL simulator in the host computer, and hardware-accelerated design (HAD) is mapped to the FPGA in the hardware accelerator. The coemulation interface between the HDL simulator and the FPGA combines the HAD proxy, the programming language interface (PLI), the device driver, the system bus, the system bus interface, and the transactor, all in a serial connection. The designer-supplied testbench interfaces with the coemulation interface via the HAD proxy, which receives the testbench at its input port and sends it as a message to the hardware accelerator through all the components of the coemulation interface. Similarly, the HAD proxy reads the output port values of the HAD through the same path to feed them to the testbench.
where tsimulator denotes the CPU time consumed in the host computer for processing the testbench in step 4, tsync denotes the synchronization time for steps 1 and 3, and taccelerator denotes the emulation time for evaluating the DUT circuit in step 2. Synchronization time tsync consists of three components: synchronization times for the input, output, and clock ports, denoted tinport, toutport, and tclkport, which we describe as BWinport t inport = t setup + t payload BWbus BWoutport t outport = t setup + t payload BWbus
t clkport = t setup + t payload
Synchronization Crossing the coemulation interface is a burdensome task because of its inherited long path. However, the
November–December 2004
where BWinport, BWoutport, and BWbus denote the bit-width of the input port, output port, and system bus. The term
485
Synergies for Design Verification
Softwaresimulated design (SSD)
Hardwareaccelerated design (HAD)
Transactor
Bus interface
Proxy
System bus
HAD
PLI
Softwaresimulated design (SSD)
Device driver
Coemulation interface
Hardwareaccelerated design (HAD)
FPGA
HDL simulator Host computer
Hardware accelerator
(a) Coemulation interface Testbench
DUT
1
Input port
Input port 4
Clock 2
Clock
Clock Input port Output port
Output port
Output port
1 2 3
3 HDL simulator
4
1 2 3
1 cycle
FPGA
4
1 2 3
1 cycle
4
1 cycle
(b)
Coemulation interface TBsource
N=3
Clock of TBsource DUT Input port
1
4-1 4-1 4-1 1 4-1 4-1 4-1 1 4-1 4-1 4-1
4-1
Clock of accelerator TBloop
2 2 2 2
4-2
3
2 2 2
3
Clock of TBsink 3
Output port
TBsink HDL simulator
4-2 4-2 4-2 FPGA
(c)
Figure 1. Functional verification system target architecture (a). A conventional acceleration system performs synchronization between the software simulator and the hardware accelerator through the coemulation interface at every clock cycle (b); the proposed system performs synchronization through the coemulation interface every N simulation cycles (c). Each circled number (1, 2, 3, 4) represents a step in the synchronization of a clock cycle.
486
IEEE Design & Test of Computers
tsetup denotes the time required to set up a transaction in the coemulation interface; it comprises five terms as follows: tsetup = tPLI_setup + tdriver_setup + tbus_setup + tbus_interface_setup + ttransactor_setup The term tpayload is the time required to send additional one-word data through the coemulation interface and is defined as follows: tpayload = max(tPLI_payload + tdriver_payload; tbus_payload; tbus_interface_payload; ttransactor_payload)
(1)
For example, to make one peripheral-component-interconnect (PCI) transaction, the sender must conduct an arbitration process to use the bus, and once it acquires the bus, it can transmit a number of words in a burst data transfer. In this case, tbus_setup corresponds to the bus arbitration time, and tbus_payload is the time needed to transmit one additional word through the bus. Finally, we define tsync as
ttotal = max(tsimulator; tsync; taccelerator)
t sync = t inport + t outport + t clkport BWinport BWoutport = 3t setup + + +11 t payload BWbus BWbus
(2)
Note that we obtain tsetup by summing the setup time of all the components in the coemulation interface. In contrast, as Equation 1 shows, we obtain tpayload by finding the maximum transmit time among components. Because components of the coemulation interface are serially connected, setup time integrates the times of all the components. On the other hand, data transmission is pipelined, so the total payload time becomes that of the component with the lowest bandwidth (the PLI and the driver are sequentially executed in software). Therefore, setup time is more critical in determining total synchronization time in Equation 2.
Proposed synchronization scheme We know that reducing tsetup is essential to minimizing total simulation time. Unfortunately, the amount of data needed for one clock cycle synchronization is fixed without a compression technique, which is out of the scope of this article. To reduce the first term of Equation 2, we perform synchronization only once for several clock cycles. We can do this by moving the part
November–December 2004
of the testbench that is involved in the input port’s data dependency on the output port’s data. Figure 1c shows each clock cycle operation of the proposed acceleration system. The system is based on the architecture depicted in Figure 1a. Unlike the conventional system’s operation (Figure 1b), crossing the coemulation interface does not occur at every clock cycle in our scheme. Rather, the testbench’s source part (TBsource) executes for a number of clock cycles but sends data through the coemulation interface only once. N denotes the number of clock cycles within the synchronization interval. When TBsource sends synchronization data that corresponds to N clock cycles, the accelerator executes the DUT and the testbench’s loop part (TBloop) very quickly by advancing N clock ticks. Finally, the testbench’s sink part (TBsink) receives the DUT result and performs its own TBsink operation. Because we perform all these procedures in a pipelined manner using direct memory access, the software simulator and the accelerator can execute in parallel, which is not possible in the conventional method. Therefore, the CPU time for a single clock cycle becomes (3)
We describe the time consumed by synchronizing the input, output, and clock ports as follows: BWinport ′ t inport = N1 t setup + t payload BWbus BWoutport ′ t outport = N1 t setup + t payload BWbus
t clkport = N1 t setup + t payload Our approach performs synchronization at the interval of N clock cycles, reducing the first term by a factor of N, but, compared with the conventional method, has no effect on the second term. Note that these values are normalized to one clock cycle. Finally, we write the time required for synchronization as
t sync = t inport + t outport + t clkport + t buffer BW ′ BW ′ = N3 t setup + inport + outport +1 t payload + t buffer BWbus BWbus (4)
487
Synergies for Design Verification
Testbench
DUT
Source vertex
Source vertex
TBsource
DUT
Input port New boundary
Backward path
Forward path
Backward path
Input port
TBloop
Output port Sink vertex
Output port
Intermediate vertex
Sink vertex TBsink
(a)
(b)
Software simulator
Hardware accelerator
TBsource
DUT
Input port
TBloop
Output port
TBsink (c) Figure 2. Testbench splitting and placement procedure in TPartition: original testbench (a), testbench partition (b), and placement of each part of testbench (c).
where tbuffer denotes the buffering time, which we introduce because our approach performs synchronization every N simulation cycles. The software side of the coemulation interface should include two buffers, which store the input and output synchronization data in the amounts of N × BW ′inport and N × BW ′outport bits. In the middle of the synchronization intervals, the input port buffer is filled and the output port buffer is exhausted during N clock cycles. At the time of synchronization, the input port buffer is flushed to the accelerator and the output port buffer is filled. Depending on the buffer location, we can reduce tbuffer to a negligible value compared with tsetup by locating the buffers as close as possible to the testbench to
488
reduce buffer access time. In our experiment, we placed the buffers in the HAD proxy.
Partitioning the testbench Our testbench-partitioning method enables the proposed synchronization scheme. Figure 2 shows the testbench splitting and placement procedure in the TPartition scheme. We model the testbench structure as a directed graph consisting of vertices and edges. A vertex denotes the HDL construct, and an edge denotes the relationship between HDL constructs. The source vertex generates test patterns, and the sink vertex consumes result patterns. For example, system task functions such as $random and $display can be source and sink ver-
IEEE Design & Test of Computers
Verilog source code
Operations for simulation 1E. Evaluate RHS(~Clk) and schedule Update LHS(Clk) after #1 1U. Update LHS(Clk) and activate Evaluate RHS of statement 1 & 2(~Clk, a+1)
forever Clk = #1 ~Clk; 2E. Evaluate RHS(a+1) and activate Update LHS(a) 2U. Update LHS(a) and activate Evaluate RHS of statement 3(a)
always @(posedge Clk) a=a+1;
Verilog code
assign b=a; 3E. Evaluate RHS(a) and activate Update LHS(b) 3U. Update LHS(b)
LHS = RHS Translation
(a) Eclk
Operation sequence
event
event
event
Update register
Enable
Enable
RHS
1E No 1U 1E 2E 2U 3E 3U No 1U 1E No 1U 1E 2E 2U 3E 3U No event
Evaluate register
Evaluate_Trigger
LHS
Clk a
0
1
Update_Trigger
2
Eclk b Simulation time
(b)
0 Period 1
1 Period 2
2 Equivalent-statement hardware
Period 3
Operations within each period appear to occur simultaneously in the context of simulation time
(c)
Figure 3. Example operation sequence of simulation model (a), waveform of proposed hardware implementation (b), and equivalent hardware translation of Verilog statement (c).
tices, respectively. We define a path as one or more connected edges. There are two kinds of paths: forward and backward. A forward path starts from the source vertex and ends in the sink vertex. A backward path starts from the DUT’s output port and ends in the input port. As Figure 2b shows, TPartition splits the original testbench into three parts: ■ ■ ■
TBsource, the testbench part that includes the source vertices; TBsink, the testbench part that includes the sink vertices; and TBloop, the testbench part that includes the vertices involved in a backward path.
We first perform the split procedure by finding TBloop and splitting the rest of the testbench into TBsource and TBsink. Then we map TBloop to the accelerator and directly attach it to the DUT. Consequently, changes occur in the boundary between the software simulator and the hardware emulator. To cope with this, we generate new ports instead of using existing ports connected to the backward path. Finally, TBsource has only outgoing ports, and TBsink has only incoming ports. In this situation, TBsource and TBsink can run independently. TBsource can run through many clock cycles without waiting for data from the accelerator, whereas TBsink executes when the output port data becomes available from the accelerator.
November–December 2004
Making the testbench synthesizable We propose a method for translating an unsynthesizable HDL description into a synthesizable one. We showed earlier that the loop part of the testbench moves into the hardware accelerator. However, we can describe the testbench in any style defined in an HDL. Because a behavioral testbench composed of time control, event control, nonstatic loops, and sequential statements is not synthesizable, we need a method for mapping this description into the FPGA-based accelerator. Therefore, we developed an automated translator that converts behavioral HDL code into synthesizable code.
Basic principle According to the IEEE 1364-2001 Verilog HDL standard,12 an HDL simulator should follow the Verilog simulation reference model. We also based our approach on this model, to support all possible HDL syntaxes for the HDL testbench. At the beginning of a simulation time step, the right-hand side (RHS) of the assignment statement is evaluated, and then the left-hand side (LHS) variable is updated. The updated LHS variable might cause additional evaluate events for sensitive statements. The evaluate and update phases alternately iterate until all events in the current time step have executed and no processes remain. Figure 3a shows an example of simulation operations. The Verilog source code includes three statements. For each statement, the simulator performs
489
Synergies for Design Verification
Evaluate register
Update register Clk
~Clk Verilog source code
Evaluate_Logic Enable
forever Clk = #1 ~Clk;
Evaluate_Trigger
always @(Clk) a=a+1;
Time compare Update_Trigger
Event_Detector
Enable
Statement 1 translation
assign b=a;
Statement analysis
a
a+1 1 Evaluate_Logic Enable
Statement information table
Event_Detector
Enable
Evaluate_Trigger Sensitivity Assignment (evaluate) (update)
LHS
RHS
Statement 1
Clk
~Clk
Clk
= #1
Statement 2
a
a+1
Clk
=
Statement 3
b
a
a
=
1 Eclk delay Update_Trigger
Statement 2 translation b
a Evaluate_Logic Enable Statement 3 translation
Enable
Event_Detector
Evaluate_Trigger 1 Eclk delay Update_Trigger Translated hardware
Figure 4. Example of testbench translation into synthesizable equivalent hardware: We analyze the three statements to extract the statement information table, from which we translate the statements into synthesizable equivalent statements in hardware.
operations in two phases: evaluate and update. 1E denotes the evaluate operation of statement 1, and 1U denotes the update operation of statement 1. Figure 3b shows the operation sequence for all three statements. Statements 1 and 2 are sensitive to the Clk signal, so 1E and 2E execute after 1U. Similarly, 3E executes after 2U. In the operation sequence, we introduced an emulation clock (“Eclk” in Figure 3a). Each operation is performed at every positive edge of Eclk. There can be several Eclk cycles in a simulation time step. Simulation time advances when there are no more active events. To implement the waveform for the FPGA-based accelerator, we translate the assignment (indicated by the “=” sign) of an HDL statement into 2-bit shift registers in which two enabled registers are connected in a cascade. As Figure 3c shows, we use the left register for the evaluate operation and the right register for the update operation. The Evaluate_Trigger signal enables the evaluate register, and the Update_Trigger signal enables the update register.
490
Translation Figure 4 shows an example translation process for three HDL statements. For each statement, we extract the statement information table, which contains four fields: LHS, RHS, sensitivity, and assignment type. Our translator maps each information component to an internal block of equivalent-statement hardware. That is, it maps RHS, sensitivity, and assignment type to Evaluate_Logic, Evaluate_Trigger, and Update_Trigger, respectively.
Translated testbench architecture Figure 5 shows the overall architecture of the translated testbench. The architecture contains two different paths: value (solid lines), and event (dashed lines). The blocks function as follows: ■ ■
Evaluate_Logic generates the RHS value. Evaluate_Trigger determines the time to enable the evaluate register. The Evaluate_Trigger signal is generated by sensitive-signal events.
IEEE Design & Test of Computers
Equivalent-statement hardware Evaluate_Logic
Evaluate register
Update register Local_ Interconnection
Evaluate_Trigger
Event_Detector
Update_Trigger Equivalent-statement hardware Evaluate_Logic
Evaluate register
Update register Local_ Interconnection
Evaluate_Trigger
Event_Detector
Update_Trigger
Equivalent-statement hardware Evaluate_Logic
Evaluate register
Evaluate_Trigger Update_Trigger Equivalent-statement hardware Evaluate_Logic
Evaluate register
Update register
Evaluate_Trigger
Event_Detector
Update_Trigger Equivalent-statement hardware Evaluate_Logic
Evaluate register Local_ Interconnection
Evaluate_Trigger Update_Trigger
Value_Interconnection Event_Interconnection
Figure 5. Translated testbench architecture.
■
■
Update_Trigger determines the time to enable the update register. For a blocking assignment, the Update_Trigger signal becomes active immediately after the Evaluate_Trigger signal becomes active. For a nonblocking assignment, Update_Trigger is delayed until all events in the current time step have executed. In addition, in cases using time control (indicated by the “#” sign), the update time becomes the simulation time specified by the time control statement’s time delay. Event_Detector generates Event signals from Value signals. An Event signal is high only for one emulation clock (Eclk) cycle duration. We implement this block using an XOR gate with two inputs: a Value signal and the inversion of the same signal delayed by one clock cycle.
November–December 2004
■
Local_Interconnection routes the Value and Event signals from the evaluate registers to the update register. When only one statement drives the RHS variable, the Value and Event signals just pass through this block. However, when multiple statements drive the LHS variable, all equivalent-statement hardware shares one update register, as shown in the bottom three boxes of Figure 5. Whenever any Update_Trigger signal is activated, the update register should be updated. When two or more Update_Trigger signals are simultaneously activated, this condition is considered a race condition, and the translator can select any statement to feed to the update register.
Parallel processing Although our approach uses sequential operation
491
Synergies for Design Verification
Table 1. Synchronization time and simulation speed of conventional and proposed methods for N = 256. TBloop size
Test case GIO IDCT VLD
Synchronization time (tsync) Conventional
Proposed
Simulation speed (1/ttotal) Conventional
Proposed
Gate
Memory
method
method
method
method
Speedup
count
(Kbits)
(ms)
(ms)
Ratio
(Kcps)
(Kcps)
factor
19.5
0.47
41.2
48
646
13.3
0 0 400
0 0
21.7
0.48
45.1
44
701
15.9
4.1
22.4
0.92
24.4
42
420
10.0
PCM
0
0
21.5
0.39
55.0
44
529
12.2
USB
100
0
22.1
0.65
33.8
42
415
9.9
ALU
0
0
22.7
0.79
28.6
41
404
9.8
Z80
200
2.0
21.9
0.65
33.6
43
512
12.1
DES (EBC)
1,100
16.4
24.0
2.00
12.0
39
272
7.0
DES (CBC)
1,600
16.4
24.7
2.31
10.7
39
286
7.4
DES3 (EBC)
1,100
3.2
23.4
1.76
13.3
40
309
7.7
DES3 (CBC)
1,600
1.0
25.0
2.33
10.7
38
276
7.3
of the Verilog simulation reference model, we can exploit parallelism for jobs that are independent of precedence relationships. For example, Figure 3a shows three statements. Both statements 1 and 2 are sensitive to the Clk signal, so we perform 1E and 2E after 1U. Because operations 1E and 2E are independent of precedence relationships, we can perform them in parallel. A software simulator cannot exploit this kind of parallelism because it runs on a sequential machine. On the other hand, our method implements each statement as individual equivalent-statement hardware, so we can execute 1E and 2E simultaneously. This parallel-processing capability reduces taccelerator in Equation 3.
Experimental results We performed hardware-software coemulation experiments to evaluate TPartition. For a hardware accelerator, we implemented a PCI card featuring an 8million-gate Xilinx Virtex-II FPGA. For the software side, we used the Cadence NC-Sim HDL simulator running on an Intel Pentium 2.8-GHz processor. To focus on the communication overhead between the simulator and the accelerator, we compared the proposed approach with a conventional hardware acceleration scheme rather than software simulation. As Equation 4 shows, simulation speed increases in proportion to N. In our experiments, simulation speed increased until N reached 256, but speed was saturated after N reached 512. This is because (3/N)tsetup is the main influence on speed before 256, but the effect of
492
tbuffer becomes dominant as N increases. We got the best performance when N was 256. We used this value of N for the following experiments. Table 1 shows the experimental results of the conventional and proposed hardware acceleration methods for various designs from OpenCore and industry. Using the proposed method, we split the testbench into three parts, of which TBloop was translated and synthesized for mapping to the FPGA-based accelerator. The table’s second and third columns display the size of the synthesized testbench part (TBloop). For some test cases (GIO, IDCT, PCM, and ALU), the testbench has no backward path. In these cases, we split the testbench into two parts (TBsource and TBsink), and TBloop did not exist. In other cases (VLD, Z80, DES, and DES3), in which TBloop includes the two-dimensional array, we used FPGA memory blocks to implement a synthesized testbench. The fourth through ninth columns show synchronization times and overall simulation speeds. The commercial simulation accelerator can increase simulation speed up to 100,000 cps.6,7 However, this figure is the maximum speed; in our experiments, we achieved speeds up to 48,000 cps using the conventional method. On the other hand, the proposed method reduced communication time by a factor of 10.7 to 55.0, and performed simulation at up to 701,000 cycles per second, 15.9 times faster than the conventional method. The communication time reductions differed between the various designs. As shown previously, the buffer size is proportional to the DUT port’s bit width. The large-bitwidth designs such as DES and triple-DES have large
IEEE Design & Test of Computers
buffering times (tbuffer), which degrade the effect of setup time (tsetup) reduction as the value of N increases in Equation 4.
10. Standard Co-Emulation Modeling Interface Reference Manual, Version 1.0, Accellera, 2003; http://www.eda.org/itc/scemi.pdf. 11. P.-S. Tseng and S. S.-P. Lin, Coverification System and Method, US patent 6,389,379 B1, Patent and Trademark
TPARTITION IMPROVES the performance of hardware-
accelerated simulation without a designer’s remodeling effort and without losing compatibility with the original testbench. We are currently working on further experiments by applying the methodology to industrially relevant examples. In our future work, we plan to extend our HDL translation technique to gate-delayannotated DUT for hardware-accelerated timing verification. To do this, we must develop optimization techniques to reduce hardware resource use. ■
References 1. International Technology Roadmap for Semiconductors, Semiconductor Industry Assoc., 2001. 2. J. Bauer et al., “A Reconfigurable Logic Machine for Fast
Office, 2002. 12. IEEE Std. 1364-2001 Verilog Hardware Description Language, IEEE Press, 2001.
Young-Il Kim is a PhD candidate in the Department of Electrical Engineering and Computer Science at the Korea Advanced Institute of Science and Technology (KAIST). His research interests include hardware-accelerated simulation, reconfigurable systems, and SoC design. Kim has a BS from Korea University and an MS from KAIST, both in electrical engineering. He is a student member of the IEEE.
Event-Driven Simulation,” Proc. 35th Design Automation Conf. (DAC 98), ACM Press, 1998, pp. 668-671. 3. “Incisive Palladium Datasheet,” Cadence, 2004; http://www.cadence.com/datasheets/IncisivePalladium_ ds.pdf. 4. “VStation HDL Link Datasheet,” Mentor Graphics, 2002; http://www.mentor.com/vstation/datasheets/HDL_Link_d s.pdf. 5. Celaro User’s Manual, Mentor Graphics, 2002. 6. “Xcite Product Description,” Verisity Design; http://www.verisity.com/products/xcite.html. 7. “ZeBu Product Description,” Eve Emulation and Verifica-
Chong-Min Kyung is a professor in the Department of Electrical Engineering and Computer Science at KAIST. He is also a director of the IDEC (Integrated Circuit Design Education Center), Korea. His research interests include microprocessor and DSP architecture, chip design, and verification methodology. Kyung has a BS in electronic engineering from Seoul National University and an MS and a PhD, both in electrical engineering, from KAIST. He is a senior member of the IEEE.
tion Engineering; http://www.eveteam.com/product_description.html. 8. M. Bauer et al., “A Method for Accelerating Test Environments,” Proc. 25th Euromicro Conf., vol. 1, IEEE Press, 1999, pp. 477-480.
Direct questions and comments about this article to Young-Il Kim, VLSI Systems Lab., CHiPS, Korea Advanced Institute of Science and Technology, Daejeon, Korea;
[email protected].
9. R. Henftling et al., “Re-use-centric Architecture for a Fully Accelerated Testbench Environment,” Proc. 49th
For further information on this or any other computing
Design Automation Conf. (DAC 03), ACM Press, 2003,
topic, visit our Digital Library at http://www.computer.org/
pp. 372-375.
publications/dlib.
November–December 2004
493