TPartition: Testbench Partitioning for Hardware- Accelerated ... - kaist

Synergies for Design Verification

TPartition: Testbench Partitioning for HardwareAccelerated Functional Verification Young-Il Kim

Chong-Min Kyung

Korea Advanced Institute of Science and Technology

Korea Advanced Institute of Science and Technology Integrated Circuit Design Education Center after all.3-7 However, the communication overhead between the software simulator and the hardware accelerator has become a new critical bottleneck. To reduce communication overhead, designers use a transaction-level interface to reduce the amount of communication data.8,9 Accellera introduced the Standard Co-Emulation Modeling Interface (SCE-MI), which defines APIs for testbench and hardware interfaces for transactor description.10 Commercial emulators use the SCE-MI for transactionlevel interfaces.3,4,7 However, this method requires designers to describe the transactor in a synthesizable fashion and rewrite the testbench in a high-level language. In addition, when the verification result and the expected result don’t match, designers cannot be sure whether the design under test (DUT) or the transactor is wrong and must make additional efforts to verify the transactor and the new testbench design. What designers want is not a changing verification environment but an increasing simulation speed. In another emulation method, the system analyzes hardware description language codes and classifies HDL components into synthesizable and unsynthesizable parts.4,11 The synthesizable part runs on a software simulator in the host machine and the unsynthesizable part executes in a hardware accelerator. The designer can apply an existing test environment without remodeling efforts. Although this method increases performance by offloading the synthesizable part of the testbench from the software simulator, the partition between the soft-

Editor’s note: This hybrid dynamic simulation scheme implements part of the simulator in software running on a processor and maps the rest onto a programmable hardware accelerator. An algorithm for hardware synthesis of behavioral testbenches enables better partitions, resulting in lower communication costs between the two components. —Sharad Malik, Princeton University

AS SOC DESIGN COMPLEXITY INCREASED, the verification process became a critical bottleneck in the design process.1 Although software simulation is the most common and familiar verification method, software simulator performance is inadequate to handle today’s hardware design complexity. Simulators run on general-purpose computers, which process tasks sequentially, whereas hardware signals propagate simultaneously along numerous paths in the actual circuits. To overcome this performance limitation, designers have used special-purpose hardware. Most hardware simulators are used with real target boards for in-circuit emulation. Although emulation has high performance capability, it has low flexibility because a behavioral testbench cannot be applied to the emulator. Researchers have tried to execute behavioral testbenches in processor-based emulators,2 but this eventdriven simulation degrades emulator performance. Most important, it cannot match the flexibility of a software simulator, which can execute system calls, such as terminal display and file I/O, and high-level testbenches, such as Vera and C++. Therefore, the software simulator must be combined with a hardware emulator

484

0740-7475/04/$20.00 © 2004 IEEE

Copublished by the IEEE CS and the IEEE CASS

IEEE Design & Test of Computers

ware simulator and the accelerator does not account for communication efficiency, and performance is limited to 100,000 cycles per second (cps).6 Another method stores stimuli in the memory located in the emulator. When the designer performs additional simulation, the system applies the prestored patterns to the DUT and compares the outputs with the expected values.5 This method avoids communication overhead by not interacting with the testbench. Although the method is useful for fast regression testing, the designer must perform cosimulation at least once to get the stimuli patterns and the expected results. Moreover, it is not applicable to designs that are self-driven and have nondeterministic behavior. In this article, we present TPartition, a new scheme for accelerating functional simulation. Starting with conventional hardware-accelerated simulation, we exploit the characteristics of the channel between the simulator and the accelerator. To speed up functional simulation, we use hardware to offload calculation-intensive tasks from the software simulator. To reduce communication overhead, we identify a part of the testbench involved in the input port’s data dependency on the output port and move it into the hardware accelerator. We do not raise the abstraction level or modify communication data. Thus, our method accelerates simulation speed without a designer’s modeling efforts and without losing compatibility with the original test environment.

task is necessary for synchronizing the software simulator with the hardware accelerator. Here, we describe a detailed synchronization mechanism and propose a new method of reducing communication overhead.

Conventional synchronization scheme In the conventional scheme, as Figure 1b shows, the designer executes the testbench on the HDL simulator and maps the DUT to the FPGA. Synchronization of a clock cycle requires four steps: 1. When the testbench clock event occurs, the coemulation interface delivers the input port value and indicates which clock will be advanced to the DUT. 2. The hardware accelerator then advances the DUT one clock cycle and evaluates it. 3. After the hardware accelerator stabilizes the result value at the DUT’s output port, the coemulation interface delivers the output port value to the testbench. 4. Finally, the testbench checks the DUT’s output results and calculates the input port value for the next clock cycle. Two operations, steps 1 and 3, occur at every clock cycle across the coemulation interface. The CPU time for a single clock cycle consists of three distinct components: ttotal = tsimulator + tsync + taccelerator

System architecture Figure 1a shows a typical system architecture of hardware-accelerated simulation. Software-simulated design (SSD) is performed on the HDL simulator in the host computer, and hardware-accelerated design (HAD) is mapped to the FPGA in the hardware accelerator. The coemulation interface between the HDL simulator and the FPGA combines the HAD proxy, the programming language interface (PLI), the device driver, the system bus, the system bus interface, and the transactor, all in a serial connection. The designer-supplied testbench interfaces with the coemulation interface via the HAD proxy, which receives the testbench at its input port and sends it as a message to the hardware accelerator through all the components of the coemulation interface. Similarly, the HAD proxy reads the output port values of the HAD through the same path to feed them to the testbench.

where tsimulator denotes the CPU time consumed in the host computer for processing the testbench in step 4, tsync denotes the synchronization time for steps 1 and 3, and taccelerator denotes the emulation time for evaluating the DUT circuit in step 2. Synchronization time tsync consists of three components: synchronization times for the input, output, and clock ports, denoted tinport, toutport, and tclkport, which we describe as  BWinport  t inport = t setup +  t payload  BWbus   BWoutport  t outport = t setup +  t payload  BWbus 

t clkport = t setup + t payload

Synchronization Crossing the coemulation interface is a burdensome task because of its inherited long path. However, the

November–December 2004

where BWinport, BWoutport, and BWbus denote the bit-width of the input port, output port, and system bus. The term

485


Softwaresimulated design (SSD)

Hardwareaccelerated design (HAD)

Transactor

Bus interface

Proxy

System bus

HAD

PLI

Softwaresimulated design (SSD)

Device driver

Coemulation interface

Hardwareaccelerated design (HAD)

FPGA

HDL simulator Host computer

Hardware accelerator

(a) Coemulation interface Testbench

DUT

1

Input port

Input port 4

Clock 2

Clock

Clock Input port Output port

Output port

Output port

1 2 3

3 HDL simulator

4

1 2 3

1 cycle

FPGA

4

1 2 3

1 cycle

4

1 cycle

(b)

Coemulation interface TBsource

N=3

Clock of TBsource DUT Input port

1

4-1 4-1 4-1 1 4-1 4-1 4-1 1 4-1 4-1 4-1

4-1

Clock of accelerator TBloop

2 2 2 2

4-2

3

2 2 2

3

Clock of TBsink 3

Output port

TBsink HDL simulator

4-2 4-2 4-2 FPGA

(c)

Figure 1. Functional verification system target architecture (a). A conventional acceleration system performs synchronization between the software simulator and the hardware accelerator through the coemulation interface at every clock cycle (b); the proposed system performs synchronization through the coemulation interface every N simulation cycles (c). Each circled number (1, 2, 3, 4) represents a step in the synchronization of a clock cycle.

486


tsetup denotes the time required to set up a transaction in the coemulation interface; it comprises five terms as follows: tsetup = tPLI_setup + tdriver_setup + tbus_setup + tbus_interface_setup + ttransactor_setup The term tpayload is the time required to send additional one-word data through the coemulation interface and is defined as follows: tpayload = max(tPLI_payload + tdriver_payload; tbus_payload; tbus_interface_payload; ttransactor_payload)

(1)

For example, to make one peripheral-component-interconnect (PCI) transaction, the sender must conduct an arbitration process to use the bus, and once it acquires the bus, it can transmit a number of words in a burst data transfer. In this case, tbus_setup corresponds to the bus arbitration time, and tbus_payload is the time needed to transmit one additional word through the bus. Finally, we define tsync as

ttotal = max(tsimulator; tsync; taccelerator)

t sync = t inport + t outport + t clkport   BWinport   BWoutport   = 3t setup +   +  +11 t payload   BWbus   BWbus  

(2)

Note that we obtain tsetup by summing the setup time of all the components in the coemulation interface. In contrast, as Equation 1 shows, we obtain tpayload by finding the maximum transmit time among components. Because components of the coemulation interface are serially connected, setup time integrates the times of all the components. On the other hand, data transmission is pipelined, so the total payload time becomes that of the component with the lowest bandwidth (the PLI and the driver are sequentially executed in software). Therefore, setup time is more critical in determining total synchronization time in Equation 2.

Proposed synchronization scheme We know that reducing tsetup is essential to minimizing total simulation time. Unfortunately, the amount of data needed for one clock cycle synchronization is fixed without a compression technique, which is out of the scope of this article. To reduce the first term of Equation 2, we perform synchronization only once for several clock cycles. We can do this by moving the part


of the testbench that is involved in the input port’s data dependency on the output port’s data. Figure 1c shows each clock cycle operation of the proposed acceleration system. The system is based on the architecture depicted in Figure 1a. Unlike the conventional system’s operation (Figure 1b), crossing the coemulation interface does not occur at every clock cycle in our scheme. Rather, the testbench’s source part (TBsource) executes for a number of clock cycles but sends data through the coemulation interface only once. N denotes the number of clock cycles within the synchronization interval. When TBsource sends synchronization data that corresponds to N clock cycles, the accelerator executes the DUT and the testbench’s loop part (TBloop) very quickly by advancing N clock ticks. Finally, the testbench’s sink part (TBsink) receives the DUT result and performs its own TBsink operation. Because we perform all these procedures in a pipelined manner using direct memory access, the software simulator and the accelerator can execute in parallel, which is not possible in the conventional method. Therefore, the CPU time for a single clock cycle becomes (3)

We describe the time consumed by synchronizing the input, output, and clock ports as follows:  BWinport ′  t inport = N1 t setup +  t payload  BWbus   BWoutport ′  t outport = N1 t setup +  t payload  BWbus 

t clkport = N1 t setup + t payload Our approach performs synchronization at the interval of N clock cycles, reducing the first term by a factor of N, but, compared with the conventional method, has no effect on the second term. Note that these values are normalized to one clock cycle. Finally, we write the time required for synchronization as

t sync = t inport + t outport + t clkport + t buffer   BW ′   BW ′   = N3 t setup +   inport  +  outport  +1 t payload + t buffer   BWbus   BWbus   (4)

487


Testbench

DUT

Source vertex

Source vertex

TBsource

DUT

Input port New boundary

Backward path

Forward path

Backward path

Input port

TBloop

Output port Sink vertex

Output port

Intermediate vertex

Sink vertex TBsink

(a)

(b)

Software simulator

Hardware accelerator

TBsource

DUT

Input port

TBloop

Output port

TBsink (c) Figure 2. Testbench splitting and placement procedure in TPartition: original testbench (a), testbench partition (b), and placement of each part of testbench (c).

where tbuffer denotes the buffering time, which we introduce because our approach performs synchronization every N simulation cycles. The software side of the coemulation interface should include two buffers, which store the input and output synchronization data in the amounts of N × BW ′inport and N × BW ′outport bits. In the middle of the synchronization intervals, the input port buffer is filled and the output port buffer is exhausted during N clock cycles. At the time of synchronization, the input port buffer is flushed to the accelerator and the output port buffer is filled. Depending on the buffer location, we can reduce tbuffer to a negligible value compared with tsetup by locating the buffers as close as possible to the testbench to

488

reduce buffer access time. In our experiment, we placed the buffers in the HAD proxy.

Partitioning the testbench Our testbench-partitioning method enables the proposed synchronization scheme. Figure 2 shows the testbench splitting and placement procedure in the TPartition scheme. We model the testbench structure as a directed graph consisting of vertices and edges. A vertex denotes the HDL construct, and an edge denotes the relationship between HDL constructs. The source vertex generates test patterns, and the sink vertex consumes result patterns. For example, system task functions such as $random and $display can be source and sink ver-


Verilog source code

Operations for simulation 1E. Evaluate RHS(~Clk) and schedule Update LHS(Clk) after #1 1U. Update LHS(Clk) and activate Evaluate RHS of statement 1 & 2(~Clk, a+1)

forever Clk = #1 ~Clk; 2E. Evaluate RHS(a+1) and activate Update LHS(a) 2U. Update LHS(a) and activate Evaluate RHS of statement 3(a)

always @(posedge Clk) a=a+1;

Verilog code

assign b=a; 3E. Evaluate RHS(a) and activate Update LHS(b) 3U. Update LHS(b)

LHS = RHS Translation

(a) Eclk

Operation sequence

event

event

event

Update register

Enable

Enable

RHS

1E No 1U 1E 2E 2U 3E 3U No 1U 1E No 1U 1E 2E 2U 3E 3U No event

Evaluate register

Evaluate_Trigger

LHS

Clk a

0

1

Update_Trigger

2

Eclk b Simulation time

(b)

0 Period 1

1 Period 2

2 Equivalent-statement hardware

Period 3

Operations within each period appear to occur simultaneously in the context of simulation time

(c)

Figure 3. Example operation sequence of simulation model (a), waveform of proposed hardware implementation (b), and equivalent hardware translation of Verilog statement (c).

tices, respectively. We define a path as one or more connected edges. There are two kinds of paths: forward and backward. A forward path starts from the source vertex and ends in the sink vertex. A backward path starts from the DUT’s output port and ends in the input port. As Figure 2b shows, TPartition splits the original testbench into three parts: ■ ■ ■

TBsource, the testbench part that includes the source vertices; TBsink, the testbench part that includes the sink vertices; and TBloop, the testbench part that includes the vertices involved in a backward path.

We first perform the split procedure by finding TBloop and splitting the rest of the testbench into TBsource and TBsink. Then we map TBloop to the accelerator and directly attach it to the DUT. Consequently, changes occur in the boundary between the software simulator and the hardware emulator. To cope with this, we generate new ports instead of using existing ports connected to the backward path. Finally, TBsource has only outgoing ports, and TBsink has only incoming ports. In this situation, TBsource and TBsink can run independently. TBsource can run through many clock cycles without waiting for data from the accelerator, whereas TBsink executes when the output port data becomes available from the accelerator.


Making the testbench synthesizable We propose a method for translating an unsynthesizable HDL description into a synthesizable one. We showed earlier that the loop part of the testbench moves into the hardware accelerator. However, we can describe the testbench in any style defined in an HDL. Because a behavioral testbench composed of time control, event control, nonstatic loops, and sequential statements is not synthesizable, we need a method for mapping this description into the FPGA-based accelerator. Therefore, we developed an automated translator that converts behavioral HDL code into synthesizable code.

Basic principle According to the IEEE 1364-2001 Verilog HDL standard,12 an HDL simulator should follow the Verilog simulation reference model. We also based our approach on this model, to support all possible HDL syntaxes for the HDL testbench. At the beginning of a simulation time step, the right-hand side (RHS) of the assignment statement is evaluated, and then the left-hand side (LHS) variable is updated. The updated LHS variable might cause additional evaluate events for sensitive statements. The evaluate and update phases alternately iterate until all events in the current time step have executed and no processes remain. Figure 3a shows an example of simulation operations. The Verilog source code includes three statements. For each statement, the simulator performs

489


Evaluate register

Update register Clk

~Clk Verilog source code

Evaluate_Logic Enable

forever Clk = #1 ~Clk;

Evaluate_Trigger

always @(Clk) a=a+1;

Time compare Update_Trigger

Event_Detector

Enable

Statement 1 translation

assign b=a;

Statement analysis

a

a+1 1 Evaluate_Logic Enable

Statement information table

Event_Detector

Enable

Evaluate_Trigger Sensitivity Assignment (evaluate) (update)

LHS

RHS

Statement 1

Clk

~Clk

Clk

= #1

Statement 2

a

a+1

Clk

=

Statement 3

b

a

a

=

1 Eclk delay Update_Trigger

Statement 2 translation b

a Evaluate_Logic Enable Statement 3 translation

Enable

Event_Detector

Evaluate_Trigger 1 Eclk delay Update_Trigger Translated hardware

Figure 4. Example of testbench translation into synthesizable equivalent hardware: We analyze the three statements to extract the statement information table, from which we translate the statements into synthesizable equivalent statements in hardware.

operations in two phases: evaluate and update. 1E denotes the evaluate operation of statement 1, and 1U denotes the update operation of statement 1. Figure 3b shows the operation sequence for all three statements. Statements 1 and 2 are sensitive to the Clk signal, so 1E and 2E execute after 1U. Similarly, 3E executes after 2U. In the operation sequence, we introduced an emulation clock (“Eclk” in Figure 3a). Each operation is performed at every positive edge of Eclk. There can be several Eclk cycles in a simulation time step. Simulation time advances when there are no more active events. To implement the waveform for the FPGA-based accelerator, we translate the assignment (indicated by the “=” sign) of an HDL statement into 2-bit shift registers in which two enabled registers are connected in a cascade. As Figure 3c shows, we use the left register for the evaluate operation and the right register for the update operation. The Evaluate_Trigger signal enables the evaluate register, and the Update_Trigger signal enables the update register.

490

Translation Figure 4 shows an example translation process for three HDL statements. For each statement, we extract the statement information table, which contains four fields: LHS, RHS, sensitivity, and assignment type. Our translator maps each information component to an internal block of equivalent-statement hardware. That is, it maps RHS, sensitivity, and assignment type to Evaluate_Logic, Evaluate_Trigger, and Update_Trigger, respectively.

Translated testbench architecture Figure 5 shows the overall architecture of the translated testbench. The architecture contains two different paths: value (solid lines), and event (dashed lines). The blocks function as follows: ■ ■

Evaluate_Logic generates the RHS value. Evaluate_Trigger determines the time to enable the evaluate register. The Evaluate_Trigger signal is generated by sensitive-signal events.


Equivalent-statement hardware Evaluate_Logic

Evaluate register

Update register Local_ Interconnection

Evaluate_Trigger

Event_Detector

Update_Trigger Equivalent-statement hardware Evaluate_Logic

Evaluate register

Update register Local_ Interconnection

Evaluate_Trigger

Event_Detector

Update_Trigger

Equivalent-statement hardware Evaluate_Logic

Evaluate register

Evaluate_Trigger Update_Trigger Equivalent-statement hardware Evaluate_Logic

Evaluate register

Update register

Evaluate_Trigger

Event_Detector

Update_Trigger Equivalent-statement hardware Evaluate_Logic

Evaluate register Local_ Interconnection

Evaluate_Trigger Update_Trigger

Value_Interconnection Event_Interconnection

Figure 5. Translated testbench architecture.

■

■

Update_Trigger determines the time to enable the update register. For a blocking assignment, the Update_Trigger signal becomes active immediately after the Evaluate_Trigger signal becomes active. For a nonblocking assignment, Update_Trigger is delayed until all events in the current time step have executed. In addition, in cases using time control (indicated by the “#” sign), the update time becomes the simulation time specified by the time control statement’s time delay. Event_Detector generates Event signals from Value signals. An Event signal is high only for one emulation clock (Eclk) cycle duration. We implement this block using an XOR gate with two inputs: a Value signal and the inversion of the same signal delayed by one clock cycle.


■

Local_Interconnection routes the Value and Event signals from the evaluate registers to the update register. When only one statement drives the RHS variable, the Value and Event signals just pass through this block. However, when multiple statements drive the LHS variable, all equivalent-statement hardware shares one update register, as shown in the bottom three boxes of Figure 5. Whenever any Update_Trigger signal is activated, the update register should be updated. When two or more Update_Trigger signals are simultaneously activated, this condition is considered a race condition, and the translator can select any statement to feed to the update register.

Parallel processing Although our approach uses sequential operation

491


Table 1. Synchronization time and simulation speed of conventional and proposed methods for N = 256. TBloop size

Test case GIO IDCT VLD

Synchronization time (tsync) Conventional

Proposed

Simulation speed (1/ttotal) Conventional

Proposed

Gate

Memory

method

method

method

method

Speedup

count

(Kbits)

(ms)

(ms)

Ratio

(Kcps)

(Kcps)

factor

19.5

0.47

41.2

48

646

13.3

0 0 400

0 0

21.7

0.48

45.1

44

701

15.9

4.1

22.4

0.92

24.4

42

420

10.0

PCM

0

0

21.5

0.39

55.0

44

529

12.2

USB

100

0

22.1

0.65

33.8

42

415

9.9

ALU

0

0

22.7

0.79

28.6

41

404

9.8

Z80

200

2.0

21.9

0.65

33.6

43

512

12.1

DES (EBC)

1,100

16.4

24.0

2.00

12.0

39

272

7.0

DES (CBC)

1,600

16.4

24.7

2.31

10.7

39

286

7.4

DES3 (EBC)

1,100

3.2

23.4

1.76

13.3

40

309

7.7

DES3 (CBC)

1,600

1.0

25.0

2.33

10.7

38

276

7.3

of the Verilog simulation reference model, we can exploit parallelism for jobs that are independent of precedence relationships. For example, Figure 3a shows three statements. Both statements 1 and 2 are sensitive to the Clk signal, so we perform 1E and 2E after 1U. Because operations 1E and 2E are independent of precedence relationships, we can perform them in parallel. A software simulator cannot exploit this kind of parallelism because it runs on a sequential machine. On the other hand, our method implements each statement as individual equivalent-statement hardware, so we can execute 1E and 2E simultaneously. This parallel-processing capability reduces taccelerator in Equation 3.

Experimental results We performed hardware-software coemulation experiments to evaluate TPartition. For a hardware accelerator, we implemented a PCI card featuring an 8million-gate Xilinx Virtex-II FPGA. For the software side, we used the Cadence NC-Sim HDL simulator running on an Intel Pentium 2.8-GHz processor. To focus on the communication overhead between the simulator and the accelerator, we compared the proposed approach with a conventional hardware acceleration scheme rather than software simulation. As Equation 4 shows, simulation speed increases in proportion to N. In our experiments, simulation speed increased until N reached 256, but speed was saturated after N reached 512. This is because (3/N)tsetup is the main influence on speed before 256, but the effect of

492

tbuffer becomes dominant as N increases. We got the best performance when N was 256. We used this value of N for the following experiments. Table 1 shows the experimental results of the conventional and proposed hardware acceleration methods for various designs from OpenCore and industry. Using the proposed method, we split the testbench into three parts, of which TBloop was translated and synthesized for mapping to the FPGA-based accelerator. The table’s second and third columns display the size of the synthesized testbench part (TBloop). For some test cases (GIO, IDCT, PCM, and ALU), the testbench has no backward path. In these cases, we split the testbench into two parts (TBsource and TBsink), and TBloop did not exist. In other cases (VLD, Z80, DES, and DES3), in which TBloop includes the two-dimensional array, we used FPGA memory blocks to implement a synthesized testbench. The fourth through ninth columns show synchronization times and overall simulation speeds. The commercial simulation accelerator can increase simulation speed up to 100,000 cps.6,7 However, this figure is the maximum speed; in our experiments, we achieved speeds up to 48,000 cps using the conventional method. On the other hand, the proposed method reduced communication time by a factor of 10.7 to 55.0, and performed simulation at up to 701,000 cycles per second, 15.9 times faster than the conventional method. The communication time reductions differed between the various designs. As shown previously, the buffer size is proportional to the DUT port’s bit width. The large-bitwidth designs such as DES and triple-DES have large


buffering times (tbuffer), which degrade the effect of setup time (tsetup) reduction as the value of N increases in Equation 4.

10. Standard Co-Emulation Modeling Interface Reference Manual, Version 1.0, Accellera, 2003; http://www.eda.org/itc/scemi.pdf. 11. P.-S. Tseng and S. S.-P. Lin, Coverification System and Method, US patent 6,389,379 B1, Patent and Trademark

TPARTITION IMPROVES the performance of hardware-

accelerated simulation without a designer’s remodeling effort and without losing compatibility with the original testbench. We are currently working on further experiments by applying the methodology to industrially relevant examples. In our future work, we plan to extend our HDL translation technique to gate-delayannotated DUT for hardware-accelerated timing verification. To do this, we must develop optimization techniques to reduce hardware resource use. ■

References 1. International Technology Roadmap for Semiconductors, Semiconductor Industry Assoc., 2001. 2. J. Bauer et al., “A Reconfigurable Logic Machine for Fast

Office, 2002. 12. IEEE Std. 1364-2001 Verilog Hardware Description Language, IEEE Press, 2001.

Young-Il Kim is a PhD candidate in the Department of Electrical Engineering and Computer Science at the Korea Advanced Institute of Science and Technology (KAIST). His research interests include hardware-accelerated simulation, reconfigurable systems, and SoC design. Kim has a BS from Korea University and an MS from KAIST, both in electrical engineering. He is a student member of the IEEE.

Event-Driven Simulation,” Proc. 35th Design Automation Conf. (DAC 98), ACM Press, 1998, pp. 668-671. 3. “Incisive Palladium Datasheet,” Cadence, 2004; http://www.cadence.com/datasheets/IncisivePalladium_ ds.pdf. 4. “VStation HDL Link Datasheet,” Mentor Graphics, 2002; http://www.mentor.com/vstation/datasheets/HDL_Link_d s.pdf. 5. Celaro User’s Manual, Mentor Graphics, 2002. 6. “Xcite Product Description,” Verisity Design; http://www.verisity.com/products/xcite.html. 7. “ZeBu Product Description,” Eve Emulation and Verifica-

Chong-Min Kyung is a professor in the Department of Electrical Engineering and Computer Science at KAIST. He is also a director of the IDEC (Integrated Circuit Design Education Center), Korea. His research interests include microprocessor and DSP architecture, chip design, and verification methodology. Kyung has a BS in electronic engineering from Seoul National University and an MS and a PhD, both in electrical engineering, from KAIST. He is a senior member of the IEEE.

tion Engineering; http://www.eveteam.com/product_description.html. 8. M. Bauer et al., “A Method for Accelerating Test Environments,” Proc. 25th Euromicro Conf., vol. 1, IEEE Press, 1999, pp. 477-480.

Direct questions and comments about this article to Young-Il Kim, VLSI Systems Lab., CHiPS, Korea Advanced Institute of Science and Technology, Daejeon, Korea; [email protected].

9. R. Henftling et al., “Re-use-centric Architecture for a Fully Accelerated Testbench Environment,” Proc. 49th

For further information on this or any other computing

Design Automation Conf. (DAC 03), ACM Press, 2003,

topic, visit our Digital Library at http://www.computer.org/

pp. 372-375.

publications/dlib.


493

TPartition: Testbench Partitioning for Hardware- Accelerated ... - kaist

TPartition: Testbench Partitioning for Hardware- Accelerated ... - kaist

Suggest Documents

Efficient Testbench Code Synthesis for a Hardware ... - CiteSeerX

Hardware/software partitioning

Server/Hardware Partitioning - Oracle

Network Hardware-Accelerated Consensus

Task-Specific Image Partitioning - KAIST AIM Laboratory

Partitioning Hardware and Software for Reconfigurable ... - CiteSeerX

Hardware/Software Partitioning for Embedded Systems - Springer

Hardware Accelerated Adobe* Flash* for Embedded Devices ...

Hardware-Accelerated Authentication for Internet ... - Semantic Scholar

Hardware-Accelerated Twofish Core for FPGA

Hardware Design Considerations for Edge-Accelerated Stereo

Rapid Prototyping for Hardware Accelerated ... - Semantic Scholar

WebCL for Hardware-Accelerated Web Applications

Graphics hardware accelerated panorama builder for ... - ee.oulu.fi

Server/Hardware Partitioning - Oracle [PDF]

Hardware Accelerated Visual Attention Algorithm

Hardware Accelerated Voxel Carving - UPC

Towards Hardware Accelerated Software Routers

BENEFITS OF HARDWARE ACCELERATED SIMULATION

Hardware Accelerated Visual Attention Algorithm

Techniques for Hardware-Accelerated Parsing for ... - Semantic Scholar

Hardware/Software Partitioning and Minimizing ... - Semantic Scholar

System Level Hardware/Software Partitioning ... - Semantic Scholar

Hardware Software Partitioning Problem in Embedded ... - CiteSeerX