A Timed HW/SW Coemulation Technique for Fast ... - Semantic Scholar

3 downloads 0 Views 600KB Size Report
simulators. Nakamura et al. proposed a lock-step based. HW/SW coemulation technique based on shared register communication [11]. They should pay heavy ...
A Timed HW/SW Coemulation Technique for Fast Yet Accurate System Verification Hoeseok Yang

Youngmin Yi

Soonhoi Ha

School of EECS Seoul National University Seoul, Korea [email protected]

Dept. of EECS University of California, Berkeley CA, USA [email protected]

School of EECS Seoul National University Seoul, Korea [email protected]

Abstract — In System-on-chip (SoC) design, it is essential to verify the correctness of design before a chip is fabricated. While conventional hardware emulators validate functional correctness of hardware components quickly, only a few researches exist to use hardware emulators for timing verification since synchronization between the hardware emulator and the other parts easily overwhelms the gain of hardware emulator. In this paper we propose a novel hardware/software coemulation framework for fast yet accurate system verification based on the virtual synchronization technique. For virtual synchronization, interface protocol and interface logic between a hardware emulator and the HW/SW coemulation kernel are proposed. Experiments with real-life examples prove the effectiveness of the proposed technique. Keywordsverification;

coemulation;

I.

cosimulation;

synchronization;

INTRODUCTION

In System-on-Chip (SoC) design, it is essential to verify the correctness of design before a chip is fabricated. For this purpose, cycle-accurate hardware/software (HW/SW) cosimulation technique has been developed integrating instruction-set-simulators (ISS) and a register-transfer-level (RTL) simulator to simulate software and hardware components respectively [1]. As the RTL simulator is much slower than ISS, hardware simulation is usually the bottleneck for overall system verification. Since hardware emulation is often preferred for faster hardware component verification [2][3][4], HW/SW coemulation techniques have been also developed by replacing slow hardware simulator with a fast hardware emulator. But hardware emulators are typically used only for functional verification, not for system verification that needs to verify the timing as well. To verify both functional and timing correctness, the hardware emulator and the other component simulators should be synchronized. Synchronizing them at every clock cycle may nullify the performance gain which comes from adopting hardware emulation instead of simulation. A previous work [5] reports that over half of time is consumed in communication at synchronization points during HW/SW coemulation if cycle-accurate validation is needed; it shows how critical the synchronization overhead is on the efficiency in system validation.

This work was supported by BK21 project, System IC 2010 project of Korean Ministry of Knowledge Economy, and Acceleration Research sponsored by KOSEF research program (R17-2007-086-01001-0). The ICT at Seoul National University and IDEC provided research facilities for this study.

Thus reducing the synchronization overhead between a hardware emulator and the component simulators is crucial for fast system verification, which is the main theme that this paper deals with. In the HW/SW cosimulation research, there are several techniques have been proposed to reduce the synchronization overhead between component simulators. They can be categorized into two: Conservative approach and optimistic approach. In conservative approach, the next synchronization is anticipated to lengthen the interval between consecutive synchronizations. On the other hand, simulators are executed for predetermined amount of time without any synchronization in optimistic approach. Afterwards, the simulator engine checks if synchronization failure occurs and roll-back in the case of failure. But, some techniques in these approaches cannot be applied to HW/SW coemulation as they require special support that the hardware emulator cannot provide. In this paper, we use virtual synchronization technique, proposed in [6] to reduce the synchronization overhead in HW/SW coemulation. Instead of synchronizing the component simulators directly, it synchronizes the traces generated from the component simulators in the simulation kernel. Since the component simulator need not be synchronized with each other, the hardware can be emulated at its full speed in most of running time. Synchronization overhead is added only when data communication with the other component simulators occurs. Keeping the same communication protocol between component simulators, the same simulation kernel of virtual synchronization can be also used as the HW/SW coemulation kernel. This enables smooth migration from HW/SW cosimulation to HW/SW coemulation without changing the other component simulators. So, it can also be regarded as a technique to accelerate HW/SW cosimulation by replacing the slowest simulator with an emulator. The key technical challenge is to design the interface protocol between the hardware emulator and the emulation kernel. Then, we design the hardware interface logic and coemulation interface module. The rest of this paper is organized as follows: In the next section, related work is reviewed to clarify the contribution of this paper. The virtual synchronization technique is briefly overviewed in section III and the system verification performance is analyzed in section IV to figure out the gain

from the proposed technique. Section V explains the proposed technique in details and section VI shows the experimental results. Conclusion and future work will be drawn in section VII. II.

RELATED WORK

Most commercial hardware emulator solutions conform SCE-MI (Standard Co-Emulation API: Modeling Interface) [7] that provides a standardized software/hardware interface for efficient communication in HW/SW coemulation. However, as its main usage is for bridging an untimed software simulation environment with a hardware emulator, it does not concern the system timing validation. The contribution of this paper is to make the hardware emulation applicable to the timing validation framework using virtual synchronization technique. To clarify this, related researches are reviewed in this section. A. Efficient System Timing Verification Speed and accuracy is in trade-off relationship in system verification. The most popular compromise is to raise the abstraction level of the simulated component models. For software, estimated elapsed time may be annotated in host code instead of using cycle accurate ISS. Similarly, Transaction Level Model (TLM) is popularly used with system-level design languages like SystemC [8]. However, cycle accurate system verification with ISS and RTL simulators is still preferred at the final stage of the system design in practice, since TLM does not provide enough timing accuracy for system verification. The lock-step approach, which synchronizes the simulators at every cycle, is widely adopted since it is simplest and most accurate. But it suffers from very poor speed. There have been extensive researches to accelerate the simulation speed by reducing the synchronization overhead. One approach is to increase the interval between consecutive synchronizations. In the optimized conservative approach, simulators proceed without any synchronization to the point where the component behavior is affected by other components. Such synchronization points are estimated by analyzing the software code or system specification conservatively [9]. This kind of estimation is not always possible unfortunately. The optimistic approach [10] is based on the assumption that past events never arrive during the simulation. Once the assumption is broken, current simulation should roll back to the latest checkpoint to handle the events safely without any causality problem. If a component simulator does not support a roll back mechanism, as is the usual case, the optimistic approach cannot be applicable. Recently, a novel technique, called virtual synchronization, has been proposed [6] where component simulators are not synchronized explicitly. While the virtual synchronization technique improves the simulation speed significantly, the overall system verification performance is still bounded to the slow hardware simulator as reported in [6]. As mentioned above, virtual synchronization is used in the proposed technique to guarantee the timing correctness of the system efficiently in HW/SW coemulation.

B. System Verification with Hardware Emulation There are some works that have deployed a hardware emulator in system validation framework. Their primary concern is how to synchronize the emulator with other simulators. Nakamura et al. proposed a lock-step based HW/SW coemulation technique based on shared register communication [11]. They should pay heavy synchronization overhead due to too frequent synchronizations. Kim et al. proposed an improved approach that synchronizes simulators at every N’th cycles [12]. Even if it improves emulation performance by reducing the synchronizations, it may result in causality problem if N is chosen too large. So, this is not general enough to validate a system where N cannot be predicted safely. Chung and Kyung’s approach [13] also reduces the number of synchronizations for efficient timing validation in HW/SW coemulation. They showed that the reduction of synchronization points could make coemulation order-ofmagnitude faster than a conservative lock-step approach. In their approach, the emulator predicts the synchronization points dynamically on both hardware and software at run-time so that simulators can run independently without synchronization during the predicted period. However, the prediction modules should be attached to hardware IP, whose design is a major obstacle for general application of their approach. Moreover, the prediction is possible only when the gate-level netlist is known. In contrast, the proposed approach is generally applicable to any hardware IP since the required information can be obtained easily without the knowledge of internal design of the hardware module. III.

VIRTUAL SYNCHRONIZATION

This section explains the virtual synchronization technique that is used in the proposed coemulation framework. In virtual synchronization, each component simulator does not synchronize its own local clock to the global clock. Instead a component simulator reports to the simulation kernel special events that affect the behavior of the other components. Memory accesses or context switches are examples of those events that a component simulator reports. When an event occurs, the simulator calculates the relative time difference between the current event and the previous one and buffers the event traces along with the time difference information. A simulator is blocked when it meets an event that should be satisfied from the rest of the system. For example, it is blocked if it meets a shared memory access until the access is satisfied. When it is blocked, it sends the buffered trace of events to the simulation kernel. The simulation kernel receives the event traces from all component simulators, and aligns the events, and services the events in the chronological order. The simulation kernel keeps track of the global clock. If an event is serviced, the simulation kernel wakes up the component simulator that awaits the event. Since each component simulator communicates with the simulation kernel only when it is blocked, the number of intersimulator communication is significantly reduced. A trace-driven virtual synchronization framework consists of two parts as shown in Figure 1. The event generation part

takes the role of extracting event traces from component simulators while the event alignment part assembles the events and sorted them in the chronological order. The coemulation kernel also models operating systems, underlying communication architecture, and memory image for more accurate system validation. Component simulators: trace generation SW task HW IP

SW task SW task

SW simulator

interface logic simulator interface

data/ data/ traces trace

OS APIs simulator interface

socket

HW simulator

socket

Cosimulation/coemulation engine traces

traces

SWtask taskrepresentative representative SW SW task representative

SWtask taskrepresentative representative SW HW task representative

OS modeler comm. arch. modeler

communication also occurs at point 4 in reverse way (from hardware to software). Note that only two synchronization points are observed in the virtual synchronization while the lock-step approach synchronizes the simulators every cycle. Moreover, idle cycles (1~2) of hardware in which internal state of hardware never changes are not even executed in virtual synchronization. Figure 3 shows a mechanism how the virtual synchronization works with a simple example. Assume there are two processors; proc1 and proc2 which may be either software or hardware. Task1 is mapped to proc1 and task2 and task3 are mapped to proc2. We assume that task3 has a higher priority than task2 and they are scheduled by a priority-based preemptive scheduler. Inter-task communications by shared memory are denoted by w(x)a or r(x)b which stand for ‘write value a to address x’ and ‘read value b from address x’, respectively. For instance, w(x)1 at time 29 means that task1 wrote a value 1 at address x. We also specify memory access latency in Figure 3(a) assuming uniform delay for simple illustration. The actual execution sequence in simulators by the virtual synchronization approach is shown in Figure 3(b). The trace alignment phase produces the final result as shown in Figure 3(a).

memory image model Simulation kernel: event alignment

Figure 1. Trace-driven virtual synchronization coemulation framework. idle

1

HW sim./emu. SW

1

active

2 2

sim.

3 3

4 4

5

(a) Conservative approach

HW sim./emu. SW sim.

3

123

4

synchronization

45

(b) Virtual Synchronization Figure 2. Synchronization methods: (a) Conservative approach and (b) Virtual Synchronization approach.

Figure 2 compares the virtual synchronization technique and a conventional conservative approach with a simple example that consists of a hardware emulator and a software simulator. Suppose that a software simulator sends an event to a hardware emulator at point 3 and hardware starts its execution and runs till point 4. In a conservative lock-step approach, as shown in (a), it synchronizes them at every clock. In a virtual synchronization technique of Figure 2 (b), the hardware emulator requests a read event to the kernel and is blocked for it from the start. It is resumed when the kernel receives the write event from the software simulator, services the event, and delivers it to the hardware emulator. Similar data

Figure 3.Virtual Synchronization example scenario: (a) The real behavior and (b) Execution sequence of simulators.

Suppose the cosimulation kernel schedules proc1 simulator first. Then, task 1 is executed until it encounters w(x)1. Before this operation, proc1 simulator returns to the kernel since a simulator has to be synchronized before executing inter-task communication. The kernel cannot align the traces since the queues of both task2 and task3 are empty yet: traces of all the components in the system must be compared to advance the global clock safely. It schedules and executes task3 as it has the higher priority (○1 ). Likewise, proc2 simulator returns to the kernel before r(y)(○2 ). Now that the kernel has the traces of both simulators, it aligns the traces conservatively up to time 10 as

such in Figure 3(a). It cannot go further than time 10 as there is no trace left in the queue of proc2. So it goes back to the trace generation phase again.

step and virtual synchronization is drawn in this section. Several terms are defined as below to formulate the cosimulation time in each approach:

Suppose that r(y) is a blocking read operation. Task3 is blocked while performing read operation since there has been no w(y) before ○3 . The proc2 simulator interface informs the kernel of the status of task3 and the OS modeler in the kernel now schedules task2 to be executed (○4 ) and the proc2 simulator synchronizes before r(x). The kernel can now advance the global clock again conservatively up to time 29, comparing the events of all the components in the system (○5 ). Note that, while it advances the global clock, communication architecture modeler in the kernel detects that both task1 and task2 try to access the same bus at time 20. It models the contention by serializing the access according to the arbitration policy of given bus model; the memory access that is requested initially by task2 at time 20 is actually granted at time 23 and is finally completed at time 26 as such in Figure 3(a).

N T sti sync Ttrans sttrans ui Tshtrans

In this way, the cosimulation proceeds up to time 35, repeating the two phases (○6 −○7 ). Suppose that w(y)2 by task1 at time 40 triggers an interrupt to proc2, waking up task3 from blocking. As a result, task3 preempts task2 immediately. However, in Figure 3 (b), task2 is executed non-preemptively until it encounters r(z) and task1 completes its execution after performing w(y)2 ( ○8 − 1○1 ). The kernel finds out that the interrupts have occurred while aligning events of task1. Trace alignment pauses and the OS modeler schedules task3 to be 12). Although tasks are executed non-preemptively in executed (○ the virtual synchronization, the occurrence of interrupts is correctly modeled since tasks cannot execute inter-task communication without first synchronizing with the kernel. Since proc2 is synchronized before r(z), w(z)3 is executed first by task3 and r(z)3 is then executed by task2, performing 13−○ 17). Note that simulation is performed correct cosimulation (○ in out of order even in the same processor simulator while preserving the causality constraints. It guarantees that the simulated result is free of causality problem by sorting the traces from all component simulators in the chronological order. Events on different processors are only simulated in out of order when they are free of causality relationship. For the proof of the validity and more details, the interested reader is referred to [6]. The virtual synchronization technique is extensible so that any component simulator can be integrated to the framework as long as it can communicate with the cosimulation kernel. The main requirement of a component simulator for virtual synchronization is to generate events with inter-event time difference information. Since typical ISS usually supports a foreign interface for modeling the memory access behavior, it can be integrated into the cosimulation framework without internal modification. In this work, a commercial ARM ISS, RVDS [14] was attached to it by only modifying its external memory interface (flatmem.c) whose source code is available. IV.

SYSTEM VERIFICATION PERFORMANCE ANALYSIS

To evaluate the speed gain of HW/SW coemulation with virtual synchronization, analysis on simulation times for lock-

sttrace steval

: Number of simulators. : Total simulated cycles. : Simulation time to advance a cycle of simulator i. : Overhead per time synchronization. : Total number of communication transactions. : Simulation time to process a transaction. : Utilization of simulator i. : Total number of communication transactions to the shared memory. : Overhead per trace generation. : Overhead for trace evaluation.

The system verification time of the lock-step approach can be formulated as equation (1). As shown, synchronization overhead (sync) is added to simulation time at every cycle and the simulation time to process a transaction (sttrans) is also added for every transaction. N

∑ {T × (st i + sync )}+ T trans × st trans

(1)

∀i

Equation (2) explains how virtual synchronization achieves significant performance improvement. The total number of cycles to be simulated is reduced by 1-ui and time synchronization occurs only when data are exchanged (Tshtrans). On the other hand, however, additional overhead for trace generation (sttrace) and evaluation (steval) is added for every transaction. N

sh × (sync + ∑ {T × u i × st i}+ T trans st trans ) ∀i + T trans ×

(2)

(st trace + st eval )

To clarify the performance gain of the virtual synchronization over the lock-step approach, equation (3) is obtained by subtracting equation (2) from equation (1): N

sh ⎞ ⎟ ∑ {T × (1 − u i )× st i}+ sync × ⎛⎜⎝ T × N − T trans ⎠ ∀i

sh ⎞ − + st trasn × ⎛⎜T trans − T trans ⎟ T trans × ⎝ ⎠

(3)

(st trace + st eval )

The positive terms are the gain while the negative ones are the overhead of virtual synchronization. There are mainly three kinds of gains. The first term explains the removal of idle duration of simulation. For instance, the utilization of hardware simulator in Figure 2 is 0.4 as 2 cycles over 5 are only activated, making 3 idle cycles not to be simulated. The second term shows how many synchronization points have been reduced: synchronization occurs only when data are exchanged between tasks. The third term indicates the removal of local memory transaction simulation. Instead, trace generation and

evaluation overhead has been added. The overhead is insignificant and, even for a complex application like video/audio codec that is usually memory intensive, it is negligible (below 3%) over the total time. In contrast to the fact that the gain of virtual synchronization comes from above three factors, we try to get further improvement by reducing simulator speed itself by exploiting hardware emulation. That is, sti on the first term in equation (3) is the main target to be reduced. It can reduce the simulation time significantly again since it is added to every cycle during system validation. V.

HW/SW COEMULATION

A. Proposed Virtual Prototyping System Using an Emulator Figure 4 shows a simple target architecture and the corresponding virtual prototyping system with a hardware emulator. While the proposed technique can be applied to a more complex system that contains many processing cores, we assume that the system consists of a single processor core and hardware components. In the proposed prototyping environment, the target microprocessor running software tasks is mapped to a software simulator (ISS) and hardware component with interface logic is running on a hardware emulator. Note that the hardware interface logic (‘BUS IF’ and ‘sync’ module) of the target architecture is replaced with the ‘HW IF’ module in the virtual prototyping system. The underlying communication architecture (a shared bus in the example) and shared memory (SRAM) are modeled in the coemulation kernel. HW and IF

SW SW SW

task

uP

sync

HW emulator

HW

SW SW SW task

BUS IF

ISS

(a)

The sequence of the proposed hardware/software coemulation can be summarized as follows: 1. The coemulation kernel triggers the hardware component to start. 2. The coemulation kernel sends data to the interface program to feed data to the hardware component. 3. The coemulation interface writes input data to SRAM which is accessible by the FPGA. 4. The coemulation interface reads output data from SRAM and timing information from the HW interface module. 5. Data, timing information, and generated traces are sent to the coemulation kernel. As the coemulation interface is very simple and small, it affects the system validation speed very little. Versatile PB926EJ-S LogicTile XC2V6000

HW w/ IF HW IF coemul IF

trace

trace

SRAM

development board which has an ARM926EJ-S processor, AMBA buses, and some peripherals such as UART and Ethernet controller. Figure 5 shows the structure of the hardware emulator. The HW IF module of Figure 4 (b) is synthesized and attached to the hardware component in the FPGA of LogicTile in Figure 5, while the coemulation interface program is also synthesized and runs on ARM926EJS processor in Figure 5. For socket communication with the coemulation kernel and easy management of peripherals, Linux 2.6 is ported on the hardware emulator board. The interface program communicates with the coemulation kernel via socket and ethernet, reading/writing data from/to hardware through off-chip SRAM on the board.

CPU (ARM926EJ-S)

FPGA (ARM926EJ-S)

coemul IF

HW IF & HW IP AHB

Coemulation Kernel

Br. APB

(b)

Figure 4. (a) Target architecture and (b) Its virtual prototype with hardware emulator.

As reviewed in the previous section, the hardware emulator as an alternative for hardware simulator in virtual synchronization should be able to (1) measure the execution cycles of hardware (for time stamp computation of events), (2) generate memory access traces, and (3) interact with the coemulation kernel. Role (1) is realized in the hardware interface which is attached to the synthesized hardware component, while roles (2) and (3) are implemented in the coemulation interface, which is a software running on the emulation board. The detail of each will be explained in following subsections. While the proposed technique does not assume any specific hardware emulator architecture, the emulator used in the experiment of this paper consists of ARM Versatile PB926EJ-S and LogicTile XC2V6000. PB926EJ-S is a general

SRAM LAN



Figure 5. Basic structure of the used hardware emulator.

B. Hardware Interface Module The HW IF module models the bus interface and synchronization logic of the target architecture as shown in Figure 4. In the model-based hardware/software codesign environment [15], the hardware interface logic is automatically synthesized. Figure 6 shows an example case where the hardware block has two input ports and one output ports: A receiver(RCV)/sender(SND) interface is associated with an input/output port. While the implementation is based on such an automated design process from dataflow specification, the proposed technique is not restricted to such design flow. In addition to the bus interface and synchronization logic, two conditions should be fulfilled in the HW IF logic for virtual synchronization. First, execution time measurement

should be performed to extract the relative time difference between events. Second, time measurement must be paused/resumed easily and freely. These two requirements can be easily satisfied if we use a commercial RTL simulator like ModelSim [16] by monitoring the bus activities in a HW/SW cosimulation environment. But we have to synthesize extra logic to fulfill the requirements in the hardware interface logic for HW emulation. Data transfer Active signal Event signal Counter value

HW component

RCV 0

RCV 1

SND 0

c h e c k

Counter en

BUS wrapper

Figure 6. Basic Measurement unit of the hardware emulation.

To figure out how many cycles consumed between events in the emulated hardware, we attach an auxiliary counter to the hardware component as drawn in Figure 6. Note that every event (task start/end and communication with the other components) occurs only through RCV/SND blocks. So, all we need to do is to record the counter value whenever events occur in the RCV/SND blocks and send it to the coemulation interface module that will compute the time difference between two neighboring events. All we have to pay for the measurement are one auxiliary counter and control signals from RCV/SND blocks. Since pausing/resuming the hardware component is physically impossible in emulation, one more control signal is needed. When the hardware component is blocked for data, the enable (‘en’) input of the counter is not asserted not to accumulate idle cycles when no active signal exists. Once data is arrived, some active signal activates the ‘en’ input and then the counter starts to increase again. Each input and output port is assigned a fixed number that indicates the number of data samples. For example an input port (Y input port of the MC block of Figure 8) may receive four data samples of macro-block size before triggers a single event. Then the hardware interface logic waits until all data samples are received before triggering an event to the coemulation interface. We omit the detailed logic structure due to space limitation. Whenever a RCV/SND block triggers an event, ‘check’ signal is asserted for the record of event triggering time. Additional channels through which measured data are delivered are also needed as drawn in Figure 6. The additional logic does not modify the hardware component itself. C. Coemulation Interface Module The primary role of the coemulation interface is to generate the event traces and interact with the coemulation kernel. These

are realized as a program in the microprocessor in the emulator. The pseudo code of the program is illustrated in Figure 7. The coemulation interface module blocks or resumes the hardware component depending on the arrival of data from the coemulation kernel as shown at line 4-5 and 15-16. Before every writing (or reading) it checks whether the channel is ready by examining the channel status. If not, it is blocked by blocking socket communication preserving counter active signal unasserted not to accumulate idle cycles as explained earlier. Once the channel is ready, it writes to (or reads from) the hardware component and updates the channel status at line 7 and 18. The time difference between the current event time and the previous event time is calculated at line 8-9 and 19-20. The time difference is also sent to the coemulation kernel. Besides the data and time difference information, the event trace carries the channel information including the address and the size. The generated traces are brought to the coemulation kernel through socket communication. 1: time_info prev; 2: void write_to_HW(int ch_id){ 3: time_info now; 4: while( not enough data arrived ) 5: Request to Coemulation kernel and wait; 6: update_ch(ch_id); // consume data in channel 7: hw_write(ch_id); // hw shared memory access 8: now = read_counter(ch_id); 9: snd_ time(now-prev); // send time information 10: prev = now; 11: log_trace(ch_id); // generate and send traces for channel 12:} 13:void read_from_HW(int ch_id){ 14: time_info now; 15: while(data is not consumed yet from SW) 16: Request to Coemulation kernel and wait; 17: update_ch(ch_id); // consume data in channel 18: hw_read(ch_id); // hw shared memory access 19: now = read_counter(ch_id); 20: snd_ time(now-prev); // send time information 21: prev = now; 22: log_trace(ch_id); // generate and send traces for channel 23:} Figure 7. Pseudo code of coemulation interface.

VI.

EXPERIMENTS

In this section, we compare a cosimulation based on the virtual synchronization technique and the proposed coemulation in accuracy and speed. Note that the base cosimulation performance is already significantly higher than a lock-step cosimulation technique as reported in [6]. A real-life example, H.263 decoder for QCIF(176x144) 10 frames, is used as a benchmark application for both frameworks. And we consider two cases: (1) Inverse Discrete Cosine Transform (IDCT) block is implemented as a hardware component and (2) Motion Compensation (MC) block are implemented as hardware. Figure 8 displays a sub-graph of H.263 decoder algorithm showing how these blocks are connected. As shown in Figure 8, IDCT and MC blocks show

very different characteristics. IDCT has a very simple data rate: it reads 1 macro-block (MB) and writes 1 MB. But, MC has various data rates: the Y input port receives 4 macro-blocks and the U and V input ports 1 macro-block each for each invocation. After being executed 99 times, the MC block produces a frame data through its output port. In addition, the MC block needs a reference frame (additional buffer on the frame buffer) for its internal execution. In summary, the IDCT block has a simple pattern of event generation (or I/O activities) and triggers events frequently (at every macro-block). On the other hand, the MC block triggers events at the port boundary as well as during its execution when referring to the reference frame. Therefore the hardware interface logic for the MC block is more complex to determine when to generate events considering various triggering conditions.

… MB size buffer Frame size buffer

Inv.Z

IDCT



IDCT



Additional buffer

As the synchronization logic synthesized in the FPGA (as shown in Figure 4) manages the complex rate control, the execution time of the coemulation kernel is reduced significantly. In the cosimulation approach, the IDCT case experiences about 5 times longer delay in the cosimulation kernel than the MC case due to more frequent communication.

Y U

MC

V

Figure 8. Motion Compensation (MC) block and Inverse Discrete Cosine Transform (IDCT) block in H.263 decoder.

In experiments, the hardware block is simulated on the ModelSim in the base cosimulation while it is synthesized in the FPGA for the hardware emulator in coemulation. All other function blocks are implemented as software blocks in an ARM9 processor which is simulated on an ISS in a host machine which has 4 Intel Xeon 3GHz processors and 4GB main memory. RVDS 2.1 is used for ARM9 ISS while an emulation board in Figure 5 is deployed for the hardware emulation. Each simulator/emulator communicates with coemulation kernel through socket interface via ethernet. TABLE I.

PERFORMANCE COMPARISON FOR CASE(1): IDCT HW IP

Unit: sec

Cosimulation

Coemulation

28.10

26.93

-

Hardware (IDCT)

53.68

13.13

4.1x

Cosim/emu. kernel

19.85

0.71

-

101.63

40.77

2.5x

TABLE II.

PERFORMANCE COMPARISON FOR CASE(2): MC HW IP

Unit: sec Software Hardware (MC) Cosim/emu. kernel Sum

Cosimulation

Coemulation

When we scrutinized the hardware emulation time further, we observed that about a half of the total elapsed time is for communication between the emulator and the coemulation kernel. Even though the execution time of an IDCT hardware component is very small, it triggers lots of communications. During 10 frames decoded, it needs to exchange events with the coemulation kernel 3,960 times. In our experiment, the network Round Trip Time (RTT) was measured to 1.267ms which may not be easily neglected. If some faster communication medium like PCI is adopted, further speedup is expected. TABLE III.

TIMING ACCURACY OF COEMULATION APPROACH

Cosimulation

Coemulation

Difference

IDCT(Total)

26,184,588

26,188,408

0.015%

MC(Total)

13,231,805

13,201,970

0.002%

Speedup

Software

Sum

TABLE I shows the comparison result of two approaches for case (1) when the IDCT block is implemented as a hardware component. Note that there is little difference in software simulation time. On the other hand, for hardware part, there is 4.1 times speed-up by using the hardware emulator. The overall performance gain of system validation through hardware emulator is 2.5. In TABLE II, the case with more complicated hardware component (MC) is shown. We observed more speed up (10.5x) through hardware emulation, which boosts up the overall validation speed by 5.9 times. For the case of MC block, the RTL simulator is definitely the bottleneck of the system validation as observed from the second column. After using the hardware emulator for the MC block, its execution becomes comparable to the ISS execution time.

Speedup

26.32

25.96

-

287.63

27.51

10.5x

4.87

0.56

-

318.82

54.03

5.9x

As shown in TABLE III, timing accuracy of coemulation technique is very close to the result of cosimulation. Differences in simulated cycles are inevitable due to memory access modeling in the cosimulation. In contrast to the hardware emulator where the physical SRAM resides in the emulator, the cosimulation kernel uses a simple memory delay model in the cosimulation approach, which may result in differences. The hardware emulator, on the other hand, does not guarantee the perfect accuracy in memory access, as its SRAM shares the bus with other peripherals during timing measurement. We also observed that the IDCT hardware component execution times measured in both frameworks are identical, while the MC block shows different execution times. It is due to the fact that the MC block accesses the previous frame heavily during execution.

VII. CONCLUSION In this paper, we proposed a fast yet accurate system verification framework using timed hardware/software coemulation. Virtual synchronization technique is used for the coemulation backplane as it reduces synchronization overhead significantly. The proposed hardware emulator is implemented in ARM Versatile PB926EJ-S prototyping board augmented with the hardware interface and coemulation interface for supporting virtual synchronization technique. The hardware interface enables the time measurement of the hardware component running on the board while the coemulation interface generates the event traces and interacts with the coemulation kernel. These interfaces can be easily attached without any modification of the hardware component itself. Experiments with a real-life example proved the effectiveness of the proposed system validation framework. It boosts up the validation speed up to 5.9 times preserving the timing accuracy. Analysis on the elapsed time shows that the communication between the cosimulation framework and the hardware emulator takes significant overhead that could be improved by faster communication channel.

[3] [4] [5]

[6]

[7] [8]

[9]

[10]

[11]

[12]

ACKNOWLEDGMENT Authors thank to Kwangsoo Ahn and Sungjin Yoon for their precious effort for this work. REFERENCES [1] [2]

Mentor Graphics, SeamlessCVE, http://www.mentor.com/seamless Cadence, Palladium Accelerator/Emulator, http://www.cadence.com/products/functional_ver/palladium/

[13]

[14]

[15]

[16]

Tharas Systems, Hammer SX and MX hardware accelerators, http ://www.tharas.com/products/ EVE, Zebu hardware emulator, http://www.eve-team.com/products.html M. Chung and C.-M. Kyung, “Enhancing Performance of HW/SW Cosimulation and Coemulation by Reducing Communication Overhead,” IEEE Transactions on Computers, Vol. 55, No. 2, Feb. 2006 Y. Yi, D. Kim, and S. Ha, “Fast and Accurate Cosimulation of MPSoC Using Trace-Driven Virtual Synchronization,” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, Vol. 26, No. 12, Dec. 2007. Accellera, “Standard Co-Emulation Modeling Interface Reference Manual,” 29, May. 2003 L. Benini, D. Bertozzi, D. Bruni, N. Drago, F. Fummi, and M. Poncino, “SystemC Cosimulation and Emulation of Multiprocessor SoC Designs”, Computer, Vol. 36, no. 4, pp. 53-59, Apr. 2003 J. Jung, S.Yoo, and K. Choi, “Performance Improvement of MultiProcessor Systems Cosimulation Based on SW Analysis”, Proc. of Design Automation and Test in Europe, pp.749-753, 2001 D. R. Jefferson and H. A. Sowizral, “Fast concurrent simulation using the Time Warp mechanism, part I: Local control,” Rand Note N-1906AF, the Rand Corp.; Santa Monica, Calif., Dec. 1982. Y. Nakamura, K. Hosokawa, I. Kuroda, K. Yoshikawa, and T. Yoshimura, “A Fast Hardware/Software Co-Verification Method for System-onChip By using a C/C++ Simulator and FPGA Emulator with Shared Register Communication”, Proc. of Design Automation Conference, June, 2004 Y. Kim, W. Yang, Y.-S. Kwon, and C.-M. Kyung, “CommunicationEfficient Hardware Acceleration for Fast Functional Simulation”, Proc. of DAC, June 2004 M. Chung and C.-M. Kyung, “Enhancing Performance of HW/SW Cosimulation and Coemulation by Reducing Communication Overhead,” IEEE Transactions on Computers, Vol. 55, No. 2, Feb. 2006 ARM Inc., “RealView ARMulator ISS user Guide,” http://infocenter.arm.com/help/topic/com.arm.doc.dui0207c/DUI0207C_ rviss_user_guide.pdf S. Ha, S. Kim, C. Lee, Y. Yi, S. Kown and Y. Joo, “PeaCE: A Hardware-Software Codesign Environment for Multimedia Embedded Systems,” ACM Transactions on Design Automation of Electronic Systems, Vol. 12, No. 3, Article 24, Aug. 2007. , Mentor Graphics, ModelSim, http://www.model.com

Suggest Documents