Concurrent-Simulation-Based Remote IP Evaluation over ... - CiteSeerX

0 downloads 0 Views 288KB Size Report
Evaluate RHS & change LHS of continuous assignments. Evaluate output ..... Intranet. IC6. (Server). 9.48. 0.87 12.26. 0.84. Gucsun01. (Client). 6.95. 1.78 546.66.
Concurrent-Simulation-Based Remote IP Evaluation over the Internet for System-on-a-Chip Design* Hung-Pin Wen, Chien-Yu Lin, Youn-Long Lin Department of Computer Science National Tsing Hua University Hsinchu 30043, Taiwan, R.O.C.

[email protected] house/divisions also try to repackage their existing designs in to IPs. In the mean time, most SOC design companies (IC or system houses) are willing to adopt external IPs. These lead to an IP business segment in the semiconductor industry.

ABSTRACT We propose an Internet-based concurrent-simulation scheme to ease IP evaluation process between IP vendors and users. Complex system-on-a-chip design requires more and more IP modules from 3rd party vendors. What can be disclosed by the vendor without impairing its trade secrete and what needs to be examined by the user to gain satisfactory level of confidence are contradictory of each other. Via PLI interface functions and Internet protocol, our proposed software enables HDL simulators (Verilog) residing in both the vendor and user’s sites to concurrently simulate the IP and SOC together. Only stimulus and response defined in the IP’s I/O are exchanged between the sites. Therefore, the vendor need not to create a functional model (or encrypted code) for the IP while the user is assured what he/she simulates is what he will purchase. Beside simulation speed degradation due to communication overhead, the SOC design/debug process is exactly same as if the IP is in the user’s hand. Our contribution will help all IP providers expose their IPs to all potential users without human intervention and IP right infringement concern.

However, the integration of IPs into an SOC design is not trivial. Unlike evaluating IC samples in the system-on-a-board era, IP evaluation requires not only data sheet but also simulatable models. Once the user is interested in a particular IP, he wants to verify its functionality first. Further, they would like to integrate the IPs into their complete SOC design and test for speed, area, and power … etc for comparing with other alternatives. Until making sure of the accuracy of the entire design, users will not purchase the IP designs. During the evaluation process, vendors must provide a model of its IP that demonstrates the capability but hides the real design from being stolen. The further the evaluation process goes, the more resource the vendor has to spend. Presently, two approaches are employed. First, the vendor would provide a fake model, called bus transaction model or functional model, for evaluation purpose. Second, some encryption is performed on the design data. The fake model approach requires extra effort and is very difficult to guarantee the complete match between the real and fake model. The encryption approach cannot assure 100% safety and would limit certain description power of the HDL.

Keywords Concurrent Simulation, Intellectual Property (IP), IP Evaluation, System-on-a-Chip.

1. INTRODUCTION

The widespread use of the Internet has affected the VLSI design community. Several previous researches [1,2,3] have been proposed to exploit the power of the Internet. They make CAD tool usage one kind of service through WWW. In [4,5], the JAVA language is used to model an IP design. Additional IP-remodeling from HDL is needed. An HDL-based methodology is proposed in [6]. But, it needs usage of a commercial tool and suffers from the abortion problem of its synchronization scheme and the speed problem due to socket conversion.

Ever increasing functionality demand from the application side and exponentially growing integration capability from the technology side together steer system implementation targeted towards system-on-a-chip (SOC) realization. Nowadays we, the system synthesis community, encounter the problems of large productivity gap, shorter time-to-market and more serious designer shortage. IP (intellectual property) reuse has been considered one of the most feasible methodologies for the above predicaments. Intuitively, integrating existing designs to form an SOC design can bridge the productivity gap. Many ASIC design

We envision a more perfect scenario. The entire design system can be more automatic and take advantage of the Internet more effectively. We develop a concurrent simulation framework for remote IP evaluation. Our method is based on a client–server architecture [7] where the client and the server stand for IP users and vendors, respectively. This architecture has been employed by many Internet-based applications. We focus on the automatic generation of wrapper programs for remote concurrent simulation.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. ISSS’01, October 1-3, 2001, Montréal, Québec, Canada. Copyright 2001 ACM 1-58113-418-5/01/00010…$5.00.

*

supported in part by a grant from the National Science Council, R.O.C., under contract no. NSC 89-2218-E-007-054 and NSC 89-2218-E-007-054

233

An AES (Advanced Encryption Standard) Rijndael cipher case study presented herein demonstrates our proposed scheme’s effectiveness and efficiency. Experimental results show the viability of our technique.

We have also developed a WWW-based user interface so that the IP user can use a browser to download the Verilog module and C program as well as invoking the simulator in the IP vendor’s site. /*==============================================*/ /* This part is automatically generated. */ /* It includes real IP into SOC simulation */ /*==============================================*/ module RemoteIP() : RemoteIP(); /* Instantiation of the IP under evaluation */ : $RemoteIP_svr(…); /* This will call compiled C function thru PLI */

The rest of the paper is organized as follows. In Section 2, we describe how our tool will be used. Section 3 describes and analyzes the technical detail. The case study is presented in Section 4. Finally, in Section 5 we draw conclusions and point

to possible directions for future research.

2. PROPOSED APPROACH We propose for the IP vendor a software system that takes as its input the IP description and generates two Verilog modules and two C programs. The usage of these modules and programs in the user’s site and vendor’s site are illustrated in Figure 1 and Figure 2, respectively. When the user wants to evaluate a remote IP within the context of its SOC design, he just instantiates the module (RemoteIP() in the illustration) as any other local module and include the vendor provided shell file, which is generated automatically by our tool. The shell file describes the callee module, which uses PLI interface ($RemoteIP_clt) to communicate with the vendor site. The communication consists of four major functions in the automatically generated C program (PLI_routines). They will be presented in the next section.

: endmodule

/* End of module RemoteIP */

/*==============================================*/ /* This part is original vendor’s real IP */ /*==============================================*/ module RemoteIP() : : endmodule /* End of module IP */ /*==============================================*/ /* This is automatically generated and compiled */ /* into object code to be called by */ /* $RemoteIP_clt() during Verilog simulation */ /* thru PLI. */ /* It consists of four main functions. */ /*==============================================*/

/*==============================================*/ /* This part is original user design. */ /*==============================================*/ module SOC() : RemoteIP(); /* Instantiation of the IP under evaluation */ : endmodule /* End of module SOC */

PLI_routines() { function function function function

socket_initialization() port_updating() data_transfer() socket_closing()

} /*==============================================*/ /* This part is automatically generated. */ /* It is included into SOC simulation. */ /*==============================================*/ module RemoteIP() : $RemoteIP_clt(…); /* This will call compiled C function thru PLI */ : endmodule /* End of module RemoteIP */

Figure 2. The usage of generated modules

and programs in the vendor’s site

3. REMOTE CONCURRENT SIMULATION In order to exchange data across the Internet, we apply the client/server paradigm to every concurrent simulation session. An IP vendor is the server side while the user is the client side. Our client/server architecture is composed of the Verilog Programming Language Interface [8,9] functions and a pair of shell modules built on the basis of the original design system. The details are described below:

/*==============================================*/ /* This is automatically generated and compiled */ /* into object code to be called by */ /* $RemoteIP_clt() during Verilog simulation */ /* thru PLI. */ /* It consists of four main functions. */ /*==============================================*/ PLI_routines() { function socket_initialization() function port_updating() function data_transfer() function socket_closing() }

3.1 PLI Function Implementation $modulename_clt and The PLI functions, named $modulename_svr, are programmed for interprocess communication using the TCP/IP. IEEE 1364 standard provides the TF/ACC routines and the VPI routines (referred to as PLI1.0 and PLI2.0) for calling a C function from Verilog codes. We use PLI1.0/2.0 to accomplish four main operations: (1) opening and initializing a socket, (2) updating values for the input/output ports, (3) transferring data using the socket, and (4) closing the socket. The PLI functions ($modulename_clt and $modulename_svr) has re-ordered module ports as their arguments. We order them such

Figure 1. The usage of generated modules and

programs in the user’s site

234

that all input ports appear before output ports to ease encoding and decoding of signal values.

scheduled events and finally the valid output values can be accessed during slot 4 and sent back to the client. Then the time wheels of both sides go forward to the next time step. The scheme is shown as Figure 4. Note that the server does not start to evaluate in slot 0 because the PLI interface function is just invoked in slot 1 at time step 0.

We program the client/server shell modules in Verilog and use them to call PLI functions. IP users include the client shell module as if the real IP is at hand. Each time when a user instantiates the name of the IP in his SOC design, upon executing $modulename_clt the client site simulator sends input values to the server site before output values are received from the vendor’s real IP. The server shell module behaves similarly, but it will send output values after receiving input values. The server shell module is responsible for invoking the server site simulator to evaluate the real IP while the client site simulator is invoked by user’s testbench or design. In the vendor site, because the real design can be viewed as an external module, it’s quite simple to provide different abstraction levels like behavioral or netlist for remote simulation by changing the included IP filename in the server shell module. The other primary advantage of shell module implementation is that once the users obtain the real Verilog code for the IP, no alteration is necessary besides replacing the file name of the client shell module with that of the real IP. Integrating the entire SOC design becomes effortless.

Time Wheel T=0

S1_Calltf routines

S2_Change nb-LHS

S2_Change nb-LHS

S3_ReadWrite CB

S3_ReadWrite CB

S4_ReadOnly CB

S4_ReadOnly CB

S0_Before events

S0_Before events

S1_Calltf routines

S1_Calltf routines

S2_Change nb-LHS

S2_Change nb-LHS

S3_ReadWrite CB

S3_ReadWrite CB

S4_ReadOnly CB

S4_ReadOnly CB

Figure 4. The cycle-evaluation technique

Current Time Step: Slot 1: All actions in a slot are intermixed in any order!

S0_Before events

S1_Calltf routines

. . .

T=1

S0_Before events

. . .

3.2 Remote Shell Module Implementation

3.4 Packet Packing Scheme

Evaluate RHS of non-blocking assignments Evaluate RHS & change LHS of blocking assignments Evaluate RHS & change LHS of continuous assignments Evaluate output from $display and $write Call PLI calltf routines for system tasks and system functions

The speed problem of remote simulation is also a great concern to the user. The conversion between input/output values and data packets may dominate the total run time more than the transmission rate. However, owing to the cycle-evaluation technique, in every time step we have to send/receive port values back and forth between the client and the server. It seems that we deteriorate the speed problem. Actually, the packet conversion will not restrict the total run time too badly as long as no extra control messages are attached and a packet packing scheme is applied properly.

Slot 2: Change LHS of non-blocking assignments evaluated in slot 1

Slot 3: Call registered simulation callbacks which have cbReadWriteSybch reason

Slot 4: Print output from $monitor and $strobe Call registered simulation callbacks which have cbReadOnlySybch reason

Next Time Step: …

Our cycle-evaluation technique guarantees the correctness of synchronization so that only the input/output values need to transmit. We encode a character (one byte) for each bit of the input/output ports. The Verilog 4-state logic values are represented with “0”, “1”, “x” and ”z” in C strings. Since the total bits of input/output ports are usually far shorter than the longest data that a packet can carry, we make use of this feature to reduce a great number of packet conversions. We have mentioned the rearranging technique for the order of input/output ports. Now we can concatenate all input bits or output bits into one data stream by this order. Then in every time step, only one packet is sent and the other one is received. Once the total number of input/output ports is larger than two, our packet packing scheme will take effect. For instance, if an IP has N inputs and M outputs, we convert only two packets instead of (M+N) packets. The time for (M+N-2) packet conversion is therefore saved.

Figure 3. The activity in a Verilog simulation time step

3.3 Synchronization Scheme The synchronization scheme must guarantee the correctness of remote concurrent simulation. To avoid the abortion problem in [6], we utilize a cycle-evaluation technique. Because a socket is message driven through the client and the server, the packaged data should be synchronized to the socket. A naive way is to transmit every input/output value within a socket to the opposite side in every time step, and then decide the related port update and new event schedule. In the IEEE 1364 Verilog standard, the activity in a Verilog simulation time step is conceptually illustrated in Figure 3. Simulation will not proceed to the next slot until it has executed all scheduled events within the current slot. In the beginning, $modulename_svr is called to wait at time step 0. For each time step, all valid input values can be read during slot 3 after all known events are executed in the client and transmitted to the server at slot 0, which is not defined in the original organization but a means to hold the time wheel before any event is executed in this time step. Afterwards, the server proceeds to process the

3.5 IP Protection Scheme IP protection is a key issue of SOC business. The dilemma originates from how to keep the evaluation of a real IP from revealing its content to IP users. A common way is rewriting a new simulation model in C or encrypting IP’s content. However, our remote IP co-simulation method does require neither of them. Our client/server shell modules separate the user’s design and IP.

235

Furthermore, PLI interface functions permits our socket protocol only, and any intention to access the real IP will be blocked. Both the client and server sides can only see the interfaces of each other, and therefore, a user can really evaluate an IP but can’t catch any detail information about it except the output values.

respectively. The generated C programs are too long (527 lines) to be included here. The simulation is performed on a SUN ULTRA SPARC60 equipped with Cadence Verilog-XL. Table 2 shows the effectiveness of packet packing scheme. At first, the synchronous AES Rijndael cipher encrypts input data. Then the encrypted output data can be obtained. The single process, method 1 and method 2 in the second column stand for the simulations on the same workstation by one process, two processes without packet packing scheme and two processes with packet packing scheme. Column 3, 4 and 5 represent the user CPU time, system CPU time and total run time in second respectively. The final column shows the normalized total run time. It is clear that our packet packing scheme reduces the total run time significantly for the client/server concurrent simulation.

3.6 Performance Analysis The total simulation time is composed of total evaluation time for all events in the timing wheel. Therefore, we create a model to analyze the performance of our scheme. First, we assume that the simulation time for an IP and its interface to the rest SOC contributes respectively 100α and 100β percent of total time (Τ) simulated on a single local machine while the remaining SOC consumes (1-α-β)Τ. The transmission rate is γ times slower than that on the single local machine. Table 1 depicts the comparison of total simulation time between our scheme and the origin.

Table 2. Simulation with different

synchronization schemes

Table 1. Simulation time comparison between our scheme and the origin Design Part

IP/SOC interface

IP

Remaining SOC

Original local simulation

αΤ

βΤ

Concurrent simulation

αΤ

γβΤ

1

γ

Ratio

Total

(1-α-β)Τ

Pattern Length

Synchronization Scheme

160

Single Process Two-processes w/o Client packet packing Server Two-processes w/ Client packet packing Server

Τ

Single Process

(1-α-β)Τ (1+ (γ-1)β)Τ 1

1600

(1+ (γ-1) β)

Two-processes w/o Client packet packing Server Two-processes w/ Client packet packing Server

For example: If an MAC consumes 10% and it interface with the outer DSP design consumes 1% of total simulation time, the Internet transmission rate (assume @1KB/sec) is now 1600X slower than that on the single local machine, and the original simulation time is 1,000,000 ms, then

Single Process

User System CPU CPU Time Time 7.00 0.23 0.00 0.00

Total Normalized Run Total Run Time Time 8.00 1.00 1236.00 154.50

3.00

0.00

1241.00

155.13

2.94

0.83

13.02

1.63

9.75

0.90

13.48

1.69

31.46

0.28

32.91

1.00

0.00

0.00

2676.00

81.31

17.00

0.00

2676.00

81.31

6.60

1.75

46.57

1.42

37.46

1.92

47.12

1.43

275.59

0.50

279.22

1.00

9.00 15738.00

56.36

27.00 15739.00

56.37

Two-processes w/o Client 4.00 16000 packet packing Server 165.00 Two-processes w/ Client 41.25 packet packing Server 315.72

12.16

363.01

1.30

12.36

363.58

1.30

α= 10%, β= 1%, γ= 1600 , and Τ=1000000 Moreover, we use this cipher to establish an encrypt/decrypt system. Encryption is done on the server while decryption is done on the client. This remote concurrent simulation is analyzed under three different conditions:

We can calculate the new simulation time and the slow-down ratio (ϕ) Τ’ = (1+ (γ-1)β)Τ = ( 1 + (1600-1) × 1% ) × 1,000,000 = 16,990,000 ms ϕ = (1+ (γ-1) β) = ( 1 + (1600-1) × 1% ) = 16.99

Single machine with single process Both encryption and decryption are executed on the same local machine. The simulation ran on one SUN ULTRA SPARC60 called IC6 with Cadence Verilog-XL.

We can see that our new simulation time is dominated by two factors: one is the percentage of simulation time for IP/SOC interface and the other is the transmission rate. In this example, the 1600X Internet delay is hidden by multiplying the small proportion for simulating on the IP/SOC interface, and finally we only get approximately 17X total simulation time. Putting the possible improvement of the Internet speed aside, the total delay will be hidden more along with larger SOC design. The larger the SOC design becomes, the smaller proportion of total simulation time the IP/SOC interface occupies. Therefore, for evaluating an IP in an SOC design, the delay due to our scheme will be confined to a reasonable range.

Intranet with two processes The client sent out the input data and ran the decryption on a SUN ULTRA SPARC60 called IC5 while the server ran encryption on IC6. The client/server machines communicate with sockets through a 100Mbit Ethernet LAN connection. This result exhibits the effectiveness of packet packing scheme. The transmission delay influence the total run time slightly. Internet with two processes The client sent out the input data and ran the decryption on a SUN ULTRA SPARC30 called Gucsun01 while the server ran encryption on IC6, and they communicate with sockets through different ISPs with about 6.0Kbyte/sec traffic under

4. CASE STUDY We use a synchronous AES Rijndael cipher chip at the RTL level [10] to test the proposed approach. This cipher IP module has 6 input and 3 output ports. Figure 5 and Figure 6 give the program listing of the generated modules for the client and server site,

236

normal network loading. This environment measures the most practical Internet environment.

access the interface values of each other. Cycle-evaluation technique is responsible to timing synchronization and guarantees the correctness. However, packet packing scheme reduces a lot of conversion overhead and the total run time can be in the reasonable range of tolerance.

Table 3 shows the results. Intranet with two processes is the fastest. In this condition, the packet conversion and transmission rate do not limit the speed of concurrent simulation and take advantage of distributed computation. However, the results in the Internet with two processes interest us most. Our remote IP concurrent simulation method is only constrained by the transmission rate, which may vary under different situations. We can see that the Internet condition consumes 10 to 36 times more total run time. This result is acceptable to both the IP vendor and the IP user given the convenience it brings to the party. We expect the total run time will significantly reduce when video-grade bandwidth is available.

Our scheme also provides thousands of client evaluating multiple IPs in vendor’s machine if the vendor has enough licenses. It eases the concurrent simulation of multiple IPs for SOC design. Many kinds of simulator are developed in the market. To provide a standard interface, we support all kinds of simulator that are conforming to IEEE 1364. Thus, any simulator conforming to IEEE 1364 can be applied with our method. Internet bandwidth is the only major bottleneck on our proposed scheme. Although we expect it will be solved in the broadband era, we would like to study whether the amount of communication can be further reduced.

Table 3. Remote IP concurrent simulation under different network conditions Pattern Simulation Length Type Single machine Intranet 160

Internet Single machine Intranet 1600 Internet Single machine Intranet 16000 Internet

Machine IC6 IC5 (Client) IC6 (Server) Gucsun01 (Client) IC6 (Server) IC6 IC5 (Client) IC6 (Server) Gucsun01 (Client) IC6 (Server) IC6 IC5 (Client) IC6 (Server) Gucsun01 (Client) IC6 (Server)

User CPU Time

System CPU Time

Total Normalized Run Total Run Time Time

13.13

0.32

14.62

1.00

2.66

1.09

12.00

0.82

9.48

0.87

12.26

0.84

6.95

1.78

546.66

37.39

10.04

1.40

547.49

37.45

64.67

0.47

67.27

1.00

6.02

2.19

44.13

0.66

36.45

1.92

44.56

0.66

25.40

1.86

780.01

11.60

41.84

2.49

781.62

11.62

579.57

0.93

589.73

1.00

36.76

13.38

363.64

0.62

308.51

12.54

363.97

0.62

225.09

13.29 7235.84

12.27

337.02

13.84 7240.15

12.28

6. REFERENCES [1] F. Chan, M. Spiller and R. Newton, “WELD-An environment for Web-based electronic design”, Proc. of the Design Automation Conference, 1998, pp. 146-151. [2] L. Benini, A. Bogliolo and G. De Micheli, “Distributed EDA tool integration: the PPP diagram”, Proc. of the International Conference on Computer Design, 1996, pp. 448-453. [3] Y.L. Lin, “Computing Brokerage and Its Application in VLSI Design”, Proc. of Asia and South Pacific Design Automation Conference ’97, 1997. [4] M. Dalpasso, A. Bogliolo and L. Benini, “Specification and Validation of distributed IP-based designs with JavaCAD”, Proc. of Design, Automation and Test in Europe Conference &Exhibition, 1999, pp. 684-688 [5] M. Dalpasso, A. Bogliolo and L. Benini, “Virtual Simulation of distributed IP-based designs”, Proc. of the Design Automation Conference, 1999, pp. 50-55. [6] A. Fin and F. Fummi, “A Web-CAD Methodology for IP-Core Analysis and Simulation”, Proc. of the Design Automation Conference, 2000, pp. 597-600. [7] D.E. Comer and D.L. Stevens, Internetworking With TCP/IP Volume III: Client-Server Programming and Application, Prentice-Hall, Inc., 1996, pp. 9-19. [8] S. Mittra, Principles Of Verilog PLI, Kluwer Academic Publishers, 1999.

5. CONCLUSIONS AND FUTURE WORK We have demonstrated the feasibility of remote IP evaluation using an Internet-based concurrent simulation scheme. Our software can automatically generated wrapper programs and modules for remote exchange of simulation data. Our contribution will help both the IP vendor and user. Therefore, it will ease the adoption of external IP during SOC design project.

[9] S. Sutherland, The Verilog PLI Handbook: a user’s guide and comprehensive reference on the Verilog programming language interface, Kluwer Academic Publishers, 1999. [10] An AES Rijndael Cipher Design, refer to AES web page: http://csrc.nist.gov/encryption/aes/rijndael.

Our proposed system is built on a client/server architecture. This architecture mainly comprises some PLI functions and a pair of shell modules. They also provide the security and only permit to

237

`include "rijndael.v" module rij_shell_svr_vpi; parameter ClkRateOfRijndael = 5; parameter EndofCoSimulation = 175000;

module top // rij_shell_clt.v shell ( clk, // clock : CLK reset, // reset : RST mode, // mode : Mode en_de, // encrypt or decrypt key_length, // length of key:128 192 256 din, // data input : Din

);

ready, dout, sftout

// declaration of i/o type and width /* input clk, reset, en_de; input [1:0] mode, key_length; input [7:0] din; output ready; output [7:0] dout; output [127:0] sftout; */ // declaration of variable type and width reg clk; wire reset, en_de; wire [1:0] mode, key_length; wire ready; wire [7:0] din; wire [7:0] dout; wire [127:0] sftout; reg clk_r,reset_r, en_de_r; reg [1:0] mode_r, key_length_r; reg [7:0] din_r;

// ready : Ready // data output : Dout

// declaration of i/o type and width input clk, reset, en_de; input [1:0] mode, key_length; input [7:0] din; output ready; output [7:0] dout; output [127:0] sftout; // declaration of variable type and width wire clk; wire reset, en_de; wire [1:0] mode, key_length; wire [7:0] din; wire [7:0] din_d; wire ready; wire [7:0] dout; wire [127:0] sftout;

top

u1( clk, reset, mode, en_de, key_length, din,

assign din_in = din; initial begin $rijndael_clt(clk, reset, mode, en_de, key_length, din_d, ready, dout, sftout); end

ready, dout, sftout ); always #(ClkRateOfRijndael) clk = ~clk; assign clk = clk_r; assign reset = reset_r; assign mode = mode_r; assign en_de = en_de_r; assign key_length = key_length_r; assign din = din_r;

endmodule

Figure 5: The generated shell module for client site

initial begin $rijndael_svr( clk_r, reset_r, mode_r, en_de_r, key_length_r, din_r, ready, dout, sftout ); end initial begin #(EndofCoSimulation) $finish; end endmodule

Figure 6: The generated shell module for server site

238