... on Computer Systems: Architectures, Modeling, and Simulation (SAMOS'04), ... In [4], a fast retargetable simulation technique improves traditional static ...
In Proc. of the 4th Int. Workshop on Computer Systems: Architectures, Modeling, and Simulation (SAMOS’04), Samos, Greece, July, 2004. c Springer-Verlag, 2004. Volume 3133 of Lecture Notes in Computer Science (LNCS), pp. 519-529,
High-Speed Event-Driven RTL Compiled Simulation Alexey Kupriyanov, Frank Hannig, J¨urgen Teich Department of Computer Science 12, Hardware-Software-Co-Design, University of Erlangen-Nuremberg, Germany, {kupriyanov, hannig, teich}@cs.fau.de
Abstract In this paper we present a new approach for generating high-speed optimized event-driven register transfer level (RTL) compiled simulators. The generation of the simulators is part of our BUILDABONG [7] framework, which aims at architecture and compiler co-generation for special purpose processors. The main focus of the paper is on the transformation of a given architecture’s circuit into a graph and applying on it an essential graph decomposition algorithm to transform the graph into subgraphs denoting the minimal subsets of sequential elements which have to be reevaluated during each simulation cycle. As a second optimization, we present a partitioning algorithm, which introduces intermediate registers to minimize the number of evaluations of combinational nodes during a simulation cycle. The simulator’s superior performance compared to an existing commercial simulator is shown. Finally, we demonstrate the pertinence of our approach by simulating a MIPS processor.
1
Introduction
Today, due to the increasing complexity of processor architectures, 70-80% of the development cycle is spent in validation. Here, beside formal verification methodologies, simulation based validation plays an important role. Furthermore, cycle-accurate and bit-true simulations are necessary for the debug process when mapping applications to a not physically available architecture. Moreover, in the domain of architecture/compiler co-design for application-specific instruction set processors (ASIPs), fast estimation and simulation methodologies are essential to explore the enormous design space of possible architecture/compiler co-designs. In the following, we list some significant approaches aiming at architecture simulation. Today’s simulation techniques include the flexible but often very slow interpretive simulation and faster compiled simulation. The simulation can be performed at the different levels of abstraction starting from gate level, register transfer (RT) level, up to instruction-set (IS) level. The lower the abstraction level used, the higher simulation flexibility is achieved but at the same time the simulation speed dramatically slows down. Nowadays, the IS simulators are widely used in the domain of application specific instruction set processors because of their extremely high simulation speed. In [4], a fast retargetable simulation technique improves traditional static compiled simulation by the utilization of the host machine resources. FastSim [11] and Embra [14] simulators use dynamic binary translation and result caching to improve simulation performance. Embra provides a high performance but it is restricted to the simulation of only the MIPS R3000/R4000 [1] architectures.
520
The architecture description language LISA [10] is the basis for a retargetable compiled approach aiming at the generation of fast simulators for microprocessors even with complex pipeline structures. LISA is also used as entry language in the MaxCore framework [3] which automatically generates fast and accurate processor simulation models. In the BUILDABONG framework ([12], [6], [7]), a cycle-accurate model of the register transfer architecture using the formalism of Abstract State Machines (ASMs) is generated. Here, a convenient debugging environment is generated but the functional simulation is not optimized and can be performed only for relatively simple architectures. The problem of this approach is that the simulation engine is based on a library operating on bitstrings which is relatively slow. One of the largest drawbacks in all of these approaches is that some high-level machine description language must be additionally used in order to generate highest speed simulators. In this paper, we propose a mixed register-transfer/instruction-set level compiled simulation technique where the simulator is automatically extracted from a RTL description and the application program is compiled prior to simulator run-time. Hence, there is no need to use any particular machine description language. Furthermore, we present two new approaches to optimize the simulation speed: (a), a novel graph decomposition algorithm to transform the graph into subgraphs denoting the minimal subsets of sequential elements which have to be reevaluated during one simulation cycle, and (b), we present a partitioning algorithm to minimize the number of evaluations of combinational nodes during a simulation cycle by the introduction of intermediate registers. The rest of the paper is structured as follows: In Section 2, the basic concepts when generating high-speed optimized event-driven bit-true and cycle-accurate compiled simulators are described. Here, a given graphical architecture description is transformed into a graph representation. Subsequently, the main focus of the paper is on the above mentioned RTL graph decomposition and partitioning algorithms. In Section 3, experiments and a case study when simulating a MIPS processor are presented and the results of our simulation approach are compared to an existing commercial simulator. Finally in Section 4, the paper is concluded with a summary of our achievements and an outlook of the future work directions.
2
Event-Driven Compiled Simulation
The flow of conventional compiled simulation, particularly Levelized Compiled Code (LCC) [13] simulation consists of circuit description, code generation, compilation, and running of the stand-alone compiled simulator, which reads input vectors and generates the circuit’s output values. In our case, the circuit description is extracted from the ArchitectureComposer [7] tool, where the architecture is graphically entered using a library of customizable building blocks such as register files, memories, arithmetic and logic units, busses, etc. The simulator is compiled from a C++ program, which is generated automatically using the SimulatorGenerator, a tool built-in ArchitectureComposer. The structure of the generated simulator is similar to the structure of a conventional LCC simulator. It includes the main simulation loop in which the stimuli vectors are
521
read from an external stimuli file, the simulation routine sensitivity-update (definition follows) is called to perform the current simulation cycle, and the selected tracing values are stored. In order to be able to simulate arbitrary bitwidths accurately, our simulation engine uses the GNU Multiple Precision (GMP) library [8], which performs efficient arbitrary word-length arithmetic and logic computation. The simulation engine itself is a set of nested inline functions which evaluate each output of the hardware components of the circuit. This approach allows the C++ compiler to perform an optimization of the program execution speed by reducing the number of intermediate variables in a data flow graph compared to conventional approaches, where the simulation engine uses a table of functions for each hardware element and a table of interconnections between them most of the times. As our objective is a cycle-accurate and pipeline-accurate bit-true simulation of a given computer architecture, the architecture’s behavior have to be simulated at the RTL level. Such a behavior may be described by a set of guarded register transfer patterns [9]. From a hardware-oriented point of view, register transfers (RT) take place during the execution of each machine cycle. During a RT, input values are read from a set of storage elements (registers, memory cells, etc.), a computation is performed on these values, and the result is assigned to a destination, which is again a set of storage elements. Often, it can be seen that not all regions of the given circuit are involved in its entire functional process during any significant time interval. For example, in such architectures as FPGAs or reconfigurable arrays some inactive cells do not perform any signal changes from cycle to cycle, especially those parts which are responsible for the reconfiguration process. That is why, it is very reasonable not to recompute or update these regions in each simulation cycle. Hence, an event-driven simulation is applied in order to optimize the simulation speed by evaluating only those parts of the circuit whose values are affected from one to another simulation cycle. This is also called forward event-driven simulation. Definition 1 (Sensitivity-update-mapping). A sensitivity-update-mapping defines a set of sequential elements {ur1 , . . . , urn } of the circuit which must be recomputed if at least one sequential element of a set {vr1 , . . . , vrm } has changed it’s value compared to the previous simulation cycle. It can be represented by the following notation: {vr1 , . . . , vrm } 7−→ {ur1 , . . . , urn }. The RTL circuit is represented as a directed graph G(V, E). In the following sections, we present two essential algorithms to enable highest speed simulations on this graph: (i) a graph decomposition algorithm whose purpose is to divide a given description of an RTL architecture into subgraphs which denote the minimal subsets of sequential elements which have to be recomputed (determination of an optimal set of sensitivityupdate-mappings) during one simulation cycle, and (ii), a partitioning algorithm to minimize the computational cost of the sensitivity-update-mappings. But first, we show how a given netlist is transformed into such a graph representation. Definition 2 (Netlist). A netlist N = (V, F ) is a set V of logic elements and a set F of nets interconnecting the elements v ∈ V with each other. It defines a circuit in terms of
522 (a)
in2
in3
r3
r4
in4 f1
c2
(b)
(c)
r5 c3 =1
+
r3
r4
c2
c3
r5
BLIF Netlist format
in1
c4
r10
c4 +
r10
r6 out2
G=(V,E) c6
r6
r9
c6
r9
out1
Figure 1. Netlist representation. In (a), schematic graphical representation, in (b), BLIF textual representation , and in (c), netgraph representation of the given netlist.
basic logic elements. A unidirectional1 net f ∈ F , which interconnects n + m elements will be represented through f = ({v1 , . . . , vn }, {u1 , . . . , um }), where v1 , . . . , vn ∈ V are source nodes and u1 , . . . , um ∈ V are target nodes of net f . Example 1. In Fig. 1, a netlist is shown with |V | = 10 elements. Net f1 is given by f1 = ({r4 }, {c2 , c3 }). Nodes named ri denote sequential elements whereas nodes named ci denote combinational (i.e., state free) logic elements. A netlist can be given as a schematic graphical entry or in the form of a textual description (e.g., in BLIF [5]). A netlist can be seen as a hypergraph, where the elements and nets are vertices and edges, respectively. This hypergraph can be transformed into a graph by the introduction of a netgraph concept. Definition 3 (Netgraph). A netgraph G = (V, E), E ⊆ V × V is a directed graph containing two disjoint sets of vertices V = Vr ∪ Vc , representing the sequential elements or registers Vr and combinational elements Vc of a given netlist N = (V, F ). Netlist interconnections are represented by directed edges e = (v1 , v2 ) ∈ E. Example 2. In Fig. 1, a netlist and its netgraph G = (V, E), respectively, is shown. The subset of combinational elements Vc = {c2 , c3 , c4 , c6 } is shown as circles and the subset of registers or sequential vertices Vr = {r3 , r4 , r5 , r6 , r9 , r10 } is represented by rectangles. In order to transform a given netlist N into a netgraph G all elements of the netlist must be analyzed first and, according to the element’s type (combinational or sequential), 1
A bidirectional net can be modeled by two unidirectional nets.
523
they are either included in the subset Vc or in the subset Vr . In the generated simulator, the subset Vc will be represented by a set of inline nested functions implementing the functionality of the combinational elements, and the subset Vr will be represented by a set of variables of the certain data type. Furthermore, all of the nets are represented as directed edges e ∈ E of a netgraph G. In case, when a net contains a n : m connection, it is transformed into n × m directed edges of the netgraph, (in Example 1, for f1 the case of a 1 : 2 connection is represented which is transformed into edges (r4 , c2 ) and (r4 , c3 ) of the graph G in Fig. 1 (c)). Given such a netgraph, a simple procedure to perform a determination of the initial sensitivity-update-mappings could be as follows: if there exist a directed path from one register vr1 to an other register vr2 and on the path between these registers are no other sequential elements, then if the value of vr1 changes, vr2 has to be updated by evaluating the path of combinational elements in between. A set of such initial sensitivity-updatemappings and evaluation paths can be achieved by a search algorithm like depth-first search (DFS). For the example in Fig. 1 (c), the initial sensitivity-update-mappings are extracted in (a): (a) {r3 } 7−→ {r6 , r9 } {r4 } 7−→ {r6 , r9 , r10 } {r5 } 7−→ {r10 } {r6 } 7−→ {r9 }
−→
(b) {r3 , r4 } 7−→ {r6 } {r4 , r5 } 7−→ {r10 } {r3 , r4 , r6 } 7−→ {r9 }
In a technical implementation, for instance, registers r6 and r9 would have to be updated twice if r3 and r4 would have changed their values compared to the previous simulation cycle. In order to avoid multiple updates, we present a graph decomposition algorithm, which groups the sensitivity-update-mappings such that, at most one of the sequential elements will be updated during each simulation cycle. For the above example, the grouping result is than represented by the sensitivity-update-mappings in (b). 2.1
RTL Circuit Graph Decomposition
Now, we propose a graph decomposition into n subgraphs Gi = (Vi , Ei ), i = 1, . . . , n. Each subgraph will consist of three regions: (i), the sequential input region is a subset of sequential nodes: Viinp ⊆ Vi , Viinp ⊆ Vr , (ii), the combinational region is a subset of combinational nodes: Vicomb ⊆ Vi , Vicomb ⊆ Vc , and (iii), the sequential output region is a subset of sequential nodes: Viout ⊆ Vi , Viout ⊆ Vr . The set Viinp specifies which input sequential elements v ∈ Viinp have to be analyzed such that, if at least one of them differs from the value in the previous simulation cycle, then only the output sequential elements v ∈ Viout have to be updated. An update of a sequential element v ∈ Viout here means (i), the evaluation of all combinational elements v ∈ Vicomb , which are affected by v or, in other words, the combinational elements that lie on the paths from v to u ∈ Viinp in the subgraph Gi , and (ii), an assignment of the newly evaluated result to the sequential element v. However, in practice, the sequential element v ∈ Viout can be updated by the function call of the corresponding combinational element c ∈ Vicomb ,
524
which is the unique direct predecessor of v, as the call of one nested function leads to the calls of all dependent functions. Finally, the conditions of the following routine can be directly extracted from the set of subgraphs and the generated simulator code will look as follows: Sensitivity-update routine in generated simulator 1 FOR all subgraphs Gi DO 2 IF (∃v ∈ Viinp : v(tcur ) 6= v(told )) THEN 3 FOR all nodes u ∈ Viout DO 4 update(u); 5 ENDFOR 6 ENDIF 7 ENDFOR In the above algorithm, tcur and told denote the current and the previous simulation step, respectively. We propose a novel decomposition algorithm, which (i) guarantees that each register is updated at most once during each simulation cycle and (ii) minimizes the number of input registers v ∈ Viinp which have to be analyzed (see line 2 of the above algorithm), whether a register has changed with respect to the previous simulation cycle as a secondary goal. In Fig. 2, the decomposition algorithm to derive the set of subgraphs Gi (Vi , Ei ) is given as pseudo-code. The worst-case running time of the decomposition algorithm is O(|V |2 + |E||V |). First, the algorithm performs DFS2 in order to extract the unique set of sensitivity-update-mappings such that, each left hand side of them has only one register. Then, the set of sensitivity-update-mappings is represented by an adjacency matrix M (|Vr | × |Vr |). Each row of it represents a register on the left hand side of a sensitivity-update-mapping and each column represents the registers on the right hand side of a sensitivity-update-mapping. Afterwards, the rows and columns of this matrix are sorted in descending order. The sorting in the adjacency matrix allows to group the input registers together which have to be analyzed, whether a register has changed with respect to the previous simulation cycle. By this a heuristical minimization of input registers in each subgraph is performed. And finally, the subgraphs Gi are generated. Theorem 1. The algorithm DECOMP in Fig. 2 satisfies the condition that during each simulation cycle, the set of registers that have to be recomputed will be updated at most once. Proof. Each column of the adjacency matrix M represents exactly one register that must be updated. Furthermore, according to the lines 20–32 of DECOMP, each column will be uniquely assigned to one sensitivity-update-mapping with index l. t u Example 3. For the netgraph shown in Fig. 3 (a), the result of the decomposition is shown in Fig. 3 (d). Three subgraphs have been extracted: 2
The DFS is modified such that, if a sequential node v is found no outgoing edges of v traversed anymore.
525 DECOMP 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33
IN: G(V, E) OUT: Set of subgraphs Gi (Vi , Ei ) BEGIN //Find the sensitivity-update-mappings list Li = (vri 7−→ {ur1 , . . . , urn }) integer i ← 0; FOR all sequential elements v ∈ Vr DO i ← i + 1; vri ← v; {ur1 , . . . , urn } ← DFS(G, v); Li ← (vri , {ur1 , . . . , urn }); ENDFOR //Build adjacency matrix M (|Vr | × |Vr |) 1, if ∃Li : (vri 7−→ urj ), mij ← 0, otherwise; //Sorting the rows of M in descending order (criteria is a sum of the elements in each row) sum in rows[1..|Vr |] ← get sum in rows(M); sort rows(M, sum in rows); //Sorting the columns of M in descending order (criteria is a sum in each column) sum in columns[1..|Vr |] ← get sum in columns(M); sort columns(M, sum in columns); //Clustering graph G into subgraphs integer j ← |Vr |; l ← 0; WHILE (j > 1) DO l ← l + 1; Vlinp ← ∅; Vlout ← ∅; WHILE (columns[j] = columns[j − 1])&&(j > 1) DO Vlout ← Vlout ∪ {vrj }; j ← j − 1; ENDWHILE Vlout ← Vlout ∪ {vrj }; FOR i = 1 TO |Vr | DO IF (mij > 0) THEN Vlinp ← Vlinp ∪ {vri }; ENDFOR j ← j − 1; ENDWHILE END Figure 2. RTL circuit graph decomposition algorithm
G1 : (V1inp = {r1 , r2 , r3 , r4 }, V1comb = {c1 , c2 , c4 , c5 }, V1out = {r6 , r7 , r8 }), G2 : (V2inp = {r3 , r4 , r6 }, V2comb = {c2 , c6 }, V2out = {r9 }), and G3 : (V3inp = {r4 , r5 }, V3comb = {c3 }, V3out = {r10 }). Here, the corresponding set of sensitivity-update-mappings is: {r1 , r2 , r3 , r4 } 7−→ {r6 , r7 , r8 }, {r3 , r4 , r6 } 7−→ {r9 }, and {r4 , r5 } 7−→ {r10 }.
2.2
RTL Circuit Graph Partitioning
As a result of the decomposition algorithm we obtain a set of subgraphs from which the sensitivity-update routine can be directly extracted and the compiled simulator code can be generated as shown in the previous section. Considering one subgraph Gi , an update of each output register v ∈ Viout can lead to multiple evaluations of the corre-
526 (a)
(b) r2
r3
r4
c1
r2
+
r4
r5
c2
c3
c4 r6 c5
c6
r8
r3
c1
r6
>>
(c)
=1
r1
c4 +
c5
r7
c2
r5 c3
+
r1
r9
c6 r7
r10
r8
r9
r10
V2inp
(d) V1inp
r1
r2
c1
V inp r5 3
r3
r4
c2
c3 V3comb
c4
V1comb
r6
V2comb c6 G3 G2 out r8 V2out r9 r10V3
c5 G1 V1out
r7
Figure 3. RTL circuit graph decomposition. In (a), a netlist is shown, in (b) the corresponding graph G. In (c), the set of initial sensitivity-update-mappings for the sequential elements of the graph G is depicted. Finally, in (d), the decomposition into subgraphs G1 , G2 and G3 is shown.
sponding combinational elements c ∈ Vicomb in the same simulation cycle. Obviously, this dramatically slows down the simulation speed. Example 4. In Fig. 4 (a), if the value of register r1 is changed compared to it’s value in the previous simulation cycle then the RT update for registers {r2 , r3 , r4 } must be performed. Namely, for r2 the set of combinational elements {c1 , c2 , c3 , c4 }, for r3 the set {c1 , c2 , c4 }, and for r4 the set {c1 , c2 , c4 , c5 } must be reevaluated. The sets overlap each other which implies multiple evaluations of the combinational elements c1 , c2 , c4 during the same simulation cycle. To avoid this effect and to minimize the subset of combinational elements which have to be evaluated during the simulation cycle, a graph partitioning algorithm is presented in the following. The algorithm inserts so called virtual intermediate registers (VIR) at certain places of the graph. The function of a VIR is to store the required intermediate circuit value (assignment operation), reducing by this the overall number of function calls during the register transfer updates (see Fig. 4). In the following, we consider only the subgraph Gi = (Viinp , Vicomb , Viout , Ei ), particularly, only the subset of combinational elements Vicomb . Let n = |Vicomb | be the number of nodes in Vicomb . Let W = wc1 , wc2 , . . . , wcn be the vector of the weights of the corresponding combinational elements denoting the simulation time in milliseconds of each combinational element. In the vector L = lc1 , lc2 , . . . , lcn each element lci denotes, how often each
527 r1
(a)
c1
r1
(b)
c2
c1
U(r3) U(r2)
c3
U(r2)
U(r4)
c4
c5
U(VIR1)
c4
c2
U(r4)
c3
c5 VIR1
r2
r3
r4
r2
r3
r4
Figure 4. RTL circuit graph partitioning. In (a), before partitioning, if the value of r1 is changed compared to it’s value in the previous simulation cycle then the RT update for {r2 , r3 , r4 } must be performed. The overlapping subsets U (r2 ), U (r3 ), and U (r4 ) include the combinational nodes which must be reevaluated for each output register respectively during RT. In (b), after a first step of partitioning the virtual register V IR1 is inserted and, as a result, the common subsets include only two combinational elements c1 and c2 .
combinational element ci is called. Let wvir be the a constant weight denoting the simulation time of one VIR assuming that all VIRs have the same bitwidth. Then, the total simulation time depends on the number m of inserted VIRs: S(m) = W · L + wvir · m. Assume the case when m = n; wcmin = min {wci }; lci = 2 (i.e., each ∀j=1,...,n
combinational element is called at least twice). Then, only one case is possible, when wcmin ≥ wvir , then wvir = wcmin − δ, where δ → 0, δ > 0. The insertion of the maximal possible number of VIRs (i.e., in all places, where the output degree of combinational node is > 1) results in the minimal simulation time. This case is dominant, since the simulation time of one assignment operation (VIR) is always less or equal then the simulation time of any other combinational element. It could be theoretically possible to have the case when wcmin < wvir for instance in a system with hard memory restrictions, where each additional insertion of VIR could be very expensive considering memory constraints. But practically, the insertion of a VIR is nothing else as one assignment function and, from other hand, the implementation of any combinational element includes already the assignment as well, which means a VIR will never consume more memory compared to a combinational element. Moreover, as far as our simulation engine is a set of inline functions, insertion of a VIR will reduce the calls of the same function in the same simulation cycle, which, as consequence, also will reduce the program code size of the generated simulator.
3
Experimental Results and Case Study
First, we performed a number of tests to measure the speedup which can be obtained by using the proposed graph partitioning algorithm. Here, the simulation times were reduced by 7-12%. Secondly, we present simulation results for various architectures of different complexity using different simulation engine libraries at link-time. Performance results of
528
Figure 5. Simulation speedups with respect to the generated simulators in [12] for different linked libraries: a) integer, b) GMP static and c) GMP shared library. Table 1. MIPS processor simulation speed. The simulation speed in machine cycles per second is compared to the existing commercial RTL Scirocco simulator in event-driven mode. MIPS R3000
SUN SUN SUN Scirocco Linux Linux (GMP) (Integer) (VHDL) (GMP) (Integer) Simulation speed, CPS 253 807 3 448 280 9 940 540 541 9 090 920
different generated simulators were obtained using a Pentium IV at 2.5GHz with 1GB RAM running Linux SuSe 8.0. In Fig. 5, different linked libraries have been used and compared in the generated simulator versions. The speedup is shown with respect to the Simcore library [12] which is based on bitstrings. The use of the GNU Multiple Precision (GMP) arithmetic library showed simulation speedups of a factor of 24 higher for 8-bit architectures, of a factor of 27 for 16-bit architectures, and of a factor of 34 for 32-bit architectures. The use of standard data structures such as, for instance, integer (without linking of any library) showed even better results with simulation speeds of a factor of 4 faster than using the GMP arithmetic library. But in this case, we are able to simulate only the architectures with standard fixed bitwidths (8, 16, 32 bit) as standard data types operate only on the values of these bitwidths. As far as such architectures are widespread, this approach is not unimportant for us and can be considered during the simulator generation. Third, as a realistic case study, a simulator for the MIPS R3000 32-bit processor has been generated and evaluated, too. Table 1 shows the simulation performance of the MIPS processor using our approach compared to the existing commercial RTL Scirocco simulator (in event-driven mode) (Synopsys)[2]. In the table, the performance results were obtained using a Sun workstation running Solaris 2.9 with a 900 MHz UltraSPARC-III processor. Furthermore, the simulation speed, reported in the last two columns, was achieved by using the described Linux environment. A notable improvement in simulation speed of almost two magnitudes compared to Scirocco is shown. Up to 9 mill. machine cycles per second were demonstrated here at RTL which is equally fast or faster than most compiled instruction-level simulators. Additionally, the binary code size of the generated simulators is about factor 50 less compared to the commercial one.
529
4
Conclusions and Future Work
In this paper, we proposed a new mixed register-transfer/instruction-set level compiled simulation technique where the simulator is automatically extracted from a RTL architecture description and the application program is compiled prior to simulator run-time. Furthermore, several optimization techniques to accelerate the simulation speed have been introduced: Event-driven simulation, efficient generation of simulator code by use of inlining to allow highest compiler optimizations, a graph decomposition algorithm that divides a netgraph description into subgraphs which denote optimal sets of update rules, and a partitioning algorithm to minimize the subset of combinational elements which have to be evaluated during one simulation cycle. The results show a high simulation speed with flexibility, cycle-accuracy, and bit-truth of entirely RT level simulation. In the future, we would like to work on the visualization of the simulation process and/or results. Further optimizations can be done by using a combination of different data types (arbitrary- and fixed-width) in one generated simulator.
References 1. 2. 3. 4. 5. 6.
7.
8. 9. 10.
11.
12.
13. 14.
Mips homepage. http://mips.com/. Synopsys homepage. http://synopsys.com/. Axys design automation. http://www.axysdesign.com. J. Z. Daniel. A retargetable, ultra-fast instruction set simulator. In Proceedings on the European Design and Test Conference, 1999. K. S. E. Sentovich et al. Sis: A system for sequential circuit synthesis. In Technical Report UCB/ERL M92/41. University of California, Berkeley, May 1992. D. Fischer, J. Teich, M. Thies, and R. Weper. Efficient architecture/compiler co-exploration for asips. In ACM SIG Proceedings International Conference on Compilers, Architectures and Synthesis for Embedded Systems (CASES 2002), pages 27–34, Grenoble, France, 2002. D. Fischer, J. Teich, M. Thies, and R. Weper. BUILDABONG: A framework for architecture/compiler co-exploration for ASIPs. Journal for Circuits, Systems, and Computers, Special Issue: Application Specific Hardware Design, pages 353–375, 2003. T. Granlund. The GNU multiple precision library, edition 2.0.2. Technical report, TMG Datakonsult, Sodermannagatan 5, 11623 Stockholm, Sweden, 1996. R. Leupers. Retargetable Code Generation for Digital Signal Processors. Kluwer Academic Publishers, Dordrecht, The Netherlands, 1997. S. Pees, A. Hoffmann, and H. Meyr. Retargeting of compiled simulators for digital signal processors using a machine description language. In Proceedings Design Automation and Test in Europe (DATE’2000), Paris, March 2000. E. Schnarr and J. R. Larus. Fast out-of-order processor simulation using memorization. In Proceedings of the eighth international conference on Architectural support for programming languages and operating systems, pages 283–294, 1998. J. Teich, P. Kutter, and R. Weper. Description and simulation of microprocessor instruction sets using asms. In International Workshop on Abstract State Machines, Lecture Notes on Computer Science (LNCS), pages 266–286. Springer, 2000. L.-T. Wang, N. E. Hoover, E. H. Porter, and J. J. Zasio. SSIM: A software levelized compiledcode simulator. pages 2–8. E. Witchel and M. Rosenblum. Embra: Fast and flexible machine simulation. In Measurement and Modeling of Computer Systems, pages 68–79, 1996.