Fast Cycle-Accurate Simulation and Instruction Set ... - CiteSeerX

24 downloads 0 Views 191KB Size Report
Sep 8, 2004 - generate assemblers, generate synthesizable RTL Verilog, and generate .... equate connected ports and to assert each component's constraint.
Fast Cycle-Accurate Simulation and Instruction Set Generation for Constraint-Based Descriptions of Programmable Architectures Scott J. Weber1, Matthew W. Moskewicz1, Matthias Gries1, Christian Sauer2, Kurt Keutzer1 1

University of California, Electronics Research Lab, Berkeley Infineon Technologies, Corporate Research, Munich, Germany {sjweber, moskewcz, gries, sauer, keutzer}@eecs.berkeley.edu

2

Abstract State-of-the-art architecture description languages have been successfully used to model application-specific programmable architectures limited to particular control schemes. In this paper, we introduce a language and methodology that provide a framework for constructing and simulating a wider range of architectures. The framework exploits the fact that designers are often only concerned with data paths, not the instruction set and control. In the framework, each processing element is described in a structural language that only requires the specification of the data path and constraints on how it can be used. From such a description, the supported operations of the processing element are automatically extracted and a controller is generated. Various architectures are then realized by composing the processing elements. Furthermore, hardware descriptions and bit-true cycleaccurate simulators are automatically generated. Results show that our simulators are up to an order of magnitude faster than other reported simulators of this type and two orders of magnitude faster than equivalent Verilog simulations.

Categories and Subject Descriptors: C.0 [Computer Systems Organization]: General -- Modeling of computer architecture; I.6.7 [Simulation and Modeling]: Simulation Support Systems; D.3.2 [Programming Languages]: Language Classifications – Constraint and logic languages, design languages, specialized application languages.

General Terms: Algorithms, Design, Languages. Keywords: Instruction set extraction, automatic control generation, cycle-accurate simulation.

1. Introduction The primary focus of the designer of an application-specific programmable processor is often only the data path, not the control and instruction set. Control should be automatically generated whenever possible to ease the design process and to avoid potential errors. Likewise, the instruction set should reflect the capabilities of the underlying data path, not define them. Attempting to define an instruction set can be complicated by architectural complexities such as multiple memories and forwarding paths. The problem, however, is that existing Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. CODES+ISSS’04, September 8–10, 2004, Stockholm, Sweden. Copyright 2004 ACM 1-58113-937-3/04/0009...$5.00.

architecture description languages (ADLs) require the complete specification of either the control or the instruction set in addition to the data path. The MIMOLA ADL [1][2], for instance, provides the ability to automatically extract the instruction set from a description of the data path, but requires the specification of control signals. On the other hand, ADLs, such as LISA [3], EXPRESSION [4], and nML [5], provide the ability to automatically generate the control, but require the specification of the instruction set. It is for this reason that we introduce a new language and supporting framework that both extracts the instruction set and generates a controller from a description of the data path. In our framework, each processing element is a staticallyscheduled horizontally-microcoded machine. Such machines neither require hazard detection logic nor dynamic control, but instead rely on static scheduling. In such a scenario, components are composed in a modular manner so that the control for one is provided by another. Moreover, composing in this manner does not limit one to the vertically controlled machines created by today’s ADLs. Due to space constraints, we will focus only on machines with horizontally-microcoded control. Such machines are the core component for defining more sophisticated machines. To deal with the potential code size explosion of the microcode, we will also outline an encoder/decoder strategy. As a first step to analyzing the utility of our methodology, we have developed a type-polymorphic, parameterized language for describing processing elements in terms of the data path and constraints on how the data path can be used. From this description, we are able to extract the supported operations, generate assemblers, generate synthesizable RTL Verilog, and generate fast bit-true cycle-accurate interpretive and compiledcode simulators. In fact, our generated simulators are as fast as or faster than those produced by other ADL-based frameworks. The paper is organized as follows. In Section 2, we provide a brief overview of our design flow. Our language and operation extraction procedure are discussed in Section 3. In Section 4, we discuss the generation of simulators. An overview of the generation of hardware is covered in Section 5. In Section 6, we present the simulation performance for two designs. We discuss related work in Section 7, and conclude in Section 8.

2. Overview We have developed a correct-by-construction framework for designing statically-scheduled horizontally-microcoded programmable architectures. From a constraint-based description of the design, the primitive operations of the architecture are extracted. From these operations, bit-true cycle-accurate simulators and a synthesizable RTL description are automatically

generated. The complete flow of the design methodology is shown in Figure 1. Programmer

Architect

Application Source Code

Architecture Description

Object Code

Operation Extraction

Verilog RTL

Interpretive Simulator

Lib

value din + step. The rule no_fire is satisfied if din and step are not “present” When no_fire is satisfied, then out is set to not “present” (i.e. nothing is assigned). Note that due to these rules, inc cannot be valid if only one of the input signals (din or step) is present. In Figure 4 the component is shown in terms of quantifier-free firstorder logic on rules and the presence of signals. The formulation is shown since internally this is how components are represented for the operation extraction procedure described later. inc

Compiled Code Sim

Hardware

∧ ∧ ∧ ∧

(input din, input step, output out) { ( fire ⇔ ( din ∧ step ) ( fire ⇒ ( out ∧ “out = din + step” ) ) ( no_fire ⇔ ( ¬din ∧ ¬step ) ( no_fire ⇒ ¬out ) ( inc ⇔ ( fire ∨ no_fire ) ) }

Figure 4. Incrementer Constraints

Figure 1: Design Methodology

dec Accordingly, a decrementer is defined in Figure 5.

3. Architecture Design Our methodology simplifies the way an RTL designer approaches a programmable design. First, designers specify the data path. Second, to enable operation extraction and to make verification a first-class citizen during design, constraints must be specified on how particular components of the data path can be used. These constraints force the designer to design in a correct-byconstruction manner. An instruction set representing the source (inputs or register reads) to sink (outputs or register writes) operations of the data path is then automatically extracted. A controller that implements the instruction set is also generated. Finally, in order to finish the implementation, the designer writes a program either at the level of the instruction set or in a higher level language that is then compiled to the architecture. As an illustrative example, we explore an architecture that is capable of incrementing or decrementing an input by a step value. The architecture is shown in Figure 2. sel din

D

din out

inc

din step

sel

out

step

din

dec

din M

out

out

Figure 2. Inc/Dec Architecture

3.1 Describing an Architecture The components themselves are described in a constraint-based language in terms of constraints on signals and a set of rules. Rules have an activation and an action section. The activation section indicates what signals have to be present and/or absent in order for the rule to be activated (parenthesized after the rule name). The action section describes what operations to perform if the rule is activated. An incrementer component is shown in Figure 3 ( indicates that step is a 5-bit unsigned integer). inc( input din, input step, output out) { rule fire(din, step) { out = din + step; } rule no_fire(-din, -step) {} inc fire || no_fire; }

Figure 3. Incrementer Description

inc The incrementer is interpreted as follows. The inc component is valid if and only if the constraints of rule fire or (inclusive) the rule no_fire are satisfied. The rule fire is satisfied if din and step are “present”. A signal is “present” if it is assigned a value. When fire is satisfied, then out is set to “present” and is assigned the

dec (input din, input step, output out) { ( fire ⇔ ( din ∧ step ) ∧ ( fire ⇒ ( out ∧ “out = din - step” ) ) ∧ ( no_fire ⇔ ( ¬din ∧ ¬step) ∧ ( no_fire ⇒ ¬out ) ∧ ( dec ⇔ ( fire ∨ no_fire ) ) }

Figure 5. Decrementer Constraints

mux The definition in Figure 6 of the mux demonstrates a number of useful features of the language. First, the mux uses a list port @din and foreach expressions in order to make the mux a generic N:1 mux where N is determined by the number of signals connected to @din. Second, the N:1 mux is type- polymorphic. A type resolution routine is used to determine the type of din and out. The type rules are the same as in Verilog, but types can also be determined as expressions on constants and other types (i.e. din is defined to have the same type as out). Third, the input sel is an enumerated signal. Enumerated signals are constrained to take a unique value for each rule in which they appear. Finally, only the data path has been specified. The control will be generated for the unconnected port(s) – only sel in the case of the mux. mux ( input @din, input {enum} sel, output out) { foreach(i) { ( fire ⇔ ( din[$i] ∧ sel ∧ foreach(j) { if ($i != $j) { ¬din[$j] }} ) ) ∧ ( fire ⇒ (out ∧ “out = din[$i]” ) ) ∧ ( no_fire ⇔ ( ¬@din ∧ ¬sel ) ) ∧ ( no_fire ⇒ ¬out ) ∧ ( mux ⇔ ( or(fire) ∨ no_fire ) ) }

Figure 6. Type-Polymorphic N:1 Mux If out is a 22-bit unsigned integer, and there are two signals connected to din, then the mux component would be automatically expanded to the constraints of the 2:1 mux shown in Figure 7. Note that the sel.e signal is constrained to take a single value, and that sel is of type indicating that it is not used in the data path. Furthermore, signals (e.g. out) are constrained to be assigned only one value. mux ( ∧ ∧ ∧ ∧ ∧ ∧ ∧ ∧ ∧

input din[0], input din[1], input sel, output out) { ( fire0 ⇔ ( din[0] ∧ ¬din[1] ∧ sel ∧ sel.e = α ) ) ( fire0 ⇒ ( out ∧ “out = din[0]” ) ) ( fire1 ⇔ ( din[1] ∧ ¬din[0] ∧ sel ∧ sel.e = β ) ) ( fire1 ⇒ ( out ∧ “out = din[1]” ) ( no_fire ⇔ ( ¬din[0] ∧ ¬din[1] ∧ ¬sel ∧ sel.e = χ ) ) ( no_fire ⇒ ¬out ) ( sel.e ∈ { α, β, χ } ) ( “out = din[0]” ⇒ out ) ∧ ( “out = din[1]” ⇒ out ) ( “out = din[0]” ⇒ ¬“out = din[1]” ) ( mux ⇔ ( fire0 ∨ fire1 ∨ no_fire ) ) }

Figure 7. 22-bit 2:1 Mux

demux The demux actor is defined in a way similar to the mux so that it is a type-polymorphic 1:N demux with no control specified. An expanded 22-bit 1:2 demux is shown in Figure 8. demux ( input din, input sel, output out[0], output out[1]) { ∧ ∧ ∧ ∧ ∧ ∧ ∧

( fire0 ⇔ ( din ∧ sel ∧ sel.e = α ) ) ( fire0 ⇒ ( out[0] ∧ ¬out[1] ∧ “out[0] = din” ) ) ( fire1 ⇔ ( din ∧ sel ∧ sel.e = β ) ) ( fire1 ⇒ (¬out[0] ∧ out[1] ∧ “out[1] = din” ) ) ( no_fire ⇔ ( ¬din ∧ ¬sel ∧ sel.e = χ ) ) ( no_fire ⇒ ( ¬out[0] ∧ ¬out[1] ) ( sel.e ∈ { α, β, χ } ) ( demux ⇔ ( fire0 ∨ fire1 ∨ no_fire ) ) }

Figure 8. 22-bit 1:2 Demux

composition To complete the architecture in Figure 2, the appropriate ports on the components are connected. This allows for the propagation of “present” signals. Hierarchical composition is available, but is not used in the example. The two sel ports that are left unconnected will have their values set appropriately for each operation that is extracted. Since the step input ports are not connected to any signal but are used by the data path, their values will be provided by operation parameters. We assume that D.din is provided by a signal from the environment. memories Although not shown in any of the components in Figure 2, a component can also contain any number of state elements. A state element is either a register or flip-flop. A register holds the value written to it until the value is overwritten. A flip-flop holds the value written to it for one cycle. With these primitives, one can build type-polymorphic, parameterized RAMs, register files, ROMs, pipeline registers, and other useful state elements. After composing either user defined or pre-defined components from a library, the designer invokes the operation extraction routine. At this point, the designer will see what operations the data path supports. If components are utilized in an unexpected way, the designer must inspect and reformulate the constraints to get the desired behavior.

3.2 Extracting the Operations In the previous section, we demonstrated how the constraints on components are formulated in quantifier-free first-order logic. When extracting operations, we create additional constraints to equate connected ports and to assert each component’s constraint literal (e.g. inc, dec, mux, demux). We then find the set of satisfying solutions using an iterative SAT procedure, called FindMinimalOperations (FMO), shown in Figure 10. FMO finds all “minimal” paths through the data path; these are the valid operations of the design. The procedure must be restricted to find “minimal” solutions, as there are generally an exponential number of solutions, scaling with the amount of independent parallelism in the design. A solution is “minimal” when it has the least number of “present” ports and still satisfies the constraints. This means that no other ports can be set to not “present”. The inner loop of FMO does this minimization. Following the creation of an operation, we restrict the model formula so that subsequent “minimal” operations are not simply combinations of previously found operations. By limiting FMO in this manner, we can quickly find the set supported “minimal” operations. For the example architecture in Figure 2, the operations shown in Figure 9 are found.

NOP INC(step) DEC(step)

M.out = D.din + step M.out = D.din – step

Figure 9. Operations Extracted for Inc/Dec Arch Operation extraction also creates a conflict table that indicates which operations cannot occur on the same cycle due to the constraints. For the operations in Figure 9, INC(step) and DEC(step) are in conflict. The assembler and compiler then use this information to create appropriate schedules. After extraction, we can further restrict the architecture by removing unwanted operations. This is sometimes an easier way to restrict particular paths than it is to create the appropriate constraints on the data path; in either case, the resulting control is equivalent. After extracting the operations, we now have a description of a completely statically-scheduled horizontally-microcoded machine. Architectures in this class neither require dynamic scheduling nor hazard detection control since all conflicts can be found at compile time. More complex architectures can be created by composing and coupling these machines. Such coupling approaches have been successfully used by the Intel x86 machines to translate CISC instructions into more RISC-like instructions for the superscalar core, and more recently by Transmeta to translate x86 code into statically-scheduled VLIW code. From this description we can now generate bit-true cycle-accurate simulators as well as synthesizable RTL for any architecture in this class. For the RTL, we will also need to synthesize a controller. BASE is the CNF formulation of the model PORT is the set of port literals CERT is a certificate, i.e. set of literals (satisfying the BASE) present : (PORT x CERT) → Boolean (true if port present in cert) isSatisfiable: BASE → Boolean (true if CNF is satisfiable) getCertificate: the last certificate that made isSatisfiable TRUE OPERATIONS = {}, OPPORTS = {} while (isSatisfiable(BASE)) { do { C = getCertificate() “remove the constraints added in (1)” BASE ∧= (∑i {¬porti | porti ∈ PORT ∧ present(porti, C)}) BASE ∧= (∏i {¬porti | porti ∈ PORT ∧ ¬present(porti, C)}) } while (isSatisfiable(BASE)) “remove the constraints added in (2)” “create new operation named ‘op’ based on the certificate” OPPORTS ∪= {op → {port | port ∈ PORT ∧ ¬present(port, C)}} OPERATIONS = OPERATIONS ∪ op BASE ∧= (∏i {porti | porti ∈ PORT ∧ present(porti, C)} ⇔ op) BASE ∧= (∑i {porti | porti ∈ PORT} ∧ (∏j {¬opj | opj ∈OPERATIONS ∧ porti ∈ OPPORTS(opj)})) } “remove the constraints added in (3), and (4)”

(1) (2)

(3) (4)

Figure 10. FindMinimalOperations (FMO)

4. Simulator Generation The structure of our generated simulators is straightforward. For each cycle, we run a statically-scheduled instruction. An instruction is defined as an unordered set of operations that has been checked for conflicts. Each operation may contain a set of statically determined parameters and may read inputs from and write outputs to the interface of the architecture. Decoding an operation simply requires jumping to an appropriate label based on the name of the operation. For each operation label, we execute the appropriate source to sink transformations. Since multiple operations can be executed in a single cycle, we commit writes to state elements at the end of a cycle. The resulting simulator is

equivalent to a discrete-event simulation of an FSM in an RTL simulator. However, our simulator is much faster because we statically determine the schedule. The generated simulators are implemented in C++ for performance reasons. In order to get the utmost performance from our simulators, we also use a number of constructs that would not be found in hand-written code. For example, we use computed-gotos, template meta-programming, and inlining. These techniques coupled with a good compiler result in high performance simulators.

4.1 Interpretive Simulator Generating an interpretive simulator is useful when the instruction stream is dynamic. The interpretive simulator executes either instructions from object files generated by an assembler or dynamic instructions produced by another processing element. The assembler is parameterized by the instruction set and conflict table that were found during operation extraction. Since instructions are conflict free, the execution order of the operations within an instruction does not matter. Also, since the program is statically scheduled on a cycle by cycle basis, no dynamic scheduling or hazard detection control is required. State is maintained appropriately by performing reads at the beginning and writes at the end of a cycle. Finally, a testbench is generated for each simulator that is used to advance time, terminate simulation, and to provide inputs and outputs for the simulator. The control of the simulation is orchestrated through a special port called “instruction”. At the beginning of each cycle, the last value written to “instruction” is interpreted as the instruction to execute. The mechanism of how instructions are placed on this port is up to the designer. Possible implementations may include having another processing element with its own instruction set produce the instructions or having the processing element produce the instructions itself through the use of a program counter. If nothing is connected to the “instruction” port then the environment must provide an instruction trace. while(!testbench→exit()) { instruction = testbench→read_instruction(); foreach(operation ∈ instruction) { switch(operation) { case NOP: break; case INC: uint din = testbenchàread_din(); uint step = operationàget(0); uint out = din + step; testbenchàwrite_out(out); break; case DEC: uint din = testbenchàread_din(); uint step = operationàget(0); uint out = din – step; testbenchàwrite_out(out); break; default: report_error(); } } // commit state writes (if they exist) }

Figure 11. Interpretive Simulator for Inc/Dec Arch For each operation, the actual expressions to be performed are automatically extracted from the action sections of the components which are activated for a given operation. We apply copy propagation on the network of activated expressions for a given operation. If we did not do this, we would have a number of unnecessary temporaries for the signals between components. Dead code elimination is also applied to improve the quality of the generated code. Although compilers will attempt to perform these optimizations, we have found that applying these optimizations

before compilation is beneficial. The interpretive simulator for the architecture of Figure 2 is shown in Figure 11.

4.2 Compiled-Code Simulator Compiled-code techniques can be utilized to further improve the performance of the simulator. If we know the program that is going to be run, we can hard code the program into the simulator. If the program trace, {NOP(), INC(6)}; {NOP(), DEC(10)}, is executed for the architecture in Figure 2, then the compiled-code simulator would be as shown in Figure 12. _0: uint din = testbenchàread_din(); uint out = din + 6; testbenchàwrite_out(out); testbenchàincrement_clock(); // computed-goto would go here _1: uint din = testbenchàread_din(); uint out = din - 10; testbenchàwrite_out(out); testbenchàincrement_clock(); testbenchàexit();

Figure 12. Compiled-Code Simulator for Inc/Dec Arch Before runtime, we know what operations are included in each instruction. Therefore, we can combine the operations to create a single set of expressions for each cycle. Although in our example we trivially combine NOP(), any number of conflict-free operations (the assembler checks this) can be combined into a single set. Combining the operations of an instruction removes the overhead of iterating through the list of operations in an instruction. We then apply the same optimizations as we did for the interpretive simulator plus we can propagate constant operation arguments. Furthermore, in cases where a program counter is used to determine the next instruction, we utilize computed-gotos to jump between runtime computed labels. A further optimization that we will make in the future is to remove these gotos when we can statically determine that the simulation simply proceeds to the next label.

4.3 Interfacing with the Simulator Three methods are used to interface with the simulator. First, probe components, which are guaranteed not to add any new semantics to the design, can passively capture a trace of the simulation. Second, when synthesis is not required, black-box components can be used. This is useful for modeling components with verified implementations (i.e. IP integration) and for using analysis components (i.e. cache analyzer). Finally, a testbench is generated that can be used to interface the simulator in a system simulation (e.g. SystemC based).

5. Hardware Generation A key component of our design methodology is to be able to produce synthesizable RTL from our architecture descriptions –we only discuss the simulation of the RTL, since describing the synthesis of hardware is beyond the scope of this paper. Since the components are specified in a structural manner using syntax and type rules consistent with Verilog, simple syntax transformations are used to generate the appropriate RTL for the design. However, unlike our simulators, we do not implement each operation as a hardware path. Instead, we preserve the structure of the data path, and synthesize a controller that multiplexes the paths in the architecture appropriately. For each operation, we can determine what control signals and write enables are required in order to activate the appropriate paths

in the architecture. We then use this information to create a horizontally-microcoded controller. In order to simulate a program, we also generate the appropriate control words to be embedded in the program memory. Currently, we are exploring an encoder/decoder scheme that compresses the instruction stream based on the analysis of static program traces. Although the details of the approach are beyond the scope of this paper, the basic idea, as shown in Figure 13, is to compress the microcode and then decompress it to get the appropriate control. Software: Hardware:

program→compiler→encoder→ program store→decoder→microcode buffer→control

Figure 13: Encoder/Decoder Strategy The generated decoder is manually specified or automatically generated as another component in the system with its own instruction set. The decoder demonstrates how multiple architectures can be coupled to create a more complex architecture. The benefit of this general approach is that the data path and control do not change when the encoder and decoder change. We are actively exploring encoder and decoder strategies, but for this paper our encoder/decoder strategy performs the identity function.

6. Results In order to test the quality of our generated simulators, we developed a DLX processor and a channel encoding processor. We designed the architectures in our language and then automatically extracted the operations. We then generated interpretive and compiled-code simulators and synthesizable RTL Verilog. Since our C++ simulators are equivalent to RTL simulation, we compare our simulation with Cadence’s nc-verilog simulation of our generated RTL Verilog. Our simulators were compiled using gcc v3.2 with –O3 and were run on a 2.4 GHz Pentium 4 with 1 GB RAM. The nc-verilog ran on a dual 900 MHz 64-bit UltraSparcIII with 2 GB of memory. In order to compare the results, we liberally scaled the nc-verilog numbers by a factor of 2.67 (2400 MHz/900MHz).

6.1 DLX Processor Although our framework is mainly targeted for applicationspecific programmable cores, in order to compare the effectiveness of our approach with existing methods, we have modeled a general purpose processing core. This means our model incorporates the characteristic elements of the micro-architecture of the DLX processor [6]. However, since we extract the set of supported operations automatically from the description of the data path, we do not match the binary encoding. Furthermore, the modeled 32bit DLX is a horizontally-microcoded core supporting arithmetical and logical operations with a five stage pipeline. Conditional jumps are supported. The model includes instruction and data memory, as well as program counter logic. We extracted 113 operations. Each operation represented one path through a single stage of the pipeline. We then generated the DLX instruction set by defining macro-operations that combine operations from different stages of the pipeline. These macro-operations make programming easier; however, in the end, the assembler expands the macro-operations back to the set of extracted operations. We have implemented three representative benchmark kernels in assembly code: Cyclic Redundancy Check (CRC), Inverse Discrete Cosine Transform (IDCT), and a signal processing multiply and accumulate filter loop including masking (FIR), as used by established benchmarks (EEMBC, DSPStone, and

Mediabench). The lookup-based 32-bit CRC requires only a few arithmetic operations, but relatively frequent memory accesses, whereas, the complex IDCT has more arithmetic and program flow constructs, but fewer memory accesses. The FIR filter loop, in particular, allows us to stress the pipeline. As a corner case, we also simulate executing NOP operations only. The achieved simulation speed results are listed in Table 1. design

nc-verilog

interpretive

compiled

ops/inst

NOP

2.5 MHz

6.9 MHz

588 MHz

1

CRC

455.7 KHz

4.6 MHz

85.5 MHz

9.12

IDCT

444.5 KHz

4.6 MHz

49.0 MHz

8.65

FIR

361.7 KHz

4.0 MHz

40.8 MHz

5.5

Table 1: DLX Simulation Speed Results The results are reported for 2 billion simulated cycles. We report the virtual running speed on the host in cycles per second, and the ratio of the average number of primitive operations to equivalent DLX pipelined instructions. For instance, a DLX add instruction needs ten operations to execute within five cycles.

6.2 Channel Encoding Processor We also developed a channel encoding processor capable of performing CRC, UMTS Turbo/convolutional encoding, and 802.11a convolutional encoding. The design is composed of approximately 60 components that include PC and zero-overhead looping logic, a register file, an accumulator, and a bit manipulation unit. A total of 46 operations were extracted from the design. Half of these were removed after extraction. bit width

nc-verilog

interpretive

compiled

2 4 7 10 13 16 32 33 64

533.6 KHz 512.5 KHz 516.5 KHz 507.2 KHz 505.3 KHz 505.3 KHz 491.2 KHz 485.6 KHz 468.8 KHz

5.5 MHz 5.7 MHz 5.7 MHz 5.5 MHz 5.4 MHz 5.4 MHz 5.5 MHz 5.2 MHz 5.5 MHz

169.5 MHz 169.5 MHz 158.7 MHz 157.5 MHz 157.5 MHz 157.5 MHz 168.1 MHz 132.5 MHz 131.6 MHz

Table 2: CEP Simulation Speed Results For this design, we experimented with various bit widths for the data path. We only had to write the convolution encoding program in assembly once, since our tools automatically adjust for bit width changes if possible. The results of the experiment are shown in Table 2. The results are reported for 2 billion simulated cycles. We report raw speeds since the ratio of instructions to operations is not as relevant with this type of processor. The processor was running an average of five primitive operations per cycle. The running times are only slightly affected by the bit width size. However, there was a noticeable drop in performance for bit widths greater than the native 32-bit data path of the Pentium 4.

6.3 Discussion The actual time taken to create these designs was on the order of an hour. The extraction of operations, generation of simulators, and generation of Verilog was performed in a few seconds. The majority of the design effort was focused on specifying the programs. This effort required renaming the operations for debugging purposes, controlling the pipeline on a cycle to cycle basis, and specifying macro-operations to ease programming. We have a compiler that alleviates the need to perform these tasks. The

compiler is still in development so we did not use it for our experiments. The typical speed of the compiled simulators is about a factor of 20 to 60 slower in cycles per second than the native host speed. The compiled C++ simulator is approximately one order of magnitude faster than the interpretive version and is two orders of magnitude faster than highly-optimized commercial Verilog simulator. Most of the speedup can be attributed to the fact that simulation can be completely statically scheduled. Comparing to results in related work, the speed of our simulators meets or exceeds the speed of similar simulation techniques. In the domain of ADLs, recent instruction set simulation results have been reported for ARM7, SPARC, and VLIW cores[7][8]. When we scale the reported MIPS results to our simulation host, the performance of the interpretive simulators is comparable. However, our compiled-code simulators are at least a factor of two faster than ADL-based compiled simulators (this may be a sideeffect of our small kernels). When compared to the MIMOLAbased JACOB simulator [9], which we are most closely related to, we find that our simulators are an order of magnitude faster. Our speedup can be attributed to a number of factors. First, since each operation is a statically-scheduled state-to-state transformation, we treat each as a basic block. We can then apply a number of compiler transformations such as copy propagation, dead code elimination, and inlining. Second, we do not need to decode operations, but instead we simply use each one as a label to jump to the corresponding optimized basic block. Using computed-gotos in the compiled-code simulator further improves the jump efficiency. Third, unlike JACOB [9], we do not depend on chaining primitive operations, but instead apply optimizations directly on the C code extracted from the action section of the component descriptions and handle arbitrary bit-width types with GNUmp libraries [10]. Finally, gcc is applied to the resulting simulator to efficiently map the code to the host.

7. Related Work A number of approaches to retargetable simulation based on ADLs have been proposed. Frameworks, such as FACILE [11], ISDL [12], Sim-nML [13], are optimized for particular architectural families and cannot capture the range of architecture that we can. More flexible modeling that supports both interpretive and compiled-code simulation is presented in the LISA [7] and EXPRESSION [8] frameworks. All of these approaches require that the designer specify the instruction set, and thus are more suitable for modeling architectures where the instruction set is known. Although we have not applied the just-in-time techniques presented in [7] we have applied a number of optimizations including compiled-code techniques, static analysis, and compiler optimizations. The MIMOLA framework most closely resembles our approach to retargetable simulation. Interpretive and compiled-code simulators have been generated from structural MIMOLA descriptions [9]. Since the structure of the data path and control are specified in MIMOLA, hardware generation is straightforward. The key difference between our approach and MIMOLA lies in the fact that we do not require the control to be specified and we extract instructions using SAT, not BDDs [2].

8. Conclusion Simplifying the design process by freeing the designer from concerns about the instruction set and control and providing high performance automatically generated tools greatly increases the productivity of designers. Our new design language obviates the need to specify the control and an instruction set, thus allowing designers to focus on the data path. From a description of a data path, we automatically extract the control and instruction set, generate bit-true cycle-accurate interpretive and compiled-code simulators, and generate synthesizable RTL Verilog. A simple horizontally-microcoded control scheme is also generated. Our results have shown that our simulators are one to two orders of magnitude faster than an equivalent nc-verilog simulation of our generated Verilog. Furthermore, our simulators are up to an order of magnitude faster than existing simulators using similar generation techniques.

9. References [1]

[2]

[3]

[4]

[5] [6]

[7]

[8]

[9]

[10] [11]

[12]

[13]

R. Leupers, P. Marwedel, “Retargetable Code Generation Based on Structural Processor Description.” Design Automation for Embedded Systems, vol. 3, no. 1, Jan 1998, pp. 1-36. R. Leupers, “Instruction-Set Extraction,” In Retargetable Code Generation for Digital Signal Processors, Kluwer Academic Publishers, 1997, pp. 45-83. A. Hoffmann, H. Meyr, and R. Leupers. Architecture Exploration for Embedded Processors with LISA. Kluwer, 2002. A. Halambi, P. Grun, V. Ganesh, A. Khare, N. Dutt, A. Nicolau, “EXPRESSION: A Language for Architecture Exploration through Compiler/Simulator Retargetability.” DATE 1999. A. Fauth, J. Van Praet, M. Freericks, “Describing Instruction Set Processors Using nML” ED&TC 1995. D.A. Patterson, J.L. Hennessy. Computer Organization & Design: The Hardware/Software Interface. Morgan Kaufmann, 1994. A. Nohl, G. Braun, O. Schliebusch, R. Leupers, H. Meyr, A. Hoffman, “A Universal Technique for Fast and Flexible Instruction-Set Architecture Simulation.” DAC 2002. M. Reshadi, N. Bansal, P. Mishra, N. Dutt, “An Efficient Retargetable Framework for Instruction-Set Simulation.” CODES+ISSS 2003. R. Leupers, J. Elste, and B. Landwehr, “Generation of Interpretive and Compiled Instruction Set Simulators.” ASPDAC 1999. GNUmp, http://www.swox.com/gmp. E. Schnarr, M. Hill, J. R. Larus, “Facile: A Language and Compiler for High-Performance Processor Simulators.” PLDI 2001. G. Hadjiyiannis, S. Hanono, S. Devadas, “ISDL: An Instruction Set Description Language for Retargetability.” DAC 1997. M. Hartoog, J.A. Rowson, P.D. Reddy, S. Desai, D.D. Dunlop, E.A. Harcourt, N. Khullar, “Generation of Software Tools from Processor Descriptions for Hardware/Software Codesign.” DAC 1997.