Implementing a Stack-Based Processor with FPGAs - Semantic Scholar

Implementing a Stack-Based Processor with FPGAs Michael Gschwind, Christian Mautner fmike,[email protected]

Institut fur Technische Informatik Technische Universitat Wien Treitlstrae 3-182-2 A-1040 Wien AUSTRIA Tel: (+43)(1)58801 Ex. 8156, 8151 Fax: (+43)(1)56 96 97 Abstract

This paper describes the design of a stack-based CPU using eld-programmable gate array technology. The architecture to be implemented was already de ned by a compiler, which had been implemented previously. We describe what tools and strategies were used to implement dierent parts of the processor, as well as the nal integration process.

Introduction FPGAs oer a unique opportunity to prototype chip implementations. To more closely study this option, we have built a prototype board based on Xilinx FPGAs [Hub92] and conducted several implementation experiments. We implemented our rst design, JAPROC, as part of the JAMIE project. JAPROC is a micro-controller upwardly compatible to the PIC16C57 [GJ92]. Field programmable gate arrays (FPGAs) can be used to allow fast implementation of chip designs [GAO92], [Gsc94]. This allows for a fast debug cycle, as designs can be altered and downloaded in a matter of hours. As FPGAs are pretested, only logic functionality has to be validated, reducing the time to get a workable implementation of a chip considerably. Since this has proved to be a remarkable success, we have started to use FPGAs in student projects for logic design courses (building circuits such as multipliers and 1

dividers) and to build more complex designs, such as the stack-based microprocessor presented here. The advantage of this approach is that students do not have to deal with the electrical intricacies of silicon implementations or breadboarding. Also, the implementation cost is reduced dramatically. Currently, we use ViewLogic to enter designs at the schematic/function block level. The ViewLogic environment oers a wide array of cell libraries, including macros modeling TTL parts and more advanced function blocks. Thus, students can concentrate on architecture design, without having to spend a large portion of their time designing low-level function blocks. The TTL emulation library allows student familiar with conventional board design to work much more eciently, as their existing expertise in this design style can be applied.

Architecture To maximize the understanding of the interaction of all levels of computer design (hardware, compilers, OS), we emphasize integration of system design consideration in student designs. Thus, the architecture presented here was used earlier in a compiler construction class [BF92]. The processor implements a stack machine, with all operands being addressed relative to the top-of-stack pointer or a frame pointer (local-pointer) which is used to access local variables [Mau94]. The memory model is that of a Harvard architecture, i.e. separate data and program memories. Memory addresses, as all other data, are 16 bit wide. Thus the processor can address 216 = 64k words in each memory segment. The data memory is 16 bit wide, and instruction memory uses 24 bit. This allows each instruction word to encode a full 16 bit immediate constant or address.

Implementation Control Unit

The stack machine was implemented using a nite state machine (FSM) controlling the data path (see gure 1). The nite state machine was modeled using a microprogram-like mnemonic representation (see gure 2). We decided to automatically generate the controller part from a high-level description. This had several advantages: The high-level description can be used as speci cation of the controller behavior. No inconsistencies can arise between the speci cation and the actual implementation, as the implementation is automatically generated from the speci cation.

Instruction Description NOP PSHc const PSHl oset PSHli STOl oset STOli MVTc const MLT PSL PPL GET PUT ADD SUB MUL SWP JMP address JPE address JPG address JSR address RET STP

no operation push const push value at (FP + oset) pop oset, push value at (FP + oset) pop value and store at (FP + oset) pop oset, pop value and store at (FP + oset) move top-pointer (SP) by const local-pointer (FP) := top-pointer (SP) push local-pointer (FP) pop local-pointer (FP) read value from I/O port and push write top stack element to I/O port add subtract multiply swap the top elements jump jump on equal jump on greater-than jump to subroutine { push PC+1, jump return { pop return address and jump stop execution

Table 1: Instruction set architecture

control signals a

b

FSM instr decode

ALU

temp internal bus

frame ptr

PC

instr-mem data buf

instr-mem addr buf

stack ptr

data-mem addr buf

data-mem data buf

system bus

Figure 1: Block level diagram of stack machine LABEL(ADD) COM( top_to_dmemadd | dmemadd_le | dmem_rd | dbus_to_alu_a ) COM( top_direct_down | top_clk_en ) COM( top_to_dmemadd | dmemadd_le | dmem_rd | dbus_to_alu_b ) COM( top_to_dmemadd | dmemadd_le | dmem_wr | alu_e_to_dbus | \ alu_cntrl_add | pc_inc | fetch )

Figure 2: FSM code for adding the top two stack elements If the speci cation changes, a new implementation can be generated with little eort, whereas a manual translation process such as used for JAPROC requires a complete re-design of the controller. To describe the design, we used a simple language with two primitives, one to de ne the output signals to be generated in a particular state, a second primitive to symbolically name state numbers. The language primitives have C macros associated with them, so that the formal speci cation of the FSM can be executed. By executing the speci cation, a bit stream is generated which describes the control unit [Gsc93]. To implement the control unit, we use a ROM storing all state transitions and control signals. This ROM is implemented using the Xilinx memgen tool [Xil92], which allows automatic generation of ROM- and RAM-like structures for FPGAs. The bit stream generated by the formal speci cation is used to initialize this ROM.

For JAPROC, we used espresso optimized random-logic to generate the FSM controlling the data path. As a result, the complete state machine had to be speci ed as a set of boolean equations and changes to the original control structure were much harder to achieve. Due to the simplicity of the instruction set we implemented, each instruction is implemented by a linear sequence of two to eight states. Each state has exactly one successor state. The only time when control is not transferred to a well de ned next state is during the decoding stage of a new macro-instruction: to decode a macro instruction, the opcode is fed to a decoder (also implemented as ROM and automatically generated from the same executable speci cation) which decodes an instruction by setting the controller state to the beginning of the state sequence which implements the macro instruction.

Data Path

The design of the data path was straight-forward, using Xilinx-supplied macros (soft- and hard-macros), the TTL emulation library and our own, generic bit-slice ALU. Integration of the design was seamless, but the the usage of hard macros and of multiple XNF modules complicated things somewhat: to generate an FPGA description which can be simulated, the design has to be translated rst to XNF level where all XNF modules were merged. The merged design was then translated to the LCA level where hard macros could be integrated. Then the whole translation process was reversed to generate a VSM-type le for simulation. This lengthy translation process showed a number of interfacing bugs in the Xilinx software and between the Xilinx and ViewLogic environments which have to this date not been resolved. The simulation was largely successful, but exhibited occasional unexpected behavior, like erroneous incrementing of the PC { this was tracked down to hazards in the automatically generated ROM. The control signals had been stabilized by latching the current state, allowing hazards to propagate to all functional units in the data path. By latching the control signals of the current state instead, these hazards were masked out. After this nal veri cation, the original compiler was adapted to re ect the changes made to the architecture at the beginning phase of the project. Thus, a fully functional microprocessor environment was available, including a compiler and a hardware prototype, implemented on one Xilinx XC4006 FPGA.

Results and Experiments We simulated a whole system by integrating this CPU design in a ViewLogic schematic which also contains instruction and data memories, and all the necessary glue logic. This board level design was then simulated using ViewSim.1 1One problem we encountered with this approach was that the ViewLogic tools try to regenerate all description les from the schematic level, but have a rather simplistic view of the process involved

Simulation shows that the CPU designed here will run at 12.5 MHz, and that the processor speed is limited by the memory subsystem. The circuit itself could operate at a much higher clock rate. The XC4006 FPGA showed 100% utilization of CLBs, with a huge degree of

ip- ops being unused: Partitioned Design Utilization Using Part 4006pg156-5 (Note 3) Occupied CLBs (Note 4): Used=256 Max=256 Util=100% Packed CLBs (Note 4): Used=254 Max=256 Util=99% Package Pins (Note 5): Used=73 Max=128 Util=57% FG Function Generators: Used=508 Max=512 Util=99% H Function Generators: Used=167 Max=256 Util=65% Flip Flops (Note 5): Used=131 Max=768 Util=17% Memory Write Controls: Used=0 Max=256 Util=0% 3-State Buffers: Used=176 Max=640 Util=27% 3-State Buffer Output Lines: Used=64 Max=64 Util=100% Address Decoders: Used=0 Max=192 Util=0% Address Decoder Output Lines: Used=0 Max=32 Util=0%

It is interesting to note that more than a third of the available CLBs were used for implementing the two ROMs used in the control unit. For larger designs, CLBbased FPGAs should probably not be used to implement large look-up tables. An alternative is to used dedicated parts for memory-type resources, as described in [KNZB93]. We tried to optimize our design with the ASYL netlist optimizer (ASYL versions 2.4.3 and 3.0.4) [INP93] available to us from EuroChip, but the output generated by this tool could not be processed by the ppr partitioner.

Related Work Intel Corp. used 14 Xilinx-based Quickturn RPMs2 to fully simulate its current topof-the-line PentiumTM microprocessor as part of the PentiumTM pre-silicon validation process [KNZB93]. The simulated PentiumTM microprocessor achieved an emulation speed of 300 kHz and booted all major operating systems for Intel's x86 processor family.

Conclusion and Future work We have shown that FPGAs are a useful tool for CPU prototyping. We are currently embarking on a project to model the MIPS R3000 CPU using FPGAs as target technology and VHDL for design speci cation. This design will be targeted towards and enhanced board featuring multiple Xilinx FPGAs and local static RAM. in merging a complex FPGA design. 2Each Quickturn RPM contains approximately 50k gates using Xilinx FPGAs.

Acknowledgement We wish to thank Alexander Jaud for his help with the ViewLogic and the Xilinx design environments.

References

Manfred Brockhaus and Andreas Falkner. Ubersetzerbau. Vorlesungsskriptum, TU Wien, 1992. [GAO92] T. Gal, K. Agusa, and Y. Ohno. Educational purpose microprocessors implemented with user-programmable gate arrays. In Proc. of the 2nd International Workshop on Field-Programmable Logic and Applications, Vienna, Austria, August 1992. [GJ92] Herbert Grunbacher and Alexander Jaud. JAPROC { an 8 bit microcontroller and its test environment. In Proc. of the Second International Workshop on Field-Programmable Logic and Applications, Vienna, Austria, August 1992. [Gsc93] Michael Gschwind. Automatic generation of nite state machines for data path control. Technical report, TU Wien, 1993. [Gsc94] Michael Gschwind. Reprogrammable hardware for educational purposes. In Proc. of the 25th ACM SIGCSE Symposium, Phoenix, AZ, March 1994. ACM. [Hub92] Ernst Huber. Eine Einsteckkarte fur den IBM-PC/AT zur Programmierung von Xilinx FPGAs. Diplomarbeit, Institut fur Technische Informatik, Technische Universitat Wien, Vienna, Austria, September 1992. [INP93] INPG. ASYL { A Multi-Target Synthesis System. Institut National Polytechnique de Grenoble, Grenoble, France, 1993. [KNZB93] Wern-Yan Koe, Harish Nayak, Nazar Zaidi, and Azam Barkatullah. Presilicon validation of Pentium CPU. In Hot Chips V { Symposium Record, Palo Alto, CA, August 1993. TC on Microprocessors and Microcomputers of the IEEE Computer Society. [Mau94] Christian Mautner. Entwurf eines Stackprozessors als Kon guration eines Xilinx FPGA Serie 4000. Technical report, Institut fur Technische Informatik, Technische Universitat Wien, April 1994. [Xil92] Xilinx. XACT Reference Guide. Xilinx, October 1992.

[BF92]

Implementing a Stack-Based Processor with FPGAs - Semantic Scholar

Implementing a Stack-Based Processor with FPGAs - Semantic Scholar

Suggest Documents

Optimizing an Open-Source Processor for FPGAs: A ... - IEEE Xplore

An FPGA Based SIMD Processor With A Vector ... - Semantic Scholar

a pipeline fft processor - Semantic Scholar

A PROGRAMMABLE BASEBAND PROCESSOR ... - Semantic Scholar

Development of a customized processor ... - Semantic Scholar

A Middleware Framework Coordinating Processor ... - Semantic Scholar

smart as a cryptographic processor - Semantic Scholar

Network processing in Multi-core FPGAs with ... - Semantic Scholar

Smart Memories: A Configurable Processor ... - Semantic Scholar

Implementing Image Processing Algorithms on FPGAs - CiteSeerX

Development of a customized processor ... - Semantic Scholar

trials with developing and implementing ... - Semantic Scholar

Implementing VCODE with static processes ... - Semantic Scholar

FUN with Implementing Algorithms - Semantic Scholar

Implementing Product-Line Features with ... - Semantic Scholar

Implementing Adaptive Educational Methods with ... - Semantic Scholar

Accelerated Image Processing on FPGAs - Semantic Scholar

The HORUS Processor - Semantic Scholar

Dynamic clock-frequencies for FPGAs - Semantic Scholar

Implementing Trusted Terminals with a TPM and ... - Semantic Scholar

GATOS: A Windowing Operating System for FPGAs - Semantic Scholar

Implementing a Question Answering Evaluation - Semantic Scholar

SUGGESTIONS FOR IMPLEMENTING A FAST ... - Semantic Scholar

Implementing a relationship marketing program - Semantic Scholar