An IEEE 754 Floating Point Engine designed with an Electronic System Level Methodlogy V. A. Chouliaras
Jose Luis Nunez-Yanez
Department of Electronic and Electrical Engineering Loughborough University Loughborough, Leicestershire, UK
[email protected]
Department of Electrical and Electronic Engineering University of Bristol Bristol, UK
[email protected]
Abstract—This paper presents the design and implementation of an IEEE 754-compliant, single-precision, coarse-level pipelined floating point engine designed using a new Electronic System Level tool. The starting point was the Softfloat ANSI-C implementation of the standard. Minimal modifications, in the form of two parallel processes, were introduced to map the ANSI-C code to the Input/Output interface of a Cardbus-based Field-Programmable Gate Array prototyping board. The remaining part of the code was directly synthesized to gates using Cebatech’s C2R compiler and a number of configurations were studied with various degrees of functionalunit sharing and capabilities. Preliminary results clearly identify the potential of next generation ESL tools in transforming control-oriented applications written in ANSI-C code directly to synthesizable verilog, in record time. At the same time, the generated silicon exhibits up to two orders of magnitude better cycle performance compared to the Softfloat standard executing on an In-order superscalar processor (simulated) thus making a strong statement for the suitability of the ESL tool for the silicon synthesis of control-dominated applications. The design was validated both at the C-level as well as on an E14 PicoComputing FPGA board were it achieved an operating frequency of 64 MHz.
I.
INTRODUCTION
Current and near-future silicon technology nodes allow designers to integrate complete systems on a single chip, where a typical such system consists of a multitude of processing engines based around, as a minimum, on scalar 32-bit CPUs, augmented with data-parallel infrastructure in the form of standalone coprocessors or instruction set architecture extension instructions [1], connected via high bandwidth point-to-point links and incorporating possibly hundreds of local memories for data streaming. This everincreasing circuit complexity and tight design deadlines (due to time-to-market imperatives) mean that designers are almost always under great pressure to deliver what in most of the times is a sub-optimal System-on-Chip (SoC) architecture. Another reason for the sub-optimality of the resulting SoC is the increasing productivity gap where the chip complexity that can be handled by current design teams
falls well short of the possibilities offered by recent technological advances. As a result, architecture, microarchitecture and software space exploration is limited in most instances due to the design abstraction gap, manifested as the hardware-software divide. Electronics industries have long realized that critical decisions must be made long before development teams engage in the hardware and software design for new Systemon-Chip products. In recent years, it has become clear that hardware-software co-design and verification must form part of a single, unified effort, whereas the established design methodologies exist to primarily assist hardware-only or software-only development. That these tools are no longer adequate for modern SoC designs is evidenced by the recent emergence of new concepts that are disrupting the traditional design flow; these include system-level specification (specification capture), functional and architectural analysis, high-level estimation, partitioning and most importantly, Electronic System Level (ESL) synthesis flows. This work is a case study in applying this latter form of high level tools to designing, from first principles, a coarse-level pipelined floating-point engine for SoC applications. It should be read in conjunction with [2] which details the design and implementation of a transmission-lime-modeling (TLM) accelerator designed solely with ESL methodologies. In particular, this work studies the Floating Point Engine (FPE) implemented in [2] but this time chooses a different flow, based on the Cebatech C2R compiler [3] as the specification/implementation medium. II.
REVIEW OF ESL TOOLS
Commercial vendors and research institutions in the area of Electronic System Level (ESL) strive to fulfill the promise of fast, silicon-efficient integrated systems specification and implementation through offering hierarchy-aware, streaming high-level-language (HLL)-based tools to complement or replace the established modeling and design methodologies. Though primarily perceived as a system specification and modeling medium, SystemC synthesis had early adopters ([4], now discontinued), as an integrated systems design
vehicle. Other vendors followed [5] resulting in tools that require explicit cycle and resource designation or capable of probing the microarchitectural space autonomously [6]. Various flavors of C/C++ based input languages have been proposed as the specification and implementation mediums for such systems resulting in a multitude of solutions offering very good support for data streaming [7] at the expense however of the input C subset, C-subsets extended with CSP-derived parallel constructs [8] or full UML-based VLSI synthesis environments [9]. Other interesting efforts are based on the implicitly-parallel Mitrion-C [10] which assumes an underlying, parallel hardware architecture, the PICO platform [11], a VLIW/hardwired engine combination originally conceived at the Hewlett Packard Laboratories. Finally, in terms of offering reasonably good support to a canonical C/C++ application there are the examples of the SPARK compiler [12], now discontinued, the Trident Floating Point Compiler [13], Catapult [14] and C2R [3]. This work consolidates experiences gained during the high-level design a full Floating Point engine directly from an IEEE754 open-source implementation in ANSI-C while making use of the new Cebatech C2R compiler. Various configurations have been implemented and conclusions drawn as to the suitability of a new breed of ESL tools for rapid VLSI systems design. III.
IEEE754 ENGINE
The test case for C2R was the Softfloat (2b release) IEEE754 ANSI-C implementation [15]. This is a full implementation of the floating point standard and has been used to validate the correctness of IEEE 754 FP units, either standalone [16] or validate the implementation of the standard in high performance desktop processors. Starting with Softfloat, it’s source code was instrumented with multiple pre-processor directives (switches) to enable various microarchitectural features such as resource (function-level) sharing, address space mapping (for the target FPGA board), and coarse-grained control of allocated cycles per floating point operation. No source code modifications whatsoever were necessary as C2R successfully parsed the reference code. This is in contrast to a similar floating point engine, designed for a TLM accelerator, which used another commercial SystemC compiler and which required both source code adaptation for successful parsing as well as explicit cycle designation via wait() statements. It is however understood that substantial more performance can be extracted should additional modifications (such as fully enforcing a single-assignment regime at the C-level) were applied at source level. IV.
PERFORMANCE RESULTS
Overall, eight configurations of the Softfloat-based FPE were studied: four with fully arbitrated (shared) functions and four with major functions inlined. These configurations are named Hier1 through HIer4 (shared functions) and
Inline1 through Inline4 for the latter case. The supported single-precision operations and the naming convention are depicted in table I: TABLE I.
FPENGINE CONFIGURATIONS
Fpengine Configurations Supported Floating Point Operations
Config SPADD
SPSUB
SPMUL
SPCMP
Conversions
Hier1
Y
Y
-
-
-
Hier2
Y
Y
Y
-
-
Hier3
Y
Y
Y
Y
-
Hier4
Y
Y
Y
Y
Y
Inline1
Y
Y
-
-
-
Inline2
Y
Y
Y
-
-
Inline3
Y
Y
Y
Y
-
Inline4
Y
Y
Y
Y
Y
Out of the total eight, four configurations (in bold) were chosen for further study in order to probe the microarchitecture (cycle performance) and silicon implementation space for both FPGA and standard cell target technologies. In all cases, typical parsing/compilation times were of the order of a few minutes on a low-performance Linux-based laptop computer; the output of these C2R compilation runs were verilog netlists, per configuration, ready to be synthesized to gates via established (RTL-based) tool flows. TABLE II.
CYCLE PERFORMANCE
Cycle performance / FSM-based FP / All configurations Design
SPCMP EQ/LEQ/LT
INT2FLO AT/ FLOAT2I NT
SPADD
SPSUB
SPMU L
Inline1
7
8
-
-
-
Inline4
7
8
9
1/1/1
2/1
Hier1
12
13
-
-
-
Hier4
12
13
12
2/1/2
2/1
InOrderSS (IPC)
572 (0.549)
580 (0.549)
503 (0.55)
OutOrderSS (IPC)
530 (1.261)
548 (1.258)
477 (1.273)
287 (0.552)/ 260 (0.531)/ 221 (0.555) 303 (1.294)/ 274 (1.291)/ 233 (1.304)
447 0.55)/ 288(0.55) 455(1.28)/ 268(1.30)
Table II shows the cycle performance of the four chosen ESL-designed configurations as well as two CPU-based configurations (InOrderSS, OutOrderSS for in-order and outof-order microarchitectures respectively, 8-wide). These were evaluated for all their supported floating point operations. From the table, a non-pipelined, fully IEEE754 compliant add operation (with user-selectable rounding
modes) takes 7 cycles in a fully-inlined configuration or 12 cycles in the case of arbitrated (shared) functions. Similar observations can be made for the remaining operations. As it can be seen, function inlining leads up to 50% decrease in cycle count for simpler functions (floating-point comparisons) down to approximately 40% for more complicated functions. These results are encouraging; function inlining however leads to larger implementations with substantially more state (registers) as every nonarbitrated (shared) function is ‘instantiated’ at the calling site. A very interesting aspect of these results is the substantial cycle-performance improvement when moving from a software implementation (Softfloat executing on the sim-outorder simulator of the Simplescalar toolset [17]) to a full ESL solution. Though in reality an advanced CPU would utilize a pipelined floating point unit to achieve near-1 IPC for floating-point operations, the results clearly demonstrate the maturity of the C2R compiler when translating controlorientated applications into custom silicon. V.
hybrid VHDL-Foreign-Language-Interface environment as depicted in Fig 3:
(FLI)-based
I_FPE2:FPE2 Clk, reset FLI-based stimulus process
FPE2
vectors
response
Good_result Pipe_equalizer
Check_process
Figure 3. FLI-based Validation Environment
SILICON IMPLEMENTATION
The four configurations were implemented both on a high performance 0.13 um 1-poly, 8-copper silicon process from TSMC and the Xilinx Virtex4-FX60-based E14 board from PicoComputing. The programmer’s model and supported operations of the four chosen configurations are depicted in Figs. 2 and 3 respectively: Address 0x10b00000 0x10b00004 0x10b00008 0x10b0000C 0x10b00010 0x10b00014 0x10b00018
Register CMD_R
R
W *
OPR1_R OPR2_R OPR3_R RES_R BUSY_R
* * * * *
* * *
Figure 1. Programmers model Commands SP_ADD SP_SUB SP_MUL SP_CMPE SP_CMPLE SP_CMPLT INT_TO_SP SP_TO_INT
Value 0 1 2 5 6 7 8 9
Figure 2. Operations
A number of registers were specified including three 32bit operand registers, a command register, a result register and a single-bit busy register. This infrastructure allowed for easy integration of the FPE into the FPGA board which in turn facilitated the at-speed validation (on real-silicon) of the designs. Due to the disruptive nature of the chosen implementation flow, thorough RT-level verification was important to ensure that the C2R compiler correctly transformed the Softfloat source code into an equivalent verilog netlist. Validation of the designs took place in a
Figure 4. VLSI Layouts
The FLI code included the full implementation of the Softfloat reference code along with a random, floating-point number generator. The former was instantiated within a procedure and produced random floating point vectors, per supported command, as well as the correct response. The later is pipelined (held) up to the point were the C2Rgenerated verilog entity (I_FPE2) had computed the result. At that point, the good_result, pre-computed by the FLI code, was compared to the response produced by the FPE2. Simulation disagreements would cause the testbench to fail and the failing vectors were investigated. It is interesting to note that no such failure conditions were observed, (both at simulation level (Modelsim environment) or on real hardware residing on the E14 Virtex4 FX60 FPGA) thus increasing the confidence on C2R tool.
The first implementation campaign targeted the TSMC 0.13 um standard cell process. Synthesis was carried out topdown using Synopsys Design Compiler (compile_ultra switch) with a conservative wireload (tsmc13_wl50). Following the logical synthesis step, the four netlists for the four configurations, along with their associated constraint files, were read into Cadence SoC Encounter were placement, clock-tress synthesis, routing and IPO took place. Finally, power was measured (statistical power) on the postroute netlists. The standard-cell implementation campaign targeted 200 MHz as a reasonable, post-route frequency; results are tabulated in table III and the resulting VLSI cells (with their relative area indicated) are shown in Fig. 4 TABLE III.
substantially, from an estimated weeks to few days, by virtue of adopting a C-based specification and implementation flow. Results have clearly demonstrated the superiority of ESL-designed silicon for control-dominated applications, compared to a pure software implementation. Further research is underway to design high capacity speech coders from algorithmic source. It is anticipated that ESLbased toolflows and methodologies will play a very important role in the design of next generation integrated systems ACKNOWLEDGMENT The authors acknowledge the support of Cebatech Incorporated in providing early access to the C2R compiler
VLSI SYNTHESIS CAMPAIGN
VLSI Cells / FSM-based FP / All configurations Cell
Fmax (MHz)
gates
Inline1
199.2
118469
Inline4
201.6
199886
Hier1
137.5
67386
Hier4
203.4
84617
Dimensions
Std cell rows
Statistical power (mW)
405
52.14
517
87.66
290
26.91
349
38.85
926x1670 um2 1142x2086 um2 718x1249 um2 827x1474 um2
The second implementation campaign targeted the E14 Cardbus-based FPGA board. In this case, the verilog netlists were encapsulated in a further logical hierarchy which implemented the interface between the Cardbus host (Linux X86 system) followed by a fully-scripted flow. The initial clock target was 33 MHz (Cardbus frequency) however, the largest configuration (inline4) has also successfully been synthesized targeting the same FPGA device and achieved a clock speed of 64 MHz. Results from the FPGA campaign are shown on table VI VI.
FPGA RESULTS Table Column Head
Cell
Map Flops
Map LUTs
Equiv gate count
Inline1
9002/50560 (17%)
24938/50560 (48%)
933340
Inline4
11528/50560(22%)
31592/50560 (62%)
1003831
Hier1
5936/50560 (11%)
11723/50560 (23%)
820916
Hier4
7807/50560 (15%)
15516/50560 (30%)
862292
VII. CONCLUSIONS AND FUTURE WORK A coarse-level pipelined floating point engine was designed and implemented using a fully ANSI-C based methodology and a novel ESL compiler. Performance and silicon area/frequency/power results are very encouraging while at the same time, the time-to-market constraint was reduced
REFERENCES [1]
V.A. Chouliaras, , V.M. Dwyer, S. Agha, J.L. Nunez-Yanez, D. Reisis, K. Nakos and K. Manolopoulos, “Customization of an embedded RISC CPU with SIMD extensions for video encoding: A case study”, accepted for publication in Elsevier Press Integration: The VLSI Journal [2] V. A. Chouliaras, J. A. Flint, Y. Li, “A Transmission Line Modelling VLSI processor designed with a novel Electronic System Level Methodology”, submitted to ICECS2007 [3] C2R Compiler product brief, www.cebatech.com [4] Synopsys SystemC compiler product brief, www.synopsys.com [5] Agility Compiler Product brief, http://www.celoxica.com/products/agility/default.asp [6] http://www.forteds.com/products/cynthesizer.asp [7] ImpulseC Compiler, http://www.impulsec.com/C_to_fpga.htm [8] DK Design Suite, http://www.celoxica.com/products/dk/default.asp [9] R. Thomson, V. Chouliaras and D.Mulvaney, “From UML to structural hardware designs”. Accepted for presentation at DAC 2007, San Diego, California, USA [10] A. Dellson, “Progamming FPGAs for High Perfomracnce Computing Acceleration”, Xilinx XCELL Journal, http://www.xilinx.com/publications/xcellonline/xcell_55/xc_pdf/xc_ mitrion55.pdf [11] PICO Express product brief, http://www.synfora.com/products/picoexpress.html [12] S. Gupta, N.D. Dutt, R.K. Gupta, A. Nicolau, “SPARK : A HighLevel Synthesis Framework For Applying Parallelizing Compiler Transformations”, International Conference on VLSI Design, January 2003+++ [13] Tripp, J.L.; Peterson, K.D.; Ahrens, C.; Poznanovic, J.D.; Gokhale, M.B, “Trident: an FPGA compiler framework for floating-point algorithms”, Field Programmable Logic and Applications, 2005. International Conference on, Volume , Issue , 24-26 Aug. 2005 Page(s): 317 – 322 [14] Catapult Compiler Product brief, http://www.mentor.com/products/esl/high_level_synthesis/catapult_s ynthesis/index.cfm [15] Softfloat release 2b, http://www.jhauser.us/arithmetic/SoftFloat.html [16] Floating Point Unit project, http://www.opencores.org/projects.cgi/web/fpu/overview [17] D. Burger, T. Austin, ‘Evaluating Future Microprocessors: The SimpleScalar Tool Set’ , http://www.simplescalar.com