Optimizing the Implementation of Floating Point Units for FPGA Synthesis Irvin Ortiz Flores Advisors: Manuel Jiménez and Domingo Rodríguez Electrical and Computer Engineering Department University of Puerto Rico, Mayagüez Campus Mayagüez, Puerto Rico 00681-5000
[email protected] Abstract This article reports the work done on the optimization of scalable floating-point addition and multiplication operators. Both operators were previously accomplished but some of their characteristics offered room for improvement. Their structure and main components are discussed and characterized to allow quantifying the improvements achieved.
1. Introduction Scalable floating-point cores allow for manipulating range and precision of computations to the exact needs of the user. In order to achieve the so-called scalability, Hardware Description Languages (HDL’s) are used. Precision and range can be adjusted by controlling the sizes of mantissa and exponent fields, respectively, by means of parameters passed to the HDL at compile time. The implementation of these cores on a Field Programmable Gate Array (FPGA) allows for significant reduction in the turnover time of many applications, enabling their rapid prototyping. Moreover, scalable cores provide added flexibility when used to speed up the design cycle. Two floating-point cores were developed in a previous work [Jiménez98]. The features of these units had been reported, but the results obtained offered some room for improvements [Jiménez01]. This paper reports the optimization results on these units.
2. FPGAs and VHDL Prototypes Field Programmable Gate Arrays (FPGAs) have evolved enormously from 10,000 to 10,000,000 logic gates. This density increase, including the overall technology improvements, make FPGAs a good choice to implement DSP applications. FPGAs are an attractive tool for rapid prototyping because of their high programmability and fast reconfiguration.
The prototypes for the optimized units were designed using VHDL, taking advantage of the language characteristics that allow for reusability. Several typical steps are performed in the process of synthesizing an algorithm or application through a hardware description language. Figure 1 shows a typical design flow, which starts with an HDL source and ends with a netlist downloadable to a programmable device like an FPGA. Scalability parameters are specified through generic VHDL parameters at the beginning of the synthesis process. When a circuit is synthesized, timing, functionality, and consumed resources data is available through the synthesis reports. CLBs are configured and interconnected during the download of the synthesized code. This creates a circuit with the specifications given on the HDL code.
HDL source file
Scalability Parameter
Functional Simulation
Read HDL Files
Insert Pads Area/Delay Constraints Compile Design
Write VHDL Backanotated
Simulate with CLB models
Replace FPGA Write XNF
Partition, Place & Route
Program File (Makebits)
Synthesis Libraries
Figure 1: VHDL Design Flow
3. Floating Point Operators A typical floating-point representation includes the fields: sign, exponent, and mantissa. The format in the optimized operators resembles in most aspects the IEEE 754
standard for floating point representation [Parhami00]. The developed units allows for numbers of varying precision and range by reconfiguring the sizes of its mantissa and exponent fields.
approaches based on two bits number comparator and the old scheme which is further described. These schemes were tested on speed and consumed resources over different operand sizes, as seen on Figure 3. 1000
100 Size (CLBs)
The floating-point adder is structured as a three-stage pipeline. The first stage swaps the operands and detects information for exponent equalization. The second stage equalizes the exponents and performs the addition/subtraction of the mantissas. The third stage sets the status conditions and flags, and postnormalizes the result. The bottlenecks of this operator are the shifter and fixed-point addition units.
10
The floating-point multiplier is also structured as a three-stage pipeline. The first stage performs the mantissa multiplication (which is the critical stage) and exponent addition. The second stage normalizes the result, while the third sets the status conditions and flags.
1 1
10
100
1000
Data size (bits) Modules comparator Modules with synthesis tool Old comparator
4. Optimization of Base Components
By synthesis tool By substraction
80 .0 70 .0 60 .0
Delay (ns)
The general procedure for the component’s optimization was based on a careful inspection of the VHDL code in search for inefficient parts. For example, a for-loop is a typical resource consuming structure. That is because the synthesis tool unrolls all loops and generates hardware for each loop’s iteration. Another procedure was to inspect components, focusing on the most commonly used in both FP operators. Research and development was made to obtain more efficient approaches. Figure 2 shows the common structures subjected to optimization task. Xilinx FPGAs have been used as synthesis target.
50 .0 40 .0 30 .0 20 .0 10 .0 0.0 1
10
10 0 D a ta S ize B y s yn th e s is too l (b its )
1 00 0
M o du le s c o m pa rator
M o du le s w ith s yn th e s is tool
B y s u b s tra c tio n
O ld c om p a ra to r
FPM
Array Mult.
RCLA. Adder
FPA
Comparators.
Glue Logic.
Figure 3: CLB consumption and delay characteristics of the evaluated comparators. Shifter
Optimization
FPM*
FPA*
Figure 2: Optimization of the floating-point units and components
4.1.Comparators Comparators are frequently used in floating point arithmetic. They are associated to operand swapping and decision base for if-then-else constructs. Several schemes were tested, which include: comparation by subtraction, by the synthesis tool, two modular
The old scheme uses a function that converts the incoming vectors to integers and then set the outputs by operating on integers. Its results are very similar to those of the synthesis tool, which provides support for comparation operations. Maybe the synthesis tool uses that approach for comparators. The subtraction scheme subtracts both numbers and evaluates the sign of the result to determine the comparation result. The module comparator is based on 2-bit two number comparator arranged in a binary tree way. The idea behind this scheme is to reduce the size of the numbers to be compared in each stage while the relative sizes of the input numbers is maintained. The size of the numbers is almost reduced by a half after each stage. This is done until the numbers are reduced to 1-bit. A variant of this
scheme lets the synthesis tool implement the basic cell while maintaining the same structure of the module comparator. The synthesis tool and the module comparator were the best in terms of delay .The module approach (with synthesis tool support) and the subtraction approach were the best in terms of area. The selection between these approaches should be done considering the data size, maximum allowable area and delay of the design.
4.2.Adder Fixed-point addition is a fundamental operation, which serves as building block for other complex arithmetic algorithms. To optimize this operation, two alternatives were evaluated: Carry Look Ahead (CLA) and the one provided by the synthesis tool. The fixed-point adder given from the synthesis tools was selected instead of the CLA. Table 1 shows that this fixed-point adder has a better performance on speed and used slices than the CLA. That is in part because the slice structure supports this operation with fast carry logic. Three fixed-point adders are incorporated into the FP adder. These participate on mantissa addition, exponent equalization and post-normalization. The FP Multiplier incorporates two fixed-point adders. These perform the exponent addition and exponent adjustment due to the mantissa normalization.
Table 2:Fixed-Point Multiplier optimization. Operand size
Array Mult. Delay (ns) 127 202 ---
Slices 1069 4812 ---
32 64 96 128
Tool’s Multiplier Delay Slices (ns) 544 43 3100 72 7013 105 12572 --
% Opt
% Opt
49 35 ---
66 64 ---
4.4.Shifter An efficient and fast shifter is needed for floating point operations. When the previous used shifter was synthesized, hardware was created for each one of the shifting possibilities. In addition, a multiplexer that selects between one of these possibilities, was created. This scheme was unacceptable for an FPGA because of its high space consumption. Figure 4 shows the internal structure of the developed scalable shifter. It was found that it follows the scheme of a log-2 shifter structure based on multiplexers [Heo00]. Input A (size n) is the vector to be shifted. E (size m) specifies the number of shifting positions, where m is given by ceiling (log2 (n)). Value m also represents the number of multiplexer’s stages (s = m). Vector O, with n bits, represents the shifted output value. A (n-1 to 0)
(n-1) to 2s
2(s) 0’s
E(m-1) 1
Stage (s)
M(s)
0
Table 1: Integer Adder optimization results. CLA
Operand size
Slices
32 64 96 128
74 150 218 285
Delay (ns) 51 78 114 157
Tool’s Adder % Opt % Opt Delay Area Delay Slices (ns) 16 27 76 47 32 27 78 65 48 27 77 76 64 28 77 82
0
0
(n-1) to 22
0
1
Stage (2)
4.3.Multipliers Multiplication is a resource consuming arithmetic operation where time is a stake. Two schemes were tested, seeking the optimal one for the FP units: array multiplier and the one provided by the synthesis tool. An array multiplier was used to perform high-speed operation in the FP Multiplier. It was found that the synthesis tool gives support for this operator more efficiently than the array multiplier in speed and resources, which is seen on Table 2.
0
0
0
0
E2
(n-1) to 21
Stage (1)
0
M2
1
M1
M0
0
0
E1
E (m-1 to 0)
(n-1) to 20
Stage (0)
1
E0
O (n-1 to 0)
Figure 4: Diagram of the scalable shifter This scheme has the advantage of shifting in constant time regardless of the number of positions specified by E.
In addition, its space complexity grows linearly with the size of the data as seen in Table 3. Furthermore, this shifter can be pipelined by inserting latches at multiplexer's output.
Optimized components have been integrated to the FP operators, yielding better results than with the previous used structures. Area reduction was completely achieved. Delay reduction was achieved in almost all the structures. Table 6 compares the units with other implementations.
Table 3:Area and delay data of the shifter Operand size
Slices
Delay (ns) 15.2
8
10
16
3
18.4
32
75
36.8
64
184
34.5
128
128
43.5
Table 6: Comparation of the developed units Vs other implementations Unit FPA
FPM
5. Results Testing on space and delay has been made for the FP operators. The FP units have been optimized and repaired from some scalability failures. The FP Adder needs optimization in the leading zero detection stage. Table 4 shows the characteristics of this operator when synthesized at the 32 and 64 bit IEEE standards. The FP multiplier has been synthesized for various sizes, including the two IEEE standards. Table 5 shows the scalability data for this operator. There is an almost quadratic relation between the space complexity and operand size.
Design 32 bit fixed This paper [Jaenicke01] 32 bit fixed This paper [[Jaenicke01]
Delay (ns) 20.6 18.4 35.7 11.1 15.1 35.7
Area 317 slices 424 slices;2402 LUTs -403slices 398 slices; 1062 LUTs 1750 LUTs-
6. Conclusion & Future Work Optimization and speed improvements for the cores will be done. This is mainly in the leading-zero detection stage of the FP Adder This stage is being synthesized with a sequential algorithm, whose delay does not permit the FP unit to operate at a higher frequency. Three additional floating-point cores will be developed. These include division, exponentiation and square root operators. Research in scalable pipeline will be done in order to determine the feasibility of include this feature in the cores.
References Table 4:Optimization on the FP Adder operator Op. Mant. size size
Exp. size
Slices
32
23
8
Non opt 470
64
52
11
--
Delay (ns)
% % Opt Opt Non Opt Area Delay opt 21 18.4 9.8 12.4
Opt 424 1232
--
28.7
--
--
Table 5: Optimization on the FP Multiplier operator Slices Op. Mant. Exp. size size size Non Opt opt 23 8 705 398 32
Delay (ns)
% % Opt Opt Non Opt Area Delay opt 14.7 15.1 43 -2
64
52
11
3605
1647 22.7 22.1
54
2.6
96
81
14
7311
5382
24.5
27
-2
128
110
17
--
9985
19.9
--
--
24
New scalable components have been generated. These include comparators, shifters, adders, and multipliers. Some of these have been achieved with the help of the synthesis tool. That is because the FPGA structure gives special support operations like carry propagation. There have not been finded the approaches of the synthesis tool for comparation, multiplication and adder components.
[Heo00] Heo, S., “A low-power 32-bit data path design: Master's thesis”, Massachusetts Institute of Technology, August 2000. [Jaenicke01] Allan, J.; Luk, W., “Parameterized floatingpoint arithmetic on FPGAs” IEEE International Conference on Acoustics, Speech, and Signal Processing, USA, 2001, pp 897-900 vol.2 [Jiménez98] Jiménez M., Santiago, N., Rover D. “Development of an FPGA-Based Scalable Floating Point Multiplier” Proceedings of the Fifth Canadian Workshop on Field Programmable Devices Workshop (FPD’98), Jun 1998, pp 145-150. [Jiménez01] Jiménez, M., Rodríguez, D., Santiago, N. “Scalable Floating Point FPGA Cores for Digital Signal Processing”,“Seminario Anual de Automática Electrónica Industrial e Instrumentación (SAAEI 2001)” Matanzas, Cuba, September 2001 [Parhami00] Parhami, B. “Computer Arithmetic: Algorithms and Hardware Designs”, Oxford, 2000 pp 300-303