Very low resource table-based FPGA evaluation of ... - IEEE Xplore

Very low resource table-based FPGA evaluation of elementary functions Horácio C. Neto

Mário P. Véstias

INESC-ID/IST/ULisboa Lisbon, Portugal Email: [email protected]

INESC-ID/ISEL/IPL Lisbon, Portugal Email: [email protected]

Abstract—This paper analyzes the FPGA implementation of polynomial-based function evaluation specifically considering the embedded block RAMs and multiplier-adders available in today’s technologies. The computation of the reciprocal, square root and inverse square root functions using first and second order polynomial approximations is discussed, in particular. In each case, the most appropriate sizes for the interpolation intervals are selected according to the maximum polynomial approximation errors. Upper-bounds for the truncation errors are formally derived in order to find the most appropriate sizes for the polynomial coefficients and fixed-point operands. The bit-sizes of the polynomial coefficients are optimized so that all the required values fit in only one 36Kbit BRAM. Further, the word lengths and the number of fractional bits of the operands are adjusted so that the fixedpoint multiplications and additions can be implemented with the 17x24 unsigned multipliers and 48-bit adders available in the FPGA DSP blocks. The experimental results confirm that a straightforward implementation of the function evaluator using one BRAM and two DSP blocks can provide more than single-precision. Additionally, an implementation with one BRAM and three DSPs can provide a precision of 28-bits, which is more than adequate to generate the seed for a double-precision operator using one additional Newton-Raphson iteration.

I.

I NTRODUCTION

The objective of this work is to investigate an efficient FPGA scheme to compute elementary functions. These function evaluators will be part of a FPGA-based many-core architecture for scientific computing [1]. A significant set of matrix computations (e.g. LU and QR matrix factorizations) requires the efficient computation of the reciprocal, square root and/or inverse square root. Therefore, in this work we investigate low resource FPGA-based architectures to effectively implement these three functions using the same base structure. A main objective was to find which is the maximum accuracy that can be achieved using a minimal number of specific FPGA embedded blocks, BRAMs and DSPs, and without any additional logic. The implementation of other elementary functions will be considered in future work. The most effective scheme to evaluate an elementary function is, in most cases, to use a piecewise polynomial approximation over a range reduced input interval. In fact, a large number of table and polynomial approximation based methods for the hardware evaluation of elementary functions c 978-1-4799-2079-2/13/$31.00 ⃝2013 IEEE

has been proposed and published since the early days of computing. An extensive survey of these is available in [2]. Most proposed methods, such as [3], [4], [5], targeted VLSI implementations, where the resource consumption metrics differ substantially from those of FPGAs. It is also possible to avoid the use of the table by using a single polynomial approximation for the full input interval, at the cost of increasing the number of iterations [6]. Some of the works, such as [7], [8], [9], provide more or less complex methods to optimize the truncation errors of the polynomial approximations in order to reduce the size of the tables and/or of the arithmetic operators. Only few, such as [10], [11], have targeted FPGA implementations but also did not specifically adjust the operands to the embedded multiplieradder blocks available in a specific FPGA technology. In this work we specifically optimize the word lengths and the number of fractional bits of the operands so that the fixed-point multiplications and additions can be efficiently implemented with the 17x24 unsigned multipliers and 48bit adders provided by the Xilinx FPGA DSP blocks. The optimization of the bit-sizes of the polynomial coefficients is further constrained such that all the required values fit in only one 36Kbit block RAM. In each case, we derive upper-bounds for the truncation errors of 1𝑠𝑡 and 2𝑛𝑑 order polynomial approximations and use these bounds to formally validate the coefficient optimization. Finally, the optimization analysis is confirmed by the experimental results presented in section IV. II.

F UNCTION EVALUATION

We consider the evaluation of functions for a 𝑥-input represented by 𝑠-bit significands. Without loss of generality, we consider the input to be normalized in the interval [0.5, 1[. 𝑥 = 0. 1𝑥−2 𝑥−3 𝑥−4 ⋅ ⋅ ⋅ 𝑥𝑠

(1)

𝑠−𝑏𝑖𝑡𝑠

For operands available in the interval [1, 2[, as in the IEEE standard, this implies a simple one-bit shift-right at the input. The function result 𝑓 (𝑥) will also be represented with 𝑠-bit significands. In the case of the reciprocal and inverse square root, the significand of the result will be directly available in the interval [1, 2[ while in the case of the square root a final one-bit shift-left will be necessary to normalize it to the IEEE standard.

TABLE I.

M AXIMUM APPROXIMATION ERRORS FOR DIFFERENT INTERVAL SIZES

Function

Reciprocal

Root

Inverse Square Root

F IXED - POINT FORMAT NOTATION Fixed-Point

Number

Word

𝑚𝑎𝑥 ∣𝐸𝑝1 ∣

𝑚𝑎𝑥 𝐸𝑝2

Number

Interval Size

Decimal

Format

Binary

Length

2−9

0.99 × 2−18

0.99 × 2−28

3.375

Q2.3

−20

0.99 × 2

0.99 × 2−31

3.375

Q4.7

0.99 × 2−22

0.99 × 2−34

0.078125

Q-3.6

2

−22

0.71 × 2

0.70 × 2−33

2−10

0.71 × 2−24

0.71 × 2−36

2−11

0.71 × 2−26

0.71 × 2−39

2−9

0.53 × 2−19

0.88 × 2−30

−21

0.88 × 2−33

−23

0.88 × 2−36

−10

2

2−11 Square

TABLE II.

−9

−10

2

−11

2

0.53 × 2 0.53 × 2

5

0.000101

11 3

bit sizes of the operands. When adding/subtracting fixed-point operands, the point must be aligned, and the integer/fractional part of the result (without carry out) will equal the larger of the operands integer/fractional part. 𝑄𝐼𝑎×𝑏 𝑄𝐹𝑎×𝑏 𝑄𝐼𝑎+𝑏 𝑄𝐹𝑎+𝑏

A. Polynomial approximation The function is evaluated using an approximation obtained through a piecewise minimax polynomial. The minimax polynomial is the approximating polynomial which has the smallest maximum error from the given function [12]. We analyze the precisions than can be obtained using first 𝑝1 (𝑥) = 𝑎 + 𝑏 (𝑥 − 𝑥0 )

(2)

𝑝2 (𝑥) = 𝑎 + 𝑏 (𝑥 − 𝑥0 ) + 𝑐 (𝑥 − 𝑥0 )2

(3)

= 𝑄𝐼𝑎 + 𝑄𝐼𝑏 = 𝑄𝐹𝑎 + 𝑄𝐹𝑏 = 𝑚𝑎𝑥(𝑄𝐼𝑎 , 𝑄𝐼𝑏 ) = 𝑚𝑎𝑥(𝑄𝐹𝑎 , 𝑄𝐹𝑏 )

In practice and to reduce the size of the lookup table and of the multipliers and adders, the arithmetic operators are implemented with less than full precision. In this case, a multiply-add term 𝑝=𝑙+𝑚𝑛

𝑥0

(4)

ROM adress bits

Table I shows the upper bounds for the function approximation errors, when computed with full precision, using the first and second order minimax polynomials. As shown, the use of subintervals of size 2−11 can provide approximations with errors lower than 2−22 , using a first order polynomial, while subintervals of size 2−10 can provide second order approximations with errors lower than 2−31 . The worst cases correspond always to the reciprocal function, as for the other functions the error limits are always lower. B. Fixed-point arithmetic In the following we represent a fixed-point number format as 𝑄[𝑄𝐼].[𝑄𝐹 ], where 𝑄𝐼 is the number of integer bits and 𝑄𝐹 is the number of fractional bits . The word length is given 𝑊 𝐿 = 𝑄𝐼 + 𝑄𝐹 . We consider negative values for 𝑄𝐼 to represent extended resolution for fractional only numbers, as in [13]. Thus, a negative 𝑄𝐼 represents the number of leading fractional zeros of the unsigned number. Table II shows some format notation examples. The usual rules for full-precision arithmetic remain valid. For fixed-point multiplication the number of integer and fractional bits in the result is the sum of the corresponding operand

(9)

becomes

polynomials. The input operand interval [0.5, 1[ is divided into subintervals of size 2−𝑤 , such that the polynomial coefficients for each subinterval can be stored in a ROM, addressed by the lower 𝑤 − 1 bits of 𝑥0 .

(5) (6) (7) (8)

C. Truncation Errors

and second order

𝑥 = 0.1 𝑥−2 𝑥−3 ⋅ ⋅ ⋅ 𝑥−𝑤 𝑥−(𝑤+1) ⋅ ⋅ ⋅ 𝑥−𝑠

11.011 0011.0110000

𝑝 = (𝑙𝑇 + 𝜖𝐿 ) + (𝑚𝑇 + 𝜖𝑀 ) (𝑛𝑇 + 𝜖𝑁 ) (10) = 𝑙𝑇 + 𝑚𝑇 𝑛𝑇 + 𝜖𝐿 + 𝑚𝑇 𝜖𝑁 + 𝑛𝑇 𝜖𝑀 + 𝜖𝑀 𝜖𝑁 𝑝𝑇

𝜖𝑇

where 𝜖𝐿 , 𝜖𝑀 and 𝜖𝑁 are the truncation errors of 𝑏, 𝑚 and 𝑛, respectively. The truncation error of the multiplication-addition term (9) is therefore upper-bounded by 𝜖𝑇 < 𝜖𝐿 + ∣𝑚∣𝑀 𝐴𝑋 𝜖𝑁 + ∣𝑛∣𝑀 𝐴𝑋 𝜖𝑀 + 𝜖𝑀 𝜖𝑁

(11)

or 𝜖𝑇 < 2−𝑄𝐹𝑙 + ∣𝑚∣𝑀 𝐴𝑋 2−𝑄𝐹𝑛 + ∣𝑛∣𝑀 𝐴𝑋 2−𝑄𝐹𝑚 + 2−𝑄𝐹𝑚 2−𝑄𝐹𝑛

(12)

considering the number of fractional bits truncated at 𝑙,𝑛 and 𝑚. III.

A RCHITECTURAL EXPLORATION

Xilinx DSP blocks can implement a multiply-add operation with a 17 × 24 unsigned multiplication (18 × 25 signed), and a 48-bits addition/subtraction [14]. This constrains the sizes of the operands in the multiplyadd term (9), such that 𝑊 𝐿𝑙 ≤ 48 𝑊 𝐿𝑚 ≤ 17 𝑊 𝐿𝑛 ≤ 24

(13) (14) (15)

but also constrains their number of fractional bits because, for fixed-point operands, the point must be aligned for the final addition/subtraction. 𝑄𝐹𝑚 + 𝑄𝐹𝑛 ≤ 𝑄𝐹𝑙

(16)

We target ROM implementations using Xilinx block RAMs. Each BRAM can store 36 Kbits and can be configured as 1K×36 or 512 × 72 bits, among other possible configurations with smaller word lengths [15]. Therefore, we analyze implementations with 1K intervals (𝑤 = 11), for first order approximations, and with 512 intervals (𝑤 = 10), for second order approximations.

Using subintervals of size 2−11 , 𝑦 has 11 leading fractional zeros, that is 𝑄𝐼𝑦 = 11. Given that the larger multiplier input can support a 𝑊 𝐿𝑦 ≤ 24, the 𝑦-input does not need to be truncated for 𝑥-inputs with significand bits 𝑠 ≤ 35. This is much larger that the best approximation error achievable, which is 2−22 (from table I). Therefore and for practical significand sizes of 𝑥, 𝑦 does not need to be truncated.

This BRAM constraint further implies that, for first order polynomials, the sum of the word lengths of the two coefficients, 𝑎 and 𝑏, must be not larger than 36 and, for second order polynomials, the sum of the word lengths of the three coefficients, 𝑎, 𝑏 and 𝑐, must be not larger than 72.

The truncation error of 𝑦 being zero, 𝜖𝑌 = 0, the upperbound of the truncation error of the first order approximation can be evaluated considering only the first two terms in (12), that is 𝜖𝑇 1 < 2−𝑄𝐹𝑎 + 2−11−𝑄𝐹𝑏 (26)

Considering, 𝑦 = (𝑥 − 𝑥0 ) in (2) and (3), we want to implement (17) and (19) with the word length constraints indicated below

This upper bound, 𝜖𝑇 1 , is maximized by making

1𝑠𝑡 order

𝑝1 = 𝑎 + 𝑏 × 𝑦 𝑊 𝐿𝑎 + 𝑊 𝐿𝑏 ≤ 36

𝑛𝑑

(17) (18) 2

𝑝2 = 𝑎 + 𝑏 × 𝑦 + 𝑐 × 𝑦 𝑊 𝐿𝑎 + 𝑊 𝐿𝑏 + 𝑊 𝐿𝑐 ≤ 72

2 order

(19) (20)

The polynomial coefficients can be positive or negative, depending on their order and on the function to approximate. However, only their absolute values are stored in the BRAM. Negative values are simply handled by configuring the DSP block with a subtraction instead of an addition. In the subsections below we analyze implementations of 1𝑠𝑡 order approximations with one multiplier-adder, and implementations of 2𝑛𝑑 order approximations with two and three multiplier-adders. The analysis implicitly considers the coefficient bit sizes and truncation errors required by an implementation of the reciprocal function, because this is the function with the larger polynomial coefficients and thus with the larger truncation errors (for the same evaluating subintervals and the same coefficient bit sizes). The truncation error values presented are therefore upper bounds of the lower truncations errors implied by the less stringent square root and inverse square root functions. For simplicity of notation and without loss of generality, we will present the formulae considering the multiply-add terms always in the form 𝑙 + 𝑚𝑛, although in some cases they may be, in fact, computed as 𝑙 − 𝑚𝑛. A. One multiplier-adder The implementation of a first-order approximation with one multiplier-adder is straightforward. The values of 𝑎 and 𝑏 in (17) are upper-bounded by 2 and 4 (the worst case is the first subinterval of the reciprocal function), respectively ∣𝑎∣𝑚𝑎𝑥 < 2 ∣𝑏∣𝑚𝑎𝑥 < 4

(21) (22)

𝑄𝐼𝑎 = 1 𝑄𝐼𝑏 = 2

(23) (24)

(27)

Therefore and considering (25), we select 𝑄𝐹𝑎 = 22 𝑄𝐹𝑏 = 11

(28) (29)

which, from (26), give an estimated upper bound of the truncation error of 𝜖𝑇 1 < 2−20.6

(30)

These values were confirmed experimentally as providing the lower global errors. The experimental results obtained are discussed in section IV. B. Two multiplier-adders The second-order approximation may be directly mapped in a two multiplier-adders architecture by using the Horner scheme 𝑝2 = 𝑎 + 𝑦 × (𝑏 + 𝑐 × 𝑦) (31) that is 𝑞 =𝑏+𝑐×𝑦 𝑝=𝑎+𝑞×𝑦

MAdd1 MAdd2

(32) (33)

as long as the operand bit-lengths fit the arithmetic operators and the number of fractional bits satisfy the point alignment constraints. The first multiplier-adder, MAdd1 , imposes the following bit-size constraints 𝑊 𝐿𝑦 ≤ 24 𝑊 𝐿𝑐 ≤ 17 𝑊 𝐿𝑏 ≤ 48 𝑄𝐹𝑦 + 𝑄𝐹𝑐 ≤ 𝑄𝐹𝑏

→ → → →

𝑄𝐹𝑦 ≤ 34 𝑄𝐹𝑐 ≤ 14 𝑄𝐹𝑏 ≤ 47 𝑄𝐹𝑦 + 𝑄𝐹𝑐 ≤ 47

(34) (35) (36) (37)

which are easy to fulfill because the second-order polynomial coefficient 𝑐 has a relatively small word length. The second multiplier-adder, MAdd2 , imposes the following bit-size constraints

such that

and therefore, considering the size constraint from (18) 𝑄𝐹𝑎 + 𝑄𝐹𝑏 ≤ 33

𝑄𝐹𝑎 = 11 + 𝑄𝐹𝑏

(25)

𝑊 𝐿𝑦 ≤ 17 𝑊 𝐿𝑞 ≤ 24 𝑊 𝐿𝑎 ≤ 48 𝑄𝐹𝑦 + 𝑄𝐹𝑞 ≤ 𝑄𝐹𝑎

→ → → →

𝑄𝐹𝑦 ≤ 27 𝑄𝐹𝑞 ≤ 22 𝑄𝐹𝑎 ≤ 47 𝑄𝐹𝑦 + 𝑄𝐹𝑞 ≤ 47

(38) (39) (40) (41)

We consider first the largest possible significand size for the input 𝑥, such that it is not necessary to truncate 𝑦 in (32) and (33), that is, 𝑊 𝐿𝑥 = 𝑄𝐹𝑦 = 27. In this case, 𝑞 must truncated to 20-bits, that is 𝑄𝐹𝑞 = 47 − 27 = 20.

multiplier-adder. We propose to divide the second multiplication in two, by considering separately an upper term 𝑞𝑈 and a lower term 𝑞𝐿 of 𝑞. 𝑞 =𝑏+𝑐×𝑦 𝑞 = 𝑞𝑈 + 𝑞 𝐿 𝑝1 = 𝑎 + 𝑞 𝑈 × 𝑦 𝑝 = 𝑝1 + 𝑞 𝐿 × 𝑦

If 𝑦 does not need to be truncated, the truncation error of the two multiplication-additions are upper-bounded by −𝑄𝐹𝑏

−10−𝑄𝐹𝑐

𝜖𝑇 2𝑞 < 2 +2 −𝑄𝐹𝑎 + 2−10−𝑄𝐹𝑞 𝜖𝑇 2𝑝 < 2

(42) (43)

We select, 𝑄𝐹𝑐 = 10 and 𝑄𝐹𝑏 = 21, such that the theoretical upper-bound of truncation error for the MAdd1 is approximately equal to 2−20 . Then and from (20), the number of fractional bits of 𝑎 must be at most equal to 35, 𝑄𝐹𝑎 ≤ 35, so that all the 3 coefficients fit in one memory position. Therefore, we use the following coefficient formats 𝑎 → 𝑄1.35 𝑏 → 𝑄2.21 𝑐 → 𝑄3.10

(44) (45) (46)

From (42) and (43), this gives an estimated upper bound for the truncation errors of the multiply-adders of 𝜖𝑇 2𝑞 < 2−19.4 𝜖𝑇 2𝑝 < 2−29.3

(47) (48)

𝜖𝑇 2𝑞 < 2−𝑄𝐹𝑏 + 2−10−𝑄𝐹𝑐 + ∣𝑐∣𝑀 𝐴𝑋 2−𝑄𝐹𝑦 + 2−𝑄𝐹𝑐 2−𝑄𝐹𝑦 𝜖𝑇 2𝑝 < 2−𝑄𝐹𝑎 + 2−10−𝑄𝐹𝑞 + ∣𝑞∣𝑀 𝐴𝑋 2−𝑄𝐹𝑦 + 2−𝑄𝐹𝑞 2−𝑄𝐹𝑦

(49)

(50)

the estimated truncation error upper bounds become 𝜖𝑇 2𝑞 < 2−19.3 𝜖𝑇 2𝑝 < 2−24.9

(51) (52)

MAdd2 MAdd3

(53) (54) (55) (56)

However, the need to align the points in the fixed-point operations still imposes constraints on the number of fractional bits of 𝑦, 𝑞𝑈 and 𝑞𝐿 in the second and third multiplier-adders. The second multiplier-adder, MAdd2 , imposes the following bit-size constraints 𝑊 𝐿𝑦 ≤ 24 𝑊 𝐿𝑞𝑈 ≤ 17 𝑊 𝐿𝑎 ≤ 48 𝑄𝐹𝑦 + 𝑄𝐹𝑞𝑈 ≤ 𝑄𝐹𝑎

→ → → →

𝑄𝐹𝑦 ≤ 34 𝑄𝐹𝑞𝑈 ≤ 19 𝑄𝐹𝑎 ≤ 47 𝑄𝐹𝑦 + 𝑄𝐹𝑞𝑈 ≤ 47

(57) (58) (59) (60)

Selecting the largest possible value 𝑊 𝐿𝑦 = 24 for 𝑦, imposes 𝑄𝐹𝑞𝑈 ≤ 47 − 34 = 13. Therefore, 𝑞𝑈 will be in the format Q2.13, and 𝑞𝐿 will have 13 leading fractional bits, 𝑄𝐼𝑞𝐿 = −13. The bit-size constraints imposed by the third multiplieradder, MAdd3 , are 𝑊 𝐿𝑦 ≤ 24 𝑊 𝐿𝑞𝐿 ≤ 17 𝑊 𝐿𝑎 ≤ 48 𝑄𝐹𝑦 + 𝑄𝐹𝑞𝐿 ≤ 𝑄𝐹𝑎

These coefficient bit-sizes were confirmed experimentally as providing the lower global errors. If, the significand size of the input 𝑥 is such that 𝑦 must be truncated, the truncation error will be significantly increased. In fact, considering the 𝑦-truncation error in the estimation formulae

MAdd1

→ → → →

𝑄𝐹𝑦 ≤ 34 𝑄𝐹𝑞𝐿 ≤ 30 𝑄𝐹𝑎 ≤ 47 𝑄𝐹𝑦 + 𝑄𝐹𝑞𝐿 ≤ 47

(61) (62) (63) (64)

The upper-bound truncation error of the 3𝑟𝑑 MAdd is 𝜖𝑇3 < 𝜖𝑃1 + 2−𝑤 𝜖𝑄𝐿 + 𝜖𝑌 ∣𝑄𝐿 ∣𝑀 𝐴𝑋 + 𝜖𝑄𝐿 𝜖𝑌

(65)

In order to minimize 𝜖𝑇3 we equalize the two main truncation error terms: ∣𝑦∣𝑀 𝐴𝑋 𝜖𝑄𝐿 = 𝜖𝑌 ∣𝑞𝐿 ∣𝑀 𝐴𝑋

(66)

2−10−𝑄𝐹𝑞𝐿 = 2−𝑄𝐹𝑦 −13

(67)

or and, from constraint (64), we select 𝑄𝐹𝑞𝐿 = 25 𝑄𝐹𝑦 = 22

(68) (69)

This abrupt 22× increase in the estimated truncation error clearly indicates that, for input significand sizes larger than 27, the second-order polynomial must be computed with three multiplier-adders, as analyzed in the following sub-section.

In fact, 𝑦 is truncated to a lower number of fractional bits in the third multiplier. However, as this is a multiplication with a small value (𝑞𝐿 ), this has a low impact in the overall truncation error.

C. Three multiplier-adders

For this architecture, the estimated upper bounds for the truncation errors in the 3 multiply-adders are

For input operands 𝑥 with larger word lengths, we maintain the word lengths of the polynomial coefficients 𝑎, 𝑏 and 𝑐 used in the two multiplier-adder architecture but must increase the number of fractional bits of 𝑞 and 𝑦 used to compute (33) in order to significantly reduce the existing truncation error. As the bit size of 𝑞 and/or 𝑦 increases, the computation of (33) has to be implemented with, at least, one additional

𝜖𝑇𝑞 = 2−19.9 𝜖𝑇𝑝1 = 2−29.9 𝜖𝑇𝑝 = 2−29.8

(70) (71) (72)

which, again, indicate that precisions near (about 2× lower) the maximum achievable with a 2𝑛𝑑 order polynomial are possible, using only 3 multiplier-adders.

log2(err)

Err

y

Fig. 4.

Architecture I: Maximum absolute errors of reciprocal

log2(err)

Architecture I: Error of reciprocal in [0.5, 0.5 + 2−11 ]

Err

Fig. 1.

x

y

Fig. 5.

y

Fig. 3.

Architecture II: Maximum absolute errors of reciprocal

log2(error)

Architecture II: Error of reciprocal in [0.5, 0.5 + 2−11 ]

Error

Fig. 2.

x

Architecture III: Error of reciprocal in [0.5, 0.5 + 2−11 ]

IV.

E XPERIMENTAL RESULTS

The global errors for the three functions, reciprocal, square root and inverse square root, were evaluated in Python(x,y) [16] using a fixed-point arithmetic library. For 𝑥-inputs with lengths such that the number of (𝑥 − 𝑥0 ) points in each subinterval were less than 215 we tested all points in the interval [0.5, 1[. For larger significand lengths we tested, also over the full [0.5, 1[ interval, using 215 uniformly generated random 𝑥-inputs per subinterval. Figures 1 to 6 show the error results evaluated for the reciprocal function. Figures 1, 2 and 3 show the errors obtained in the first subinterval for architectures I, II and III, respectively. Figures 4, 5 and 6 show, for the 3 architectures, the maximum absolute errors in each of the 2𝑤−1 subintervals of [0.5, 1].

x

Fig. 6.

Architecture III: Maximum absolute errors of reciprocal

We also evaluated the experimental global error using small variations on the coefficient bit sizes derived in section III. However, these values were confirmed experimentally as providing the lower global errors and, therefore, we only present herein the results obtained using these specific coefficient formats, as summarized in table III. For generic floating-point numbers, the handling of the exponent term is trivial in all cases. The only minor issue is that the computation of the square root will have to consider different approximations for odd and even exponents, which is equivalent to using two separate tables. For this reason, we evaluated the global errors for the square root and inverse square root functions using subintervals with size 2× larger than the size used for the reciprocal function.

TABLE III.

F IXED - POINT FORMATS OF THE POLYNOMIAL

TABLE V.

FPGA IMPLEMENTATION RESULTS

COEFFICIENTS

Architecture Architecture

a

b

c

I

Q1.22

Q2.11

-

II

Q1.35

Q2.21

Q3.10

III

Q1.35

Q2.21

Q3.10

TABLE IV.

DSP

Delay

Number

BRAMs Configuration

Number

(ns)

I

1

1K×36

1

3.3

II

1

512×72

2

5.3

III

1

512×72

3

6.3

M AXIMUM GLOBAL ERRORS ( EXPERIMENTAL ) 1 MADD

2 MADD

2−11

(Interval Size)

3 MADD

a double precision operator with only one Newton-Raphson iteration. ACKNOWLEDGMENT

2−10

Input significand bits

21

27

53

53

Reciprocal Max Err

2−21.1

2−29.1

2−24.9

2−29.8

(Interval Size)

2−10

2−9

Input significand bits

21

26

53

53

Square Root Max Err

2−20.8

2−28.1

2−27.0

2−28.1

Inverse Square Root Max Err

2−20.8

2−28.1

2−26.2

2−28.1

The error results obtained are shown in table IV. For significand sizes lower than 27, the architecture with one BRAM and two DSP blocks provides results with global errors lower than 2−28 . This is more than adequate for supporting single precision operations (24-bit input and output significands with the output faithfully rounded). The architecture with one BRAM and three DSPs is able to provide results with global errors lower than 2−28 (2−29.8 in the case of the reciprocal) for double precision inputs (53bit significands). This result is more than adequate to generate a double-precision result using only one additional NewtonRaphson iteration. All the architectures were synthesized with Xilinx ISE13.4 and implemented in a Virtex-7 FPGA (speed rate -3). The architecture implementations were verified using a VHDL testbench using a subset of the 𝑥-input values. The mapping and timing results after place and route are shown in table V. The delays shown correspond to non-pipelined polynomial evaluators. Naturally, the DSP multiplier-adders can be significantly pipelined, if required. V.

C ONCLUSION

We presented a new analysis for the efficient computation of the reciprocal, square root, and inverse square root functions in FPGAs. The proposed architectures have been specifically derived to take full advantage of the embedded blocks available in the target FPGA technology. We have opted to keep the architectures as simple as possible, using only 1 BRAM and 1, 2 or 3 DSP blocks, and no additional random logic or arithmetic operators implemented using LUTs. The theoretical analysis and the experimental results provided demonstrate that a 2𝑛𝑑 order polynomial approximation that consumes only 1 BRAM and 2 DSP blocks can provide more than (24-bits) single precision. The architecture with 1 BRAM and 3 DSP blocks can provide precisions up to 28bits, in the worst cases, for double precision (53-bits) inputs. These precisions are more than adequate to use the proposed 2𝑛𝑑 order polynomial approximation as an initial value for

This work was supported by national funds through FCT, Fundaça˜ o para a Ciência e a Tecnologia, under projects PEstOE/EEI/LA0021/2013 and PTDC/EEA-ELC/122098/2010. R EFERENCES [1] W. Maltez, A. R. Silva, H. C. Neto, and M. P. Véstias, “Analysis of matrix multiplication on high density Virtex-7 FPGA,” in Proceedings of the 23rd International Conference on Field Programmable Logic and Applications (FPL), Sep. 2013. [2] J.-M. Muller, Elementary Functions - Algorithms and Implementation. Birkhauser, 2006. [3] M. Ercegovac, T. Lang, J. M. Muller, and A. Tisserand, “Reciprocation, square root, inverse square root, and some elementary functions using small multipliers,” IEEE Transactions on Computers, vol. 49, no. 7, pp. 628–637, 2000. [4] V. Jain and L. Lin, “High-speed double precision computation of nonlinear functions,” in Proceedings of the 12th Symposium on Computer Arithmetic (ARITH), 1995, pp. 107–114. [5] J.-A. Pineiro and J. Bruguera, “High-speed double-precision computation of reciprocal, division, square root, and inverse square root,” IEEE Transactions on Computers, vol. 51, no. 12, pp. 1377–1388, 2002. [6] A. Habegger, A. Stahel, J. Goette, and M. Jacomet, “An efficient hardware implementation for a reciprocal unit,” in Proceedings of the 5th IEEE International Symposium on Electronic Design, Test and Application (DELTA), 2010, pp. 183–187. [7] J. M. Muller, “Partially rounded small-order approximations for accurate, hardware-oriented, table-based methods,” in Proceedings of the 16th IEEE Symposium on Computer Arithmetic (ARITH), 2003, pp. 114–121. [8] S. Tawfik and H. Fahmy, “Algorithmic truncation of minimax polynomial coefficients,” in Proceedings of the IEEE International Symposium on Circuits and Systems (ISCAS), 2006, pp. 4 pp.–2424. [9] J. W. Hauser, “Approximation of nonlinear functions for fixed-point and ASIC applications using a genetic algorithm,” Ph.D. dissertation, University of Cincinnati, 2001. [10] J. Detrey and F. de Dinechin, “Table-based polynomials for fast hardware function evaluation,” in Proceedings of the 16th IEEE International Conference on Application-Specific Systems, Architecture and Processors (ASAP), 2005, pp. 328–333. [11] S. Lachowicz and H.-J. Pfleiderer, “Fast evaluation of the square root and other nonlinear functions in FPGA,” in Proceedings of the 4th IEEE International Symposium on Electronic Design, Test and Applications (DELTA), 2008, pp. 474–477. [12] W. Fraser, “A survey of methods of computing minimax and nearminimax polynomial approximations for functions of a single independent variable,” J. ACM, vol. 12, no. 3, pp. 295–314, Jul. 1965. [13] E. L. Oberstar, “Fixed-point representation & fractional math,” Oberstar Consulting, revision, vol. 1, 2007. [14] “7 series DSP48E1 slice,” Xilinx User Guide UG479 (v1.4), Oct. 2012. [15] “7 series FPGAs memory resources,” Xilinx User Guide UG473 (v1.7), Oct. 2012. [16] “Python(x,y),” Scientific-oriented Python Distribution, https://code.google.com/p/pythonxy/.

Very low resource table-based FPGA evaluation of ... - IEEE Xplore

Very low resource table-based FPGA evaluation of ... - IEEE Xplore

Suggest Documents

FPGA Low-Power Implementation of QRS Detectors - IEEE Xplore

A Comparison and Performance Evaluation of FPGA ... - IEEE Xplore

Techniques for very low-voltage operation of continuous ... - IEEE Xplore

Techniques for very low-voltage operation of continuous ... - IEEE Xplore

Design Considerations And Implementation Of Very Low ... - IEEE Xplore

An Optimized and Low-Cost FPGA-Based DNA ... - IEEE Xplore

Very Low-Noise ENG Amplifier System Using CMOS ... - IEEE Xplore

A MEMS-Based VOA With Very Low PDL - IEEE Xplore

Simplified Log-MAP Algorithm for Very Low-Complexity ... - IEEE Xplore

Evaluation of the Effective Protection Distance of Low ... - IEEE Xplore

FPGA montgomery modular multiplication architectures ... - IEEE Xplore

Evaluation of Low Temperature Ground Coupled ... - IEEE Xplore

Our Very Own Magazine - IEEE Xplore

siciDiag osi low - IEEE Xplore

Low-power design - IEEE Xplore

A Low-Intrusion Load and Efficiency Evaluation Method ... - IEEE Xplore

Integration of Resource Synchronization and ... - IEEE Xplore

Characterization of Workload and Resource ... - IEEE Xplore

Low Complexity Implementation of Block ... - IEEE Xplore

Low Temperature Magnetization Studies of ... - IEEE Xplore

FPGA implementation of multiplication algorithms for ECC - IEEE Xplore

FPGA Implementation of 160- bit Vedic Multiplier - IEEE Xplore

FPGA implementation of multipliers for ECC - IEEE Xplore

Efficient FPGA Implementation of a Wireless ... - IEEE Xplore