VLSI Implementation of Double-Precision Floating-Point Multiplier ...

2 downloads 0 Views 342KB Size Report
Jul 19, 2012 - The multiplication is done by using Karatsuba technique. .... processing separately the sign, exponent and mantissa parts of the operands and ...
Circuits Syst Signal Process (2013) 32:15–27 DOI 10.1007/s00034-012-9457-3

VLSI Implementation of Double-Precision Floating-Point Multiplier Using Karatsuba Technique Manish Kumar Jaiswal · Ray C.C. Cheung

Received: 4 June 2011 / Revised: 3 July 2012 / Published online: 19 July 2012 © Springer Science+Business Media, LLC 2012

Abstract The double-precision floating-point arithmetic, specifically multiplication, is a widely used arithmetic operation for many scientific and signal processing applications. In general, the double-precision floating-point multiplier requires a large 53 × 53 mantissa multiplication in order to get the final result. This mantissa multiplication exists as a limit on both area and performance bounds of this operation. This paper presents a novel way to reduce this large multiplication. The proposed approach in this paper allows to use less amount of multiplication hardware compared to the traditional method. The multiplication is done by using Karatsuba technique. This design is specifically targeting Field Programmable Gate Array (FPGA) platforms, and it has also been evaluated on ASIC flow. The proposed module gives excellent performance with efficient use of resources. The design is fully compatible with the IEEE standard precision. The proposed module has shown a better performance in comparison with the best reported multipliers in the literature. Keywords Floating-point multiplication · Karatsuba · Reconfigurable computing · Arithmetic · High performance computing 1 Introduction Floating-point arithmetic is widely used in many areas, especially in scientific computation, numerical processing and signal processing (like digital filters, FFT, image processing, etc.). The IEEE defines the standard [12, 13] for single-precision and double-precision formats. Hardware implementation of arithmetic operations for M.K. Jaiswal () · R.C.C. Cheung Department of Electronic Engineering, City University of Hong Kong, Hong Kong, China e-mail: [email protected] R.C.C. Cheung e-mail: [email protected]

16

Circuits Syst Signal Process (2013) 32:15–27

IEEE floating-point standard becomes a crucial part of almost all processors. Among all the floating-point arithmetic operations, the multiplication is a representative and core operation. The application area is always looking for high-performance and areaefficient implementation of floating-point arithmetic operation and, thus, an efficient implementation of floating-point multipliers is of a major concern. Over the last few decades, a lot of work has been dedicated to improve the performance of floating-point computations, both at algorithmic level and implementation level. Several works have also focused on implementation on FPGA platforms [3, 11, 19, 20, 22]. In spite of tremendous effort, this arithmetic is still often the bottleneck in many computations. FPGAs (field programmable gate arrays) are now becoming a major focus for high performance computing. The available speed, amount of logics and several available on-board intellectual property (IP) cores make them suitable for large set of applications. They are now used in various fields of numerical and scientific computation [14, 15, 24, 26], image processing [1, 9], communications [18, 21], cryptography computations [4, 17] with significant performance. Even current era Super-Computers are using the FPGAs [6, 23, 25] to off-load and accelerate the parallelizable complex routines over them. As a result, this work is primarily aimed for improved implementation of double-precision floating-point multiplication, on FPGA platform. The crucial part of the floating-point multiplication lies in mantissa multiplication. This is the main bottleneck of the performance. The mantissa of double-precision floating-point numbers is 53 bits in length, and in general this would require implementation of 53 × 53 multipliers in hardware, which is very expensive. In this work, an approach for the multiplication of double-precision floating-point numbers has been proposed which allows to using less amount of multiplication, achieving excellent performance at a relatively low cost of hardware resources. Comparison of results has been done with optimized implementation of Sandia [2, 10, 11, 28, 29] and NEU [3, 19] (Northeastern University, Boston) floating-point library multipliers. The implementation is confined to only normalized numbers, and designed for optimum latency, which is area and performance balanced. The design has been carried out using Xilinx ISE synthesis tool, ModelSim SE simulation tool, and Xilinx Virtex-IV (xc4vlx160-12ff1513) and Virtex-5 (xc5vlx155-3ff1760) FPGA platforms for implementation and comparison of the results. The design has also been targeted for the ASIC platform. We have presented the ASIC synthesis result of the design using UMC 130-nm Standard Cell Library with a target speed of 500 MHz, using Synopsis Design Compiler Synthesis toolset. The contribution of this paper can be summarized as follows: 1. Implication of Karatsuba technique for area-efficient implementation of DoublePrecision Floating-Point Multiplier. 2. Wise use of FPGA resources (embedded DSP48, slices) for better area utilization. 3. Reduction of at least 33 % embedded multiplier blocks. 4. Best operating frequency has been achieved with the help of balance pipelining. 5. Achieved best area and performance number compared to most recent implementations in the literature. 6. A direct implementation on 130-nm Standard-Cell ASIC target has also been shown.

Circuits Syst Signal Process (2013) 32:15–27

17

The next section (Sect. 2) discusses the basic flow and understanding of how a floating-point multiplication can be carried out. Section 3 explains our idea of mantissa multiplication with reduced multiplication for this design and its implementation. Section 4 discusses the complete implementation with further required processing in floating-point multiplication. Section 5 gives the implementation details (hardware utilization and performance measures) of both designs. Comparisons with previously reported implementations are shown in Sect. 6, and finally the paper concludes in Sect. 7.

2 Background IEEE (Institute of Electrical and Electronics Engineers) standard for floating point (IEEE-754) defines a binary format used for representing floating-point numbers [12]. This standard specifies how single-precision (32-bit) and double- precision (64-bit) floating-point numbers are to be represented. Some aspects of how certain arithmetic operations are to be carried out are also defined in this standard. Goldberg [8] provides a lot of useful information on floating-point arithmetic and its implementation. The format of a floating-point number is as follows: For a Single Precision: 1-bit 23-bit -bit     8     Sign-bit exponent mantissa

For a Double Precision: 1-bit 52-bit 11-bit          Sign-bit exponent mantissa

Floating-point arithmetic is widely used in many scientific and signal processing applications. The FPU arithmetic operations include addition, subtraction, multiplication, inverse (reciprocal), division, square-root, inverse-square-root, etc. In the context of the present work, we have shown the design of double-precision multiplier on an FPGA platform. In general, floating-point arithmetic implementation involves processing separately the sign, exponent and mantissa parts of the operands and then combining them after rounding and normalization. Brief overview of the computational flow of this arithmetic operation is given below. Floating-point multiplier is a simpler arithmetic unit, except that it requires a large multiplier for performing mantissa multiplication, which poses limitation on the performance and area of the hardware design. It can be computed in the following steps: 1. 2. 3. 4. 5.

XOR sign bits of both numbers to get the sign bit of the product. Add both operands’ exponents. Perform the multiplication of both mantissas. Perform the rounding of mantissa product. Finally, the normalization to adjust exponent and mantissa.

18

Circuits Syst Signal Process (2013) 32:15–27

Fig. 1 Flow Chart for Floating-Point Multiplication

Rounding of the result is an essential part of the floating-point arithmetic operation. Even with floating point, there is still a finite precision to the computations. The results of calculations therefore need to be restricted to the given precision, and this makes it necessary to perform the rounding. The IEEE 754 standard defines a number of different “rounding modes.” These are, round toward 0 (also called truncation), round toward positive infinity (regardless of the value, round towards +∞), round toward negative infinity (regardless of the value, round towards −∞), round to the nearest (use representation nearest to the desired value). The details can be obtained from Refs. [8, 12]. Normalization in all the operations is used to bring the final result in a standard floating-point format. The complete processing of a floating-point multiplication is shown in Fig. 1.

3 Design Approach The basis of this design has been adopted from Karatsuba Multiplication Technique [16]. Karatsuba Multiplication is a fast-multiplication algorithm. It reduces the multiplication of two n-digit numbers from simple n2 to at most 3nlog2 3 ≈ 3n1.585 single-digit multiplication. The basic steps for this algorithm depends on divide-andconquer paradigm and proceeds in the following way.

Circuits Syst Signal Process (2013) 32:15–27

19

Let W and X be two n-digit numbers. By breaking these numbers for some base B, we can write them as below: W = W1 .B m + W0 , X = X1 .B m + X0 , where W0 and X0 are of m-digit. Now, we can write the product of W and X as follows:    W X = W1 .B m + W0 X1 .B m + X0 = W1 .X1 .B 2m + (W1 .X0 + W0 .X1 )B m + W0 .X0 = α.B 2m + β.B m + γ , where α = W1 .X1 ,

β = W1 .X0 + W0 .X1 ,

and γ = W0 .X0 .

Thus, as a whole at this we need four multiplications to get the complete result. But, as per Karatsuba, we need only three multiplications to get the complete result. This happens as follows. We can modify the β as below: β = (W1 .X0 + W0 .X1 ) + (W1 .X1 + W0 .X0 ) − (W1 .X1 + W0 .X0 ) = (W1 + W0 )(X1 + X0 ) − W1 .X1 − W0 .X0 = (W1 + W0 )(X1 + X0 ) − α − γ which requires only one multiplication instead of two, with some extra overhead of addition and subtraction. Thus to get a complete product of W and X we need three multiplications instead of four. The size of the each multiplier will be reduced further by recursively using this technique on each individual multiplication until it reaches a single bit. Further extension of Karatsuba has been developed by Toom and by Cook [5, 27], by dividing the operands at more levels instead of only two. By this, if we split the operands into three parts, the number of multiplications reduces to 5 instead of 9, whereas it is 4-to-3 in the case of Karatsuba. But, this requires a complex postprocessing of the data. In the present work, we have extended the technique of Karatsuba by splitting the operands into three parts and proceeding as follows. We divide the numbers W and X as follows: W = W2 .B 2m + W1 .B m + W0 , X = X2 .B 2m + X1 .B m + X0 , where W1 , X1 , W0 and X0 are of m-digit. Now, we can write the product of W and X as follows:    W.X = W2 .B 2m + W1 .B m + W0 X2 .B 2m + X1 .B m + X0 = W2 .X2 .B 4m + (W2 .X1 + W1 .X2 )B 3m + (W2 .X0 + W0 .X2 )B 2m + W1 .X1 .B 2m + (W1 .X0 + W0 .X1 )B m + W0 .X0 = α2 .B 4m + α1 .B 2m + α0 + β2 .B 3m + β1 .B 2m + β0 .B m ,

20

Circuits Syst Signal Process (2013) 32:15–27

Fig. 2 18-bit and 19-bit Multipliers

where α2 = W2 .X2 ,

α1 = W1 .X1 ,

α0 = W0 .X0

and β2 = W2 .X1 + W1 .X2 ,

β1 = W2 .X0 + W0 .X2 ,

β0 = W1 .X0 + W0 .X1 . Up to this level we need 9 multipliers to accomplish the task. The number of multiplications can be reduced by modifying the β2 , β1 , and β0 as below: β2 = (W2 .X1 + W1 .X2 ) + (W2 .X2 + W1 .X1 ) − (W2 .X2 + W1 .X1 ) = (W2 + W1 )(X2 + X1 ) − α2 − α1 . Similarly, β1 = (W2 + W0 )(X2 + X0 ) − α2 − α0 , β0 = (W1 + W0 )(X1 + X0 ) − α1 − α0 . And, by doing this, number of multiplications has been reduced to 6 instead of 9, giving the complete multiplication result. The overhead is some extra addition and subtraction, the cost of which is much lesser than that of a multiplier. We have adopted this way for the mantissa multiplication on double-precision floating-point multiplier. Both of 53-bit mantissas (including one hidden bit) have been broken into three parts as below: W2 , 17-bit    1x · · · xx X2 , 17-bit    1x · · · xx

W1 , 18-bit W0 , 18-bit       xx · · · xx xx · · · xx X1 , 18-bit X0 , 18-bit       xx · · · xx xx · · · xx

In these, α2 = W2 .X2 requires a 17-bit unsigned multiplier, both α1 = W1 .X1 and α0 = W0 .X0 need 18-bit unsigned multiplier. The computation of each, β2 , β1 and β0 , requires 19-bit unsigned multiplier. The computation of both the 18-bit and 19-bit multipliers is shown in Fig. 2. In these multipliers, partial product P 0 has been computed by a dedicated hardcore Multiplier IP, whereas all other partial products (P 1, P 2, and P 3) have been

Circuits Syst Signal Process (2013) 32:15–27 Table 1 Effective use of DSP48 in a 19-bit Multiplication (on Virtex-4)

21 Effective use

Simple use

Slices

79

98

Delay (ns)

6.481

7.429

computed using logic resources, and implementation of these is simple and straightforward (will need some set of AND gates and adder logic). Generally all the partial products can be added simply to get the multiplication result. But, we can optimize the implementation when we are using DSP48 IP for a 17-bit multiplication for partial product P 0 (instead of MULT18x18, in Virtex-2-Pro or older version FPGAs). The in-built 47-bit adder of DSP48 IP can be used to reduce some logic resources. And here, the sum of the partial products P 1, P 2, and P 3 has been supplied to the DSP48 IP (when used) and added to P 0 to get the multiplication result. The implementation difference with and without using DSP48 for 19-bit multiplication can be seen from Table 1. The implementation details have been taken after post-PAR analysis. We can see that both the resources and delay become lesser with an effective use of DSP48. By using these 18-bit and 19-bit multipliers and some extra adders/subtractors, we can thus compute all the α’s and β’s. And, finally, by combining them we can get the total 53-bit mantissa multiplication result, which contains only 6 Multiplier IP, which is a minimum of 33 % reduction in number (compare to only 9 Multiplier IP for a 53-bit multiplier).

4 Implementation The implementation of the floating-point multiplication requires the processing on the sign, exponent and mantissa separately (as shown in Fig. 1). Each of these are discussed below. 4.1 Sign and Exponent Computation These computations are done in a straightforward manner. The output sign will be the logical XOR of the sign-bit of both operands. Sign_out = Sign_in1 ⊕ Sign_in2 The output exponent is given by addition of both input exponents and then adjusting it by BASE, i.e. Exp_out = Exp_in1 + Exp_in2 − 1023 For double-precision floating-point numbers the BASE is equal to 1023 (211−1 − 1). The BASE for any floating-point number is given by (2exponent bits−1 − 1).

22

Circuits Syst Signal Process (2013) 32:15–27

4.2 Exceptional Case Handling As defined by the IEEE standard [12], there are several exceptional cases like NaN, INFINITE, ZERO, UNDERFLOW, OVERFLOW appearing in any floating-point arithmetic. Thus, the main computation has also been combined with the detection of all the exceptional cases, and determining the final output as per standard. The executions of all the exceptional cases are handled in line with the IEEE-754 standard. For example, if any/both of the operands is infinite, we produce an Infinity as output (with computed sign-bit). If either of the input operands is denormalized, the output will be zero (with respective sign-bit). If the output exponent goes to ZERO or below ZERO, UNDERFLOW will activate, and if it goes beyond 11’h7fe (2046 in decimal), OVERFLOW will activate. The entire execution is shown in Algorithm 1 (where, EXP_OUT is the output exponent, MANT_OUT is output mantissa, and the other keywords have their literal meaning). In addition, when one operand is infinite and the other is denormalized, an INVALID operation is indicated (Algorithm 2), and results in the NaN output. 4.3 Normalization and Rounding The requirement of putting the final result again in the 64-bit sign-exponent-mantissa format demands the normalization of the result. Often the mantissa multiplication results in an extra bit in the MSB before decimal point. Similarly, sometimes after rounding the same situation appears. These results need to be fixed to get the mandatory formatting of the result. So, whenever there is an extra carry generated after Algorithm 1 Exceptional Case Handling if any operands infinite then EXP_OUT = 11’h7ff; MANT_OUT = 52’h0000000000000; else if any operands denormalized then EXP_OUT = 0; MANT_OUT = 0; else if output exponent ≤ ZERO then EXP_OUT = 0; MANT_OUT = 0; UNDERFLOW = 1; else if output exponent > 11’h7fe then EXP_OUT = 11’h7ff; MANT_OUT = 0; OVERFLOW = 1; else EXP_OUT = estimated exponent; MANT_OUT = estimated mantissa; UNDERFLOW = 0; OVERFLOW = 0; end if

Circuits Syst Signal Process (2013) 32:15–27

23

Algorithm 2 Invalid Operation if one operand is infinite and the other is denormalized then INVALID_OPERATION = 1; EXP_OUT = 11’h7ff; MANT_OUT = 52’h8000000000000; else INVALID_OPERATION = 0; end if Table 2 Hardware utilization and performance for the design

V4

V5

Latency

10

10

Slices

840 (863 LUTs, 847 FFs) 377

419 (848 LUTs, 1002 FFs) 441

Freq (MHz)

All cases have used 6 DSP48

multiplication/rounding, the product is right-shifted by one bit and the exponent is incremented by one to make the result normalized. Rounding is required in order to trim back the 106-bit mantissa multiplication result to 53-bit only. We have implemented only round to nearest rounding specified by the IEEE standard. The error performance has been tested on 5 million of random test cases, and found to be fully compatible with rounding results. The other rounding methods can also be used depending on the requirement of the application. Since our aim in this work is mainly to reduce the expensive multiplication cost, we have focused only on one rounding method.

5 Results The design has been implemented by using Verilog-HDL. It has been synthesized and placed and routed on Virtex-IV (xc4vlx160-12ff1513) and Virtex-5 (xc5vlx1553ff1760) FPGA targets using Xilinx ISE. Simulation results have been analyzed in ModelSim-SE. Hardware utilization and performance for the proposed implementation is shown in Table 2, on Virtex-IV and Virtex-5 FPGAs. All the hardware resource estimates reported were obtained after place-and-route process of FPGA synthesis. The complete map options are: -timing -ol high -global_opt on -retiming on -register_duplication -equivalent_register_removal off -logic_opt on -xe n -c 100 -cm speed -ignore_keep_hierarchy and PAR options are -ol high -xe n. The synthesis options were targeted towards the speed optimization of the design. With an area optimization synthesis target (-c = 1 in map) on Virtex-4, the design takes only 646 slices, with a speed of 290 MHz.

24 Table 3 Mantissa multiplication using Proposed Method and Xilinx Core (on Virtex-4)

Circuits Syst Signal Process (2013) 32:15–27 Proposed method

Xilinx core

Slices

376

146

DSP48

6

13

0

Delay (ns)

15.428

9.564

7.514

1500

In view of checking the logic complexity of the design without using DSP48 block, we have carried one synthesis on Virtex-5 platform without DSP48. Entire design is occupying 927 slices (1192 FFS, 3137 LUTS), with an operating speed of 365 MHz. On Virtex-4, with similar approach it is taking 1728 slices (1192 FFS, 3049 LUTS), with an operating speed of 350 MHz. The latency of the design in this case was 9, instead of 10 with DSP48. As the core of the design is effective implementation of mantissa multiplication, the hardware utilization and delay information for it as against the Xilinx Core 53-bit multiplier are presented in Table 3. Clearly, the proposed design is using less hardware, and delay can be optimized by suitable pipeline. ASIC implementation of the design has been done with Synopsis Design Compiler using UMC 130-nm (umce13h210t3_wc_108V_125C) Standard Cell Library. The speed was targeted for a period of 2 ns (500 MHz frequency) and was achieved successfully with a required period of 1.87 ns (534 MHz), with the total cell area requirements of 122,957 (616 cells).

6 Comparison In this section the comparison with some most optimized reported implementations of double-precision floating-point multiplier is being discussed. The comparison with Xilinx, Sandia [10, 11], NEU [2, 3, 19, 28, 29] multipliers is shown in Tables 4 and 5. Sandia [11] reported the results for both normalized and denormalized implementation. Whereas, Xilinx, NEU [2] and proposed methods support only the normalized number implementation. The proposed design requires slightly more logic than Sandia [11] but with a benefit of 33 % reduction in the number of multiplier units, also this has achieved an improved performance. Hemmert and Underwood [11] proposed an FPGA level optimization which can also be used in the proposed here design to get benefits in terms of slices. Also, although the implication of Karatsuba Method reduces the costly multiplier blocks, it needs some extra adders/subtractors which increase the slice count in our design. In the current scenario, the reduction of three multiplier blocks requires some extra logic in the form of “three 18-bit adders, three 36-bit adders, and three 38-bit subtractors.” An N -bit adder usually takes N LUTs, thus requiring 18 ∗ 3 + 36 ∗ 3 + 38 ∗ 3 = 276 extra LUTs. That is why the current design has a larger slice count, but if we count the equivalent of 3 extra multiplier blocks, our design will have a major area impact. From a simple synthesis on Xilinx ISE tool, the equivalent hardware for three embedded DSP48 multiplier blocks will be equal to 548 slices (1065 LUTs), where a 17 × 17-bit multiplier and a 34-bit adder is used as

Circuits Syst Signal Process (2013) 32:15–27

25

Table 4 Comparison on Virtex-IV FPGA Method

Latency

DSP48

Slices

Freq (MHz)

Hemmert [10]

11

9

448

294

Sandia [11] Denorm.

14

9

737

274

Sandia [11] Non-denorm.

10

9

384

275

NEU [19]

5

13

1048

98

Venishetti [28]

11

9

2471 (LUTs)

228

Wang [29]

5

13

1048

98

Banescu [2]

16

10

729

338

Xilinx [2, 30]

22

16

561

321

Proposed (area)

10

6

646 (760 FFs, 865 LUTs)

290

Proposed (speed)

10

6

840 (847 FFs, 863 LUTs)

377

Table 5 Comparison on Virtex-V FPGA

Method

Latency

DSP48

LUTs

FFs

Freq (MHz)

Xilinx [2, 30]

18

10

339

482

319

Banescu [2]

14

9

804

804

407

Banescu [2]

13

9

1184

1080

407

Proposed

10

6

848

1002

441

an equivalent to a DSP48 block. Thus, from a total equivalent hardware perspective, the proposed design saves much hardware by incorporating the Karatsuba technique, whereas the balance pipelining of the design leads to a better speed of the operation. The benefit of this method will also have major impact in ASIC or custom realization of the circuit. The comparison with a recent literature [2] has also been shown here, on both Virtex-4 and Virtex-5. Banescu et al.’s [2] is the extension of de Dinechin and Pasca’s work [7]. On both platforms, the proposed design is better in terms of hardware as well as speed. The major benefit of current design is reduction in the number of Multiplier IP cores and relatively higher performance with a low latency. The balance pipelining of the critical data path into several equivalent slots helps to achieve a better operating frequency. In the current work, the primary concern of critical path comes from the mantissa multiplication with several multiplier blocks and a large adder/subtractor to accumulate the partial products. Firstly, the implication of Karatsuba reduces the number of multiplier blocks, which reduces the amount of logic and thus related data path and routing delay. Further, the partitioning of the partial products accumulation by incorporation of carry-save adder/subtraction method for large operands size in balance data-path size, helps in achieving a better performance number. Moreover, the use of fast internal adders of DSP48 IP block along with obvious multiplier on it helps in the reduction of some slices and delays in the data path. All other related floating-point operations for exponent, normalization, rounding and exceptional-case handling are relatively simpler and much eas-

26

Circuits Syst Signal Process (2013) 32:15–27

ier to partition their data path, which generally contains some comparators, small adder/subtractor, multiplexers. Thus, the proposed work achieves the implementation of double-precision floating-point multiplication on an FPGA platform with relatively 33 % less multiplier blocks, which is a major achievement in terms of total equivalent area requirement. Moreover, the proper partitioning of data path and efficient use of FPGA resources helps in getting a better performance number. Thus, the end user can get a significant benefit in terms of area as well as frequency by incorporation of present work in their relevant target applications.

7 Conclusion We have presented an efficient architecture for implementation of double-precision floating-point multiplication on FPGAs together with the Standard Cell based ASIC results. The proposed modules achieve high performance compared to other modules previously reported in the literature on the FPGA platform. The major benefit of the design is the amount of reduced multipliers (a reduction of a minimum 33 %) compared to other reported works. Inclusion of the proposed approach will lead to a significant reduction in resources, with improvement of performance measure. Furthermore, this approach can improve the parallelization by including more multipliers on an FPGA, as well as area reduction on ASIC platform. Acknowledgement The authors would like to thank the anonymous reviewers, whose comments and suggestions helped considerably to improve the paper. This work is supported by the City University of Hong Kong (Project No. 7200179).

References 1. V. Aggarwal, A.D. George, K.C. Slatton, Reconfigurable computing with multiscale data fusion for remote sensing, in Proceedings of the 2006 ACM/SIGDA 14th International Symposium on Field Programmable Gate Arrays (FPGA-06) (ACM, New York, 2006), p. 235. doi:10.1145/1117201.1117261 2. S. Banescu, F. de Dinechin, B. Pasca, R. Tudoran, Multipliers for floating-point double precision and beyond on FPGAs. Comput. Archit. News 38, 73–79 (2011). doi:10.1145/1926367.1926380 3. P. Belanovic, M. Leeser, A library of parameterized floating-point modules and their use, in 12th International Conference on Field-Programmable Logic and Applications (FPL-02) (Springer, London, 2002), pp. 657–666 4. W. Chelton, M. Benaissa, Fast elliptic curve cryptography on FPGA. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 16(2), 198–205 (2008). doi:10.1109/TVLSI.2007.912228 5. S.A. Cook, On the minimum computation time of functions, Ph.D. thesis, Harvard University, Department of Mathematics, 1966, http://cr.yp.to/bib/1966/cook.html 6. Cray XD1 Supercomputers (2008). http://www.cray.com/ 7. F. de Dinechin, B. Pasca, Large multipliers with fewer DSP blocks, in International Conference on Field Programmable Logic and Applications (2009), pp. 250–255. doi:10.1109/FPL.2009.5272296 8. D. Goldberg, What every computer scientist should know about floating-point arithmetic. ACM Comput. Surv. 23(1), 5–48 (1991). doi:10.1145/103162.103163 9. Z. Guo, W. Najjar, F. Vahid, K. Vissers, A quantitative analysis of the speedup factors of FPGAs over processors, in Proceedings of the 2004 ACM/SIGDA 12th International Symposium on Field Programmable Gate Arrays (FPGA-04) (ACM, New York, 2004), pp. 162–170. doi:10.1145/968280. 968304

Circuits Syst Signal Process (2013) 32:15–27

27

10. K.S. Hemmert, K.D. Underwood, Open source high performance floating-point modules, in 14th Annual IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM-06) (2006), pp. 349–350. doi:10.1109/FCCM.2006.54 11. K.S. Hemmert, K.D. Underwood, Fast, efficient floating point adders and multipliers for FPGAs. ACM Trans. Reconfigurable Technol. Syst. 3(3), 11 (2010) 12. IEEE, IEEE standard for binary floating-point arithmetic. ANSI/IEEE Std 754-1985 (1985). doi: 10.1109/IEEESTD.1985.82928 13. IEEE, IEEE standard floating-point arithmetic. IEEE Std 754-2008 pp. 1–58 (2008). doi:10.1109/ IEEESTD.2008.4610935 14. M.K. Jaiswal, N. Chandrachoodan, A high performance implementation of LU decomposition on FPGA, in 13th VLSI Design and Test Symposium (VDAT-2009) (2009), pp. 124–134 15. M.K. Jaiswal, N. Chandrachoodan, FPGA based high performance and scalable block LU decomposition architecture. IEEE Trans. Comput. 61, 60–72 (2012). doi:http://doi.ieeecomputersociety. org/10.1109/TC.2011.24 16. A. Karatsuba, Y. Ofman, Multiplication of many-digital numbers by automatic computers, in Proceedings of the USSR Academy of Sciences, vol. 145 (1962), pp. 293–294 17. C.H. Kim, S. Kwon, C.P. Hong, FPGA implementation of high performance elliptic curve cryptographic processor over GF(2163 ). J. Syst. Archit. 54(10), 893–900 (2008). doi:10.1016/j.sysarc. 2008.03.005 18. A. Koohi, N. Bagherzadeh, C. Pan, A fast parallel Reed–Solomon decoder on a reconfigurable architecture, in First IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis (2003), pp. 59–64 19. M. Leeser, VFloat: the northeastern variable precision floating point library (2008), http://www.ece. neu.edu/groups/rpl/projects/floatingpoint/ 20. G. Lienhart, A. Kugel, R. Manner, Using floating-point arithmetic on FPGAs to accelerate scientific n-body simulations, in 10th Annual IEEE Symposium on Field-Programable Custom Computing Machines (FCCM’02) (IEEE Comput. Soc., Los Alamitos, 2002) 21. H. Parizi, A. Niktash, A. Kamalizad, N. Bagherzadeh, A reconfigurable architecture for wireless communication systems, in Third International Conference on Information Technology: New Generations (2006), pp. 250–255. doi:http://doi.ieeecomputersociety.org/10.1109/ITNG.2006.16 22. S. Paschalakis, P. Lee, Double precision floating-point arithmetic on FPGAs, in 2nd IEEE International Conference on Field Programmable Technology (FPT’03) (2003), pp. 352–358 23. SGI Supercomputers. http://www.sgi.com/ 24. M. Smith, J. Vetter, X. Liang, Accelerating scientific applications with the SRC-6 reconfigurable computer: methodologies and analysis, in Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (2005), p. 157b 25. SRC Supercomputers (2008). http://www.srccomp.com/ 26. O. Storaasli, R.C. Singleterry, S. Brown, Scientific Computation on a NASA Reconfigurable Hypercomputer (2002) 27. A.L. Toom, The complexity of a scheme of functional elements realizing the multiplication of integers, in Soviet Math., vol. 4 (1963), p. 4. (translations of Dokl. Adad. Nauk SSSR). http:// www.de.ufpe.br/~toom/articles/rusmat/Multipli.pdf 28. S. Venishetti, A. Akoglu, A highly parallel FPGA based IEEE-754 compliant double-precision binary floating-point multiplication algorithm, in International Conference on Field-Programmable Technology (ICFPT 2007) (2007), pp. 145–152. doi:10.1109/FPT.2007.4439243 29. X. Wang, M. Leeser, VFloat: a variable precision fixed- and floating-point library for reconfigurable hardware. ACM Trans. Reconfigurable Technol. Syst. 3, 16 (2010). doi:http://doi.acm.org/10.1145/1839480.1839486 30. Xilinx, Xilinx floating-point IP core. http://www.xilinx.com

Suggest Documents