FPGA Implementation of Multiply-Add Unit based on 2 ...

IJSART - Volume 1

Issue 10 –OCTOBER 2015

ISSN [ONLINE]: 2395-1052

FPGA Implementation of Multiply-Add Unit based on 2^(n+1) Modulo Arithmetic E. Jebamalar Leavline1, N. Renukadevi2, R. Sivapriyanka3, G. Santhiya4 1, 2, 3, 4 1, 2, 3, 4

Department of Electronics and Communication Engineering

Bharathidasan Institute of Technology, Anna University, Tiruchirappalli – 620 024, India

ABSTRACT- In real time applications, digital signal processors (DSP) need a unit for fast computation of convolution, correlation, autocorrelation, cross-correlation. The basic arithmetic operations involved in these computations are addition and multiplication. So, efficient adder and multiplier circuits need to be designed. In this paper, we present the design of dot product unit, multiply-add unit, multiplier and compare their performance in terms of the parameters like delay, power, device utilization. Keywords: Convolution, multiply-add unit, DSP processor, dot product.

I. INTRODUCTION In digital signal processing, the basic operation behind every processing of signals is convolution, correlation, autocorrelation and cross-correlation. For fast computation of these operations DSP processors need a separate unit. So far, DSP processors (eg.TMS320 family) have MAC (MultiplyAccumulate) unit and FPGA families contain multipliers for processing signals. This work presents of fused architecture to compute dot product unit using modular arithmetic concept. Fused architecture means both multiplier and adder combined in the same unit. Modular arithmetic is a system of arithmetic for integers, where numbers "wrap around" upon reaching a certain value - the modulus. The modern approach to modular arithmetic was developed by Carl Friedrich Gauss. Modular arithmetic is widely used in digital signal processing. This paper reports the analysis of various existing architectures such as booth architecture, novel modulo 2n+1 architecture, and array architecture. The array architecture proposed by C. Efstathiou is very efficient compared to other architectures. Because it does not use separate unit for correction factors. It simply uses inverted end around carry (IEAC). The entire partial products are normally formed by multiply and shift process which is very tedious to process signals. So, to solve this problem this architecture provides a solution to reduce the entire partial products into a matrix format. So, whenever we give n bit signals to this unit, it needs only n+1 bits to process these signals by modulo Page | 308

2n+1concept. II. ADDER AND MULTIPLIER ARCHITECTURES This section discusses IEAC CSA adder, modulo 2n+1 adder and modulo 2n +1 multiplier architectures. A. IEAC (Inverted End Around Carry) CSA Adder The inverted EAC CSA tree can reduce the partial products. The CSA tree is usually constructed with full adders (FA). The IEAC is an adder that accepts two n-bit operands and provides a sum. Although an IEAC adder can be implemented by using an integer adder in which its carry output is connected back to its carry input via an inverter. B. MODULO 2 n+1 ADDER n n n The moduli set (2 -1, 2 , 2 +1) and its extensions have received significant attention because they provide simple and efficient implementations. In many residue n number system (RNS) based systems, handle of modulo2 +1 arithmetic units become a bottleneck, because they have to deal with (n + 1)-bit operands, whereas the remaining units n n of 2 -1, 2 units operate on operands with width shorter than or equal to n. The diminished-1 representation of binary n numbers is used to speed up modulo2 +1 arithmetic operations [4]. Since only n bits are required for the representation of the magnitudes, the diminished-1 representation can lead to implementations with delay and area approaching that of n n modulo 2 -1, 2 units. The need for time and power consuming translators from the weighted to the diminished-1 representation and vice versa makes the efficient n modulo2 +1functional units for conventional operands in many applications [2]. C. MODULO 2n +1 MULTIPLIER Multiplication is an important operation in today’s electronic system. Extensively used in DSP algorithms such www.ijsart.com

IJSART - Volume 1



n as FIR and IIR filters. The modulo2 + 1multipliers for weighted operands are the most efficient multiplier. The multiplication array required can be reduced to n x n, because several groups of partial product bits cannot be simultaneously at 1. The multiplier architecture is based on merging the correction factors that result from the formation and the reduction of the new partial-products into a single correction factor [3]. Partial product formation: Let A=anan-1……a1a0 and B=bnbn-1……b1b0 n denote two (n+1) bit numbers in the range of (0, 2 + 1). The objective is to reduce the multiplication array from an (n +1) x (n + 1) - bit size down to n x n. Partial-product reduction Fig. 1. Modulo 17 multiplier The n partial products and the correction factor that n is n+1 partial products must be added with modulo 2 + 1.This can be performed by using carry save adder (CSA) array. The carry outputs at the most significant bit position of each stage will be used as carry inputs of the subsequent stage. The carries out of the most significant bit position can be complemented and added to the least significant bit position of the next stage, forming an inverted EAC CSA tree. Final-stage addition A straightforward implementation for the multipliers is to use an extra partial product, along with a n fast modulo 2 +1adder which accepts the two summands produced by the reduction scheme (array/tree) and produces the product. Fig 1 shows the block diagram of modulo 17 multiplier.

D. Modulo 2n+1 dot product unit n The modulo 2 + 1 dot product units are also called inner product units. Residue number system (RNS) eliminates the delay of carry propagation, thus offering significant speedup and area reduction over the conventional binary system. This characteristic is advantageous when repetitive arithmetic operations on long operands have to be performed. RNS has been adopted in the design of Digital Signal Processors (DSPs), Finite Impulse Response (FIR) filters, Discrete Cosine Transform (DCT) and Fast Fourier Transform (FFT) processors, communication components and cryptography. Dot product A dot product is an algebraic operation. It is obtained by taking two sequences of numbers of equal length and multiplying the corresponding entries [1]. n a.b= ∑ i=1 ai.bi=a1b1+a2b2+….anbn. So, this unit straightly solves convolution and correlation.

Page | 309

www.ijsart.com

IJSART - Volume 1


ISSN [ONLINE]: 2395-1052 reduced to m matrixes. The FA which an input equal to 0 are simplified to HA modules, while those which has +1 an input equal to 1 are simplified to HA modules, which implement the and functions, and have the same complexity with the HA modules.

Fig 2. Architecture Modulo 17 dot product unit E. MODULO 2n +1 MULTIPLY-ADD UNIT Efficient multiply-add units which perform the operation can facilitate common Digital Signal Process and cryptography routines. Microprocessors which is used in embedded systems .It contain a fast multiply-add unit. Long integer arithmetic would profit from a multiply-add-add unit which can carry out computation of the form .This kind of operation is performed in the inner loop of various algorithms of long integer arithmetic [5]. Fused multiply–add units, also known as multiply–accumulate units, are common in practice and it compute expressions of the form X x Y + Z without computing the product X x Y separately. The design of n modulo 2 + 1 multiply–add units which compute n expressions of the form T = |XxY + Z|2 + 1. The multiply–add (or generalized multiply– accumulate) units, compute expressions of the form R=|X1 n x Y1+……….+Xm x Ym+Z|2 +1.The partial products are Page | 310

Fig 3. Architecture of modulo 17 multiply add unit III. EXPERIMENTAL RESULTS AND DISCUSSION A. EXPERIMENTAL SETUP The experiment is carried out in HDL environment with Xilinx 14.5 Integrated Simulation Environment and Spartan 3 development board with the specifications shown in Table 1. B. EXPERIMENTAL RESULTS The simulation outputs of dot product, multiplyadd unit and multiplier unit are shown in Figure 4, Figure 5 and Figure 6 respectively. Table 2 shows the device utilization summary (estimated values) for dot product unit. The device utilization summary of multiply-add unit and multiplier unit are given in Table 3 and Table 4 respectively. Table 5 compares the delay and memory usage of the three computation units discussed. Figure 7 shows the output on Spartan 3 Kit. www.ijsart.com

IJSART - Volume 1



Table 1. Specifications of Spartan 3 development board

Fig. 5. Simulation output for multiply-add unit (No. of operands:3, No.of.inputs:12 (4 bit for each operand))

Fig 6. Simulation output multiplier Table 4. Device utilization summary (estimated values) for multiplier Fig. 4 Simulation output for dot-product (No.of.operands:4 No.of.inputs:16(4bit for each operand)) Table 2. Device utilization summary (estimated values) for dot product

Table 5. Comparison of delay and memory usage

Table 3. Device utilization summary (estimated values) for multiply-add unit From the experimental results we find the delay for each unit. Although dot product unit have more delay than other units, it has more operands than others. So, in spite of having more operands than others it does not take much time to process. Hence, performing digital signal processing such as convolution, correlation Dot-product speed performance is possible with these hardware architectures. Page | 311

www.ijsart.com

IJSART - Volume 1



Fig 7. Output on Spartan 3 Kit IV. CONCLUSION According to our work, partial products and additive operands are efficiently added using inverted – end around carry save adders. It leads to savings on delay compared to other discrete units. Dot product unit is very efficient for processing signals like convolution, correlations compared to multiplier and multiply-add unit. Our work computes signals of only 4 bits for each operand. This architecture can be extended for more number of bits. The pipeline architecture can also be incorporated to best suit for processing digital signals with high speed and less delay. REFERENCES [1] Efstathiou, C., Moschopoulos, N., Voyiatzis, I., & Pekmestzi, K. (2013). On the design of modulo 2n+1 dot product and generalized multiply–add units.Computers & Electrical Engineering, 39(2), 410-419. [2] Efstathiou, C., & Voyiatzis, I. On the diminished-1 modulo 2N+1 fused multiply-add units. In IEEE 6th International Conference on Design & Technology of Integrated Systems in Nanoscale Era (DTIS), 2011 (pp. 15). [3] Vergos, H. T., & Efstathiou, C. (2007). Design of efficient modulo 2n+ 1 multipliers. IET Computers & Digital Techniques, 1(1), 49-57. [4] Vergos, H. T., & Dimitrakopoulos, G. (2012). On Modulo 2^n+1 Adder Design. IEEE Transactions on Computers, 61(2), 173-186. [5] Efstathiou, C., Moshopoulos, N., Axelos, N., & Pekmestzi, K. (2014). Efficient modulo 2n+1 multiply and multiply-add units based on modified Booth encoding. Integration, the VLSI Journal, 47(1), 140-147. Page | 312

www.ijsart.com

FPGA Implementation of Multiply-Add Unit based on 2 ...

FPGA Implementation of Multiply-Add Unit based on 2 ...

Suggest Documents

Implementation and Evaluation of FPGA-based ...

(FPGA) - Based Implementation of Iris Recognition ...

FPGA-BASED IMPLEMENTATION OF THE INSTANTANEOUS ...

FPGA-Implementation of Wavelet-based Denoising

FPGA-Based Implementation Direct Torque Control of

FPGA Implementation of the V-disparity Based

FPGA IMPLEMENTATION OF MMSE METRIC BASED EFFICIENT

FPGA-based implementation of intelligent ... - Semantic Scholar

FPGA-Based Real-Time Implementation of AES

Development and Implementation of Parameterized FPGA Based ...

Implementation of FPGA based LED dimmer control

Implementation of FPGA based Fast DOA

FPGA Based Implementation of Convolutional Encoder- Viterbi ...

Implementation of an FPGA-based Aided IMU on a

FPGA implementation of Hilbert transformer based on ... - IEEE Xplore

Implementation of a Multi-Context FPGA Based on ...

FPGA Implementation of Video Transmission System Based on LTE

On the Implementation of FPGA-Based Adaptive ... - Google Sites

Design and FPGA Implementation of an OFDM System Based on

Modeling and FPGA Implementation of a Thermal Peak Detection Unit ...

FPGA Based Implementation and Testing ofOVSFCode

fpga-based design and implementation - EJUM

a fpga-based viterbi algorithm implementation for

LabVIEW FPGA based Software Implementation for ...