(n+k-1)FA+(3n-q1+k-1)D+5(k-1)AND+2(k-1)MUX. 85n+89k-21q1-89. DPSPM. (n+1)FA+(4n+1)D+2nMUX. 122n+43. FA:Full-Adder(22 transistors) D:D Flip-Flop ...
SEGMENTATION BASED DESIGN OF SERIAL PARALLEL MULTIPLIERS P. Bougas, A. Tsirikos, K. Anagnostopoulos, I. Sideris, K. Pekmestzi School of Electrical and Computer Engineering National Technical University of Athens Athens, Greece {paul, andy, kanagno, isidoros, pekmes}@microlab.ntua.gr Abstract— In this paper, a novel architecture for the implementation of Serial Parallel Multipliers (SPM) is proposed. The proposed multiplier is based on a segmentation technique of a simple SPM to blocks of equal bit length. This multiplier achieves higher throughput because it requires small number of zeros to start a new multiplication cycle at a moderate hardware expense and achieves significant hardware reduction compared to the Double Precision SPM. The proposed technique permits the optimization of the area time product.
I. INTRODUCTION Real-time processing systems are implemented using dedicated hardware architectures. Different sample rates require different implementation styles. Bit serial systems are considered a fair compromise between the required area, clock rate and the interconnection complexity of the architecture for low and medium rate applications. In cases like cryptography, where long number computations are involved, bit-serial arithmetic remains the most common technique to reduce the wiring down to a reasonable level [1]. Furthermore, in programmable logic targeted applications (FPGAs, SoCs), hardware resources are limited and expensive while the clock frequency can be increased up to a reasonable level with no extra cost. Parallel architectures for complex algorithms (cryptography or filter structures with considerable number of taps) cannot be implemented using such devices. The serial stream approach matches better with the structure and the limited resources [2]. Multiplication is the most used arithmetic operation in the above algorithms. The serial/parallel multiplier (SPM) [3] is widely used to perform bit-serial multiplication. The first multiplier architectures date back to the late 70’s [4], [5]. However, their serial implementation has the disadvantage that between two successive multiplications, zero bits must be inserted. Thus, the effective throughput of the circuit is decreased. Alternative architectures were presented in [6] and [7] where the full product is generated
0-7803-9390-2/06/$20.00 ©2006 IEEE
with no latency and 100% throughput, at the expense of extra hardware. An interesting classification of the SPM architectures is presented in [9]. The SPMs are distinguished into systolic and non-systolic structures. The non-systolic structures consist of the quasi-serial structures [10] and the most common used family of carry-save add-shift (CSAS) based structures [11], [12], which perform addition of one of the raw partial products at each iteration and save the carry to the next step. Systolic SPMs are derived from the nonsystolic structures with proper retiming and pipelining techniques. They are divided into unidirectional [13], [14], bidirectional [13], [15] and contraflow [16], [17] structures, according to the flow directions of the operands and the partial product. Another scheme based on modified Booth encoding is presented in [8].
Figure 1. Simple SPM Architecture
Gnanasekaran [11] proposed an n-bit parallel ripple carry or carry look ahead adder to perform the addition of the sum and carry vectors. This structure is known as the Fast Serial Parallel Multiplier (FSPM) and produces the n most significant bits of the product in parallel. A parallel to serial converter can be used to transform the result to a serial stream. The main disadvantage of this architecture is the increased critical delay induced by the n bit parallel adder. Others [1] proposed the use of two n-bit shift registers, to store the sum and carry vectors. The outputs of the shift registers are connected to a bit serial adder which computes the high order bits of the product. At the same time a new multiplication operation can be initiated. The critical delay of the circuit remains unaltered, no inset zeros are required and 100% throughput is achieved. The disadvantage of this
1487
ISCAS 2006
architecture is the area required by the shift registers which practically increases the overall area of the circuit by 2/3. This architecture is referred as Double Precision SPM (DPSPM). All the above schemes have either a simple SPM architecture with reduced throughput or 100% throughput with significant hardware overhead. In order to optimize the design of the SPM the throughput to area ratio must be maximized. Based on this idea a new SPM architecture is proposed that permits variable throughput, achieving the design of an optimum architecture for any given combination of the bit-length of the operands. The organization of the paper is as follows: In Section 2, the proposed schemes are presented. In Section 3 the area and the effective throughput are calculated and their ratio is optimized. Finally, a comparison of the optimum architecture with previously presented schemes. II.
PROPOSED SPM
As it has been presented in Section I, the two vectors (sum and carry) must be added either by inserting n zeros, that degrades the throughput, or by downloading them to a double shift register that significantly increases the hardware. In the proposed SPM schemes the above techniques are combined. Instead of inserting n zeros, the SPM is divided in
k blocks of q bits where k ⋅ q = n in order to partially add the corresponding bits of the two vectors. By inserting q zeros, each block generates its partial sum as a single vector decreasing the required double shift register to a single one. At the same time each block empties its content after q clock cycles instead of n. The proposed SPM is presented for k = 2 (divided in two blocks) in Fig. 4.The timing diagram for this description is shown in Fig. 5. It is noted that the result is produced with one clock cycle latency. During the first m clock cycles, the circuit operates like the simple SPM and the LSB of the product are produced at the PL output. In the following q clock cycles zeros are inserted emptying each block. MB2 is producing the next q bits of the product at the PL output while MB1 is producing the other q bits of the product that are stored at the shift register of MB2. From the addition and the shifting of the sum and carry vectors only two bits remain in MB2 ( S1 , C0 ) , of the same weight with the output of MB1 generated at the m-th clock cycle. These three bits are combined at the FA-C of MB2 producing the next bit of the product at clock cycle m+q. At the next clock cycles FA-C produces the remaining bits of the product. Meanwhile the SPM is empty and ready to start a new multiplication cycle.
Figure 2. The proposed SPM for n=6 bits, k=2 blocks and q=3 bits.
The proposed architecture can be easily extended to a scheme of multiple blocks. In order to form a k-block SPM one MB1 block is connected with k-1 MB2 blocks to form a k-block SPM as depicted in Fig. 6.
Figure 3. Timing diagram for the proposed SPM.
As shown in Fig. 4 and Fig.5, the Cn1 signal is zero when the q zeros enter the circuit, and controls the process of storing the bits of MB1 to the shift register of MB2. The Cn2 signal is zero during the clock cycle m+q directing the bits of S1 and C0 to FA-C and at the same time it clears the registers of FA0 .
The PH and PL outputs of each MB2 blocks are connected to the Bin and SRin inputs of the next MB2. The SRin input of the leftmost MB2 block is set to 0 and its Bin input is connected to the Bout of MB1. As in the case of two blocks, the result is obtained from the PL output of the rightmost block for the first m+q clock cycles and the remaining bits of the product from the PH output of the same block for the next n-q clock cycles (Fig. 5). For synchronization purposes in the case that q is not a divisor of n, n q is the number of MB2 blocks and the remainder of the division q1 is the length of MB1.
1488
The circuit remains idle for q clock cycles, instead of n clock cycles which is the case of the simple SPM. A new result is produced every m+q clock cycles and the effective throughput is m ( m + q) . It is obvious that as the number of blocks increases the effective throughput is also increased at the expense of an extra serial adder at each block.
i
negative terms xi hn −1 2 requires an inverter (NAND gate). For the correct 2’s complement addition of these negative m −1 + 2 n−1 must terms, except of the inversion the quantity 2 be added. When n=m the Cn3 signal must be 1 during the m −1 n −1 n second clock cycle (to add 2 + 2 = 2 ). When n > m, the Cn3 signal must be 1 during the first clock cycle (to add 2n−1 ) and also the delay of the sum of the FAm-1 must be m−1 already initialized to 1 (to add 2 ). Finally when n < m, the Cn3 must be 1 at the first clock cycle and at the m-th clock cycle. III.
HARDWARE COMPARISON
In Table I, the overall hardware complexity of the proposed SPM, DPSPM and simple SPM is summarized. An n bit SPM, and a length of q bits for each MB2 block are assumed. The overall transistor count is derived accordingly to a 0,13 um standard cell library [18]. The q1 factor is the MB1 block bit length and is in general not equal to q. Each SPM is divided in n q = k pieces.
Figure 4. The MB1 block for the proposed 2’s complement serial parallel multiplier, q=3.
The optimum fragmentation of the SPM architecture is derived from the maximization of the effective throughput to hardware ratio, or equivalently to the minimization of area to throughput ratio:
a(q ) = (85n + 79k − 21q1 − 79) (m + q ) m In the case that m>>n, the number of inset zeros does not degrade the overall performance of the circuit, since the efficient throughput is equal to one. The best choice for q becomes an integer close to n or equivalently the optimum architecture tends to the simple SPM architecture.
Figure 5. The MB2 block for the proposed 2’s complement serial-parallel multiplier, q=3.
The proposed architecture can be also adapted for operands in 2’s complement form. In this case, the blocks MB1 and MB2 are converted as shown in Fig. 4 and Fig. 5 respectively. For the addition of the negative terms produced by the sign bits, an inversion of these terms is required and an addition of a ‘1’ with the corresponding weight. This is implemented by a XOR gate controlled by Cn3 for the addition of the last negative partial product term xm−1h where xm−1 is the sign bit of x. Also hn −1 represents the sign bit of h and the leftmost cell that handles the
When q=n and q1=n, the expression of the hardware complexity becomes equal to that of the simple SPM. However when q=1, the expression does not correspond to the hardware of the DPSPM, rather to a pipelined version of the FSPM architecture with increased complexity. In all other cases where m is comparable to n, a proper value of q must be computed in order to minimize α(q). Table 2 summarizes the optimum choice of the SPM architecture for typical values of the operand lengths m and n and the overall gain of the α(q) compared to that of the DSPM and the simple SPM architectures. In all other cases where m is comparable to n, a proper value of q must be computed in order to minimize α(q).
Figure 6. A serial-parallel multiplier consisting of k-blocks each of them q-bit wide.
1489
TABLE I.
HARDWARE COMPLEXITY OF PROPOSED ARCHITECTURE
SPM type
Hardware Complexity
Hardware complexity in transistors
PROPOSED SPM
(n+k-1)FA+(3n-q1+k-1)D+5(k-1)AND+2(k-1)MUX
85n+89k-21q1-89
DPSPM
(n+1)FA+(4n+1)D+2nMUX
122n+43
FA:Full-Adder(22 transistors) D:D Flip-Flop (21 transistors) AND: 2 input AND gate (6 transistors) MUX:2 to 1 MUX (8 transistors)
TABLE II. Case No.
m
n
The proposed architecture achieves high gains as the bit length n increases making it the ideal choice for long number SPMs.
OPTIMISED SPM ARCHITECTURES
Optimum q
Blocks
Gain to SPM
Gain to DPSPM
REFERENCES
1
64
48
12
4
10.2%
18.2%
[1]
2
64
64
8
8
20%
16.5%
[2]
3
128
128
16
8
23.8%
20.3%
[3]
4
256
256
16
16
26.7%
23.2%
5
512
512
27
18
28.7%
25.2%
6
1024
1024
32
32
30.1%
26.7%
[4]
[5]
[6]
Table 2 summarizes the optimum choice of the SPM architecture for typical values of the operand lengths m and n and the overall gain of the α(q) compared to that of the DSPM and the simple SPM architectures.
[7]
Considering that MB1 does not include a shift register maximizing its bit length minimizes the number of registers required. Therefore in most cases the optimum q is an exact divisor of n because q1 takes its maximum value (q). The only case in Table 2 that q is not a divisor of n, is when n=m=512. Even in this case the optimum segmentation requires blocks approximately of equal bit length (q1=26, q=27).
[9]
As explained earlier in cases 4, 7, 8 and 9, where m is significantly greater than n, the simple SPM is the best choice. IV.
CONCLUSIONS
In this paper, a new SPM architecture is proposed that adds the sum and carry vectors of the n most significant bits of the product in an efficient manner. The architecture uses a single shift register instead of a double shift register of the DPSPM and less inset zeros than the simple SPM architecture. For given number of bits n and m of the parallel and serial operand, an optimum SPM architecture can be designed in a sense that it minimizes the area time product.
[8]
[10] [11] [12]
[13] [14]
[15]
[16]
[17]
[18]
1490
K. Z. Pekmestzi, P. Kalivas, and N. Moshopoulos, “Long Unsigned Number Systolic Serial Multipliers and Squarers,” IEEE Transactions on Circuits and Systems II, vol. 48, no.3, pp.316-321, March 2001 J. Valls and Ed. Boemo, “Efficient FPGA-Implementation of Two’s Complement Digit-Serial/Parallel Multipliers,” IEEE Transactions on Circuits and Systems II, vol. 50, no.6, pp. 317-322, June 2001 L. Dadda, “On serial-input multipliers for two’s complement numbers,” IEEE Transaction on Computer, vol. 38, pp. 1341–1345, Sept. 1989. C. Baugh, B.A. Wooly, “A two’s complement parallel array multiplication algorithm,” IEEE Transactions on Computer, vol. 22, pp. 1045–1047, 1973. P.E. Blankenship,”Comments on a two’s complement parallel array multiplication algorithm,” IEEE Transactions on Computer, vol. 23, pp. 1327, 1974. G. Even, “Two’s complement pipeline multipliers,” Integration, no. 22, pp. 23–38, 1997. L. Dadda and L. Breveglieri, “A modular bit-serial convolver,” in Wafer Scale Intergration, III, M. Sami and F. Distante, Eds. Amsterdam, the Netherlands: Elsevier Science, 1990, pp. 279–289. C. Wu, “A fast 1-D serial-parallel systolic multiplier,” IEEE Transaction on Computer, vol. 36, pp. 1243–1247, Oct. 1987. M.A. Ashout and H.I. Saleh, “An FPGA implementation guide for some different types of serial – parallel multiplier structures”, Microelectronics Journal, Elsevier, vol 31, pp.161-168. E.J. Swartzlander, “The quasi-serial multiplier,” IEEE Transactions on Computers, vol. 22, pp.317–321, 1973. R. Gnanasekaran, “A fast serial–parallel binary multiplier,” IEEE Transactions on Computer, vol. 34, pp. 741–744, 1985. S. Sunder, F. EL-Guibaly, A. Antoniou, “Two’s complement fast serial–parallel multiplier”, IEE Proceedings of Circuits Devices System 142, pp.41–44, 1995. P.E. Danielson, “Serial–parallel convolvers,” IEEE Transactions on Computer, vol. 33, pp.652–667, 1984. D. Ait-Boudaoud, M.K. Ibrahim, B.R. Hays-Gill, “Novel pipelined serial/parallel multiplier”, Electronics Letters, vol. 26, pp. 582–583, 1991. D. Ait-Boudaoud, M.K. Ibrahim and B.R. Hays-Gill, “Novel cell architecture for bit level systolic arrays multiplication”, IEE Proceedings E 138, pp. 21–26, 1991. M.B. Tosic and M.K. Stojcev, “Pipelined serial/parallel multiplier with contraflowing data streams,” Electronics Letters, vol. 27, pp.2361–2363, 1991. K. Z. Pekmestzi, and C. G. Caraiscos, “A class of systolic serialparallel multipliers,” Int. Journal of Electronics, vol. 76, no.3, pp.463-468, 1994 “TSMC 0.13um (CL013G) Process 1.2-Volt SAGE-XTM Standard Cell Library Databook”, Artisan Components, Inc.