An Efficient Modified Booth Multiplier Architecture

6 downloads 0 Views 407KB Size Report
Sign Extension (SSE) method, the multiplication's output is incorrect. The 2nd part is to improve the delay in the 4:2 compressor circuit. The redesigned 4:2 ...
2008 International Conference on Electronic Design

December 1-3, 2008, Penang, Malaysia

An Efficient Modified Booth Multiplier Architecture Razaidi Hussin1, Ali Yeon Md. Shakaff2, Norina Idris1, Zaliman Sauli1, Rizalafande Che Ismail1 and Afzan Kamarudin1 2 School of Computer and Communication Engineering. 1 School of Microelectronic Engineering, Universiti Malaysia Perlis (UniMAP), P.O Box 77, d/a Pejabat Pos Besar, 01007 Kangar Perlis, Malaysia [email protected] Abstract in this paper, we present the design of an efficient multiplication unit. This multiplier architecture is based on Radix 4 Booth multiplier. In order to improve his architecture, we have made 2 enhancements. The first is to modify the Wen-Chang’s Modified Booth Encoder (MBE) since it is the fastest scheme to generate a partial product. However, when implementing this MBE with the Simplified Sign Extension (SSE) method, the multiplication’s output is incorrect. The 2nd part is to improve the delay in the 4:2 compressor circuit. The redesigned 4:2 compressor reduced the delay of the Carry signal. This modification has been made by rearranging the Boolean equation of the Carry signal. This architecture has been designed using Quartus II. The Gajski rule has been adopted in order to estimate the delay and size of the circuit. The total transistor count for this new multiplier is being a slightly bigger. This is due to the new MBE which is uses more transistor. However in performance speed, this efficiency multiplier is quite good. The propagation delay is reduced by about 2% – 7% from other designers.

I INTRODUCTION With the constant growth of computer applications such as computer graphics and signal processing, fast arithmetic unit especially multipliers are becoming increasingly important. Advanced VLSI technology has given designer the freedom to integrate many complex components, which was not possible in the past. Various high speed multipliers have been proposed and realized [1, 2, 3, 5, 10, 11, and 12]. There are 2 operations in implementing a Modified Booth Multiplier design, which a generating the partial product and accumulating the entire partial product. In producing partial product the Booth Encoder and Booth Selector circuits are used. Accumulation of the partial product is done by the adder or compressor circuit. Ideally, the performance and size of the multiplier circuit are dependent on these two operations. The BE can be designed in many ways and the design will be tradeoff between area and speed. The BE which has a better performance in speed will be used in our design since the

objective is developing the fast multiplier. However, this BE will create a problem when implementing the SSE method. In the accumulation part, all partial products must be accumulated to obtain the final result. A fast multi-operand adder such as the Wallace tree [4] or the carry save adder (CSA) tree using multi input counter and Compressor [5,6] should be employed for high speed accumulation. In this paper we evaluate Booth encoding with respect to the use of 4:2 Compressor. We proposed the redesign MBE to overcome the incompatibility with the SSE method and redesign 4:2 compressor to speed up the delay. Section II will discuss the Booth architecture; Section III shows the result analysis and follows by Section IV the conclusion of this research.

II BOOTH ARCHITECTURE Conventional Booth Multiplier, Hsin_lei and Wen Chang architecture [1, 10 and 11] consists of three basic components namely Booth Encoder (BE), Booth Selector (BS) and adder tree summation. The basic operation of this component is the BE is to decode the multiplier signal and the output will be used by the BS to generate the partial product. Last component is the adder tree summation will be used to accumulate the entire partial product to produce the result. Usually with this architecture, the 2’s Complement Error Correction is implemented in adders summation. As a result this architecture will have n/2 +1 partial product for a n x n multiplier. Based on the conventional technique for signed multiplication, the sign bit of a partial product row would have to be extended all the way to the MSB position which would then require the sign bit to drive that many output loads (each bit position until the MSB should have the same value as sign). As a result, the partial product rows will be unequal in length. In the example shown in Figure 1, the first row spans 16-bit (pp to the leftmost pp ), the second row 14-bit (pp to the 00

01

81

02

pp ) and the fourth row 10-bit (pp 82

978-1-4244-2315-6/08/$25.00 ©2008 IEEE.

80

leftmost pp ), the third row spans 12-bit (pp to the leftmost 03

to the leftmost pp ). 83

Ercegovac [15] designed a method to eliminate certain bits in the partial products in order to make the partial product much smaller and fast to accumulate. Figure 2 shows the sign extension prevention method [15].

Carry = ~ (H) (B+Cin) + H (B*Cin) =G (B + Cin) + H*B*Cin =G*B + G*Cin + H*B*Cin = G*B + G*Cin + B*Cin

Booth Encoder and Booth Selector development Several MBE’s have been evaluated in published [12], and based on the result, Wen-Chang’s MBE was the most efficient, is chosen in our design. In BS circuit in Figure 3, the S signal is needed first from the BE to generate the partial product. The BE generated the S signal with depending only C2 signal. As a result the combination of BE and BS to produce the partial product is faster. However, when implementing the SSE method in this multiplier, The S signal from BE was found to be incompatible. The multiplication result are incorrect when BE is in state ‘111’. To overcome this problem, modification to Wen-Chang’s BE have been done. In order to overcome this problem, the Z signal was improved. The combination of S and Z signal will drive the Simplified Sign Extension and 2’s Complement Error Correction to produce the correct result of multiplication. Figure 3 and Figure 4 shows the MBE of Wen-Chang and New BE. While Table 1, shows the truth table for both BE’s.

The 4:2 Compressor Several attempts to find repeatability in Wallace tree [4] have been made, leading to the notion of compressor such as 5:2 or 9:2[5, 6]. The notion of compressor has been a major departure from the traditional notion of Dadda counters, since they require the use of Carry In and Carry Out signal. However, the propagation of the signal is limited to 1 bit by rendering the Carry In and the corresponding Carry out independent. The most popular Compressor is actually the 4:2 Compressor, introduce by Weinberger [7] The structure of the Original 4:2 compressor is shown in Figure 5. The major advantage in this cell is that allows a high regular layout. Indeed the 2 to1 reduction of the cell leads to a symmetric and regular compression tree. However, since this cell is built with Full Adders, there is no improvement compared to the Wallace tree. Several designers tried to modify the 4:2 compressor cell in order to reduce the critical path [8, 9] and was used by Hsin Lei et al. [10]. An example of a Modified 4:2 compressor resulting in a 3 XOR critical path is shown in Figure 6. In this paper, we have further modifications to the redesign circuit for a 4:2 compressor. This modification has been made based on the Carry signal which has the longest delay. The modification is made by rearranging the Boolean equation for carry signal. Let H = B XOR C XOR D, and G = ~ (H). Based on the Modified 4:2 Compressor, carry signal can be assigned as.

Figure 7 shows the new redesigned 4:2 Compressor has 4 less transistors from Modified 4:2 Compressor and 2 less transistors from the Original 4:2 Compressor comprising with Full Adder.

III Result Analysis A normalized gate delay model is used to analyze the circuit performance. As in [12], [13], the delay of an inverter gate is considered as one unit delay to simplify the analysis. The delays of the other CMOS logic gates are normalized with respect to the unit delay. Table 2 summarizes the information for the logic gates used in this project. Based on the analysis, the New Multiplier used 1% less transistor compared to the original Wen-Chang multiplier. This is due to the redesigned 4:2 compressor which used fewer transistors. In the MBE circuit, after modification, the number used of transistor is still remains. Since the Wen-Chang’s BE is of average sizing, therefore, the new multiplier has a disadvantages in sizing, whereby 1 % – 2 % more transistor than Hsin-Lei and Normal multiplier are used. The propagation delays of this system have been evaluated using Dgasjki rules. Table 4 shows the summary of Delay analysis for various multipliers, which shows that the new multiplier is the faster multiplier. The propagation delay for this new multiplier was decreased by 2% – 7% from other designs.

IV. CONCLUSION An efficient Multiplier has been developed using a redesigned BE based on Wen Chang BE and redesigned 4:2 compressor. The objective of developing the fast multiplier has been achieved with 136.4 ns, 246.8 ns and 467.6 ns for 8 bit, 16 bit and 32 bit multiplier respectively. This is 2 % to 7 % improvement compared to other designs. Although the improvement seems marginal, however when implemented in large system, the accumulated delay will be much significant. The new multiplier has a slightly disadvantages in size because the faster BE is not the smallest scheme. However the increasing of sizing can be tolerate since the difference is only 1 % from other multipliers.

ACKNOWLEDGEMENT The authors acknowledge Universiti Malaysia Perlis (UniMAP) for providing the financial that enabled the production of this article.

8.

REFERENCE 1. 2. 3.

4. 5.

6.

7.

A. D. Booth, “A Signed Binary Multiplication Technique”, Quarterly J, Mech. Appli. Math., vol 4, part2, pp. 236-240,1951 O. L. Mac Sorley, "High Speed Arithmetic in Binary Computers", Proceedings of IRE,. Vol.49, No. 1, January, 1961 D. Villeger and V. G. Oklobdzija, “Analysis Of Booth Encoding Efficiency In Parallel Multipliers Using Compressor For Reduction Of Partial Products", Proceedings of the 27th Asilomar Conference on Signals, Systems and Computers, pp. 781-784, 1993. C. S. Wallace, “A Suggestion for A Fast multiplier”, IEEE Transaction on Computers,Vol. BC13, pp. 14-17, February 1964. R S. Lim, “High Speed Multiplication and Multiple Summand Addition’, 4th International Symposium on Computer Arithmetic, Santa Monica, California, June 1978. P. Soong, G. De Michelli, “Circuit and Architecture Trade offs for high Speed Multiplication”, IEEE Journal of Solid State Circiuts, Vol. 26, No. 9, September 1991. A. Weinberger, “4:2 Carry save Adder Module”, IBM Technical Disclosure Bulletin, Vol. 23, January 1981.

9.

10.

11. 12.

13. 14. 15.

J. Mori et al, “A 10ns 54x54bit Parallel Structure Full Array Multipliers with 0.5u CMOS technology”, IEEE Journal of Solid State Circuits, Vol. 26, No. 4, April 1991. T. Soulas, D. Villeger, V. G Oklobzija, “An ASIC Multiplier for Complex Numbers”, proceedings of EURO-ASIC93, The European Event in ASIC Design, Paris, France, February 22-25, 1993 Hsin-Lei Lin, Design of a Novel Radix – 4 Booth Multiplier, the 2004 IEEE Asia – Pacific Conference on Circuit and Systems, December 2005 Wen-Chang, Y. & Chein-Wei, J. High-speed Booth encoded parallel multiplier design, IEEE Transactions on Computer, 2000 Razaidi et al, “ Analysis of various Modified Booth Encoder (MBE) and proposal for an efficient Modified Booth Encoder”, IEEE Regional Symposioum on Microelectronics, December, 2007. Rizalafande Che Ismail, “A Complex Multiplier Using Booth Wallace Algorithm”, M.Eng. RMIT, 2005 D. Gajski. Princples of Digital, Design,Prentice Hall, 1997 Ercegovac. T. L. M. D. (2003). Digital Arithmetic. California, USA: Morgan Kaufmann Publishers.

Figure 1: The array of partial products for signed multiplication using conventional technique

Figure 2: Simplified Sign Extension method [15]

C1

INPUT VCC INPUT VCC

INPUT VCC INPUT VCC INPUT VCC INPUT VCC INPUT VCC

Z

XNOR

C0

M2

M

OUTPUT

M

inst1 S

XOR

Y0

M2

OUTPUT

682 673

inst

683

INPUT VCC

OUTPUT

Z

OUTPUT

S

OR2 NAND2

XNOR

inst2

OR3

WIRE

inst3

Figure 3: Wen-Chang BE and BS XNOR

C0 C1

INPUT VCC INPUT VCC

inst1 NOT

inst2

OUTPUT

M

OUTPUT

M2

OUTPUT

Z

OUTPUT

S

NOT

inst NAND3

NOT NAND2

inst3 NOT

C2

INPUT VCC

OUTPUT

inst11

INPUT VCC

Y1

C2

667

XNOR

XNOR

inst4

inst5 NAND3

inst7

inst6

Figure 4: Newly BE based of Wen-Chang architecture

PP0

C2

Figure 5: Structure of the Original 4:2 Compressor built with Full Adders

C1

C0

Wen-Chang’s BE M 2M S Z

M

New BE 2M S

Z

0

0

0

1

0

0

1

1

0

0

1

0

0

1

0

1

0

1

0

1

0

0

0

1

0

0

1

0

0

0

1

0

0

0

1

1

1

0

0

0

1

0

0

0

1

0

0

1

0

1

0

1

0

1

0

1

0

1

0

1

1

0

0

1

1

0

1

1

0

0

1

1

1

0

1

1

0

1

1

1

1

0

1

1

1

0

1

1

Table 1: The truth table for Wen-Chang’s BE and New BE Gate Name

Hardware Cost(in number of transistor) 2

Normalized gate delay(normalized in one inverter delay) 1

NAND2

4

1.4

NAND3

6

1.8

NAND4

8

2.2

NOR2 NOR3 XNOR2

4 6 12

1.4 1.8 3.2

XOR2

14

4.2

INV

Figure 6: Logic of a Modified 4:2 Compressor XOR

CIN A

INPUT VCC INPUT VCC

XNOR

inst

OUTPUT

SUM

XOR

B C

INPUT VCC INPUT VCC

inst13 XNOR

inst2 INPUT VCC

NOT

inst6

inst11

D

NAND2 OR2 NAND2

inst1

OUTPUT

inst5 NAND2

CARRY

inst3

inst4

NAND2

inst7 NAND2

NAND3

inst8

inst10

OUTPUT

COUT

Function

1-input Inverter 2input NAND 3input NAND 4input NAND 2- input NOR 3- input NOR 2-input XNOR 2- input XOR

NAND2

Table 2: Normalized Gate Delay and Hardware cost

inst9

Figure 7: New Redesign of 4:2 Compressor Hsin Lei

Wen-Chang New Multiplier Normal

8bit

16bit

32bit

3534 3584 3546 3552

12566 12818 12680 12428

47350 48004 47474 46308

Table 3: Total Transistor for Modified Booth Multiplier

Hsin Lei

Wen-Chang New Multiplier Normal

Generate PP

8Bit Adder tree

Total Delay

Generate PP

16Bit Adder tree

Total Delay

Generate PP

32Bit Adder tree

Total Delay

9.4ns 8.4ns 8.4ns 13.6ns

138ns 132ns 128ns 132ns

147.4ns 140.4ns 136.4ns 145.6ns

9.4ns 8.4ns 8.4ns 13.6ns

256.4ns 245.6ns 238.4ns 245.6ns

265.8ns 254ns 246.8ns 259.2ns

9.4ns 8.4ns 8.4ns 13.6ns

493.2ns 472.8ns 459.2ns 472.8ns

502.6ns 481.2ns 467.6ns 486.4ns

Table 4: Delay Analysis for Generate Partial Product, Adder tree and Total Delay for each multiplier

Suggest Documents