2008 International Conference on Electronic Design
December 1-3, 2008, Penang, Malaysia
An Efficient Modified Booth Multiplier Architecture Razaidi Hussin1, Ali Yeon Md. Shakaff2, Norina Idris1, Zaliman Sauli1, Rizalafande Che Ismail1 and Afzan Kamarudin1 2 School of Computer and Communication Engineering. 1 School of Microelectronic Engineering, Universiti Malaysia Perlis (UniMAP), P.O Box 77, d/a Pejabat Pos Besar, 01007 Kangar Perlis, Malaysia
[email protected] Abstract in this paper, we present the design of an efficient multiplication unit. This multiplier architecture is based on Radix 4 Booth multiplier. In order to improve his architecture, we have made 2 enhancements. The first is to modify the Wen-Chang’s Modified Booth Encoder (MBE) since it is the fastest scheme to generate a partial product. However, when implementing this MBE with the Simplified Sign Extension (SSE) method, the multiplication’s output is incorrect. The 2nd part is to improve the delay in the 4:2 compressor circuit. The redesigned 4:2 compressor reduced the delay of the Carry signal. This modification has been made by rearranging the Boolean equation of the Carry signal. This architecture has been designed using Quartus II. The Gajski rule has been adopted in order to estimate the delay and size of the circuit. The total transistor count for this new multiplier is being a slightly bigger. This is due to the new MBE which is uses more transistor. However in performance speed, this efficiency multiplier is quite good. The propagation delay is reduced by about 2% – 7% from other designers.
I INTRODUCTION With the constant growth of computer applications such as computer graphics and signal processing, fast arithmetic unit especially multipliers are becoming increasingly important. Advanced VLSI technology has given designer the freedom to integrate many complex components, which was not possible in the past. Various high speed multipliers have been proposed and realized [1, 2, 3, 5, 10, 11, and 12]. There are 2 operations in implementing a Modified Booth Multiplier design, which a generating the partial product and accumulating the entire partial product. In producing partial product the Booth Encoder and Booth Selector circuits are used. Accumulation of the partial product is done by the adder or compressor circuit. Ideally, the performance and size of the multiplier circuit are dependent on these two operations. The BE can be designed in many ways and the design will be tradeoff between area and speed. The BE which has a better performance in speed will be used in our design since the
objective is developing the fast multiplier. However, this BE will create a problem when implementing the SSE method. In the accumulation part, all partial products must be accumulated to obtain the final result. A fast multi-operand adder such as the Wallace tree [4] or the carry save adder (CSA) tree using multi input counter and Compressor [5,6] should be employed for high speed accumulation. In this paper we evaluate Booth encoding with respect to the use of 4:2 Compressor. We proposed the redesign MBE to overcome the incompatibility with the SSE method and redesign 4:2 compressor to speed up the delay. Section II will discuss the Booth architecture; Section III shows the result analysis and follows by Section IV the conclusion of this research.
II BOOTH ARCHITECTURE Conventional Booth Multiplier, Hsin_lei and Wen Chang architecture [1, 10 and 11] consists of three basic components namely Booth Encoder (BE), Booth Selector (BS) and adder tree summation. The basic operation of this component is the BE is to decode the multiplier signal and the output will be used by the BS to generate the partial product. Last component is the adder tree summation will be used to accumulate the entire partial product to produce the result. Usually with this architecture, the 2’s Complement Error Correction is implemented in adders summation. As a result this architecture will have n/2 +1 partial product for a n x n multiplier. Based on the conventional technique for signed multiplication, the sign bit of a partial product row would have to be extended all the way to the MSB position which would then require the sign bit to drive that many output loads (each bit position until the MSB should have the same value as sign). As a result, the partial product rows will be unequal in length. In the example shown in Figure 1, the first row spans 16-bit (pp to the leftmost pp ), the second row 14-bit (pp to the 00
01
81
02
pp ) and the fourth row 10-bit (pp 82
978-1-4244-2315-6/08/$25.00 ©2008 IEEE.
80
leftmost pp ), the third row spans 12-bit (pp to the leftmost 03
to the leftmost pp ). 83
Ercegovac [15] designed a method to eliminate certain bits in the partial products in order to make the partial product much smaller and fast to accumulate. Figure 2 shows the sign extension prevention method [15].
Carry = ~ (H) (B+Cin) + H (B*Cin) =G (B + Cin) + H*B*Cin =G*B + G*Cin + H*B*Cin = G*B + G*Cin + B*Cin
Booth Encoder and Booth Selector development Several MBE’s have been evaluated in published [12], and based on the result, Wen-Chang’s MBE was the most efficient, is chosen in our design. In BS circuit in Figure 3, the S signal is needed first from the BE to generate the partial product. The BE generated the S signal with depending only C2 signal. As a result the combination of BE and BS to produce the partial product is faster. However, when implementing the SSE method in this multiplier, The S signal from BE was found to be incompatible. The multiplication result are incorrect when BE is in state ‘111’. To overcome this problem, modification to Wen-Chang’s BE have been done. In order to overcome this problem, the Z signal was improved. The combination of S and Z signal will drive the Simplified Sign Extension and 2’s Complement Error Correction to produce the correct result of multiplication. Figure 3 and Figure 4 shows the MBE of Wen-Chang and New BE. While Table 1, shows the truth table for both BE’s.
The 4:2 Compressor Several attempts to find repeatability in Wallace tree [4] have been made, leading to the notion of compressor such as 5:2 or 9:2[5, 6]. The notion of compressor has been a major departure from the traditional notion of Dadda counters, since they require the use of Carry In and Carry Out signal. However, the propagation of the signal is limited to 1 bit by rendering the Carry In and the corresponding Carry out independent. The most popular Compressor is actually the 4:2 Compressor, introduce by Weinberger [7] The structure of the Original 4:2 compressor is shown in Figure 5. The major advantage in this cell is that allows a high regular layout. Indeed the 2 to1 reduction of the cell leads to a symmetric and regular compression tree. However, since this cell is built with Full Adders, there is no improvement compared to the Wallace tree. Several designers tried to modify the 4:2 compressor cell in order to reduce the critical path [8, 9] and was used by Hsin Lei et al. [10]. An example of a Modified 4:2 compressor resulting in a 3 XOR critical path is shown in Figure 6. In this paper, we have further modifications to the redesign circuit for a 4:2 compressor. This modification has been made based on the Carry signal which has the longest delay. The modification is made by rearranging the Boolean equation for carry signal. Let H = B XOR C XOR D, and G = ~ (H). Based on the Modified 4:2 Compressor, carry signal can be assigned as.
Figure 7 shows the new redesigned 4:2 Compressor has 4 less transistors from Modified 4:2 Compressor and 2 less transistors from the Original 4:2 Compressor comprising with Full Adder.
III Result Analysis A normalized gate delay model is used to analyze the circuit performance. As in [12], [13], the delay of an inverter gate is considered as one unit delay to simplify the analysis. The delays of the other CMOS logic gates are normalized with respect to the unit delay. Table 2 summarizes the information for the logic gates used in this project. Based on the analysis, the New Multiplier used 1% less transistor compared to the original Wen-Chang multiplier. This is due to the redesigned 4:2 compressor which used fewer transistors. In the MBE circuit, after modification, the number used of transistor is still remains. Since the Wen-Chang’s BE is of average sizing, therefore, the new multiplier has a disadvantages in sizing, whereby 1 % – 2 % more transistor than Hsin-Lei and Normal multiplier are used. The propagation delays of this system have been evaluated using Dgasjki rules. Table 4 shows the summary of Delay analysis for various multipliers, which shows that the new multiplier is the faster multiplier. The propagation delay for this new multiplier was decreased by 2% – 7% from other designs.
IV. CONCLUSION An efficient Multiplier has been developed using a redesigned BE based on Wen Chang BE and redesigned 4:2 compressor. The objective of developing the fast multiplier has been achieved with 136.4 ns, 246.8 ns and 467.6 ns for 8 bit, 16 bit and 32 bit multiplier respectively. This is 2 % to 7 % improvement compared to other designs. Although the improvement seems marginal, however when implemented in large system, the accumulated delay will be much significant. The new multiplier has a slightly disadvantages in size because the faster BE is not the smallest scheme. However the increasing of sizing can be tolerate since the difference is only 1 % from other multipliers.
ACKNOWLEDGEMENT The authors acknowledge Universiti Malaysia Perlis (UniMAP) for providing the financial that enabled the production of this article.
8.
REFERENCE 1. 2. 3.
4. 5.
6.
7.
A. D. Booth, “A Signed Binary Multiplication Technique”, Quarterly J, Mech. Appli. Math., vol 4, part2, pp. 236-240,1951 O. L. Mac Sorley, "High Speed Arithmetic in Binary Computers", Proceedings of IRE,. Vol.49, No. 1, January, 1961 D. Villeger and V. G. Oklobdzija, “Analysis Of Booth Encoding Efficiency In Parallel Multipliers Using Compressor For Reduction Of Partial Products", Proceedings of the 27th Asilomar Conference on Signals, Systems and Computers, pp. 781-784, 1993. C. S. Wallace, “A Suggestion for A Fast multiplier”, IEEE Transaction on Computers,Vol. BC13, pp. 14-17, February 1964. R S. Lim, “High Speed Multiplication and Multiple Summand Addition’, 4th International Symposium on Computer Arithmetic, Santa Monica, California, June 1978. P. Soong, G. De Michelli, “Circuit and Architecture Trade offs for high Speed Multiplication”, IEEE Journal of Solid State Circiuts, Vol. 26, No. 9, September 1991. A. Weinberger, “4:2 Carry save Adder Module”, IBM Technical Disclosure Bulletin, Vol. 23, January 1981.
9.
10.
11. 12.
13. 14. 15.
J. Mori et al, “A 10ns 54x54bit Parallel Structure Full Array Multipliers with 0.5u CMOS technology”, IEEE Journal of Solid State Circuits, Vol. 26, No. 4, April 1991. T. Soulas, D. Villeger, V. G Oklobzija, “An ASIC Multiplier for Complex Numbers”, proceedings of EURO-ASIC93, The European Event in ASIC Design, Paris, France, February 22-25, 1993 Hsin-Lei Lin, Design of a Novel Radix – 4 Booth Multiplier, the 2004 IEEE Asia – Pacific Conference on Circuit and Systems, December 2005 Wen-Chang, Y. & Chein-Wei, J. High-speed Booth encoded parallel multiplier design, IEEE Transactions on Computer, 2000 Razaidi et al, “ Analysis of various Modified Booth Encoder (MBE) and proposal for an efficient Modified Booth Encoder”, IEEE Regional Symposioum on Microelectronics, December, 2007. Rizalafande Che Ismail, “A Complex Multiplier Using Booth Wallace Algorithm”, M.Eng. RMIT, 2005 D. Gajski. Princples of Digital, Design,Prentice Hall, 1997 Ercegovac. T. L. M. D. (2003). Digital Arithmetic. California, USA: Morgan Kaufmann Publishers.
Figure 1: The array of partial products for signed multiplication using conventional technique
Figure 2: Simplified Sign Extension method [15]
C1
INPUT VCC INPUT VCC
INPUT VCC INPUT VCC INPUT VCC INPUT VCC INPUT VCC
Z
XNOR
C0
M2
M
OUTPUT
M
inst1 S
XOR
Y0
M2
OUTPUT
682 673
inst
683
INPUT VCC
OUTPUT
Z
OUTPUT
S
OR2 NAND2
XNOR
inst2
OR3
WIRE
inst3
Figure 3: Wen-Chang BE and BS XNOR
C0 C1
INPUT VCC INPUT VCC
inst1 NOT
inst2
OUTPUT
M
OUTPUT
M2
OUTPUT
Z
OUTPUT
S
NOT
inst NAND3
NOT NAND2
inst3 NOT
C2
INPUT VCC
OUTPUT
inst11
INPUT VCC
Y1
C2
667
XNOR
XNOR
inst4
inst5 NAND3
inst7
inst6
Figure 4: Newly BE based of Wen-Chang architecture
PP0
C2
Figure 5: Structure of the Original 4:2 Compressor built with Full Adders
C1
C0
Wen-Chang’s BE M 2M S Z
M
New BE 2M S
Z
0
0
0
1
0
0
1
1
0
0
1
0
0
1
0
1
0
1
0
1
0
0
0
1
0
0
1
0
0
0
1
0
0
0
1
1
1
0
0
0
1
0
0
0
1
0
0
1
0
1
0
1
0
1
0
1
0
1
0
1
1
0
0
1
1
0
1
1
0
0
1
1
1
0
1
1
0
1
1
1
1
0
1
1
1
0
1
1
Table 1: The truth table for Wen-Chang’s BE and New BE Gate Name
Hardware Cost(in number of transistor) 2
Normalized gate delay(normalized in one inverter delay) 1
NAND2
4
1.4
NAND3
6
1.8
NAND4
8
2.2
NOR2 NOR3 XNOR2
4 6 12
1.4 1.8 3.2
XOR2
14
4.2
INV
Figure 6: Logic of a Modified 4:2 Compressor XOR
CIN A
INPUT VCC INPUT VCC
XNOR
inst
OUTPUT
SUM
XOR
B C
INPUT VCC INPUT VCC
inst13 XNOR
inst2 INPUT VCC
NOT
inst6
inst11
D
NAND2 OR2 NAND2
inst1
OUTPUT
inst5 NAND2
CARRY
inst3
inst4
NAND2
inst7 NAND2
NAND3
inst8
inst10
OUTPUT
COUT
Function
1-input Inverter 2input NAND 3input NAND 4input NAND 2- input NOR 3- input NOR 2-input XNOR 2- input XOR
NAND2
Table 2: Normalized Gate Delay and Hardware cost
inst9
Figure 7: New Redesign of 4:2 Compressor Hsin Lei
Wen-Chang New Multiplier Normal
8bit
16bit
32bit
3534 3584 3546 3552
12566 12818 12680 12428
47350 48004 47474 46308
Table 3: Total Transistor for Modified Booth Multiplier
Hsin Lei
Wen-Chang New Multiplier Normal
Generate PP
8Bit Adder tree
Total Delay
Generate PP
16Bit Adder tree
Total Delay
Generate PP
32Bit Adder tree
Total Delay
9.4ns 8.4ns 8.4ns 13.6ns
138ns 132ns 128ns 132ns
147.4ns 140.4ns 136.4ns 145.6ns
9.4ns 8.4ns 8.4ns 13.6ns
256.4ns 245.6ns 238.4ns 245.6ns
265.8ns 254ns 246.8ns 259.2ns
9.4ns 8.4ns 8.4ns 13.6ns
493.2ns 472.8ns 459.2ns 472.8ns
502.6ns 481.2ns 467.6ns 486.4ns
Table 4: Delay Analysis for Generate Partial Product, Adder tree and Total Delay for each multiplier