a complex multiplier using overturned-stairs adder

5 downloads 0 Views 60KB Size Report
The distributed arithmetic and the overturned-stairs adder tree are introduced in section 2, and the design of. 17×13 low-voltage two's-complement complex multi ...
A COMPLEX MULTIPLIER USING OVERTURNED-STAIRS ADDER TREE Weidong Li and Lars Wanhammar Electronics Systems, Dept. of EE., Linköping University, SE-581 83 Linköping, SWEDEN E-mail: {weidongl, larsw}@isy.liu.se

ABSTRACT In this paper we describe a new complex multiplier based on Distributed Arithmetic (DA) using an overturned-stairs adder tree (OS-tree). The OS-tree yields the same speed as the optimal Wallace tree, but has the advantage of a regular layout. A 17×13 complex multiplier is implemented with Mietec™ 0.35 µm standard CMOS technology. It can execute 30 Mmult./s and dissipate about 15 mW at 25 Mmult./s while operating at 1.5 V.

yields a regular routing and high speed as well as simplifies the development of a layout generator. The distributed arithmetic and the overturned-stairs adder tree are introduced in section 2, and the design of 17×13 low-voltage two’s-complement complex multiplier is shown in section 3. Section 4 shows the simulation results of the complex multiplier.

2. DISTRIBUTED ARITHMETIC AND OVERTURNED-STAIRS ADDER TREE

1. INTRODUCTION Modern broadband communication systems on cable and radio like xDSL modems, DAB, etc. are based on OFDM which require huge amounts of complex arithmetic operations in the FFT/IFFT. The multiplication is often the most expensive arithmetic operations and one of the dominate factors in determining the performance in terms of speed and power consumption. As observed by Melander [1] and Widhe [2], the complex multiplier may consume more than 70% of the power in an FFT/ IFFT processor. Therefore, an effective design of complex multiplier is vital in low-power applications. The aim here is to develop a complex multiplier for use in a wide-band OFDM radio modem for 20 Mbit/s with wide area coverage. A straightforward implementation of a complex multiplication requires four real multiplications, one addition and one subtraction. However, the number of multiplications can be reduced to three by using transformation at the cost of extra pre- and post additions. A more efficient way to reduce the cost of multiplication is to utilize distributed arithmetic [3]. Distributed arithmetic is based on successive addition of some pre-computed coefficient values. However, these values can also be added using either an adder array or an adder tree as done in real multipliers. The difference is that the partial products in the latter are replaced by a simple logic circuits that selects the proper pre-computed value. In this paper we select to use an adder tree. The Wallace tree [4] is optimal in terms of speed, but it yields a complicated and irregular wire routing, which makes the Wallace tree less practical for implementation. In order to simplify the routing, several modifications of the Wallace tree has been suggested in the literature. Among these we select the overturned-stairs adder tree [5] because it has a regular structure which

2.1 Distributed Arithmetic The distributed arithmetic uses pre-computed partial products for more efficient computation of inner-products. In the case of complex multiplication, we treat the real part and the imaginary part separately. Consequently, the complex multiplication can be considered as two inner-products of two vectors. Each part can be implemented directly with an addition tree and pre-computed partial products. The partial product can be stored in a look-up table with only four words. A further reduction of the number of words in the look-up table to only two can be achieved with offset binary coding. The offset binary coding uses an equivalent expression for a number X 1 X = --- [ X – ( – X ) ] 2

(1)

The negativity of X can be obtained in two’s-complement representation by bit-inverse and add 1 operation. Let X be less than 1 and the word length be Wd, then –W d + 1 1 1 X = --- ⋅ [ X – ( – X ) ] = --- [ X – ( X + 2 )] 2 2 Wd – 1

= – ( x 0 – x 0 )2

–1

+



–i

( x i – x i )2 – 2

–W d

(2)

i=1 Wd – 1

where X = – x 0 +



xi 2

–i

and x i ∈ { 0, 1 } and xi is

i=1

bit-inverse of xi. Let U and V be two complex numbers of which V is the coefficient, U = A + jB and V = C + jD . Without any loss of generality, we assume that A, B, C, and D are less than 1. The word length of A and B is Wd. Then,

the product of U and V can be written

j – 2 , and a connector. The branch of height j – 2 is formed by using j – 2 CSAs on “top of each other” with proper interconnections [5]. The connector connects

U ⋅ V = ( AC – BD ) + j ( AD + BC )



C ( a i – a i )2

–i–1

– C2



D ( b i – b i )2

Wd – 1

 –1 + j  – D ( a 0 – a 0 )2 + 

Wd – 1

 + j  G ( a 0, b 0 ) + 

Wd – 1



–i–1

– D2



C ( b i – b i )2

–i–1

– C2

D ( a i – a i )2

–i–1

F ( a i, b i )2

+ F ( 0, 0 )2



G ( a i, b i )2

–i–1

+ G ( 0, 0 )2

CSA CSA

Tree 2

CSA

–W d 

 

CSA

+ 

–W d 

i=1

 

Root

Tree 3

–W d 

i=1

CSA

Body 2

+ 

i=1 –i–1

CSA Root

–W d 

– D2

CSA

Tree 1

+ 

i=1



CSA

CSA

–W d 

i=1

 –1 + j  – C ( b 0 – b 0 )2 + 

 =  F ( a 0, b 0 ) + 

– 

i=1

Wd – 1

Wd – 1

–W d 

CSA CSA

CSA Connector

Branch n

(3) Body height j-1

ai

bi

F(ai,bi)

G(ai,bi)

0

0

-(C-D)

-(C+D)

0

1

-(C+D)

(C-D)

1

0

(C+D)

-(C-D)

Branch

The look-up table is shown in Table 1. Only two coefficients are sufficient if the symmetry in rows and interchangeability between F and G is explored [3].

Height j-2

 –1 –  – D ( b 0 – b 0 )2 + 

Wd – 1

n CSAs

 –1 =  – C ( a 0 – a 0 )2 + 

CSA

Connector

CSA CSA

Body of height j 1

1

(C-D)

(C+D)

Root

Table 1: Partial Product Generation.

2.2 Overturned-Stairs Adder Tree The overturned-stairs adder tree is suggested by Mou and Jutand [5]. The main features of overturned-stairs adder tree are 1. Recursive structure that yields regular routing and simplifies the design of the layout generator. 2. The tree height is low, i.e, O( p N ), where p depends on the type of overturned-stairs tree. There are several types of overturned-stairs adder trees in [5]. The first-order overturned-stairs adder tree, which has the same speed bound to that of Wallace tree when the number of the operands is less than 19, is choosen. The construction of overturned-stairs tree is illustrated in Figure 1. The trees of height 1 to 3 are shown in Figure 1. When the height is more than three, we can construct the tree with only three building blocks: body, root, and connector. The body can be constructed repeatedly according to Figure 1. The body of height j ( j > 2 ) consists of a body of height j – 1 , a branch of height

Tree of height j+1

Figure 1 Tree construction. three feed-throughs from the body of height j – 1 and two outputs from the branch of height j – 2 to construct the body of height j. A root (CSA) is connected to the outputs of the connector to form the whole tree of height j+1.

3. CIRCUIT DESIGN It is an efficient way to reducing the power consumption by lowering the supply voltage [6]. However when the voltage is as low as twice of the threshold, the speed degrades significantly. The tradeoff between speed loss and power consumption must be considered with care. The requirements in our target application is 25 Mmult./s, which allows a reduction of the supply voltage to 1.5 V.

3.1 Adder Cell The adder cells are an essential part of the multiplier. Three types of adder cells have been designed and verified at the layout level [7].

The first type of adder is a conventional static CMOS adder. When the voltage is as low as 1.5 V, the conventional static CMOS full adder, with large stack height, is too slow. Further from a power consumption point of view it is not competitive.

y z

S

x z

x

z y

y

y

C

x

Figure 2 Conventional static CMOS full adder. The second type of adder cell is full adder with transmission gates (TG). This adder realizes the complex function (i.e. XOR) with transmission gates and both the power consumption and chip area are smaller than that of a conventional static CMOS adder.

This adder is fast and compact but requires buffers for the outputs. The buffer insertion is usually considered as a drawback since it introduces delay and increases the power consumption. However, in the multiplier the buffer insertion is necessary anyway in order to drive the long interconnections. There is no direct path from VDD or VSS in this full adder, which tends to reduce the power consumption. In the multiplier, the delay time difference between the carry and the sum bit in an adder cell should be minimized. A smaller delay tends to reduce the glitch and thereby tends to reduce the switching activity factor, which leads to smaller power consumption. From this point of view, we select the third type of adder cell instead of the second one. The performance of the adders is presented in Table 2. Adder type

Transistor count

Delay (ns) @1.5V

Power (µW)@1.5V

Static CMOS

24

4.2

4.3

TG

16

3.5

2.5

Reusens

16

3.2

2.1

Table 2: Full Adder Comparison.

x

z 3.2 Partial Product Generation

y

The selection of pre-computed values from Table 1, which correspond to the partial product generation in a real multiplier, can be realized with a 2:1 multiplexer and an XOR gate as shown in Figure 5.

S

-(C-D)i

ai⊕bi

C

-(C+D)i

0

Figure 3 Transmission gates full adder.

y

z

1

0

0

The third type of full adder is Reusens full adder [8].

x

1 0

ai

-(C+D)i

-(C-D)i

y

Fi(ai,bi)

(i)

1

1

bi ai

Fi(ai,bi)

(ii)

Figure 5 Partial product generation circuits. An alternative is to use a 4:1 multiplexer circuit. The benefit of this implementation is that the delay is reduced since the generation of select signal of select signal (ai⊕bi) is not required.

3.3 Final Adder

S

C Figure 4 Reusens full adder.

A final adder is needed to yield the final output after the summation of partial products in the adder tree. Carry-look-ahead adders (CLA) are often used, but its size increases and becomes large for large word length. We therefore choose to use the Brent-Kung adder [9] instead.

The Brent-Kung adder has a regular structure. The drawback of the adder is the large fan-outs, but it can be solved by inserting buffers.

3.4 Layout The main layout difficulty is the interconnections in the adder tree. However, the layout problem can be reduced since the OS-tree has at most three feed-throughs between two branches and that there are five metal layers available in the 0.35 µm process. The layout strategy is to embed the interconnections in the higher metal layers and minimize the routing channels. We divide the interconnections into two groups: intercolumn connections and intracolumn connections. The intercolumn connections mean connections between two different columns (different weights) in the bitmap and the intracolumn connections are connections inside the same column (same weight) in the bitmap. The intracolumn connections utilize metal layer 1 and 2 while the intercolumn connections utilize metal layer 3 and 4. Local supply voltage and local clock signal use metal layer 2. The partial products are generated locally inside the adder tree to reduce the routing overhead. The bit-slice achieves a density over 93k transistors/mm2. The whole adder tree has a density of 83k transistors/mm2. Future work will include a layout generator based on the current leaf cells, which is used as a unconstrained-cell approach.

4. SIMULATION RESULTS To simulate the whole multiplier is both slow and inefficient due to the size of multiplier. Fortunately, the multiplier has a regular structure, which allows a reducing of the simulation time. The simulation of multiplier has been divided into two parts: the bit-slice simulation and the whole circuit simulation. The bit-slice simulation is done with Hspice™ and the whole circuit simulation is done with Lsim™. In the bit-slice simulation, we connect the feedthroughs together to form a tree. By varying delays at the leaves of the tree, we found the worst case. The worst delay is 26 ns at 1.5 V and 25 ° C.Allowing for a technology variation of 20%, the delay is less than 32 ns. Hence, the complex multiplier can be expected to operated up to 30 MHz at a supply voltage of 1.5 V. The power dissipation is related to the input data. We used Lsim Power Analyst™ to estimate the power consumption of multiplier with radom test vectors and found that the RMS effect was 15 mW at 1.5 V and 25 MHz.

5. ACKNOWLEDGEMENT The authors will like to thanks Mark Vestbacka and Thomas Johansson for useful discussions. The project is financially supported by SSF, the Foundation for Strategic Research in Sweden.

6. REFERENCES [1] J. Melander, Design of SIC FFT Architectures, Linköping Studies in Science and Technology, Thesis No. 618, Linköping University, Sweden, 1997. [2] T. Widhe, Efficient Implementation of FFT processing Elements, Linköping Studies in Science and Technology, Thesis No. 619, Linköping University, Sweden, 1997. [3] L. Wanhammar, DSP Integrated Circuits, Academic Press, 1999. [4] C.S. Wallace, “A Suggestion for a Fast Multiplier”, IEEE Trans. Electron. Comput., pp. 14-17, Feb. 1964. [5] Z. Mou and F. Jutand, “Overturned-Stairs’ Adder Trees and Multiplier Design”, IEEE Trans. Computers, vol. C41, pp. 940-948, 1992. [6] A. P. Chandrakasan, S. Sheng, and R. Brodersen, “Lowpower CMOS digital design”, IEEE J. Solid-State Circuits, vol. SC-27, pp. 1082-1087, April 1992. [7] N. H. E. Weste and K. Eshraghian, Principles of CMOS VLSI Design, Addison-Wesley, second edition, 1993. [8] P. P. Reusens, High Performance VLSI Digital Signal Processing Architecture and Chip Design, Cornell University, Thesis, August, 1983. [9] R. Brent and H.T. Kung, “A Regular Layout for Parallel Adders”, IEEE Trans. Computers, vol. C-31, no. 3, pp. 260-264, March 1982.

Suggest Documents