VLSI Implementation of a High Performance and Low Power 32-Bit ...

VLSI Implementation of a High Performance and Low Power 32-Bit MultiplyAccumulate Unit Yuyun Liao Intel Corporation Chandler Arizona, U. S. A [email protected]

David Roberts Intel Corporation Chandler Arizona, U. S. A [email protected]

Abstract A high performance and low power 32-bit multiplyaccumulate unit (MAC) is described in this paper. The fast mixed length-encoding scheme used in the MAC leverages the advantage of a 16-bit encoding scheme without adding extra delay to the faster four-stage Wallace Tree of a 12-bit encoding scheme. A mixture of static CMOS logic and complementary pass-gate logic (CPL) was used to achieve the high speed and still meet the low power goal. Several power saving techniques were also implemented in this MAC.

1. Introduction A powerful digital signal processing (DSP) extension to traditional RISC processors is a big competitive advantage. The extension enables those companies already experienced with RISC development tools and core libraries to easily expand their products to the fastgrowing world of DSP. A high performance low power 32-bit multiply-accumulate unit (MAC) is described in this paper. It is a coprocessor to the Intel® XScale™ microarchitecture. The processor was fabricated in a six layer metal 0.18 µm CMOS process. The processor dissipates 450mW at 1.3V, 600MHz, 40mW at 0.7V, 150MHz, and 900mW at 1.6V, 800MHz. Compared with SA-110 [1], which was acquired from Digital Equipment Corporation, the major innovations of this new MAC implementation are as follows: •

A new encoding scheme allows the MAC to encode 16 bits of the multiplier in the first cycle at very high speeds.

•

Performance is especially tuned for the 16-b signed DSP. One-cycle 16 x 16 and 32 x 16 MAC instructions are implemented for increasing the throughputs of many DSP algorithms. In order to handle media streams more efficiently, the Single Instruction Multiple Data (SIMD) format and a Multiply with Implicit accumulate (MIA) are added. The overall throughput of this MAC increases dramatically.

•

2.

Eric Hoffman Intel Corporation Chandler Arizona, U. S. A [email protected]

To achieve the high performance and still meet the low power goal, the MAC circuit was implemented with a mixture of Static CMOS logic and CPL [2].

Fast Mixed Length-Encoding Scheme

For a MAC, the latency and throughput depend on how many bits of the multiplier are encoded each cycle. There are two conventional implementations, which are widely used, in today's 32-bit high-speed multiplyaccumulate units. The first method encodes 12 bits of the multiplier per cycle [1]. The Wallace tree [3], which compresses nine partial product vectors into two vectors, contains four stages of 3 to 2 carry save adders (CSA). The second method encodes 16 bits of the multiplier per cycle. The Wallace Tree that compresses eleven vectors into two vectors contains five stages of 3 to 2 CSA’s. Using the modified Booth's algorithm, both implementations need one to three cycles to create the final sum and carry vectors depending on the value of multiplier. The final sum and carry vectors are then sent to the carry look-ahead adder (CLA) to generate a final result. The main advantage of the 12-bit encoding scheme is that its Wallace tree is about 25% faster than that of the 16-bit encoding scheme. The main disadvantage of the 12-bit encoding scheme is that it needs two cycles to create the final sum and carry vectors for the 16-bit signed DSP applications. The 16-bit encoding scheme only needs one cycle to do the same thing, but is too slow for high-speed designs. The lack of feedback on the Wallace Tree in the first cycle allows improvement. A new scheme was implemented for the MAC, which inherited the advantage of the 16-bit encoding scheme without adding any extra delay to the much faster four-stage Wallace tree of the 12-bit encoding scheme. This new scheme encodes 16 bits of the multiplier in the first cycle and encodes 12 bits of the multiplier for the rest of the cycles. It generates eight partial products in the first cycle and six partial products along with the intermediate feedback carry and sum vectors for the other cycles to fill in the eight slots. The eight vectors and one accumulate data vector now can be compressed into two vectors by a four stage 3 to 2 Wallace Tree. Once the 16-bit encoding is enabled in the

first cycle, many SIMD instructions can be easily implemented. This gives the Intel® XScale™ microarchitecture a big competitive performance advantage by being able to handle media streams. The MAC is fast enough on many DSP algorithms without having to add the cost of an extra DSP engine.

3. Architecture Multiplicand A[31:0]

Multiplier B[31:0]

Control Signals

Accumulate Data C[31:0]

F-F

F-F Mux

Mux Booth Encoder

Mux Array

Early Termination MAC Control Logic

F-F CSA Wallace Tree (Four Stages)

Mux Mux & F-F

microarchitecture to make several additional 16-bit DSP features to its coprocessor to meet the specific needs of various markets. These 16-bit DSP extensions include a Single Instruction Multiple Data (SIMD) format and a Multiply with Implicit Accumulate (MIA). The MAC architecture is illustrated in Fig. 1, where A[31:0] is a 32 bit multiplicand, B[31:0] is a 32 bit multiplier and C[31:0] is a 32-bit accumulate data. A 40bit CLA is used for the purpose of increasing performance and precision of audio coding algorithms. Two 40-bit accumulators are introduced to increase the MIA and SIMD instruction's throughput. One of the basic SIMD operations, called MIAPH, is shown in Fig. 2. Two 32-bit registers are treated as two pairs of 16-bit values. The upper 16 bits of each register are multiplied together, and the lower 16 bits also are multiplied together. The two multiplied results are then added to the contents of the 40-bit accumulator, and the result is sent to the 40-bit accumulator. SIMD MIAxy MAC instruction is more complex than MIAPH. It involves the single multiplication of two 16bit values, taken from either the upper or low16-bits of the two source registers. This instruction is illustrated in Fig. 3. The combination of other basic MAC instructions and SIMD instructions allows a programmer to create tight code for handling media streams.

Mux & F-F

Mux

Mux

Control Outputs

B[31:0]

A[31:0] 31:16

15:0

31:16

15:0

Sign_EXT 40-b CLA

Mux

40-bit Accumulators

Data output Acc0

39:0

Fig. 1 New MAC Architecture In real-time DSP systems, many applications (such as speech codecs G.729b, GSM-AMR, MP3, and AC-3) require 16-bit MAC operations. Offering high performance while consuming less power for these 16-bit DSP applications is considered a big competitive advantage for 32-bit MAC in the emerging wireless/handheld markets. The high-speed mixed lengthencoding scheme enables Intel® XScale™

Fig. 2 SIMD MIAPH instruction performs two 16 x 16-bit multiplications and a 40-bit addition.

31:16

The rest of the MAC is implemented in either Static PassGate logic or Static CMOS logic.

B[31:0]

A[31:0] 15:0

31:16

Mux

15:0

b

bz

Mux a az

cz Acc0

c

39:0

Fig. 3 Four combinations of 16-bit entities mean that SIMD MIAxy instruction can be four different instructions.

4.

Circuit Implementation s

The high performance and low power circuit implementation is a big challenge for the MAC design. Several circuit families, such as Dual Rail Domino (DRD) logic which was used by the first IA-64 microprocessor [7], Cascode logic [4] [5] which was used by the SA-110, Static CMOS logic [6], Complementary Pass-Gate logic (CPL), have been evaluated for the MAC Wallace tree. The schematics of CSA sum and CSA carry circuits, which are based on CPL, are shown in Fig. 4 and Fig. 5 respectively. Table 1 show the normalized delay times and power consumption with respect to the CPL implementation for different circuit implementations of Wallace Tree. These simulation results in Table 1 were based on a 0.18µm process at 1.3 V power supply.

c

Below are the results of implementations for Wallace Tree:

b

• • • •

different

circuit

Static CMOS logic meets the power requirement, but it does not meet the speed requirement. Cascode logic meets the power requirement. It is faster than Static CMOS logic, but it still can not meet the speed requirement. Dual Rail Domino (DRD) logic meets the speed requirement, but it does not meet the power requirement. Complementary Pass-Gate logic (CPL) meets the speed and power requirements.

The MAC Wallace tree was implemented in CPL. The 40-b CLA was implemented in Static Pass-Gate logic.

sz

Fig. 4 Schematic of CSA SUM Circuit Using CPL

cz

a az

bz

c

cz

Fig. 5 Schematic of CSA Carry Circuit Using CPL

Table 1. Normalized delay times and powers for different Wallace Tree implementations Logic type Delay Power CMOS Cascode DRD CPL

5.

1.44 1.36 0.96 1

0.96 0.93 2.2 1

Low Power Considerations

Due to the increase of portable electronic products, low power designs have become a major consideration. There are two types of power dissipation, which are static and dynamic power dissipations. The static power dissipation is due to leakage current. Whereas the dynamic power dissipation is due to switching transient current, charging and discharging of the load capacitance. The dynamic power dissipation of CMOS circuits is represented as Pd = a • C •V 2 • f. In this formula, a is the activity factor, C is the load capacitance, V is the power supply voltage, and f is the frequency. The power saving techniques, which used in this MAC, are as follows: • • • •

6.

Extensive clock gating is implemented as early in the clock spine as feasible to limit the clock switched capacitance. The use of pulse-clocked latches as master-slave flipflops reduces the clock power by approximately 50% while minimizing clock to Q delay and setup time. Compared with DRD implementation, the activity factor of the CPL implementation is reduced tremendously. In standby mode, a reverse body-biased technique is used to cut the leakage.

Results

Compared with the SA-110 MAC on the cycle basis, the throughput of the MAC in this paper increases 25% ∼ 100% for the 32-bit applications and 100% ∼ 200% for the 16-bit operations. The overall throughput comparisons for the SA-110 and new MAC are given in Table 2. The SIMD feature reduces register pressure and memory traffic since two 16-bit data items can be loaded into a single register. The Intel® XScale™ microarchitecture MAC can achieve 0.8 MMAC/MHz (650 MMAC at 800 MHz), whereas typical CPUs can only provide 0.2 ~ 0.4 MMAC/MHz. The dot product performance and the FIR benchmark show that the performance of the Intel® XScale™ microarchitecture MAC is three times better than the performance of the SA-110 at the same frequency. The maximum power estimate of the new MAC is 86mW based on 800 MHz

and 1.6V power supply. The size of the MAC layout is 780 µm x 840 µm. Results from the real silicon measurements were better than the simulation results. Table 2 Throughput Comparisons 16-bit 32-bit (MAC/cycle) (MAC/cycle) SA-110 MAC 0.33 to 0.5 0.2 to 0.5 Intel® XScale™ MAC 1 0.25 to 1

7.

Summary

To meet the high throughput requirements of many advanced DSP applications, a high performance and low power 32-bit multiply-accumulate unit has been designed. It reaches 600MHz at 1.3V, 150MHz at 0.7V, and 800MHz at 1.6V. The high throughput rate is achieved by using a mixed length-encoding scheme and a new MAC architecture with the enhanced DSP features. The low power consumption is achieved by using the CPL on the critical data paths and Static COMS on the rest of the MAC with several power saving techniques.

Acknowledgements The authors would like to acknowledge the contributions of the following people: Tom Hameenanttila, Tom Adelmeyer, Steve Strazdus, Paul Meyer, Jayanti Gandhi, Manish Biyani, and Jay Heeb. The authors would like to thank them for their contributions to this paper.

References [1] J. Montanaro et al “A 160-MHz, 32-b, 0.5-W CMOS RISC Microprocessor,” IEEE J. Solid-State Circuits, vol. 31, no. 11, pp. 1703-1714, Nov. 1996. [2] K. Yano and T. Yamannaka, “A 3.8-ns CMOS 16 x 16-b multiplier using complementary pass-transistor logic,” IEEE J. Solid-State Circuits, vol. 25, pp. 388-395, Apr. 1990. [3] C. S. Wallace, “A suggestion for a fast multiplier,” IEEE Trans. Electron. Comput., vol. EC-13, pp. 14-17, Feb. 1964. [4] L. G. Heller, W. R. Griffin, J. W. Davis, and N. G. Thoma, “Cascode voltage switch logic: a differential CMOS logic family,” in ISSCC Dig. Tech. Papers, pp. 16-17, 1984. [5] F. –S. Lai, “Design and implementation of differential cascode voltage switch with pass-gate (DCVSPG) logic for high-performance digital systems,” IEEE J. Solid-State Circuits, vol. 32, no. 4, 1997. [6] N. Weste and K. Eshraghian, Principles of CMOS VLSI Design, MA: Addison-Wesley, 1992. [7] S. Rusu and G. Singer “The first IA-64 Microprocessor,” IEEE J. Solid-State Circuits, vol. 35, no. 11, pp. 15391544, Nov. 2000.

VLSI Implementation of a High Performance and Low Power 32-Bit ...

VLSI Implementation of a High Performance and Low Power 32-Bit ...

Suggest Documents

High Performance and Low Power VLSI Architecture ...

A Low Power VLSI Implementation of 2X2 MIMO ...

A Low Power VLSI Implementation of 2X2 MIMO ... - Semantic Scholar

Low-Power 32bit x 32bit Multiplier Design with Pipelined ... - CiteSeerX

A LOW POWER AND HIGH PERFORMANCE EBCOT ...

High Performance and Low Power Hardware Implementation for ...

High Performance and Low Power Hardware Implementation for ...

Design of a Low Power, High Performance

using cmos sub-micron technology vlsi implementation of low power

Performance and Low Power Driven VLSI ... - Semantic Scholar

High Performance and Low power Monolithic Three

VLSI Implementation of High Performance Optimized Architecture for ...

FPGA Implementation of Low Complexity VLSI ...

Implementation of Area & Power Optimized VLSI ...

Design and Implementation of a Low Complexity VLSI ... - CiteSeerX

Area-Power Efficient VLSI Implementation of ... - EGR.MSU.Edu

Design and Implementation of Low Power High Speed ... - ScienceDirect

A High Performance And Low Power Hardware ... - CiteSeerX

a high performance and low power hardware ... - Semantic Scholar

A Low-power and High-performance Radix-4

KMote-Design and Implementation of a Low Cost, Low Power

High Performance Low-Power Signed Multiplier - CiteSeerX

Reza Asadpour - Low-Power High-Performance Nanosystems ...

VLSI Implementation of a Low-Complexity LLL Lattice ... - CiteSeerX