Power Optimization in a Parallel Multiplier Using ... - IEEE Xplore

3 downloads 25040 Views 435KB Size Report
Computer Science Department. University of California at Los Angeles. CA, USA 90095. Email: [email protected]. Suk Joong Huh. Samsung Electronics.
Power Optimization in a Parallel Multiplier using Voltage Islands Suk Joong Huh

Seok Won Heo Computer Science Department University of California at Los Angeles CA, USA 90095 Email: [email protected]

Miloˇs D. Ercegovac

Computer Science Department Samsung Electronics University of California at Los Angeles Suwon, Korea CA, USA 90095 Email: [email protected] Email: [email protected]

Abstract—Minimizing the power dissipation of parallel multipliers is important for mobile digital signal processing. In this paper, we present an approach to reducing power dissipation in the design of parallel multipliers by utilizing voltage islands to exploit non-uniform arrival of inputs to the carry propagate adder. Our approach reduces up to approximately 20% of dynamic power dissipation with little delay penalty in a parallel multiplier of a tree type, and uses a fast simple adder instead of a hybrid adder.

power multipliers. Section III addresses the problem of parallel multipliers. In Section IV, the paper focuses on power savings utilizing voltage islands. Section V analyzes how to reduce power, and section VI discusses current problems. Finally, a summary is given in Section VII.

I. I NTRODUCTION

To exploit parallelism with a scaled power supply voltage, the clustering and partitioning technique was proposed in [3]. The cluster width is defined as the distance between the first and the last nonzero bits. Ignoring the positions outside the cluster and performing multiplication with a collection of smaller multipliers in parallel with scaled supply voltages while maintaining given throughput can achieve significant power savings. Another approach for power savings uses pipelining [4]. Compared to non-pipelined schemes, the pipelined technique can achieve a higher operating frequency at a given supply voltage or, alternatively, a lower supply voltage for a desired throughput. These power-efficient schemes for parallel multipliers, however, have larger areas. To disable the operations in some rows (or columns), bypassing techniques were discovered in [5]–[7]. If the bits of a multiplier (or multiplicand) are zero, the corresponding partial products are also zero. As a result, the multipliers need not perform summation of zero partial products. These multipliers bypass inputs to outputs when corresponding partial products are zero, and therefore disable unnecessary transitions. These architectures can save significantly dynamic power dissipation with little area penalty. Using typically large fraction of zero and small valued input, signal gating approach can achieve power savings by deactivating slices [8]–[10]. The multiplier which is divided into several slices, detects parts of operands with zero values. These approaches mentioned above mainly reduce 1) the power supply voltage with large area overhead or 2) the switching activity with small area overhead. The proposed approach reduces the power supply voltages with little delay and area penalties. Therefore, the proposed approach is more efficient and economical than previous approaches.

The multiplier is an expensive core component of the Digital Signal Processors (DSPs) and Graphics Processing Units (GPUs): studies on power dissipation in DSPs and GPUs indicate that the multiplier is one of the most power hungry components on these chips. Therefore, the research on low power multipliers remains critical. With the increasing complexity of VLSI systems and a growing number of mobile applications, minimizing the power consumption has been growing in importance. Dynamic power dissipation is the dominant factor in the total power consumption of a CMOS circuit and typically contributes over 60% of the total system power dissipation. Although the effect of static power dissipation increases significantly, the dynamic power dissipation will dominate as VLSI manufacturing technology shrink [1]. It can be described by 2 × fp × N Pdynamic = 0.5 × CL × VDD

where CL is the load capacitance, VDD is the power supply voltage, fp is the clock frequency, and N is the switching activity. The equation indicates the power supply voltage has the largest impact on the dynamic power dissipation due to its squared term factor. Unfortunately, the lowering power supply voltage causes speed penalties. A great deal of effort has been expended in recent years on the development of the techniques to utilize the low power supply voltage while minimizing the performance degradation. Using voltage islands is one way to mitigate such performance degradation by architectural changes of the circuit [2]. This paper proposes a scheme to achieve power savings in a parallel multiplier of a tree type by utilizing voltage islands. This paper is organized as follows. Section II presents an in-depth view of recent research in the design of low

978-1-4673-5762-3/13/$31.00 ©2013 IEEE

II. R ELATED W ORK

345

III. P ROBLEM The parallel multiplier can be divided into three parts: 1) radix-4 encoder and the Partial Product Generator (PPG), 2) the Partial Product Reduction Tree (PPRT), and 3) the carry propagate adder (CPA). The power dissipation introduced by 1) the radix-4 encoder and the PPG, and 2) the CPA is relatively small compared to the PPRT. The power dissipation introduced by the PPRT constitutes a dominant component of the power dissipation in tree multipliers. As shown in Table I, the power dissipation introduced by the PPRT is approximately 60% of the total power dissipation. Therefore, power savings in the PPRT will result in major enhancement of the power reduction of the parallel multiplier. In this paper, we focus on reducing the power of the PPRT part. The non-uniform arrival of inputs to the adder produced by the PPRT has been used in improving the cost and delay of the adder by decomposing it into several subadders. Arithmetic units have usually been designed under the assumption that all of input signals arrive at the same time. The design of CPA is critical in the design of high performance parallel multipliers, because the delay of CPA increases significantly the total multiplication time. To reduce the delay of the CPA, optimal design schemes consisting of blocks of Ripple Carry, Carry Skip, and Carry Select adders exploiting nonuniform arrival of the adder inputs generated by the PPRT were developed [11][12]. This approach reduces significantly the adder delay compared to the final standard adder and, consequently, reduces the delay of the multiplier. IV. S OLUTION We propose to exploit the non-uniform arrival time profiles of the tree multiplier to achieve power savings with minimal performance degradation. Specifically, we apply voltage islands technique to the regions of non-uniform input generated by the PPRT. That is, adders are partitioned into blocks that operate with different power supply voltages. A voltage island occupies a contiguous physical space and operates at one supply voltage. Such voltage island techniques are applied to the tree multiplier so that the units of the multiplier get different levels of voltage support, as profiled by their performance requirements. The slowest region of the tree multiplier is the middle region at which the arrival time is large and constant. It requires higher supply voltage level in order to maximize element’s performance. On the other hand, the other regions may run at lower level of the supply voltage because they are not on the critical path. These regions are (1) the Least Significant (LS) part at which arrival time increases from the Least Significant Bit (LSB) towards middle region and (2) the

Most Significant (MS) part where arrival time decreases from the middle region towards the Most Significant Bit (MSB). An example of a partition into Low-High-Low islands is shown in Fig. 1. V. E XPERIMENTAL R ESULTS A. The PPRT utilizing voltage islands We implemented conventional and the proposed tree multipliers using Verilog and a top-down methodology. We used (3 : 2) counters for the Wallace trees with radix-4 modified Booth recoding in 16 × 16 and 32 × 32 bit multipliers [13]. The designs were verified using Cadence NC-Verilog and synthesized using Synopsys Design Compiler in Samsung 65 nanometer CMOS standard cell low power library. The proposed designs were synthesized with two supply voltages 1.08V and 1.32V supported by technology. The voltage level shifters are needed whenever circuits convert a source of the supply voltage from one voltage to another. In recent years, the voltage level shifters can be automatically inserted to support voltage islands by Synopsys Design Compiler. To reduce the effects of changes made by the synthesis tool to the structure of the original Verilog code, we implement the same design technology and use the same Synopsys Design Compiler constraints in all designs. Place & Route processes were performed to obtain more precise results using Synopsys Astro. Delays were obtained from Synopsys PrimeTime, and powers were obtained from the Samsung in-house power estimation tool, CubicWare. Fig. 2 shows the non-uniform arrival time profiles generated by the PPRT with high supply voltage and voltage islands in a 32 × 32 bit multiplier. To partition the PPRT, we analyzed the slopes of arrival time of signals. For the LS region, the delay between two consecutive bits starting LSB toward MSB is linearly increasing. For the middle region, delay is constant and large. For the MS region, the slope of the signal delay profile is negative. Table LS, MS region: Short path → low power supply voltage

Middle region: Long path → high power supply voltage

Fig. 1.

The Partition of the PPRT for Voltage Islands

TABLE I POWER DISSPATION OF MULTIPLIER COMPONENTS Components (32 × 32 bit multiplier) radix-4 encoder and PPG PPRT (Wallace, (3 : 2) counter) CPA (Carry Skip Adder)

TABLE II R ESULTS FOR PARTITION .

Power Dissipation (%) 21.34 61.47 17.19

346

Simulation Results

16 × 16 bit

Size 32 × 32 bit

Input Characteristics

LS region Middle region MS region

0∼9 10 ∼ 27 28 ∼ 31

0 ∼ 11 12 ∼ 55 56 ∼ 63

Linear Increase Constant/Large Rapid Decrease

P OWER ,

Power (µW)

Delay (ns)

Area (NAND2)

Power-Delay Product (fJ)

DELAY AND AREA COMPARISONS OF

Size Conventional (only 1.32V) Voltage islands (1.08, 1.32V) Conventional (only 1.32V) Voltage islands (1.08, 1.32V) Conventional (only 1.32V) Voltage islands (1.08, 1.32V) Conventional (only 1.32V) Voltage islands (1.08, 1.32V)

TABLE III PPRT UTILIZING HIGH SUPPLY VOLTAGE AND VOLTAGE ISLANDS

16 × 16 bit

32 × 32 bit

ARM7TDMI-S 32 × 8 bit

1472.17

1.1983

8485.92

1.0957

1678.31

1.0687

1228.55

1

7744.75

1

1570.42

1

1.03

1

1.16

1

0.99

1

1.12

1.0874

1.18

1.0172

1.00

1.0101

1294

1

5700

1

1384

1

1309

1.0115

5728

1.0049

1392

1.0058

1501.61

1.0913

9843.67

1.0592

1661.53

1.0279

1375.98

1

9293.70

1

1616.37

1

that the proposed design approach has a good potential for power savings while maintaining the multiplier latency.

decrease constant / large increase

B. Simple CPA

Fig. 2. Adder Input Profiles of the PPRT with High Supply Voltages and Voltage Islands using (3 : 2) Counters in a 32 × 32 Bit Multiplier

II shows the simulation results for partition in the 3 regions of multipliers. The LS and MS regions may not require higher supply voltage, and thus the power can be significantly reduced in the PPRT. In order to get accurate results, we compare our approaches with other improved parallel multipliers. A good example is the ARM7TDMI processor, which is the most widely used for low power architecture [14]. The ARM7TDMI-S is a synthesizable core, which includes an enhanced single 32 × 8 Wallace Tree multiplier. Table III summarizes the results for the proposed and conventional multipliers. The proposed multipliers dissipate between 19.83% and 9.57% less power than conventional multipliers for operand sizes of 16- and 32-bit while the 8.74% and 1.72% increase in delay. As the operand size increases, the relative reduction in power dissipation decreases, because the rate of middle region of the PPRT which requires higher supply voltage level increases. Furthermore, our designs are better than conventional multipliers in terms of power-delay product. The result for an ARM7TDMI-S multiplier shows the similar result. The voltage islands technique reduces 6.87% power, but delay, area are little affected. Compared to conventional multipliers, the ARM7TDMI-S multiplier reduces less power due to small LS and MS regions. The overall results indicate

The hybrid final adders are designed under the assumption of the non-equal signal arrival profile. With the use of voltage islands, non-uniform input arrivals to the CPA are transformed into uniform input profiles, as the delays are increased in the LS and MS regions while maintaining the delay in the middle region. Thus, the hybrid final adder is unnecessary for parallel multipliers utilizing voltage islands. Furthermore, hybrid adders have actually power and critical path delay very similar to the fastest adders such as carry skip adder, as shown in Table IV. Thus, it would be better to use the fastest adder instead of hybrid adders in a parallel multiplier, due to a simple structure. Experimental results indicate that multipliers utilizing voltage islands with simple CPA reduce the power by 19.07% and 12.02% with the 11.98% and 7.06% increase in delay for operand sizes of 16- and 32-bit, respectively, in comparison with multipliers utilizing high supply voltage with hybrid final adder. Table V summarizes the results for the proposed and conventional multipliers. These comparisons include the power, delay and area of additional implementation of voltage islands. The voltage level shifters consume extra power and delay. However, by implementing voltage islands using the voltage level shifters, the significant reduction in power dissipation of multipliers can be achieved. VI. D ISCUSSION A. The Problems of Our Design Our proposed design does not scale well in terms of power reduction, and thus is not suitably power-efficient when applied to large precision. It probably would not gain power reduction in 64 × 64 bit or wider parallel multipliers. Designing for voltage islands requires consideration of additional circuits. The chip utilizing voltage islands requires the installation of voltage regulators to generate one or more

347

TABLE V P OWER ,

DELAY AND AREA COMPARISONS OF MULTIPLIER UTILIZING VOLTAGE ISLANDS AND HIGH SUPPLY VOLTAGE

16 × 16

Size

Power (µW)

Delay (ns)

Area (NAND2)

The PPRT utilizing high supply voltage with hybrid adder The PPRT utilizing voltage islands with carry skip adder The PPRT utilizing high supply voltage with hybrid adder The PPRT utilizing voltage islands with carry skip adder The PPRT utilizing high supply voltage with hybrid adder The PPRT utilizing voltage islands with carry skip adder

1639.28

1.1907

8798.92

1.1202

1376.70

1

7854.34

1

3.84

1

5.38

1

4.30

1.1198

5.76

1.0706

1672

1.0108

6546

1.0098

1654

1

6482

1

TABLE IV P OWER ,

VII. C ONCLUSION

DELAY AND AREA COMPARISONS OF HYBRID AND CARRY SKIP ADDER

Size Power (µW)

Delay (ns)

Area (NAND2)

Hybrid adder Carry skip adder Hybrid adder Carry skip adder Hybrid adder Carry skip adder

32-bit

In this paper, we have discussed ways of power savings of the multiplier. The problem is performance critical only in the middle region of the PPRT. We have examined ways of utilizing voltage islands and shown the proposed approach is power optimized. Compared to conventional multipliers, the power savings of up to approximately 20% are obtained in the PPRT. Furthermore, we can use a simple CPA instead of the hybrid final adder. The techniques presented in this paper can also be applied to other arithmetic units with non-uniform signal arrival profiles.

64-bit

66.28

1

179.54

1.09

67.20

1.01

164.64

1

4.80

1

7.24

1

5.27

1.10

7.67

1.06

334

1.05

748

1.17

317

1

641

1

32 × 32

R EFERENCES

power supply levels, if a chip is connected to a single power supply. We should design signal I/O pad cell corresponding to each supply voltage level, if a chip is connected to two or more power supplies. Voltage islands also create requirements for verification. The voltage range significantly affects gate delay characteristics and paths which traverse the boundaries of voltage islands. Thus, Static Timing Analysis (STA) is more than a problem of timing calculation for multiple supply voltage paths, but such tools are not currently available to provide perfect support. Design For Test (DFT) for voltage islands also increases the potential testing complexity, because all islands in the power-on state is to isolate scan and test logic and then each island is tested independently. B. The Use of Voltage Islands for Commercial Chips Multiple supply voltages have already been used in most commercial chips. Our example is Samsung NAND flash memory with three supply voltages, 3.3V, 1.8V, and 1.2V. Digital cores require an operating voltage of 1.2V while some modules, such as high speed analog cores which coexist on the digital cores, are specified at 3.3V and 1.8V. Further, NAND flash memory utilizes 1.8V supply voltages. Thus, it does not seem hard to use voltage islands in parallel multipliers, because most commercial chips already have built-in multiple voltage sources. This allows us to reduce the total power dissipation of a chip.

[1] J. Rabaey, Low power design essentials, Springer, 2009. [2] D. E. Lackey, P. S. Znchowski, T. R. Eednar, D. W. Stout, S. W. Gould, and J. M. Cobn, “Managing power and performance for system-on-chip designs using voltage islands,” in Proc. ICCAD, Nov. 2002, pp. 195–202. [3] A. A. Fayed and M. A. Bayoumi, “A novel architecture for low-power design on parallel multipliers,” in Proc. IEEE Comput. Soc. Workshop on VLSI, Apr. 2001, pp. 149–154. [4] J. Di and J. S. Yuan, “Power-aware pipelined multiplier design based on 2-dimensional pipeline gating,” in Proc. GLSVLSI, Apr. 2003, pp. 64–67. [5] G. Economakos and K. Anagnostopoulos, “Bit level architectural exploration technique for the design of low power multipliers,” in Proc. ISCAS, May 2006, pp. 1483–1486. [6] J.-N. Ohban, V. G. Moshnyaga, and K. Inoue, “Multiplier energy reduction through bypassing of partial products,” in Proc. APCCAS, vol. 2, Oct. 2002, pp. 13–17. [7] M.-C. Wen, S.-J. Wang, and Y.-N. Lin, “Low power parallel multiplier with column bypassing,” in Proc. ISCAS, May 2005, pp. 1638–1641. [8] K. Kim, P. A. Beerel, and Y. Hong, “An asynchronous matrix-vector multiplier for discrete cosine transform,” in Proc. ISLPED, Jul. 2000, pp. 256–261. [9] J. Choi, J. Jeon, and K. Choi, “Power minimization of functional units by partially guarded computation,” in Proc. ISLPED, Jul. 2000, pp. 131– 136. [10] Z. Huang and M. D. Ercegovac, “Two-dimensional signal gating for low-power array multiplier design,” in Proc. ISCAS, vol. 1, Aug. 2002, pp. 489–492. [11] V. G. Oklobdzija, “Design and analysis of fast carry-propagate adder under non-equal input signal arrival profile,” in Proc. Asilomar Conf. Signals, Syst., and Comput., Nov. 1994, pp. 1398-1401. [12] P. F. Stelling and V. G. Oklobdzija, “Design strategies for optimal hybrid final adders in a parallel multiplier,” Journal of VLSI Signal Processing, vol. 14, pp. 321–331, Dec. 1996. [13] C. S. Wallace, “A suggestion for a fast multiplier,” IEEE Trans. Electron. Comput., vol. EC-13, pp. 14–17, Feb. 1964. [14] ARM, “ARM7TDMI Technical Reference Manual”

348

Suggest Documents