Bus-Switch Coding for Reducing Power Dissipation in Off ... - CiteSeerX

7 downloads 13817 Views 100KB Size Report
Email: [email protected]. Email: [email protected]. June 3, 2005. Abstract. We present a novel coding scheme for reducing bus.
Bus-Switch Coding for Reducing Power Dissipation in Off-Chip Buses Mauro Olivieri Department of Electronic Engineering University of La Sapienza Rome, Italy Email: [email protected]

Francesco Pappalardo and Giuseppe Visalli Advanced System Technology ST Microelectronics Catania, Italy Email: [email protected] Email: [email protected]

June 3, 2005

Abstract We present a novel coding scheme for reducing bus power dissipation. The presented approach is well suited to driving off-chip buses, where the line capacitance is a dominant factor. A distinctive feature of the technique is the dynamic reodering of bus line positions,in order to minimize the toggling activity on physical bus wires. The effectiveness of the approach is demonstrated through cycle-accurate simulation of industrial benchmarks in conjunction with post-layout evaluation of speed,po wer and area overhead. KEY WORDS Bus transfer, Encoding, Power demand, Very-large-scale integration

1 Introduction and Background

(CBI) coding. In [1] and [4], the idea of adaptive coding was introduced. In particular, adaptive partial BI (APBI) applies BI only to lines with high toggling probability. In [3], Ramprasad showed a theoretical lower bound in reducing the activity of a bus. Other more recent works address the problem of reducing power consumption caused by interwire coupling capacitance [5], [6], [12] which is a technologically growing factor in on-chip buses (thin and tall wires), while remains a second-order effect in off-chip buses. This paper introduces a novel point of view in offchip bus encoding, which is not embraced by Ramprasads baseline scheme. The method, that we call bus-switch (BS), is based on dynamically reordering the bus lines, to obtain a physical bit vector as similar as possible to the previous one. The method has been functionally simulated in C language and implemented in synthesizableVerilog targeting 0.13−µm and 0.18−µm CMOS standard cell libraries [9], [10]. Here we report the technique, the switching activity analysis and the postayout power consumption analysis.

Differently from technologic solutions, such as voltage swing reduction and charge recycling [7], [13], bus data encoding aims at lowering the average energy consump- 2 The Bus-Switch Coding Mechation by reducing the switching activity of bus lines [2]. nism In bus driversespecially off-chip bus driversthe large load capacitance maintains dynamic (i.e., switching) power the In principle, the Bus Switch technique can be logically target to be defeated, also in emerging technologies like expressed as a four-step process: systems-in-packages [11]. Bus encoding for low power started with bus-invert (BI) [8] and clustered bus-invert 1. A large bus is divided into several identical clusters

Definition 3 The optimal reordering pattern popt (t) is the reordering pattern that minimizes the switching activ2. Each M-line bus is coded by swapping the input lines ity (score see fig.1) H[B (t− 1)⊕ B(t)], where H is the opt using a particular reordering pattern Hamming distance from previous transmission Bopt (t−1) 3. A tentative data encoding is obtained by applying to and the coding function result, varying the reordering pattern. the swapped M lines a fixed coding function. of M (cluster depth) lines each.

4. The process is repeated M! times from step 2 until the optimal reordering pattern is found, that minimizes the output switching activity in the encoded data of the whole bus.

From the definition of coding function and inverse swapping pattern,we derive the following definition of decoding function to obtain the original word b(t). This function is implemented in the decoder receiving the encoded In the following a formal definition of the process is given. words Bopt (t) Let b(t) be input data word to the bus encoder and Bopt (t) the encoded data word on the bus, at clock cycle t. The b(t) = S(S(Bopt (t − 1), p−1 (t)) ⊕ Bopt (t), p−1 (t)) single bits of any M-bit data word x(t) will be indicated as x(t)(0), x(t)(1), , x(t)(M − 1). In BS implementation, the optimal sorting pattern is Definition 1 A reordering pattern p(t) is an ordered set searched by making M! attempts in parallel or partially of M indices i0 . . . iM−1 , associated with clock cycle in parallel, depending on the constraints on the allowed t. Given a data word x(t), the swap operator Sw with area overhead. The inverse sorting pattern, the swapped reordering pattern p is a combinational logic function words and the coding/decoding functions can be obtained producing a swapped data word y(t) = Sw (x(t), p(t)), by combinational logic. such that : y(t)(0) = x(t)( i0 ); y(t)(1) = x(t)( i1 ); ..., 3 Switching Activity Reduction y(t)(M-1) = x(t)( iM−1 ).

Analysis

As an example, if p(t) = ”1,2,3,0” and x(t) = ”0100”, then Sw (x(t), p(t)) = ”1000”. Note that each reorder- 3.1 Evaluation Methodology ing pattern p(t) has a unique p−1 (t), such that  inverse −1 x(t) = Sw Sw x(t), p(t) , p (t)). For instance if As a benchmark for switching activity analysis, we considered the transfer of binary executable and multimedia p(t) = ”1, 2, 3, 0” than p−1 (t) = ”3, 0, 1, 2” files on the bus. We used C language models of the BS Definition 2 A coding function is a combinational logic algorithm and of previously proposed approaches. Fig. function producing a data word B(t), applying swapping 1 shows the reference architecture scheme of BS. Some to b(t) and employing any other words resulting from in- extra bus lines are present, to transmit the swapping patput or output observation. tern p(t) used for the current clock cycle. Such extra lines are encoded by means of conventional BI. The switchIn our work, we evaluated several different coding funcing activity of the extra lines is taken into account in the tions, arriving at satisfactory performance with the folanalysis. In addition to pure BS procedure, for practilowing: cal implementations we studied the performance of a re−1 duced pattern set, i.e., employing a subset composed of B(t) = S(b(t), p) ⊕ S(Bopt (t − 1), p ) most-frequently used patterns resulting from a statistical where the symbol ⊕ represents the bit-wise XOR opera- analysis. This is relevant for area and speed requirements in practical implementations. tor. 2

3.2 Results and Comparison With Existing all the patterns have been tried and the minimum distance found, the threshold unit stores the pattern, the encoded Encoding Schemes word and the distance value on output registers. Fig. 5 shows the top view of the encoder architecture.

Table I reports the switching activity reduction obtained by BS and previous approaches. For BS, the table shows several cluster sizes M and several reduced pattern set sizes. BS performs better than the compared approaches in nearly all cases. It comes out that increasing the cluster size over 6 bits does not improve the performance significantly. A further analysis has been done on the effect of the total bus width on BS, considering from 8-bit up to 64bit bus. The only case where BS resulted to be slightly inferior to a previous approach (CBI) is for 8-bit bus width.

4.2

Decoder

Referring to Definition 3, the decoder architecture is depicted in Fig. 6. The PConv combinational block performs the pattern conversion to obtain the inverse pattern. In addition, a BI decoding unit and a combinational pattern decoder reconstruct the direct representation of the transmitted pattern.

4 VLSI Architecture Design and 5 Implementation

Post-Layout Performance Evaluation

4.1 Encoder

5.1

The swapping patterns can be sequentially generated by a finite state machine (FSM) very similar to a binary counter. The direct binary representation of a swapping pattern is a vector of M binary numbers each ranging from 0 to M .. 1, therefore, requiring M · log2(M ) bits. The swap operation is performed by a set of multiplexers as in Fig. 2. Referring to M = 4, the 8-bit pattern is partitioned into four 2-bit numbers, namely A, B, C, and D in the left part of Fig. 2. In practice, the extra lines to transmit the pattern are drastically reduced by means of a combinational pattern encoder, exploiting the fact that the allowed individual patterns are at most M!. The proposed coding function (Definition 2) is implemented by a twin swap unit, illustrated in Fig. 3; the conversion from a swapping pattern to its inverse is directly implemented by a dedicated two-level combinational logic unit PConv. In order to perform the M! attempts to find the best pattern, a partially or fully parallel implementation of BS decoder can be pursued, employing L units, each performing M !/L attempts. We refer to such solution as an L-way parallel architecture. The architecture of the single unit is shown in Fig. 4. PatGen is the FSM that generates the set of allowed patterns to be tried, H produces the Hamming distance between two words by performing a population count after XORing. The Cmp unit compares the actual Hamming distance with the temporary minimum. When

We have implemented two versions of BS with 30-bit bus, 5-bit cluster size, and 16 patterns. The bus frequency was 50 MHz for all the implementations. The first version is a four-way parallel architecture with 200-MHz unit frequency. The other version is a fully (i.e., 16-way) parallel architecture, with 50-MHz unit frequency. Both architectures have been implemented in two CMOS technologies: 1.6 V supply 0.18 − µm HCMOS8 and 1.32 V supply 0.13 − µm HCMOS9 from STMicroelectronics, and for each technology two different cell libraries have been used, low leakage and high speed. For comparison, we also implemented BI, CBI with 5-bit cluster size, and APBI with 16-transfers mask update period. Table II summarizes the area results. The large result for the four-way architecture in 0.18 − µm technology is due to the relatively high clock frequency (200 MHz), demanding fast cells with considerably large area. For the fully parallel architecture, the shortest sustainable bus cycle is 3.23 ns (about 310 MHz) in 0.13µm, and for the four-way architecture it is 15.52 ns (about 68 MHz) (typical process and 40C temperature).

5.2

Area overhead and Timing

Power Consumption overhead

Table III reports the estimated power consumption of the BS and other approaches. The power evaluation method3

[4] R. Siegmund, C. Kretzchmar, and D. Muller. Adaptive bus encoding technique for switching activity reduced data transfer over wide system buses. In Proc. of International Workshop on Power And Timing Modeling Optimization and Simulation, Gottingen, Germany, 2000.

ology is based on postlayout simulations with real stimuli, using power models of the target standard cell library. Table III also reports the average energy overhead consumed per bus cycle (independently of the bus frequency). The average energy saving per bus cycle is therefore Esaved = 0.5·switchingreduction·T ·Cbus· V dd2 · energyoverhead, where T is the toggling activity without bus encoding and Cbus is the capacitance of a bus line. The energy consumption percentage ratio, with respect to the nonencoded bus, must be lower than 100% for the encoding to be effective. Fig. 7 shows the energy percentage ratio, E%, versus the bus line capacitance Cbus referring to the low leakage 0.13µm and to the average result among the benchmarks used in Table I. The average break even point for BS effectiveness corresponds to 6.5 pF bus capacitance for the 4-way implementation, and 3 pF for the 16-way implementation. For large bus capacitance the actual energy saving estimate is better than all the compared approaches.

[5] P. Sotiriadis and A. Chandrakasan. Bus energy minimization by transition pattern coding (tpc) in deep submicron technologies. In ACM/IEEE International Conference on Computer-Aided Design, pages 322–327, San Jose, CA, Nov. 2000. [6] P. P. Sotiriadis and A. Chandrakasan. Low power bus coding techniques considering interwire capacitances. In IEEE Custom Integrated Circuits Conf., pages 507–510, May 2000. [7] P. P. Sotiriadis, T. Konstantakopoulos, and A. Chandrakasan. Analysis and implementation of charge recycling for deep sub-micron buses. In ACM/IEEE Int. Symp. Low-Power Electronics Design, Huntington Beach, CA, Aug. 2001.

6 Conclusions

[8] M. Stan and W. Burleson. Bus-invert coding for low power I/O. IEEE Transaction on VLSI Systems, 3:49–58, Mar. 1995.

The outcome of our work is that the BS approach can be effectively applied to off-chip buses, obtaining better performance than previous approaches. When the area limitation is not a primary issue, the fully parallel implementation is definitely preferable. Minor design improvements may still be introduced the pattern transmission.

[9] STMicroelectronics. Corelib HCMOS8 DataBook and CB35000 Standard Cell Library Datasheet. 2001. [10] STMicroelectronics. Corelib HCMOS9 DataBook and CB45000 Standard Cell Library Datasheet. 2002.

References

[11] [1] L. Benini, A. Macii, E. Macii, M. Poncino, and R. Scarsi. Architectures and synthesis algorithms for power-efficient bus interfaces. IEEE Transaction on [12] Computer-Aided Design, 19:969–980, Sept. 2000. [2] A. Chandrakasan and R. Brodersen. Minimizing power consumption in digital CMOS circuits. Proc. IEEE, 83:498–523, Apr. 1995.

K. L. Tai. System-in-package (sip): Challenges and opportunities. In Proc. IEEE ASP-DAC, pages 191– 196, Jan. 2000. . Zhang, J. Lach, K. Skadron, and M. R. Stan. Odd/even bus invert with two-phase transfer for buses with coupling. In ACM/IEEE Int. Symp. LowPower Electronics Design, Monterey, CA,, Aug. 2002.

[13] H. Zhang, V. George, and J. M. Rabaey. Low-swing on-chip signling techniques: effectiveness and robustness. IEEE Trans. VLSI Syst., 8:264–272, June 2000.

[3] S. Ramprasad, N. Shanbhag, and I. Hajj. Information-theoretic bounds on average signal transition activity. IEEE Transaction on VLSI Systems, 7:359–368, Sept. 1999. 4

Files

Stream Length KB

BI, 32 bits

CBI, 32 bits M=4

CBI, 30 bits M=5

APBI 32 bits

LaTeX Spice Gcc JPeg MP3 AVI

363 1763 86 44 91 29

7.78 10.64 9.07 10.00 10.27 0.00

14.53 16.56 14.15 15.26 15.53 1.70

16.81 19.27 18.68 17.14 16.97 6.58

6.70 10.20 8.37 6.54 6.40 6.12

BS 30 bits M=4 all patterns 22.10 21.63 18.39 15.15 16.56 23.16

BS 30 bits M=4 16 patterns 22.94 22.00 19.17 14.30 15.70 24.45

BS 30 bits M=5 8 patterns 25.81 24.36 25.00 13.94 13.58 35.95

BS 30 bits M=5 16 patterns 26.56 26.08 26.34 17.06 16.62 35.66

BS 30 bits M=5 32 patterns 26.60 26.66 26.91 19.36 18.81 33.34

BS 30 bits M=6 8 patterns 25.69 24.53 24.71 13.05 12.74 35.80

BS 30 bits M=6 16 patterns 26.48 25.84 25.96 15.92 15.53 35.00

BS 30 bits M=6 32 patterns 26.78 26.70 26.83 18.36 17.68 32.26

Table 1: SWITCHING ACTIVITY REDUCTION (PERCENTAGE). IN BS, PATTERN EXTRA LINES ACTIVITY IS TAKEN INTO ACCOUNT

Technology 0.18 µm Low Leakage 0.13 µm Low Leakage 0.18 µm High Speed 0.13 µm High Speed

BI 7593 3671 11172 3665

CBI 7728 3787 11370 3711

APBI 47544 22818 66785 23597

BS 4-way 208523 78862 189648 62835

BS 16-way 298455 152817 296042 142303

Table 2: AREA REPORT OF DIFFERENT BUS ENCODERS µm2

Technology

0.18 µm Low Leak. 0.13 µm Low Leak. 0.18 µm High Speed 0.13 µm High Speed

BI BI CBI CBI APBI APBI BS AvP EpBC AvP EpBC AvP EpBC 4(mW) (pJ) (mW) (pJ) (mW) (pJ) way AvP (mW) 1.79 36 1.66 33 2.38 48 8.90 0.90 18 0.77 15 0.93 19 4.65 1.91 38 1.67 33 2.10 42 9.27 1.04 21 0.87 17 1.28 26 5.10

BS 4way EpBC (pJ) 178 93 185 102

BS 16way AvP (mW) 6.14 2.01 7.19 3.34

BS 16way EpBC (pJ) 123 40 144 67

Table 3: Energy per bus cycle (EpBC) and Average Power (AvP) for different bus encoders, bus frequency 50 Mhz

5

Figure 1: Basic architecture scheme of BS coding/decoding system.

Figure 5: Top level architecture of the BS encoder.

Figure 2: Implementation of the swap operation (M = 4). Left: swap box. Right: complete swap unit. N is the bus width.

Figure 6: Decoder architecture.

Figure 3: Twin swap unit for the implementation of the coding function.

Figure 7: Actual energy consumption ratio E% versus bus line capacitance for

Figure 4: Architecture of the single unit inside the encoder.

6

Suggest Documents