1904
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 55, NO. 7, AUGUST 2008
Skewed Repeater Bus: A Low-Power Scheme for On-Chip Buses Maged M. Ghoneima, Muhammad M. Khellah, Member, IEEE, James Tschanz, Yibin Ye, Nasser Kurd, Javed S. Barkatullah, Member, IEEE, Srikanth Nimmagadda, Yehea Ismail, Member, IEEE, and Vivek K. De, Senior Member, IEEE
Abstract—This paper purposes a bus architecture called skewed repeater bus (SRB) for reducing on-chip interconnect energy in microprocessors. By introducing a dynamic relative delay between neighboring bus lines, SRB reduces both average and worst-case coupling capacitance between those lines. SRB is compared to previously published techniques like delayed data bus (DDB) and delayed clock bus (DCB). Simulation results in 65-nm process show that bus energy reduction of 18% is achieved when SRB is applied to a real microprocessor example, versus 11% and 7% only for DDB and DCB, respectively. Index Terms—Coupling capacitance, interconnects, low-power design, on-chip bus, repeater. Fig. 1. Technology scaling trend of M4 coupling capacitance.
I. INTRODUCTION
O
N-CHIP interconnect RC delay is not scaling with technology [1], [2]. Moreover, microprocessor die size increases from one generation to next. Deeper pipelining also causes clock frequency to improve at a faster rate. Therefore, the longest interconnect that can be accommodated in a single clock cycle between two flip-flops (flop distance) reduces, and the number of clock cycles needed to propagate a signal from one point of the chip to another increases. This adversely impacts overall performance of the processor. At the same time, energy gets worse due to the larger repeaters required to meet cycle time, and the higher number of flops which magnify clock energy. An example of a multicycle conventional static bus (CSB) is given in Fig. 2. Coupling capacitance between adjacent lines in a bus is becoming a larger fraction of total interconnect capacitance because of higher interconnect aspect ratio and tighter pitch [3], [4]. This is illustrated in Fig. 1 which gives technology scaling trend of coupling capacitance ratio for Metal-4 interconnect. The relative switching behavior of neighboring lines affects the charging/discharging of their intermediate coupling capacitor, and hence impacts their energy dissipation and propagation.
Manuscript received February 17, 2007. First published August 13, 2008 (projected). This paper was recommended by Associate Editor R. Puri. M. M. Ghoneima and J. S. Barkatullah are with NVIDIA Corporation, Santa Clara, CA 95051 USA (e-mail:
[email protected]; m_ghoneima@ieee. org). M. M. Khellah, J. Tschanz, Y. Ye, and V. K. De are with the Circuits Research Laboratory, Intel Corporation, Hillsboro, OR 97124 USA (e-mail:
[email protected]). N. Kurd and S. Nimmagadda are with the Digital Enterprise Microprocessor Circuit Technology Group, Intel Corporation, Hillsboro, OR 97124 USA. Y. Ismail is with the Department of Electrical Engineering and Computer Science, Northwestern University, Evanston, IL 60201 USA. Digital Object Identifier 10.1109/TCSI.2008.928527
The worst-case scenario for energy dissipation and delay, occurs when the neighboring interconnects oppositely switch, and cause the coupling capacitor to charge/discharge from to . The best-case scenario occurs when both interconnects switch similarly at the same time. In this case, the coupling capacitor neither charges nor discharges, and hence has no impact on the interconnect delay or energy dissipation. Thus, in order to model the impact of relative switching, an effective couis inserted in (1) which is a multiple of the pling capacitor physical coupling capacitance . The coupling capacitor multiplier CCM takes the value of 1 when either neighboring lines is switching while the other is idle, and takes the values of 0 or 2 when they are similarly or oppositely switching, respectively. The interconnect delay and energy dissipation of bus line i given by are proportional the total line capacitance
(1) where and are the line-to-ground and effective couis the pling capacitances, respectively. coupling capacitor multiplier between line and its neighboring . line Shielding [5], [6], increasing line-to-line spacing and nonuniform wire placement [5], [6] reduce worse-case CCM from 2 to 1 at the expense of dropping throughput per unit area. Alternatively, low CCM circuit techniques can be used to reduce CCM in multicycle buses. We compare SRB to two other low CCM schemes in Section II. Section III describes the bus optimization methodology. Section IV provides the energy comparisons results. Section V gives a summary.
1549-8328/$25.00 © 2008 IEEE
GHONEIMA et al.: SRB: LOW-POWER SCHEME FOR ON-CHIP BUSES
1905
Fig. 2. Multicycle CSB.
Fig. 3. Low CCM bus techniques. (a) DDB. (b) DC. (c) SRB (one cycle is shown).
II. LOW CCM CIRCUIT TECHNIQUES Introducing a relative delay between adjacent bus lines avoids their simultaneous switching in opposite directions, and hence, [8]. eliminates the worst-case switching scenario This section describes, and compares some of the previous relative delay techniques [9], [10] to the proposed SRB technique. A. Delayed Data Bus The delayed data bus (DDB) scheme [9], [10] achieves a worst-case CCM of 1 by misaligning signal firing through introducing delays at the beginning of alternate bus lines [Fig. 3(a)]. There exists an optimal value for this extra delay since while a large separation improves CCM, it can itself increases total bus delay. The delay is inserted at the beginning of every cycle in a
multicycle bus since signals re-align at the end of the previous flop distance. B. Delayed Clock Bus A similar misalignment effect is achieved in the delayed clock bus (DCB) [10] delaying clock inputs on alternate bus lines [Fig. 3(b)]. This is done by using alternate positive- and negative-edge triggered flops. In a multicycle bus scenario, and unlike DDB where the extra delay has to be added at the beginning of every cycle, delaying alternate DCB bits is first introduced at the very first bus cycle and is then carried on by consecutive cycles. Since signals have to eventually align before reaching their final destination, the last flop distance in DCB is made shorter than previous ones to make up for the extra delay. Delaying
1906
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 55, NO. 7, AUGUST 2008
Fig. 5. Effective CCM for studied schemes.
Fig. 4. Simulated waveforms in 65-nm process at 1.1 V. Fig. 6. Signal separation in low CCM schemes.
clock in DCB can be also achieved using simple buffers as in DDB but on alternate clock lines, or by employing dual phase clocks ( and ) feeding alternate bus flops. The method presented here, while resulting in a maximum firing delay of 1 clock phase, is the preferred scheme since it has the least impact on clock network energy and routing. Furthermore, in a long multicycle bus, the impact of having to shorten the last flop distance is small as compared to the savings in interconnect and clock energies. C. Skewed Repeater Bus Fig. 3(c) shows the proposed method for reducing CCM, called the skewed repeater bus (SRB). Unlike conventional repeater design where rising and falling delays are made equal, SRB achieves low CCM by intentionally skewing the nMOS and pMOS devices in all inverters along a bus line. Consecutive inverters along the bus line are skewed in opposite directions. Rising and falling transitions in adjacent lines are no longer simultaneous as one transition will travel faster than the other thus forcing worst-case CCM to reduce from 2 to 1. SRB skewing can be achieved through a combination of device , and body-bias for optimal energy width (this paper), length reduction. Simulated waveforms of signal propagation across a CSB and a SRB are given in Fig. 4. Most signal separation in SRB is introduced by the first few inverters with smaller separation added by repeaters down the line. D. Qualitative Comparison Similar to CSB, but unlike DDB and DCB, all bus bits are symmetrical in SRB. In addition, DDB and DCB delay both oppositely and similarly switching signals, wasting the energy
benefit of cases where CCM is originally smaller than 1. In SRB, however, transitions in the same direction in adjacent lines remain simultaneous as shown in Fig. 5. Further notice that SRB separates opposite transitions through speeding up one transition and delaying adjacent transitions as illustrated in Fig. 6. Thus, SRB delay overhead is always smaller than either DDB or DCB. However, DCB holds an important advantage over both DDB and SRB since its delay overhead is introduced only once at the first bus cycle. This is useful for a long multicycle bus as will be shown later. III. BUS OPTIMIZATION METHODOLOGY A 4-bit cyclic bus model is used to emulate a very wide bus as shown in Fig. 7. The model guarantees that signal transitions in adjacent bus lines will occur simultaneously (in the absence of any signal staggering). Transistor sizes in flip-flops and repeaters are optimized using a global optimizer and a circuit simulator. For a given flop distance, the optimizer selects the optimal segment length (distance between two consecutive repeaters) to minimize total energy of the bus while meeting a target cycle time requirement of 250 ps as given by ps (2) where is the clock-to-out delay of the driver FF, is is the setup time at the receiver flop, and the line delay, is a fixed quantity that accounts for clock skew and jitter between the driver and the receiver flops. So basically, for each flop distance, the optimizer sweeps the number of interconnect segments. And for each these trials, it will optimize the
GHONEIMA et al.: SRB: LOW-POWER SCHEME FOR ON-CHIP BUSES
1907
Fig. 7. 4-bit cyclic bus model used in the optimization.
sizing of the inserted buffer to minimize the overall energy dissipation measured by circuit simulations, while meeting the target cycle time requirement of 250 ps. The optimizer finally outputs the optimum number of segments (with the minimum energy dissipation), and its corresponding optimum buffer sizing. Excitation data vectors are used to cover all possible CCM conis taken as the worse-case ditions between adjacent bits. delay amongst the victim and attacker bits over all input vectors. Furthermore, skew in clock signals feeding neighboring driver flops create inherent firing misalignment between adjacent bits. In the worst-case, this separation can work against signal staggering introduced by the different schemes. To model this effect, two separate clock signals for victim and attacker driver flops are used. This creates two sets of vectors: one with perfectly aligned clocks, and the other with separated clocks. . The worse-case delay among the two sets is taken as Both switching and leakage currents are included in the total energy calculation. A single bus line is assumed to be switching 10% of the time (or 90% idle) with equal 0 and 1 probability, and without any spatial or temporal correlation. Switching energy per bit is taken as a weighted average over all possible CCM values of 0, 1, and 2. Noise margin constraints are imposed for SRB to limit maximum allowed repeater skewing. For all other schemes, a constraint of equal rise and fall delay through each inverter is used in order to prevent the optimizer from skewing inverters as in SRB. All techniques are required to meet a maximum signal slope constraint. IV. PERFORMANCE AND ENERGY COMPARISONS A. Single Cycle Bus Results Energy per bit versus flop distance comparisons for singlecycle buses are given in Fig. 8. At a flop distance of 1700 m, DDB reduces energy by 37% compared to CSB, and SRB improves this further by another 15%. As compared to DDB, SRB employs shorter delay overhead for signal separation and so it requires smaller repeaters to send signals in the remaining cycle time, thus reducing switched capacitance as a result. This is illustrated in Fig. 9 which gives the energy breakdown at flop distance of 1700 m. Note that part of SRB energy advantage is due to lower effective switched capacitance since CCM for SRB varies between 0 and 1. In terms of performance, maximum flop distance traversable by SRB is 1900 m which is 12% longer than both CSB and DDB. As mentioned before although DDB achieves a CCM of 1 by staggering the firing of alternate bits, the delay overhead needed to achieve signal separation is larger than SRB. In fact, there
Fig. 8. Single-cycle energy comparison.
Fig. 9. Single-cycle energy breakdown at flop distance of 1700 m.
exist an optimal number of buffers for delaying alternate bits. An example is depicted in Fig. 10 for flop distance of 1700 m. At lower buffer counts, signals are not perfectly separated, implying a CCM higher than 1. This results in delay degradation and requires repeaters upsizing. At a higher buffer count, on the other hand, CCM is 1 but the delay is limited by the additional buffers, and this also requires repeater upsizing. For all energy points shown in Fig. 8, the optimal number of buffers is always selected to reduce energy of DDB bus. Note that DCB appears to achieve the best single-cycle energy and performance results. However, this comparison is invalid as DCB is inherently a multicycle bus scheme since the delay overhead introduced at the first bus cycle has to be accounted for (absorbed) in the last bus cycle by making it shorter. These comparisons are given in the next section. B. Multicycle Bus Results To drive a long on-chip interconnect while meeting cycle time, a number of single cycle bus lines are cascaded together to form a multicycle bus [11]–[13] as illustrated earlier in Fig. 1. The total energy and total length of an -cycle bus can be expressed in terms of a single cycle bus as follows: (3) (4) where is the energy consumed by a single cycle bus at a given flop distance (as in Fig. 8).
1908
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 55, NO. 7, AUGUST 2008
Fig. 12. Cycle savings over CSB at equal energy.
Fig. 10. Optimal delay selection for DDB.
Fig. 13. Microprocessor interconnect energy reduction.
full clock cycle over CSB, DCB bus length needs to be longer than 6800 m while SRB needs 400 m more. Low CCM schemes are applied to on-chip buses in a microprocessor example containing buses ranging from 2 cycles to 17 cycles with various flop distances. The results are summarized in Fig. 13. About 11%, 7%, and 18% energy savings are achieved by DDB, DCB, and SRB; respectively. Best savings of about 19% is attained when the better amongst SRB and DCB is applied depending on bus length and number of cycles. V. SUMMARY Fig. 11. Energy savings over CSB.
Fig. 11 gives multicycle energy savings of the three schemes . As in CSB, relative to CSB for bus length of a multicycle DDB or SRB bus is simply formed by cascading single-cycle stages. Thus, the total energy dissipation of these multicycle busses can be calculated using (2). Hence, the relative energy savings of DDB and SRB remains constant as that of a single cycle bus. On the other hand, energy savings of DCB is not constant but rather improves as increases. This is due to the fact that the data on the alternate bus lines have to align at the end of the bus as explained previously. This means that the last DCB bus stage has only half a cycle (1 phase) of available time which leads to a shorter last stage. This overhead becomes less significant, however, as increases. Fig. 11 shows that DCB is less energy efficient than SRB for . Fig. 12 compares the schemes in terms of performance. The metric used is the number of cycles saved over CSB at equal total energy. The figure shows that SRB delay savings are better than DCB up to bus length of 6400 m. However, to save at least 1
We introduced the SRB architecture for energy reduction of on-chip interconnect. We compared SRB to previously published low-CCM bus schemes: DDB, and DCB. Simulation results in a 65-nm process shows the SRB is more energy efficient than either DDB or DCB for multicycle busses of 3 cycles long or less. Results show that bus energy reduction of 18% is achieved when SRB is applied to a real microprocessor example, versus 11% and 7% only for DDB and DCB, respectively. A 19% overall reduction is achieved when the best combination SRB and DCB is used. REFERENCES [1] M. Bohr, “Interconnect scaling: The real limiter to high performance ULSI,” in Proc. IEEE IEDM, 1995, pp. 241–244. [2] J. D. Meindl, J. A. Davis, P. Zarkesh-Ha, C. S. Patel, K. P. Martin, and P. A. Kohl, “Interconnect opportunities for gigascale integration,” IBM J. Res. Develop., vol. 46, no. 2/3, pp. 245–264, 2002. [3] D. Sylvester and C. Hu, “Analytical modeling and characterization of deep submicron interconnect,” Proc. IEEE, vol. ???, no. 5, pp. 634–664, May 2001. [4] K. Rahmat, S. Nakagawa, S.-Y. Oh, and J. Moll, “A Scaling Scheme for Interconnects in Deep-Submicron Processes,” Hewlett Packard, San Jose, CA, HPL-95-77, 1995, Tech. Rep.. [5] P. Saxena and C. Liu, “A post-processing algorithm for crosstalkdriven wire perturbation,” IEEE Trans. Comput.-Aided Design, vol. 19, no. 6, pp. 691–702, Jun. 2000.
GHONEIMA et al.: SRB: LOW-POWER SCHEME FOR ON-CHIP BUSES
[6] A. B. Kang et al., “Interconnect tuning strategies for high-performance ICs,” Proc. DATE, pp. 471–478, 1998. [7] R. Arunachalam, E. Acar, and S. R. Nassif, “Optimal shielding/spacing metrics for low power design,” in Proc.ISVLSI 2003, Dec. 2003, pp. 167–172. [8] M. Ghoneima and Y. Ismail, “Effect of relative delay on the dissipated energy in coupled interconnects,” in Proc. 2004 IEEE Int. Symp. Circuits Syst., May 2004, vol. 2, pp. 525–528. [9] K. Nose and T. Sakurai, “Two schemes to reduce interconnect delays in bi-directional and uni-directional buses,” in Proc. Symp. VLSI Circuits, 2001, pp. 193–194. [10] K. Hirose and H. Yasuura, “A bus delay reduction technique considering crosstalk,” Proc. DATE, pp. 441–445, 2000. [11] L. Scheffer, “Methodologies and tools for pipelined on-chip interconnect,” in Proc. IEEE Int. Conf. Comput. Design (ICCD), 2002. [12] R. Lu, G. Zhong, C.-K. Koh, and K.-Y. Chao, “Flip-flop and repeater insertion for early interconnect planning,” in Proc. Design, Autom. Test Eur. Conf. Exhib., Mar. 2002, pp. 690–695. [13] P. Cocchini, “Concurrent flip-flop and repeater insertion for high performance integrated circuits,” in Proc. IEEE/ACM Int. Conf. Comput. Aided Design (ICCAD), 2002, pp. 268–273.
Maged M. Ghoneima (M’97) received the B.Sc. degree in electronics and communications engineering (with honors) and the M.Sc. degree in electronics from Ain Shams University, Cairo, Egypt, in 1997 and 2000, and the Ph.D. degree in computer engineering from Northwestern University, Evanston, IL, in 2006. In 1997, he was appointed as a Teacher Assistant in the Department of Electrical and Computer Engineering, Ain Shams University. He was with OEA International in 2002 and with the Circuit Research Laboratory, Intel Corporation, Hillsboro, OR, from 2003 to 2005. He is currently a Senior Circuit Design Engineer in the VLSI Circuit Design Group at NVIDA Corporation, Santa Clara, in 2006, developing on-chip memory structures for the Tesla and Fermi GPUs. He has authored and co-authored over 25 technical papers in refereed international conferences and journals. He also has two patents granted and more than five patents filed in the area of low-power and high performance circuit design. His research interests include on-chip interconnect architectures, on-chip memory structures, low power circuit design, and related circuit level issues in high performance VLSI circuits. Dr. Ghoneima was awarded the Walter Murphy and Capell fellowship awards from Northwestern University in 2001 and 2005, in addition to the Intel PhD fellowship award in 2004.
Muhammad M. Khellah (M’93) received the B.Sc degree in computer engineering from King Fahd University of Petroleum and Mining, Dhahran, Saudi Arabia, in 1991, the M.A.Sc. degree in electronics and computer engineering from the University of Toronto, Toronto, Canada, in 1994, and the Ph.D. degree in electronics and computer engineering from the University of Waterloo, Waterloo, Canada, in 1999. He joined Intel Corporation, Hillsboro, OR, in 1999 and worked on designing on chip SRAM caches for the P3 and P4 microprocessor products. Currently, he is a Staff Research Engineer at Intel’s Microprocessor Technology Labs working on low-power and high-performance circuits. He has authored or co-authored 20 papers in refereed international conferences and journals. He has 5 patents granted and 31 more patents filed.
1909
James Tschanz received the B.S. and M.S. degrees in electrical engineering from the University of Illinois at Urbana-Champaign in 1997 and 1999, respectively. Since 1999 he has been a Circuits Researcher at Intel’s Circuit Research Lab, Hillsboro, OR, where his research interests include low-power and variation-tolerant circuits. He also taught VLSI design for eight years as an adjunct faculty member at the Oregon Graduate Institute, Portland, OR.
Yibin Ye, photograph and biography not available at the time of publication.
Nasser Kurd is Senior Principal Engineer in the Digital Enterprise Microprocessor Circuit Technology Group, Intel Corporation, Hillsboro, OR. He joined Intel in 1996 and has been involved in clock generation and distribution in particular phase-locked loops (PLLs) and delay-locked loops (DLLS) and input–output (I/O) designs for several microprocessor generations and currently is the clock and I/O tech lead for next generation microprocessors. He holds 26 granted patents.
Javed S. Barkatullah (S’88–M’91) received the M.S. and Ph.D. degrees in electrical and computer engineering from the University of Iowa, Ames, in 1990 and 1992, respectively. He is currently a Principal Engineer at Nvidia Corporation, Santa Clara, CA, and involved in designing custom and mixed signal circuits for the next generation GPUs. From 1992 to 2005, he was with Intel Corporation, involved with the design and development of various microprocessor projects. Dr. Barkatulla has served as an Associate Editor of IEEE JOURNAL OF SOLID STATE CIRCUITS from June 2004 to October 2005.
Srikanth Nimmagadda is a Principal Engineer in the Digital Enterprise Group, Intel Corporation, Hillsboro, OR, and is a part of Visual Computing Group focussing on next generation visual computing platforms. He joined Intel Corporation in 1998, and specializes in full-chip integration issues like floorplanning, area-power-frequency estimation and convergence, design integration, and verification, etc.
Yehea Ismail (M’00) was born in Giza, Egypt, on November 16, 1971. He received the B.Sc. degree in electronics and communications engineering (with distinction) and honors and the Master’s degree in electronics with distinction from the School Of Engineering, Department of Electronics and Communications, Cairo University, Cairo, Egypt, in 1993 and 1996, respectively, the second Master’s degree in electrical engineering, and the Ph.D. degree in electrical engineering from the University of Rochester, Rochester, NY, in 1998 and 2000, respectively. As one of the top of his class at Cairo University, he was appointed as a Teacher Assistant in the Department of Electrical and Computer Engineering, Cairo University, in August 1993. He is currently with Northwestern University as an Associate Professor. He has coauthored more than 100 technical papers
1910
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 55, NO. 7, AUGUST 2008
and a book, and several book chapters. He was with IBM Microelectronics from 1997–1999. His primary research interests include interconnect, noise, innovative circuit simulation, and related circuit level issues in high performance VLSI circuits. Prof. Ismail is the chair elect of the CAS VLSI Technical Committee, Associate editor in chief of the IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, was on the Editorial Board of the IEEE TRANSACTIONS ON CIRCUITS SYST.—I: REGULAR PAPERS, and a guest editor for a special issue of the IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS on “On-Chip Inductance in High Speed Integrated Circuits.” He has also chaired many conferences. Prof. Ismail was selected as the 2002 IEEE Circuits And Systems Society Outstanding Young Author Award winner, National Science Foundation Career Award in 2002. He has been given the best teacher award in the Department of Electrical Engineering and Computer Science, Northwestern University, in 2003.
Vivek K. De (SM’89) received the Bachelor’s degree in electrical engineering from the Indian Institute of Technology—Madras, Chennai, India, in 1985, the Master’s degree in electrical engineering from Duke University, Durham, NC, in 1986, and the Ph.D. degree in electrical engineering from Rensselaer Polytechnic Institute (RPI), Troy, NY, in 1992. He is an Intel Fellow and Director of Circuit Technology Research in the Corporate Technology Group, Intel Corporation, Hillsboro, OR. He joined Intel in 1996 as a staff engineer in the Circuits Research Lab
(CRL). Since that time he has led research teams in CRL focused on developing advanced circuits and design techniques for low-power and high-performance processors. In his current role, he provides strategic direction for future circuit technologies and is responsible for aligning CRL’s circuit research with technology scaling challenges. He has published 163 technical papers in refereed international conferences and journals, and six book chapters on low power circuits. He holds 146 patents, with 48 more patents filed (pending). He is currently an Associate Editor of the IEEE TRANSACTIONS ON CIRCUITS SYSTEMS—I: REGULAR PAPERS. Dr. De received an Intel Achievement Award for his contributions to a novel integrated voltage regulator technology. Prior to joining Intel, he was engaged in semiconductor devices and circuits research at Rensselaer Polytechnic Institute and Georgia Institute of Technology, and was a Visiting Researcher at Texas Instruments.