1600
IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 35, NO. 11, NOVEMBER 2000
A Variable-Frequency Parallel I/O Interface with Adaptive Power-Supply Regulation Gu-Yeon Wei, Jaeha Kim, Dean Liu, Stefanos Sidiropoulos, and Mark A. Horowitz, Fellow, IEEE
Abstract—This paper presents a low-power high-speed CMOS signaling interface that operates off of an adaptively regulated supply. A feedback loop adjusts the supply voltage on a chain of inverters until the delay through the chain is equal to half of the input period. This voltage is then distributed to the I/O subsystem through an efficient switching power-supply regulator. Dynamically scaling the supply with respect to frequency leads to a simple and robust design consisting mostly of digital CMOS gates, while enabling maximum energy efficiency. The interface utilizes high-impedance drivers for operation across a wide range of voltages and frequencies, a dual-loop delay-locked loop for accurate timing recovery, and an input receiver whose bandwidth tracks with the I/O frequency to filter out high-frequency noise. Test chips fabricated in a 0.35- m CMOS technology achieve transfer rates of 0.2–1.0 Gb/s/pin with a regulated supply ranging from 1.3–3.2 V. Index Terms—Adaptive control, data communication, dc–dc power conversion, delay-locked loops, digital communication, integrated circuits, interchip communication, power supplies.
I. INTRODUCTION
A
GGRESSIVE CMOS scaling has enabled explosive growth in the IC industry yielding cheaper and higher performance chips. Unfortunately, these advancements have also resulted in higher power consumption and increased bandwidth requirements for communication between chips. Higher communication bandwidth can be obtained by enabling higher I/O speed and using more I/O channels, but these approaches can quickly eat into the overall power budget of the chip. In addition, complexity and area become major design constraints when trying to integrate hundreds of links on a single chip. To address these concerns, this paper presents an I/O design that adaptively regulates the supply voltage to minimize energy consumption and lower design complexity to enable efficient wide parallel interfaces. Maken [1] and others [2]–[4] have demonstrated that adaptive power-supply regulation offers an effective technique for saving power in microprocessor and DSP cores. It dynamically reduces the supply down to the minimum voltage required by the chip to operate at a desired clock frequency. Adaptively scaling Manuscript received March 29, 2000; revised June 26, 2000. G.-Y. Wei, J. Kim, and D. Liu are with the Computer Systems Laboratory, Stanford University, Stanford, CA 94305 (e-mail:
[email protected];
[email protected];
[email protected]). S. Sidiropoulos is with Rambus Inc., Mountain View, CA 94040 USA (e-mail:
[email protected]). M. A. Horowitz is with the Computer Systems Laboratory, Stanford University, Stanford, CA 94305, and also with Rambus Inc., Mountain View, CA 94040 USA (e-mail:
[email protected]). Publisher Item Identifier S 0018-9200(00)09437-3.
Fig. 1. Normalized frequency versus supply voltage.
the supply requires a mechanism that locally monitors the critical path delay in the chip to adjust the supply voltage using a feedback loop, in order to guarantee timing. To understand roughly how this method works, we will initially assume that the critical path consists of a string of inverters.1 Fig. 1 plots normalized frequency, found by inverting the delay of an inverter and normalizing it to the nominal peak frequency, versus supply voltage in a standard 0.35- m CMOS process for typical (typical nMOS, typical pMOS, 25-C), fast (fast nMOS, fast pMOS, 0-C), and slow (slow nMOS, slow pMOS, 100-C) corners. Due to voltage margin requirements in conventional designs, chips are normally run at frequencies that are lower than the fastest possible rate under nominal conditions. We assume this slowdown is around 70% of the peak frequency, as identified by the dotted horizontal line in Fig. 1. Notice that for the slow corner, the chip can just barely guarantee timing for this reduced frequency at 3.3 V, which is the maximum voltage for this process. To see how power savings is possible, we take this data and apply it to the dynamic power equation for digital CMOS circuits and plot its normalized power (normalized to the power dissipated at the typical corner and at a normalized frequency of one) versus normalized frequency, as shown in Fig. 2. Since frequency is approximately linear with supply voltage, (or ). For reference, power consumed is proportional to the linear relationship between power and frequency for a conventional fixed-supply case is also plotted. Again assuming the chip is running at 70% of peak under typical conditions, when the voltage is dynamically scaled, the power consumed is cut by more than a factor of two. Power savings is even more significant for the fast case; however, the slow corner requires a higher supply to meet timing and therefore does not exhibit as 1Somewhat surprisingly, we will later show that a chain of inverters is a fairly good model for the critical path.
0018–9200/00$10.00 © 2000 IEEE
WEI et al.: VARIABLE-FREQUENCY PARALLEL I/O INTERFACE
Fig. 2.
1601
Normalized power versus normalized frequency.
Fig. 4. Source-synchronous parallel link architecture.
Fig. 3.
Signal magnitude versus clock period.
much savings. The actual power savings may be less than what is shown in the plot, since the adaptive supply will need some margins as well. In addition to saving the power required for overhead margins, significant power savings is possible when this technique is applied to the I/O subsystem, which has shorter delay paths compared to the rest of the chip. The critical path for digital chips usually resides in the core logic blocks, where most of the computation is performed, and can be on the order of 20 inverter delays per cycle. However, the I/O subsystem has much lower performance requirements since it only consists of the delay through latches to hold data, the transceiver to drive bits on and off the chip, and the cycle time required to sustain a full-swing signal through the clock distribution network for the I/O transceivers. Given the ability to aggressively pipeline the I/O datapath, the clock distribution becomes the limiting factor. Fig. 3 plots the signal magnitude, normalized to full-swing, through a six-stage inverter fan-up chain versus clock period normalized to an inverter delay. Significant attenuation occurs below a cycle time of six inverter delays, which sets the minimum cycle time required to ensure full-rail signals through the clock distribution network. Given the difference in performance requirements between the core logic and I/O subsystem, by separately operating the I/O interface off a lower dynamically regulated supply, the I/O subsystem’s power consumption can be reduced even while operating both blocks at the same frequency. Alternatively, the I/O subsystem may operate at a higher clock rate or a higher bit rate (by transmitting on multiple phases of the clock [5]) to
meet high bandwidth requirements, and still guarantee energy efficiency with adaptive supply regulation. The following section introduces the three main components required to realize such an adaptive I/O interface: a mechanism for setting the desired voltage with respect to frequency, an adaptive supply regulator that efficiently distributes this voltage, and the I/O transceiver that operates in this dynamically scaled supply environment. A full description of each of the components reveals how such an environment leads to a simple and robust design. Interestingly, using a dynamically regulated supply eliminates the need for many precision analog circuits, and we can replace them with standard digital CMOS gates. Section III verifies the performance and power savings that can be achieved by presenting experimental results measured from a fabricated test chip. II. ARCHITECTURE A conventional source-synchronous parallel link architecture is presented in Fig. 4. It consists of several parallel data links that run between two chips and has a separate parallel clock line whose delay through the channel matches the data-link delays [6]. Given this matching, a dual-loop delay-locked loop (DLL) [11] at the receiver uses the synchronous clock signal to accurately align the on-chip clocks to the receivers to sample the incoming data at the most optimal point. One can build the DLL out of static inverters, and use it as the feedback system that determines the correct voltage for running the interface. Thus a similar I/O architecture, but one that uses adaptive supply regulation to save power, requires only one additional component—an efficient power-supply regulator to distribute this voltage to the rest of the I/O subsystem. A. Core Delay-Locked Loop In order to operate at the most energy-efficient point, we need to determine and provide the minimum supply voltage required to operate at a desired clock frequency. Since circuit performance varies with process and operating conditions, some local mechanism that dynamically finds this minimum voltage is necessary. A replica of the critical delay path is an accurate way to
1602
IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 35, NO. 11, NOVEMBER 2000
in Fig. 6(d). However, there is still some mismatch with respect to supply voltage and some margin is unfortunately required to guarantee timing over the full range of operation. Fortunately, since the critical path in the I/O subsystem consists predominantly of inverters, as described earlier, using a chain of minimum length inverters is sufficient in this case. B. Digitally Controlled Adaptive Power-Supply Regulator Fig. 5.
Core DLL block diagram.
measure delay variation with respect to different process corners and variations in operating conditions in a single chip [4]; however, it is often impractical to implement a replica for complex digital chips. Instead, we use an inverter as a unit delay element that accurately matches delays of various other circuit elements under the same operating conditions. A series chain of inverters whose delay nominally matches the critical path delays is used to measure circuit performance and placed in the core DLL. A simplified block diagram of the core DLL is shown in Fig. 5. A phase detector compares the 0 and 180 clock edges and drives a charge-pump such that the delay through the delay line locks to half the input reference clock cycle. A unity gain buffer replicates the control voltage out of the charge pump and supplies current to the inverters. With careful isolation of the supply nodes to the inverters and given the high power-supply rejection ratio (PSRR) achievable in the unity gain buffer, a low-jitter DLL is possible. A detailed description of building DLLs and phase-locked loops (PLLs) with supply-controlled inverter delay elements can be found in [7]. With this accurately generated voltage with respect to frequency, all of the I/O circuitry operates properly as long as the critical path in the I/O subsystem is not longer than twice the delay-line’s delay. Using supply controlled inverters as delay elements also has the nice property that the delay of all circuit elements operating at the same control voltage will always be a fixed percentage of the clock period, which lets us replace precision analog circuit blocks with digital gates. To verify that inverters are an accurate metric for measuring circuit performance, Fig. 6 plots the delay variation of several standard static and dynamic gates normalized to a fanout-of-4 (FO4) inverter delay versus process corners, temperature, and supply voltage. An FO4 inverter is an inverter driving a load with four times its input capacitance and represents the desirable fanout for an inverter chain driving a large load with minimum delay. The relatively flat lines for the delay of standard gates in Fig. 6(a) and (b) indicate that they track well with the delay of an inverter across process corners and temperature variations. However, the delay of complex gates with stacked nMOS devices speed up more quickly than the inverter for higher supply voltages [Fig. 6(c)]. This effect can be attributed to velocity saturation, which is stronger for the minimum channel length devices in the inverter compared to longer effective channel lengths of stacked devices. Therefore, inverters with minimum channel length devices may not be a sufficient model for tracking a critical path consisting mostly of complex gates with stacked transistors. Modifying the unit ) delay element to have longer channel lengths ( reduces the worst-case mismatch from 28% to 17%, as shown
Given a reliable way of determining the correct supply voltage, we need a way of replicating the control voltage out of the core DLL while efficiently driving the rest of the I/O subsystem. A linear regulator generally has poor conversion efficiency and would therefore counteract our power saving efforts. Instead, a buck converter is used to efficiently power the I/O circuitry as shown in Fig. 7. The off-chip inductor and capacitor act as a low-pass filter, and as long as its cutoff frequency is at least an order of magnitude less than the input pulse-width modulated (PWM) rectangular-wave’s switching frequency, the output is an average value where its magnitude is set by the duty-cycle of the input rectangular-wave. Since the low-pass filtering is not perfect, a 1-MHz switching frequency with 9.2- H and 10- F inductor and capacitor sizes, results in a small ripple at the output with a peak-to-peak magnitude less than 5% of the regulated voltage. The pMOS and nMOS transistors are large on-chip devices that deliver power to input to the buck converter the load. To set the duty-cycle such that its output replicates the DLL’s control voltage, the converter is placed in a standard proportional, integral, and derivative (PID) control loop, as shown in Fig. 8. A detailed description and analysis of this type of controller can be found in [8]. References [2]–[4] and [9] present other approaches to similar power-supply designs. While this loop could easily be implemented with analog blocks, using a ring oscillator to convert a voltage to a frequency gives an inherent A/D operation that converts an analog signal to a digital clock signal. Leveraging this property obviates the need for analog blocks and a completely digital controller is possible. An initial implementation of this type of digital controller, described in [8], uses a ring oscillator and a counter to convert an analog voltage to a binary number by counting pulses out of the oscillator over the buck converter’s switching period. Albeit simple, in order to get high resolution, this approach requires the ring oscillator and counter to run at a fairly high rate and ends up consuming an appreciable amount of power. To address this inefficiency, we implemented an improved A/D structure which is presented in Fig. 9. It relies on a delay-line to achieve high resolution [9] with the least significant bit (LSB) equivalent to approximately two inverter delays. In order to avoid a geometrically increasing number of delay stages with each additional bit of the output, the 32-stage delay-line is configured as a ring to reuse hardware. The higher order bits [9 : 5] are determined by the number of passes an edge makes through the ring over the switching period of 1 s. Therefore, the ring oscillator’s maximum frequency is 16 MHz at the maximum range of 3.3 V. By making the delay elements out of latches and latching the internal state of the ring at the end of the counting period, a decoding logic block determines how far an edge has traversed through the ring to determine the lower order bits of
WEI et al.: VARIABLE-FREQUENCY PARALLEL I/O INTERFACE
1603
(a)
(b)
(c)
(d)
Fig. 6. Inverter tracking versus (a) process corners, (b) temperature, (c) supply voltage, and (d) supply voltage with long channel length inverters to mitigate velocity saturation effects.
Fig. 7. Buck converter.
Fig. 9. A/D converter block diagram.
Fig. 8. PID control loop block diagram.
the binary output . This configuration enables higher resolution while consuming less power. A pair of these A/D blocks are used to implement a completely digital controller, as shown in Fig. 10. The difference between the numbers generated with respect to the output of the buck converter and the reference voltage represents the error
( ). This error feeds into a set of binary adders and shifters that implement the PID control functions on the error, and the output sum of these blocks is a binary representation of the duty-cycle for the PWM rectangular-wave input to the buck converter. Given the relatively slow update rate of the controller at 1 MHz, which is equivalent to the switching frequency of the buck converter, the digital PID control blocks operate off the regulated . A subsequent D/A block generates the variable supply duty-cycle output rectangular wave and its operation resembles that of the A/D block, but in reverse. A ring oscillator is driven and oscillates at a conoff the fixed high supply voltage stant frequency. A counter counts the pulses out of the oscillator and compares the counter output to the higher order bits of
1604
Fig. 10.
IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 35, NO. 11, NOVEMBER 2000
Digital PID controller.
Fig. 12.
Fig. 11.
I/O transceiver block diagram.
. While the counter output is less than , the D/A block outputs a low level. Once the counter output and match, a multiplexer selects a rising edge after some delay through a number of delay stages set by the lower order bits of . The output is held high until the counter saturates and then the comparison repeats. The 16-stage ring (with an inversion) oscillates at 32 MHz and counts 32 cycles to set the 1-MHz switching frequency of the PWM rectangular wave. C. I/O Transceiver The I/O transceiver consists of three components, as seen by the block diagram in Fig. 11. A digital peripheral loop dynamically selects and interpolates between a pair of clock edges out of the core DLL’s delay-line to accurately align the on-chip clock to the receivers. Due to variations in channel characteristics and skews in the clock distribution network, timing adjusters may be used to tune out variations for each receiver, as demonstrated in [10], to improve timing margins. Parallel data transceivers consist of a transmitter and receiver pair that share pads and can be configured to transmit or receive data for half-duplex operation. This I/O implementation uses high-impedance drivers to transmit low-swing data signals, and the receivers consists of a preamplifier and regenerative latch pairs. Except for the core DLL described in Section II-A, nearly all of the blocks in this I/O transceiver consist of static CMOS gates and operate ). off the dynamically regulated supply ( 1) Timing Recovery: The timing recovery block relies on a dual-loop DLL architecture that uses the core DLL to generate evenly spaced clock edges that span 360 and drive into a peripheral loop that selects an adjacent pair of edges, and interpolates between them to finely align a clock edge relative to
Digital peripheral loop.
the input I/O clock [11]. The original implementation of this dual-loop architecture uses analog differential delay buffers in the core delay-line and interpolator with a sophisticated replica biasing scheme [12]. However, the core DLL in this implementation serves a primary role by setting the required voltage of operation while using simple inverters as delay elements. Digital CMOS gates operating off of the regulated voltage can replace the precision analog blocks in the peripheral loop since the performance of these gates track with link frequency. A detailed block diagram of this loop and its components is presented in Fig. 12. In order to generate twelve clock edges that are evenly spaced an inverter delay apart, the delay-line in the core DLL consists of a parallel set of inverters driven with complementary reference clock inputs. Coupling through weak cross-coupled inverters between the two paths minimize skew between them. Each of the inverters drive out to multiplexers that select a pair of adjacent clock edges, and the multiplexer outputs drive into a digital interpolator. As Fig. 13 illustrates, the digital interpolator consists of two parallel sets of tri-state buffers with their outputs shorted together. Adjusting the relative drive strengths of the two sides, by digitally controlling the number of buffers enabled, varies the contribution of each input edge, and , to interpolate between them. In order to save area for 16 interpolation steps, the weighting is binary coded for the lowest three bits. Thermometer coding is used for the two higher-order bits to avoid nonmonotonic discontinuities that can arise from full binary coding. A slight nonlinearity shown by the measured histogram plot of all 16 interpolation steps in Fig. 13(b) (generated by manually stepping through the interpolator) is due to the superposition of two exponential waveforms separated an inverter delay apart. Better linearity is achievable if the output time constant of the tri-state buffers approximately doubles the inverter delay [11]. Since this output time constant tracks with an inverter delay with adaptive supply regulation, good linearity can be maintained over a wide range of frequencies. The control signals for the multiplexers and interpolator come from a digital finite state machine (FSM) that consists of a simple binary up/down counter and decoder. Each counter value corresponds to one of 192 possible edge positions within a clock
WEI et al.: VARIABLE-FREQUENCY PARALLEL I/O INTERFACE
1605
Fig. 14.
(a)
(b) Fig. 13. Digital interpolator. (a) Circuit schematic. (b) Measured interpolation histogram.
cycle. A receiver for the parallel I/O clock acts as a phase detector (RX/PD), cancels out the data receiver set-up time, generates up and down pulses for the FSM, and closes the loop. While operating off of the lower regulated supply is attractive, because precision analog blocks can be replaced with digital gates while reducing power consumption, the performance of this peripheral loop may potentially be degraded due to the ripple induced on the regulated supply by the switching regulator. All digital logic and clock distribution buffers operate off of this supply voltage and, because delay is directly affected by the supply magnitude, the clock may jitter relative to the ripple magnitude and degrade the timing margin for the receivers. However, as long as the slew rate of the induced supply is slower than the rate at which the peripheral loop can respond, the loop can track out this jitter. The worst case is under low I/O frequency conditions, because the response of the loop is proportional to the operating frequency. Therefore, the peripheral loop is designed to respond with a slew rate higher than the power-supply ripple at the lowest target frequency. 2) Transmitter: Given two chips that want to communicate using this type of interface, the internally generated supply voltages can vary significantly for the same frequency of operation, and therefore cannot use high-impedance transmitters with nMOS current sources and pull-up loads that reference their sig. To ensure that the transmitted and received signals nals to swing relative to a common reference, this interface uses pMOS current sources, as shown in Fig. 14, and all signals swing rel-
High-impedance transmitter.
ative to ground, which is also the current return path for the signals. A 2 : 1 multiplexer feeds data to the output driver for double data-rate transmission, transmitting data on both phases of the clock. A parallel set of binary weighted nMOS devices act as termination resistance and are configured digitally. Since low power is one of the primary design targets, a pair of these drivers can be configured to transmit complementary signals for a differential mode of operation, to compare power consumption for single-ended versus differential modes of signaling. Given a 50- termination resistance that matches the channel impedance, the current source-current magnitude sets the output of the pMOS swing magnitude. Due to the threshold voltage device, the gate overdrive voltage quickly decreases with reducing supply voltage and reduces the output swing. To combat this effect and allow for configuration flexibility to adjust transmission swing, a parallel set of binary weighted current source legs were implemented and can be selectively enabled to vary the current magnitude. In addition to low swing magnitudes for low power, a desirable feature for transmitters is to have the output transition gradually, because sharp transitions inject high-frequency noise into the channel, supply, and ground which degrades performance. Edge rates from 25% to 33% of the bit period are ideal for reducing coupling and reflection noise. In most transmitters designs, extra engineering is required to control slew rate, either by sequentially turning on and off parallel current source legs or by shaping the driver input to get the desired transition times at the output with a process monitoring feedback control [13]. This feedback control is implicit for an adaptively scaled supply environment, because edge rates, as well as the delay of an inverter, are fixed relative to the bit time. So, the rise and fall output transitions of the inverter predrivers to the current source are always a fixed percentage of the bit time regardless of process and temperature. By sizing the proper fanout ratio of the inverter predriver to the current source for the desired slew rates, automatic slew-rate control is achieved and obviates the need for any additional hardware. of the pMOS device, Another factor, that arises from the determines the sizing of the inverter predrivers to the transmitter current source. Transmitter waveforms with pulses that consume half the bit period maximize timing margins for the data transmitted on both phases of the clock. While this would not be a problem for sharp edges, given purposely sloped edge transitions and because the input to the driver must swing beyond a threshold below the supply to turn on the current source, a 50% duty-cycle input pulse results in a smaller pulse width at the output. To circumvent this problem, two inverter stages pre-distort the pulse to the current source to generate the desired 50%
1606
Fig. 15.
IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 35, NO. 11, NOVEMBER 2000
Receiver preamplifier and regenerative latch.
duty-cycle waveforms at the driver output. Using two inverter stages makes this approach immune to relative process skewing between pMOS and nMOS devices. Adaptive voltage regulation compensates for delay variation of these inverter predrivers is fixed, across process and temperature; however, because this distortion must be adjusted with respect to frequency, since the regulated voltage varies with frequency. 3) Receiver: One of the few parts of the I/O transceiver that is not completely comprised of digital gates is the receiver, illustrated in Fig. 15, which consists of a preamplifier that buffers the received signal before driving into a regenerative latch that generates a full-swing output signal. We again take advantage of an adaptively regulated supply environment to build a receiver with several attractive properties. The preamplifier evaluates during is low and resets by shorting the outhalf a clock cycle while is high. During evaluation, the output of puts together while the preamplifier amplifies and drives into a regenerative latch that samples the signal on the falling edge of a delayed clock . Since data is transmitted on both phases of the clock, a pair of these preamplifier latch pairs that operate on complementary clock signals are required. The preamplifier is configured to reand or a single-ended signal ceive differential signals with a reference , by turning on the appropriate nMOS is generated on switches at the input. The reference voltage the transmitter side to half the swing magnitude by configuring a transmitter to always transmit a high level with half the current. To save on pin count, only a single reference is required and is shared by all the receivers. Since the transmitted data swings relative to ground, the preamplifier consists of a pMOS differential pair with nMOS linear loads to properly receive and amplify the incoming signals. The preamplifier also operates off the regulated supply, and the current source to the differential pair is biased
through a replica self-biasing scheme [14]. A half-replica of the differential preamplifier is used in the bias generator and its output drives one input of an amplifier, while the voltage on the other input is self-biased with an inverter. The output of the amplifier drives the current source of the half-replica and preamplifiers, and with feedback, the output swing is clamped to an nMOS threshold voltage. Analysis of this design reveals that the bandwidth of the preamplifier is set by the RC product between the capacitive loading on the output of the preamp and its output resistance. This resistance is dominated by the nMOS load device throughout its swing. Since the supply to the gates of these nMOS loads is again the regulated voltage, their output resistance is also effectively regulated. To the first order, this resistance tracks the effective “on” resistance of inverters in the core DLL’s delay line so that the input receiver’s bandwidth tracks the link’s bit rate. This is advantageous because the bandwidth can be set to only allow in frequency components up to the bit rate and filter out unwanted high-frequency noise. Although this receiver design requires two clocks and , needs to be routed, because can be locally generated only with inverters. Again, by leveraging the delay tracking nature of inverters in a dynamically scaled supply environment, the relative spacing of the clock edges can always be guaranteed to be a fixed percentage of the clock period. III. ANALYSIS AND RESULTS The I/O test chip was fabricated in a MOSIS 0.35- m n-well CMOS process. The prototype die micrograph is shown in Fig. 16 and performance is summarized in Table I. The test chip consists of four parallel sets of data receiver and transmitter pairs and a parallel clock link along the top and right sides of the chip. Eight-bit data sequence generators and
WEI et al.: VARIABLE-FREQUENCY PARALLEL I/O INTERFACE
1607
(a)
Fig. 16.
Test chip photo micrograph.
TABLE I TEST CHIP PERFORMANCE SUMMARY
(b) Fig. 17.
a 20-bit pseudorandom bit sequence (PRBS) generator and verifier test functionality and performance of the links. The links were tested by transmitting and receiving signals between two chips through 8 inches of 50- traces on FR4 printed circuit boards and 36-inch coax cables. A core DLL and digital peripheral loop generate clock signals for the receiver, aligned with the parallel clock transmitted along with the data. The clock signal for the transmitters is generated off-chip. A digital PID controller and power transistors for the buck converter reside in the lower left portion of the chip. Power consumed by the digital controller is significantly lower than its previous implementation in [7] and consumes 1.5 mW. The power transistors have sizes of 7 and 3.5 m for the pMOS and nMOS devices, respectively. The overall converter efficiency of the switching power supply, while delivering 200 mW to the load at 2.7 V from a 3.3-V supply, is greater than 94%. supply and operThe core DLL is supplied off of the high ates from 33–500 MHz, while the digital peripheral loop operset by the core loop and operates from 100–500 ates off of MHz. The receiver used as a phase detector in the peripheral
DLL jitter histogram plots. (a) Core loop. (b) Dual loop.
loop limits the lower frequency range of operation due to a minimum 1.3-V supply headroom required by the analog pream, which keeps the differenplifier; this is equal to tial pair and current source in saturation. Fig. 17 shows the core and dual-loop jitter histogram plots while running at 400 MHz under quiet supply conditions. The larger jitter in the dual loop can be attributed to the peripheral loop occasionally dithering between interpolation steps. The core loop dissipates 37 mW operating off the 3.3-V supply and the digital logic for the peripheral loop, supplied with a 2.7-V regulated voltage, dissipates 19 mW. Table II summarizes the measured dual-loop DLL characteristics. The I/O interface successfully operates over a 100–500-MHz frequency range, which translates to a 0.2–1-Gb/s range in bit rates across a 1.3–3.2-V range in regulated voltage levels. The fabrication run for the test chip turned out slower than expected boosted from simulations, and therefore required a high to 3.7 V to comfortably meet the headroom requirements of the linear regulator in the core DLL in order to operate at the high-frequency target of 500 MHz. Transmitting and receiving
1608
IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 35, NO. 11, NOVEMBER 2000
TABLE II DLL PERFORMANCE SUMMARY AT 400 MHz
Fig. 19. Total I/O power breakdown.
Fig. 18. I/O power per link and minimum transmission swing vversus frequency.
a 20-bit PRBS at a transfer rate of 0.8 Gb/s verifies a bit error (which corresponds to three days of rate (BER) less than 10 operation without a single error). Fig. 18 plots the total regulated power consumed per link operating across a range of frequencies for single-ended and differential modes of operation to demonstrate the power-saving potential of dynamically scaling the supply to the I/O subsystem. Since the regulated voltage varies with frequency, power consumption reduces dramatically for lower bit rates. The overhead line represents the power consumed by the receiver and a portion of the power dissipated in the bias generator, peripheral loop, clock distribution network, and testing circuitry amortized across all the links. The difference between the total power and the overhead represents the power dissipated in the transmitter at the minimum transmission swings achieved for the different frequencies, which are also plotted in Fig. 18. Although differential signaling requires two parallel channels to transmit data, they require less than half the swing magnitude compared to single-ended signaling, and thus they consume less power. Noise on the reference voltage that is shared by all the receivers is common mode and adversely affects the voltage margins for the receiver, thus requiring the larger swing magnitudes for single-ended signaling. Estimated power consumption of the overhead power operating off of a fixed high supply voltage (3.3 V) is presented by the dotted line and emphasizes the power-saving potential of this approach.
(a)
(b) Fig. 20.
(a) Data-eye histogram. (b) Edge transitions versus bit rate.
Given the relatively low swing magnitudes required for transmission over short distances on a board, the power consumed by the digital logic in the I/O subsystem dominates for this type of interface. A breakdown of all the components of power in the I/O transceiver is presented in Fig. 19. A significant portion of the power is consumed by the clock distribution, clock recovery, and data generators, while only 11% is consumed by the transceiver itself. Therefore, this type of I/O subsystem especially benefits from the savings possible by reducing the dynamic power dissipated by digital gates with adaptive powersupply regulation.
WEI et al.: VARIABLE-FREQUENCY PARALLEL I/O INTERFACE
An additional advantage of adaptively scaling the supply comes from the automatic slew-rate control of the transmitter output. Fig. 20 presents the measured data-eye diagram for a PRBS data stream transmitted at 0.8 Gb/s. It verifies the 50% duty-cycle signals achieved by pre-distorting the predrivers to the output driver and shows the output slews for a third of the bit time. This slew-rate control also extends to different frequencies, demonstrated by the relatively constant rise and fall transition times as a percentage of bit time, tabulated over a range of bit rates. IV. CONCLUSION To meet the challenges of potential performance bottlenecks and constraints in aggressively scaled chips, we proposed a low-power high-speed I/O interface. In addition to providing a mechanism for operating with high energy efficiency, adaptively regulating the supply enables a very simple and robust design that leverages standard CMOS logic for most of its components. This design implements a core DLL that consists of static CMOS inverters as delay elements, instead of traditionally analog components. Since the critical path in the I/O consists mostly of inverters, the DLL’s delay line serves to accurately monitor the minimum supply voltage needed to operate the link and provides clock edges to the digital peripheral loop for accurate timing recovery. Given power’s cubed dependence on voltage, the measured power for the I/O subsystem scales down dramatically with frequency. Also, a breakdown of the power reveals that the I/O subsystem significantly benefits from adaptively regulating the supply for all its digital components. Additional features, such as automatic slew-rate control and a frequency tracking receiver bandwidth, are facilitated with a tracking power supply. A prototype implementation of the described I/O subsystem has been integrated in a 0.35- m CMOS technology. The link prototype achieves a nominal transfer rate of 0.8-Gb/s/pin with , while operating at a regulated voltage a BER less than level of 2.7 V. Although data is only transmitted on two phases of the clock in the prototype, even higher bandwidths are achievable by multiplexing the data over more clock phases. This is well suited for an adaptively scaled voltage environment since generating additional clocks is easily achieved with simple inverters whose delays track with frequency. The power impact is also minimized since the circuitry operates in an energy efficient regime. However, one future challenge for this approach does not scale as quickly to interface design arises because with each technology generation. Therefore, improved translimimitter and receiver designs that can circumvent the tation are needed to make adaptive supply scaling an attractive technique for building low-power high-bandwidth parallel links in future aggressively scaled CMOS technologies. REFERENCES [1] P. Macken, M. Degrauwe, M. Van Paemel, and H. Oguey, “A voltage reduction technique for digital systems,” in IEEE ISSCC Dig. Tech. Papers, Feb. 1990, pp. 238–239. [2] V. Gutnik and A. P. Chandrakasan, “An efficient controller for variable supply-voltage low power processing,” in IEEE Symp. VLSI Circuits Dig. Tech. Papers, June 1996, pp. 158–159.
1609
[3] T. Burd, T. Pering, A. Stratakos, and R. Broderson, “A dynamic voltage scaled microprocessor system,” in IEEE ISSCC Dig. Tech. Papers, Feb. 2000, pp. 294–295. [4] K. Suzuki et al., “A 300-MIPS/W RISC core processor with variable supply-voltage scheme in variable threshold-voltage CMOS,” in IEEE Custom Integrated Circuits Conf., May 1997, pp. 587–590. [5] C. K. Yang et al., “A 0.5-m CMOS 4.0-Gbit/s serial link transceiver with data recovery using oversampling,” IEEE J. Solid-State Circuits, vol. 33, pp. 713–722, May 1998. [6] S. Sidiropoulos and M. Horowitz, “A 700-Mb/s/pin CMOS signalling interface using current integrating receivers,” IEEE J. Solid State Circuits, vol. 32, pp. 681–690, May 1997. [7] S. Sidiropoulos, D. Liu, J. Kim, G.-Y. Wei, and M. Horowitz, “Adaptive bandwidth DLLs and PLLs using regulated supply CMOS buffers,” in Proc. IEEE Symp. VLSI Circuits, June 2000, pp. 124–127. [8] G. Wei and M. Horowitz, “A fully digital energy efficient adaptive power-supply regulator,” IEEE J. Solid-State Circuits, vol. 34, pp. 520–528, Apr. 1999. [9] A. P. Chandrakasan et al., “Data-driven signal processing: An approach for energy-efficient computing,” in IEEE ISLPED, Aug. 1996, pp. 347–352. [10] E. Yeung and M. Horowitz, “A 2.4-Gb/s/pin simultaneous bidirectional parallel link with per-pin skew compensation,” in IEEE ISSCC Dig. Tech. Papers, Feb. 2000, pp. 256–257. [11] S. Sidiropoulos and M. Horowitz, “A semi-digital dual delay-locked loop,” IEEE J. Solid-State Circuits, vol. 32, pp. 1683–1692, Nov. 1997. [12] J. Maneatis, “Low-jitter process independent DLL and PLL based on self-biased techniques,” IEEE J. Solid-State Circuits, vol. 28, Dec. 1993. [13] B. Lau et al., “A 2.6-Gb/s multipurpose chip-to-chip interface,” in IEEE ISSCC Dig. Tech. Papers, Feb. 1998, pp. 162–163. [14] M. Johnson, “Bias circuit and differential amplifier having stabilized output swing,” U.S. Patent 5 452 898, Sept. 1995.
Gu-Yeon Wei was born in Seoul, Korea, in 1972. He received the B.S. and M.S. degrees in electrical engineering from Stanford University, Stanford, CA, in 1994 and 1997, respectively. He is currently working toward the Ph.D. degree in the Computer Systems Laboratory, Stanford University. His current research interests are in low-power, high-performance circuits and systems design, with specific interest in energy-efficient adaptive supply regulation and high-speed link design.
Jaeha Kim received the B.S. degree in electrical engineering from Seoul National University, Seoul, Korea, in 1997. He received the M.S. degree in electrical engineering from Stanford University, Stanford, CA, in 1999, where he is currently working toward the Ph.D. degree. His research interests include high-speed and lowpower CMOS circuits, inter- and intra-chip interconnects, and clock-recovery circuits.
Dean Liu received the B.S. degree in electrical engineering from the University of Washington, Seattle, in 1997, and the M.S. degree in electrical engineering from Stanford University, Stanford, CA, in 1999, where he is currently working toward the Ph.D. degree. His research interests include clock generation and distribution, clock skew analysis and deskewing circuits, high-performance circuit design, and parallel computer architecture.
1610
IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 35, NO. 11, NOVEMBER 2000
Stefanos Sidiropoulos received the B.Sc. and M.Sc. degrees in computer science from the University of Crete, Greece, and the Ph.D. degree in electrical engineering from Stanford University, Stanford, CA, in 1997. He is currently with Rambus Inc., Mountain View, CA, where he manages a group designing high-speed memory interfaces. He has worked with Digital Equipment Corporation, SGI and IIT. His work interests are in circuit design and CAD tools.
Mark A. Horowitz (S’77–M’78–SM’95–F’00) received the B.S. and M.S. degrees in electrical engineering from the Massachusetts Institute of Technology, Cambridge, and the Ph.D. degree from Stanford University, Stanford, CA. He is Yahoo Founder’s Professor of Electrical Engineering and Computer Sciences and Director of the Computer Systems Laboratory at Stanford University. He is well known for his research in integrated circuit design and VLSI systems. His current research includes multiprocessor design, low-power circuits, memory design, and high-speed links. He is also co-founder of Rambus, Inc., Mountain View, CA. Dr. Horowitz received the Presidential Young Investigator Award and an IBM Faculty Development Award in 1985. In 1993, he was awarded Best Paper at the International Solid State Circuits Conference.