ASYNCHRONOUS INTERCONNECT FOR SYNCHRONOUS SOC DESIGN ASYNCHRONOUS CIRCUITS CAN EFFICIENTLY INTERCONNECT SYSTEM-ON-CHIP MODULES WITH DIFFERENT CLOCK DOMAINS. FULCRUM’S NEXUS INTERCONNECT FEATURES A 16-PORT, 36-BIT ASYNCHRONOUS CROSSBAR THAT CONNECTS THROUGH ASYNCHRONOUS CHANNELS TO CLOCK-DOMAIN CONVERTERS FOR EACH SYNCHRONOUS MODULE. IN TSMC’S 130-NM PROCESS,
NEXUS ACHIEVES 1.35 GHZ AND TRANSFERS 780 GBPS.
Andrew Lines Fulcrum Microsystems
32
System-on-chip (SoC) designs integrate a variety of cores and I/O interfaces, which usually operate at different clock frequencies. Communication between unlocked clock domains requires careful synchronization, which inevitably introduces metastability and some uncertainty in timing. Thus, any chip with multiple clock domains is already globally asynchronous. The literature is rife with techniques to handle the integration of multiple clock domains, most of which rely on a localized clockdomain-crossing circuit that lets one clock domain talk directly to another. Most designs still implement long-range communication with synchronous circuits, requiring a widely distributed clock, which can be challenging to implement at high frequency. At Fulcrum Microsystems, my colleagues and I have devised a more elegant and efficient solution to the multiple-clock-domain problem. Instead of gluing synchronous domains directly to each other with clockdomain bridges, we use asynchronous-circuit design techniques to handle all clock-domain crossing as well as all cross-chip communication and routing. All synchronous modules then have their own clock, without any fre-
quency or phase constraints between domains. The phase-locked loop (PLL) and clock distribution can be entirely local to each synchronous core, easing timing closure and improving the reusability of cores across multiple designs. Our solution, Nexus, is a globally asynchronous, locally synchronous (GALS) interconnect that features a 16-port, 36-bit asynchronous crossbar. The crossbar connects through asynchronous channels to clockdomain converters for each synchronous module. The GALS approach has been the subject of academic research for many years. Recent publications describe support for clockboundary crossing with pausable clocks1 and the use of asynchronous channels and routing circuits.2 Compared to previous systems, Nexus exhibits higher throughput and lower latency, and provides a comprehensive and verified solution. In Taiwan Semiconductor Manufacturing Company’s (TSMC’s) 130nm process, a Nexus system achieved 1.35 GHz at 1.2V with less than 5 mm2 in area. Nexus is a vital part of several upcoming commercial chip designs from Fulcrum and our partners.
Published by the IEEE Computer Society
0272-1732/04/$20.00 2004 IEEE
Asynchronous integrated pipelining Because a system is asynchronous merely because it lacks a clock, asynchronous design has many possible styles and approaches. Nexus is based on the quasi-delay-insensitive (QDI) timing model,3 which requires that the circuit function correctly regardless of any gate delay and most wire delays. (Some isochronic forks in wires must have bounded relative delays, but in our designs, these are local to small cells and quite safe.) The QDI model is very conservative, since it forbids all forms of timing races, delay assumptions, glitches, and even clocks. These characteristics are particularly attractive for SoC interconnects, because QDI circuits will work robustly over huge delay variations caused by power supply droop, in-die variation, local heating, and crosstalk. In a QDI system, designers can’t use a separate wire to indicate when a data wire is valid because they can’t make an assumption about the wires’ relative delay. Instead, a QDI system mixes the data value and validity onto two wires (dual-rail, or one-of-two)—and includes a backward-going acknowledge wire for flow control. Together, these wires form an asynchronous channel. When both data wires are 0, the channel is neutral, with no data present. To send a bit, the sender raises either the 0 data rail or the 1 data rail to send a logical 0 or 1. Once the receiver has received and stored the data, it raises the acknowledge wire. Eventually the sender will set the data rails back to neutral, and the receiver will then lower the acknowledge. This process is a four-phase dual-rail handshake. It is also possible to transmit two bits at a time by raising one of four data rails (quad-rail, or oneof-four), which we prefer over one-of-two codes because it reduces power and the number of acknowledge wires per bit. At Fulcrum, we use a QDI design style originally developed at the California Institute of Technology from 1995 to 1999.4,5 It is a smaller and faster implementation of the QDI timing model, in which circuits use precharge domino logic plus some control overhead to combine logic with pipelining. The forward data path is similar to that of full-custom synchronous designs like the Pentium 4, except without timing margins, race conditions, or explicit latches. A typical pipeline stage has two-transition forward latency and 18-transition cycle time.
CPU
CPU
Cache
Cache
Crossbar
I/O
I/O
I/O
Clock domain converter Pipelined repeater Built-in self-test module
I/O Test
Boot
Control
Figure 1. Typical Nexus-based SoC.
Data
36 bits
Tail
1 bit
DN
D2
1
0
Control 4 bits
Source module
D1
DN
D2
0
1
0
To
D1
0 From
Crossbar
Destination module
Figure 2. Nexus’ burst format.
Nexus architecture As Figure 1 shows, a SoC using Nexus has a crossbar that interconnects locally synchronous modules on the same chip. Each module can have an independent clock frequency or phase. Clock-domain converters connect the asynchronous interconnect to each synchronous module. Asynchronous channels carry the data across the chip to and from the central crossbar. The crossbar includes routing and arbitration circuitry to resolve contention on output ports. All parts of the system are safely flow controlled. Nexus relies on one-way data transfers, or bursts. Figure 2 shows the burst format. A burst contains a variable number of data words (Dx), terminated by a tail bit. Each
JANUARY–FEBRUARY 2004
33
HOT INTERCONNECT 11
Crossbar Data (36)
Data (36) Crossbar
Tail (1)
Tail (1)
Repeat
Repeat
Si To (4)
Mj
Input control
Output control
From (4)
Request
Figure 3. Crossbar components with number of bits in parentheses.
burst is routed with a sideband To channel, which converts to a From channel when it leaves the crossbar. The crossbar routes bursts atomically and cannot fragment, interleave, duplicate, or drop them. It creates links when the To control arrives and wins the arbitration; links close automatically when the last word of the burst exits the crossbar. In the asynchronous portions, we encoded channels as bundles of one-of-four data rails plus acknowledgments. On the synchronous interfaces, channels use a simple request-andgrant FIFO (first-in, first-out) protocol. On the clock’s rising edge, if the both the sender’s request line and the receiver’s grant line are asserted, data advances. Either the sender or receiver can stall the transfer. This is functionally equivalent to asynchronous flow control. All channels are unidirectional, so every module has both an ingress and egress channel. Nexus supports round-trip transactions as split transactions, with a request burst going out and a completion burst returning. Because Nexus preserves various ordering properties, such as producer-consumer and global store ordering, SoC designers can tunnel many legacy bus protocols through Nexus. Performance improves because there is no bus contention. Existing Nexus versions support 16 modules, a data path of 36 bits plus the tail bit, and four-bit To/From channels. The description in the rest of this article refers to this configuration, although many variations are possible.
34
IEEE MICRO
Figure 3 shows Nexus’ crossbar components. We implemented the crossbar by decomposing it into smaller circuits that communicate on internal channels. The largest part of the crossbar is its data path, which must multiplex all input data channels to all output data channels. Split control channel Si controls the data path for each input, specifying to which output that input should send its next word of a burst. Merge control channel Mj controls the data path for each output, specifying from which input to receive the next word of a burst. The split control comes from an input-control block for that port, which also receives the To control channel. The merge control comes from an outputcontrol block for that port, which also sends the From control channel. Between the I/O control and the data path are repeat circuits that replicate the same split and merge controls until a tail bit of 1 passes through the link, thus routing a variable-length burst. Data path. The heart of the crossbar is a 16 by 16, four-bit demultiplex-multiplex circuit organized in a grid. Each input port broadcasts its split control and input data across a row. Each output port broadcasts its merge control and collects its output data on a bus for each column. Figure 4 shows the transistors at each grid intersection. We encoded each four-bit split and fourbit merge control as two one-of-four codes. A hit circuit for each grid point checks that both the split and merge controls have selected it. The hit signal enables the data transfer from input bus L to inverted output busR. After the output bus latches the data, the output port routes an acknowledge signal backward through the same grid point. Both sides then complete the handshake and return first data and then acknowledges to their neutral states. Nine 16 by 16, four-bit crossbars compose a 36-bit data path; this design distributes the split and merge controls to each of these chunks with some pipelining. Consequently, data path extension does not incur any performance bottlenecks, which means that a 128-bit crossbar would have as high a frequency, with only a slight increase in latency. We chose four-bit crossbars to minimize wiring congestion.
R[1].0
R[1].1
R[1].2
L[1].1
L[1].2
L[1].3
R[0].0
R[0].1
R[0].2
L[0].1
L[0].2
L[0].3
R[1].3
sb L[1].0 mb hit
en[1]
sa
sb
R[0].3
ma ve
L[0].0
mb en[0] rv[1]
rv [0]
Figure 4. data path grid circuit for the crossbar. When the hit signal goes high,4-bit data transfers from input bus L to inverted output busR. An acknowledge signal then goes backward through the same grid point.
Repeat until tail. There is a similar one-bit crossbar for the tail bit, except it also makes a copy of the incoming and outgoing tail bits. These copies go to repeat units on each split or merge control channel, which repeat the same control values for each word of a burst until the tail is 1. In this way, the data path and repeat units route variable-length bursts atomically, and the crossbar’s input and output control units need not know the burst’s length. Input control. Each input port has an inputcontrol unit that receives the To control channel and copies it to the S control. It also sends a token on a request channel (which consists of one request wire and one acknowledge wire) to the selected output-control unit. Nexus has 256 request channels connecting all 16 input-control units to all 16 outputcontrol units. Output control. The output-control unit waits until it receives a request from one or more input ports. It picks the first one and sends its
port number on the From and M channels. If multiple requests arrive at exactly the same time, metastability occurs, so we included a metastability filter that makes the outputcontrol unit wait for the metastability to resolve. In a QDI circuit, such metastability introduces only minor uncertainty in the latency of arbitration, not a chance of failure. If an input port could make two requests to two output ports in parallel, it might win its second request first. If another input had also requested the same output ports but in the opposite order, it too could have its second request win first. Thus, the two inputs would each have won permission to send to their second-choice destination, but they have data waiting on the input ports intended for their first. This scenario results in deadlock. Two solutions to this problem are possible, and both require that each input port have at most one request outstanding at a time. In other words, the input port must be certain it has won its first arbitration before starting a second. One solution is to introduce back-
JANUARY–FEBRUARY 2004
35
HOT INTERCONNECT 11
Typical
Synchronous (peak power) Asynchronous
100
Power
Synchronous (average power) Synchronous (minimum power)
0 Utilization (percentage)
100
ble, we believe the asynchronous design is superior because it can trivially support links at frequencies from DC to maximum. Our asynchronous design has no timing assumptions to verify and operates correctly over a wide voltage and temperature range. Moreover, it consumes power only for the bandwidth actually used. In contrast, a synchronous design that latches 16 inputs and 16 outputs would require tricky clock-gating logic to save power.
Pipelined repeater Figure 5. Power consumption of asynchronous pipelined repeaters versus synchronous latch channels.
ward-flowing grant channels from the outputs to the inputs. The other solution, which is trickier, is to remove enough of the integrated pipelining so that the first request token blocks the progress of the second until the first has won the arbitration. We used the first approach in our early crossbar designs, but adopted the second approach thereafter because it requires substantially less area. Both solutions introduce a performance bottleneck in the control circuitry, which causes this circuitry to have lower throughput than the rest of the system—20 percent lower for the first version and 50 percent lower for the second. Fortunately, arbitration is also pipelined and occurs only once per burst, not once per word, which means this bottleneck rarely affects performance and never matters for bursts of two or more words. Synchronous crossbars. Synchronous crossbars could use a similar data path grid to transfer data from all inputs to all outputs. In this case, the input port would broadcast the input split control and the data across the row, and the output port would broadcast the merge control up the column, combining it in a hit circuit. To merge the output data, the design could use an OR tree or, in a full-custom flow, a dynamic logic bus. Input and output control circuits would be similar, but the design would require extra support for flow control. During floorplanning, designers would have to ensure that the N 2 intermediate links are very short. Although synchronous crossbars are possi-
36
IEEE MICRO
Pipelined repeater is another name for a traditional QDI asynchronous half buffer. Designers implement such a repeater as inverting C elements followed by inverters for each data rail, with a NAND gate and inverter driving an inverted acknowledge backward. The pipelined repeater has only half a token of storage, since either its input or its output channel must be empty. The repeater’s purpose is to decouple the four-phase handshake over long wires. Without the pipelined repeater, and even with normal inverter repeaters, longer channels operate at lower frequency. A 130-nm design, for example, needs a pipelined repeater every 2 mm to maintain 1.35 GHz, but designers could increase the spacing for slower links. The pipelined repeater is analogous to a synchronous latch, and both designs can benefit from the automatic insertion of inverters on long wires. The comparison is interesting and quite relevant, because the power cost of cross-chip communication can be significant. Figure 5 compares the power consumption of pipelined repeaters to synchronous latch channels. Utilization in this context is the fraction of peak bandwidth actually used, which means that invalid synchronous padding doesn’t count. Activity is the probability that an individual bit of data will change from one word to the next. Peak power consumption is at 100 percent utilization and 100 percent activity. The asynchronous system always has constant activity, because it has a return-to-zero phase to reset the channel between valid data. To send a valid word over a synchronous channel requires zero to one transitions per bit, depending on the activity. In addition, the clock input to the latch goes up and down once and drives all bits. The one-of-four asyn-
chronous buffer takes two transitions to send two bits, plus it raises and lowers an acknowledge signal once, which gates all bits. So the asynchronous and synchronous systems actually transition roughly the same number of nodes for 100 percent activity. The asynchronous system scales linearly with utilization all the way down to zero power at 0 percent utilization. The synchronous system also scales linearly with utilization (assuming it holds the old data value for a padding cycle), but the clock load on the latches consumes a constant amount of power. The synchronous system also has a datadependent power dissipation that varies widely, because the power also scales linearly with activity, unlike in asynchronous systems. Two parts of this figure are particularly interesting. The first is peak power—100 percent utilization and 100 percent activity—the maximum power consumption that designers must consider for power distribution and heat dissipation. At a rough estimate, the worst-case power for an asynchronous design is near that of the synchronous design. The second interesting scenario is average power, with typical utilization and activity percentages, which affects battery life and energy costs. In this scenario, as the figure shows, the asynchronous circuit is more likely to consume less power. Of course, the exact results vary with assumptions and implementations,6 but the figure highlights the qualitative differences in power consumption between the two designs.
Clock domain converter The clock domain converter consists of two independent circuits, synchronous-to-asynchronous converter S2A for outbound data and asynchronous-to-synchronous converter A2S for inbound data. In both directions, these converters propagate full sender and receiver flow control across the boundary. Figure 6 shows the structures of S2A and A2S. Synchronization control circuit. Both S2A and A2S use the same control circuit to decide when a transfer should occur. The circuit accepts a valid signal from the input side and an enable from the output side, and must produce an enable to the input side and a valid signal to the output side. The asynchronous
Synchronous data path
Asynchronous (outbound) data path
Request A Grant
Clock (a)
Synchronous data path
Asynchronous (inbound) data path
Request A Grant
Clock (b)
Figure 6. S2A (a) and A2S (b) clock-domain converters.
side interprets these signals as a one-of-one channel (request rail plus an acknowledge). The synchronous side uses these as request and grant signals. The converter control also receives the clock from the synchronous module and produces a go signal to latch data. The control circuit must decide if a transfer should move forward on a certain clock edge. To do that, it arbitrates, determining whether both sides are ready on the clock’s rising edge. We implemented this arbitration through a metastable circuit using cross-coupled NAND gates followed by a metastability filter, allotting a half cycle for metastability resolution. The arbitration’s result determines if the next cycle advances or holds the old data. Data path. The S2A data path uses the control output to latch the synchronous data into a flip-flop and perform the asynchronous hand-
JANUARY–FEBRUARY 2004
37
HOT INTERCONNECT 11
shake. It assumes that the asynchronous handshake will complete within a clock cycle without blocking. The A2S data path is similar, and takes an asynchronous one-of-four code and latches it in flip-flops. A2S operation assumes that once the transfer starts, all asynchronous data is present and will complete the handshake within a clock cycle. To guarantee that the asynchronous data paths are ready, a completion detection circuit must check either that all bits of an asynchronous input have arrived or that the asynchronous output channel has enough space. Instead of trying to complete all bits in a single C-element tree, we used a pipelined completion circuit that can sustain high frequency regardless of data path width. A2S completion sends a one-of-one channel to the control circuit to indicate that all data is present. S2A completion sends a one-of-one channel to the control circuit to indicate that there is space in the output FIFO. The design initializes a few extra tokens on this channel to match the amount of storage available between the flip-flop and the completion detector. By checking that all input bits have arrived or that all output bits have buffer space, the converters can safely tolerate any relative skew between bits that the asynchronous interconnect introduces. Latency. We measure latency from when an input channel becomes valid to when the output channel becomes valid, assuming the link is empty to begin with. On the synchronous side, latency is the time until the request is valid on a rising clock edge. S2A has very little latency because it creates the asynchronous token directly from the rising clock through only a few stages of logic. A2S also has little asynchronous latency in completing the incoming asynchronous bits and making the control decision, but it must also wait for metastability resolution. It could also have to wait an extra cycle if it just misses the sample window. MTBF analysis. This design requires the association of a single potentially metastable synchronization with each transfer to or from a clock domain, regardless of data path width. To function correctly, the synchronization must resolve within about half the clock cycle. To check the mean time between failures (MTBF), we ran a Spice simulation of the exponential
38
IEEE MICRO
decay of the metastable state across various process corner cases, which let us predict resolution time as a function of input-event spacing. We assumed worst-case simultaneous arrival time with a linearly distributed jitter of a few picoseconds and computed the probability that the metastable state would last longer than the allotted resolution time. From this we computed an MTBF for each converter. We found that even with all converters encountering metastable events on every cycle, Nexus can achieve an MTBF of millions of years up to frequencies of 800 MHz. Because MTBF depends strongly on frequency, we believe that for clock frequencies above 800 MHz, a longer metastability resolution time is preferable for high-reliability systems. To that end, we have designed a new converter that designers can configure to allow either a half or full clock cycle for metastability resolution, depending on module frequency. Comparison to standard techniques. Designers can position an S2A and an A2S back to back to create a clock boundary bridge circuit, then stretch and pipeline asynchronous channels between the converters to span arbitrary onchip distances at full frequency. In the standard technique for clockboundary crossing, designers use a dualported register file to store the data (with writes and reads on the two clocks), and head or tail pointers in Gray code transferred between the clock boundaries through metastable flip-flops. This approach is popular because designers can easily synthesize it using a standard application-specific IC or field-programmable gate array. Although the data path has no metastability, each bit of the head or tail pointers is potentially metastable, and the transfer of these pointers involves two metastable events per data transfer. Our converter is much simpler, with only a single metastable node and simultaneous computation of forward validity and backward acknowledges. The data path is essentially a synchronous flip-flop with a small amount of asynchronous buffering, which is smaller than a dual-ported register file. The latency of our design is also much less. Others have developed similar optimized designs,1,7 but the standard technique is still the most widespread.
Verification and test To ensure that Nexus will work robustly in a commercial application, we developed and applied many verification and test strategies, including novel variations of noise analysis, timing analysis, and fault and delay testing.
Noise analysis Asynchronous circuits are susceptible to noise glitches for a larger portion of their cycle than synchronous designs with flip-flops and combinational logic. Although our digital design never generates glitches itself, analog noise from charge sharing or capacitive coupling can cause glitches. We confine dynamic logic that is susceptible to charge sharing to small cells, which we can check locally. We use various design guidelines and netlist transformations to fix any charge-sharing problems without resorting to precharging internal nodes. Capacitive coupling tends to affect longer wires, so we drive all long wires with strong inverters, and also insert inverters or pipelined repeaters spaced at reasonably short distances. Also, the frequent use of quad-rail (one-of-four) encoding means that only one in a group of four wires is likely to be an aggressor. We ran exhaustive Spice simulations for both forms of noise using an in-house tool to set up and simulate the relevant portions of large circuits. All circuits simulated correctly with a worst-case combination of simultaneous capacitive-coupling aggressors and charge sharing.
Timing analysis Nexus’ QDI nature alleviates much of the timing-analysis problem because timing variances can affect performance but won’t result in failure. To select transistor sizes, we used a sizing tool that relies on floorplanning information to estimate wire capacitance and resistance. After layout, we used Spice simulation to verify performance over process, voltage, and temperature corner cases. If the wires between blocks did not satisfy the target delay budget, we inserted more pipelined repeaters. For the synchronous converters, normal setup and hold times apply.
Fault and delay testing The extensive use of domino and dynamic logic presents challenges for fault modeling,
although these are largely similar to the challenges in full-custom synchronous designs, which also use domino logic. We used an equivalent model for our dynamic gates and a commercial tool to fault-grade our circuits. We have not yet adapted a commercial tool to automatically generate test patterns, but given Nexus’s simple data path, we can manually generate patterns. Fault testing requires some additional circuitry. In each converter, we included a loopback multiplexer on the synchronous side. The multiplexer can bounce an incoming burst back into the crossbar after swapping its From control with some data bits in the first word to create the To control and a modified first word. One of the Nexus ports has an asynchronous test-assist module that launches bursts into the interconnect, bounces them through two ports and back, and then around again for a configurable number of iterations. This process can test all the links between modules with various data patterns, running them at full speed. We used a standard scan chain to check the final burst data on the synchronous side. Many faults would result in deadlock, which would become evident by the lack of valid data at the expected time. Despite Nexus’ asynchronous nature, we were able to make the test stimulus and response completely deterministic merely by waiting a suitable time before looking at the results. This process is also suitable for detecting speed faults or performing speed binning. We have completed the design of this test-assist unit, but it is not in the latest test chip.
Characterization results We have fabricated and characterized Nexus in numerous processes with minor design variations, and it was functionally correct and performed well in all of them. In TSMC’s 180-nm G (generic logic) process, the Nexus system operated at a frequency of 450 MHz, running at 1.8 V and 25° C. In TSMC’s 150nm G process, it operated at 480 MHz at 1.5 V and 25° C. Recently, we fabricated Nexus in TSMC’s 130-nm LV (low voltage) process, with both reduced dielectric constant (low-k) and fluorinated silicate glass (FSG) insulators. Figure 7 shows the layout of the crossbar in the 130-nm process. The 16-port 36-bit crossbar itself is 1.75 mm2, and each
JANUARY–FEBRUARY 2004
39
HOT INTERCONNECT 11
Figure 7. Layout of the 16-port 36-bit crossbar in TSMC’s 130-nm process. From bottom to top, the I/O control, repeat-until-tail logic, four-bit and onebit crossbars, and the remaining 32 bits of crossbars.
pipelined repeater is 0.025 mm2 for both directions with 36 bits for data, 1-bit tail, and 4 bits for control. The clock domain converters are 0.2 mm2 in this design, but we plan to make them 0.1 mm2 by optimizing the mostly automated initial layout. The total area of a typical Nexus system using all 16 ports and an average of two pipelined repeaters per link is 4.15 mm2—a small fraction of the total area of a typical SoC using Nexus. For the 130-nm process, we verified Nexus’ correct operation from –55° to 125° C at 0.7 V to 1.4 V. Figure 8 summarizes the results of measuring frequency over voltage at 25° C for both the low-k and FSG processes. We also measured the energy per bit transferred, using units of picojoules (pJ) per bit. The energy measurement was for a two-word burst from a synchronous module to itself, and is independent of the data values transferred. Table 1 summarizes key frequency, latency, and energy results at 25° C.
1.8
Frequency (GHz)
1.5
Low k FSG
A
1.2 0.9 0.6 0.3 0 0.7
0.8
0.9 1.0 1.1 1.2 Supply voltage (V)
1.3
1.4
Figure 8. Nexus frequency versus voltage at 25° C in TSMC’s 130-nm process for both the low-k and FSG insulators.
s shrinking process geometries and higher frequencies increase the difficulty of global clock distribution and as timing variability increases, we believe that a GALS approach is the best way to design large SoCs. The Nexus interconnect is a complete and proven solution to this problem. As other asynchronous components are introduced, Nexus will make the gradual migration from synchronous to asynchronous cores easier. Soon, a SoC designer will be free to choose whichever design style provides the best solution for each component. MICRO References
Table 1. Key results of Nexus operation at 25° degrees C for TSMC’s 130-nm process with both low-k and FSG insulators.
Process Low-k Low-k FSG FSG
Voltage (V) 1.2 1.0 1.2 1.0
Frequency (GHz) 1.35 1.11 1.10 0.87
Latency* (ns) 2.0 2.4 2.5 3.1
Energy per bit (pJ) 10.4 7.0 11.2 7.6
*Reported latency does not include the receiving module’s 1/2 to 3/2 clock period. Because latency is difficult to measure directly, we inferred it from a combination of lab measurements and Spice simulation.
40
IEEE MICRO
1. G. Taylor et al., “Point to Point GALS Interconnect,” Proc. 8th Int’l Symp. Asynchronous Circuits and Systems (Asych 02), IEEE CS Press, 2002, pp. 69-75. 2. W.J. Bainbridge and S.B. Furber, “DelayInsensitive System-on-Chip Interconnect Using 1-of-4 Data Encoding,” Proc. 7th Int’l Symp. Asynchronous Circuits and Systems (Asych 01), IEEE CS Press, 2001, pp. 118126. 3. A.J. Martin, “The Limitations to DelayInsensitivity in Asynchronous Circuits,” Proc. 6th MIT Conf. Advanced Research in VLSI, MIT Press, 1990, pp. 263-278.
4. A. Lines, Pipelined Asynchronous Circuits, master’s thesis, CS-TR-95-21, Caltech, 1995. 5. A.J. Martin et al., “The Design of an Asynchronous MIPS R3000 Processor,” Proc. 17th Conf. Advanced Research in VLSI, IEEE Press, 1997, pp 164-161. 6. K. Stevens, “Energy and Performance Models for Clocked and Asynchronous Communication,” Proc. 9th Int’l Symp. Asynchronous Circuits and Systems (Asych 03), IEEE CS Press, 2003, pp. 56-66. 7. A. Chakraborty and M. Greenstreet, “Efficient Self-Timed Interfaces for Crossing Clock Domains,” Proc. 9th Int’l Symp. Asynchronous Circuits and Systems (Asych 03), IEEE CS Press, 2003, pp. 78-88.
Andrew Lines is CTO and cofounder of Fulcrum Microsystems where his responsibilities include asynchronous-circuit design, architecture, CAD tools, and general technology development. Lines has an MS in computer science from the California Institute of Technology. He is a member of the IEEE. Direct questions and comments about this article to Andrew Lines, Fulcrum Microsystems, 26775 Malibu Hills Rd., Calabasas, CA 91301;
[email protected]. For further information on this or any other computing topic, visit our Digital Library at http://www.computer.org.
Coming Next Issue March-April 2004 Guest Editors Pradeep Dubey, Intel Michael Flynn, Stanford University
Hot Chips 15 Gemini: A First-Generation Chip Multithreaded Processor for Network Facing Workloads D. Bistry et al.—Sun Microsystems
A 1.5-GHz 130-nm Itanium2 Processor with a 6-Mbyte On-Die L3 Cache Stefan Rusu, Harry Muljono, and Brian Cherkauer—Intel
IBM Power5 Chip: A Dual-Core Multithreaded Processor Ron Kalla, Balaram Sinharoy, Joel M. Tendler—IBM
PivotPoint: High-Performance Clockless Crossbar Switch for High-Performance Embedded Systems Uri Cummings—Fulcrum Microsystems
Anatomy of a Portable Digital Media Processor Deepu Talla et al.—Texas Instruments
JANUARY–FEBRUARY 2004
41