Pausible Clocking Based Heterogeneous Systems - CiteSeerX

0 downloads 0 Views 241KB Size Report
FIFO channel; communication between a module and the FIFO is done using a request/acknowledge handshaking. Synchronization of handshake signals to the ...
Pausible Clocking Based Heterogeneous Systems Kenneth Y. Yun, Member, IEEE

Ayoob E. Dooply, Student Member, IEEE

Abstract— This paper describes a novel communication scheme, which is guaranteed to be free of synchronization failures, amongst multiple synchronous and asynchronous modules operating independently. In this scheme, communication between every pair of modules is done through an asynchronous FIFO channel; communication between a module and the FIFO is done using a request/acknowledge handshaking. Synchronization of handshake signals to the local module clock is done in an unconventional way [1], [2], [3], [4] — the local clock built out of a ring oscillator is paused or stretched, if necessary, to ensure that the handshake signal satisfies setup and hold time constraints with respect to the local clock. In order to validate this scheme, we implemented a test chip in 0.5µm CMOS. This chip is designed as a ring, composed of two synchronous modules, an asynchronous module, and two asynchronous FIFO’s. Each module functions as a receiver to one module and a sender to another module. Test results show that the chip functions reliably up to 456MHz. Keywords— Globally asynchronous locally synchronous, Stretchable clock, Synchronization, Heterogeneous systems

I. I NTRODUCTION The next generation digital VLSI systems will necessarily be based on system-on-a-chip concepts, in order to satisfy unrelenting demands for higher performance and also to accommodate smaller packaging and low power requirements. These on-chip systems will consist of multiple independently synchronized modules: some may be clocked modules, such as synchronous processor cores, while others may be clockless (asynchronous) modules, such as peripheral controllers. These chip designs will resemble today’s complex board-level designs. On-chip modules will be held together by interface glue logic which must facilitate high speed communication between synchronous modules operating at different clock frequencies, between synchronous and asynchronous modules, and between asynchronous modules. The key difference between today’s board-level designs and the future system-on-a-chip designs, however, is the speed at which communications take place. A first step toward such heterogeneous system design on a chip is a reliable high-speed communication scheme among multiple synchronous modules operating independently. We examined a variety of communication schemes that attempt to mitigate synchronization failure without sacrificing communication throughput. They generally fall into one of two categories: (1) brute-force synchronization of communication signals to each module’s free-running clock with an acceptable level of synchronization failure; (2) adjustment of individual synchronous module’s local clock, when necessary, to avoid synchronization failure. The first category includes methods such as the well-known double-latching scheme and a natural extension of the doublelatching scheme called pipeline synchronization [5]. These methods reduce the probability of synchronization failure to This work was supported in part by a gift from Intel Corporation and by a National Science Foundation CAREER Award MIP-9625034. The authors are with ECE Dept., University of California, San Diego, 9500 Gilman Drive, La Jolla, CA 92093-0407; email: {kyy,adooply}@ucsd.edu.

an acceptable level by resynchronizing communication signals with back-to-back latches. These methods are simple and inexpensive to implement, but a major drawback is the latency of communication. The methods in the second category [1], [2], [3], [4] generally rely on stopping or stretching each synchronous module’s local clock to guarantee that communication signals never violate setup and hold time constraints with respect to the local clock. Although these methods are robust and do not incur long communication latency, they involve designing a special clocking circuit, unfamiliar to most designers. The simplest example using this scheme one can conceive is a synchronous module communicating with an asynchronous peripheral. In this system, the synchronous module latches the handshake signals from the asynchronous module by stopping or stretching its own clock, when necessary. Both categories described above require some form of arbitration. However, if the phase relation between two synchronous modules can be predicted deterministically as in [6], then we can avoid synchronization failure, without arbitration, by timing communications when safe. In this paper, we describe a general method of communication amongst multiple synchronous and asynchronous modules operating independently, i.e., at different clock frequencies or phases, based on the pausible clocking scheme as shown in Fig. 1. Synchronous modules communicate with one another via asynchronous FIFO’s1 used as communication channels. The interfaces between the synchronous modules and the FIFO are pausible clocking control (PCC) circuits. In order to validate this scheme, we implemented a test chip in 0.5µm CMOS through MOSIS. This chip is designed as a ring, composed of two synchronous modules, an asynchronous module, and two asynchronous FIFO’s. Each module functions as a receiver to one module and a sender to another module. Each synchronous module has a frequency control with four settings, which is independently programmable externally. The chip functions reliably for all the settings, albeit substantially faster than expected by simulation. II. D ESIGN

OF

PAUSIBLE C LOCKING C ONTROL

The pausible clocking control is a scheme to avoid synchronization failure by adjusting the local clock. A synchronization failure at the module interface occurs when the arrival times of an external signal transition and a sampling edge of the clock are indistinguishable by the sampling latch at the module boundary. In our scheme, the synchronization failure is circumvented by pausing or stretching the local module clock when necessary. 1 Self-timed FIFO’s have been used to facilitate robust communication between synchronous modules operating at the same frequency elsewhere [7]. However, they have not been utilized in communication between synchronous modules operating independently at different clock frequencies.

IEEE TRANSACTIONS ON VLSI SYSTEMS

sender / receiver

Synch Module 1

PCC

R 1ρ A1ρ

2

R 2σ A2σ

FIFO

A1σ R 1σ

A2ρ R 2ρ

FIFO

sender / receiver

Synch PCC Module 2

Fig. 1. Two synchronous modules communicating via asynchronous FIFO channels.



R1

R

Arbiter

R2

Ring Oscillator

ME

AFSM Aρ FSM SR ρ G1



G

rclk

AFSM R σ FSM SAσ G2

sysclk

sysclk

Fig. 2. PCC.

A. Synchronization Strategy A block diagram of the PCC is shown in Fig. 2. This scheme uses a mutual exclusion element (ME) to force the temporal separation of the sampling edges of the clock and external signal transitions. A mutual exclusion element [2], [8] is a circuit that allows one request to pass through at a time on a first come first serve basis. When two inputs arrive simultaneously, it selects one to pass through arbitrarily. Because MEs require that requesters competing for shared resources must be persistent, the clock input to the ME must be “stretched” when it loses an arbitration. A ring oscillator is used instead of a crystal oscillator in order to be able to adjust the duration of off-phases of the clock. The local module clock, sysclk, is a buffered version of one of the outputs of the ME. It normally has a 50% duty cycle, except when the clock input loses an arbitration, in which case an off-phase of the clock is stretched. Consider a scenario in which a synchronous module equipped with the PCC is receiving inputs from the FIFO. The PCC needs

Rρ R1 R G G1 SRρ Aρ rclk sysclk

setup

sysclk paused Aσ R2 Fig. 3. PCC timing (bidirectional communication).

an arbiter, as depicted in Fig. 2, to decide which of the two independent inputs (the request from the sending FIFO [Rρ ] and the acknowledgment from the receiving FIFO [Aσ ]) to be presented to the ME. For this scenario, we assume that Rρ has won the arbitration. Thus, in this case, the arbiter simply forwards R1 to the ME and G from the ME. As shown in the shaded area of Fig. 3, Rρ is forwarded to the ME via the asynchronous finite state machine (AFSM) and the arbiter. If rclk is low when R rises, then the ME immediately raises G, which prompts the AFSM to generate an event on SR ρ . This event is effectively synchronized to sysclk, i.e., guaranteed not to induce a synchronization failure when sampled by the FSM, under a reasonable timing assumption as described below. Note that rclk may rise before the ME lowers G, but the ME will not allow sysclk to rise until G becomes low. On the other hand, if rclk is already high when R rises, the assertion of G is stalled until rclk is lowered. As soon as rclk falls, the ME raises G and the AFSM generates an event on SR ρ . Rclk may actually rise at about the same time R rises. In such situations — the situations in which temporal separation of R+ and rclk + becomes blurred — the ME simply “tosses a coin” to determine which signal to service first. If rclk + wins the “coin toss”, then sysclk rises first and G remains low until sysclk falls (which happens shortly after rclk falls). On the other hand, if R+ wins, then the ME raises G first and blocks sysclk from rising. In order to prevent sysclk from stalling indefinitely (until the next toggling of the request, Rρ ), the AFSM lowers R1 immediately after G1 rises, which in turn causes G to fall allowing sysclk to rise. The PCC does not differentiate rising edges of Rρ from falling edges — both edges enable R1 to be asserted and G1 to be asserted as a result. In fact, the AFSM effectively performs a two-phase to four-phase conversion from Rρ to R1 and a four to two-phase conversion from G1 to SR ρ . This conversion is independent of whether a two-phase or four-phase communication protocol is used between the FIFO and the synchronous module. It is merely done so that both edges of SRρ are synchronized to sysclk. In order for the synchronous FSM that generates Aρ to recognize the change in SR ρ , we need to ensure that SR ρ satisfies setup and hold time constraints with respect to sysclk. In order to recognize SR ρ + (SR ρ −)2 reliably, the path from G1 + to SR ρ + (SR ρ −) must be shorter than the path from G1 + to sysclk + via R− and G− by at least the data setup time for the FSM latches. This is easily satisfied in practice because G1 + to SR ρ + (SRρ −) delay is a complex gate delay (transitions on SR ρ are directly triggered by G1 +), which is much less than the delay from G1 + to R− to G− to sysclk +. The asynchronous finite state machine (see Fig. 4) is specified in burst-mode [9] and synthesized using the 3D-gC synthesis tool [10]. This burst-mode state machine has two inputs (Rρ , G1 ) and two outputs (R1 , SR ρ ). In state 0, when Rρ rises, the machine raises R1 and goes to state 1. In state 1, the machines waits for G1 to rise; when it does, the machine lowers R1 and raises SR ρ concurrently and goes to state 2. When G1 falls in state 2, the machine transitions to state 3. The machine transi2 We

use a+ and a− to denote rising and falling transitions of a respectively.

IEEE TRANSACTIONS ON VLSI SYSTEMS

0

1

2

4

G1−

/

5

G

R ρ + / R 1+

weak

G1

G1+ / R 1− SR ρ −

R

gME R1

/

R ρ − / R 1+

R1 G1

reset

G1+ / R 1− SR ρ + G1−

3

3



G1

G1

SR ρ

G2 R2 (a) Arbiter with fast−forward request

SR ρ

gME

weak

R1

SR ρ

G T1

G2

G1 Fig. 4. PCC asynchronous finite state machine specification and implementation.

G2 T2

tions through states 4 and 5 and back to 0, as Rρ − triggers a sequence of signal transitions ending with G1 −.

(b) Gated ME

B. Arbitration In order to simplify the interface to the ME, our design uses an arbiter to select just one external signal to pass through at a time. An arbiter [8], [11], [12] is a circuit that propagates one request at a time (as does the ME) but also acknowledges the requesters with grant signals as well. The arbiter used in our design [13] shown in Fig. 5 is different from conventional ones. When R1 or R2 arrives at the arbiter, the arbiter forwards a request to the ME, without determining which one has arrived first. As shown in Fig. 3, R1 and R2 (enabled by Rρ and Aσ respectively) may arrive at the same time, causing gME (gated ME inside the arbiter) to go into a metastable state. When R+ arrives at the ME, rclk + may also arrive at about the same time as well. If the ME determines that R+ has arrived first (after resolving its internal metastability), it raises G and blocks sysclk from rising (thus stretching it) until R falls. Note that the metastability resolution in the gME would be well underway, if not already finished, by the time G is asserted. Thus the metastability resolution in the gME and the ME are done concurrently. When G is asserted, G1 rises, which in turn resets R and R1 . R− enables G−, which in turn enables sysclk to rise. The pending request R2 enables R to be asserted again after G1 resets. However, it is blocked until rclk falls. Note that our arbiter design is not speed-independent, i.e., it requires a timing constraint on its environment for correct functionality. Consider an example depicted in Fig. 5(c), in which R2 + starts a chain of events, starting with R+. After G is asserted, G2 is raised. This event enables both R− and R2 −. If R2 side is slow to respond, i.e., R2 does not fall till G is negated, then this residual R2 may be viewed as a new request to the gME. In our design, the ASFM (see Fig. 4) lowers R2 immediately, once G2 rises, in order to prevent this problem. More precisely, the following timing constraint is always satisfied for both i = 1 and 2: tG+ →R− + tR− →G+ + tG+ →G− > tG+ →R− . i

G1

R2

i

i

i

(1)

R1 T1 G1 R2 T2 G2 R G (c) Arbiter timing Fig. 5. (a) Fast-forward arbiter; (b) Gated ME (ME with enabled outputs); (c) Fast-forward arbiter timing.

III. S YSTEM C ONFIGURATIONS

AND

L IMITATIONS

Using the pausible clocking scheme, it is conceivable that one can construct a heterogeneous multi-processor system with point-to-point links between every pair of nodes. Each link is a bidirectional FIFO as shown in Fig. 1. However, as fanouts from and fanins to each node increase, the arbiter block becomes larger making the system design impractical. However, we assert that it is possible to construct a ring configuration as shown in Fig. 6(a) similar to the systems proposed in Scalable Coherent Interface (SCI) specification [14]. In this structure, messages are always transmitted to one side and received from the other side, so that only one level of arbitration is required in the PCC. A major advantage of our ring configuration over other proposed systems, such as SCI system, is that it is a truly heterogeneous system with each node operating at its own speed. Another possible system configuration would be a conventional bus based architecture depicted in Fig. 6(b). In this config-

IEEE TRANSACTIONS ON VLSI SYSTEMS

Node

(1)

Sync Rρ

Node

R1 AFSM

Aρ FSM FIFO

FIFO

Sync

Sync

Async

Node

Node

Aσ (3)

R2

R σ FSM SAσ

AFSM

G2

G

Ring Oscillator

ME

G1

SR ρ

Node

R

Arbiter

Node

FIFO

Sync

FIFO

Sync

4

rclk G

(a) Heterogeneous ring configuration

FIFO

FIFO

sysclk

uP

Fig. 7. Performance constraints.

Node

Sync Node

FIFO

Async

FIFO

Asynchronous bus

Async Bus Arbiter

sysclk (2)

(b) Bus configuration

Fig. 6. (a) A heterogeneous message-passing multiprocessor using PCC’s; (b) A conventional bus based system.

uration, synchronous modules are connected to an asynchronous bus, e.g., Futurebus+, via FIFO’s and asynchronous modules are connected directly to the bus. Since the synchronous modules in this configuration interface to the bus only via two FIFO’s, the PCC’s in the synchronous modules require just one level of arbitration. Systems-on-a-chip should be designed with as many reusable components as possible. Standard modules, such as CPU cores, should be reused with little or no modification, because these modules are highly optimized for performance and sensitive to timing variation. For the systems proposed in this paper, ideally, the pausible clocking control circuit should simply replace a portion of the system clock generation unit. However, for state-of-the-art microprocessors, the system clock is produced by a phase locked loop (PLL). We cannot adjust the phase of the output of a PLL instantaneously in an analog fashion, as required in our pausible clocking control. Thus a ring oscillator should be used in place of a PLL. We then lose control of the nominal frequency. Tuning the ring oscillator frequency would require more control pins and hence is more expensive. Furthermore, the ring oscillator based clocks may be more susceptible to jitter problems. However, the ring oscillator does have an advantage that its frequency drift closely tracks logic components on the chip, e.g., if logic components slow down due to an increase in operating temperature, then so does the ring oscillator. IV. P ERFORMANCE C ONSIDERATIONS In this section, we examine performance limitations of our PCC scheme. The maximum clock frequency of the PCC is constrained by the FSM input setup and hold time requirements as follows (see Fig. 7). Assume that R is high (enabled by Rρ ) when rclk falls. This activates G+ and G1 + in sequence, which in turn enables SR ρ to toggle. We need to in-

sure that SR ρ satisfies setup and hold time requirements with respect to sysclk +. Assuming that the delays through paths and (1) and (2) are t1 = trclk − →G + + tG+ →G+ + tG+ →SR+ ρ 1 1 t2 = trclk + →sysclk + respectively, the following inequality must hold true: T T (2) − + th < t1 − t2 < − tsu 2 2 where tsu and th are the required setup and hold times of SRρ with respect to sysclk + and T is the nominal period of the ring oscillator clock with a 50% duty cycle. Therefore, the minimum clock period achievable is: Tmin = 2 max(t1 − t2 + tsu , t2 − t1 + th ).

(3)

In general, t2 >> t1 for a large clock tree; hence Tmin = 2(t2 − t1 + th ). In other words, the minimum clock period is about twice the clock tree delay. Assuming that ∆s + ∆h represents the metastability window of the ME (analogous to the setup and hold time window of a flip-flop) and ∆ represents the delay from the time R is granted an access to the ME to the time it relinquishes it, we can define “window of vulnerability” to be ∆ + ∆h (if ∆ > ∆s ). In + other words, sysclk may pause (due to A+ σ enabled by Rσ ) if the following condition is true: nT − ∆

+ + t + + tR+ →R+ < tsysclk + →Rσ+ + tR+ Aσ →R+ σ →Aσ 2 2 +trclk + →sysclk + < nT + ∆h for n ≥ 1, (4)

where ∆ = tR+ →G+ + tG+ →G+ + tG+ →R− . A similar inequal1 1 ity exists for Rρ (the same as Eq. 4, but with Rσ replaced by Aρ , Aσ by Rρ , and R2 by R1 ). Clearly, the frequency and duration of clock stretching depends on the size of this window (∆ + ∆h ) — the smaller this window, the higher the performance of the system. V. E XPERIMENTAL R ESULTS We constructed a heterogeneous ring on a chip as depicted in Fig. 8. The chip consists of two synchronous modules nicknamed “CPU” (CPU) and “Co-processor” (Coproc), an asynchronous module (Async), and two asynchronous FIFO’s (FIFO1 and FIFO2).3 Every module has two neighbors. Messages are always transmitted to one side and received from the 3 Our

implementation is a fully-decoupled FIFO described in [15].

IEEE TRANSACTIONS ON VLSI SYSTEMS

FIFO2

R3 A3

5

3.3V

Async

3.3V

4

3.3V

R4 A4 4

CPU

4 A2 R 2

A0 R0

3.3V

A1 R1

FIFO1

2

3.3V

3.3V

Coproc

2

3.3V 3.3V

Fig. 8. Test chip configuration. 3.3V 3.3V

Rin





Aout

R2A1 =00

PCC SRρ SAσ sysclk Rout

Ain

R1 R2 Coproc FSM A1 A2

Din

FF

Dout

(a)

R1=0

R1A2=01 R1A2=10

R1A2=11

R2A1 =01

A2=0 R2A1 =11

3.3V

Coproc CLK(162MHz) R2 A2 Coproc Dout[1] Coproc Dout[0] R3 A3 Async Dout[1] Async Dout[0] R4 A4 120ns

140ns

160ns

180ns

200ns

220ns

240ns

260ns

Fig. 10. Simulation trace (fCoproc = 162 MHz, fCPU = 327 MHz).

R1A2=11 A2=0 R1=0

A2=1

R2A1 =01 R1=1

(b) Fig. 11. Die photo.

Fig. 9. (a) Co-processor implementation; (b) Co-processor FSM specification.

other side. All communications are done using a four-phase return-to-zero protocol. CPU and Coproc are equipped with a PCC whose ring oscillator frequency can be set to one of 4 different frequencies ranging from 162MHz to 327MHz. CPU generates data and sends them to Coproc via FIFO1. It has an FSM that generates 2-bit data in Gray code sequence. Coproc generates 4-bit output by transforming the 2-bit input from CPU and passes them to Async. The Coproc FSM, depicted in Fig. 9(b), handles the flow control of data in a “semidecoupled” manner: i.e., it asserts an acknowledgment and a request to its input and output FIFO’s, upon receiving an input request and latching input data, but does not complete handshaking on the input side until its output FIFO acknowledges the receipt of data. Async is composed of an AFSM and a data processing unit. This AFSM controls the flow of data in a similar fashion as the FSM of Coproc but asynchronously. The data processing unit extracts two copies of the original 2-bit data and passes them back to CPU via FIFO2. FIFO1 and FIFO2 (both with depth of two) are included for a generic reason of “smoothening” the bursty data transfer between two modules operating at different clock rates and “decoupling” the operations of the modules. We performed extensive SPICE simulations of the chip after backannotating the layout parasitics into the schematic, using Mentor Graphics Accusim. The timing trace in Fig. 10 shows a simulation result including handshake and data signals. CPU and Coproc are programmed to run at 327MHz and 162MHz respectively in this case. The frequencies of R2 and A2 (handshake signals between Coproc and Async) are one fourth that of Coproc clock, because it takes one clock cycle to sample A2 and another clock cycle to toggle R2 . R2 and A2 have about 50%

duty cycle because they are “synchronized” to Coproc clock; however, R3 and A3 have much narrower pulses because FIFO2 (output side of Async) is never full and thus responds quickly. The chip (die photo shown in Fig. 11) was fabricated in 0.5µm HP CMOS14TB triple-metal process through MOSIS. Our test results show that the fabricated chips function correctly, but test results are found to be much faster than the simulated results (with a scale factor of 1.4). We believe that this is due, in part, to the conservative estimation of parasitics by the Mentor tool. Test results clearly indicate that the clocks do become stretched. Although we were not able to observe high frequency clock signals (due to the pad design limitations),4 we could clearly observe several instances of clock pausing when the CPU module was programmed to operate at 162MHz (simulation speed). An oscilloscope trace is shown in Fig. 12. Signals labeled CLK and REQ are sysclk and Rρ of the CPU module observed at the pins of the fabricated chip. CLK is paused if REQ toggles within the window of vulnerability of CLK+, which was calculated to be between −3.43ns and −2.43ns from the rising edge of CLK (with a scale factor of 1.4 taken into account). However, the time difference between CLK and REQ observed on the oscilloscope as shown in Fig. 12 is outside the calculated window. In fact, the instance captured in Fig. 12 shows that CLK pauses even when REQ rises 3.85ns before CLK rises. The source of this discrepancy is unclear. We conjecture that not all delays scale the same. Test measurements were made with HP 54720A Oscilloscope with two 4 GSa/s plug-in cards. Table I compares the nominal frequencies (without clock 4 We used pads readily available from MOSIS which could not support the high speed clock signals.

IEEE TRANSACTIONS ON VLSI SYSTEMS

Test Operating CPU (456MHz) (456MHz) 236MHz 236MHz

Coproc (456MHz) 236MHz (456MHz) 236MHz

6

Simulation Nominal Operating (without pause) (with pause) CPU Coproc CPU Coproc 327MHz 327MHz 290MHz 290MHz 327MHz 162MHz 290MHz 150MHz 162MHz 327MHz 150MHz 290MHz 162MHz 162MHz 150MHz 150MHz

No. of Handshake Input Transitions Per 1000ns 148 152 128 124

Fraction of Clock Cycle Stretched Per H/S Transition CPU Coproc 0.25 0.25 0.24 0.08 0.09 0.29 0.10 0.10

TABLE I C OMPARISON OF NOMINAL AND OPERATING FREQUENCIES (3.3V, 27◦ C, N68MAR RUN ).

VI. C ONCLUSION

Fig. 12. Oscilloscope trace illustrating clock stretching (sysclk frequency = 236 MHz tested at 3.3V and 22◦ C).

pausing) and the operating frequencies obtained from simulation and test. The number shown in parentheses (456MHz) was obtained by extrapolation, not from the actual test. This is due to the limitation imposed by I/O pads. However, all the data and handshake signals, measurable because of their low frequencies, have been tested to be correct, even when the clock is set to the maximum frequency. Thus we concluded that the clock internal to the chip does not result in any synchronization failures. We extrapolated the top clock speed from the simulation and lowerspeed clock signals measured from the test. The impact of clock pausing on performance can be deduced from Table I. For example, when the nominal frequencies of CPU and Coproc are programmed to be 327MHz and 162MHz, the actual operating frequencies are 290MHz and 150MHz respectively (according to simulation). Therefore, CPU clock is stretched 37 (nominal) clock cycles per 1000ns and Coproc clock 12 cycles per 1000ns. However, clock stretching potentially occurs only when handshake inputs toggle. We measured 152 handshake input transitions per 1000ns for this case. Thus, in this case, the fraction of clock cycle stretched per handshake 37 = 0.24. As the frequency is input transition for CPU is 152 lowered, the number of clock cycles stretched becomes smaller. This is because the “window of vulnerability” as a fraction of a clock period becomes smaller. Clearly, we are trading off system performance for communication reliability and latency. In fact, the degradation in CPU throughput is about 12% in this configuration. However, we assert that we can gain back this loss and possibly more by operating the system at a higher speed, i.e., by tuning the ring oscillator to a higher frequency. Systems with a ring oscillator clock do not require as much margin for the worst-case design as the ones with a fixed reference clock, because the operating condition changes affect both the logic and ring oscillators similarly.

We presented a new communication strategy, which is based on the pausible clocking scheme, for multiple synchronous modules operating independently. We described the design and implementation of the pausible clock unit in detail. In order to prove its feasibility, we implemented a 0.5µm CMOS chip, mimicking a heterogeneous ring. Our test results show that the chip functions correctly as simulated, up to the top clock speed of 456MHz. We suggested possible system configurations (heterogeneous rings and bus based systems) and practical limitations. We also determined analytical performance bounds and analyzed the effects of clock stretching on overall system performance. Acknowledgment The authors would like to thank Ryan Donohue, Reza Jalilizeinali, and LiJen Ko for their chip design effort. R EFERENCES [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15]

M. J. Stucki and J. R. Cox Jr., “Synchronization strategies,” in Proceedings of the First Caltech Conference on Very Large Scale Integration, C. L. Seitz, Ed., 1979, pp. 375–393. C. L. Seitz, “System timing,” in Introduction to VLSI Systems, C. A. Mead and L. A. Conway, Eds., chapter 7. Addison-Wesley, 1980. D. M. Chapiro, Globally-Asynchronous Locally-Synchronous Systems, Ph.D. thesis, Stanford University, Oct. 1984. F. U. Rosenberger, C. E. Molnar, T. J. Chaney, and T. Fang, “Q-modules: Internally clocked delay-insensitive modules,” IEEE Transactions on Computers, vol. C-37, no. 9, pp. 1005–1018, Sept. 1988. J. N. Seizovic, “Pipeline synchronization,” in Proc. International Symposium on Advanced Research in Asynchronous Circuits and Systems, Nov. 1994, pp. 87–96. L. F. G. Sarmenta, G. A. Pratt, and S. A. Ward, “Rational clocking,” in Proc. International Conf. Computer Design (ICCD), Oct. 1995, pp. 271– 278. M. R. Greenstreet, “Implementing a STARI chip,” in Proc. International Conf. Computer Design (ICCD), 1995, pp. 38–43. C. L. Seitz, “Ideas about arbiters,” Lambda, vol. 1, no. 1, First Quarter, pp. 10–14, 1980. Steven M. Nowick, Automatic Synthesis of Burst-Mode Asynchronous Controllers, Ph.D. thesis, Stanford University, Department of Computer Science, 1993. K. Y. Yun and D. L. Dill, “Automatic synthesis of extended burst-mode circuits: part II (automatic synthesis),” IEEE Transactions on ComputerAided Design, vol. 18, no. 2, pp. 118–132, Feb. 1999. A. J. Martin, “On Seitz’s arbiter,” Tech. Rep. 5212:TR:86, Caltech Computer Science, 1986. T. Sakurai, “Optimization of CMOS arbiter and synchronizer circuits with submicron MOSFETs,” IEEE Journal of Solid-State Circuits, vol. 23, no. 4, pp. 901–906, Aug. 1988. M. B. Josephs and J. T. Yantchev, “CMOS design of the tree arbiter element,” IEEE Transactions on VLSI Systems, vol. 4, no. 4, pp. 472–476, Dec. 1996. IEEE Standard 1596-1992, “Scalable coherent interface (SCI),” 1992. S. B. Furber and P. Day, “Four-phase micropipeline latch control circuits,” IEEE Transactions on VLSI Systems, vol. 4, no. 2, pp. 247–253, June 1996.