A New FIFO design Enabling Fully-Synchronous ... - Semantic Scholar

4 downloads 0 Views 204KB Size Report
circuits with unrelated clocks to communicate synchronously is proposed. Not only ... Pausible clock systems stop (or pause) the clock of the IP block during data.
A New FIFO design Enabling Fully-Synchronous On-Chip Data Communication Network Muhammad E. S. Elrabaa Computer Engineering Department King Fahd University for Petroleum and Minerals Dhahran, Saudi Arabia [email protected] Abstract—A new FIFO design that enables fully synchronous circuits with unrelated clocks to communicate synchronously is proposed. Not only would every circuit be running on its own clock, but the interconnection network is fully synchronous and runs at an unrelated clock of its own. With relatively low gate count, the proposed FIFO allows communicating circuits to put/get data at their respective frequencies (1 datum/clock cycle) till it gets filled then the rates converge to the lower frequency. The maximum initial latency is 3 cycles of the consumer's clock. Several manifestations of the proposed FIFO have been developed for different design cases including data width mismatch between producer and consumer. The operation of different FIFOs has been verified using gate-level simulations for several ratios of clock frequencies. An 8-cell FIFO has been designed at the transistor-level and Spice simulations using a 0.13 μm, 1.2V technology has been carried out. It shows proper operation at producer and consumer clock frequencies of 2GHz and 3.125GHz, respectively, with a data transfer rate of more than 2Giga datum/second and an average power of 721 μW. Key Words: System-on-Chip, Data Synchronization, Networkon-Chip, GALS

I.

INTRODUCTION

Currently, Systems-on-Chip (SoCs) are constructed using a wide range of pre-designed intellectual property modules (IPs) that are integrated together with a communication medium (typically a system bus). Each IP may have a different clock and communication needs. This, coupled with the ever increasing demands on shorter time-to-market, necessitates developing efficient design flows that can achieve time closure of the whole SoC in short time while satisfying the communication needs of its various components. Many Bus-based SoC design methodologies have been proposed in the literature [1-6]. Several SoC bus standards have been developed requiring either asynchronous or synchronous IP interfaces [7]. Due to the limitations of buses a new interconnection paradigm have recently been proposed; Networks-on-Chip (NoCs) [8-10]. NoCs are being explored as scalable interconnect architectures that can route data between SoC IPs over shared interconnects. Also due to the difficulty of globally synchronizing SoC components another interconnection scheme has recently emerged; Globally Asynchronous Locally Synchronous (GALS)

systems [11-13]. In GALS systems, synchronous blocks with separate clock domains are connected using asynchronous interconnects. GALS systems are categorized into three types based on their communication schemes [13]; pausible clocks; asynchronous, and loosely synchronous. Pausible clock systems stop (or pause) the clock of the IP block during data transfer. With each additional input/output channel to an IP block, the percentage of idle time would increase. This goes against the fundamental concept of decoupling ‘computations’ from ‘communications’ rendering this design style impractical. A comprehensive evaluation of existing pausible clocking schemes revealed that these techniques are not well-suited for interfacing large high-speed IP cores in SoCs [14]. Fully asynchronous interconnects are systems that can adapt to a wide range of temperature, process and voltage variations, as well as varying data rates. As such, they stand to offer the highest degree of robustness and decoupling of different SoC design activities. The data transfer rates and latencies of these systems, however, is limited due to the required handshaking. It was shown in [15] that the fastest asynchronous repeaters can at best only match the speed of synchronous repeaters. Loosely synchronous techniques with dedicated point-to-point connections require some form of a FIFO buffer between the transmitter and receiver to move data across their clock domains. Communication throughput and latency depends on the design of the FIFO, transmitter/receiver clock rates and communication patterns. A major problem with using a conventional (textbook-style FIFO) is the need to exchange read/write pointers between the write/read controllers of the FIFO (which operate at different clock rates). This is required for detecting empty/full conditions. Using a simple asynchronous FIFO would take at least three clock cycles of the slower of the two clocks to transfer a datum due to handshaking and synchronization between the two domains [13, 19]. Recently, several new FIFO designs have been proposed to facilitate data transfer between two different clock domains [16-20]. A self-timed FIFO for transferring data between two clock domains with arbitrary frequencies was proposed in [16]. To overcome the

A FIFO based on dual-port SRAM was proposed in [20]. Two address pointers are used to point to the beginning and end of the data in the FIFO. These pointers are Gray coded to be able to convey them from one clock domain to the other through synchronization. A configurable logic is used to reserve space in the FIFO to compensate for synchronization latency incurred in exchanging the address pointers between the two sides. Also configurable delay blocks are used to control the skew of data and control signals on both sides of the FIFO and to reserve space in the FIFO. While this implementation is well suited for large buffers it has a complex design and significant latency. In this work a new GALS paradigm is proposed; polysynchronous systems. While each IP is running at its own clock, the interconnection medium itself is synchronous with its own separate clock. Each IP would exchange data using its own clock through the communication medium. This medium could be simple point-to-point interconnections or a full synchronous NoC. This proposed scheme stands to have several advantages; IPs are still designed as fully synchronous entities with simple synchronous ports, the communication medium itself is simply designed as a synchronous IP, computations are decoupled from communications, and maximum data throughputs can be achieved through synchronous pipelining of the

interconnection medium. Figure 1 below illustrates the architecture of the proposed polysynchronous system. A key component in this system is the data transfer interface (DTI) FIFO. Although the FIFO is shown in the Figure as bidirectional, it is actually made of two FIFOs; one for transferring the data in each direction. In this work a novel FIFO design that allows independent data writing and reading at different and unrelated rates with maximum possible transfer throughput of one datum per cycle of the slower clock and minimum latency was developed to be used in the proposed polysynchronous scheme. The new data transfer interface (DTI) FIFO design, described in section II, is simpler than other FIFOs with similar capabilities (e.g. [16, 18 and 20]). Simulation results that verify its operation are presented in section III followed by conclusions in section IV. IP1

DTI FIFO

CLK1

IP2

CLK

IP3 CLK3

Interconnect

synchronization penalty on throughput, it FIFO implements training circuitry to estimate the frequency difference between the two domains before data transfer can begin. From that point on it requires that the clocks remain stable and synchronization is only carried for what is considered as high risk transfers. The circuit structure depends on which clock domain has the higher rate. In [17] a FIFO with a maximum throughput of one datum per clock cycle (of the slower of the two clocks) was proposed. Both data and synchronization were pipelined alongside one another. This simple approach of implementing the FIFO as a pipeline greatly reduced the probability of failure due to Metastability and eliminated the need for detecting full/empty conditions. However it increased the latency of the interface since the pipeline has to be filled first before data can come out of it. It also imposed the constraint that the sender and receiver had to operate at the same data rate. A better approach for data transfer between different clock domains based on a general FIFO was proposed in [18]. It allows the sender and receivers to put (or send) and get (or receive) data at their own clock rates simultaneously. In addition to the need for elaborate circuitry for detecting empty/full FIFO conditions, more circuits were added to detect when the FIFO is nearly full or nearly empty. These signals are necessary to maintain the data transfer rates while synchronizing the conventional empty/full signals. A point-to-point bidirectional link based on an asynchronous FIFO was proposed in [19] requiring a minimum of 3 clock cycles (of the slower of the two clocks) to transfer a datum.

CLK

CLK

CLK2

IP4 CLK4

Figure 1. The proposed Polysynchronous GALS system; a fully synchronous interconnect medium running at its own independent clock (CLKI) connects several IPs each with its own arbitrary local clock through the DTI FIFOs.

II.

DTI FIFO DESIGN

The basic concept behind the new DTI FIFO is to use a simple 2-stage asynchronous pipeline in each FIFO cell (or stage). Data enters each cell (or pipeline) from one clock domain and leaves to the other. Synchronization and empty/full detection are taken care of within each cell without having to transfer pointers between the two clock domains. Synchronization latency is hidden by overlapping data transfers within several cells thus allowing maximum PUT/GET rates. The design of the asynchronous pipeline that makes up a cell is introduced first then the construction of the DTI FIFO is presented afterwards. A. The Asynchronous Pipeline Figure 2 below illustrates the basic structure of the asynchronous pipeline circuit and the signaling protocol required to transfer data between a producer and a consumer. Data transfer is illustrated for equal producer and consumer clock frequencies (CLKP and CLKC, respectively). This is the worst case condition in terms of transfer latency. Two data latches are utilized, one on the producer side and another on the consumer side (ENP and ENC are the enable signals of these latches). A four-phase signaling protocol is used to simplify the circuit design. A conventional conservative synchronizer made of two Flip Flops is utilized to minimize

probability of failure due to metastability [22]. A producer initiates data transfer by setting up the data and raising the PUT signal. The producer/consumer-side controllers would then take over completing the transfer in 8 cycles using the signaling protocol of Fig.2(b). At the end of the transfer the OK_to_TAKE is set high to indicate a data ready for the consumer. This signal is reset when the consumer removes the data by setting the TAKE high. If either the producer or the consumer have higher clock frequency, the transfer would take less than 8 cycles (the minimum is four).

Latch

OK_to_TAKE CLKC

ReqOut Synchronize r CLKC PUTACK Synchronize r TAKEACK

En ENP Producer-side Controller

TAKE

n Data Lines

PUT

Producer

En ENC PUTReq Consumer-side Controller

B. DTI FIFO Construction Figure 4 shows the block diagram of an n stage FIFO constructed from the basic asynchronous pipeline described above. Input data lines (DIN) are connected to the inputs of all stages on the producer side. Two counters are used as pointers to the tail of the PUT queue and the head of the TAKE queue. The OK_to_PUT signal of the stage selected by the PUT pointer is routed to the producer through a MUX. Similarly the PUT request signal from the producer is routed to the same stage through a MUX. For large FIFOs the counter-MUX combination could be replaced by a ringconnected shift register (that contain one token pointing to the tail of the PUT queue) and tri-state buffers. So when the producer issues a PUT request while the OK_to_PUT signal of the current stage is high, an internal put signal (PUTi) is generated, routed to this stage and the PUT pointer is incremented. An SR-latch is used to keep the OK_to_PUT signal of the selected stage. It would reset after one clock cycle so that the internal PUT signal would not evaporate

Latch

The design of the producer/consumer controllers are shown in Figure 3. Each controller is a simple two-state FSM implemented with a single FF and simple logic. If the producer asserts the PUT signal while PUTACK is low the controller would set both ENC and ReqOut high (i.e. latch-in the data) and transition to state S1. This transition resets the OK_to_PUT signal. The ReqOut signal is kept high till the PUTACK signal goes high (indicating that the consumer has received the data). The OK_to_PUT signal is then set when the PUTACK signal goes back low. On the consumer-side, if the controller receives a put request (high PUTReq) while OK_to_TAKE is low it would assert both ENC and TAKEACK high. TAKEACK remains high till the producer responds by lowering the PUTReq signal. After latching the data the OK_to_TAKE signal is set high. The consumer consumes the data by asserting the TAKE signal which in turn resets the OK_to_TAKE signal. SR latches are used to produce the OK_to_PUT and OK_to_TAKE signals which indicate the state of the asynchronous pipeline (empty/full).

Consumer

Using two latches per cell instead of a single latch or FF as in most FIFOs greatly simplifies the design by decoupling the PUT and GETS operations and provides a two-stage pipeline per FIFO stage, reducing the impact of clock frequency difference on the PUT/GET rates. Also, the number of transfer cycles could have been reduced by overlapping data transfers but this would have complicated the control circuitry and made them slower to operate.

before the end of the cycle, allowing the producer to put a datum every cycle (as long as the FIFO is not full). On the consumer side a datum is removed from the stage selected by the TAKE pointer when a TAKE request is received while the corresponding OK_to_TAKE signal is high and the TAKE pointer is incremented. As was explained earlier, depending on the two clock frequencies, it can take up to 8 cycles to complete a datum transfer within a single stage. Hence using 8 stages allows data transfer at one datum per clock cycle for any producer/consumer clock ratio by overlapping the transfer within all stages.

OK_to_PUT CLKP

(a) Block diagram of the asynchronous pipeline. CLKP OK_to_PUT Di ENP PUT CLKC PUTReq ENC TAKEACK OK_to_TAKE TAKE (b) The signaling protocol of the asynchronous pipeline. Figure 2. The asynchronous pipeline representing the basic cell of the DTI FIFO.

III.

SIMULATION RESULTS

A. Gate-Level Simulations Figure 5 shows a gate-level simulation results for an 8stage FIFO for three producer/consumer clock frequency ratios; 1:1, 1:2.5 and 2.5:1. These ratios were selected for convenience but the FIFO would operate for any frequency ratio as illustrated by the Spice simulations. In the simulation setup a new datum is put into the FIFO whenever the

PUT | PUTACK S0 PUT & PUTACK / ENP & ReqOut

PUTACK S1

ENP

PUT PUTACK A A

OK_to_PUT

S L R

A

FF

A CLKP

PUTReq | OK_to_TAKE

S0: A=0 S1: A=1

S0 PUTReq & OK_to_TAKE / ENC & TAKEACK ENNoC S1

PUTReq

TAKE

S L R

OK_to_TAKE

PUTReq OK_to_TAKE A TAKEACK FF

A

PUT1

Stage 1

D1

TAKEn-1 OK_to_TAKEn-1

PUTn-1

Stage n

Dn-1

LOG2(n) m

D1

PUT Pointer

MUX

m

PUTi

OK_to_PUTn-1

D0 DOUT

MUX

MUX

OK_to_PUT1

CLKP

DIN INC_PUT

Dn-1

PUT OK to PUTi TAKE

PUTi INC_PUT INC_TAKE

CLKC

C. SPICE simulations Transistor-level simulations of the DTI FIFO have been carried out using a 0.13μm, 1.2V CMOS technology. Transistors were sized to achieve the maximum possible consumer frequency at this technology node (3.125GHz). The producer clock was set to 2GHz. Figure 6 below shows all the waveforms during data transfer through one of the stages. The measured average power for a single stage was ~90 μW. IV.

(a) The design of the producer-side (producer) controller.

OK to TAKE

DeMUX

TAKE1 OK_to_TAKE1

TAKE Pointer

ReqOut

PUTReq / TAKEACK

m

INC_TAKE Figure 4. Block diagram of an n stage FIFO constructed from the asynchronous pipeline.

PUT PUTACK

PUTReq

PUT0 OK_to_PUT0

OK_to_PUTi

D0

DeMUX

TAKE

Stage 0

LOG2(n)

PUT PUTACK A

PUTACK / ReqOut

S0: A=0 S1: A=1

TAKE0 OK_to_TAKE0

OK_to_TAKEi

OK_to_PUT signal is high and a datum is taken out whenever the OK_to_TAKE is high. Results show that for equal frequencies, both producer and consumer are able to put/get a datum per clock cycle. When the consumer's clock frequency is 2.5X the producer’s, the producer is able to put data every cycle but the data removal rate by the consumer is automatically reduced by a factor of 2.5 of the consumer clock frequency. When the producer's clock frequency is 2.5X the consumer’s, initially when the FIFO is empty, the producer is able to put data at the maximum rate. The rate gradually goes down till it reaches 1/2.5 of the producer's clock rate. This gradual reduction is due to the inherent pipelining within the cells and the fact that for this clock ratio, it takes 4 consumer’s clock cycles to transfer data between the producer/consumerside latches. Since the FIFO size is 8 there will be enough time for several stages to complete their data transfers. B. Comparison with similar FIFOs Table 1 shows a comparison between the new FIFO and published FIFOs with similar capabilities based on 8-cell FIFOs with 8-bit data width. Though the FIFO in [17] has the lowest gate count, its latency is directly proportional to the FIFO size. This is not the case for the other FIFOs. The new FIFO achieves a performance comparable to the best reported FIFO with significantly fewer gate count and simpler design.

A

CLKC (b) The design of the Consumer-side (consumer) Figure 3. Design of the asynchronous pipeline

ENC

CONCLUSIONS

A new interface circuit that can transfer data efficiently between two unrelated clock domains has been developed. With a relatively low gate count, it allows fully synchronous data communication between the two domains at the maximum rate of 1 datum per cycle of the lower frequency of the two no matter what the frequency ratio between the two domains. The correct operation of this circuit was verified with both gate-level and transistor-level (SPICE) simulations. TABLE 1. COMPARISON BETWEEN DIFFERENT FIFOS BASED ON 8CELLS/FIFO AND 8-BIT DATA. FIFO This work

# of Gates* 1,250

Initial Latency 1~2 cycles

[17]

~300

8 cycles

[18]

~1,500

1~2 cycles

Max. Throughput 1 data per slower clock cycle 1 data per slower clock cycle 1 data per slower clock cycle

* Estimated by the author based on logic diagrams in published work

CLKP PUT7 PUT6 PUT5 PUT4 PUT3 PUT2 PUT1 PUT0 CLKC OK_to_TAKE7 OK_to_TAKE6 OK_to_TAKE5 OK_to_TAKE4 OK_to_TAKE3 OK_to_TAKE2 OK_to_TAKE1 OK_to_TAKE0 TAKE7 TAKE6 TAKE5 TAKE4 TAKE3 TAKE2 TAKE1 TAKE0 (b) Consumer's clock frequency is 2.5X (c) Producer's clock frequency is 2.5X the producer's. the Consumer's. Figure 5. Simulation results of the 8-stage FIFO with three Producer/Consumer clock frequency ratios.

(a) Equal clock frequencies.

REFERENCES

CLKC

15.0

[1]

P. Coussy, A. Baganne and E. Martin, A design methodology for integrating IP into SOC systems, in Proc. Custom Integrated Circuits Conference (CICC), (2002) 307–310.

[2]

M. Bocchi, C. Brunelli, C. De Bartolomeis, L. Magagni and F. Campi, A system level IP integration methodology for fast SOC design, in Proc. Int. Sym. On System-on-Chip (ISSOC'03), (2003) 127–130.

[3]

F. Abbes, E. Casseau, M. Abid, P. Coussy and J. B. Legoff, IP integration methodology for SoC design, in Proc. 16th Int. Conf. Microelectronics (ICM2004), (2004) 343–346.

14.0 13.0

TAKE

12.0

TAKEACK

11.0 10.0

OK_to_TAKE

ENC

[4] J. AD.'O Filho, M. E. de Lima, P. R. Maciel, J. Moura and B. Celso, A fast IP-core integration methodology for SoC design, in Proc. 16th Sym. On Integrated Circuits and Systems Design (SBCCI2003), (2003) 131– 136.

9.0

V (V)

PUTACK

PUTReq

8.0 7.0

ENP

[5]

A. P. Niranjan and P. Wiscombe, Islands of synchronicity, a design methodology for SoC design, in Proc. Design, Automation and Test in Europe Conference and Exhibition (DATE'04), (2004) 64–69 Vol.3.

[6]

J. Wu, J. Williams and N. Bergmann, System Level Design Methodology for Hybrid Multi-Processor SoC on FPGA, in Proc. 16th Int. Sym. On Field-Programmable Custom Computing Machines (FCCM2008), (2008) 312–313.

6.0

OK_to_PUT

5.0

ReqOut

4.0

PUT

PUTi 3.0 2.0

[7] E. Salminen, V. Lahtinen, K. Kuusilinna, and T. Hamalainen, Overview of bus-based system-on-chip interconnections, in Proc. IEEE Int. Sym. On Circuits and Systems (ISCAS'02), (2002) 372–375 Vol.2.

DOUT

DIN

CLKP 1.0

[8]

A. Hemani, A. Jantsch, S. Kumar, A. Postula, J. Oberg, M. Millberg, and D. Lindqvist, Network on chip: an architecture for billion transistor era, in Proc. IEEE NorChip Conf., 2000.

[9]

W. J. Dally and B. Towles, Route packets, not wires: On chip interconnection networks, in Proc. 38th Design Automation Con., (2001) 684–689.

0.0 0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

4.5

5.0

5.5

6.0

6.5

Time (nS)

Figure 6. Simulation waveforms of one of the FIFO stages for

7.0

[10] J. Henkel, W. Wolf, and S. Chakradhar, On-chip networks: a scalable, communication-centric embedded system design paradigm, in Proc. 17th Int. Conf. VLSI Design, (2004) 845-851.

[17] J. Seizovic, Pipeline Synchronization, in Proc. Int. Symp. On Advanced Research in Asynchronous Circuits and Systems (ASYNC 94), (1994) 87-96.

[11] D. M. Chapiro, "Globally-Asynchronous Locally-Synchronous Systems", PhD thesis, Stanford University, October 1984.

[18] T. Chelcea and S. Nowick, Robust Interfaces for Mixed Timing Systems, IEEE Trans. On Very Large Scale Integration (VLSI) Systems, Vol. 12-8, (2004) 857–873.

[12] S. Moore, G. Taylor, R. Mullins, and P. Robinson, "Point to Point GALS Interconnect", Proc. 8th Int. Symp. Async. Cir. & Sys. (ASYNC'02), pp. 69–75, 2002. [13] P. Teehan, M. Greenstreet and G. Lemieux, A Survey and taxonomy of GALS design styles, IEEE Design & Test of Computers, Vol.24-5, (2007) 418–428. [14] S. Dasgupta, and A. Yakovlev, Comparative analysis of GALS clocking schemes, IET J. Computers & Digital Techniques, Vol.1, No.2, (2007) 59– 69.

[19] A. Chattopadhyay and Z. Zilic, GALDS: A Complete Framework for Designing Multiclock ASICs and SoCs, IEEE Transactions on Very Large Scale Integration (VLSI) Systems, Vol. 13, No. 6, (2005) 641– 654. [20] R. W. Apperson, Z. Yu, M. J. Meeuwsen, T. Mohsenin, and B. M. Baas, A Scalable Dual-Clock FIFO for Data Transfers Between Arbitrary and Haltable Clock Domains, IEEE Trans. On Very Large Scale Integration (VLSI) Systems, Vol. 15, No. 10, (2007) 1125–1134.

[15] Muhammad E. S. Elrabaa, Robust Two-Phase RZ Asynchronous SoC Interconnects, To Appear in IEEE Trans. On VLSI Systems.

[21] Stephen B. Furber and Paul Day, "Four-phase micropipeline latch control circuits", IEEE Transactions on VLSI Systems, vol. 4, No.3, 1996, pp 247-253.

[16] A. Chakraborty and M. R. Greenstreet, Efficient self-timed interfaces for crossing clock domains, in Proc. Int. Symp. Asynch. Circuits Syst. (ASYNC'03), (2003) 78–88.

[22] R. Ginosar, Fourteen ways to fool your synchronizer, in Proc. Of 9th Int. Symp. On Asynchronous Circuits and Systems (ASYNC'03), (2003) 1-8.

Suggest Documents