Rapid single ux quantum fast-packet switching element - CiteSeerX

5 downloads 0 Views 138KB Size Report
Sample operating sequence of a 4 4 switching element. \address" ... It takes at most (log2N + N) \fast" clock periods to advance a bit from an input bu er to.
Rapid single ux quantum fast-packet switching element Dmitry Y. Zinoviev State University of New York, Department of Physics Stony Brook, New York 11794-3800

ABSTRACT

We present the design of a Rapid Single-Flux Quantum (RSFQ) N  N fast-packet TDM switching element that can be used in ATM packet switches. Using simple 3:5 ? m niobium-trilayer technology, this device would allow the external exchange rate fe = 10 Gb=sec per switched channel at the internal clock frequency f = 40 GHz . The element structure implies that a separate unit provides the address of the destination channel for each packet and resolves packet contentions. Keywords: superconductivity, RSFQ, ATM, fast-packet switch. 0

1. INTRODUCTION

Superconducting electronics based on Josephson e ect provides powerful means for ultra-high-performance handling of digital data. Presently the development of this eld is focused on the so-called Rapid Single-Flux Quantum devices. In RSFQ, data are represented as a presence or absence of magnetic ux quanta   2:07  10? Wb between two successive clock pulses, and superconductor quantum interferometers are used for data storage and processing. RSFQ logical units can transfer and process data at rates up to hundreds Gb/s. Because the main di erence between an ATM switch and a packet switch used in the computer networks is speed, it is natural to use RSFQ devices in the ATM switching. A number of attempts has been made to implement telecommunication equipment in various superconducting technologies. Yet, this seems to be the rst purely RSFQ architecture of a bus-type fast-packet switching element with time-division multiplexing (TDM). 1

0

15

2

3{5

6,7

2. SWITCHING ELEMENT OPERATION OVERVIEW clock out "Slow" clock "Fast" clock Data Address

N input channels

serial-to-parallel converter

frequency multiplier bit concentrator input buffers

N output channels Figure 1.

clock in

Overall structure of an N  N switching element.

In our TDM switch, the bit concentrator picks up one bit at a time from one of N input bu ers in the required order and places it into the corresponding slot of the shared communication bus | serial-to-parallel converter (see

Figure 1). As soon as all (at most N ) bits from the incoming channels are pushed to the converter, its contents is read out in parallel into the output channels. The throughput of the converter de nes the overall performance of the switch. The switch operation is controlled by three clock signals. External, or \slow" clock runs at frequency fe . It determines the packet arrival rate in each channel. The same clock signal phase-shifted by 2  (log N ) =N controls the output serial-to-parallel interface. The internal shared bus operates at the higher frequency f = N  fe . The corresponding internal, or \fast" clock, is produced by the frequency multiplier (see section 3.4). 2

0

"fast" clock "slow" clock address load input buffer pick-up from 1 pick-up from 2 pick-up from 3 pick-up from 4 output buffer 1

2

3

4

1

2

3

4

Figure 2. Sample operating sequence of a 4  4 switching element. \address" | address at any serial line; \load" | signal at any \load" line.

The following example (Figure 2) illustrates the time sequence of events in a 4  4 switching element. For the sake of simplicity we assume that the packets are head-aligned, i.e. their rst bits arrive at the same clock period. This has not to be the case in real operation mode, when a newly arrived packet can immediately be routed to the destination channel. One can see that each external \slow" clock derives N internal \fast" clocks. The rst log N \slow" time slots are used to accept the physical address bits followed by the \load" signal. During the next \slow" clock periods incoming packets are read bit-by-bit, so that only one bit (under the contention-free assumption) is taken from the input bu er to the concentrating tree and pushed to the FIFO-like serial-to-parallel converter within each \fast" clock period. Lastly, the (N + 1)th \slow" clock ushes the contents of the converter to the outgoing channels. The switch timing is organized in a counter ow pipelined manner: the clock and the data always run in the opposite directions. It takes at most (log N + N ) \fast" clock periods to advance a bit from an input bu er to the destination slot in the output converter, so that the read-out signal should be delivered to the serial-to-parallel converter at the (log N )th \fast" clock period rather than immediately. This delay a ects only the latency of the switch, not its throughput. The proposed switching element requires an external unit ATCR (address translator and contention resolver, see Figure 3a) which should extract the VCI/VPI or its equivalent from the packet (cell) header, translate it and provide the translated physical address for each channel via a separate line. The physical address has a predetermined length L = log N bits, yet for the sake of simplicity a signal at an additional line (\load" strobe) indicates the end of the address (see also section 3.1). For the same reason the rst proposed design of the switching element does not have internal congestion control. It is implicitly assumed that the ow of incoming packets is contention-free (this condition can be achieved at the ATCR stage, say, by keeping track on the used entries in look-up tables). Due to the TDM nature of the element, it has a certain limitation on the number of switched channels: 2

2

2

2

N  ff ; 0

e

(1)

However, this constraint can be overcome by organizing several elementary N  N switches in more complex hierarchical (N  N , etc.) switching fabrics (see Figure 3b). 2

2

a)

"load" strobes

N*N outputs

.....

.....

.....

.....

..... .....

..... .....

NxN

.....

N*N inputs

SN

.....

address lines

NxN

.....

ATCR

.....

in

NxN

.....

clock

NxN

.....

clock

2*log(N) address lines out

clock

b)

a) Top-level con guration of an ATM switch (simpli ed). ATCR | address translation and contention resolution module, SN | switching network or packet-switching element. b) Building a N  N switching network from N  N elements.

Figure 3.

2

3.1. INPUT BUFFERS

2

3. SWITCH PARTS

An input bu er is associated with each incoming channel. It performs simultaneously bu ering and routing functions which are controlled by di erent (\slow" and \fast") clock signals. The input bu er consists of three major parts (see Figure 4a): a FIFO-type temporary address storage (S2P) organized as a serial-to-parallel converter, a counter modulo N with a loading interface (CNT) and a data bit storage bu er (BUFF). The temporary address storage sequentially accepts the address from the input A, one address bit per one \slow" clock TIS, and ushes it to the counter part whenever the \load" (L) signal arrives. The storage is designed as a shift register built up of RSFQ analogs of the \call" cell rather than of \conventional" RS ip- ops, so that one \reset" branch of each cell is used for transferring bits along the register, and the other is used for parallel read-out. The ATCR module assigns a number A to each input channel (and hence each input packet arriving to the input) according its look-up tables: A = Oi ? 1; i = 1 : : : N; (2) where Oi is the number of the destination channel, starting from 1, from left to right (see Figure 1). The \load" signal causes the corresponding binary code A (the least signi cant bit placed to the left-most T

ip- op) to be written to the counting loop CNT as its initial state. Thus, we can speak about two modes of the input bu er: a programming mode and a counting mode. The signal L switches the input bu er from the further mode to the latter mode explicitly, while the opposite transmission is implicit and takes place when the last bit of the routed packet leaves the storage. The \fast" clock permanently circulates in the counting loop causing the generation of the \carry" signal each time the contents of all the upper T ip- ops change to zero. The period of this \carry" signal is N=f = fe , i.e., it coincides with the period of the \slow" clock, while its initial o set depends on A and, in turn, determines the number of the TDM slot assigned to the data bit stored in the BUFF D ip- op. Note that when in the counting mode, the S2P part of the input bu er is idle, and the \slow" clock rolls the empty shift register. Vice versa, in the programming mode both CNT and BUFF parts do useless job, namely, the \fast" clock runs trough the counting loop and sends nothing from the DI input to the DO output. Yet, this little overhead saves a lot of hardware needed otherwise to monitor the activity within the input bu ers and switch the corresponding clock streams on and o . 8

9

0

TIS

L

D

C

TO1

D

A

C

DI1

S2P

TO2 DI2 D

D

D

TO3

TIF

BUFF DI

D

DO

DI3 T

CNT

a)

TI

D

T

T

TO4

D

DI4 D

D

D

DO b)

Figure 4. a) Input bu er for a 8  8 switch. Pins: A | serial address line; L | \load" line; TIF | \fast" clock input; TIS | \slow" clock input; DI | serial data input; DO | serial data output. b) Bit concentrator for a 4  4 switch. Pins: TI | \fast" clock input; TO1 : : : 4 | \fast" clock outputs; DO | serial data output; DI1 : : : 4 | serial data inputs. Parts for both pictures: D | RSFQ D ip- op with positive only and complementary outputs, T | RSFQ T ip- op with programmable initial state, C | RSFQ \call" cell.

3.2. BIT CONCENTRATOR

The task of the bit concentrating tree is to synchronously demultiplex packets arriving from di erent input bu ers into a single TDM stream. As shown at the Figure 4b, it is a full binary tree with height log N and width N , built-up of RSFQ D ip- ops and RSFQ asynchronous mergers (con uence bu ers). The tree is controlled by the \fast" clock. Each clock signal advances packet bits one stage towards the root. Because of the non-contenting nature of the incoming packet stream, there is at most one valuable data bit in each column of the tree at a time (under certain conditions there may be a number of completely empty columns due to the fact that there may occur empty packets in one or more input streams). Hence, at most cells in the tree are busy at a time, where N (3) = 2 log  N ? 1: 2

2

This factor equals  28% for N = 4 and decreases with the growth of N .

3.3. SERIAL-TO-PARALLEL CONVERTER

The serial-to-parallel FIFO-type converter used as the main shared bus of the switching element is very similar in structure to the temporary address storage in the input bu ers (see section 3.1). It serially accepts intermixed data bits from the input DI, one bit per each \fast" clock period (fed from the input TIF) and ushes the overall contents of the shift register to the output channels DO1 : : : N by the request of the \slow" clock TIS (Figure 5). It is worth mentioning that the \call" cells used in the converter may be considered as an \RRS" ip- ops, because they have one \set" input and two independent fully-functional \reset" inputs with the corresponding outputs.

3.4. FREQUENCY MULTIPLIER

As it has been shown above, the switching element requires three di erent clock streams that di er in the frequency and in the phase, yet it takes only one clock stream from the outside, while the rest are generated in the internal frequency multiplier.

TO DI

TIF

TI T

C

C

C

T

T

C TIS

TOF

τ

D

D T

a)

DO1

DO2

DO3

TOS

DO4

TOS_L

b)

a) Serial-to-parallel converter for a 4  4 switch. Pins: TIF | \fast" clock input; TIS | \slow" clock input; TO | \fast" clock output; DI | serial data input; DO1 : : : 4 | data outputs. b) Frequency multiplier for a 8  8 switch. Pins: TI | \slow" clock input; TOS | \slow" clock output; TOSL | \slow" clock shifted by log N ; TOF | \fast" clock output. Parts for both pictures: C | RSFQ \call" cell, D | RSFQ D ip- op, T | RSFQ T

ip- op,  | RSFQ inverter,  | delay line.

Figure 5.

2

The frequency multiplier (which serves as a \fast" clock generator) is organized as a delay-loop with the positive feedback and a counting loop that terminates the generation after N \fast" clock signals are derived. The frequency generated by the multiplier is determined by the delay time in the RSFQ inverter inv , which is pretty stable and depends on the nature and the design of the inverter, and the delay time in the transmission lines t which can be adjusted according to our needs so that: fe  N <  +1 < f : (4) t

0

inv

The multiplier also produces the third clock signal used in the device, namely, the \slow" clock shifted by log N periods relatively to the external clock. This signal TOS L is taken from the (log N ? 1)th stage of the multiplier, and then its frequency is divided by 2. 2

2

4. HARDWARE ESTIMATIONS

The following equation gives a crude estimation of the overall number of Josephson junctions in the N  N switching element: n  (26  N  log N + 7  log N + 28  N ) : (5) Even for N = 16 Eq. ( 5) gives n  2400. RSFQ circuits of this size can be easily t to a 5 mm  5 mm chip. However, according to Eq. (1), large N may signi cantly decrease the bit rate in each channel. Using the hierarchical approach, say, composing an N  N switch from N  N blocks, can signi cantly increase (namely, double) the frequency fe , but this gain results in the increasing number of Josephson junctions per switch. Eq. (5) can be extended to a hierarchical case: ! = d  log N d d log N N ? = )  26 (6) + 7 2d + 28  N = ; n  (d)  2d  N ( 2d 2

2

2

2

1 2

1

1 2

2

2

1 2

where d is the \depth" of hierarchy (d = 0 for a plain N  N switch) and is a factor that takes into account the interconnections' overhead. We expect that it is a function of d and increases with the growth of d. As one can see from the Table 1 hierarchical architecture with the depth d = 1 should be considered as the optimal one because in comparison with the at architecture it has double throughput and only slightly more complex circuitry. On the other hand, although depth d = 2 hierarchical switch allows even larger arrival rate, the complexity of its interconnections may be tremendous. One of the greatest advantages of RSFQ devices over \traditional" semiconductors is their low power consumption. The estimations of the power dissipated by a 16  16 switching element on a single chip give as low as P  300 W at frequencies of 40 GHz . The combination of fast speed and low power may justify the usage of RSFQ circuits in practical systems, despite the necessity of their deep refrigeration (to 4 : : : 5 K ).

Comparison of di erent at and hierarchical con gurations for a 16  16 switch. External packet arrival rate derived from the Eq. 1, assuming f = 40 GHz .

Table 1.

0

Number of Hierarchy Number of Number of External packet Number of overall switched depth d switching switched channels arrival rate Josephson channels N elements per element N fe , GHz junctions n 16 0 1 16 5 2,400 16 1 4x2 4 10 3,000 16 2 16x4 2 20 4,100

5. ACKNOWLEDGMENTS

The author would like to thank P. Bunyk, Dr. V. Semenov and especially Prof. K. Likharev for numerous helpful hints and fruitful discussions. This research has been inspired by Dr. E. Wikborg, LM Ericsson, Sweden, and supported by DoD's University Research Initiative (AFOSR Grant #F49620{92{J{0508).

REFERENCES

1. K. Likharev and V. Semenov, \RSFQ logic/memory family: a new Josephson junction technology for sub-teraherz clock frequency digital systems," IEEE Trans. on Appl. Supercond., vol. 1, pp. 3{28, Mar. 1991. 2. B. Lee, M. Kang, and J. Lee, Broadband Telecommunications Technology. Norwood, USA: Artech House, 1993. 3. M. Hosoya, T. Nishino, W. Hioe, S. Kominami, and K. Takagi, \Superconducting packet switch," IEEE Trans. on Appl. Supercond., vol. 5, pp. 3316{3317, June 1995. 4. S. Tahara, S. Yorozu, and H. Matsuoka, \A superconductive ring-pipelined network system," IEEE Trans. on Appl. Supercond., vol. 5, pp. 3164{3167, June 1995. 5. A. H. Worsham, J. X. Przybysz, J. Kang, and D. L. Miller, \A single ux quantum cross-bar switch and demultiplexer," IEEE Trans. on Appl. Supercond., vol. 5, pp. 2996{2999, June 1995. 6. R. Handel, M. Huber, and S. Schroder, ATM Networks. Concepts, Protocols, Applications. Addison-Wesley, 1994. 7. P. Newman, \ATM technology for corporate networks," IEEE Communications Magazine, pp. 90{101, Apr. 1992. 8. J.-C. Lin and V. Semenov, \Timing circuits for RSFQ digital systems," Applied Superconductivity, vol. 5, pp. 3472{ 3477, Sept. 1995. 9. E. Brunvand and R. Sproull, \Translating concurrent communicating programs into delay-insensitive circuits," tech. rep., School of Computer Science, Carnegie Mellon University, Pittsburg, PA 15213, Apr. 1989.

Suggest Documents