Feasibility Study of RSFQ-based Self-Routing ... - CiteSeerX

2 downloads 0 Views 310KB Size Report
clock periods, the token will advance to the Aith column of the matrix (note that for any Ai, N1 + N2 = N is a constant). After N clock periods, the counter C2 termi-.
Feasibility Study of RSFQ-based Self-Routing Nonblocking Digital Switches

Dmitry Y. Zinoviev?;+ and Konstantin K. Likharev?

? Department of Physics and + Department of Computer Science, SUNY, Stony Brook, NY 11794{3800, USA

[email protected], [email protected]

Abstract|This paper describes the results of a preliminary analysis of ultra-fast low-power superconductor digital switches based on Rapid Single-Flux-Quantum (RSFQ) technology. In particular, RSFQ-based crossbar, Batcher-banyan, and shared bus switching cores have been considered, and the possible parameters of these circuits have been estimated. The results show that the proposed RSFQ digital switches with overall throughput of 7:5 Tbps operating at an internal clock frequency of  60 GHz and dissipating very little power could e ectively compete with their semiconductor and photonic counterparts. I. Introduction

The rapid increase of the speed of communication networks requires faster digital switching circuits. The only way the traditional semiconductor electronic technologies (silicon bipolar, CMOS, Bi-CMOS, and GaAs MESFET) can handle hyper-gigabit frequencies is broad parallelizing leading to high power consumption. For example, parallel processing of a 2:5 Gbps stream in a 256  256 ATM switch may result in dissipation of 1; 135 W [1], while a simple 10 Gbps serial-to-parallel converter and a 2  2 router may consume 10 W each [2]. Photonics technologies promising fast data rates and a natural way of coupling to optical bers [3], still lack integratable memory and logic circuits. This is why we believe that digital switching may provide a eld for the application of superconductor electronics, especially the new RSFQ (Rapid Single-FluxQuantum) family of ultrafast logic/memory devices [4], [5]. Simple RSFQ circuits (e.g., frequency dividers) have been demonstrated to operate at internal clock frequencies fi beyond 300 GHz when implemented in a relatively simple 1:5m niobium-trilayer technology [6]. Scaling estimates show [4], [5] that with a 0:5m technology the maximum frequency may be increased to  700 GHz for simple devices and  200 GHz for LSI circuits. In order Manuscript received January 18, 1999. This work was supported by the DoD's University Research Initiative (AFOSR grant #F49620-95-I-0415). The URL of the Home Page of the RSFQ group (SUNY at Stony Brook) is: http://pavel.physics.sunysb.edu/RSFQ/RSFQ.html

to be somewhat conservative, in all the following estimates we will use the value fi = 60 GHz which may be reached using a 1:5m fabrication technology. Another important advantage of the RSFQ logic is the very low power consumption. For 1:5m junctions, the average power per grounded Josephson junction may be easily reduced to  0:34 W [7]. Other important features of RSFQ circuits include low crosstalk between onchip interconnects, and the possibility of local clock signal generation (which makes possible skew-free timing at 60 GHz -range frequencies) [4], [5]. In order to be commercially successful, RSFQ circuits have to provide a substantial advantage in performance over the traditional electronics. We believe that ultrafast switching can be an application that reveal such an advantage. In what follows we will estimate possible complexity and performance of several RSFQ switches, including the crossbar switch (Section III.), the Batcherbanyan switching core (Section IV.), and the original time-division switch (Section V.). In the conclusion (Section VI.) we will compare these options, and summarize the possible performance advantages which may be achieved using RSFQ technology. II. Feasibility Study: Task and Restrictions

We have considered various switching cores trying to understand which of the existing switching architectures would be the most feasible for RSFQ implementation. In the case of an emerging technology like RSFQ, the feasibility depends mostly on hardware complexity, but power consumption and performance parameters such as aggregate throughput and latency are also important. In order to obtain comparable results, the following conditions have been imposed onto the studied switching networks (namely, the crossbar core, Batcher-banyan sortingexpanding network, and time-division shared bus):  Same total workload of 7:5 Tbps, applied in two different ways: A. 128 bit-serial channels with the extrinsic input/output rate fe = 60 Gbps (\optimistic", or \ ber-optic", environment), and B. 96 32-bit-parallel channels with fe = 2:44 Gbps (\realistic", or \semiconductor", environment).

In this case, superconductor parallel-to-serial and serial-to-parallel converters must also be included in the estimates.  Self-routing, i.e., an ability to route a packet from the source port to the destination port in accordance with the provided physical address of the destination port;  No address translation, no contention resolution, no broadcast or multicast features, no cyclic redundancy check. All these functions can be implemented using the RSFQ technology, but are beyond the scope of this paper. III. Crossbar Switching Core

A. Top-level Design The Crossbar switching core (see, e.g., [8], Chapter 4) consists of a square array of N 2 elementary node switches (\" at Fig. 1) capable of sending data from top to bottom and from left to right (\open" state) and/or to bottom (\closed" state). Note that under the assumption of contention-free input, at most one node switch in a column may be in the \closed" state, so there can be no collision of the signal \steered" from left to bottom with a signal arriving from the top. The design of the crossbar switching core is very regular and simple, which makes it attractive for implementation in an emerging technology such as RSFQ. Its disadvantages include quadratic hardware complexity, linear (with respect to N ) average message passing latency:

 / N12

XN XN (N + j ? i) = N

DY

clean/select path address path data path

DX’

× DX Clean/Select

(1)

i=1 j =1

DY’ ×

×

×

×

IC

×

×

×

×

IC

×

×

×

×

Input Ports

IC

Single–Plane Crossbar Switch

A)

Fig. 2. Crossbar switching core: A) with path-dependent latency (single-plane) and B) with path-independent latency (double-plane).

and the dependence of the latency on the positions (i; j ) of the source and destination channels. The latter drawback can be corrected by using two square switching matrices in series (two-plane switching core, Fig. 2B). This will not increase the commutation delay, because the matrices can be commutated in parallel. However, the average latency is doubled as compared with the single-plane case. We propose a pipelined architecture of a packetoriented self-routing crossbar switch. Consider the routing of the ith inbound packet presented by picosecond SFQ pulses. We assume that the rst LA = log2 N bits of the packet (of the total L + LA) stand for the binary representation Ai of the destination address (the number of the output port within the switch), while the remaining L bits carry the payload, or the body of the packet, Di . When the (i +1)th packet arrives at the switching core, the input port controller (IC at Fig. 1) interprets the destination address of the packet, discards it from the packet, and produces a token (an SFQ signal) that is advanced along the corresponding address path to the Ai+1 th column of the switching matrix (this corresponds to the Strip Ai+1 and Load Ai+1 boxes at Fig. 3), while the body of this packet is stored in a bu er within the input controller. Simultaneously, the body of the previous, ith packet is being transferred to its destination (box Send Di ). When the transfer is completed and the token is advanced to its position (whichever comes last), the clock is stopped, and externally induced SFQ signals Clean and Select propagate along every column in the matrix (box Cln/Sel). Clean clears all existing commutations, and Select establishes the connection between a row and a column whenever the token is found in the correspond-

11111 00000 111111 L 000000 Cln/Sel

IC Input Controllers

×

×

×

× Output Ports

Fig. 1. Single-plane crossbar switch.

B)

Send D i Load A i+1

Strip A i+1

00 11 11 000 111 111 1 La=log(N) 0 0 1 2 00 N 000 11111L 000 00000 111 buff

Fig. 3. Timing diagram of an N  N crossbar switch, with delays given in the number of clock cycles. Shaded boxes denote idle cycles. Ai and Di are the data and the address part of the i'th packet, respectively.

Data

Clock"

T

R

S

R

R

S

D2

D2

0

D2

T S

D

T

D Address

D

D

demux

address counter

D

one input channel

serial-to-parallel converter

Shift_Register

D counter Clock Clean

1

B. Input Controllers The task of the input port controller (Fig. 4) is to discard the address bits from a packet, interpret the destination address, generate the token, and store the body of the packet in a bu er. The controller consists of:  a bu ering shift register of length Lbuff  2 + L ? LA (3) to accommodate the pipelined data;  a serial-to-parallel converter to translate the address from the initial bit-serial form into a bit-parallel pattern;  an address counter;  a 1 ! 2 demultiplexer demux and a modulo log2 N counter C1 that derive the rst log2 N address bits to the serial-to-parallel converter and the remaining L bits to the shift register;  a modulo N counter C2 and a switch that controls the delivery of the clock signal to the address counter. Initially, the demultiplexer demux connects the packet input A+D to the converter. Each clock signal pushes an address bit into the converter consisting of a string of D2 cells (\double ip- ops", see Appendix A). After log2 N clock periods, the counter C1 generates an over ow signal Sig that toggles the demultiplexer and reads out the address bits from the converter to the line of D

ip- ops with complementary outputs. The next clock signal copies the contents of the D ip- ops to T RS ip ops of the address counter so that the bit pattern stored in the counter will be equal to Ai , and also allows the clock to feed the address counter. After N1 = N ? Ai clock periods, the address counter over ows and produces the Address signal (the token). In the next N2 = Ai clock periods, the token will advance to the Ai th column of the matrix (note that for any Ai , N1 + N2 = N is a constant). After N clock periods, the counter C2 terminates the clock supply to the address counter. Finally, the input controller is reset to its initial state by the Clean signal which toggles the input demultiplexer.

Sig

Clean"

A+D

ing node of the matrix, thus settling the data path for the body Di+1 and freeing the address path for the next token Ai+2 . Finally, the clock is resumed. The internal clock frequency fi may di er from the average external arrival rate fe because of the nite setup overhead (see Fig. 3): L + LA (2) fi = fe 2 + max( L; N + LA ) If L  N + LA, which is true, e.g, for Asynchronous Transfer Mode (ATM) cells and N  415, and for Internet Protocol (IP) packets and N  32; 000, then fe can be reduced to match fi by inserting a certain number of \dummy" bits in the packet's body.

counter

Clock’

C1 C2

Clean’

Select

Select’

Fig. 4. RSFQ implementation of the crossbar input port controller.

C. Node Switch The node switch (Fig. 5) can be in one of two states, depending on the contents of the non-destructive read-out (NDRO) cell: \open" or \closed". In the \open" state it connects the horizontal input DX to the horizontal output DX and the vertical input DY to the vertical output DY . In the \closed" state, the horizontal input DX is connected to both DX and DY . The state of the switch can be cleared by the Clean signal and set by the Select signal which connects the row to the column only if an SFQ token is stored in the D2 ip- op. The schematics of the RSFQ elementary cells and components of the switch can be found in the literature: counter in [9]; demultiplexer, NDRO cell, D ip- op, and shift register in [4]; D ip- op with complementary outputs and T  RS ip- op in [10]; D2 ip- op in Appendix A. 0

0

0

0

D. Hardware and Power Accounting In order to implement a two-plane N  N crossbar switch, we would need 2N input controllers, 2N  N node C’

Select

Clean

DY C

D2

A’

A 1

0

DX

DX’

Select’

Clean’

DY’

Fig. 5. RSFQ implementation of the node switch.

TABLE I HARDWARE AND POWER BUDGET FOR INPUT BUFFERS Cell Quantity Per Unit Total name #JJ P; W #JJ P; W Splitter 6 + 3LA 1 0.34 3LA 1:02LA D2 cell LA 7 1.02 7LA 1:02LA DFFC LA 12 1.36 12LA 1:36LA T  RS LA 12 1.36 12LA 1:36LA DFF LA ? 1 3 0.68 3LA 0:68LA Demux 1 11 1.7 11 1:7 Sh. reg. L ? LA 3 0.68 3L 0:68L Grand Total: 31LA + 4:5LA 11 + 3L +0:7L

SA

SA

SA

SA

SA

SA

AX

AX

AX

SD

SA

SA

SA

SA

SA

AX

AX

AX

SA

SD

SD

SA

SA

SA

AX

AX

AX

SD

SD

SD

SA

SA

SA

AX

AX

AX

sorting network log(N)(1+log(N))/2 columns

expanding network log(N) columns

Fig. 6. Batcher-banyan switching core (for a particular case of N = 8). SA is an ascending sorting element; SD is a descending sorting element; AX is an ascending expanding element.

TABLE II HARDWARE AND POWER BUDGET FOR NODE SWITCHES Cell name Quantity Per Unit Total #JJ P; W #JJ P; W Splitter 4 1 0.34 4 1:36 D2 cell 1 7 1.02 7 1:02 NDRO 1 7 1.02 7 1:02 Merger 1 4 0.68 4 0:68 Grand Total: 22 4:08

This results in overall  44N 2 Josephson junctions and  8:2N 2 W dissipated power per N  N part of the

switching core. Equations 4 and 5 give the hardware complexity and power dissipation for the overall design.

NJJ  2N (11 + 31LA + 22N + 3L); P  2N (4:5LA + 4:1N + 0:7L + 2) :W

N/2 rows

switches, and some overhead hardware. The overhead complexity cannot be estimated at the feasibility study stage but is known to be O(N ) + O(1), so we can safely ignore it if N  1. Our method of calculating the number of Josephson junctions and dissipated power is described in Appendix B. The results of such a calculation for an input controller and a node switch are presented in Tables I and II:

(4) (5)

When considering the \semiconductor" case, it is possible to load data in parallel into the input controller rather than rst serialize it using parallel-to-serial converters. This opportunity will not be considered here in detail. Some numerical examples will be discussed in Section VI. IV. Batcher-Banyan Switching Core

A. Top-level Design The Batcher-banyan switching core is a self-routing nonblocking network [11]. Packets are routed on a spacedivision basis: elementary 2  2 switches trace distinct paths from each input to each output, so that each path is used by at most one routed packet at a time. While

usually having a larger routing latency than the crossbar switch, the Batcher-banyan network requires much less hardware ( N log N rather than  N 2 elementary switches). The Batcher-banyan network consists of the sorting and expanding parts (Fig. 6). The sorting element expects an input packet to contain a main part (that includes the protocol header and the payload), preceded by the physical address Aout of the destination output port. The element compares the complete addresses of two input packets, and makes the routing decision (an empty packet is always considered as having the largest address). The ascending sorter (SA in Fig. 6) sends the packet with the larger Aout upwards, while the descending element SD sends such a packet downwards. As a result, the packets arriving at the expanding stage are sorted by their address, and there can be no contention in the expanding part of the network. Each expanding element AX routes the packets according to the rst bits of their addresses, and also strips them of these bits, so the packets nally arrive at appropriate output ports completely free of their physical address parts. A superconductor prototype of the elementary expanding switch has been implemented [12] using a latching logic (3-phase-clocked MVTL) and has been tested to operate fully at low frequencies and only partly at 4 GHz . B. Sorter and Expander Although the sorter and the expander perform di erent tasks they have much in common. Fig. 7 and Fig. 8 show their possible block structure; the descriptions of the Muller C element, D ip- op, D2 ip- op, and extended XOR cell can be found in [4], [10], Appendix A, and [13], respectively. In particular, both devices have the same set of inputs and outputs (the rst letters D, C, and I in the names of the pins denote data, clock, and initialization ports, while the second letters I and O denote the direction | input or output | of the ports), the same 2  2 cross-bar router ( -element), and the same propagation scheme of the clock and the initialization signals. For example, in both cases the signal from one clock input

DO2

CI2

DO1

CI1

IO2

IO1

Bar A’

J8

C

II1 II2

C

B J2

I4

I8

CO1 CO2

J7

D2

J9

B

X

Shift_Register

I3

I7 J4

J3

DI1

J5

I2

I6

I1

I5

J6

Shift_Register

Cross

DI2

Fig. 7. Block structure of the Batcher-banyan sorting element. X is an extended XOR cell; C is the Muller C element; D2 is a D2

ip- op; B is a 2  2 cross-bar router ( -element). C

CO1

CI1 CI2

CO2 D2 IO1 II1 II2

DI1

C

IO2

D

DO1

B DI2

D

DO2

Fig. 8. Block structure of the Batcher-banyan expanding element. The notation is the same as in Fig. 7.

is synchronized with a signal from the other clock input and then feeds the device and propagates to both clock outputs. When both initialization signals arrive at either device, the binary \1" is written into the D2 ip- op. The di erence between the sorter and the expander shows up when we consider their operation more closely. In the case of the sorter, the extended XOR cell compares the destination addresses of the arriving packets bit by bit (MSB rst). The output is produced only if some two bits are not equal, which means that the addresses are not equal. In this case, the output of the XOR cell will indicate which address is greater, the contents of the D2 ip op are destructively read out to the output corresponding to the packet with the larger destination address, and the cross-bar router is properly commutated. Shift registers of length LA = log2 N are used in the sorter (rather than D ip- ops) to keep the rst bits of the packets until the routing decision is taken. In the case of the expander, only the rst bits of the addresses are compared: if the rst bit in any channel is 1, it destructively reads out the contents of the D2 ip- op and provides commutation of the router. C. -elements Fig. 9 shows a possible RSFQ implementation of the most complex component of the switches, the so-called -element [14], or a 2  2 cross-point router. In contrast to the circuit described in [15], our router is controlled by SFQ (rather than dc) signals cross and bar, allowing the RSFQ implementation of the other circuits of the switch.

J1 A

B’

Fig. 9. A version of the RSFQ -element using inductive coupling. The shown directions of persistent currents correspond to the Cross state of the element.

The circuit has two inputs A and B, two outputs A' and B' and two control lines Cross and Bar. In the Bar state, counter-clockwise currents ow in the quantizing interferometers I 3 and I 6, critical currents of nonquantizing interferometers I 4 and I 5 are suppressed, and signals from inputs A and B are routed to the outputs A' and B', respectively. In the Cross state, persistent currents ow in interferometers I 2 and I 7, and signals from inputs A and B are routed to outputs B' and A'. Consider for example the arrival of an SFQ signal A when the element is in the Cross state. First, the pulse is split at junction J 1. The critical current of the non-biased interferometer I 5 is larger than that of Josephson junction J 4, so that the latter junction is switched, and the signal goes to the output B'. However, the critical current of interferometer I 1 is suppressed by the magnetic eld from interferometer I 2, so that signal A switches both junctions of I 1 rather than J 3 and does not reach output A'. The signal B is routed to output A' (rather than B') in exactly the same manner. The element can be switched from the Cross state to the Bar state by applying the SFQ Bar signal which will stop the persistent currents in interferometers I 2 and I 7 and induce currents in interferometers I 6 and I 3. Cross signal switches the circuit back in the similar manner. Although such an inductive solution is hardware-saving, it cannot be easily implemented using existing RSFQ technologies because they do not allow high mutual coupling to be achieved, unless holes are cut under the transformers in the ground plane(s). These holes are potential attractors for external magnetic elds; also, we still lack reliable software tools for the geometrical design of such structures. Fig. 10 shows that the -element can also be constructed from four NDRO cells (switches) and two mergers (con uence bu ers) [4]. Signal Bar closes the upper left and the lower right switches and opens two other switches, so that signal B can propagate to the left (to output A') but not to the right, while signal A could propagate to the right (to output B'). The Cross signal performs the

M

(6) (7) (8)

exp

2 A  NLA(53 + 20LA):

Notice that because the Batcher-banyan switching core is pipelined at the bit level rather than at the packet/cell level (as the crossbar switching core), we do not have a dependency of the hardware complexity on the packet/cell length. Power that would be dissipated by the switching core is P  3:2NLA(1 + 2:85LA) W: (9) In the \semiconductor" case, serializing and parallelizing overheads must be added (see Section III.). M -bit serialto-parallel and parallel-to-serial converters are actually composed of M connected D2 cells or SR ip- ops with two inputs, respectively, so their hardware complexity is 13MN , and dissipated power is 1:7MN . Some numerical examples will be discussed in Section VI. V. Shared Bus Switching Core

This constraint could be overcome by organizing several elementary N  N switches in more complex hierarchical (N 2  N 2 , etc.) switching fabrics or by way of severe (by factor N ) parallelizing of the switching core which would lead, however, to a corresponding increase in hardware. The following example (Fig. 12) illustrates the time sequence of events in a 4  4 switching core. Each external \slow" clock produces 4 internal \fast" clocks. The rst log2 N = 2 \slow" time slots are used to accept the physical address bits. Then the \load" signal is developed. During the next \slow" clock periods, incoming packets are transported bit-by-bit to the FIFO-like serialto-parallel converter, one bit in each \fast" clock period. Finally, a \slow" clock reads out the contents of the converter to the outgoing channels. The timing is organized in a counter ow pipelined manner: the clock and the data always run in the opposite directions. It takes at most (log2 N + N ) \fast" clock peTO LD L TIS A1 DI1

DO input.controller TIF

A DI

L TIS A8

A

DI8

DI

DO1

DO1 DO2 DO3

DO input.controller TIF

0

0

1

1

0

A’

DO8

Cross

TO1 DI2 TO2 DI3 TO3 DI4 TO4 DI5 TO5 DI6 TO6 DI7 TO7 DI8 TO8

TOS_L

TI DO

fmult TI

DO8 TIF

B’

DO4 DO5 DO6 DO7

DI1

TOS

1

TIS

1

s2p

0

Output Channels

B

Bar

(10)

e

DI TOF

A. Top-level Design The recently proposed shared bus time-division switch [16] seems to be the rst RSFQ architecture of a bus-type fast-packet switching core (see, e.g., [17]). In the switch (Fig. 11), a tree-shaped bit concentrator picks up one bit at a time from one of N input controllers in the required order and places it into the corresponding slot of the shared communication bus | serial-to-parallel converter. The order of the pickup is determined by the packet destination addresses consisting of log2 N bits each, which are placed in advance into the input controllers. As

max

N  fif ;

bit.concentrator

Msort  78 N2 L2A (1 + LA ); M  68 N L ;

TOF

D. Hardware and Power Accounting Equations 6{8 give the number of Josephson junctions required to implement an N  N Batcher-banyan switching network using the galvanic-coupled -elements:

soon as all N bits from the incoming channels are pushed into the converter, its contents are read out in parallel to output channels. The switch operation is controlled by three clock signals. External, or \slow" clock TI runs at frequency fe . It determines the packet arrival rate in each channel. The same clock signal (phase-shifted by 2  (log2 N ) =N ) controls the output serial-to-parallel interface S2P. Finally, the internal shared bus operates at much higher frequency fi = Nfe , which is produced by the frequency multiplier fmult. The switch has a limitation on the number of switched channels:

Input Channels

reverse commutation. This design is more suitable for immediate implementation because reliable designs for both NDRO cells and mergers are available. However, we have to pay by more than doubling the number of Josephson junctions.

Serial-toParallel Converter

A

Fig. 10. A version of the RSFQ -element based on galvanically coupled NDRO cells.

TI

Fig. 11. Shared bus switch (for the particular case N = 8).

"fast" clock "slow" clock address load input data pick-up from 1 pick-up from 2 pick-up from 3 pick-up from 4 output buffer

1 2 3 4 1 2 3 4

Fig. 12. Sample operating sequence of a 4  4 switching core. \address" is an address at any serial line; \load" is a signal at any \load" line.

riods to advance a bit from an input controller to the destination slot in the output converter, so that the read-out signal should be delivered to the serial-to-parallel converter at the (log2 N )th \fast" clock period rather than immediately. This delay a ects only the latency of the switch, not its throughput. B. Input Controller The input controller performs simultaneously bu ering and routing functions. It consists of three major parts (Fig. 13): a FIFO-type temporary address storage (TAS) organized as a serial-to-parallel converter, a counter [9] modulo N with a loading interface (Cnt) and a one-bit data bu er (Bu ). The temporary address storage sequentially accepts the address bits from input A, one bit per one \slow" clock (TIS) cycle, and writes the whole address to the counter whenever the \load" (L) signal arrives. We can speak about two modes of the input controller: a programming mode and a counting mode. The signal L switches the input controller from the former mode to the latter mode explicitly, while the opposite transmission is implicit and takes place when the last bit of the routed TIS A D2

D L

D

D2

TAS

D

D

D

Cnt TIF R

S

R

T Buff

DI

D

S

T

R

packet leaves the storage. The \fast" clock permanently circulates in the counting loop causing the generation of the \carry" signal each time the contents of all the upper T ip- ops change to zero. The period of this \carry" signal is N=fi = fe , i.e., it coincides with the period of the \slow" clock, while its initial o set depends on A and, in turn, determines the number of the TDM slot assigned to the data bit stored in the Bu D ip- op. C. Bit Concentrator The task of the bit concentrator is to synchronously demultiplex packets arriving from di erent input controllers into a single TDM stream. As shown in Fig. 14, it is a full binary tree with height log2 N and width N , built up of RSFQ D ip- ops and asynchronous mergers (con uence bu ers) [4]. The tree is controlled by the \fast" clock TI. Each clock signal advances a packet bit one stage towards the root. Because of the non-contending nature of the incoming packet stream, there is at most one valuable data bit in each column of the tree at a time. D. Serial-to-Parallel Converter The serial-to-parallel FIFO-type converter is used as the main shared bus of the switching core. It is very similar to the temporary address storage in the input controllers and is constructed of D2 cells in similar fashion. The converter receives intermixed data bits from the serial input, one bit per \fast" clock period, and ushes the overall contents of the shift register to the output channels at the request of the \slow" clock. E. Frequency Multiplier As has been shown above, the shared bus switch needs three clock streams that di er in frequency and in phase. It takes only one clock stream from the outside, while the rest are generated in the internal frequency multiplier. The only di erence between the multiplier used in our switch (Fig. 15) and that described in [9] is that the former also produces the third clock signal TOS L shifted by log2 N periods relative to the external clock TI. This signal is taken from the (log2 N ? 1)th stage of the multiplier, and then its frequency is divided by 2. TO1

S

T

DI1 TO2

DO D

D

DO

DI3 TO4 DI4

Fig. 13. Input controller for a 8  8 shared bus switch.

TI

D

DI2 TO3

D

D

Fig. 14. Bit concentrator for a 4  4 shared bus switch.

F. Hardware and Power Accounting The number of Josephson junctions needed for a at N  N shared bus switch is calculated in the same way as in Section III. and is given by equations:

Minput:controller  N (28 log2 N ? 1); Mbit:concentrator  8(N ? 1); MS2P  8N; Mfmult  12 + 7 log2 N; M  28N log2 N ? 12N + 7 log2 N + 4:

(11) (12) (13) (14) (15)

The dissipated power estimate is:

P  4:3N log2 N + 2:5N + 1:2 log2 N + 1:6 W: (16) Both M and P must be multiplied by N in the case of a parallelized switching core. VI. Discussion

We have estimated the complexity and performance of RSFQ implementation of several ultra-wide-band digital switching cores: crossbar, Batcher-banyan, and shared bus time-division switching core as a function of the number of channels N . Now let us consider two numerical examples: the so-called \optical", or \optimistic", case (128  128 bit-serial 60 Gbps channels), and the \semiconductor", or \realistic", case (96  96 32-bit-parallel 2:44 Gbps channels), for each of the switches. The packet length necessary for the calculation for the crossbar switch is selected to be 424 bits (the ATM standard). The results are summarized in Tables III and IV. TABLE III NUMBER OF JOSEPHSON JUNCTIONS (M ) AND DISSIPATED POWER (P ) FOR 128  128 BIT-SERIAL SWITCHING CORES Switching core type M P , mW Crossbar 1,105,000 216 Batcher-banyan 173,000 60 Shared bus 3,020,000 520 TABLE IV NUMBER OF JOSEPHSON JUNCTIONS (M ) AND DISSIPATED POWER (P ) FOR 96  96 32-BIT-PARALLEL SWITCHING CORES Switching core type M P , mW Crossbar 733,000 142 Batcher-banyan 170,000 45 Shared bus n/a n/a

The tables show that the Batcher-banyan switching core is probably the best candidate for RSFQ implementation. In both cases such a circuit can be t onto one 1cm2 RSFQ chip. In fact, in a typical modern RSFQ design for the 3:5m Hypres technology, 67 junctions occupy 0:13 mm2, i.e., the density is close to 50; 000 junctions=cm2 [?]. Hence, taking into account the contact pin area and scaling to a 1:5m technology, a 1cm2 chip can accommodate a circuit with more than 180; 000 junctions. The estimated dissipated power for the core may be as low as 45 mW , at an internal clock frequency of 60 GHz , or  0:5 mW=channel. This is almost 4 orders of magnitude less than the power dissipated by similar devices fabricated using high-speed semiconductor [1] and photonic [2] technologies. To summarize, we have compared the proposed RSFQ switching cores to systems using semiconductor and photonic technologies, and have found out that the RSFQ approach would dissipate signi cantly less power per switched channel than semiconductor and optical devices, at no extra cost of chip area. Moreover, RSFQ switching cores can manipulate ultra-wide-band ( 60 Gbps) data streams that cannot be processed by any other integrated circuits. Appendix A The D2 Cell

The function of the D2 (\double D ip- op") cell is to keep its internal state (as a \regular" D ip- op would) and report it to one of the callers. Fig. 16 shows the schematics of an RSFQ D2 cell. It consists of a quantum interferometer (J 7-L-J 2-J 4) and two comparators J 4/J 5 and J 2/J 6. Input S can be used to write an SFQ into the interferometer via Josephson junction J 7. Thereafter, the circulating current will subcritically bias junctions J 2 and J 4, so that sending an SFQ pulse to inputs R1 or R2 will cause a switch of J 2 and J 3, or J 4 and J 1, respectively, and the SFQ pulse will propagate to the corresponding output O1 or O2. Simultaneously, the circuit will be reset to the initial state. O2

R2

I2

O2

O1

TI

D2

S TOF

T

T

T

R1

t

TOS

J4 J5

J1 J6

J3

J2

L

O1 I1 J7

TOS_L D

D

T

Fig. 15. Frequency multiplier for a 8  8 shared bus switch.

a)

b)

R2

R1 S

Fig. 16. A) Notation and b) schematics of the RSFQ D2 cell.

B Hardware and Dissipated Power Estimates for Primitive RSFQ Cells

In order to estimate the hardware complexity of a certain RSFQ circuit, we must rst decompose it onto primitive RSFQ cells that have a well-known low-level structure, and then count the overall number of Josephson junctions and the overall dissipated power (Table V). TABLE V NUMBER OF JOSEPHSON JUNCTIONS NJJ AND ESTIMATED POWER DISSIPATION P FOR PRIMITIVE RSFQ CELLS Cell Name NJJ P , W Single grounded Josephson junction 1 0.34 Splitter 1 0.34 Diode, or bu er key 2 0.34 NDRO 2 or 3 0.68 D ip- op 3 0.68 Muller C cell 3 1.02 Merger, or con uence bu er 4 0.68 T ip- op 4 0.68 SR ip- op 4 0.68 Inverter 5 0.68 D ip- op w/reset 6 0.68 SR ip- op with two S inputs 6 1.02 XOR 6 1.02 D ip- op w/complementary outputs 7 to 12 1.36 NDRO (1-bit register) 7 0.68 D2 cell 7 0.68 AND 9 1.7 Demultiplexer 11 1.7 XNOR 9 0.68 X cell 9 0.68 T  RS ip- op 12 1.36 -element w/inductive coupling 25 4.08

To estimate the minimum dissipated power, the number of \grounded" Josephson junctions is counted and multiplied by power P0 dissipated in each path in each circuit. For the 1:5m technology, average critical current Ic = 0:2 mA, clock frequency fi  60 GHz , and dc bias voltage U = 2:4 V (to guarantee that cell margins will not decrease by more than 5%), P0  340 nW [7]. Acknowledgment

This research has been inspired by discussions with Dr. E. Wikborg (LM Ericsson, Sweden). The authors would like to thank P. Bunyk, Dr. V. Semenov, Dr. F. Bedard, Dr. D. Smith, and Dr. L. Wittie for numerous helpful hints and fruitful discussions.

References [1] T. Banwell, R. Estes, S. Habiby, G. Hayward, T. Helstern, et al., \Physical design issues for very large ATM switching systems," IEEE Trans. on Selected Areas in Commun., vol. 9, pp. 1227{1238, Oct. 1991. [2] S. Hino, M. Togashi, and K. Yamasaki, \Asynchronous transfer mode switching LSIs with 10 Gbit/s serial inputs and outputs," in 1994 Symposium on VLSI Circuits. Digest of Technical Papers, pp. 73{74, 1994. [3] K. Kitayama, \Ultrafast photonic asynchronous transfer mode switch based upon parallel signal processing," Optical Review, Sample Issue, pp. 1{6, 1994. [4] K. Likharev and V. Semenov, \RSFQ logic/memory family: a new Josephson junction technology for sub-teraherz clock frequency digital systems," Applied Superconductivity, vol. 1, pp. 3{28, Mar. 1991. [5] K. Likharev, \Rapid single- ux-quantum logic," in The New Superconducting Electronics (H. Weinstock and R. Ralston, eds.), pp. 423{452, Dordrecht, The Netherlands: Kluwer, 1993. [6] P. Bunyk, V. Semenov, A. Oliva, M. Bhushan, K. Likharev, and J. Lukens, \High-speed single- ux quantum circuit using planarized niobium trilayer Josephson junction technology," Appl. Phys. Lett., vol. 66, pp. 646{648, Jan. 1995. [7] A. Rylyakov and S. Polonsky, \New design of single-bit alldigital RSFQ autocorrelator," IEEE Trans. on Appl. Supercond., vol. 7, pp. 2709{2712, June 1997. [8] B. Lee, M. Kang, and J. Lee, Broadband Telecommunications Technology. Norwood, USA: Artech House, 1993. [9] J.-C. Lin and V. Semenov, \Timing circuits for RSFQ digital systems," Applied Superconductivity, vol. 5, pp. 3472{3477, Sept. 1995. [10] S. Polonsky, V. Semenov, and A. Kirichenko, \Single ux, quantum B ip- op and its possible applications," Applied Superconductivity, vol. 4, pp. 9{18, Mar. 1994. [11] K. Batcher, \Sorting networks and their applications," in AFIPS Proc. of Spring Joint Compt. Conf., pp. 307{314, 1968. [12] M. Hosoya, W. Hioe, S. Kominami, H. Nagaishi, and T. Nishino, \Superconducting packet switch," in 5th Int'l Supercond. Electronics Conf. Extended Abstracts, pp. 37{39, Sept. 1995. [13] D. Zinoviev and O. Mukhanov, \Novel tri-stable elements for binary RSFQ circuitry," IEEE Trans. on Appl. Supercond., vol. 5, pp. 2984{2987, June 1995. [14] A. Acampora, An Introduction to Broadband Networks: LANs, MANs, ATM, B-ISDN, and Optical Networks for Integrated Multimedia Telecommunications. New York: Plenum Press, 1994. [15] A. H. Worsham, J. X. Przybysz, J. Kang, and D. L. Miller, \A single ux quantum cross-bar switch and demultiplexer," IEEE Trans. on Appl. Supercond., vol. 5, pp. 2996{2999, June 1995. [16] D. Zinoviev, \Rapid single ux quantum fast-packet switching element," in Emerging High-Speed Local-Area Networks and Wide-Area Networks (K. Annamalai, K. Bala, C. Traw, and R. Bianchini, Jr., eds.), Proc. SPIE 2608, (Philadelphia, USA), pp. 190{196, SPIE, Oct. 1995. [17] P. Newman, \ATM technology for corporate networks," IEEE Communications Magazine, pp. 90{101, Apr. 1992.