Switches are often scaled up by using Clos networks [3]. In comparison with a crossbar switch, a non-blocking Clos network reduces the number of crosspoints ...
OBIG: the Architecture of an Output Buffered Switch with Input Groups for Large Switches Wladek Olesinski, Hans Eberle, and Nils Gura Sun Microsystems Laboratories 16 Network Circle Menlo Park, CA 94025, USA {wladek.olesinski, hans.eberle, nils.gura}@sun.com Abstract—Large, fast switches require novel approaches to architecture and scheduling. In this paper, we propose the Output Buffered Switch with Input Groups (OBIG). We present simulation results, discuss the implementation, and show how our architecture can be used to build single-stage (flat) switches with multi-terabit-per-second throughput and hundreds of ports. Keywords- switching, routing
I. INTRODUCTION Large switches with hundreds of ports and terabit-persecond throughput need scalable architectures. Switches often rely on crossbars – matrices of elements that allow to transfer packets from N inputs to N outputs. Crossbars are attractive for relatively small switches but they do not scale well because the number of crossbar elements grows quadratically with the number of ports. Also, scheduling in large switches is hard. More specifically, calculating a schedule for a switch with a large number of ports is challenging as the computation time grows with the number of ports. To relax requirements imposed on the scheduler, a solution called a buffered crossbar [7] adds a buffer to every crosspoint in the architecture. This approach does not scale well because of the high memory capacity required by N2 buffers. Another approach, a load-balanced switch [8], simplifies the scheduling problem by distributing the switching among three stages. The first stage evenly distributes cells among second stage queues, which then forward cells to destination output ports in the third stage. Both stages operate according to a fixed schedule removing the need for a scheduler. This solution scales better than others but suffers from several problems: high average latencies, out-of-order delivery of cells, and difficulties with addition and removal of line cards. Switches are often scaled up by using Clos networks [3]. In comparison with a crossbar switch, a non-blocking Clos network reduces the number of crosspoints and still allows for conflict-free forwarding of cells through the switch. However, such architectures have shortcomings such as high latency, intricate wiring between switch elements, and more complex routing. Given these shortcomings, we want to avoid multi-stage architectures and maintain a flat (single-stage) architecture whenever possible. Only when the application demands a
larger switch, do we want to resort to Clos networks or other, multi-stage solutions. More recently, schemes have been proposed, such as PCIQ [9] and CIXOB-k [11], that partition a crossbar into several smaller crossbars. Our architecture involves a similar approach. In this paper, we examine how to build a large, flat switch fabric with 256 ports and an aggregate bandwidth of 2.5Tbps. To accomplish this, we propose the Output Buffered Switch with Input Groups (OBIG) architecture that exploits a recently developed, novel chip interconnect technology called Proximity Communication. Proximity Communication relies on capacitive coupling between overlapping chips and provides two orders of magnitude higher bandwidth than traditional chip packaging technologies such as area ball bonding [4,6]. Notably, our architecture does not have to rely on Proximity Communication to offer some improvement over existing solutions. However, at this point, a truly large switch fabric with hundreds of ports is possible only with Proximity Communication. Our paper is organized as follows. The next section briefly introduces Proximity Communication. Section III describes the OBIG architecture and talks about flow control, the scheduler, and implementation. We also show how our solution scales considering current semiconductor technologies. In Section IV, we present simulation results for several types of traffic, compare the OBIG architecture with PCIQ proposed in [9], and look at the performance of a 256-port, 2.5Tbps switch. Section V describes briefly how our switch can be built without Proximity Communication. Section VI discusses how OBIG can be tailored to different applications. Finally, Section VII presents a plan for future work, and Section VIII concludes the paper. II.
PROXIMITY COMMUNICATION
Traditionally, components such as memories, CPUs, and other chips are physically connected using macroscopic features such as pins, ball bonds and solder bumps. These structures, however, are massive in size compared to the submicron features on the chip itself. Proximity Communication technology uses microscopic pads constructed out of standard top-layer metal structures during chip fabrication. These pads are then sealed with the rest of the chip components under a micron-thin layer of insulator to protect
1930-529X/07/$25.00 © 2007 IEEE This full text paper was peer reviewed at the direction of IEEE Communications Society subject matter experts for publication in the IEEE GLOBECOM 2007 proceedings.
the chip from static electricity. Two chips, with receiver and transmitter pads, are then placed facing each other such that the pads are only a few microns apart (Figure 1). Each transmitterreceiver pad pair forms a plate capacitor, and voltage changes on the transmitter pad cause voltage changes on the receiver pad by means of capacitive coupling. Figure 3 Vector of M chips connected with Proximity Communication (PxIO) Links.
Finally, a concrete 12-port, 3-chip example is depicted in Figure 4. Note that we assume that external connections are implemented with conventional high-speed Serializer/Deserializer (SerDes) technology, and inter-chip connections are implemented with Proximity Communication. Figure 1. Proximity Communication.
Proximity Communication provides two orders of magnitude improvement in chip I/O density, reduced cost, and lower energy making it possible to transfer tens of terabits per second into and out of a single VLSI chip. In contrast, current chip I/O technologies are limited to a few hundred gigabits per second. Proximity Communication has been demonstrated with a test chip manufactured in 180nm technology providing 430 Gbps per mm2 [6]. III.
OUTPUT BUFFERED SWITCH WITH INPUT GROUPS (OBIG )
A. General OBIG Architecture A logical view of the switch is shown in Figure 2. There are M crossbars, each with a separate set of inputs, and shared outputs. A high-level physical view of the vector arrangement is shown in Figure 3. The switch consists of M identical chips arranged in a vector, and connected via Proximity Communication links. Each link implements point-to-point connections with Proximity Communication interfaces at both ends.
Every chip has K input ports and K output ports, which gives a total of N=KM ports for the entire switch. Chip 1 has ports numbered 1...K, chip 2 has ports numbered K+1...2K, etc. Generally, chip C has ports numbered (C-1)K+1...CK, for 0 < C ≤ M. Every chip implements M KxK crossbars, with K columns corresponding to input ports, and K rows (buses) used to deliver cells locally or to a remote chip connected via Proximity Communication links (Figure 4). Crosspoints are denoted by pairs (c, r), where c is a column number, r is a row number, and 0 < c ≤ N, 0 < r ≤ N. Every crosspoint contains a switch that enables a flow of traffic from row to column. Every crossbar has K buffers (referred to also as “output buffers”), one for each column. The buffers are numbered B(C, c, m), where C is a chip number, c is a column number, and 0 < m ≤ M is a crossbar number within a chip, counting from top to bottom. Each buffer collects cells received from K rows in a given crossbar. The row is selected by a multiplexer at the input of a buffer. A chip has a total of KM=N buffers. Each chip has its own independent input scheduler that provides a maximal matching between K local input ports, and N buffers (as described in detail in Section III.C). Chip C with ports numbered (C-1)K+1...CK uses the bus lines (C-1)K+1...CK to forward cells to any of the output ports. Let us trace a path of a cell in chip C that was selected by the scheduler for forwarding from input port s to output port d (note that (C-1)K + 1 ≤ s ≤ CK and 0 < d ≤ N). Using column s, the cell is first transmitted to row s and then forwarded to destination column d. From crosspoint (d, s), the cell is forwarded to buffer B(C, d, m). Later, the output scheduler removes this cell and forwards it to the output port at column d. Assume that in a N=12 port switch using M=3 chips with K=4 ports per chip (Figure 4), the scheduler in chip 1 selected a cell from input port s=3 to be forwarded to output port d=7 in chip 2. This cell will first go to row s=3, then to destination column d=7. From crosspoint (7,3), the cell is forwarded to buffer B(2,7,1) from which it is eventually removed and forwarded to the output port.
Figure 2. Logical view of the OBIG architecture.
1930-529X/07/$25.00 © 2007 IEEE This full text paper was peer reviewed at the direction of IEEE Communications Society subject matter experts for publication in the IEEE GLOBECOM 2007 proceedings.
communicated to the input scheduler. Flow control can be either credit-based or xon/xoff. Assuming a simple xon/xoff scheme, we calculated the minimum size of an output buffer necessary to avoid cell and bandwidth loss. Cell loss would occur due to cells in transit that arrived at a full buffer after backpressure was asserted; bandwidth loss would occur when backpressure was lifted and the buffer drained completely before new cells started arriving again from the backpressured input. According to our calculations, skipped here for the sake of brevity, the memory requirements for a 256-port, 2.5 Tbps switch described below can be met using 90nm chip technology.
Figure 4 . Example OBIG architecture with N=12 ports, and M=3 chips.
In another example of the same 12-port switch, a cell from input port s=6 (chip 2) addressed to output port d=1 (chip 1) will be first forwarded to row s=6, then to destination column d=1. From crosspoint (1,6), the cell is forwarded to buffer B(1,1,2) from which it is eventually removed and forwarded to the output port. Every chip has one output scheduler per output port (i.e., per column). This simple scheduler removes cells from the M buffers in the column in a round-robin fashion. As mentioned above, every chip also has one input scheduler that gathers and grants or rejects requests from local input ports. Note that in this architecture, groups of input ports share common buffers. Within the buffers, there is no static memory allocation per input port. Instead, each buffer stores in a shared space all cells flowing from the corresponding group of K input ports. This helps better utilize the available memory. For example, buffer B(2,7,1) is shared by traffic flowing from input ports 1, 2, 3 and 4 (see Figure 4). The output scheduler for a given output port (a column) drains the buffers, effectively dividing the bandwidth between groups of input ports. Summarizing, this architecture accomplishes a division of one large NxN crossbar into several smaller KxK crossbars, and of one NxN scheduler into several KxN schedulers. All these elements are distributed over multiple chips. This lowers the memory requirements per chip and greatly simplifies the schedulers which arbitrate between just a subset of the total number of input ports. B. Flow Control To avoid buffer overflow, flow control needs to be implemented between the input scheduler and the buffers. Congestion at buffers on the local or on a remote chip must be
C. Input Scheduler Note that for large K, the input scheduler can be difficult to implement because it has to find matchings between K input ports (each with N VOQs) and N output ports. However, with the application of our new Parallel Wrapped Wave Front Arbiter with Fast Scheduler (PWWFA-FS), that will be described in a separate paper, the input scheduler can easily handle the volume of incoming cells even in a very large switch. The PWWFA-FS is based on the PWWFA scheduler that we described in [10]. PWWFA is a parallel version of the wellknown Wrapped Wave Front Arbiter (WWFA) [12]. The PWWFA arbiter consists of a matrix of elements that maintain and process requests for outputs. With several “waves” of processing performed concurrently, the arbiter can schedule cells for multiple future slots. The PWWFA-FS scheme augments schedules calculated by PWWFA with grants to requests generated by cells that arrived most recently. This significantly decreases delays under light load, and enables the efficient implementation of a centralized arbiter applicable for very large switches. With small delays, our architecture is particularly attractive for applications that require very low latencies such as cluster interconnects. Note, however, that any other scheduler capable of arbitrating between large numbers of input ports can be used in the OBIG architecture. D. Scalability Let us elaborate on the maximum size of a flat switch using the OBIG architecture. Assume a chip area of 100mm2 out of which 50% is allocated for buffers of size b=4kB that compensate for flow control. 4kB memory requires about 0.16mm2 of die area (synthesized with an SRAM generator tool from Artisan for TSMC's 90nm process [2]). Hence, the total memory capacity per chip is B=(50/0.16) × 4=1.22MB. If we distribute a crossbar among M chips connected by Proximity Communication links, there are K=N/M ports per chip, KN= N2/M crosspoints per chip but only KM=N buffers in every chip (see Figure 4). They need a total memory capacity of Nb, which must be less than B to fit on a chip. Thus, N