Therefore, the line-rate of the testbed was limited to 1.25 Gbps because the commercially available burst mode clock and data recovery modules (BM-CDR) is.
Experimental Demonstration of LIONS: A Low Latency Optical Switch for High Performance Computing Yawei Yin, Roberto Proietti, Xiaohui Ye, Runxiang Yu, Venkatesh Akella and S.J.B. Yoo Dept. of Electrical and Computer Engineering, Univ. of California, Davis, One Shields Ave., Davis, CA 95616 {yyin, sbyoo}@ucdavis.edu Abstract—This paper experimentally demonstrates a low latency optical switch prototype leveraging the arrayed-waveguide-
grating router (AWGR) based switching fabric and fast tunable lasers. The experimental results collected from the testbed are analyzed in detail, and then compared with the data generated by our network level switch simulator, verifying the accuracy of the simulator. Then the simulator is used to project the switch performance to high port count. Keywords-AWGR; Optical Interconnects; Optical Switch; Optics in Computing; Experiments
I.
INTRODUCTION
Scalable, low latency and power efficient switches supporting data centers and/or high-performance computing are highly desirable in light of the massive deployment of the petaFLOPs and petaBytes scale computing powers all over the world. The critical performance bottleneck has shifted from the computing systems to the interconnection infrastructure since the traditional electrical interconnection networks (e.g. FatTree, CLOS, Flattened Butterfly, etc) are struggling on the scalabilities in terms of end-to-end latency, throughput, and power consumption. On the other hand, the optical counterparts are providing promising solutions based on their inherent wavelength division multiplexing (WDM) capabilities, which can not only scale the line-rate to tens or hundreds of giga bits per second, but also resolve the severe I/O limitations on electrical data communication systems today. Contention in the switching fabric and head-of-line blocking in queues commonly seen in electronic switches can also be potentially resolved by introducing optical parallelism at the input, output and loopback buffering queues.
outperforms OSMOSIS and Data Vortex in terms of latency scalability and fairness. In this paper, we experimentally demonstrate a 4x4 LIONS switch prototype with shared loopback buffer architecture. The statistical data collected from the testbed are analyzed and broke down into detailed components. By feeding back the delay parameters measured from the testbed into the system level simulator used in [5], we then compare the experimental results with the corresponding simulation results. The matching between the two results verifies the correctness and accuracy of the simulator and gains our confidence on projecting the switch performance to high port count. II.
EXPERIMENTAL TESTBED
Figure 1 shows the testbed setup of the 4x4 LIONS optical switch. A 32x32 50 GHz-spacing AWGR occupies the core of the switch architecture. It also includes wavelength converters (WCs) based on cross-phase modulation (XPM) in a semiconductor optical amplifier Mach-Zehnder interferometer (SOA-MZI). Each WC accepts one continuous wave (CW) input signal from a tunable laser diode (TLD) board. The TLD guarantee nanosecond switching time over the C band with a wavelength accuracy of 0.02 nm. By reading the 5-bit parallel control signals coming from the FPGA-based control plane, each TLD board tunes its wavelength according to a routing table stored on a CPLD chip mounted on the board itself. An optical path is this established between each AWGR input and output on a packet basis, according to the destination address
Among all the proposed optical switching architectures, the Optical Shared Memory Supercomputer Interconnect System (OSMOSIS) [1, 2], Data Vortex [3, 4] and Low-latency Interconnect Optical Network Switch (LIONS, previously named as DOS) [5] are three pioneering architectures and attract fair amount of attentions. The simulation comparison in [5] shows that LIONS provides low-latency and highthroughput switching and does not saturate even at very high (approx 90%) input load. In addition, the latency of LIONS is almost independent of the number of input ports, which Figure 1. LIONS Testbed This work was supported in part by the Department of Defense through contract #H88230-08-C-0202 and Google Research Awards
(a)
(b)
(c)
Figure 2. LIONS Testbed (a) End Hosts, (b) AWGR switching fabric and FDLs, (c) Control plane and loopback buffers
carried by each packet label. The control plane and loopback buffers are implemented using a Xilinx Virtex 5 FPGA ML523 Characterization Platform, which is capable of instantiating 8 high speed RocketIO GTP tile transceivers connected to 16 pairs of differential SMA connectors. Four pairs of the transceivers are used as the 4 channels in the control plane, each of which receives the labels from the one end host respectively. The control plane reads the label and checks a built-in lookup table for input/output mapping, and then generates the 5 bit control signals to TLD boards after arbitration. The contended packets fail to win arbitrations are directed to the inputs of the loopback buffer, which are implemented using another 4 pairs of RocketIO transceivers on the board. The buffered packets keep applying to the arbiter in the control plane until get granted. The LIONS testbed is also equipped with two Virtex 5 FPGA platforms to emulate the multiprocessor parallel computing system as the 4 end-hosts. Four of MicroBlaze Soft Processor Cores [6] were instantiated on Virtex 5 FPGA with MPI interfaces capable of doing Remote Direct Memory Access (RDMA) operations. The generated data are firstly written to the BRAM block on FPGA and then moved into the RocketIO transmitter output queue using direct memory access (DMA) operation, as shown in Fig. 3. Then the packets are encapsulated there and sent out by the RocketIO at 1.25 Gbps after de-serialization. On the receiving direction, the received data packets are directed moved from RocketIO input queue to the on board DDR2 SDRAM memory using DMA operation.
The stored data are then analyzed by the MicroBlaze processor. III.
STATISTICS COLLECTION
The testbed implementation was limited by the available off-the-shelf components. Therefore, the line-rate of the testbed was limited to 1.25 Gbps because the commercially available burst mode clock and data recovery modules (BM-CDR) is running at this line rate. We are developing the 10 Gbps BMCDR which can be interfaced with the FPGA boards. The implementation complexity and cost of LIONS with distributed loopback buffer and mixed loopback buffer [7] also restrains the range of experimental data we can achieve, leaving us only with the low-cost and most feasible LIONS with shared loopback buffer architecture (Fig. 2). The end-to-end latency is an important parameter. To do the measurement, the synthetic traffic model was firstly used in the testbed. Therefore, the data streams at each host were encapsulated into fixed size packets with uniform random destination address. Each packet is with 5 byte header (2 byte preamble, 1 byte destination address, 1 byte source address and 1 byte packet length). Different offered load can be generated by changing the guard time between packets. Note that a minimum guard time of 17 byte has to be guaranteed due to the hardware constraints in the testbed (i.e. worst case TLD tuning time, burst mode receiver settling time and comma alignment delay in SERDES etc.) Since the traffic is uniform randomly distributed, the end-to-end latency statistics can be collected at any of the output ports. In light of this, only host 2 was used to collect data. The end-to-end latency for each single packet travelling through the testbed can be divided into the following components: (1) (2) (3) (4)
Figure 3. FPGA based End Hosts for Parallel Computing
Where Tl is the traverse delay on the fiber links which belong to path P, is the delay of the control plane, is the delay introduced by loopback buffer, is the delay added by Rocket IO transmitter (Tx), is the delay added by Rocket IO receiver (Rx), is the delay introduced by the burst mode clock and data recovery chipset, DARB is the arbitration delay in the control plane, DTLD is the tunable laser
(a)
(b)
Figure 4. LIONS Testbed (a) Chipscope captured packet, (b) Simulation & Experiment Comparison
switching latency, and Dbuf is the time duration that a contended packet stays in the electrical loopback buffer. As we can see from the equations, the end to end latency of an un-contended packet is deterministic. It is the minimum latency a packet should experience when travelling through the testbed. While in the contented case, the latency of a packet travelling through the loopback buffer is given by equation (3), with a nondeterministic component of Dbuf changing with the load and contention probability. In the experiment, the components in the four equations were measured with delay values as follows: : ~12.5 clks, corresponding to around 20 meters of fiber pigtails in the lab, : 14 clks, : 6 clks, DARB: 4 clks, DTLD: 9.5 clks, DBCDR: 8 clks. Note that the end-to-end latency of a packet is counted from the first-bit-out to the last-bit-in, so the transmission delay is also considered in the results, making the absolute value of delay change also according to packet size. Figure 4 (a) shows one received data packet at node 2 Rx captured by Chipscope [8]. As depicted the packet consists of 5 byte header and payload. The comma symbols between packets are used as guard time to compensate the bit loss of clock and data recovery and the delay of comma alignment circuit in RocketIO. Figure 4 (b) shows the statistic results from both simulations and experiments. The red line shows the 4x4 experimental data with K=1, while the blue line shows the simulation results counter parts. Here K means the number of parallel wavelengths can be received by the same host simultaneously from one output port of the switch [5]. The comparison of the results shows a close match between the experimental data and the simulation data, which verifies the correctness and accuracy of the simulation tools we developed. The other curves in Figure 4 (b) show the projection of the results to high port count. As depicted, the increase of the LIONS radix does not significantly affect the end to end latency, while K=2 can dramatically reduce it since it reduces the contention probability at each output port.
IV.
CONCLUSION
This paper experimentally demonstrates a 4x4, AWGR based LIONS switch prototype with shared loopback buffer architecture. The statistical data collected from the testbed are compared with the corresponding results generated by the network level simulator. The comparison of the data verifies the correctness and accuracy of the simulator and gains us confidence on projecting the switch to high radix. The simulation studies of the LIONS are presented in [5, 7] and shows the benefits of leveraging WDM parallelism in the switch with different loopback buffer architectures (i.e. shared/distributed/mixed loopback buffer). Nevertheless, we are also investigating the feasibility and benefit of building a buffer-less architecture for LIONS [9] since the removal of the electrical loopback buffer will facilitate the scalability of the switch in terms of port count, line rate, and power efficiency. REFERENCES [1] [2] [3]
[4]
[5]
[6] [7]
[8] [9]
C. Minkenberg, et al., "Designing a Crossbar Scheduler for HPC Applications," IEEE Micro, vol. 26, pp. 58-71, 2006. R. Hemenway, et al., "Optical-packet-switched interconnect for supercomputer applications," Journal of Optical Networks, 2004. O. Liboiron-Ladouceur, et al., "The Data Vortex Optical Packet Switched Interconnection Network," Journal of Lightwave Technology vol. 26, 2008. O. Liboiron-Ladouceur, et al., "Physical Layer Scalability of WDM Optical Packet Interconnection Networks," Journal of Lightwave Technology, vol. 24, pp. 262-270, 2006. Xiaohui Ye, Paul Mejia, Yawei Yin, Roberto Proietti, S. J. B. Yoo, and Venkatesh Akella, "DOS - A Scalable Optical Switch for Datacenters" in ACM/IEEE Symposium on Architectures for Networking and Communications Systems (ANCS) 2010, October, 2010. http://www.xilinx.com/tools/microblaze.htm Xiaohui Ye, Roberto Proietti, Yawei Yin, S. J. B. Yoo, and Venkatesh Akella, "Buffering and Flow Control in Optical Switches for High Performance Computing" in Journal of Optical Communications and Networking, Vol. 3, No. 8, pp. A59-A72, August, 2011 http://www.xilinx.com/tools/cspro.htm Roberto Proietti, Yawei Yin, Runxiang Yu, Xiaohui Ye, Christopher NItta, Venkatesh Akella, and S. J. B. Yoo, "All-optical Physical Layer NACK in AWGR-based Optical Interconnects" in IEEE Photonics Technology Letters, Vol. 24, No. 5, pp. 410-412, March, 2012.