2009 17th 17th IEEE IEEE Symposium Symposium on on High High Performance Performance Interconnects Interconnects
Adaptive Routing in Data Center Bridges Cyriel Minkenberg, Mitchell Gusat
German Rodriguez
IBM Research, Zurich Research Laboratory S¨aumerstrasse 4, CH-8803 R¨uschlikon, Switzerland {sil,mig}@zurich.ibm.com
Barcelona Supercomputing Center Nexus II, C/ Jordi Girona, 29, 08034 Barcelona, Spain
[email protected]
(IEEE 802.1Qau), traffic differentiation (IEEE 802.1Qbb), and enhanced transmission selection (802.1Qaz). A specific requirement for DC/HPC networks is that packet losses cannot be tolerated. Therefore, to avoid drops due to buffer overflows, all data-center networks employ some form of link-level flow control (LL-FC), usually some variation of either credit-based or stop/go-based flow control. The Ethernet standard provides an LL-FC mechanism called PAUSE (802.3x), which can temporarily pause the corresponding link when a buffer is filling up. An unwanted side-effect of LL-FC in combination with typically short buffers and high speed is that local congestion may rapidly spread to neighboring switches. If congestion persists long enough, a saturation tree [1] of congested switches is induced, which can cause a catastrophic collapse of global network throughput [1], [2], because such a saturation tree affects not only flows directly contributing to the congestion, but also other flows getting caught in the ensuing backlog. Therefore, congestion management (CM) is considered essential to safeguard against such collapses. The IEEE 802.1Qau working group [3] is busy defining a standard for CM in DCB networks. All CM protocols proposed in this context, such as Ethernet Congestion Management (ECM) [4]) or Quantized Congestion Notification (QCN) [5]), try to prevent saturation trees from building up by pushing the backlog to the edge of the network. By means of congestion notification (CN) frames, switches instruct the sources of flows contributing to a specific congestion point (hotspot) to adjust their send rate, with the objective of matching the aggregate send rate of all contributing flows to the bottleneck bandwidth. Reducing send rates to eliminate congestion is unavoidable if the congestion is due to end point contention, i.e., multiple source nodes sending to the same destination node at a higher aggregate rate than that node can handle. However, congestion can also occur within the network because of routing conflicts, i.e., because multiple flows (possibly destined to different destinations) are mapped to a common port by the routing algorithm. Network topologies used in DCs and HPC systems often provide significant multi-path capabilities, i.e., multiple paths between any source–destination pair. Congestion due to routing conflicts could be resolved effectively by rerouting some or all of the congested flows onto an alternative path, thereby obtaining higher overall network throughput than in existing 802.1Qau approaches, which merely reduce the source rates.
Abstract— In an effort to drive down the cost of ownership of interconnection networks for data centers and high-performance computing systems, technologies enabling consolidation of existing networking infrastructure, which often comprises multiple, incompatible networks operating in parallel, are feverishly being developed. Such consolidation carries the promises of lower complexity, less maintenance overhead, and higher efficiency, which should translate into lower power consumption. 10-Gigabit Ethernet is one of the contending technologies to fulfill the role of universal data-center interconnect. One of the key features missing from conventional Ethernet is congestion management; this void is being filled by the standardization work of the IEEE 802.1Qau working group. Here, we build on the congestion management scheme defined in 802.1Qau to exploit a crucial property of many data-center networks, namely, multi-pathing, i.e., the presence of multiple alternative paths between any pair of end nodes. We adopt a twotier approach: in response to congestion detection, an attempt is made to reroute “hot” flows (i.e., those detected as contributing to congestion) onto an alternative, uncongested path; only when no uncongested alternative exists are transmission rates of hot flows reduced at the sources. We demonstrate how this can lead to significant performance improvements by taking full advantage of path diversity. Moreover, we highlight the scheme’s practical usefulness by showing how it improves the performance of parallel benchmark programs on a realistic network.
I. I NTRODUCTION Simplicity, scalability, wide availability, and low cost, combined with sufficient performance have made Ethernet the network of choice for LAN traffic. For other networks that impose more stringent performance requirements, such as storage area networks (SAN), clustering, and high-performance computing (HPC), specialized standard and proprietary technologies have been developed, such as Fibre Channel, InfiniBand, PCIe Advanced Switching, Myrinet, and QsNet. This has led to a situation in which a typical data center (DC) or HPC installation has a communication infrastructure comprising (at least) three disjoint networks: a LAN, a SAN, and a clustering network. To reduce cost, complexity, power consumption, as well as management and maintenance overhead, several efforts are underway to consolidate all traffic on a single, unified physical infrastructure. One of the main contenders to fulfill this role is 10Gigabit Ethernet (10GE). Several working groups within the IEEE and IETF standards bodies are addressing key issues to ensure that 10GE meets DC and HPC requirements. Within IEEE 802, the Data Center Bridging (DCB) Task Group is responsible for defining standards for congestion management 1550-4794/09 $26.00 © 2009 IEEE DOI 10.1109/HOTI.2009.14
33
Authorized licensed use limited to: UNIVERSITAT POLIT?CNICA DE CATALUNYA. Downloaded on December 7, 2009 at 09:59 from IEEE Xplore. Restrictions apply.
If a congested flow can be routed on an alternative, uncongested path, its rate does not have to be reduced. Therefore, combining adaptive routing (AR) with CM can significantly increase the throughput of a congested network, because send rates have to be reduced only if no alternative uncongested path exists. AR provides a spatial alternative to the temporal reaction of the 802.1Qau schemes. The main contribution of this work is an online AR scheme built on top of the existing end-to-end CM schemes being defined in 802.1Qau. Each switch snoops for 802.1Qau CN frames and, using information snooped from these notifications, annotates its routing table to indicate a congestion level for each port through which a given destination is reachable. When routing a frame to a given destination, the port with the lowest congestion level is selected. A key advantage of the proposed scheme is that no changes are necessary to the Ethernet frame format, existing CM schemes, and Ethernet adapters; it operates by exclusively modifying the routing behavior of the Ethernet switches. Interoperation with 802.1Qau is seamless: if no alternative path exists, the send rates will eventually be reduced to match the bottleneck rate. The paper is organized as follows. Sec. II reviews the IEEE 802.1Qau CM framework. In Sec. III, we discuss how to set up the switches’ routing tables to enable multi-path routing while avoiding routing loops. We describe our AR proposal for DCB networks in Sec. IV. Sec. V presents results for synthetic traffic scenarios on small test topologies, which demonstrate that significant performance gains in terms of throughput and latency can be achieved. Sec. VI shows how AR can improve performance for actual parallel benchmark programs on larger networks. We conclude in Sec. VII.
Fig. 1.
Loop-free set of links to route to Host 5.
2) Source reaction: Each source has an associated reaction point (RP) that instantiates rate limiters for congested flows, and adjusts the sending rates depending on the feedback received. When an RP receives a CN, it decreases the rate limit of the sampled flow if the feedback is negative, and increases it if the feedback is positive. In addition, the RP can autonomously increase rate limits based on timers or byte counting to avoid starvation, accelerate rate recovery when congestion is over, and improve fairness of rate allocation. The scheme is referred to as Quantized Congestion Notification (QCN), first proposed in [5] and detailed in [6]. Unlike the earlier ECM proposal [4], QCN is characterized by a lack of explicit positive feedback. Instead, a source autonomously increases the target rate of a rate-limited flow after a time interval T during which it has not received any negative feedback. For a given rate-limited flow, the duration of T is determined either by the time needed to transmit a certain number of bytes or by the expiry of a timer. Target rate increments are performed in three phases: first, several steps are applied according to a binary increase function similar to BIC-TCP [7], followed by several linear increase steps, before finally moving to a superlinear increase regime. In each increase step, the distance between the current rate limit and the target rate is halved. Further details on ECM and QCN can be found in [4]–[6], [8]. QCN is able to control the queue length at the CP well up to a round-trip time of about 500 μs, as shown in [9]. However, the performance benchmarks used in 802.1Qau consider single-path topologies exclusively. Here, on the other hand, we will study congestion in multi-path topologies and use QCN as a basis for an AR scheme that can exploit alternative paths rather than just throttling the sources.
II. E THERNET C ONGESTION M ANAGEMENT This section provides an overview of the efforts of the IEEE 802.1Qau Task Group [3] towards defining a CM scheme for DCB. The schemes considered adopt an end-to-end approach aiming to push congestion out to the edge of the network by having the sources (in practice, network interface cards) impose and adjust rate limits for flows contributing to congestion. The primary goal is to achieve the aforementioned rate matching, which is done indirectly by means of observing and controlling the integrated rate mismatch, i.e., the switch buffer lengths. The key components are the following:
III. ROUTE C ONFIGURATION
1) Congestion detection and signaling: Each switch samples the incoming frames with a (randomized) sample interval Is . When a frame is sampled, the switch determines the output buffer length and computes a feedback value Fb , which is a weighted sum of the current queue offset Qoff with respect to the equilibrium threshold Qeq and the level change Qdelta since the preceding sample: Fb = Qoff − Wd ∗ Qdelta . Depending on the value of Fb , the switch sends a CN to the source of the sampled frame. The congested queue that causes the generation of the message is called congestion point (CP).
Routing in Ethernet networks has two key characteristics: First, all traffic is routed on a loop-free spanning tree topology. The Spanning Tree Protocol (STP) or one of its variants transforms an arbitrary physical topology (which may contain loops) into a logical one without loops by en- or disabling specific ports in each switch. Routing loops must be eliminated to avoid “broadcast storms,” i.e., endless replications of broadcast frames. Unfortunately, this also means that potential multipathing capabilities are negated. As any bidirectional network with multiple paths from some source to some destination 34
Authorized licensed use limited to: UNIVERSITAT POLIT?CNICA DE CATALUNYA. Downloaded on December 7, 2009 at 09:59 from IEEE Xplore. Restrictions apply.
automatically has a loop in it, this problem applies to all bidirectional multi-path topologies. Second, switched Ethernet networks are self-learning, i.e, they do not require explicit programming of the switch routing tables. A switch automatically learns the routes by observing the source MAC addresses of the frames entering it. If a frame from MAC m enters on port p, the switch makes an entry in its routing table to route frames destined to MAC m through port p. When a frame arrives for which no entry exists in the routing table, the switch broadcasts the frame to all enabled ports except the one on which the frame arrived. In this way, each switch will discover a route to each end node (assuming that all end nodes generate some traffic that reaches all switches). As this method involves broadcasting, it may also cause broadcast storms on a multi-path topology. To prevent loops, we adopted an approach based on preconfiguring (instead of using STP and self-learning) the routing tables of each switch such that, for each destination node d, the directed graph formed by all routes leading to node d is free of loops, see Fig. 1. This guarantees that, regardless of the AR scheme used, no frame can ever be routed in a loop, without having to keep track of previously visited switches. Route configuration is performed at network initialization. Note that this also guarantees that the routing is deadlockfree, as there exists a routing subfunction that is free of cyclic channel dependencies [10, Sec. 3.1.3]. To enable multi-path routing, each switch implements a routing table that maps each destination MAC to a list of eligible ports to which frames for that address can be routed, ordered by preference. The first entry is the designated default port, which is chosen as long as there is no congestion. Initial preference ordering could be according to, for instance, distance to the destination (hop count). IV. A DAPTIVE ROUTING
IN
Fig. 2.
•
• •
Routing of congestion notifications.
A local flag indicating (if congested is true) whether the congestion occurred locally, i.e., in the output queue attached to port p. A feedback counter fbCount indicating how many CNs have been snooped for (d, p). A feedback severity indication feedback providing an estimate of how severe the congestion is.
Whenever a switch receives or generates a CN for a flow destined to d, it updates the congestion information corresponding to (d, p), where p is the output port corresponding to the input port on which the CN was received (remote) or the output port that triggered the creation of the CN (local). If the entry has not been marked as congested (or did not exist so far), the congestion flag is set and local is set according to whether the CN was generated remotely or locally, fbCount is incremented, and the product of fbCount and the feedback value carried by the CN is added to feedback. As CNs always carry negative feedback values, feedback will also be negative and decrease as more CNs are received. The lower the value of feedback, the more severe the congestion. We employ such a weighted update to assign more weight to recent CNs to gradually reduce the effect of older entries and false positives. In addition, this allows congestion points that generate small but frequent feedback values and thus accumulate a significantly negative feedback value to be considered congested. This is the case for a queue in equilibrium, i.e., one for which congestion is under control but load demand still exceeds link capacity. There is one minor difference in the update procedure if the entry has already been marked as congested: local is updated only if it already has been true, i.e., local congestion can be overridden by remote congestion, but not vice versa.
DCB
The basic concept of our proposal is to exploit congestion information conveyed by notifications generated by 802.1Qauenabled switches. These CNs travel upstream from the congestion point to the originating sources of the hot flows. While relaying CNs, the upstream switches can also snoop their content to learn about downstream congestion. By marking ports as congested with respect to specific destinations, a switch can reorder its routing preference of eligible output ports for the affected destinations to favor uncongested ports over congested ones.
B. Congestion expiration QCN only signals negative feedback, i.e., the presence or increase of congestion, but not the absence or decrease of congestion. Therefore, we opted for a timer-based approach to expire remote entries in the congestion information table. Local entries can be expired when the corresponding output queue is no longer congested. To this end, whenever an entry is updated as being congested, a timer is started. When the timer expires the entry is reset, provided that it is not flagged as local. A local entry is reset when the length of the corresponding output queue drops below Qeq /2.
A. Congestion notification snooping To enable AR, each switch maintains a congestion information table that maps a congestion key (d, p), where d is the destination MAC address and p the local port number, to a small data structure that keeps track of the current congestion status of port p with respect to destination d. This data structure comprises the following four fields: • A congested flag indicating whether congestion has been detected on port p for traffic destined to d. 35
Authorized licensed use limited to: UNIVERSITAT POLIT?CNICA DE CATALUNYA. Downloaded on December 7, 2009 at 09:59 from IEEE Xplore. Restrictions apply.
TABLE I S IMULATION PARAMETERS Parameter line rate frame size min. rate Qeq Wd Gd quantization CM timer AR timer
(a) Topology 1 (T1)
Value 10 1500 100 33000 2 1/128 6 15 250
Unit Gb/s B Kb/s B bit ms ms
Parameter mean sample itvl. buffer size per port fast recovery thr. byte count limit active incr. hyperactive incr. min. decr. factor extra fast recovery
Value 150 150 5 150 5 50 0.5 enabled
Unit KB KB KB Mb/s Mb/s
V. S YNTHETIC T RAFFIC S CENARIOS We first study the proposed AR scheme with synthetic traffic scenarios in small networks. A. Topologies and scenarios (b) Topology 2 (T2)
Figure 3 shows the benchmark topologies T1 and T2 used in our evaluation. The purpose of these small, artificial topologies is to illustrate the shortcomings of CM without AR and to test the basic functionality of the proposed AR scheme. In Sec. VI we will study a more realistic fat tree topology. We applied traffic scenarios with and without endpoint contention, represented by solid and dashed arrows, respectively. To check whether the scheme marks all paths leading to the hotspot as congested, T2 comprises multiple such paths; only the longest path routes around the hotspot. In the scenario without endpoint contention, each flow has a different destination. However, because the preferred shortest paths all lead through a single bottleneck link, shortest-path routing will lead to severe congestion. Note that in each topology a set of routes is available that can eliminate congestion. In general, it is impossible to assign a set of routes that eliminates congestion for any combination of flows a priori. In the scenario with endpoint contention, all flows have the same destination, so that, although alternative routes are available, they cannot be routed without contention. The purpose of this scenario is to determine whether the interaction between AR and rate reductions works as intended, i.e., when no alternative paths can be found, the send rates should be reduced. To determine whether the proposed AR scheme can increase the saturation load, we also applied i.i.d. Bernoulli traffic, with each host generating uncorrelated arrivals uniformly distributed over all other hosts. The input load was varied from 10 to 100% and each frame was 1500 bytes in size. In all experiments, we measured the transmission rate per host and the queue length of each switch output queue, all averaged over 1 ms intervals. To cope with out-of-order delivery caused by AR, each adapter performs resequencing at the receiving end. We also measured the length of the resequencing buffer at each host.
Fig. 3. Evaluated topologies and traffic scenarios. Solid and dashed lines correspond to scenarios without and with endpoint contention, respectively.
C. Routing decisions When a frame arrives, a switch performs a routing lookup for the frame’s destination MAC address d. If the default (most preferred) port p0 is not flagged as congested by the congestion table entry for (d, p0 ), the frame is routed to port p0 . If the default port is flagged as congested, the next-preferred port is checked, and so on. If all ports are flagged as congested, the frame will be routed to the port with the least severe congestion (i.e., with the feedback value closest to zero). CN frames receive special treatment. They are not subjected to congestion checks. However, we need to ensure that all ports belonging to alternative paths leading to the congestion point are made aware of the congestion. If all CN frames are always routed on the same path to the reaction point (source), the flow might be rerouted on an alternative path that eventually ends up at the same congestion point. We address this issue by having each switch randomly select one of the available ports when performing a lookup for a CN frame. As long as the congestion persists, the congested switch will keep generating CNs. By routing them randomly, each upstream switch will eventually be traversed. Figure 2 shows an example in which H1 and H2 are sending at line speed to H3 and H4, respectively, causing severe congestion at port 2 of S4 when the shortest paths are taken. The shortest reverse path back to H1 is through switch S2. However, if all CNs for H1 traverse S2, S1 will only mark its port 2 as congested, but never port 1, so S1 will route its traffic on the second-shortest path through port 1 to S6 and S7, still ending up at the bottleneck in S4. Therefore, S3 should ensure that it routes CNs also on the reverse path through S7 and S6. Then, S1 will mark ports 1 and 2 as congested with respect to destination H3, and will proceed to route its traffic through the longest path via S8–S12 to S5, thus bypassing S4 and eliminating the congestion.
B. Route configuration Figure 1 shows a possible route configuration forming a loop-free directed graph for host H5 of T1; bold arrows 36
Authorized licensed use limited to: UNIVERSITAT POLIT?CNICA DE CATALUNYA. Downloaded on December 7, 2009 at 09:59 from IEEE Xplore. Restrictions apply.
1
1e-2
None CM AR
0.9 0.8
1
None CM AR
1e-3
0.8
None CM AR
1e-3
0.5 0.4 0.3
Throughput
Latency [s]
0.6
1e-4 1e-5
0.1
0.6 0.5 0.4 0.3
1e-6
0.2
Latency [s]
0.7
0.7 Throughput
1e-2
None CM AR
0.9
1e-5 1e-6
0.2 0.1
1e-7
1e-4
1e-7
0
0 0
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Load
(a) Throughput vs. load, T1
1
0
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
1
0
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
1
0
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
Load
Load
(b) Latency vs. load, T1 Fig. 4.
(c) Throughput vs. load, T2
1
Load
(d) Latency vs. load, T2
Uniform traffic
indicate the set of links (output ports) that may be used to route to H5. Loop prevention implies that not all possible paths are enabled: for instance, S8 could also route to S2 to follow the path through S7, but this would create a loop. The paths for the other hosts and topologies are configured similarly.
H4–H6 and vice versa) contending for a bottleneck link, hence each flow received 1/9 of the link rate. As CM was disabled, the input buffers of the congested switches filled up rapidly, causing PAUSE to be applied. As each of the cold flows shared a link with at least one hot flow (e.g., in T1, all flows from hosts H1–H3 traversed a congested link to switch S7, and all flows from hosts H4–H6 traversed a congested link to switch S9), the cold flows effectively obtained the same throughput as the hot ones, leading to an aggregate throughput per host of 5 · 19 ≡ 55.6%. Similarly, the saturation throughput without CM of T2 can be shown to be 75%. Second, for both topologies enabling CM increased the throughput. The CM mechanism controlled the congested queue lengths such that the links leading to the congested switch did not have to be paused continuously. Therefore, the cold flows were unaffected, resulting in saturation throughputs of (T1) 3 · 19 + 2 · λ5 for λ ≥ 59 and (T2) 2 · 14 + 1 · λ3 for λ ≥ 34 , where λ is the offered load ratio. These values correspond precisely to the results obtained, see Figs. 4(a, c). Finally, enabling AR increased the saturation throughput significantly, namely, to 96.8% and 99.5% for T1 and T2, respectively, indicating that our AR scheme optimally exploited the available path diversity. Accordingly, the mean latency was also drastically reduced. Note that as the load approached saturation the latencies exhibited a maximum value instead of growing without bound, because we limited the length of the queues in the input adapters to prevent the simulation’s memory usage from becoming excessively large. b) Flows without endpoint contention: Figure 5 shows the results for the scenario without endpoint contention. Figures 5(a, d) plot the mean transmission rates for H1, H2, and H3 as a function of time using only CM and Figs. 5(b, e) the corresponding rates when using AR on top of CM. Figures 5(c,f) show the queue length of the corresponding hot queue with and without AR. Figures 5(a, d) show that CM by itself was able to limit the send rates of the hot flows such that their sum matches the bottleneck link capacity. In addition, Figs. 5(c, f) show that the hot queue lengths converged on Qeq with high accuracy and stability. However, by enabling AR, the throughput of each flow increased to 1187.5 MB/s, i.e., the offered load. Congestion was avoided by routing each flow on a different path. Correspondingly, Figs. 5(c, f) show that the hot queues
C. Simulation parameters The CM algorithm used for detection and reaction was QCN version 2.2 as specified in [8]. Table I lists relevant simulation parameter settings. Each simulation run simulated two seconds’ worth of real time. During the hotspot experiments, non-congesting random Bernoulli traffic with a load of 50% (T1) or 60% (T2) was offered to the network during the initial 200 ms and the final 500 ms of the simulation. The hotspot occurred from 200 to 1500 ms, with each flow sending at 95% of the line rate (1187.5 MB/s). The switch architecture was output-queued but with buffers partitioned per input to support PAUSE. Each input adapter used per-host virtual output queuing with round-robin service to ensure that rate limits applied to one flow do not affect other flows from the same adapter. In all simulations, PAUSE was enabled. D. Results This section presents simulation results obtained for uniform Bernoulli traffic as well as the scenarios with and without endpoint contention. a) Uniform traffic: The achievable throughput in T1 and T2 under uniform Bernoulli traffic is significantly limited when using only the shortest paths, because they all contend for the link between S7 and S9 (T1) or S8 and S12 (T2). We will show that the proposed AR scheme can provide automatic load-balancing in such situations, leading to significantly higher overall throughput. Figure 4 shows results for uniform traffic, with Figs. 4(a, c) plotting the measured throughput vs. the offered load and Figs. 4(b, d) plotting the measured latency vs. offered load. Each figure contains curves corresponding to CM disabled (“None”), only CM enabled (“CM”), and both CM and AR enabled (“AR”). First, the results show that without CM, throughput saturated at about 55.6% in T1. This is because there were nine hot flows in each direction (the flows from hosts H1–H3 to 37
Authorized licensed use limited to: UNIVERSITAT POLIT?CNICA DE CATALUNYA. Downloaded on December 7, 2009 at 09:59 from IEEE Xplore. Restrictions apply.
1500
host1_tx_tp
host2_tx_tp
host3_tx_tp
host3_tx_tp
500
1000
500
switch7_occ[3] CM switch7_occ[3] AR
Queue length [B]
1000
1E5
host1_tx_tp
host2_tx_tp
Throughput [B/s]
Throughput [B/s]
1500
CM
1E4
AR 0
0 0
0.5
1
1.5
2
1E3 0
0.5
Time [s]
1
1.5
(a) Throughput, CM only, T1
0.5
(b) Throughput, AR, T1 1500
host1_tx_tp
switch8_occ[2] CM
1000
500
CM
1E4
Host 1
AR
0
0 0
0.5
1
2
switch8_occ[2] AR
Queue length [B]
500
1.5
(c) Queue length, T1 1E5
host1_tx_tp host2_tx_tp
Throughput [B/s]
Host 2
1 Time [s]
host2_tx_tp
Throughput [B/s]
0
Time [s]
1500
1000
2
1.5
2
1E3 0
0.5
Time [s]
1
1.5
2
0
Time [s]
(d) Throughput, CM only, T2 Fig. 5.
(e) Throughput, AR, T2
0.5
1
1.5
2
Time [s]
(f) Queue length, T2
Results for scenario without endpoint contention.
were no longer congested. Small spikes in the queue length occurred every 250 ms, because the timers periodically reset the congestion table entries. Each time a timer expired, the switch reverted back to the default path, which caused congestion if another flow was also present there. This led to the generation of new CNs, forcing the contending flow to again reroute onto an uncongested path. In T2, see Fig. 5d, we observed that the flow rate allocation was not fair, with one flow receiving significantly more bandwidth than the other. This, however, was due to QCN itself, which has not been designed to converge on a fair flow rate allocation; its main objective is to stabilize the queue length. Although occasional misorderings occurred, the overall effect on latency was negligible. The mean length of the resequencing buffers was less than one frame. c) Flows with endpoint contention: Figure 6 shows the results for the scenario with endpoint contention using AR on top of CM, with Figs. 6(a, b) showing the per-host transmission rates, and Figs. 6(c, d) the hot queue lengths. In T1, these are in S7 and S9 and in T2 in S8 and S12. These results demonstrate that in scenarios in which AR could not eliminate congestion, the underlying CM was able to match the send rates of the contending flows to the bottleneck link rate. In these scenarios, two potential hotspots exist in each topology; therefore, the routes frequently changed from one to the other. This effect could be observed in an increased length of the resequencing buffer at the destination node of each hot flow, which in this case was around 10 frames on average.
(FT) applications as suitable candidates, because of their high communication demands. The CG benchmark [11] computes an approximation to the smallest eigenvalue of a large, sparse, symmetric positive definite matrix. This kernel tests irregular long-distance communication, employing unstructured matrix-vector multiplication, and is as such typical of unstructured grid computations. To accurately reflect the behaviour of the applications, we adopted a trace-based approach: First, the applications were run on a real parallel machine. During the run, all MPI calls of each of the application’s tasks were recorded in a trace file. Then, the trace was visually inspected to determine a representative communication phase. This phase was then cut and saved separately to be used as the input for our simulations. The application traces thus obtained were replayed using an end-to-end simulation framework comprising the Dimemas and Venus simulators [12]. Dimemas is an MPI simulator that accurately models the semantics of the MPI calls. This simulator interfaces with Venus, which is a detailed interconnection network simulator. Whenever an MPI communication is produced, Dimemas passes the corresponding message to Venus, which takes care of segmentation, packetization, buffering, switching, routing, reassembly, and eventually delivery back to Dimemas. As Dimemas and Venus were originally developed independently, both simulators had to be extended with a cosimulation interface that implements a conservative parallel discrete event simulation method to ensure that their event lists are in synchronization so that casuality of events is guaranteed.
VI. I MPACT AT A PPLICATION L EVEL
Venus supports various network topologies (fat tree, torus, mesh, hypercube) and provides models of various networking hardware (Ethernet, InfiniBand, Myrinet, as well as generic switch and adapter models). For more details on the simulation framework we refer the reader to [12].
To demonstrate that AR has significant potential benefits in practice, we studied the behaviour of particular HPC applications. From the NAS Parallel Benchmarks (NPB), we selected the Conjugate Gradient (CG) and Fast Fourier Transform 38
Authorized licensed use limited to: UNIVERSITAT POLIT?CNICA DE CATALUNYA. Downloaded on December 7, 2009 at 09:59 from IEEE Xplore. Restrictions apply.
1500
1500
host1_tx_tp
1E5
host1_tx_tp
host2_tx_tp
1E5
switch7_occ[3]
host2_tx_tp
switch8_occ[2]
switch9_occ[4]
switch12_occ[4]
host3_tx_tp
500
1000
500
Queue length [B]
Host 3
Queue length [B]
1000
Throughput [B/s]
Throughput [B/s]
Host 2
1E4
1E4
Host 2 Host 1
Host 1
0 0
0
0.5
1
1.5
2
1E3 0
0.5
Time [s]
(a) Throughput, AR, T1
1
1.5
2
Time [s]
0.5
1
1.5
2
0
0.5
1
Time [s]
(b) Throughput, AR, T2 Fig. 6.
1E3 0
1.5
2
Time [s]
(c) Queue length, T1
(d) Queue length, T2
Results for scenario with endpoint contention.
TABLE II
A. Traffic pattern
XGFT CONFIGURATIONS USED .
We collected a trace of a 128-node run of CG (data-set class D) comprising five phases. Figure 7a shows a Paraver plot of an optimal execution of this CG trace (3.2 ms). Tasks are listed along the y-axis, time proceeds along the x-axis. A line between two tasks indicates the start and end of a communication. Figure 7b shows the corresponding communication matrix with source and destination tasks listed along x- and y-axis, respectively. The matrix entries indicate the number bytes sent per source–destination pair (white = zero, black > 0). In each phase, each node sends one message of 750 KB to one other node, in a one-to-one fashion. Because the messages are so large, each communication is executed in rendez-vous mode mode, meaning that short request and reply protocol messages are exchanged before the data transfer starts. To prevent these protocol messages from being delayed by large data messages, they are sent on a higher priority than the data messages. In addition, we collected a trace of a 128-node run of FT. The selected pattern comprises a full 128-way all-to-all personalized exchange of 128 KiB messages. Here also, all communications take place in rendez-vous mode.
h 2 2 3 4 7
mi 16,8 8,16 4,4,8 4,4,4,2 2,2,2,2,2,2,2
wi 1,16 1,8 1,4,4 1,4,4,4 1,2,2,2,2,2,2
paths 16 8 16 64 64
C. Routing in XGFTs Routing a message on a minimal deadlock-free path from source node s to destination node d in an XGFT can be done by choosing any of the nearest common ancestors (NCAs) of the source–destination pair. Having selected the NCA, it is trivial to compute the unique ascending and descending paths (self-routing property) [13]. Several oblivious algorithms have been proposed to assign a path to each source–destination pair in k-ary n-trees. Random routing [16] assigns a path (s → d) from node s to node d by choosing a random NCA. Two other oblivious routing techniques have been proposed independently without a common agreement on their names: what we will call source-mod-k routing [14], [15] and destination-mod-k routing [17]. Both techniques employ a modulo function to select routes; the former uses the source node identifier and the latter the destination identifier. For k-ary n-trees, s-mod-k and d-mod-k routings can be described as follows: to establish a path (s → d) node s to node d, s-mod-k routing chooses parent sfrom l−1 k d mod k at hop l, whereas d-mod-k routing chooses kl−1 mod k, for 0 < l < n. Routing algorithms for k-ary n-trees can easily be adapted for XGFTs (see [14]). In particular, s-mod-k and d-mod-k can l−1 mi in the modulo be adapted by replacing k l−1 by gl =
B. Network A network topology found in many current HPC systems is the k-ary n-tree [13]. A k-ary n-tree comprises N = k n leaf nodes (computing nodes) and n · k n−1 inner nodes (each corresponding to a switch with 2k ports). Formally, k-ary ntrees and their slimmed versions (i.e., those not providing full bisectional bandwidth) belong to the family of extended generalized fat trees (XGFT) [14]. This family includes many popular multi-stage interconnection networks, such as m-ary complete trees, k-ary n-trees, fat trees as described in [15], and slimmed k-ary n-trees. An XGFT(h; m1 , ..., m h ; w1 , ..., wh ) of height h has h + 1 h levels, divided into N = i=1 mi leaf nodes at level l = 0, and inner nodes at levels 1 ≤ l ≤ h serving as routers. Each non-leaf node in level i has mi child nodes, and each non-root has wi+1 parent nodes [14]. XGFTs are constructed recursively, with each sub-tree at level l having parents numbered from 0 to (wl+1 − 1).
i=0
operation (with m0 = 1); we refer to these generalized variants as s-mod-m and d-mod-m. The parent at level l is given by pl = gxl mod ml mod wl+1 , where x is either the source or the destination number. In addition to the above oblivious routing algorithms, several pattern-aware schemes have been proposed. A patternaware algorithm takes into account the connectivity matrix C = [cs,d ], which indicates the data volume cs,d for each source–destination pair (s → d). This may correspond, e.g., to the traffic pattern produced by a given parallel application. 39
Authorized licensed use limited to: UNIVERSITAT POLIT?CNICA DE CATALUNYA. Downloaded on December 7, 2009 at 09:59 from IEEE Xplore. Restrictions apply.
Tasks 128 ... 1
phase 1 phase 2 phase 3 phase 4 phase 5
Time
(a) Execution trace. Fig. 7.
Slowdown factor
4.5
5.5 5 4.5
4 3.5 3 2.5
Conjugate Gradient traffic pattern. 2.6
random d-mod-m s-mod-m colored CM AR
2.5
None AR, Qeq = 33k AR, Qeq = 6k
2.4 Slowdown factor
5
6
random d-mod-m s-mod-m colored CM AR Slowdown factor
6 5.5
(b) Communication matrix.
4 3.5 3
2.3 2.2 2.1
2.5 2
2
2
1.5
1.5
1
1.9
1 2:16,8 :1,16
2:8,16 :1,8
3:4,4,8 :1,4,4
Topology
4:4,4,4,2 :1,4,4,4
7:2,2,2,2,2,2,2 :1,2,2,2,2,2,2
(a) CG, linear task placement Fig. 8.
1.8 2:16,8 :1,16
2:8,16 :1,8
3:4,4,8 :1,4,4
Topology
4:4,4,4,2 :1,4,4,4
7:2,2,2,2,2,2,2 :1,2,2,2,2,2,2
(b) CG, random task placement
2:16,8 :1,16
2:8,16 :1,8
3:4,4,8 :1,4,4
Topology
4:4,4,4,2 :1,4,4,4
7:2,2,2,2,2,2,2 :1,2,2,2,2,2,2
(c) FT, random task placement
Performance of CG and FT on 128-node fat tree networks using AR, CM, and various static routing algorithms.
Given a connectivity matrix C, a pattern-aware algorithm attempts to optimize the paths required to route all connections (s → d) for which cs,d > 0 such that the number of potential conflicts is minimized. For the special case where C is a permutation matrix, an efficient algorithm to produce a set of routes without conflicts was proposed in [18]. A general approach to find a set of routes with minimum conflicts for any XGFT given C was proposed in [19], using a per-port cost function that indicates the number of conflicts, i.e., the minimum of the number of distinct sources and distinct destinations across all flows sharing a particular switch port; the cost of a given route corresponds to the maximum cost across the ports it traverses. Based on this cost function, a heuristic, offline optimization algorithm, henceforth referred to as colored, tries to minimize the global cost across all routes. In the remainder of this section, we will compare oblivious routing (s-mod-m, d-mod-m, and random), pattern-aware routing (colored, which can be seen as an offline AR algorithm), and our proposed AR, using traffic patterns obtained from the NPB CG and FT benchmarks.
The slowdown factor is computed with respect to the minimum execution time obtained without any contention (about 3.2 ms). We employed linear (Fig. 8a) and random (Fig. 8b) task placements. The simulation parameters were the same as in Table I, except that Gd was set to zero when using AR; therefore, congestion notifications served only to trigger AR, but did not cause any rate limiting. For improved accuracy, we also employed variable-sized Ethernet frames with payloads between 46 and 1500 bytes and 22 bytes of overhead per frame. We configured the switch routing tables to allow all possible minimal routes, with the default entry for each destination corresponding to d-mod-m. With linear placement, the first four phases of the CG pattern could be routed without any conflicts even by oblivious algorithms such as s-mod-m and d-mod-m. In the fifth phase, however, the modulo algorithms caused significant contention, which led to slowdown by a factor of seven (in this phase, seven messages contended for the same path; the eighth one was a self-message that did not enter the network). The first important observation is that when 802.1Qau CM was enabled without AR (d-mod-m), execution times were drastically worse than without CM, because the congestion induced in the final phase led to transmission rate reductions without any hope of improvement, because the problem here was not throughput loss due to saturation tree congestion. Rate reduction can only improve overall throughput if there are “victims” and “culprits,” but in this case every flow was a culprit. In addition, the time scale of the simulation is too
D. CG results Figs. 8(a, b) compare the slowdown factors for the d-mod-m, s-mod-m, random, colored, and adaptive routing algorithms, obtained by replaying the CG trace on the topologies listed in Table II. For comparison, we also used d-mod-m routing in conjunction with CM. 40
Authorized licensed use limited to: UNIVERSITAT POLIT?CNICA DE CATALUNYA. Downloaded on December 7, 2009 at 09:59 from IEEE Xplore. Restrictions apply.
short for CM to be useful, as the rate recovery duration of QCN is in the range of tens of milliseconds. Only colored was able to route all phases without any contention. AR performed clearly better (20–40% faster) than random, d-mod-m, and s-mod-m for all topologies. Overall, AR achieved execution times not more than 25% over the theoretical minimum of 3.2 ms. Also interesting to observe is that execution times did not vary greatly as the number of topology levels increased. With random task placement, execution times were generally longer, because conflicts then also arose in the first four phases. However, here we observe that AR performed best on all topologies, even outperforming offline route optimization.
ACKNOWLEDGMENTS We thank Alessandra Scicchitano for her contributions. R EFERENCES [1] G. Pfister and V. Norton, “Hot spot contention and combining in multistage interconnection networks,” IEEE Trans. Computers, vol. C34, no. 10, pp. 933–938, Oct. 1985. [2] G. Pfister, M. Gusat, W. Denzel, D. Craddock, N. Ni, W. Rooney, T. Engbersen, R. Luijten, R. Krishnamurthy, and J. Duato, “Solving hot spot contention using InfiniBand Architecture congestion control,” in Proc. HP-IPC 2005, Research Triangle Park, NC, July 24 2005. [3] “IEEE Standard for Local and Metropolitan Area Networks—Virtual Bridged Local Area Networks - Amendment: 10: Congestion Notification,” http://www.ieee802.org/1/pages/802.1au.html, 802.1Qau. [4] D. Bergamasco, “Ethernet congestion manager (ECM) specification,” Cisco Systems, initial draft EDCS-574018, Feb. 2007. [5] R. Pan, B. Prabhakar, and A. Laxmikantha, “QCN: Quantized Congestion Notification,” May 17 2007. [Online]. Available: http://www.ieee802.org/1/files/public/docs2007/au-prabhakarqcn-description.pdf [6] ——, “QCN: Quantized Congestion Notification,” May 29 2007. [Online]. Available: http://www.ieee802.org/1/files/public/docs2007/aupan-qcn-details-053007.pdf [7] L. Xu, K. Harfoush, and I. Rhee, “Binary increase congestion control (BIC) for fast long-distance networks,” in Proc. INFOCOM 2004, vol. 4, Hong Kong, Mar. 7–11 2004, pp. 2514–2524. [8] R. Pan, “QCN pseudo code version 2.2,” Nov. 13 2008. [Online]. Available: http://www.ieee802.org/1/files/public/docs2008/au-pan-QCNpseudo-code-ver2-2.pdf [9] C. Minkenberg and M. Gusat, “Congestion management for 10G Ethernet,” in Second Workshop on Interconnection Network Architectures: On-Chip, Multi-Chip (INA-OCMC 2008), G¨oteborg, Sweden, Jan. 2008. [10] J. Duato, S. Yalamanchili, and L. Ni, Interconnection Networks: An Engineering Approach. San Francisco, CA: Morgan-Kaufmann Press, 2003, revised printing. [11] D. Bailey, E. Barszcz, J. Barton, D. Browning, R. Carter, L. Dagum, R. Fatoohi, S. Fineberg, P. Frederickson, T. Lasinski, R. Schreiber, H. Simon, V. Venkatakrishnan, and S. Weeratunga, “The NAS parallel benchmarks,” NASA Ames Research Center, Moffett Field, CA, Tech. Rep. NAS Technical Report RNR-94-007, Mar. 1994. [12] C. Minkenberg and G. Rodriguez Herrera, “Trace-driven co-simulation of high-performance computing systems using OMNeT++,” in Proc. SIMUTools 2nd International Workshop on OMNeT++ (OMNeT++ 2009), Rome, Italy, Mar. 6 2009. [13] F. Petrini and M. Vanneschi, “A comparison of wormhole-routed interconnection networks,” in Proc. Third International Conference on Computer Science and Informatics, Research Triangle Park, NC, USA, Mar. 1997. ¨ [14] S. R. Ohring, M. Ibel, S. K. Das, and M. J. Kumar, “On generalized fat trees,” in Proc. 9th International Symposium on Parallel Processing (IPPS ’95), Washington, DC, 1995, p. 37. [15] C. E. Leiserson, “The network architecture of the Connection Machine CM-5,” in Proceedings of the Fourth Annual ACM Symposium on Parallel Algorithms and Architectures, San Diego, CA, June 1992, pp. 272–285. [16] L. G. Valiant and G. J. Brebner, “Universal schemes for parallel communication,” in Proc. 13th Annual ACM Symposium on Theory of Computing. Milwaukee, WI: ACM, 1981, pp. 263–277. [17] X.-Y. Lin, Y.-C. Chung, and T.-Y. Huang, “A multiple LID routing scheme for fat-tree-based InfiniBand networks,” Proc. of the 18th International Parallel and Distributed Processing Symposium, pp. 11–, 2004. [18] Z. Ding, R. R. Hoare, A. K. Jones, and R. Melhem, “Level-wise scheduling algorithm for fat tree interconnection networks,” in Proc. 2006 ACM/IEEE Conference on Supercomputing. New York, NY, USA: ACM, 2006, p. 96. [19] G. Rodriguez Herrera, R. Beivide, C. Minkenberg, J. Labarta, and M. Valero, “Exploring pattern-aware routing in generalized fat tree networks for HPC,” in Proc. 23rd International Conference on Supercomputing (ICS 2009), New York, NY, June 9–11 2009.
E. FT results For FT, linear placement with d-mod-m routing did not cause any contention at all, so CM and AR played no role. Therefore, we applied five different random placements instead. We simulated the same set of topologies with and without AR. As the messages were significantly shorter than with CG, we also tried using a lower Qeq to accelerate the triggering of route changes in case of contention. However, setting Qeq too low might cause underruns, queue-length oscillations, and more missequencing. Figure 8c plots the slowdown factors for each topology, normalized with respect to the contentionless case (about 14.3 ms). Each data point shows the mean slowdown value across the five different placements, as well as the minimum and maximum. AR improved performance by about 10 to 20%. Lowering Qeq improved performance further, especially for topologies with more levels. VII. C ONCLUSIONS Based on the congestion management framework being standardized the IEEE DCB Task Group, we proposed a fully adaptive routing scheme for DCB networks that provides a practical way to exploit available path diversity to its full extent. This is, to the best of our knowledge, the first scheme to demonstrate the load-balancing potential of the upcoming DCB CM standard. It takes advantage of the presence of CM by modifying the switch routing behavior based on information snooped from congestion notifications. Our evaluation showed that the performance can be improved significantly (increased throughput, reduced latency) under both specific congestion scenarios and uniform traffic. Moreover, we showed that our scheme can reduce the execution time of a real benchmark; in this case, we also found that enabling CM by itself can do more harm than good. We intend to extend this work along multiple dimensions, i.e., to study other benchmarks and parallel applications, to consider other topologies, in particular slim trees, as well as direct networks such as meshes and tori. We also see further potential to optimize the routing decisions, and we will investigate the trade-offs between oblivious load balancing and “informed” AR as considered here. 41
Authorized licensed use limited to: UNIVERSITAT POLIT?CNICA DE CATALUNYA. Downloaded on December 7, 2009 at 09:59 from IEEE Xplore. Restrictions apply.