Exploring Pattern-aware Routing in Generalized Fat Tree Networks German Rodriguez
Ramon Beivide
Cyriel Minkenberg
Barcelona Supercomputing Center (BSC) Barcelona, Spain
University of Cantabria Cantabria, Spain
IBM Research GmbH Zurich Research Laboratory Rüschlikon, Switzerland
[email protected] [email protected] [email protected] Jesus Labarta Mateo Valero Universitat Politècnica de Catalunya and BSC Barcelona, Spain
Universitat Politècnica de Catalunya and BSC Barcelona, Spain
[email protected]
[email protected]
ABSTRACT
General Terms
New static source routing algorithms for High Performance Computing (HPC) are presented in this work. The target parallel architectures are based on the commonly used fattree networks and their slimmed versions. The evaluation of such proposals and their comparison against currently used routing mechanisms have been driven by realistic traffic generated by HPC applications. Our experimental framework is based on the integration of two existing simulators, one replaying an MPI application and another simulating the network details. The resulting simulation platform has been fed with traces from real executions. We have obtained several interesting findings: (i) contrary to the widely accepted belief, random static routing in k-ary n-trees (which is the default option for InfiniBand and Myrinet technologies) is not a good solution for HPC applications; (ii) some existing oblivious routing techniques can be very good for certain communication patterns present on applications, but clearly fail for some others and (iii) one of the proposed pattern-aware routing algorithms could be used to better utilize network resources and thus achieve higher performance, particularly for the case of cost-effective networks.
Performance, algorithms
Keywords Extended Generalized Fat Trees (XGFT’s), k-ary n-trees, communication/traffic patterns, network topologies, routing algorithms, Clos networks
1. INTRODUCTION Current High Performance Computing (HPC) systems consist of thousands of processors connected by customized interconnection networks. The generalized use of such massive parallelism has increased the impact of the network on the overall system performance and cost. Although some of the fastest supercomputers in the Top500 list use Torus networks, a large number are built around indirect networks based mainly on fat-tree topologies. The present work focuses on enhancing the performance of this second class of networks by providing better routing algorithms. Different papers have studied the effect of routing on the performance of regular and irregular indirect networks, [20], [4]. Such works are mainly based on simulations fed by constant, randomly generated traffic managed under static or adaptive routing. Two main conclusions were obtained: (i) random routing is good because it uniformly distributes traffic, and (ii) intelligent adaptive routing can balance traffic and maximize memory utilization at the switches so as not to block packet injectors. It is tempting to extrapolate these results, which can be valuable in certain contexts, to the communication patterns of supercomputer applications. However, the bursty and causal nature of HPC traffic is quite unlike random nonreactive traffic. In general, supercomputer applications exhibit very regular and repetitive communication patterns alternated by computation phases that act as implicit traffic flow control. Hence, the main networking objective in the HPC domain is not congestion control under a constant packet injection, which usually means taking decisions that directly or indirectly reduce the injection rate. In contrast, networks for HPC must provide the maximum possible peak rate for the transmissions involved in the various commu-
Categories and Subject Descriptors C.2.2 [Computer-Communication Networks]: Network Protocols—Routing protocols; B.4.3 [Input/Output and Data Communications]: Interconnections (Subsystems)—Topology (Fat Trees); C.4 [Performance of Systems]: Design studies
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. ICS’09, June 8–12, 2009, Yorktown Heights, New York, USA. Copyright 2009 ACM 978-1-60558-498-0/09/06 ...$5.00.
276
nodes used as processing nodes and n · kn−1 inner nodes (2k-port switches). These full bisection bandwidth networks exhibit path redundancy, and also the property of being rearrangeable [7]. This means that any scheduled permutation of sources over destinations can be routed without blocking, i.e., no messages contend for the same network port. Each specific permutation needs an appropriate set of routes. Therefore, these networks will be referred to as non-blocking. As stated earlier, several recent works have identified a potential over-provisioning of bandwidth of k-ary ntrees [9], [2], [15]. Consequently, the use of “slimmed” k-ary n-trees has been considered. Slimmed k-ary n-tree topologies have less than n · kn−1 switches, losing both the full bisection bandwidth and the rearrangeable non-blocking properties. Formally, k-ary n-trees and their slimmed versions belong to the family of Extended Generalized Fat Trees (XGFT) [16]. This family includes many popular Multi-stage Interconnection Networks (MIN), such as mary complete trees, k-ary n-trees [19], fat trees as described in [12], and slimmed k-ary n-trees. An XGF T (h; m1 , ..., mi , ..., mh ; w1 , ..., wi , ..., wh ) of height h Q has N = hi=1 mi leaf processors, with the inner nodes serving only as routers. Each non-leaf node in level i has mi child nodes, and each non-root has wi+1 parent nodes [16]. An XGFT of height h has h + 1 levels. Leaf nodes are at level l = 0. XGFTs are constructed recursively, each sub-tree at level l having parents numbered from 0 to (wl+1 − 1). See Figure 1 for some examples.
nication phases of applications. This can be achieved by minimizing packet contention over network resources. In addition, many recent works [9], [23], [2] conclude that, given the communications requirements of HPC applications, current networks are overdesigned. These works are based on the observed resource usage of such “overdesigned” networks, but their conclusions do not necessarily hold for more cost-effective networks. Instead of a typical non-blocking network (as defined in Sec. 2) a slimmed blocking version can be used. An important question to ask in this context is how much a network can be cut down in switching resources without incurring a significant performance degradation. The present work focuses on reducing the costperformance ratio of interconnection networks by optimizing message routing. Good routing mechanisms should improve performance at little or no cost by removing or reducing the contention of packets for network ports. A bad routing scheme can underutilize an overdesigned network, and a good routing scheme for a non-blocking network is not necessarily good for its blocking version. Hence, our analysis addresses both the non-blocking and blocking network variants. The main contributions of this work are as follows: (i) we propose an offline contention metric to factor out the contention at the adapters from the contention in the network fabric; (ii) we devise two new pattern-aware routing heuristics that lead to optimized routes according to the offline contention metric; (iii) we compare these routing schemes with several well-known routing techniques for both rearrangeable non-blocking networks and their slimmed (blocking) versions, and (iv) we show that performance gains can be achieved in different realistic HPC scenarios at the negligible cost of modifying the routes. Moreover, the proposed pattern-aware routing opens the door to research targeted at its application to power-saving and fault-tolerant supercomputers. It has been pointed out in [5] that the performance of a static routing scheme (D-mod-k) can be a lower bound for adaptive routing techniques. In our work, we have studied several static routing schemes in the HPC domain that can help to set and better understand these bounds extending the study to slimmed (blocking) networks. Our study is based on a realistic experimental methodology. We obtained traces of communication patterns extracted from real applications running on a production machine. We fed these traces to a trace-driven MPI simulator integrated with a detailed network-level simulator. Finally, we evaluated the routing performance in a family of popular network topologies, of which k-ary n-trees are a sub-family. The remainder of this paper is organized as follows. In Sec. 2 we review the background and related work on the topologies studied and oblivious routing algorithms. In Sec. 3 we introduce pattern-aware routing, propose an offline contention metric, and propose two new heuristic patternaware routing techniques that attempt to minimize this contention metric. Sec. 5 explains our evaluation methodology and presents the results. We conclude in Sec. 6.
2.
n
n−1
\ \ A k-ary n-tree is an XGF T (n; k, ..., k; 1, k, ..., k), where h = n, w1 = 1, m1 = k, and mi = k, wi = k, ∀i with 2 ≤ i ≤ n. A slimmed k-ary n-tree is precisely defined by the vectors wi and mi when ∃i|(wi < k) with 2 ≤ i ≤ n. Slimmed trees are blocking networks. In both cases, the number of inner switches I can be computed as ! h i h Y Y X mj · wj . (1) I= i=1
j=i+1
j=1
Figure 1: Several XGFTs Finding a minimal deadlock-free path for connection (s → d) between source node s and destination node d in an XGFT network can be done by choosing any of their Nearest Common Ancestors (NCAs). Having selected the NCA, it is trivial to compute the unique ascending and descend-
BACKGROUND AND RELATED WORK
Many current supercomputers employ k-ary n-tree networks [19]. They are a popular parametric family of indirect multi-tree networks. A k-ary n-tree has N = kn leaf
277
ing paths to the nodes using their identifiers (self-routing property) [19]. Different oblivious algorithms have been proposed to fill the routing tables in k-ary n-trees. Random routing [22], [6], [4], is used as the default mechanism in Myrinet and InfiniBand interconnects. A path (s → d) from node s to node d is created by choosing any random NCA between both nodes. Two other oblivious routing techniques have been proposed independently without a common agreement on their names: what we will call S ource-mod-k routing [16], [12] and Destination-mod-k routing [13], [10], [5], [8]. Both techniques employ the same function to select routes. The difference is that the former uses the source node identifier and the latter the destination identifier. S-mod-k and D-mod-k routings can be concisely described for k-ary ntrees: to establish a path (s → d) from node s to node d, s S-mod-k routing chooses parent ⌊ kl−1 ⌋ mod k at hop l, and d D-mod-k routing chooses ⌊ kl−1 ⌋ mod k. Routing in any XGFT is identical to finding paths in k-ary n-trees (see [16]). The routing tables of XGFTs can be filled using straightforward adaptations of the algorithms used in k-ary n-trees. For instance, S-mod-k and D-mod-k can be Q adapted by replacing the denominator kl−1 by l−1 j=1 mj , l > 1, (1, l1 = 1) and k by wl . We have used a roughly equivalent adaptation, using a self-routing variable-radix base to label the nodes as described in [13], applying the modulo wl to the corresponding digit of the node label at each level. A route r from s to d that has NCAs at level lNCA is determined by the sequence of local output ports to reach the NCA. Local output ports of switches at level l are numbered from 0 to wl+1 − 1. Each local output port corresponds to one of the possible parents of the switch reached at level l. A route r is therefore described as: < r0 , ..., rl , ..., rl(s,d)−1 >, the path to NCA. The second half of the route to the destination can easily be reconstructed from the first half by knowing the destination (d) identifier [3]. Very few pattern-aware routing schemes have been proposed for XGFT networks. A pattern-aware routing scheme tries to optimize the set of connections C = (s → d) present in an application for any leaf nodes s and d. A special case of a set of connections is a permutation in which every node sends to a sole distinct destination. A very efficient algorithm to route a particular permutation without conflicts (i.e., realizing the rearrangeability property) is proposed in [3]. Optimizing a more general set of connections is a combinatorial problem. A brute-force search using a breadth first search or depth first search backtracking algorithm is impractical for the node counts of current supercomputers. Breadth-first Search strategies have been used to find minimal distance deadlock-free up*/down* routes in networks of workstations [20], [21], but no attempt has been made to optimize the global set of routes for a particular communication pattern. Brute-force and greedy strategies have also been used to optimize multicast traffic [1]. Whereas our work also tries to minimize contention for a many-sources many-destinations set, it is fundamentally different because we cannot assume that the source data is the same for different destinations. Finally, we have not found any specific pattern-aware routing scheme for slimmed networks, except the obvious adaptations of the general ones discussed above for k-ary n-trees. An orthogonal work for more general network topologies, Application Specific Routing (APSRA) [17], uses the com-
munication pattern to remove channel dependencies so that more deadlock-free paths can be found. In XGFTs, all minimum paths are deadlock free. In contrast to the aforementioned works, this one tries to optimize routing to adequately manage the communication patterns of HPC applications. Such patterns are much more general than permutations. Our main goal is to obtain a global set of fixed routes for the entire application execution if possible, or at least, for a reasonable time span, as globally re-programming the routing tables in the switches and adapters can be very costly, in the order of seconds [8].
3. PATTERN-AWARE ROUTING A pattern-aware routing scheme takes the connectivity matrix M (N × N ) of a communication pattern C as input and produces an optimized set of routes for this pattern as output. The connectivity matrix M of C records its set of connections with elements mij 6= 0 iff the connection (i → j) ∈ C. The actual value of mi,j can represent a useful cost metric of (i → j) as, for example, the number of bytes. The connectivity matrix of a permutation has at most N non-zero elements, namely, a single non-zero filled element per row, such that no two non-zero elements are in the same column. The connectivity matrix is built from a set of connections within a certain time span of the execution of the application that could range from one instantaneous moment to a complete communication phase or the entire application. The connectivity matrix has no timing information. The information about when each communication started, or how long it took, is lost. Different executions of the same application will probably experience different communication timings. However, the structure of the connectivity matrix will be almost the same across different runs with the same number of processors, as evidenced by our own experimental analysis and by [9]. Whereas much effort has been devoted to minimizing path lengths while guaranteeing deadlock freedom, our work focuses on computing an optimized routing table to increase performance by reducing network fabric contention. In order to do that, we will define a cost function that only accounts for network contention and eliminates the effect of endpoint contention.
3.1 Cost function: Offline Contention Metric We can differentiate between two kinds of contending messages in an application: those that contend for the network adapter because they were produced by or are going to be consumed at the same node1 and those that were injected by different nodes and compete to go through some switch port. A routing scheme by itself can only address the latter kind. We have devised a cost metric for every switch port p that solely accounts for the network contention. We define routes(p) = {(s → d), such that (s → d) uses p} as the set of routes that go through port p. The cost function 1
278
Assuming that each node has a single network adapter.
is computed as follows: srcs(p) = {s | ∃(s → d) ∈ routes(p)} dsts(p) = {d | ∃(s → d) ∈ routes(p)} cost(p) = min(|srcs(p)|, |dsts(p)|)
achieves a minimum value for (5). If a partial route already has a higher value of (5) than the best complete route found for (s → d), the partial route is not further expanded. Because a solution to BeFS(s → d) reaching an optimal value for (5) is not unique, the order in which routes are found (which depends on the ordering of the priority queue) is relevant. To reproduce the algorithm, it is therefore crucial to define the ordering of the priority queue. The priority queue has two levels of ordering: a first level based on the properties (kinds) of the ports, and the second internal ordering based on either the FIFO ordering or the cost function. The different kinds are, from most preferable to least, the following:
(2) (3) (4)
The Max-flow Min-cut theorem tells us that for a fully connected graph connecting sources (2) to destinations (3), the maximum flow of the set routes(p) is achieved by the minimum cut. Assuming all flows to be 1, the minimum cut corresponds to (4). In our case, the computed maximum flow (4) has to go through a single port p. Hence, the bandwidth loss of sharing port p in comparison to a fully connected network is at most (4). The cost function has a useful secondary property: it assigns low costs to very busy ports that would benefit very little of a full-connected network because the contention is concentrated at the endpoints. Finally, the global cost function of a partial or a complete route r =< r0 , ..., rn−1 >, where n is the number of hops, is the maximum of the cost functions of the individual ports p that it traverses (5). The global cost function of a complete route includes both the upward and the downward path: Global cost(r) =
n−1
max cost(ri ). i=0
À Ports whose assigned routes (Pi ) all have either s as source or d as destination. Formally: either ∀(s′ → d′ ) ∈ Pi , (s = s′ ) or ∀(s′ → d′ ) ∈ Pi , (d = d′ ). Á Ports with no routes assigned yet. Â Ports whose assigned routes (Pi ) share either s or d among other conflicting routes (those that have neither s nor d in common). Ã All other ports with conflicting routes.
(5)
Ports of kind À and Á are in the default FIFO order. Ports of kind  and à are ordered by the cost function (4) that would result from adding the current connection to the set of routes of that port. The ordering of the priority queue tries to reuse the same paths as much as possible if the communication topology is sending from one to many or from many to one. Ports of kind À are the non-conflicting busy ports (value of cost function is 1) that will not suffer from added network fabric contention by adding the path (s → d), therefore saving links for other connections. Ports of kind Á are the free ports: a new path will be used if any other busy path will cause contention. Ports of kind  are those conflicting ports that will not experience more contention by adding the path (s → d), and finally, kind à comprises all the conflicting ports, ordered by the number of conflicting paths, in an attempt to evenly distribute the conflicts if they cannot be avoided. In summary, this heuristic tries to find paths that economize links without causing additional network fabric contention. This is done by leaving more room for the remaining of connections by finding non-conflicting paths through free links, and eventually distributing the conflicts across ports for the conflicting connections.
The two heuristics we present next try to find a single routing table that minimizes the maximum global cost function for the entire connectivity matrix.
3.2 Non-backtracking Best-first Search with Branch and Bound Heuristic (BeFS ). BeFS is a greedy heuristic that takes the connections of M and for each one, finds a route using a Best-first Search heuristic. A Breadth-first Search can be thought of as a Best-first Search with ties in the priority queue resolved as LIFO and equal costs for all search nodes [18]. We describe the internals of this heuristic below. BeFS Heuristic. Input: connectivity matrix, M . Output: optimized routes for M according to Global cost (5). Step À: Initialization. Insert elements (s → d) with M (s, d) 6= 0 to the list L sorted by source node. Initialize the set of routes found: S = ∅. Initialize the port annotations (routes) Pi = ∅, where i is a global network port identifier, with 0 ≤ i ≤ (I · K), I being the number of inner switches and K being that of ports per switch. Step Á: for each (s → d) ∈ L, perform BeFS(s → d) with Branch and Bound to find the first route r with a value for (5) close to 0 or the minimum that can still be achieved. Update S = S ∪ r, and Pi = Pi ∪ r, ∀i|r uses port i. An accepted route is never backtracked. Finding the paths inside the BeFS(s → d) can result in backtracking. Subsequent calls to BeFS(s → d) will use the globally updated port annotations Pi internally. The function BeFS(s → d) searches a path from s to d by inserting reachable ports from the currently inspected switch or node into the priority queue. At each step towards a solution, the port that is first in the priority queue (ties are resolved in a FIFO manner) is expanded. When all costs are equal, the algorithm behaves as a traditional Breadthfirst Search. The search stops with the first route found that
3.3 The Colored Heuristic (Colored ) Under certain conditions for the matrix M , the problem of assigning an optimized set of routes could be formulated as a graph-coloring problem (assigning an NCA for each communicating pair). However, the general case cannot be formulated as such in a practical way because (i) it would have to be formulated as minimum weighted coloring (the weight being the cost function), and (ii) the weights depend on the coloring assignment. We have derived a heuristic that makes use of some properties specific to the recursive nature of the XGFT to approximate the original graph-coloring problem formulation. We will call this heuristic Colored. The Colored heuristic relies on a routing property of XGFTs, proved in [3], that states that, regardless of the
279
Qhs,d hs,d , they can choose from as many as i=0 wi roots. If the route from s → d has already been set up to level l < hs,d , the set of available roots at level hs,d gets restricted Qhs,d to i=l wi . Moreover, two assignments < r0 , r1 , ..., rl > and < r0′ , r1′ , ..., rl′ > for any two communicating pairs of any SourceGroup or TargetGroup will share the restricted set of common roots only if they have a common prefix: ∀i ≤ l, ri = ri′ . Algorithm Colored Heuristic. Input: connectivity matrix, M . Output: optimized routes for M according to Global cost (5). Step À: for each level l, 0 < l < (h − 1) of the XGFT, build a list Ll containing the SourceGroups and TargetGroups of level l ordered by their cardinality, i.e., by the number of outgoing/incoming communications in each group. The order is such that the groups (SourceGroup or TargetGroup) that potentially need more resources will be analyzed first and will direct the future assignments of roots for the remainder of the communicating pairs. Step Á: Analysis of Ll : Analyze every Gli ∈ Ll , (Gli being either a Sil or a Til ), Gli . Gli is a collection of communicating pairs. The assignments rlj made to the communicating pairs (sj → dj ) are tracked independently of Gli . When a Gli is analyzed, communicating pairs are queried for their assignment of rlj . Some communicating pairs might have rlj assigned from a previous analysis of another group, and others might not. Step Â: Analysis of Gli : for each set S = {(s → d) ∈ l Gi } sharing the same route prefix, choose an rl for a communicating pair. To decide which root rl from the available roots (0, ..., w(l−1) ) should be assigned to each communicating pair, a matrix R is built. The rows of the matrix are labeled by the individual communicating pairs and the columns represent the available parents 0, ..., (wl − 1). The matrix is filled in with a weight indicating how good (positive value) or how bad (negative value) it is to choose a parent in column k for row j. Step à Rules to fill row j of matrix R: for SourceGroup or TargetGroup. Given row j:
NCA chosen for (s → d), the relative parents of the complete route (upward and downward path) are symmetric. All complete routes in an XGFT have an odd number of hops, and the middle hop selects the NCA. Once in the NCA, it is no longer possible to choose a different set of relative parents of the smallest sub-trees down to the destination node d. The first hop downward from the NCA is determined by d, and the rest of the route will follow exactly the same relative sequence of parents (in inverse order) as the upward path. Our algorithm explores the outgoing and incoming connections hierarchically from the leaves towards the roots of the trees. The goal is to achieve the least conflicting assignments, by level, considering all the nodes under a certain level of the tree as a cloud (SuperNode) that sends and receives messages. At level 0, there are as many clouds as leaf nodes, at level 1; there will be N/m1 clouds of m1 nodes each. At each level from 0 to (h − 1), the algorithm will assign some or all of the wl+1 parents to the different outgoing or incoming connections of the cloud. The assignment is done per level such that the cost function (5) is minimized for all connections. Note that the sequence of parents chosen in going up to the network fabric are the resulting routes. The implementation of the algorithm distinguishes between sending and receiving communications of the cloud. We denote SourceGroups and TargetGroups as the sets of sending and receiving connections, respectively. SourceGroups and TargetGroups are ordered, and the assignment of parents is done group by group. The benefit of doing the assignment of parents (partial routes) by SourceGroups and TargetGroups is that the communication topology is taken into account. The embedding of routes into the physical topology is done in such a way that it optimizes both the conflicts and the use of resources. This is achieved by concentrating the contention of the “sending” nodes (SourceGroups) in the upward paths, and the contention of the “receiving” nodes (TargetGroups) in the downward paths. Next, we introduce some definitions and, after that, the Colored heuristic. Definition 1: A SuperNode Nil of level l is the set that contains the nodes belonging to a sub-tree of level l. Node i is SuperNode Ni0 , nodes connected to the first level (l = 1) switch i constitute SuperNodes Ni1 , and so on. SuperNodes capture the recursive nature of the physical topology. Definition 2: The SourceGroup Sil is the set of communicating pairs (sk → dk ) whose sources belong to SuperNode Nil and whose destinations belong to any other SuperNode, but not itself. Equivalently, a TargetGroup Tjl is the set of communicating pairs whose targets belong to Njl and whose sources belong to any other SuperNode, but not itself. Note that a SourceGroup or TargetGroup of level l does not contain the inbound communication pairs, but only those that go outbound level l. SourceGroups and TargetGroups superpose the structure of the communication topology to the physical topology. Definition 3: A route for s → d is defined as the sequence of selected intermediate parents < r0 , r1 , ..., r(l−1) > to one of the NCA at level l connecting s and d. The r∗ elements are specific for each communication pair, but will be omitted for brevity. The available routes for s → d depend on the number of roots. Given two nodes (s,d) whose common level is
j (sj → dj , < r0j , ..., rl−1 >) ∈ Gli ,
we will call Ssl j the SourceGroup containing sj as a source and Tdlj the TargetGroup containing dj as a destination. À Communicating pairs (sk → dk ) for SourceGroup Ssl j with a root rlk assigned are analyzed: • For each communicating pair k, with dj 6= and a previously assigned root rlk , penalize column rlk heavily. • For each communicating pair k, with dj = and a previously assigned root rlk , increase preference of the column rlk .
dk , the dk , the
Á Communicating pairs (sk → dk ) for TargetGroup Tdlj with a root rlk assigned are analyzed: • For each communicating pair k, with sj 6= sk , and a previously assigned root rlk , penalize the column rlk heavily. • For each communicating pair k, with sj = sk , and a previously assigned root rlk , increase the preference of the column rlk .
280
4.2 Tools and Experimental Framework
The row j having the highest positive value for the entire matrix in column k will be chosen: rl = k will be set for sj → dj . Step à is repeated with the non-assigned communicating pairs as long as there are pairs still remaining in S. At the first iteration of the algorithm, all values of matrix R will be 0, and the parent chosen will be the one with the smaller index, i.e., 0. As the algorithm iterates, more information to choose the roots becomes available. At some point after Step Â, a particular Gli will have all terms in the matrix with negative numbers (conflicting). When this happens, the current assignments for rl are removed, and the group is given a second chance by putting it again at the end of the list L to be analyzed. The aim of this is that as the algorithm progresses, there will be more information to assign the roots to better balance the conflicts. This second chance (a kind of backtracking) is only given once to each group, and only if all elements in the matrix were negative (conflicting). In the worst case, each group is analyzed twice. The complexity of this algorithm does not increase exponentially because at each level l (i) the communicating pairs that do not go outbound are ignored, and (ii) for each SourceGroup or TargetGroup, only the communicating pairs sharing the same prefix of assigned roots up to level l will be analyzed together. The time needed to compute the patternaware routing tables for the applications studied does not exceed 8 sec in the worst case. Typical run times for these HPC applications take several hours.
4.
To study the effect of the routing scheme on network contention, we have used two coupled simulators: Venus and Dimemas. Venus is an event-driven simulator developed at the IBM Zurich Lab that is able to simulate any generic network topology of nodes, switches and wires at the flit level. It can simulate all range of XGFTs as well as many other topologies. Dimemas [11] is an MPI simulator driven by a post-mortem trace of a real application execution. The trace contains the MPI calls the application performed, which in turn include the communication pattern as well as the causal relationships between messages. Dimemas reconstructs the temporal behavior according to a parametric bus network model. We have implemented a co-simulation approach between Venus and Dimemas to substitute the default network model from Dimemas with the detailed network model from Venus [14]. We have used an input/output buffered switch model, link speed of 2 Gbits/s, flit size of 8 bytes, and segment size of 1KB with a round-robin interleaving of messages at the network adapter. We have obtained execution traces from runs of the applications selected. Dimemas was fed with the execution trace, relying on to Venus to do the detailed network simulation of the communications. We extracted the connectivity matrix M (source-destination pairs) for each communication phase. For each topology under study (instantiations of XGFTs) we fed our routing algorithms with (i) the connectivity matrix, (ii) the topology file, and (iii) the mapping of processes to nodes (sequential). The routes obtained were then supplied, along with the topology and mapping, to the Venus simulator. For better comparison, we have scaled the reported times against the time employed by a single ideal full-crossbar connecting all the nodes. Simulating a full-crossbar with hundreds of ports provides the best performance that can be obtained in the absence of network contention. A full-crossbar does not need any routing algorithm.
WORKLOADS AND EXPERIMENTAL METHODOLOGY
In this Section, the applications chosen as benchmarks are described and the employed methodology and tools are presented.
4.1 Applications Most of the research done in routing has focused on maximizing the throughput of synthetic, flow-controlled, generated traffic. In this work, we simulate the MPI level of execution traces of the following applications:
5. EVALUATION In this section we present the results obtained for nonslimmed and slimmed networks. We will address here the question posed in Section 1 of how much a network can be trimmed without incurring in a significant performance degradation. We will take into account the effect of the routing schemes, comparing several well-known and some custom techniques. For the slimmed networks, we will analyze every application in greater detail. Finally, we will summarize the results to draw overall conclusions on the joint effects of topology and routing decisions.
1. WRF (Weather Report Forecast) is a numerical weather prediction system designed to serve the atmospheric research community. We include results with 256 processors (WRF-256). 2. Alya is a Finite Element Method (FEM) solver code. Alya uses the Metis partitioning library to balance the workload among threads. We include results with 101, 200 and 201 processors (Alya-101, Alya-200, and Alya-201). We have also executed a replay of Alya-200 changing the synchronous Send calls by asynchronous ones; we will refer to this run as Alya-200Isends.
5.1 Non-slimmed networks Figure 2 shows the relative degradation of the various routing schemes for the communication patterns of the applications we have evaluated. WRF performance is almost identical for all routing schemes, except for Random, which reduces it by more than a factor of 3. CG is also noticeably affected by the choice of the routing scheme: here S-mod-k, D-mod-k and BeFS are unable to obtain the optimum performance for CG, which only Colored achieves. The communication phases of Alya-101, Alya-201 and Alya-200Isends exhibit almost no difference for the different routing schemes compared with the Full Crossbar. None of
3. The NAS Parallel Benchmarks is a set of pseudoapplications and numerical kernels designed to compare the performance of HPC machines. We present the results for Conjugate Gradient (CG) from the NPB suite, which is one of the most demanding in terms of point-to-point communication performance. We include here results with 128 processors for data-set class D: CG.D-128.
281
WRF, Progressive tree-slimming
Comparison of Routing schemes for non-blocking networks 4 3.5 3
13 11
2.5
Slowdown
Slowdown
15
Routing schemes Random S mod k D mod k BeFS Colored
2
9 7
1.5
5
1
3
0.5
1 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 Value of w2 (#middle switches) for XGFT(2;16,16;1,w2)
0 WRF-2
56
Alya-10
1
Alya-20
1
Alya-20
0
Full-Crossbar Random S mod k D mod k BeFS Colored
Alya-20
C 0IsenGd-128 s
1
(a) WRF: slimmed trees with 32-port switches m1 , m2 = 16 Figure 2: Routing schemes in Non-slimmed Networks vs. Full Crossbar (no routing)
WRF, Progressive tree-slimming 128 120 112 104 96 88 80 72 109 102 95 88 81 74 67 92 92 86 80 74 68 62 77 72 67 62 57 64 60 56 52 53 50 47 44 42 37
Inner Switches 15
the five routing schemes is able to achieve the Full Crossbar performance for Alya-200, and they all perform almost identically. A deeper analysis of the Alya-200 communication pattern reveals that the routing schemes are already close to the optimum that is achievable. The chain of several synchronous sends in the implementation causes dependencies that can only be resolved by appropriately scheduling the sends, which exceeds the scope of the routing scheme. Figure 2 shows that pattern-aware routing schemes work well with complete k-ary n-tree topologies. Random static is probably the least advisable routing technique, despite the fact that most of the literature is in favor of it. This recommendation seems to be mostly based on research done using continuous injection of synthetic traffic. As we can see, static routing techniques, including S-mod-k, which was proposed in the first works [12] [16] on these topologies, do reasonably well for non-slimmed networks.
13
Slowdown
11
Full-Crossbar Random S mod k D mod k BeFS Colored
9 7 5 3
8,8 8,7 8,6 8,5 8,4 8,3 8,2 8,1 7,7 7,6 7,5 7,4 7,3 7,2 7,1 6,7 6,6 6,5 6,4 6,3 6,2 6,1 5,5 5,4 5,3 5,2 5,1 4,4 4,3 4,2 4,1 3,3 3,2 3,1 2,2 2,1 1,1
1
Values of w2,w3 for XGFT(3;8,8,4;1,w2,w3)
(b) WRF: slimmed trees with 16-port switches m1 , m2 = 8, m3 = 4 Figure 3: Routing schemes in slimmed-versions of (a) 16-ary 2-trees and (b) 8-ary 3-trees, for WRF256. The x axis indicates the number of inner switches of the progressively slimmed topologies with parameters (w2 , w3 , ...) of a corresponding slimmed XGFT from the complete k-ary n-tree.
5.2 Slimmed trees When using the slimmed-tree versions, which constrain the availability of paths, the routing problem plays a more important role. So far, routing techniques in regular slimmed-trees have not been studied using detailed simulations of the actual traffic generated by applications. The following subsections show the results obtained for the applications studied in this paper.
following effect can be observed: if a single middle switch is taken out (w2 = 15 middle switches), the duration of the communication phase doubles, but if we take out 2, 3, 4, or even 8 middle switches, neither the D-mod-k, the S-mod-k, nor the Colored algorithm suffer any additional decrease in performance. The performance degradation for these three routing schemes exhibits a step-wise behavior: once an additional switch had been removed, several other switches could be removed as well without degrading the performance further. The vertical lines in Figure 3(a) are placed between the steps of the most efficient algorithm. The BeFS approach is very sensitive to the slimming, and the random approach has a high variability. Figure 3(b) shows the relative performance degradation for progressive slimming (of the second and third levels) of an XGF T (3; 8, 8, 4; 1, w2 , w3 ). The top X axis shows the to-
5.2.1 WRF The WRF communication pattern consists of a pairwise exchange between the neighboring nodes in a 2-D mesh (±1, ±16 nodes away). The phase analyzed here is the nonlocal phase (±16 nodes away). In Figure 3(a) we see how the routing algorithms perform when we slim the tree and use fewer switches at the second level. The first point (w2 = 16 middle switches) corresponds to the histogram in Figure 2 for the WRF case (non-slimmed networks). The X axis records the topology, whereas the Y axis is the slowdown with respect to a Full Crossbar. The network with only 1 middle switch (rightmost, w2 = 1) can be considered as the worst case. In this minimum cost network, routing decisions do not matter because there is only one path between each pair of nodes. The
282
tal number of switches of the corresponding topology in the bottom X axis. The case for w2 = w3 = 8 corresponds to the 8-ary 3-tree. We draw similar conclusions as for the case with 16 middle switches: Random is not advisable, S-mod-k, D-mod-k and Colored exhibit stable behavior and a performance close to the optimum achievable for each slimmed topology, close to that of the Full Crossbar. We note that S-mod-k, D-mod-k, and Colored manage to achieve the performance of the first step up to the slimmed topology with w2 = 4, w3 = 2 using only 56 switches for WRF-256 (in comparison to the 128 switches used by topology w2 = w3 = 8). A closer look at Figure 3(b) shows that from configuration (w2 = 7, w3 = 7) up to configuration (w2 = 4, w3 = 1), only the configurations with w3 = 1, i.e., a single top-most level switch per sub-tree, suffer a degradation of more than a factor of four. WRF needs very little connectivity at the topmost level, but it needs w3 ≥ 2; otherwise, its performance halves again. If we could accept a performance degradation of four times that of a Full Crossbar for WRF, we could choose the configuration w2 = 3, w3 = 1, with only 47 switches. However, the penalty incurred by a bad routing scheme would be huge.
formance with up to five middle switches for both the synchronous and the asynchronous case.
5.2.3 CG The results for CG are plotted in Figure 5. CG has a communication pattern that consists of five exchanges of equal size, four of which are local to the first-level switch for the radix2 we have used (m1 = 16). Only the fifth phase is non-local, so whatever degradation in performance this application might suffer due to the routing decision exclusively corresponds to the fifth exchange phase. It can be seen that all routing schemes, except Colored, entail a huge performance degradation. The fifth phase of CG performs exchanges with destinations whose differences are multiples of the radix, which is precisely the one kind of exchanges that the D-mod-k algorithm cannot route without conflicts. When the tree is slimmed, i.e., with w2 = 15 middle switches, the best possible performance can no longer be obtained. At least one non-local communication will suffer contention, doubling the time of this fifth phase, and therefore increasing the ideal time, i.e., that of a Full Crossbar, by 1/5. Colored can route the pattern with only nine middle switches without increased performance degradation. With eight middle switches, a new conflict arises, therefore increasing the total time by an additional 1/5.
5.2.2 Alya Figure 4 shows three executions of Alya with different numbers of processors (101, 201 and 200), whereas Figure 4(d) shows the Alya-200 case executed with asynchronous calls (MPI Isends) instead of the synchronous ones (MPI Sendrecv). The results for Alya-101 and Alya-201 (Figures 4(a) and 4(b)) show that the routing schemes studied have almost no impact on Alya’s performance with even as few as w2 = 6 middle switches. With less middle switches, only the BeFS routing scheme performs badly. Alya-200 (Figure 4(c)) shows that most routing schemes except BeFS achieve a performance close to that of a Full Crossbar. The little important of the routing algorithm, and a performance close that of a Full Crossbar suggests that most communications must be local. However, that is not the case, the communication pattern has many non-local communications, and the small effect of routing decisions on the performance of the communication phases of all variants of Alya is related to the implementation of the communication phase, using synchronous send/receive calls, serializing each of the data exchanges between the nodes, which underutilizes the network. We have tested how this communication pattern would perform if all calls were turned into asynchronous ones. This change is possible for this application as the casual dependencies introduced by the blocking nature of the MPI Sendrecv calls are not inherent to the algorithm, but to the implementation alone. Figure 4(d) shows the performance results normalized with the completion time for Full Crossbar for the synchronous case. In comparison with Figure 4(c), the performance with the implementation change doubles. However, the variability in the performance of the routing schemes becomes only slightly more noticeable. The communication pattern of Alya is limited by endpoint contention, which the routing scheme cannot mitigate. As evidenced by Figures 4(c) and 4(d), most routing schemes do reasonably well with as few as w2 = 8 middle switches for the synchronous and w2 = 11 for the asynchronous case. Colored manages to achieve a stable per-
GG, 128 processors, Progressive tree-slimming 5
Slowdown
4
Full-Crossbar Random S mod k D mod k BeFS Colored
3
2
1
0 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 Value for w2 (#middle switches) for XGFT(2;16,16;1,w2)
1
Figure 5: CG.D: 128 processors We performed a detailed analysis of the performance degradation incurred by D-mod-k for the non-slimmed case (w2 = 16), i.e., a k = 16-ary 2-tree. There is no contention in the first four phases, which are local to the switch. However, the degradation for the fifth phase (all of equal number of bytes, namely, 750 KB), accounts for more than a factor of two. The simulated trace reveals that this last phases takes eight times longer with D-mod-k routing. This is due to the nature of the communication pattern of CG: each processor s inside a switch communicates to a processor d=
s · 16 + (s mod 2). 2
(6)
D-mod-k routing will choose r1 = (d mod 16) as the first local port going up into the tree. Given (6), r1 can only be either 0 for the eight sources within a switch, where s ≡ 2
283
The radix is the k parameter of a k-ary n-tree.
Alya, 201 processors, Progressive tree-slimming 1.6
1.4
1.4
1.2
1.2
1
1
Slowdown
Slowdown
Alya, 101 processors, Progressive tree-slimming 1.6
0.8 0.6
0.8 0.6
Full-Crossbar Random S mod k D mod k BeFS Colored
0.4 0.2 0
16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 Value for w2 (#middle switches) for XGFT(2;16,16;1,w2)
0.2 0 1
16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 Value for w2 (#middle switches) for XGFT(2;16,16;1,w2)
(a) Alya 101
2.5
Alya, 200 processors, using Isends, Progressive tree-slimming 3
Full-Crossbar Random S mod k D mod k BeFS Colored
2.5
2 Slowdown
2 Slowdown
1
(b) Alya 201
Alya, 200 processors, Progressive tree-slimming 3
Full-Crossbar Random S mod k D mod k BeFS Colored
0.4
1.5
1.5
1
1
0.5
0.5
0
Full-Crossbar,Alya 200 w/o isends Full-Crossbar Random S mod k D mod k BeFS Colored
0 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 Value for w2 (#middle switches) for XGFT(2;16,16;1,w2)
1
16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 Value for w2 (#middle switches) for XGFT(2;16,16;1,w2)
(c) Alya 200
1
(d) Alya 200 (using Isends)
Figure 4: (a) Alya-101 (b) Alya-201 (c) Alya-200 (d) Alya-200-ISends (using Immediate Sends) normalized to the performance of a full crossbar with synchronous sends from (c). 0 (mod 2), or 1 for the other eight sources, where s ≡ 1 (mod 2).
6.
suffers from its greediness. The first paths are routed without conflicts, but when more paths are added without the possibility of backtracking the algorithm performs poorly. The nature of the algorithm concentrates the contention in the upward paths. The Colored approach, however, which tries to optimize both the conflicting paths and the resource usage, revealed useful to improve the performance of the communication phases in complete and strongly slimmed trees for all the benchmarks studied. Routing is coupled with mapping, and production machines usually cannot offer a sequential mapping of processes to nodes, but only fragments scattered across the network. This makes the routing policies even more difficult to evaluate. We plan to continue this work to study the effect of this fragmentation in both oblivious and pattern-aware routing techniques. In addition, fault-tolerant and power-saving supercomputers could benefit from pattern-aware routing schemes, such as Colored, that try to optimally embed the communication topology of the application into the physical topology of the network.
CONCLUSIONS AND FUTURE WORK
In this paper, a detailed analysis of the impact of routing on communication performance has been carried out for a broad family of commonly used networks. The workload considered is based on several benchmarks and production applications. This analysis allows us to draw several conclusions for both oblivious and pattern-aware routing schemes. Regarding oblivious routing, one of the most interesting conclusions is that a random distribution of paths is not advisable. It introduces a great variability and produces bad performance. Even simple regular routings do better for a non-slimmed network. Both S-mod-k and D-mod-k routing schemes are good, but strongly depends on both the communication pattern and on the application mapping. There is a group of common permutations in parallel computing that cause network conflicts and degrade its performance. Regarding pattern-aware techniques, the BeFS strategy
284
7.
ACKNOWLEDGEMENTS
[11] J. Labarta, S. Girona, V. Pillet, T. Cortes, and L. Gregoris. DiP: A parallel program development environment. In Proc. of the Second International Euro-Par Conference on Parallel Processing, volume II, pages 665–674, London, UK, 1996. Springer-Verlag. [12] C. E. Leiserson et al. The network architecture of the Connection Machine CM-5. In Proc. of the Fourth Annual ACM Symposium on Parallel Algorithms and Architectures, pages 272–285, San Diego, California, June 1992. [13] X.-Y. Lin, Y.-C. Chung, and T.-Y. Huang. A multiple LID routing scheme for fat-tree-based InfiniBand networks. Proc. of the 18th International Parallel and Distributed Processing Symposium, pages 11–, 2004. [14] C. Minkenberg and G. Rodriguez Herrera. Trace-driven Co-simulation of High-Performance Computing Systems using OMNeT++. In Proc. 2nd International Workshop on OMNeT++, held in conjuction with the Second International Conference on Simulation Tools and Techniques (SIMUTools’09), 2009. [15] J. Navaridas, J. Miguel-Alonso, F. J. Ridruejo, and W. Denzel. Reducing complexity in tree-like computer interconnection networks. Technical Report EHU-KAT-IK-06-07, UPV/EHU, 2007. ¨ [16] S. R. Ohring, M. Ibel, S. K. Das, and M. J. Kumar. On generalized fat trees. In Proc. of the 9th International Parallel Processing Symposium, page 37, Washington, DC, USA, 1995. IEEE Computer Society. [17] M. Palesi, R. Holsmark, S. Kumar, and V. Catania. Application specific routing algorithms for networks on chip. IEEE Trans. Parallel Distrib. Syst., 20(3), 2009. [18] J. Pearl. Heuristics: intelligent search strategies for computer problem solving. Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA, 1984. [19] F. Petrini and M. Vanneschi. A comparison of wormhole-routed interconnection networks. In Proc. Third International Conference on Computer Science and Informatics, Research Triangle Park, NC, USA, Mar. 1997. [20] J. C. Sancho and A. Robles. Improving the Up*/Down* routing scheme for networks of workstations. In Proc. 6th International Euro-Par Conference on Parallel Processing, pages 882–889, London, UK, 2000. Springer-Verlag. [21] J. C. Sancho, A. Robles, and J. Duato. Effective strategy to compute forwarding tables for InfiniBand networks. In Proc. of the International Conference on Parallel Processing, page 48, Los Alamitos, CA, USA, 2001. IEEE Computer Society. [22] L. G. Valiant and G. J. Brebner. Universal schemes for parallel communication. In STOC, pages 263–277. ACM, 1981. [23] J. S. Vetter and F. Mueller. Communication characteristics of large-scale scientific applications for contemporary cluster architectures. J. Parallel Distrib. Comput., 63(9):853–865, 2003.
This work has been partially supported by the Ministry of Science and Technology of Spain under contracts TIN2004-07739-C02-01, TIN-2007-60625, TIN2007-68023-C0201, the BSC-IBM MareIncognito research agreement and the HiPEAC European Network of Excelence. Part of it has been carried out during German Rodriguez’s internship at IBM Zurich Research Labs. We would also like to thank Phillip Stanley-Marbell from IBM Zurich Research Labs for his thorough reading and valuable comments.
8.
REFERENCES
[1] S. Coll, D. Duato, F. Petrini, and F. J. Mora. Scalable hardware-based multicast trees. In SC ’03: Proc. 2003 ACM/IEEE Conference on Supercomputing, page 54, Washington, DC, USA, 2003. IEEE Computer Society. [2] N. Desai, P. Balaji, P. Sadayappan, and M. Islam. Are Nonblocking Networks Really Needed for High-End-Computing Workloads? In Proc. 2008 IEEE International Conference on Cluster Computing, pages 152–159, Washington, DC, USA, 2008. IEEE Computer Society. [3] Z. Ding, R. R. Hoare, A. K. Jones, and R. Melhem. Level-wise scheduling algorithm for fat tree interconnection networks. In Proc. 2006 ACM/IEEE Conference on Supercomputing, page 96, New York, NY, USA, 2006. ACM. [4] J. Flich, M. P. Malumbres, P. L´ opez, and J. Duato. Improving routing performance in Myrinet networks. In Proc. of the 14th International Parallel and Distributed Processing Symposium, pages 27–32, Los Alamitos, CA, USA, 2000. IEEE Computer Society. [5] C. Gomez, F. Gilabert, M. Gomez, P. Lopez, and J. Duato. Deterministic versus adaptive routing in fat-trees. Proc. of the 21st Parallel and Distributed Processing Symposium, 2007, pages 1–8, Mar. 2007. [6] R. I. Greenberb and C. E. Leiserson. Randomized routing on fat-trees. In Proc. of the 26th Annual Symposium on the Foundations of Computer Science, pages 241–249, 1985. [7] A. Jajszczyk. Nonblocking, repackable, and rearrangeable Clos networks: fifty years of the theory evolution. Communications Magazine, IEEE, 41(10):28–33, Oct. 2003. [8] G. Johnson, D. J. Kerbyson, and M. Lang. Optimization of InfiniBand for Scientific Applicationsa. In Proc. of the 22nd International Parallel and Distributed Processing Symposium, pages 1–8. IEEE, 2008. [9] S. Kamil, J. Shalf, L. Oliker, and D. Skinner. Understanding ultra-scale application communication requirements. Proc. Workload Characterization Symposium, pages 178–187, Oct. 2005. [10] H. Kariniemi. On-Line Reconfigurable Extended Generalized Fat Tree Network-on-Chip for Multiprocessor System-on-Chip Circuits. PhD thesis, Tampere University of Technology, 2006.
285