In Proceedings of the 1994 Symposium on the Theory of Computing Montreal, Canada, May 23-25, 1994.
Scalable Expanders: Exploiting Hierarchical Random Wiring Eric A. Brewer
Frederic T. Chongy
F. Thomson Leightonz
[email protected]
[email protected]
[email protected]
MIT Laboratory for Computer Science MIT Artificial Intelligence Laboratory MIT Department of Mathematics
k-extending a randomly wired multibutterfly.
Abstract
Finally, we present the results of detailed simulations of the metabutterfly that show that its performance and fault tolerance match those of the multibutterfly. Although the theoretical results apply to expanders in general, this work was motivated by practical limitations on the scalability of randomly wired multibutterflies. In this section, we discuss the advantages of randomly wired multibutterflies and the difficulties encountered in the construction of large ones. We propose our solution to these problems, hierarchical expanders, in Section 2. In Section 3, we present a probabilistic analysis that proves that, with high probability, k-extension preserves expansion. Section 4 presents our empirical results, and Section 5 discusses some open issues.
Recent work has shown many advantages to randomly wired expander-based networks. Unfortunately, the wiring complexity of such networks becomes physically problematic as they become large. This paper introduces a technique for scaling expanders that avoids this wiring complexity. Specifically, we make the following contributions: 1. We introduce hierarchical expanders, which use a method of scaling small expanders to larger ones while maintaining practical physical construction. We present an example of such a scalable network, called the metabutterfly, which is scaled from the randomly wired multibutterfly. 2. We present a proof that we can scale any ( M N )expander with M 1 into an ( 0 0 kM kN )-expander 2 with probability at least 1 ; 2e; M , where 0 = 2 e4 +4 and 0 = ; 2.
1.1 Definitions A splitter network is composed of multiple stages of routers, organized into splitters. The canonical butterfly is an example of a splitter network. It is helpful to view routing a message through a splitter network as a sorting function through equivalence classes of fewer and fewer routers. Specifically, for the ith stage there are ri equivalence classes, each with rs;i routers, where r is the radix of the routers (the number of directions among which the router selects) and s is the number of stages in the network. Each equivalence class is connected to r equivalence classes in the next stage. An individual splitter consists of an equivalence class and its r associated equivalence classes in the next stage. A bipartite graph with M inputs and N outputs is an ( M N )-expander if every set of m M inputs reaches at least m outputs. For a radix-r splitter network to have expansion, each splitter must achieve expansion in each of the r directions. To achieve expansion, a splitter network must have routers with redundant connections in each of its r directions. We refer to this redundancy, d, as the multiplicity. The degree of any node in the splitter is then dr.
3. We present empirical evidence that the performance and fault tolerance of metabutterflies equals that of traditional randomly wired multibutterflies, despite the greatly simplified wiring of the metabutterfly.
1 Introduction This paper introduces hierarchical expanders, expanders that are scalable in practice. We construct these expanders with a transformation called a k-extension, which we prove can preserve expansion. We also introduce the metabutterfly, a novel hierarchical expander network constructed by Eric Brewer is supported in part by the National Science Foundation, grant CCR-8716884; by ARPA, contract N00014-91-J-1698; by an equipment grant from Digital Equipment Corporation; and by grants from AT&T and IBM. y Fred Chong is supported in part by an Office of Naval Research Graduate Fellowship and ARPA contract N00014-91-J-1698 z Tom Leighton is supported in part by Air Force Contract AFOSR F49620-92-J0125 and ARPA contracts N00014-91-J-1698 and N00014-92J-1799.
1.2 Multibutterflies A multibutterfly is a splitter network with expansion. In particular, each M -input splitter of a multibutterfly is an (, M Mr )-expander in each of the r directions. Bassalygo and Pinsker [BP74] first studied splitter networks with expansion. Recently, numerous results have 1
scalability. In effect, there are a few fat cables connected to each board instead of many thin cables. At the same time, the resulting network will still have all the same nice routing properties as a randomly wired multibutterfly. Hence, we gain scalability without sacrificing performance or fault tolerance. In fact, once we have constructed a network with dr cables per board and k wires per cable, we then have the option to decrease the number of wires in each cable by multiplexing the logical connections among fewer physical wires. In effect, we can then have a few thin cables connecting to each board instead of a few fat cables, thereby decreasing the number of wires. This flexibility allows us to further reduce cabling and wiring cost; in particular, we can adjust the thickness of the cable based on the average load, which is significantly less than the peak load for a large group of wires. As long as the cable can handle the average load well, most traffic remains unaffected by the use of fewer wires. In turn, if we are pin-limited on the board-level (e.g., say each board has only drb pins), decreasing the physical size of each cable, thereby cutting b, would allow us to increase dr without altering the pin count. Increasing d gives greater expansion, which results in better routing performance, and increasing r allows for fewer levels in the network, which results in less routing delay and lower hardware cost. Such design options could prove to be very valuable and are not available with traditional multibutterflies, because they provide no physical locality for wires.
been discovered that indicate that multibutterflies are ideally suited for message-routing applications. Among other things, multibutterflies can solve any one-to-one packet routing [Upf89], circuit-switching [ALM90], or non-blocking routing problem [ALM90] in optimal time, even if many of the routers in the network are faulty [LM89]. No other networks are known to be as powerful. The reason behind the power of multibutterflies is that expansion roughly implies that p outputs must be blocked or faulty for p inputs to be blocked, and thus it takes j faults to block one input j levels back. In contrast, one fault in a radix-2 butterfly blocks 2 j inputs j levels back. As a consequence, problems with faults and congestion that destroy the performance of traditional networks can be easily overcome in multibutterflies. (For a survey of the research on multibutterflies see [Pip93] [LM92].)
1.3 Wiring Complexity Multibutterflies are generally constructed by randomly wiring redundant connections between the equivalence classes of each splitter. Although deterministic constructions are known [WZ93], none are known to produce expansion comparable to random wiring. Unfortunately, random wiring and the known deterministic constructions of good expanders scale poorly in practice. For example, a 4096-endpoint machine with multiplicity d = 2 has 8192 wires in the first stage, almost all of which would be long cables with distinct logical endpoints. For comparison, a fat-tree [Lei85] might have a similar number of cables for the root node, but there are few logical endpoints, so huge groups of wires can be routed together. The groups connect to many boards, but the boards are located together and the connection of cables to boards is arbitrary and thus low labor. In the multibutterfly, the cables cannot be grouped and the connection of cables to boards is constrained. The other early stages also suffer from this problem. At first glance, it appears that this wiring complexity is inherent to both expanders and random wiring. Indeed, given a splitter with M boards of input routers, M boards of output routers, and b routers per board, we can expect each board to be connected to about min(M dbr) other boards when using random wiring. For typical values of M , d, b, and r, this means that we would need to connect every input board to every output board in a randomly wired splitter. Clearly, this becomes infeasible as M gets large and thus the randomly wired multibutterfly does not scale well in the practical setting where the network consists of boards of chips. A similar problem arises at the level of cabinets of boards for very large machines. In what follows, we show how to (randomly) construct a special kind of expander for which there is no explosion in cabling cost. In particular, we show how to build a multibutterfly for which each board is connected to only dr other boards, no matter how large M and b become, thereby achieving full
2 Hierarchical Expanders The wiring complexity of large expanders can be dramatically decreased by constructing them hierarchically. A hierarchical expander is an expander constructed from the application of a sequence of random k -extensions to an expander. Given a directed graph G (V E ), an integer k 1, and a set of permutations of [1,k ], Π = f e je 2 Eg, we define the k -extension of G induced by Π to be the graph G0 (V 0 E 0 ) where:
V 0 fhv ii j v 2 V i 2 1 k]g and E 0 f(hu ii hv j i) j (u v ) 2 E and uv (i) = j g Note that jV 0 j = k jVj and jE 0 j = k jEj. (
)
For example, two 2-extensions of a three-cycle are shown in Figure 1. Note that 2-extension (A) results in two disconnected copies of the original graph. In general, if all e 2 Π are the identity permutation, then the k -extension consists of k disjoint copies of the original graph. Each edge in the original graph corresponds to k edges in the extended graph. These groups of k edges are called channels. The group of k nodes that correspond to one node in the original graph form one metanode; metanodes are shown in gray in Figure 1. The metanode/channel structure of G0 is isomorphic to the vertex/edge structure of G. 2
We define a random k -extension of a graph G to be a k-extension induced by some Π such that each e 2 Π is
a
an independently and uniformly chosen random permutation of [1,k ]. Equivalently, a random k -extension of a graph G can be obtained by selecting randomly and uniformly over all of the (k !)jEj possible k -extensions of G. In Section 3 we prove that random k -extensions preserve expansion, with very high probability, for any k .
b
2.1 Metabutterflies A metabutterfly is a splitter network that is constructed from a multibutterfly through random k -extensions. Each splitter of the metabutterfly is a random k -extension of the corresponding splitter of the multibutterfly, with the possible exception of the last few stages. The value of k may differ for each splitter. For the late stages of an M -input, radix-r multibutterfly, the splitters may be expanders only because Mi < 1, where Mi Mri is the input size of an ith -stage splitter. In other words, the late stages are not really providing expansion, since only sets of size zero get expansion. In this case, we construct the metabutterfly splitter, which has Mi k inputs, out of an Mi k -input multibutterfly splitter. This avoids hierarchical wiring, but does not affect the practical scalability because Mi is small. Furthermore, the replaced splitters are typically complete bipartite graphs, in which case hierarchical wiring does not reduce the wiring complexity. Alternatively, in practice it may simpler and sufficient to k extend these end stages, even though the resulting splitters may not provably have expansion. The simulations presented in Section 4 use this simplification. If all of the stages actually expand sets of size at least one, then we replace the final output metanodes, which each have k nodes, with k-input multibutterflies. This ensures that the network resolves destinations to the correct node rather than just the correct metanode. For example, a 1024-node metabutterfly can be implemented with a 64-extended 16-node multibutterfly plus 16 64-node multibutterflies for the output metanodes. The total counts of nodes, wires, and stages are each the same as for a 1024-node multibutterfly; the only difference is the wiring pattern. Figure 2 shows a radix-2 64-input metabutterfly that is an 8-extended 8-input multibutterfly. Unlike the multibutterfly, in which the first-stage wiring is unconstrained, the connections are constrained into a twolevel hierarchy. The top level of the hierarchy is the channel wiring, which reduces the number of inter-metanode connections from roughly M M r to at most Md, where d is the multiplicity. The wires within the channels form the second level of the hierarchy and do not affect the number of inter-metanode connections. For example, for a 4096-processor machine with metanodes of size 64, the number of logical endpoints has been reduced from 4096 to 4096 64 = 64. With d = 2, this takes us from 8192 individual wires per stage to 128 groups of 64
c Original Graph
a
b
c (A) Identity Permutation
a
b
c (B) Different Permutation
(ab) = (bc) = (ca) =
1 2 1 2
1 2 2 1
Figure 1: 2-extensions of a 3-cycle
3
its. The number and capacity of fibers can be designed to accomodate expected load. The larger the original cable being multiplexed, the more likely the average load will be significantly lower than the peak load. We can exploit such differences to build more cost-effective networks. The relationship between metabutterflies and multibutterflies is quite interesting. The set of all metabutterfly wirings is a strict subset of the set of all randomly wired multibutterfly wirings. However, it does not follow automatically that a metabutterfly has expansion with high probability! Since the metabutterfly allows only a subset of the random wirings, the percentage of bad wirings may no longer be vanishingly small. A primary result of this paper is that metabutterflies are in fact multibutterflies; that is, all the splitters of the metabutterfly have expansion. We also present simulation results in Section 4 that show that the performance and fault tolerance of the metabutterfly is statistically indistinguishable from that of the multibutterfly. This is somewhat surprising since the metabutterfly constrains the randomness of the wiring in order to ensure that the network remains scalable in practice. Thus, the metabutterfly provides the size, performance and fault tolerance of a large multibutterfly, but with the wiring complexity of a small one.
Figure 2: A radix-2, multiplicity-2, 64-endpoint metabutterfly with metanodes of size 8. Each circle on the left contains 8 inputs, and each oval metanode contains 8 routers. Each router (solid square) is a 4 4 switch. Each output metanode (hollow square), shown expanded at the top right, is an 8-input multibutterfly; each channel, shown expanded at the bottom right, contains 8 wires. Typically, the metanodes correspond to boards.
3 Theoretical Results
wires, with each group routed as a unit. The wires within a group can be connected to the endpoint routers arbitrarily, since any (random) one-to-one mapping is sufficient.
The k -extension of a graph inherits many of the properties of the underlying graph. For example, if the underlying graph is d-regular, then so is the k -extension. In this section, we prove a somewhat more difficult and important fact, namely, that if G is an expander, then a random k -extension of G is also an expander with very high probability. It is useful to establish some intuition about the expansion of a k -extension given that the original graph is an ( M N )-expander. First, if a node in the original graph has d neighbors, then each of the k nodes in the corresponding metanode have d neighbors, all of which are distinct, for a total of dk nodes. Extending this notion, if a set, S , of size m M nodes in the original graph expands to set T of size m, then the corresponding set of metanodes, which contain km nodes, expands to km nodes covering m metanodes. This gives us an expansion factor of for any k . But it does not follow that the k -extension is an ( Mk Nk )-expander. In particular, if the k -extension were such an expander, then any set of at most Mk nodes must achieve expansion. The argument given above requires that the Mk selected nodes cover at most M metanodes and be spread evenly among metanodes (although we can avoid the latter restriction). If the set covers more metanodes, then the expansion of the underlying graph does not apply, since jSj > M . However, for the k -extension the Mk restriction does apply. Thus, the difficult part of showing that the k -extension is an expander is handling the case in which the selected set, with size at most Mk , covers more than M metanodes.
This example has a two-level hierarchy, but deeper hierarchies are possible and actually make sense for very large networks. For example, if a two-level hierarchy requires that k be very large, then it may not be possible to group k nodes onto one board. A three-level hierarchy provides k 2 times as many nodes, which allows a much smaller k for the same total number of inputs. For example, a 64K-node machine might be constructed as a (64, 16)-extended 64-input multibutterfly (64 16 64 = 64K). Each board would contain 16 nodes, 64 boards would be assembled as one cabinet, and the 64 cabinets would be connected as a 64-input multibutterfly (with very thick inter-cabinet cables). The top level of the hierarchy simplifies the inter-cabinet wiring, and the second level allows the large inter-cabinet cables to be constructed as groups of 64 inter-board cables. Connecting the inter-board cables to the boards is trivial, since the assignment is random. Finally, note that the boards within a cabinet are not connected; they are located together only for wiring convenience. Likewise, the routers on a board are completely independent. With large cables, the option of multiplexing becomes both more cost effective and more likely to deliver good performance. For example, optical fiber would provide enough bandwidth to replace very large cables. Note that a packet routing scheme across such fiber would provide little degradation in performance until the load reached bandwidth lim4
2 6 6 6 4
s1 s2 .. .
sbM c+1 s(C ;1)bM c+1 sbM c+2 s(C ;1)bM c+2 .. .
.. .
sbM c s2bM c
.. .
sC bM c
l
3
may be less than C bM c, we pad the bottom of the rightmost column with zeroes; that is, si 0 for all i > M . This matrix and the others used in this proof are strictly organizational tools: we exploit no properties of matrices other than their two-dimensional structure. P Note that S jSj = ij fij . We partition the metanodes into C groups corresponding to the columns of F; the j th group consists of the metanodes corresponding to the values f1j f2j : : : fbM cj . Thus, the first group contains the bM c metanodes that contain the most nodes in S . For concreteness, we let uij denote the metanode that corresponds to fij , for all fij > 0. (The restriction on fij exists because we padded F with zeroes; there may not be a corresponding metanode if f ij = 0.) For each group of metanodes (with the possible exception of the last, which may contain less than bM c metanodes), we identify a particular set of bM c channels, such that each channel connects a metanode in the group to one of a set of bM c metanodes in V . We can always find such a set of channels since any set of size m M in U expands to a set m in V . The channels and metanodes that we select must satisfy certain additional properties, however. In particular, if we weight each channel with the value of f ij for the connected metanode in U , then we require that the weight of the heaviest l channels in the j th group each be at least flj , for all l and j . We can show that such a collection of channels and metanodes in V can always be found by induction on i. The base case of i = 1 is trivial. Since, without loss of generality, f1j > 0 we can just use the channels linking u1j to of its neighbors in V . Once we have found a set of channels satisfying the property for l ; 1, we can augment it to a set that satisfies the property for l as follows. If flj = 0 then we are done immediately. Otherwise, we examine the neighbors of U fu1j u2j : : : ulj g in V . By the expansion properties of G, there are at least l neighbors of this set in V and each channel linking U to N (U ) has weight at least flj . Since, by induction, we have already found (l ; 1) nodes each with weight at least fl;1j , we can augment the set by choosing any previously unchosen metanodes from N (U ). The additional metanodes each have weight at least flj and we are done. We next construct an N C matrix of weights H fhij g by setting hij to be the weight of the channel that connects the ith metanode of V to the j th group of metanodes from U just described. If there is no such connection, then we set hij 0. By the preceding analysis, we know that the l largest entries in the j th column each have size at least flj for all l and j . This means that we can define another N C matrix H fhij g so that 0 hij hij and so that there are precisely copies of each fij in the j th column of H. Essentially, we take the largest items in the column,
M
7 7 7 5
Figure 3: The structure of F. It is easy to show that the
k-extension
is an
m
M bM c columns. Figure 3 shows the structure of F. Since
0
(
Mk Nk )-expander. By limiting 0 to we know k k that the selected set has size at most Mk = M and thus k
can cover at most M metanodes, which avoids the difficult case. Naturally, we would like the expansion to be independent of k . If we keep 0 independent of k , however, then not all k-extensions are expanders, since some of the extensions are not even connected, as shown in Figure 1(A). In particular, if we choose k large enough so that the size of one copy is less than 0 Mk , then the copy must expand. However, the copy is disconnected from the rest of the graph and thus can not expand. Thus, not all k -extensions of an expander are expanders. Fortunately, the following result shows that the vast majority of k -extensions of an expander are also expanders, for any k . We will later use this fact to prove that given a multibutterfly with sufficient expansion, then, with very high probability, each splitter of a metabutterfly will have expansion, since it is a random k -extension of the corresponding multibutterfly splitter.
Theorem 1 If G (UV E ) is an ( M N )-expander from U to V , with M 1, then for any k 1, a random k-extension G0 (U 0 V 0 E 0 ) of G is an (0 0 kM kN )expander from U 0 to V 0 with probability at least 1 ; 2e;M , 2 where 0 = 2 e4 +4 and 0 = ; 2. Proof: In what follows, we refer to the edges of G as channels in the k -extension of G. In addition, the nodes of G correspond to metanodes in G0 . Consequently, we use the sets U and V when referring to either the nodes of G or the metanodes of G0 . We use the sets U 0 and V 0 when referring to the nodes of G0 . Let S be any subset of U 0 with at most 0 kM nodes, and let N (S ) V 0 be the neighborhood of S , which is the set to which S expands. In order to show that G 0 is an expander, we must show that jN (S )j 0 jSj for all S . We define si to be the number of nodes of S contained in the ith metanode of U and we order the metanodes so that s1 s2 sM . Next, we arrange the si ’s into a matrix, F ffij g, with bM c 1 rows, so that the values appear in column-major order. That is, f ij fi0 j0 if and only if j < j 0 or j = j 0 and i i0 . Since there are M metanodes and bM c rows, there must be C 5
which each have size at least f1j , and replace them with f1j . Similarly, we take the next largest items and replace them with f2j , and continue until we have copies of each f ij . This gives us the following two properties:
N X i=1
hij =
bX M c i=1
S =
fij
X
ij
Since S
bM cbmax 0 Mk and, after defining 0 , that:
for all j and
hij :
bmax
N X i=1
hij
N X
bi (ai + bi ) i=1 k ; bi
for all j:
N X
bmax amin + bmin f fb cC ;1
+ 1C + M
ai +
(6)
C X j =2
hij = ai + bi
nodes in N (S ) since there may be overlap among the neighbors of each group. However, since each channel is wired randomly and independently, we will be able to show that the amount of overlap is small with very high probability (at least on average over all vi ). In the probabilistic analysis that follows, we will only account for ai distinct neighbors from the first group and bi distinct neighbors from the other groups. That is, we will assume, without loss of generality, that vi is connected to metanodes in U that contain ai hi2 hi3 : : : hiC nodes of S . In fact, vi may have more metanode neighbors in U and each may contain more nodes of S , but we will undercount by ignoring this potential for additional neighbors in N (S ). In addition, we think of each channel as being randomly wired in sequence, starting with channels connecting to u11 u21 : : : , and continuing in column-major order through the metanodes, and starting with the wires that are
(2)
fbM cC
+
The key is that column-major order ensures that for each term on top, the term below it is at least as large. In addition, since S is the sum of the elements in F, we know that:
S bM c(amin + bmin ) bM cbmax
bmax S: ; 0 )k
(1
We are now ready for the probabilistic analysis. Consider the ith metanode vi in V . By definition, vi is incident to a metanode from the first group of metanodes in U , which contains at least ai hi1 nodes in S . Since each channel is wired one-to-one, this means that vi contains at least ai nodes in N (S ). In addition, v i is incident to a metanode in the j th group that contains at least hij nodes from S for each j 2. Unfortunately, this does not mean that vi contains at least:
To see this, we expand it over two lines:
N bmax X k ; bmax i=1 (ai + bi )
bi (ai + bi ) i=1 k ; bi
to be the smallest and largest row sums in F when the first column is excluded from the sum. The first key fact is that:
+
(4)
and by applying Equations 1 and 4, we get:
bmin fbM c2 + fbM c3 + + fbM cC bmax f12 + f13 + + f1C
+
0 k
to be the sum of the remaining elements in the i th row of H. By the manner in which H was constructed, it should be clear that bi bmax for all i and that:
Although the preceding fact is helpful, it is not sufficient, since we must P show that (with high probability) jN (S )j is close to S = ij hij . To obtain the stronger bound, we rely on the fact that the ith metanode of V does indeed contain P at least j hij neighbors of S , when neighbors are counted according to multiplicity. Since each channel is wired with a random permutation, we use probabilisticP methods to show that, with high probability, most of these j hij neighbors are distinct (at least on average over the whole graph). The analysis depends crucially on the following simple facts about row and column sums in F and H. We define amin fbM c1 to be the smallest item in the first column of F. We also define:
f12 fbM c1
Mk bM c
M . (The value of will be determined where 0 bM c later.) Next, define ai hi1 to be the first item in the ith row of H, and bi hi2 + hi3 + + hiC (5)
(1)
The matrix H plays a crucial role in describing how many nodes in V 0 are likely to be a neighbor of S . In particular, from the definition of hij we know that the ith metanode of V is connected to a metanode in U that contains at least hij items from S . Since every channel is wired in a one-to-one fashion, this means that the ith metanode in V contains hij neighbors of S . As an immediate consequence of this fact, we can deduce that:
jN (S )j
0 Mk, we can therefore conclude that:
(3)
6
connected to nodes in S within each metanode. Then, regardless of the existing connections, the probability that the wire currently being connected (from a node in S ) connects to a node already in N (S ) (because of previous connections among those that we are counting) is at most: 8 > > 0 > < > > > :
which is satisfied when:
for the bi connections being made for later groups
for any
2ke2
2e3 ( ; 0 )(1 ; 0)
; 0 )(1 ; 0)k ; ;1 : bmaxe 0
(
; 0 )(1 ; 0) ; ;1 0 e 0
(
2e3 (2)(1 ; 0)
ai + bi ) ; T = S ; T
Simplifying:
(9)
0 2 e4
2(1 ; 0)
1
0 e
2(1 ; 0 )2
We bound (1 ; 0 )2 with (1 ; 2 0 ) and simplify:
0 ( 2 e4 + 4)
2
which gives us the desired value for 0 :
0
Mke S e;T ln e S
2
e 4 0 2 (10) 0 2 2 e4+ 4 We must also show that ( ;T )S > 1. From Equation 7 we know: S (1 ; 0 )k T bmax
S ;( ; )S ln T e : e S ln Mke This quantity is at most e;S provided that: Mke ; ( ; 0 ) ln ( ; 0 )S ;1 ln S Te 0
and the dependence on k is finally gone. There are many ways to set and 0 so that Equation 9 is satisfied. In fact, we can make 0 arbitrarily close to ; 1 simply by making be a very small constant (assuming and are constant). For the theorem we set 0 ; 2 and solve for :
0
=( )
0
By Equation 4, the latter inequality holds if:
needs to be at least 0 S . Hence, we define ( ;T )S to ensure that we achieve the required expansion. It now 0 remains to select values for and 0 that ensure that PS e;S and that > 1. We start this process by observing that:
PS
; 0 )(1 ; 0 )k ; bmaxe
(
2e3 ( ; 0 )(1 ; 0)
(7)
overlaps. In order to make PS small, we must make and/or T be large. On the other hand, we do not want T to be too large since: (
which is satisfied when:
i
0
Mke2 M ke2 bM cbmax bM c bmax 2ke2 bmax
bmax
;T ln e PS Mk S e there exists some S of size S with at least T overlaps. Thus, with probability 1 ; PS there is no set of size S that has T
N (S )
0
Hence Equation 8 is true provided that:
is an upper bound on the expected number of overlaps. When we consider this probability over all possible choices for S of size S , we find that with probability at most:
X
(8)
; 0 )S ; ( ; 0 )(1 ; 0)k ; Te bmaxe
(
Mke2 S
;T ln e
> 1, where by Equation 6: X T = bi (kai;+b bi ) (1b;max 0 )k S i i
0
and we find from Equation 3 that:
This is because there is no chance for overlap for the first channel, and because for subsequent channels, there are still at least k ; bi choices for nodes, at most ai + bi of which can lead to previously selected nodes. We can now use a Chernoff bound (see Lemma 1.7 of [Lei92]) to show that the probability that there are T overlaps over all metanodes is at most:
e
; 0 )S ; : Te
(
From Equation 7 we find that:
for the ai connections being made from the first group
ai + bi k ; bi
Mke2 S
( ; 0 )S
2 4+
0
7
and after substituting for 0 and bmax we get: Percent Endpoint Loss
2TS
60
2(1 ; 0 ) 0 > 1
since 0 < 12 . Finally, we observe that the probability that we fail to achieve the desired expansion for a random k -extension of an expander is PS summed over all possible sizes of S . We can assume that S M ; otherwise, S can cover at most M metanodes and there is no need for a probabilistic analysis. Thus, the probability that we fail to achieve the desired expansion is at most:
Mk X s=dM e
e;s
3, we can convert an M -input multibutterfly with ( )-expansion into an Mk -input metabutterfly in which each splitter has at least (0 0 )-expansion, where 0 and 0 are those given in Theorem 1. This gives us a metabutterfly with a two-level hierarchy. Deeper hierarchies are obtained by k -extending a metabutterfly splitter, so that each “wire” in the original graph is itself a channel and each “node” in the original graph is itself a metanode. Although the k -extensions can be applied recursively, it should be noted that 0 shrinks rapidly with each k -extension. Fortunately, practical applications should never need more than a three-level hierarchy.
We first measure the connectivity, which is the probability that all input-output pairs remain connected for a given percentage of failed routers. We assume that a failed router fails completely; that is, all of its inputs are blocked. We compared connectivity for a 1024-endpoint multibutterfly and for 1024-endpoint metabutterflies with metanode sizes of 4, 16, and 32. The routers had a radix of 4 and a multiplicity of 2. We found no significant differences in the connectivity of all four networks. However, connectivity is not a good measure of fault tolerance because it makes no guarantees about the performance of the surviving input-output pairs. Under this metric, bottlenecks due to synchronization constraints have been shown to degrade application performance significantly. To avoid such bottlenecks, we choose a partition, a subset of endpoints to use, with the Leighton-Maggs Fault Propagation algorithm [LM92]. This algorithm treats a router as faulty unless it has at least one unblocked output in each direction; faults propagate backward when there is insufficient bandwidth through a router. The resulting partitions have been shown to have high bandwidth between all pairs of endpoints. Figure 4 shows the percentage of endpoints that remain connected under this more conservative definition. The partitionings of all four networks are statistically indistinguishable. We simulated performance on these networks under these partitioning situations. The routers simulated were based upon the RN1, a full-custom, high-speed VLSI crossbar that performs source-responsible, pipelined, circuitswitched routing [MDK91]. We used a synthetic, barriersynchronized network load that models shared-memory applications studied in [CFKA90]. In over 500 trials simulated on the CM5 [TMC91], we
4 Empirical Results In this section we present empirical evidence that the performance and fault tolerance of metabutterflies is identical to that of multibutterflies. We use the methodology of previous studies [CED92] [CK92] and investigate connectivity, partitioning, and performance with uniformly distributed router failures within each network. 8
6 Conclusion
found that performance had a 0.9997 correlation to partitioning. This confirms our expectation that these partitionings guarantee high bandwidth between surviving input-output pairs. It also means that the performance of the four networks is statistically indistinguishable, even with many faults.
We have proven that, with high probability, random k extensions preserve expansion. Random k -extensions allow us to build hierarchical expanders that are much more scalable than other expanders. An example of a hierarchical expander is the metabutterfly, which is based on the randomly wired multibutterfly. Results from detailed performance simulations indicate that the fault tolerance and performance of the metabutterfly match those of the multibutterfly, despite the metabutterfly’s greatly simplified wiring and resulting scalability.
5 Open Issues An open question is whether a small set of permutations would be sufficient to provide expansion for a k -extension of an expander. Clearly, one permutation would not be enough. If S consisted of the first node in each of Mk metanodes in U , then a single permutation would connect these Mk 0 nodes to no more than M r nodes in V . This means that we lose a factor of k in expansion. If there were such a set, then the network could be wired with only a few types of cables (one for each permutation). However, in practice, cable connectors are still attached manually, which means that is it actually easier to make a cable with a random permutation than it is to make a standard cable (which requires the identity permutation).
7 Acknowledgments We would like to thank Bobby Blumofe, Tom Knight and Charles Leiserson for their comments on this work. [ALM90]
[BP74]
[CED92]
It would also be interesting to show that a randomly cabled metabutterfly with multiplicity-2 can route any permutation in O(log n) steps. Multiplicity-2 multibutterflies do not have expansion but still route well. By the results in this paper, multiplicity-4 metabutterflies have sufficient expansion to guarantee O(log n)-time packet routing, but multiplicity-2 metabutterflies may not. However, multiplicity-2 metabutterflies do route well empirically.
[CFKA90]
[CK92]
On the practical side, it remains to be shown that the multiplexing allowed by hierarchical wiring significantly reduces the number of wires required to achieve a particular level of throughput. Since the average load per cable should be well less than the peak load of a full-thickness cable, we expect to be able to reduce the number of wires per cable significantly without reducing the effective bandwidth of the network. The use of randomness helps us here as well, since we know that the actual load per cable will not differ from the average load per cable by very much; that is, randomness ensures relatively even load.
[Lei85]
[Lei92] [LM89]
[LM92]
[MDK91]
Finally, we are also looking at dynamic random multiplexing, in which the routers randomly select a wire within the cable. This provides several advantages. First, the wiring within the cable need not be random, an off-the-shelf cable works fine. Second, the effective multiplicity goes up, which increases both the fault tolerance and the performance. For example, if a cable connects ten routers to ten routers, each input router can reach each output router, as opposed to only two routers for random wiring with a multiplicity of two. 9 of the When a output fails (or is busy), each input gets 10 bandwidth, rather than eight getting their full share and two getting half their bandwidth. The full practical benefits of this technique remain to be investigated.
[Pip93]
[TMC91] [Upf89]
[WZ93]
9
S. Arora, F. T. Leighton, and B. Maggs. On-line algorithms for path selection in a non-blocking network. In Proceedings of the 22nd Annual ACM Symposium on Theory of Computing, pages 149–158, May 1990. L. A. Bassalygo and M. S. Pinsker. Complexity of optimum nonblocking switching networks without reconnections. Problems of Information Transmission, 9:64–66, 1974. F. T. Chong, E. Egozy, and A. DeHon. Fault tolerance and performance of multipath multistage interconnection networks. In T. F. Knight Jr. and J. Savage, editors, Advanced Research in VLSI and Parallel Systems 1992, pages 227–242. MIT Press, March 1992. D. Chaiken, C. Fields, K. Kurihara, and A. Agarwal. Directory-based cache-coherence in large-scale multiprocessors. IEEE Computer, 23(6):41–58, June 1990. F. T. Chong and T. F. Knight, Jr. Design and performance of multipath MIN architectures. In Symposium on Parallel Architectures and Algorithms, pages 286–295, San Diego, California, June 1992. ACM. C. E. Leiserson. Fat-trees: Universal networks for hardware efficient supercomputing. IEEE Transactions on Computers, C-34(10):892–901, October 1985. F. T. Leighton. Introduction to parallel algorithms and architectures. Morgan Kaufmann, San Mateo, CA, 1992. F. T. Leighton and B. Maggs. Expanders might be practical: Fast algorithms for routing around faults on multibutterflies. In IEEE 30th Annual Symposium on Foundations of Computer Science, 1989. F. T. Leighton and B. Maggs. Fast algorithms for routing around faults in multibutterflies and randomly-wired splitter networks. IEEE Transactions on Computers, 41(5):1–10, May 1992. H. Minsky, A. DeHon, and T. F. Knight Jr. RN1: Low-latency, dilated, crossbar router. In Hot Chips Symposium III, 1991. N. Pippenger. Self-routing superconcentrators. In 25th Annual ACM Symposium on the Theory of Computing, pages 355– 361. ACM, May 1993. Thinking Machines Corporation, Cambridge, MA. CM5 Technical Summary, October 1991. E. Upfal. An (log ) deterministic packet routing scheme. In 21st Annual ACM Symposium on Theory of Computing, pages 241–250. ACM, May 1989. A. Wigderson and D. Zuckerman. Expanders that beat the eigenvalue bound: explicit construction and applications. In 25th Annual ACM Symposium on the Theory of Computing, pages 245–251. ACM, May 1993.
O
N