Ecient O-line Routing of Permutations on Restricted Access Expanded Delta Networks 3 Isaac D. Scherson
Raghu Subramanian
[email protected]
[email protected]
Department of Information and Computer Science University of California Irvine, CA
92717-3425
U.S.A.
Abstract
This paper presents an o-line algorithm for routing permutations on expanded delta networks (EDNs) with restricted access. Restricted access means that the number of elements to be permuted may exceed the number of inputs to the EDN. For every N -element permutation on an M -input EDN, the algorithm computes a routing that takes exactly 3N=M passes (assuming M divides N ). On a certain class of EDNs, the number of passes can be reduced to 2N=M . For example, for every 16K-element permutation on the 1K-input global network of the MasPar MP-1 and MP-2, the algorithm computes a routing that takes exactly 32 passes. The time complexity of the algorithm is 2(N log N ) sequentially, and 2(log2 N ) on an N processor PRAM. 1
Introduction
Expanded delta networks (EDNs) are tradi-
tional delta networks [13] [10, page 736] in which each wire is expanded to enable it to to carry more than one message at a time. EDNs were introduced by Kruskal and Snir [9], and studied by Koch [8], Szymanski [14] and Alleyne [2]. The global network in the 16K-processor MasPar MP-1 and MP-2 is an EDN [5]. Permutations are interprocessor communication patterns in which each processor sends exactly one message and receives exactly one message. Permutations are an important class of communication patterns because they occur frequently in SIMD computations [3, page 333]. A problem instance may require more processors than the target machine has. A common solution is to solve the problem instance on an imaginary \virtual 3 This research was supported in part by the Air Force Oce of Scienti c Research under grant number AFOSR-90-0144, the NASA under grant number NAG-5-1987, and the NSF under grants number MIP-9106949 and number MIP-9205737
machine" with enough processors, and then simulate the virtual machine on the real machine (by simulating several virtual processors serially on each real processor). In this scenario, permutations are of less interest than virtual permutations, in which each virtual processor sends exactly one message and receives exactly one message. Virtual permutations are also called permutations with restricted access because all the virtual processors allocated to a particular real processor have to share a network input (and output) serially, and hence do not have free access to the network. In this paper, we present an algorithm for routing permutations on expanded delta networks with restricted access. The algorithm is o-line, i.e. a separate machine takes the complete trac pattern as input and yields a routing as output. For every N -element permutation on an M -input EDN, the algorithm computes a routing that takes exactly 3N=M passes (assuming M divides N ). On a certain class of EDNs, the number of passes can be reduced to 2N=M . The running time of the algorithm is 2(N log N ) sequentially, and 2(log2 N ) on an N -processor PRAM. As applied to the 16K-processor MasPar MP-1 and MP-2 parallel machines, the algorithm computes a routing that takes exactly 32 passes for every permutation on the global network. Section 2 describes the structure of an EDN. Section 3 describes the routing algorithm, ignoring the issue of restricted access (i.e. assuming that the number of elements to be permuted equals the number of EDN inputs). Section 4 incorporates restricted access into the routing algorithm and is based on the ideas of Youssef [15]. Section 5 shows how the results of this paper apply to the MasPar MP-1 and MP-2.
2
Expanded Delta Networks
An expanded delta network (EDN) is built using wires and switches. There are two kinds of wires: thin wires, which can carry only one signal at a time, and thick wires, which can carry up to K signals at a time, where K stands for capacity. There are three kinds of switches. Thin-to-thick converters have K thin wire inputs and one thick wire output. A thin-to-thick converter copies every incoming signal onto the (only) output wire. Observe that even though many inputs may be \driving" the output, there is no \collision" because the output wire is thick. Thick-to-thin converters have one thick wire input and K thin wire outputs. We assume that a connection request, which is a number from 0 to (K 0 1), accompanies every incoming signal. A thickto-thin converter copies every incoming signal onto the output wire that the signal requested. However, if more than one incoming signal requests a particular output wire, then only one of the requests is honored. Hyperbars have S thick wire inputs and S thick wire outputs. We assume that a connection request, which is a number from 0 to (S 0 1), accompanies every incoming signal. A hyperbar copies every incoming signal onto the output wire that the signal requested. However, if more than K incoming signals request a particular output wire, then only K of the requests are honored. As an analogy, consider the thick wires as multilane freeways capable of accomodating many cars side by side, and thin wires as single lane streets. Then a thin-to-thick converter corresponds to a set of freeway on-ramps that merges cars from dierent origins onto a freeway. A thick-to-thin converter corresponds to a set of freeway exits that dispatches cars on the freeway to their various destinations. A hyperbar is analogous to a multi-freeway exchange. 2.1
Topology
Figure 1 shows an example of an EDN. The thin lines represent thin wires and the bold lines represent thick wires. The boxes tapering from left to right represent thin-to-thick converters, and have K = 2 thin wire inputs and one thick wire output. The boxes widening from left to right represent thick-to-thin converters, and have one thick wire input and K thin wire outputs. The square boxes represent hyperbars. These have S = 2 thick wire inputs and the same number of thick wire outputs. There are M = 16 thin wire input ports which enter the rst column of M K =8 thin-to-thick converters. Similarly, there are M thin wire output ports, which emerge from the last stage of M K = 8 thick-to-thin converters. Between these two
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0
0 (0,0)
(0,1)
(0,2)
1
1
2
2 (1,0)
(1,1)
(1,2)
3
3
4
4 (2,0)
(2,1)
(2,2)
5
5
6
6 (3,0)
(3,1)
(3,2)
7
7
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Figure 1. Example of an EDN.
stages, the network consists solely of thick wires and M hyperbars. There are logS M K = 3 stages of KS = 4 hyperbars each. Each hyperbar stage is preceded by thick wires connected as an S -way shue. Thus, the EDN's topology is a traditional delta network with the following dierences: thin wires are replaced by thick wires; crossbars are replaced by hyperbars; there are additional stages of wire thickness converters at the beginning and at the end. Rigorously, an EDN has M thin wire input ports and M thin wire output ports, both labelled f0 . . .(M 0 1)g. There are M K thin-to-thick thick-to-thin converters, both laconverters, and M K M M belled f0 . . .( K 0 1)g. There are ( KS logS M K ) hyM 0 1)g 2 perbars, labelled by ordered pairs f0 . . .( KS f0 . . .(logS MK 0 1)g. The wires are connected as follows:
For each i 2 f0 . . .(M 0 1)g, a thin wire connects
input port4 i to 5 input (i mod K ) of thin-to-thick converter Ki .
For each i 2 f0 . . .( MK 0 1)g, a thick wire connects
output 0 (the only output) of thin-to-thick converter i to input (Shue(i) mod S ) of hyperbar Dj Shue(i) k ; 0E. S
M 0 1)g, j 2 f0 . . .(S 0 For each i 2 f0 . . .( KS 1)g, and ` 2 f0 . . .(logS M K 0 2)g, a thick wire connects output j of hyperbar hi; `i
to iS + j ) mod S ) of hyperbar Dj input (Shue( Shue(iS +j ) k ; (` + 1)E. S
For each i 2 f0 . . .( MK 0 1)g, a thick connects output (i mod S ) of hyperbar M 0 1) to input 0 (the only input) ; (log S S K of thick-to-thin converter i.
wire5
4 i
For each i 2 f0 . . .(M 0 1)g, a thin wire connects 4 5 output (i mod K ) of thick-to-thin converter to output port i.
2.2
i K
Routing
For every wire that has a switch output at one end, assign the wire a label that is the number of the switch output, and associate with the wire a direction that originates from the switch output. Let j 2 f0 . . .(M 0 1)g be represented in mixed radix notation as h; 0 ; . . . ; (logS MK 01); i, where is a unary digit (that is, is always 0), the i s are base-S digits, and is a base-K digit.
Theorem 2.1. If the wires of the network must be traversed along the direction assigned (if any), then there is a unique path from every input port to every output port. Moreover, for every i 2 f0 . . .(M 0 1)g, the sequence of labels encountered while traversing the unique path from input port i to output port j is h; 0; . . . ; (logS M K 01) ; i (which is just the mixed radix representation of j ).
The proof of Theorem 2.1 is omitted because it is similar to the proof for ordinary (unexpanded) delta networks [13]. 3
Routing
Permutations
on
EDNs
without Restricted Access
Let P = fP0 . . . P(M 01)g denote a set of processors. A message is an ordered pair of processors. The rst component of a message is its source, and the second component is its destination. A trac pattern is a set of messages, i.e. a relation over P . A trac pattern 0 is EDN passable if
for each thin wire w in the EDN, there is at most one message hPi; Pj i 2 0 such that w lies on the unique path from input port i to output port j .
for each thick wire w in the EDN, there are at most K messages hPi ; Pj i 2 0 such that w lies on the unique path from input port i to output port j.
A permutation is a trac pattern in which each processor is the source of exactly one message, and the destination of exactly one message, i.e. a bijection from P to P . We claim that every permutation can be expressed as the composition of three EDN passable permutations. Informally, this means that every permutation can be routed on an EDN in exactly three passes. Theorem 3.1. For every permutation , there exist EDN passable permutations 0 , 1 and 2 such that
= 0 1 2. The operator \" denotes composition: if and 0 are permutations, then 0 is the permutation de ned by ( 0)(i) = 0 ((i)).
Theorem 3.1 may not be optimal in that a recent result by Cam and Fortes [6] suggests that it is possible to decompose every permutation into just two EDN passable permutations. However, the algorithm to nd the decomposition has a prohibitive sequential time complexity of 22(N log N ) . In order to prove Theorem 3.1, we rst state some standard results in network theory. Let N be a multistage network with P input ports labelled fI0 . . . I(P 01) g and P output ports labelled fO0 . . . O(P 01) g. N is rearrangeable if for every bijection : fI0 . . . I(P 01) g ! fO0 . . . O(P 01) g, there exist edge disjoint paths connecting Ii to O (i) for 0 i (P 0 1). Some examples of rearrangeable networks are the Benes network [4], the Clos network [7] and the triple-delta network (the triple-delta network is just three copies of the traditional delta network [13] connected in series). Lev et al. [11] give an algorithm which, given a bijection : fI0 . . . I(P 01) g ! fO0 . . . O(P 01) g, constructs disjoint paths in the Benes network connecting every Ii to output vertex O (i). Their algorithm runs in 2(P log P ) time sequentially, and in 2(log2 P ) time on an P -processor PRAM. The same paper describes how to reduce permutation routing on the Clos network to permutation routing on the Benes network. Similarly, a result by Parker [12] can be used to reduce permutation routing on the tripledelta network to permutation routing on the Benes network. (The reductions assume that all switch sizes are powers of two.) Both reductions take 2(P ) time sequentially, and 2(1) time on an P -processor PRAM. These reductions, together with the algorithm for routing permutations on the Benes network by Lev et al. , yield algorithms for routing permutations on the Clos network and triple-delta network which take 2(P log P ) time sequentially and 2(log2 P ) time on an P -processor PRAM.
Proof outline of Theorem 3.1:
We transform an EDN so that thick wires, hyperbars and wire thickness converters are replaced by the more familiar thin wires and crossbars. The resulting network is weaker than the original EDN, i.e. the trac patterns that are passable (in one pass) by the resulting network is a subset of the trac patterns that are EDN passable. We explain the transformation using the example in Figure 1. The rst step of the transformation is shown in Figure 2. Every S 2 S hyperbar of the EDN is replaced by a sub-network of wire thickness converters and S 2 S
Crossbar
Crossbar
Hyperbar
Crossbar
or
Second transformation
Transformation
0
0
1
1
2
2
3
3
4 5
4 5
6 7
6 7
8 9
8 9
10 11
10 11
12 13
12 13
14 15
14 15
crossbars. Since the sub-network is weaker than the hyperbar, the resulting network is weaker than the original EDN. The second step of the transformation is shown in Figure 3. Every pair of adjacent wire thickness converters at the beginning and end of the network is replaced by a K 2 K crossbar. Every pair of adjacent wire thickness converters in the middle of the network is simply eliminated. Once again, the network is weaker than before. Figure 4 shows the resulting network drawn dierently. The only dierence between the bottom part of Figure 3 and Figure 4 is that overlapping copies of switches and wires have been separated out. Observe that if the rst and last stages of K 2 K crossbars are ignored, then K disjoint networks remain, each of which is an (M=K )-input traditional (unexpanded) delta network. To prove Theorem 3.1, it is sucient to show that a network consisting of three EDNs connected in series is rearrangeable. Further, since the network in Figure 4 is weaker than the original EDN, it is sucient to show that a network consisting of three copies of Figure 4 connected in series is rearrangeable. Consider three copies of Figure 4 connected in series. In the rst copy, set the K 2 K crossbars in the last stage to straight connections. In the middle
0 1
2 3
2 3
4 5
4 5
6 7
6 7
8 9
8 9
10 11
10 11
12
12
13
13
14
14
15
15
Second transformed network
Transformed network
Figure 2. First step of the network transformation.
0 1
Figure 3. Second step of the network transformation.
0
0
1
1
2
2
3
3
4
4
5
5
6
6
7
7
8
8
9
9
10
10
11
11
12
12
13
13
14
14
15
15
Figure 4. The resulting network redrawn.
copy, set the K 2 K crossbars in both the rst and last stage to straight connections. In the last copy, set the K 2 K crossbars in the rst stage to straight connections. In the resulting network, if the rst and last stages of K 2 K crossbars are ignored, then K disjoint networks remain, each of which is an (M=K )input triple-delta network. Since a triple-delta network is rearrangeable, it can be replaced by a crossbar without aecting the routing capabilities of the whole network. The resulting network is an M -input Clos network, which is rearrangeable.
2
Observe that although Theorem 3.1 only asserts the existence of a suitable permutation decomposition, the proof tacitly describes an algorithm to nd it. An inspection of the proof shows that the dominant steps in the algorithm are K routings on an (M=K )-input triple-delta network, and one routing on an M -input Clos network. Using the algorithm mentioned above for routing on a Clos and triple-delta, the algorithm for permutation decomposition takes 2(M log M ) time sequentially and 2(log2 M ) time on an M -processor PRAM. If an EDN has only two hyperbar stages, then Theorem 3.1 can be improved. Theorem 3.2. If an EDN is such that S 2 = M K , then for every permutation , there exist EDN passable permutations 0 and 1 such that = 0 1.
Proof Outline:
Once again, transform the EDN into a weaker network similar to Figure 4, but with only two hyperbar stages. Consider two copies of the above network connected in series. In the rst copy, set the K 2 K crossbars in the last stage to straight connections. In the second copy, set the K 2 K crossbars in the rst stage and the S 2 S crossbars in the second stage to straight connections. In the resulting network, if the rst and last stages of K 2 K crossbars are ignored, then K disjoint networks remain, each of which is an (M=K )-input Clos network. Since a Clos network is rearrangeable, it can be replaced by a crossbar without aecting the routing capabilities of the whole network. The resulting network is now an M -input Clos network, which is rearrangeable.
2
4
Routing
Permutations
with Restricted Access
on
EDNs
So far, we have described the routing algorithm while ignoring the issue of restricted access, by assuming that the number of elements to be permuted
equals the number of inputs to the EDN. In this section, we incorporate restricted access into the routing algorithm. Recall from Section 2 that P = fP0 . . . P(M 01)g is the set of processors. A message is an ordered pair hSource; Destinationi of processors. A trac pattern is a set of messages. A permutation is a trac pattern in which each processor is the source of exactly one message, and the destination of exactly one message. Similarly, let V = fV0 . . . V(N 01) g denote a set of virtual processors. A virtual message is an ordered pair of virtual processors. The rst component of a virtual message is its source, and the second component is its destination. A virtual trac pattern is a set of virtual messages, i.e. a relation over V . A virtual permutation is a trac pattern in which each virtual processor is the source of exactly one virtual message, and the destination of exactly one virtual message, i.e. a bijection from V to V . An allocation is a function from V to P . For the remainder of this section, we will assume the following:
There exists a natural number C such that N = CM . (C stands for cluster size.)
The allocation function is , de ned by (Vi) = Pb Ci c . Thus, each processor is allocated a block of C consecutive virtual processors.
Figure 5 shows an example of restricted access. The rectangular box in the center represents an EDN. However, the results of this section do not depend on this fact, and are equally valid for any other network. There are M = 4 processors, represented by the big circles on the left. The big circles on the right are the same four processors, but have been drawn again for convenience. There are N = 8 virtual processors. Each processor is allocated C = 2 virtual processors. Two virtual messages hs0 ; d0i and hs1 ; d1i are con icting if (s0 ) = (s1 ) or (d0) = (d1), that is, either their sources or their destinations are allocated to the same processor. A set of virtual messages is con ict free if no two messages in the set are con icting. We claim that every virtual permutation can be partitioned into C con ict free sets of messages. Informally, this means that that routing a virtual permutation is equivalent to routing C possibly dierent permutations. Since, by Theorem 3.1, every permutation can be routed on an EDN in three passes, it follows that every virtual permutation can be routed on an EDN in 3C = 3N=M passes. If the EDN has only two stages, by Theorem 3.2, this reduces to 2N=M passes.
pass through dierent switches in the second stage. 0
0
0
1 2
1 1
1
3 4
2 3
2
2
5 6
0
4 5
3
3
7
6 7
Figure 5. Example of Restricted Access.
Theorem 4.1. For every virtual permutation , there exist con ict free sets of messages S0 . . . S(C 01) S C 01) Si = and for every i; j 2 f0 . . .(C 0 such that (i=0 1)g, if i 6= j then Si \ Sj = ;.
Theorem 4.1 is optimal in the following sense: it is impossible to partition any virtual permutation into fewer than C con ict free sets of messages. To see this, consider
the C virtual messages hV0 ; (V0 )i ; hV1 ; (V1 )i . . . V(C 01) ; (VC 01) . Every pair of these C virtual messages is con icting, because the source of each of these virtual messages is allocated to P0. Therefore, any partition of into con ict free sets must have at least C sets.
Proof outline of Theorem 4.1:
Consider an N-input Clos network with M 2 M crossbars in the rst and third stage, and C 2 C crossbars in the middle stage. Since a Clos network is rearrangeable, there exists a set of edge disjoint paths connecting every input port i to output port j such that (Vi ) = Vj . Let Path(i) denote the path in this set originating from input port i. For 0 k (C 0 1), de ne Sk to be the set of all virtual messages hVi ; (Vi )i such that Path(i) passes through switch k in the middle stage of the Clos network. We claim that every Sk is con ict free. It suf ces to prove that if virtual messages hVi ; (Vi )i and hVj ; (Vj )i are con icting, then Path(i) and Path(j ) pass through dierent switches in the second stage. Let virtual messages hVi ; (Vi )i and hVj ; (Vj )i be con icting. Then, either the network inputs i and j are connected to the same switch in the rst stage, or the same switch in the third stage is connected to network outputs (i) and (j ). That is, Path(i) and Path(j ) share a common switch either in the rst stage or the third stage. If Path(i) and Path(j ) share a common switch in the second stage as well, then they are no longer edge disjoint. Hence Path(i) and Path(j )
2
Observe that although Theorem 4.1 only asserts the existence of a suitable partition, the proof implicitly describes an algorithm to nd it. The dominant steps in the algorithm is a routing on an M -input Clos network. Using the algorithm mentioned above for routing on a Clos and triple-delta, the algorithm for partioning takes 2(N log N ) time sequentially and 2(log2 N ) time on an N -processor PRAM. There is an alternate proof of Theorem 4.1 that is more elegant, although not constructive. De ne the graph G whose vertex set is P 2 fLeft; Rightg, and whose edge set is (thus each edge represents a virtual message.) Every edge hVi ; Vj i connects the vertices h(Vi ); Lefti and h(Vj ); Righti. Two edges may connect the same pair of vertices, so G may have multiple parallel edges. Note that two edges are adjacent if and only if the corresponding virtual messages con ict. So, a set of edges is a matching if and only if the corresponding set of virtual messages is con ict free. It is easy to see that G is C -regular and bipartite. By Hall's matching theorem [10, pages 190-196], every regular bipartite graph has a perfect matching. Therefore G has a perfect matching. If the edges of this perfect matching are removed, then a bipartite and (C 0 1)-regular graph remains. Invoking Hall's theorem again, the remaining graph has a perfect matching. Continuing in this way, the set of edges can be partitioned into C matchings. Rephrasing, can be partitioned into C con ict free sets of messages. 5
Application to the MasPar MP-1 and MP-2
As pointed out in Section 1, the global network in the 16K-processor MasPar MP-1 and MP-2 is an EDN. However, while we have assumed that every input and output port of an EDN has one processor connected to it, every input and output port of the MasPar global router has several processors connected to it, which share the port serially. (This sharing is because of practical fabrication constraints, and also occurs in the Thinking Machines CM-1 and CM-2.) For the purposes of analysis, this dierence between the MasPar machines and the model can be reconciled by treating physical processors in the MasPar machines as virtual processors in the model. Since the MasPar global network has two hyperbar stages, our algorithm computes a routing that takes exactly 2N=M = 2 2 16K=1K = 32 passes for every 16K-element permutation on the 1K-input MasPar global network. In contrast, permutations on the
MasPar global network currently take anywhere from 16 to over 256 passes [5, 1]. But the above comparison is unfair. Our algorithm is o-line, i.e. a separate machine is given the complete trac pattern and used to pre-compute a routing. On the other hand, the MasPar algorithm is online, i.e. the switches make routing decisions on the y acting under local control. O-line algorithms usually yield superior routings than on-line algorithms, but incur the overhead of pre-computing a routing. Thus, o-line algorithms are useful when the time to precompute the routing is oset by the time gained due to the superiority of the routing. This is the case if the trac pattern is known at compile time, because then the routing can be computed during the compilation, and can be used in every execution of the program. If the trac pattern is known only at run time, but it is known that the trac pattern recurs many times during the execution, then the routing need be computed only once, and can be used repeatedly during the execution of the program. Acknowledgements
We thank Tom Blank, Russ Tuck, John Zapisek and Won Kim of MasPar Computer Corporation for valuable discussions about the \inner workings" of the MasPar MP-1 and MP-2 router. References
[1] B.D. Alleyne. Personal communications, 1992. [2] B.D. Alleyne and I.D. Scherson. Expanded delta networks for very large parallel computers. In Proceedings of the International Conference on Parallel Processing, pages I127{I131, Aug 1992. [3] G.S. Almasi and A. Gottlieb. Highly Parallel Computing. Benjamin/Cummings, Redwood City, CA, 1989. [4] V.E. Benes. Mathematical theory on connecting networks and telephone trac. Academic Press, New York, 1965. [5] T. Blank and R. Tuck. Personal communications, 1992. [6] H. Cam and J.A.B. Fortes. Rearrangeability of shue-exchange networks. In Symposium on Frontiers of Massively Parallel Computation, pages 305{314, Oct 1990. [7] C. Clos. A study of non-blocking switching networks. Bell System Technical Journal, 32:406{ 424, 1953.
[8] R. Koch. Increasing the size of a network by a constant factor can increase performance by more than a constant factor. In Symposium on Foundations of Computer Science, pages 221{230, Oct 1988. [9] C.K. Kruskal and M. Snir. The performance of multistage interconnection networks for multiprocessors. IEEE Transactions on Computers, C32(12), Dec 1983. [10] T. Leighton. Introduction to Parallel Algorithms and Architectures: Arrays, Trees, Hypercubes. Morgan Kauman, San Mateo, CA, 1991. [11] G.F. Lev, N. Pippenger, and L.G. Valiant. A fast parallel algorithm for routing in permutation networks. In IEEE Transactions on Computers, pages 93{100, Feb 1981. [12] D.S. Parker. Notes on shue/exchange type networks. IEEE Transactions on Computers, C29(3):213{222, March 1980. [13] J.H. Patel. Performance of processor-memory interconnections for multiprocessors. IEEE Transactions on Computers, C-30(10):771{780, Oct 1981. [14] T.H. Szymanski and V.C. Hamacher. On the universality of multipath multistage interconnection networks. The Journal of Parallel and Distributed Computing, 7(3):541{569, Dec 1989. [15] A. Youssef, B.D. Alleyne, and I.D. Scherson. Permutation routing in restricted access networks. In Proceedings of International Parallel Processing Symposium, pages 403{406, March 1992.