Optimal Algorithms for Total Exchange without Buering on the Hypercube K. Coolsaet State University of Ghent, Seminarie voor Algebra en Functionaalanalyse, Galglaan 2, B{9000 Gent, Belgium, H. De Meyer and V. Fack State University of Ghent, Laboratorium voor Numerieke Wiskunde en Informatica, Krijgslaan 281{S9, B{9000 Gent, Belgium, Email :
[email protected] Published in : BIT 32 (1992) 559{569 Abstract Two methods are given for constructing total exchange algorithms for hypercubic processor networks. This is done by means of bit sequences with special properties. The algorithms are optimal with respect to a given time model, need no intermediate message buering and are local in the sense that every processor executes basically the same program.
Keywords : hypercube, total exchange, optimal algorithm CR categories : F.2.2, G.2.2
1 Introduction Consider a network of parallel processors where each processor is connected to a number of neighbouring processors through a duplex communication link. We often treat the processor network as an abstract graph, consisting of a set of nodes N (the
1
processors) and a set of links L (the communication links). Each duplex connection between two adjacent processors is treated as consisting of 2 dierent links, one link for each direction. We are mainly interested in the so-called d-dimensional hypercube. This graph has 2d nodes and d2d links. It can best be described by numbering the nodes from 00 : : : 0 upto 11 : : : 1 using d-bit binary notation. Two nodes are linked if and only if their binary representations dier in exactly one bit position. Hence, each node is connected to d other nodes. A sequence of nodes i0 ; i1; : : :; ik in a graph with the property that ij and ij +1 are neighbours, for 0 j < k, is called a path of length k between nodes i0 and ik . Two nodes are connected if and only if such a path exists between them. A graph is called connected if and only if paths exist between any pair of nodes in that graph. Henceforth we will only consider connected graphs. The distance between two nodes in a graph is de ned to be the length of a shortest path that connects these nodes. For example in a hypercube, the distance between two nodes is equal to the number of bit positions in which their representations dier. In this paper we discuss data exchange algorithms on such processor networks. A data exchange algorithm is a set of programs (one for each processor in the network) which are executed in parallel and whose purpose it is to send data (in the form of messages) between these processors. A total exchange algorithm is a data exchange algorithm which allows every node i in the network to send a message mij to every other processor j in the network. Note that the messages mij ; i 6= j may all be dierent. We will concentrate on the order and the paths by which the messages are sent across the network, and not on the lower level processing needed to make these transfers possible. In order to allow a theoretical treatment of our algorithms we make the following assumptions with respect to timing : We assume that the transfer of a message from one node to one of its neighbours occurs only at discrete time intervals,
2
that the time taken for extra processing of messages within each node is negligible, and that one node can transfer messages to all of its neighbouring nodes simultaneously but at most one message at the same time across a given link (i.e., one message for each direction). As a consequence, if the distance between two nodes is d, then a message that is sent from one node to the other must travel at least d units of time. The algorithms which we describe below do not need intermediate buering of messages, i.e., they have the property that every message mij which arrives at a node k 6= i; j at a given time T , leaves this node at time T + 1. Let us now compute a theoretical lower bound for the execution time of a total exchange algorithm on the hypercube. First note that at a given time, at most e messages can travel across the links of the network, where e = d2d is the total number P of interconnections in the network. On the other hand, at least M = i;j d(i; j ) such transfers must occur for every message to reach its destination (here d(i; j ) denotes the distance between nodes i and j and d(i; i) = 0.) Hence, any total exchange algorithm must take at least time T = M=e. In a hypercube of dimension d, for node i, exactly d nodes j have d(i; j ) = 1, ? nodes k have d(i; k) = 2, and in general nd nodes lie at distance n. Hence X
j
!
?d 2
!
d(i; j ) = d + 2 d2 + + n nd + = d2d?1
Summing this result for every node i, we get M = d22d?1, and hence T = 2d?1 . The algorithm which we describe below takes exactly time 2d?1 and is therefore optimal under the given timing assumptions. Recently, Saad and Schultz [8] have treated the total exchange problem on a variety of parallel architectures, including the hypercube. They call it the data transposition problem because of its equivalence with the problem of transposing a matrix where the matrix elements or blocks of matrix elements are distributed over the available processors. For the hypercube two algorithms [7, 8] are presented which take d2d and 2d execution time respectively. A rst improvement has been obtained by Bertsekas et al. [2] who discuss in their book
3
an algorithm that takes 2d ? 1 execution time. In [3] we presented a total exchange algorithm for the hypercube which is optimal but does not avoid intermediate message buering, like the algorithms presented below. The present algorithms are also easier to implement. Optimal algorithms with buering have also been obtained recently by Bertsekas et al [1] and Edelman [4]. Finally, we would like to mention that some of the present results were presented on a poster at the VAPP IV { CONPAR 90 Conference [5].
2 Locally de ned algorithms We have stated before that an algorithm really consists of a set of a programs, one for each processor in the network. Hence, in principle we need to describe a dierent algorithm for every single processor in the network. In practice however, we want the programs that run on dierent processors in a certain sense to be `the same'. This is made possible by having a given processor refer to other processors (in particular, to its immediate neighbours) using a relative address. For instance, on the 4-dimensional hypercube, processor 0110 does not refer to processor 0100 by means of its 4-bit node number, but uses a designation like `my 3rd neighbour' instead. If by means of such a technique all processors can be made to run the same program, then we say that the algorithm can be de ned locally. However, to allow a formal description of this concept, we need to restrict the class of network con gurations under consideration. Let us introduce some de nitions. A 1-to-1 mapping ^ : N ! N is an automorphism of a graph (N; L), if two nodes ^ (i) and ^ (j ) are linked if and only if the nodes i and j are linked. The set of automorphisms of a given graph forms a group with composition law ^1^2 (i) = ^1 (^2(i)). A subgroup G^ of the automorphism group of a graph acts regularly on the graph, if for every pair i; j of nodes there exists exactly one element g^ of G^ such that g^(i) = j . In what follows we will consider only networks for which such a group G^ exists. For the hypercube such a group G^ can easily be found. Indeed, let k denote any
4
bitstring of exactly d bits, then the mapping k^ : i 7! i k, where denotes the `bitwise exclusive or' operation, is an automorphism of the hypercube. Moreover, the set G^ of all such mappings k^ forms a group with composition rule k^^l = kd l. Also, G^ acts regularly on the hypercube, for a node i is mapped onto a node j by the mapping id j , and not by any other element of G^ . Given a processor network with associated group G^ , we can identify the network with the group in the following manner : Choose any node e and associate this node with the identity ^1 2 G^ . Associate any other node j with the unique element |^ of G^ that maps e onto j . (This notation corresponds to the notation in the previous paragraph if we choose e = 00 : : : 0 for the hypercube.) Now, if S = fg1; : : :; gdg denotes the set of neighbours of e, then i and j will be neighbours in our network if and only if the mapping ^{?1 maps j onto an element of S , i.e. ^{?1 (j ) 2 S which is equivalent to ^{?1 |^ 2 S^, or by symmetry, |^?1 ^{ 2 S^. Note that this implies that for every s^ 2 S^ also s^?1 2 S^. Conversely, every group G and set S G with the property S ?1 S and 1 62 S can be associated with a graph, if we treat group elements as nodes, and de ne g; h 2 G to be linked i h?1 g 2 S . For each g 2 G we de ne the automorphism g^ by the rule g^(h) = gh; (h 2 G). Then the set of all g^; (g 2 G) constitutes a group G^ (isomorphic with G) which acts regularly and transitively on the graph. Notice that if the elements of S generate the group G then the graph is connected. On account of the rule given above there is no need to distinguish between elements of G^ (automorphisms) and the corresponding elements of G (nodes). Hence we will drop the^{notation for automorphisms. A data exchange algorithm on a network with group G is locally de ned, if for every 2 G, and for every message mij which is sent by the algorithm from node e to an adjacent node ga at time T , also the message m(i);(j ) is sent at time T from node (e) to (ga). Hence, when an algorithm is locally de ned, it is sucient to know the sequence of messages sent at every instance from node e to its neighbours to reconstruct the sequences of messages sent at any instance on any other node.
5
a 1
b ba a 1 ba b b ba 1 a ba b a 1 1
a
a a
b
b
b
b
ba b a a
Figure 1: The group G and the corresponding graph. 1
The group representation can also be used to indicate paths within the graph. Indeed, in a path i0 in , the nodes ia and ia?1 are always adjacent, hence we can nd elements ga 2 S such that i?a?11 ia = ga or equivalently ia = ia?1 ga . Therefore every path can be represented by a word ga1 gan of elements of S , with the additional property that ik = i0ga1 gak . Conversely, every word ga1 gak corresponds with a path that starts in the node i0 and ends in the node ik = i0 ga1 gak . Now consider any locally de ned data exchange algorithm. Take a message m which is sent from node e across a path ga1 gak . At a given time T this message is passed from e to the adjacent node ega1 = ga1 . Hence, because of the locality of the algorithm, also at time T a message m0 is passed from node ega?11 = ga?11 to e. The message m0 travels a path which corresponds to the same group word ga1 gak as m, but starts at another node. After time T the message m0 behaves as a message sent from node e but across a shorter path ga2 gak . If k happens to be 1, then this means that the message m0 has arrived at its destination. This property can be used to deduce which messages arrive at a given node at a given time. As an example, let us consider the graph corresponding to the group G1 = ha; b j a2 = b2 = (ba)2 = 1i taking S1 = fa; bg (cf. gure 1). We now describe a total exchange algorithm on this network. The algorithm is such that the messages are sent across the paths a; b and ba. (Note that, although ab = ba in G1 , the words ab and ba correspond to dierent paths, be it with the same destination.) From now on a message sent from node e to another node will be
6
T Waiting Sent Received 1 a,b,ba a, ba a a,b a, b 2
Table 1: Total exchange algorithm for G . 1
described by means of the path the message follows through the network. Our algorithm is locally de ned, hence we only describe what happens at node 1. At time T = 1, we send message a towards node a and message ba towards node b. Because of the above reasoning we know that a message with path a will arrive from a and a message with path ba will arrive from b. The rst message has arrived at its destination, the second is transformed into a message with path a. At time T = 2, we therefore need to send a message with path a towards a and a message with path b towards b, completing the algorithm. Table 1 introduces a tabular notation for this algorithm. In this format, the `Sent' column determines the actual algorithm. The `Received' column contains the same words as the `Sent' column, but with the rst letter stripped o, and the `Waiting' column starts out with all possible messages, removing the messages sent and adding the messages received at each step. The rst letter of each message word in the `Sent' column indicates the link across which that message is sent. These letters must all be dierent. As another example we give the hexagonal network corresponding to the 6 element group G2 generated by S = fa; bg, with 1 = a2 = b2 = (ab)3 (cf. gure 2). A total exchange algorithm for G2 is given in table 2.
3 The hypercube In the case of the hypercube we use a slightly simpler notation. In the group G de ned earlier we see that S consists of all those operations gi which transpose a single bit of their argument (the ith bit for gi ). Now every element g 2 G can be
7
1 a b ab ba aba a 1 ab b aba ba b ba 1 aba a ab ab aba a ba 1 b ba b aba 1 ab a aba ab ba a b 1
ab ZZ a aba 1 ba , ZZ , b b
a
b
a
a a
b b
b
a
b
a
Figure 2: The group G and the corresponding graph. 2
Waiting Sent Received T 1 a,b,ab,ba,aba aba, ba ba, a 2 a,a,b,ab,ba ba, a a a,a,b,ab a, b 3 4 a,ab ab b a,b a,b 5
Table 2: Total exchange algorithm for G . 2
written as a product g = ga1 gak with a1 > a2 > > ak . Indeed, if we represent g as a bitstring, then a1 is the position of the most signi cant 1 bit in g, a2 is the position of the second most signi cant 1 bit in g , : : : upto ak which is the position of the least signi cant 1 bit in g . We use the above group words to represent the paths for our messages. This means that a message from node 00 0 to a node g will always be sent across the link which corresponds to the position of the most signi cant 1 bit in the bit representation of g. We adapt our tabular notation for the algorithm accordingly. As an example we give a total exchange algorithm for the 3-dimensional hypercube in table 3. With this notation, the `Receive' column contains the elements of the `Sent' column with the most signi cant 1 bit changed to 0 and all 00 0 elements removed. In this particular example, the algorithm does not need intermediate buering, i.e., every message received at time T is immediately sent further at time T + 1. (i.e., every
8
T Waiting Sent Received 1 111,110,101,100,011,010,001 001,011,111 001,011 110,101,100,011,010,001 001,011,110 001,010 2 3 101,100,010,010,001 001,010,101 001 100,010,001 001,010,100 4
Table 3: Total exchange algorithm for the hypercube of dimension 3. Message 001 011 111 110 101 010 100
T 1 2 3 4
Table 4: Shorthand notation for the algorithm on the 3-cube. bitstring in the `Received' column is copied in the `Sent' column of the next row.) In the case of a total exchange algorithm without buering we can reconstruct the entire algorithm from the `Sent' column, where any message which has just been received is removed. In the above example this means that we only need to specify the bitstrings which are underlined. We can use the shorthand notation illustrated in table 4. Such a table is called a launching pattern for the algorithm, as it indicates which messages are `launched' from the given node at which time. Note that we can associate a data exchange algorithm with every such sequence of bitstrings, assuming we launch messages in the order de ned by the sequence, and only as soon as its corresponding link becomes available. Note however that not every sequence gives a valid data exchange algorithm without intermediate buering, for it might occur that at some time T two dierent messages are received which must be sent across the same link at time T + 1. The theorem below gives a necessary condition for this not to happen.
9
4 The main theorem In the example algorithm above, the bitstrings s1 = 001; s2 = 011; : : :; s7 = 100 have the special property that sa+1 can be formed by shifting sa one position to the left and adding a 0 or 1 at the end. We will now study such sequences more closely. Consider any sequence of bits b1; : : :; bM +d?1 with the following property :
b1 = = bd?1 = 0; bd = 1; bM = = bM +d?1 = 0
(1)
sk = bk bk+1 bk+d?1 ; 1 k M
(2)
and de ne (hence s1 = 0 01 and sM = 0 0), then we have the following
THEOREM 1. If none of the si consists of all zeroes, then the sequence s ; : : :; sM ? 1
1
corresponds to a valid data exchange algorithm. The time taken by this algorithm is equal to the number of 1 bits in the sequence b1 bM . Proof. Consider a time T at which the `Sent' column of the considered data exchange algorithm is of the following form : 0 0 0 0 1 0 0 0 1 bk+1 0 0 1 bk+1 bk+2 (3) .. .. .. .. .. . . . . . 0 1 bk+1 bk+d?3 bk+d?2 1 bk+1 bk+2 bk+d?2 bk+d?1 where the bitstring 1 bk+1 bk+2 bk+d?1 corresponds to some sk . Clearly, this is the case for T = 1 (with k = d). We will now prove that this will also be the case for time T + 1 when it is true for time T .
Now, note that the last message in (3) has most signi cant bit 1, and can therefore not be a message which was received at time T ? 1 from some other node, in other words, it is a newly launched message. Let p denote the number of consecutive 0 bits counting from bk+1 onward, i.e. :
bk+1 = bk+2 = = bk+p = 0; bk+p+1 = 1; 0 p d ? 1;
10
(4)
With this de nition, the rst p + 1 messages of (3) only need to travel distance 1 and the other elements need to travel a distance > 1. The `Received' column at time T therefore contains the following elements : 0 0 0 0 .. . 0 0
.. . 1
0 1
1
bk+p+2 .. .
(5)
bk+p+2 bk+d?1
We are constructing an algorithm with no intermediate buering, hence the `Sent' column at time T + 1 contains all d ? p ? 1 messages of (5) together with p + 1 newly launched messages sk+1 ; : : :; sp+k+1 . This produces a new `Sent' list for time T + 1, which on account of (4) has the same structure as (3). As a further consequence, we note that at every time interval upto the time t when the last message sM ?1 is launched, a message is sent across each of the d links. In other words, the links are always kept fully occupied. Moreover, as sM ?1 has the special form 10 0, we know that every message sent at time t only has to travel a distance of 1 and therefore nothing remains to be sent at time t + 1. The algorithm therefore takes time t. It is an easy consequence of the above, that t is actually equal to the number of 1 bits in the sequence b1 : : :bM +d?1. As an immediate consequence we have the following
THEOREM 2. Given a bit sequence b : : :bN 1
? , with N = 2d such that the cor-
+d 1
responding bitstrings s1 ; : : :; sN ?1 are all dierent from each other and from 0 0, then the corresponding data exchange algorithm is a total exchange algorithm which takes time 2d?1 (and is therefore optimal). Proof. This follows from the fact that every node receives a message from the given node 0 0 and that exactly 2d?1 of the strings si have the most signi cant bit equal to 1.
11
d f (x) 2 x2 + x + 1 3 x3 + x + 1 x3 + x 2 + 1 4 x4 + x + 1 x4 + x 3 + 1 5 x5 + x 2 + 1
sequence 011 00 0010111 000 0011101 000 000100110101111 0000 000111101011001 0000 0000100101100111110001101110101 00000
Table 5: Bit sequences generated by theorem 3.
5 Constructing valid sequences The main question which now remains to be answered is `Do such sequences exist for every dimension d ?' Up to now we have found two methods to construct them :
THEOREM 3. Consider a polynomial f (x) = xd + ad?1 xd?1 + + a0 2 Z=2Z[x] which is a characteristic polynomial of a primitive element of GF (2d). Then the linear recurrence
b1 = : : : = bd?1 = 0; bd = 1; bi+d = (a0bi + a1 bi+1 + + ad?1 bd+i?1 ) mod 2 yields a sequence b1 : : :bN ?1 which satis es the conditions of theorem 2. Proof. This is an immediate consequence of theorem 10.1, page 153 in [2].
Examples of such sequences for small d are given in table 5. Table 6 (on the left) shows the algorithm on the 4-cube which is derived from the case d = 4; f (X ) = x4 + x + 1 .
THEOREM 4. The rst N + d ? 1 elements of the sequence generated by the following rules - b1 = = bd?1 = 0, bd = 1. - bi+d = 1 if si+1 = bi+1 : : :bi+d?1 1 is not a subsequence of b1 bi+d?1 .
12
Message 0001 0010 0100 1001 0011 0110 1101 1010 0101 1011 0111 1111 1110 1100 1000
T 1
Message 0001 0011 0111 1111 1110 1101 1011 0110 1100 1001 0010 0101 1010 0100 1000
2 3 4 5 6 7 8
T 1 2 3 4 5 6 7 8
Table 6: Total exchange algorithms on the 4-dimensional hypercube - bi+d = 0 otherwise. satisfy the conditions of theorem 2. Proof. As a consequence of the above rules, two bitstrings si and sj , with i < j , can only be equal if they both end in 0. In that case there must exist a third bitstring sk , with k < i < j which diers from si and sj only in the last bit. Hence, provided k > 1, two of the strings sk?1 , si?1 and sj?1 must also be equal. Repeating the argument we ultimately nd amongst all pairs of equal strings the pair (si ; sj ) for which i is the smallest. In that case the corresponding k must be 1, and si = 00 0. We will prove that i = N , which proves the theorem.
Assume i < N . In that case there must exist at least one bitstring s 6= 0 0 which does not occur in the set s1 ; : : :; si. Write s as bu where u is a string of d ? 1 bits, and b is either 0 or 1. Consider the string s0 = u0 and assume that s0 occurs in the sequence, say s0 = sp (note that p > 1). Because of the given rules, also u1 must occur somewhere in the sequence, say u1 = sq Note that also q > 1, or otherwise s0 = 0 0 and also s = 10 0 = si?1 . Hence fsp?1; sq?1 g = f0u; 1ug,
13
d sequence
2 3 4 5
01100 0011101000 0001111011001010000 000011111011100110101100010100100000
Table 7: Sequences generated by theorem 4. and therefore s = bu must occur somewhere in the sequence. This contradicts our assumption, hence s0 does not occur in the sequence. Applying this reasoning again to the string s0 , we nd a string s00 that ends in two zeroes which does not occur in the sequence. Continuing in this manner, we eventually nd that 00 0 may not occur in the sequence. This is a contradiction. Table 7 contains some examples of sequences generated by this theorem. In table 6 (on the right) we give a total exchange algorithm on the 4-cube which is based on such a sequence.
References [1] D. Bertsekas, C. Ozveren, G. Stamoulis, P. Tseng and J. Tsitsiklis, Optimal Communication Algorithms for Hypercubes, Journal of Parallel and Distributed Computing 11, 263-275 (1991) [2] D. Bertsekas and J. Tsitsiklis, Parallel and distributed computation, Numerical methods (Prentice Hall International Editions, 1989) [3] K. Coolsaet and V. Fack, Total exchange algorithms on `sandwich graphs', Computers Math. Applic. 22, 45-48 (1991) [4] A. Edelman, Optimal Matrix Transposition and Bit Reversal on Hypercubes : All-to-All Personalized Communication, Journal of Parallel and Distributed Computing 11, 328-331 (1991)
14
[5] V. Fack, K. Coolsaet and H. De Meyer, Optimal Algorithms for Total Exchange without Buering on the Hypercube, poster presented at VAPP IV { CONPAR 90, Joint Conference on Vector and Parallel Processing, Zurich (Switzerland), September 10-13, 1990 [6] R.J. McEliece, Finite Fields for Computer Scientists and Engineers (Kluwer Academic Publishers, 1987) [7] Y. Saad and M. Schultz, Data Communication in Hypercubes, Journal of Parallel and Distributed Computing 6, 115-135 (1989) [8] Y. Saad and M. Schultz, Data Communication in Parallel Architectures, Parallel Computing 11, 131-150 (1989)
15