optimal all-to-all broadcast schemes in distributed

0 downloads 0 Views 251KB Size Report
In this paper, we rst develop the optimal all-to-all broadcast scheme for the case of one-port communication, which means that each node can only send out one ...
OPTIMAL ALL-TO-ALL BROADCAST SCHEMES IN DISTRIBUTED COMPUTING SYSTEMS Ming-Syan Chen, Philip S. Yu and Kun-Lung Wu IBM Thomas J. Watson Research Center P.O. Box 704 Yorktown Heights, NY 10598

Abstract

Broadcast, referring to a process of information dissemination in a distributed system whereby a message originating from a certain node is sent to all other nodes in the system, is a very important issue in distributed computing. All-to-all broadcast means the process by which every node broadcasts its certain piece of information to all other nodes. In this paper, we rst develop the optimal all-to-all broadcast scheme for the case of one-port communication, which means that each node can only send out one message in one communication step, and then, extend our results to the case of multi-port communication, i.e., k-port communication, meaning that each node can send out k messages in one communication step. We prove that the proposed schemes are optimal for the model considered in the sense that they not only require the minimal number of communication steps, but also incur the minimal number of messages.

Index Terms: Distributed computing systems, all-to-all broadcast, NODUP, partitioning trees, minimal complete sets, multi-port communication.

1 Introduction The availability of inexpensive, high-performance microprocessors has made it attractive to link together many powerful and autonomous computers to build a distributed computing system for better availability and cost performance [7] [17] [19] [21]. In such a system, instead of using a shared memory and a global clock, all the synchronization and communication between the processing nodes is done via message passing [1] [22]. Since data are distributed, not shared, special schemes are generally required to perform various distributed computations. One such scheme in distributed computations is broadcast [6] [8] [9] [11] [16] [25], which refers to a process of information dissemination in a distributed system whereby a message originating from a certain node is sent to all other nodes in the system. An example of a one-to-all broadcast scheme can be found in 1, where f1g denotes the message broadcast by the originator N1 and the broadcast is completed in 3 steps while incurring 4 messages. In addition to one-to-all broadcast, all-to-all broadcast, where every node, instead of a certain node as in one-to-all broadcast, has a piece of information to be shared with others, is also very important in numerous applications in distributed computing. Applications of all-to-all broadcast include decentralized consensus protocols [3], extrema nding, coordination of distributed checkpoints [14], acquisition of a new global state [5], and the broadcast of personalized information [12]. Similar to one-to-all broadcast, all-to-all broadcast schemes are characterized by the number of message steps required and the total number of messages incurred to complete the broadcast [11] [12] [23] [26]. Several studies have been conducted to minimize the number of message steps (or time) of one-to-all and all-to-all broadcast schemes for various communication networks/distributed environments. A recent survey can be found in [11]1. Also, to reduce the number of messages, some broadcast schemes and consensus protocols were proposed in [14] [26] and shown to be ecient in terms of message complexity. To reduce the overhead of the scheme without compromising its eciency, we naturally would like to complete the broadcast in the minimal number of steps while incurring as few messages as possible. However, to date, despite its importance, the problem of determining the minimal number of messages required for all-to-all broadcast in the minimal number of steps has not been solved. Consequently, we derive in this study optimal all-to-all broadcast schemes for a distributed system that complete the broadcast with not only the minimal number of steps but also the minimal number of messages. To facilitate our presentation, we start with considering the case of one-port communication, which means that each node can only send out one message in one communication step. We then show our results for the case of multi-port communication [12], i.e., k-port communication, meaning that each node can simultaneously send out k messages in one communication step. Speci cally, All-to-all broadcast is the same as the gossiping problem in [11] except that a two-way transmission, such as a phone conversation, is assumed for the latter. 1

1

N1

N1

N2

N1

{1}

{1}

N3

{1}

{1}

N2

N3 {1}

{1}

N4

N5

(a) Step 1

{1}

N4

N5

(b) Step 2

N3

N2 {1}

{1}

N4

{1}

N5

{1}

(c) Step 3

Figure 1: A one-to-all broadcast scheme in a system of 5 nodes (3 steps and 4 messages). we consider completely connected systems with synchronous communication. Such a model was employed in other related work [11]. (A detailed description of the model can be found in Section 2.) For ease of exposition, we use the identi cation (id) of each node to denote the information that this node wants to broadcast to every other node2 . Also, as in [11] [20] [24], to reduce the cost of transmission, the schemes investigated here are those without duplicate information, i.e., every message conveys only new information to its receiver. This sort of scheme is termed \NODUP" (for no duplication) in [11]. All-to-all broadcast is completed in the end after each node receives all the id's from all other nodes. The problem studied in this paper can be best understood by considering the case of all-to-all broadcast among 5 nodes with one-port communication in Figure 2. The information collected thus far after each step is shown in the bracket next to each node. An arrow pointing from node Ni to node Nj represents that Ni is sending what it knows thus far to Nj . For example, in step 1 in Figure 2, nodes N1 and N2 simultaneously send their own id's to node N4. Thus, after step 1, node N4 will have the information f1,2,4g. Note that one message might consist of more than one id3 , and each message takes exactly one communication step. We use black nodes to denote the ones that have gathered all the information from all other nodes, and white nodes to denote those Depending on the application, the real content of id can be very general, such as a personalized information/database, the numbers to be sorted, a vector describing the local state, and a \yes" or \no" vote of the commit protocol in a distributed transaction, to name a few. 3 A message fa1 ; a2 ; : : : ; aj g should be viewed as f (a1 ; a2 ; : : : ; aj ), where f is an application-dependent function. 2

2

N1 {1}->{1,4} N2

N1

{3}->{3,5}

N3

{2}

N2

N3

{2}

{1,2,3,4,5}

N4 {4}->{1,2,4}

N5 {5}->{3,5}

(a) Step 1

N4 {1,2,3,4,5}

N1

{1,3,4,5}

N5

{1,2,3,4,5}

N3

N2 {1,2,3,4,5}

N4

N5

{1,2,3,4,5}

(b) Step 2

(c) Step 3

Figure 2: An all-to-all broadcast scheme in a system of 5 nodes (3 steps and 12 messages). that still have incomplete information. As shown in Figure 2, all-to-all broadcast is completed after 3 steps while incurring 12 messages. For an illustrative purpose, another all-to-all broadcast for a system of 5 nodes is shown in Figure 3, where 4 steps and 8 messages are required, showing a trade-o between the number of messages and that of communication steps. To develop the optimal all-to-all broadcast scheme for one-port communication, we shall rst introduce the concept of a balanced binary partitioning tree of a positive number to exploit the nature of NODUP schemes. In light of the balanced binary partitioning tree, we devise an addressing scheme, based on the topology of a hypercube, for the nodes in the distributed system. Using the addressing scheme and the concept of minimal complete sets to be introduced later, the optimal all-to-all broadcast scheme can be systematically executed according to the balanced binary partitioning tree, and completed by a system of p nodes in the minimal number of steps, i.e., dlog2 pe steps, while incurring np + p ? 2n messages, which is proved to be the minimal number of messages required for all-to-all broadcast in n steps where n = dlog2 pe. Moreover, in light of the topology of generalized hypercubes [2], we extend our results to the case of multi-port communication. It is proved in Theorem 3 that the proposed scheme not only requires the minimal number of steps, i.e., dlogk+1 pe steps, but also incurs the minimal number of messages that is required to complete k-port all-to-all broadcast in dlogk+1 pe steps. To the best of our knowledge, there is no prior work on determining the minimal number of messages required for all-to-all broadcast in the minimal number of steps for a distributed system of an arbitrary number of nodes, for either one-port or 3

N1

N1

{1}->{1,2,3,4,5}

N2

N2

N3

{2}

{3}

N4 {4}

N3

{2}->{1,2,3,4,5}

N5

{3}

N4 {4}

{5}

N5 {5}

(b) Step 2

(a) Step 1 N1

N1 N2

N2

N3

N3 {3}->{1,2,3,4,5}

N4 {4}->{1,2,3,4,5}

N4

N5 {5}

N5 {5}->{1,2,3,4,5}

(d) Step 4

(c) Step 3

Figure 3: Another all-to-all broadcast in a system of 5 nodes (4 steps and 8 messages).

k-port communication. This feature distinguishes our work from others. This paper is organized as follows. Preliminaries are given in Section 2. We develop optimal all-to-all broadcast schemes for 1-port and k-port communication in Sections 3 and 4, respectively. This paper concludes with Section 5.

2 Preliminaries We shall use the hypercube topology to facilitate the presentation of our broadcast schemes. However, as will become clear later, this does not mean that the number of nodes in the system has to be equal to that of a hypercube. An n-dimensional hypercube, denoted by Qn , can be de ned as follows.

De nition 1: A Qn is de ned recursively as follows [10].

(i). Q0 is a trivial graph with one node, and (ii). Qn = K2  Qn?1 , where K2 is the complete graph with two nodes.

P

A Qn contains 2n nodes. Let be the ternary symbol set f0, 1, *g, where * is a don't care P symbol. Every subcube in a Qn can then be uniquely represented by a string of symbols in . Such a string of ternary symbols is called the address of the corresponding subcube. The rightmost coordinate of the address of a subcube will be referred to as dimension 1, and the second to the 4

rightmost coordinate as dimension 2, and so on. For a distributed system of p nodes, we shall address the p nodes with n-bit strings of ternary symbols, where n = dlog2 pe. Also, we use bi to denote the invert of a bit bi so that 1=0, 0=1 and  = . A node bn : : :bi : : :b1 is called the i-th neighbor of node bn : : : bi : : :b1, and vice versa. We use the identi cation (id) of each node to denote the information that this node wants to broadcast to every other node. Also, the information at each node means the set of id's that node collects thus far, and the content of the message of a transmission is referred to as the information of the sender at the time of transmission. One message might contain many id's. The term communication step and the term message step are used interchangeably. For example, in step 2 of Figure 2, the message from N3 to N1 is f3,5g, and the information at N1 after step 2 is f1,3,4,5g. An all-to-all broadcast is said to be completed if all nodes in the system receive all id's in the system. The system model we consider can be summarized as follows.

Model M 1. The system is completely connected with synchronous communication. 2. Every message sent in the system takes one communication step. 3. Only NODUP (no duplication) schemes, where each message conveys only new information to its receiver, are considered. 4. k-port communication means that each node is capable of sending k messages out to any k receivers in one step. (There is no restriction on the number of messages each node can receive in one step.)

De nition 2: An all-to-all broadcast scheme is called optimal, if under the above model, M ,

the following two conditions are satis ed.

1. It completes the broadcast in the minimal number of steps. 2. It incurs the minimal number of messages required to complete the broadcast in the minimal number of steps. Note that the assumption for each message to take one communication step can be justi ed by the technique of virtual cut-through for communication [13], which can be incorporated into multicomputer systems such as iPSC/2 [4]. In addition, as it will be proved in Section 4 later, under a NODUP scheme for a system of p nodes, the total number of id's sent in all messages during the entire scheme is p(p ? 1), which is the minimal number of id's needed to be sent for an all-to-all broadcast, explaining the reason that we shall study the NODUP schemes. Consequently, the objective of this paper is to develop optimal all-to-all broadcast schemes under model M . 5

3 Optimal All-To-All Broadcast for 1-Port Communication In this section, we develop the optimal all-to-all broadcast scheme for one-port communication. As will be proved by Theorem 2 later, for a system of p nodes with one-port communication, the minimal number of messages required for all-to-all broadcast in n steps is np + p ? 2n , where n = dlog2 pe. To facilitate our presentation, we shall propose an addressing scheme for nodes in the system. It can be seen that in light of the addressing scheme, the proposed broadcast algorithm can be systematically executed and shown to incur the minimal number of messages in the minimal numbers of steps. The scheme developed in this section can be extended to the case of k-port communication in Section 4. To describe the addressing scheme, it is necessary to de ne a partitioning tree and a balanced binary partitioning tree of a positive number as follows.

De nition 3: A partitioning tree of a positive number p is a tree where the root node is

labeled with p, all leaf nodes are labeled with ones, and the number labeled in each non-leaf node (or internal node) is the sum of those labeled in its child nodes.

De nition 4: A balanced binary partitioning tree of a positive number p is a binary partitioning

tree constructed as follows.

1. Label the root node with p. 2. For each node with a label k  2, generate the left and right children of this node and label them with d k2 e and b k2 c, respectively. For example, the balanced binary partitioning tree with p = 6 is given in Figure 4. Clearly, there are n + 1 levels in the balanced binary partitioning tree of a number p where n = dlog2 pe. For convenience, the level of the root is called level 0. Using the balanced binary partitioning tree, the nodes in the system can be addressed as follows.

Addressing scheme A1 Step 1: For a system of p nodes, obtain the balanced binary partitioning tree of p. Step 2: For every internal node, code the edge to its left child with a bit \0" and that to its right child with a bit \1".

Step 3: Determine the address of each leaf node by the coded bits in the edges on the path from the root to that node.

Step 4: Append a bit \*" to each leaf node in level n ? 1. Step 5: Assign arbitrarily the p nodes in the system with the addresses of the p leaf nodes in the balanced binary partitioning tree.

6

6

......level 0 1

0

3

3 0

1

2 0 1 000

1

......level 1 1

0

1

1 11*

2

01*

1 001

0 1 100

1 1 101

......level 2

......level 3

Figure 4: The balanced binary partitioning tree and its addressing scheme when p = 6. An example of the above addressing scheme can be found in Figure 4. Note that while those nodes in level n = dlog2 pe of the balanced binary partitioning tree of a number p are addressed as hypercube nodes, i.e., Q0 's, those nodes in level n ? 1 are addressed as 1-dimensional subcubes, i.e., Q1 's. A Q3 whose subcubes are used to address the 6 nodes in Figure 4 is given in Figure 5. In light of the addressing scheme, optimal all-to-all broadcast can be described in algorithm G below, where the primitive send(M, bn bn?1 : : :b1) means sending the message M to the node bn bn?1 : : :b1, receive(RM) means receiving the messages RM, and M [RM denotes the union of the messages M and RM.

Algorithm G:

/* Let p be the number of nodes in the system and n = dlog2pe */ 1. Address each node according to scheme A1 . /* Node bn bn?1 : : :b1 does the following. */ 2. M:= fbn bn?1 : : :b1 g; 3. for j=n to 1 step=?1 do 4. begin 5. if bj 6=* then send(M, bn : : : bj : : : b^1); /* where b^1 = 0 if b1 = ; otherwise b^1 =b1. */ 6. receive(RM); 7. M:= M [ RM; 8. end 7

000

001

011

010

01* 101

100

110

11*

111

Figure 5: Illustrating the addressing of 6 nodes by a Q3 . To show the operations of algorithm G, an example for a system of 8=23 nodes is given in Figure 6 where the broadcast scheme can be described in light of the topology of a Q3 . It can be seen that nodes exchange messages via dimension 3 rst, then dimension 2 and dimension 1. For an illustrative purpose, the information collected by node 001 thus far is shown in the bracket next to 001. All nodes receive all id's (marked black) after 3 steps. To show the operations of G for a system whose number of nodes is not equal to a power of two, consider a system of 6 nodes. Under the addressing scheme shown in Figure 4 and the operations of algorithm G, the 3 steps of the message passing are shown in Figure 7. It can be veri ed that the broadcast is completed in 3 steps and the total number of messages sent is 6+6+4=16. Note that for a node with an address of the form bn bn?1 : : :b2, such as node 01* in Figure 7, it determines its message receiver by setting * to 0 and inverting the appropriate bit as described in algorithm G so that each node sends out one message at a time4 . De ne a minimal complete set of nodes as a minimal set of nodes that consists of all information. For example, after step 1 in Figure 7, the node set f000,001,01*g is a minimal complete set since the nodes in the set have enough information to complete the broadcast, whereas f000,01*g is not, nor is f000,001,01*,100g since the latter is not a minimal set. As it will become clear later, the number labeled in each internal node in level i of the balanced binary partitioning tree denotes not only the number of nodes in the corresponding Note that the purpose of setting * to 0 is mainly to provide a systematic procedure. It can be veri ed that algorithm G is also valid if such node as bn bn?1 : : : b2  determines its message receiver by setting * to 1 and inverting the appropriate bit accordingly. 4

8

000

{001,101}

011

010

Step 1

001

101

100

111

110 000

001 {001,101,011,111} 011

010

Step 2

101

100

111

110

000

001 011

010

101

100

Step 3 110

111

Figure 6: Optimal all-to-all broadcast for a system of 8 nodes.

9

Step 1:

{000,100} 100

{001,101} 001

000

{001,101} 101

{000,100}

6 messages 11* {01*,11*}

01* {01*,11*}

{000,100,01*,11*} 000

001

{000,100,11*,01*} 101 100

Step 2: 6 messages 11*

01*

101

100

001

000

Step 3: 4 messages 11*

01*

Figure 7: Optimal all-to-all broadcast for a system of 6 nodes.

10

minimal complete set, but also the number of messages sent by the nodes in that minimal complete set in step i of algorithm G. Using the concept of minimal complete sets, we obtain the following lemma for algorithm G.

Lemma 1: After step i of algorithm G, there are 2i minimal complete sets, where 2i is the number of nodes in level i of the balanced binary partitioning tree for 0  i  n ? 1. Speci cally, they are formed by nodes corresponding to the subcubes with the addresses bn bn?1 : : :bn?i+1  : : : , where bj 2 f0; 1g for n ? i + 1  j  n. Proof: It can be seen that when i=0, i.e., before the operations in algorithm G begin, the minimal complete set is the set that contains all nodes in the system. In the execution of step 1, each node, say bn bn?1 : : :b1 , sends a message to its n-th neighbor, bn bn?1 : : :b1, in the Qn . Thus, nodes in the two Qn?1 's, 1  : : :  and 0  : : : , form two minimal complete sets, respectively, since after step 1, node bn bn?1 : : :b1 has already contained the information that both bn bn?1 : : :b1 and bnbn?1 : : :b1 had in step 0. It can be veri ed from the addressing scheme of the balanced binary partitioning tree that the two subcubes, 1   : : :  and 0   : : : , are associated with the two child nodes of the root. Then, from the fact that nodes exchange messages via dimension n + 1 ? i in step i, it follows that a minimal complete set associated with a Qn?i+1 in level i ? 1 is partitioned into two minimal complete sets associated with two Qn?i 's in level i, thus proving this lemma by induction. Q.E.D.

The fact that each node in level i of the balanced binary partitioning tree is associated with a minimal complete set after step i of algorithm G can be seen in Figure 4 and Figure 7. Then, we have the following theorem.

Theorem 1: For a system of p nodes with one-port communication, algorithm G completes an all-to-all broadcast in n steps by incurring np + p ? 2n messages, where n = dlog2 pe. Proof: From Lemma 1, we know that after dlog2 pe steps every node will form a minimal complete set itself, meaning that all-to-all broadcast is complete. To determine the total number of messages incurred, consider the balanced binary partitioning tree of the number p. First, from the proof of Lemma 1, it can be seen that the number labeled in that node is the number of processing nodes5 in the corresponding minimal complete set. Next, we claim that for a minimal complete set of k nodes to be partitioned in one step into two minimal complete sets of the cardinalities k1 and k2, where k = k1 +k2 , each of the k nodes has to send out in that step a message containing the information it has collected thus far. Note that in a minimal complete set, the information of each node is not contained in that of any other node, meaning that if a node, say Ni, does not send out a message in that step, Ni should be included into both minimal complete sets, leading to a Processing nodes mean the computing nodes in the distributed system, and should not be confused with the nodes in a partitioning tree. 5

11

contradiction to k1 +k2 = k, thus proving this claim. It follows that the total number of messages sent in algorithm G is the sum of the numbers labeled in the internal nodes in the corresponding partitioning tree. Since a number k is partitioned into d k2 e and b k2 c in the partitioning tree, it can be seen that all the leaf nodes are in either level n ? 1 or level n. Then, we know that the numbers labeled in the nodes in level n ? 1 must be either 2's (labeled in non-leaf nodes) or 1's (labeled in leaf nodes), and the sum of those numbers is p, meaning that the number of 2's in level n ? 1 is p ? 2n?1 , since the number of nodes in level n ? 1 is 2n?1 . From this fact and that the sum of numbers labeled in nodes in each level i, for 0  i  n ? 2, is p, it follows that the sum of the numbers labeled in the internal nodes in the balanced binary partitioning tree of p is p(n ? 1) + 2(p ? 2n?1 )= np + p ? 2n .

Q.E.D.

It can be veri ed that the number of messages for the example in Figure 6 is 24=3*8+8?8, and that for the example in Figure 7 is 16=3*6+6?8, agreeing with Theorem 1. Note that it at least takes dlog2 pe steps for one-to-all broadcast in a system of p nodes. This fact leads to the following proposition.

Proposition 1: In a system of p nodes with one-port communication, the minimal number of steps required for all-to-all broadcast is dlog2 pe. Hence, from the above proposition and Theorem 1, we have the following corollary.

Corollary 1.1: In a system of p nodes with one-port communication, algorithm G requires the minimal number of steps, dlog2 pe, to complete an all-to-all broadcast. It can be veri ed that in a system of p nodes, for one minimal complete set to be partitioned into two in one step, the total number of id's sent in all messages incurred in that step is p. Then, as stated in Corollary 1.2 below, the total number of id's sent in all messages incurred by algorithm G is p(p ? 1), since there are p ? 1 internal nodes in a balanced binary partitioning tree. This agrees with the fact that algorithm G is a NODUP scheme.

Corollary 1.2: In a system of p nodes with one-port communication, the total number of id's sent in all messages incurred by algorithm G is p(p ? 1). Next, we have the following theorem which states that algorithm G is optimal in terms of the number of messages required for all-to-all broadcast in the minimal number of steps.

Theorem 2: For a system of p nodes with one-port communication, the minimal number of messages required for all-to-all NODUP broadcast in n steps is np + p ? 2n , where n = dlog2 pe. 12

Proof: From the facts that the schemes are without duplicate information and that every node sends all the id's it has thus far to its receiver, it follows that optimal all-to-all broadcast schemes can be described by the generation of minimal complete sets resulting from the process of broadcast. Such a generation of minimal complete sets can be denoted by a binary partitioning tree. As pointed out in the proof of Theorem 1, the number labeled in each internal node of the partitioning tree is the number of nodes in the corresponding minimal complete set. Also, the number of nodes in a minimal complete set, say h, is the number of messages to be sent in the next step so that the broadcast can be completed by the nodes within the set in dlog2 he steps. It in turn follows that the sum of the numbers labeled in the internal nodes of the partitioning tree is the total number of messages required for the all-to-all broadcast. Then, the problem of determining the minimal number of messages required in an all-to-all broadcast scheme can be transformed to the one of determining the corresponding binary partitioning tree, of which the sum of the numbers labeled in the internal nodes is minimal. Note that such a binary tree can be constructed by the Hu man algorithm [18], which starts with p ones, and then, repeatedly adds the two smallest numbers together and uses their sum to replace the two numbers. The resulting binary tree by the Hu man algorithm is called the Hu man tree. An example of the Hu man tree for 6 ones is given in Figure 8. It has been proved that the sum of the numbers labeled in the internal nodes of the Hu man tree is the minimal among all the binary trees with the same set of leaf nodes. Also, all the leaf nodes in a Hu man tree, labeled by ones, must be in either level dlog2 pe?1 or level dlog2 pe [18], implying that the sum of the numbers labeled in internal nodes is np + p ? 2n , where n=dlog 2 pe. This theorem follows. Q.E.D.

Note that the formula in Theorem 2 agrees with the lower bound of message complexity, O(p log2 p), derived in [26] where, however, the minimal number of messages required was not determined. Theorem 1 and Theorem 2 lead to the following corollary.

Corollary 2.1: For a system of p nodes with one-port communication, algorithm G requires the minimal number of messages, np + p ? 2n , to complete an all-to-all broadcast in n steps, where n = dlog2 pe.

4 Optimal All-To-All Broadcast for k-Port Communication As presented in algorithm G, the balanced binary partitioning tree can be used to describe optimal all-to-all broadcast for one-port communication. In fact, our scheme, based on the partitioning tree and the generation of minimal complete sets, can be extended to the case of k-port communication, meaning that each node is capable of sending k messages at a time. As can be seen below, the extension to the k-port communication can be described in light of the generalized n-dimensional m-ary hypercube [2], where m is chosen to be k + 1. Using the product operation in De nition 1, a generalized n-dimensional m-ary hypercube, denoted by Hnm, can be de ned as follows. 13

.....level 0

6

2

1

2

1

1

.....level 1

2

4

1

1

.....level 2

.....level 3

1

Figure 8: The Hu man tree constructed by 6 1's.

De nition 5: An n-dimension m-ary hypercube Hnm is de ned recursively as follows.

(i). H0m is a trivial graph with one node, and (ii). Hnm = Km  Hnm?1 , where Km is the complete graph with m nodes.

An example of H23 can be found in Figure 9 where the edges in dimension 2 and dimension 1 are drawn in Figure 9a and Figure 9b, respectively. It can be veri ed that De nition 2 is a special case of De nition 5 when m = 2, and in fact Qn =Hn2 . In light of the generalized hypercubes, the scheme in Section 3 can be extended to the case of multi-port communication by modifying the partitioning tree and the addressing scheme accordingly. For example, for the case of 2-port communication in a system of 9 nodes, which corresponds to the case of H23, we have the 3-ary partitioning tree as shown in Figure 10 where Step 2 of the addressing scheme A1 is modi ed as below.

Step 20: For every internal node, code the edge to its left child with a bit \0", that to its center child with a bit \1", and that to its right child with a bit \2".

From the same reasoning as in the proof of Lemma 1, it follows that all-to-all broadcast for 2-port communication can be developed from the above partitioning tree in such a way that each internal node in level i of the tree is taken as a minimal complete set generated after step i and the minimal complete set associated with an internal node is partitioned into those with its child nodes in one step. For example, for a system of 9 nodes with 2-port communication, the operations of an all-to-all broadcast is shown in Figure 9. It can be veri ed by Figure 9 and Figure 10 that after Step 1, all 9 nodes in the system (addressed by ** in Figure 10) are partitioned into three minimal 14

00

00

02

01

02

01 10

11

21

12

20

10

20

22

11

12

21

22

(b) step 2

(a) step 1

Figure 9: The all-to-all broadcast scheme for 2-port communication when p=9.

**

9 0 3 0 1

1

...... level 0 2

1 3

0*

1 2

0

1

1

1 1

3

1* 2

0

1

1

2* 1

1

...... level 1

2 1

...... level 2

Figure 10: The partitioning tree and addressing scheme for 2-port communication when p=9. 15

complete sets, formed by nodes in 0*, 1* and 2*, respectively. Clearly, the above scheme based on the generation of minimal complete sets in the corresponding partitioning tree can be generalized to the case of k-port communication. An algorithm to build the corresponding optimal partitioning tree will be presented later. Let N (p; k) be the number of message steps required by our scheme for all-to-all broadcast in a system of p nodes with k-port communication. It can be observed that the recursion N (p; k) = p e; k) holds, where N (a; b) = 1 if a  b, leading to N (p; k) = dlog pe. Note that it takes 1+ N (d k+1 k+1 at least dlogk+1 pe steps for one-to-all broadcast in a system of p nodes with k-port communication. This fact and the existence of our scheme lead to the following proposition, which was also proved by a di erent approach in [15] where no attempt was made to minimize the number of messages.

Proposition 2: In a system of p nodes with k-port communication, the minimal number of steps required for all-to-all broadcast is dlogk+1 pe. It can be seen that the height of the corresponding partitioning tree for an all-to-all broadcast determines the number of message steps required. Thus, to complete the broadcast in the minimal number of steps, the partitioning tree must have the minimal height. Note that for a system of p nodes with k-port communication, there can be di erent partitioning trees with the same minimal height, i.e., dlogk+1 pe, whereas the corresponding numbers of messages incurred may di er from one to another. Recall that the total number of messages sent in algorithm G is the sum of the numbers labeled in the internal nodes of the balanced binary partitioning tree. Call the number of child nodes of an internal node z the degree of z , denoted by ds (z ), and also denote the number labeled in z by w(z ). Then, we have the lemma below which follows from the fact that for a minimal complete set to be partitioned into r minimal complete sets in one step, each node in the original minimal complete set has to send out r ? 1 messages in that step.

Lemma 2: The number of messages incurred in an all-to-all broadcast scheme is Pz2V w(z)(ds(z) ? 1) where VI is the set of internal nodes in the corresponding partitioning tree.

I

For example, the number of messages required for the broadcast corresponding to Figure 4 is 6*1+3*1+3*1+2*1+2*1=16 which agrees with Figure 7, and the message number required for the broadcast corresponding to Figure 10 is 9*2+3*2*3=36, agreeing with Figure 9. Figure 11 shows two partitioning trees for p = 34 and k = 3, which have the same height but will incur di erent numbers of messages. From Lemma 2, it can be veri ed that the number of messages associated with the tree in Figure 11a is 241 while that with the tree in Figure 11b is 232. It can also be seen that in Figure 11a while the two subtrees under the two 9's incur di erent numbers of messages (9*2+3*2*3 6= 9*3+2*3+3*2), the two subtrees under the two 8's, in spite of their di erent structures, do incur the same number of messages (8*3+2*4 = 8+4*3*2). Then, the problem of determining the minimal number of messages required in an all-to-all broadcast scheme 16

can be transformed to the one of determining the corresponding partitioning tree, of which the sum determined by Lemma 2 is minimal. Consequently, the following theorem is derived to solve this problem.

Theorem 3: For a system of p nodes with k-port communication, the minimal number of messages required for an all-to-all NODUP broadcast in n=dlog k+1 pe steps is, M (p; k) = (d ? 2)n1 p + (d ? 1)[n2p + p ? (d ? 1)n1 dn2 ]; where n1 + n2 = n = dlogk+1 pe, and d is the smallest positive integer such that p  (d ? 1)n1 dn2 and p > (d ? 1)n1 +1 dn2 ?1 . Proof: Same as in the one-port communication, the optimal scheme can be described by a partitioning tree since there is no duplicate information in each transmission. Also, we learn from Proposition 2 that such an optimal tree is of height dlogk+1 pe. It is easy to see from Lemma 2 that in an optimal partitioning tree, each subtree, say under node z, is the optimal partitioning tree for a system of w(z) nodes. We shall prove that an optimal tree, denoted by T(p,k), possesses the following two properties: (A). All leaf nodes in T(p,k) are in the same lowest level, i.e., level dlogk+1 pe, except the case that T(p,k) is a binary tree, and (B). The set of distinct degrees of all internal nodes in T(p,k), denoted by DS , contains either a single number or two consecutive numbers. Property (B) means that for any two internal nodes in T(p,k), their degrees must either be the same or di er by one. From the above two properties, M(p,k) in Theorem 3 can then be derived by applying some algebraic operations.

We shall prove Property (A) by showing that if there is a leaf node y not in the lowest level of an optimal tree, then the tree must be binary and all the internal nodes in the same level as y are labeled with 2's, agreeing with Theorem 2. The fact that except for the binary trees, all leaf nodes of an optimal tree must in the same lowest level thus follows. Call two nodes siblings of each other if they have the same parent node. Suppose there is a leaf node y which is not in the lowest level. Then, we know that for T(p,k) to be optimal, the number labeled in any sibling of y cannot be greater than 2, since if y has a non-leaf sibling, say yA , with w(yA ) > 2, then the message number can be reduced by moving one unit from yA to y, i.e., w(yA ) is reduced by one and y becomes an internal node with w(y ) = 2. For illustrative purposes, example subtrees are given in Figure 12, where the message number associated in the subtree in (b) is less than that in (a). Denote the parent node of y as x. It can be observed that the degree of x must be two, since if ds (x) is greater than two, then from the fact that the degrees of all siblings of y are at most two, the message number can be reduced by rearranging the subtree under x in such a way that ds (x) becomes 2. From ds (x)=2, it in turn follows that all the internal nodes which are in the same level and under the same grandparent as y must be labeled with 2's, since if, among them, there is an internal node yB with w(yB ) > 2, then the message number can be reduced by moving one unit from yB to y. By 17

34

8

2 1

2

1

1

8

2

1

2

1

1

1

1

1

1

3

4

4

1

1

9

9

1

1

1

1

1

1

3 1

1

1

2

3 1

1

1

1

1

2

1

1

1

3

2 1

1

1

1

1

1

1

(a) One partitioning tree

34

11

11

4

3 1

1

1

1

1

1

3

4 1

1

1

1

1

1

1

12

4

4 1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

(b) The optimal partitioning tree

Figure 11: The partitioning trees when p=34 and k=3. 18

4

4

4

1

1

1

1

x

x m

1

1

yA

1

1

y

m-1

1

1

(a)

2

yA 1

1

y 1

(b)

Figure 12: Example subtrees to illustrate Property (A). the same reasoning, it can be seen that the degrees of all the internal nodes under the grandparent of y must be 2, showing by induction that the tree is binary. Property (A) thus follows. To prove Property (B), we shall investigate the degrees of internal nodes in an optimal tree T(p,k) in a manner of bottom up. Recall that the root of the tree is in level 0. Without loss of generality, we start with investigating a set of nodes in level dlogk+1 pe?1 which are under the same parent node, say x. If there are any two nodes z1 and z2 under x and w(z1)  w(z2)+2, it can be veri ed that such a tree is not optimal since the message number can be reduced by moving one unit from z1 to z2 . Note that (w(z1)?1)*(w(z1)?2)+ (w(z2)+1)*w(z2) < w(z1 )*(w(z1)?1)+ w(z2)*(w(z2)?1) for w(z1)  w(z2)+2, where w(z1 )=ds (z1) and w(z2)=ds (z2 ) since z1 and z2 are parent nodes of leaf nodes. Then, let d and d ? 1 be the two possible degrees for any internal node under x. We claim that ds (x) 2 fd; d ? 1g. First, if ds (x)  d + 1, then we can select a child node of x, say z , with ds (z )=d, and detach z from x by distributing the d child nodes of z , one-to-one, into the other d siblings of z so that each of the d selected siblings of z adopts one child from z . Clearly, since ds (x)  d + 1, z must have at least d siblings. The numbers labeled in the nodes a ected are modi ed accordingly. It can be veri ed by Lemma 2 that such a movement will reduce the message number, leading to a contradiction for T(p,k) to be optimal, implying that ds (x)  d. Next, if ds (x)  d ? 2, then we can take one child node out of each of the ds (x) existing child nodes of x, attach them under a new node, and arrange that node as a new child node of x. ds (x) is thus increased by one. It can be veri ed that such a movement will also reduce the message number 19

determined by Lemma 2, and the claim that ds (x) 2 fd ? 1; dg thus follows. Let Ds (x) be the set of distinct degrees for all internal nodes under x. We thus proved Ds (x)  fd ? 1; dg. Note that the technique used above to form and decompose a subtree can also be applied for the internal nodes which are under the same grandparent. Similarly, we can obtain that for each sibling of x, say h, Ds(h)  fd ? 1; dg, and then Ds(v)  fd ? 1; dg where v is the parent node of x, proving that DS  fd ? 1; dg by induction. Property (B) thus follows. From Properties (A) and (B), it can be seen that to generate p leaf nodes in level dlogk+1 pe from the number p and minimize the message number, d has to be the smallest positive integer such that p  (d ? 1)n1 dn2 and p > (d ? 1)n1 +1 dn2 ?1 , where n1 + n2 = n = dlogk+1 pe. From Lemma 2, it follows that an optimal partitioning tree can be constructed by rst having n1 levels of internal nodes with degree d ? 1 (i.e., from level 0 to level n1 ? 1), and then n2 ? 1 levels of internal nodes with degree d (i.e., from level n1 to level n ? 2), followed by, in level n ? 1, (d ? 1)n1 dn2 ? p internal nodes with degree d ? 1 and p ? (d ? 1)n1 +1 dn2 ?1 internal nodes with degree d. Then, we get M(p,k)= n1p(d ? 2) + (n2 ? 1)p(d ? 1)+ [(d ? 1)n1 dn2 ? p](d ? 1)(d ? 2)+ [p ? (d ? 1)n1 +1 dn2 ?1 ]d(d ? 1)= (d ? 2)n1p + (d ? 1)[n2p + p ? (d ? 1)n1 dn2 ], thus proving Theorem 3. Q.E.D. Therefore, to determine the minimal number of messages required for an all-to-all broadcast in a system of p nodes with k-port communication, the corresponding optimal partitioning tree can be obtained as follows, where n1 + n2 = n = dlogk+1 pe, and d is the smallest positive integer such that p  (d ? 1)n1 dn2 and p > (d ? 1)n1 +1 dn2 ?1 .

Algorithm to build the optimal partitioning tree Step 1: Build a tree from level 0 to level n1 ? 1 in such a way that each node is an internal node and has a degree d ? 1. Step 2: In the next n2 ? 1 levels (i.e., from level n1 to level n ? 2), let each internal node have a degree d.

Step 3: In the last level of internal nodes (i.e., level n ? 1), let (d ? 1)n1 dn2 ? p internal nodes have degree d ? 1, and p ? (d ? 1)n1 +1 dn2 ?1 internal nodes have degree d. Step 4: In level n, attach leaf nodes to those internal nodes in level n ? 1, according to the degree of each internal node in that level.

Step 5: Label each leaf node with one, and determine the number labeled in each internal node in the tree bottom up such that the number labeled in each node is the sum of those labeled in its child nodes.

For example, consider all-to-all broadcast in a system of 34 nodes with 3-port communication. Then, we have p = 34, k = 3 and n1 + n2 = n = 3, leading to d = 4, n1 = 2 and n2 = 1. We 20

can obtain the optimal partitioning tree in Figure 11b by the algorithm above. It follows from Theorem 3 that M(34,3)=232, meaning that the number of messages required by the partitioning tree in Figure 11b is in fact the minimal one in order to complete the broadcast in 3 steps. Note that from an optimal partitioning tree obtained by Theorem 3, we can determine the address of each leaf node in the tree in light of Hnd . As in Section 3, the addresses of leaf nodes are then one-on-one mapped into the nodes in the system. Using this addressing scheme, each node in the system can determine its message receiver in each communication step in such a way that all-to-all broadcast follows the generation of minimal complete sets in the partitioning tree. It is interesting to see that to achieve the minimal number of messages in the minimal number of steps, it is not always necessary to use the maximal number of communication ports allowed. This is the very reason that d determined in Theorem 3 is not necessary equal to k + 1. For the example of p = 20 and k = 3, we get that the minimal number of steps is 3 = dlog4 20e. However, from Theorem 3 we have d = 3, meaning that to minimize the message number in 3 communication steps, each node at most uses 2 communication ports in every step during the execution of the optimal scheme. Also, it can be veri ed that Theorem 2 is in fact a special case of Theorem 3. For the case of one-port communication, we have d = 2 and then M (p; 1)= n2 p + p ? 2n2 where n2 = dlog2 pe, agreeing with Theorem 2. To the best of our knowledge, the minimal number of messages derived in Theorem 3, together with its special case in Theorem 2, was previously unknown, and is rst determined in this study. It can be seen that the all-to-all broadcast scheme with k-port communication introduced in this section is a NODUP scheme. Speci cally, we have the following corollary.

Corollary 3.1: For a system of p nodes with k-port communication, the total number of id's carried by all messages incurred in our all-to-all broadcast scheme is p(p ? 1), which is the minimal

required for all-to-all broadcast schemes.

P

Proof: Note that for a partitioning tree of p, z2VI ds (z )= jVI j + p ? 1 where VI is the set of P internal nodes and ds (z ) is the degree of node z in the tree. Then, we have z2VI (ds (z ) ? 1) = p ? 1. Also, for a minimal complete set associated with node z to be partitioned into ds (z ) minimal complete sets in one step, p(ds(z ) ? 1) id's have to be sent in all messages incurred in that step. The fact that the total number of id's sent in all messages incurred in our all-to-all broadcast scheme is p(p ? 1) thus follows. It can be seen that in any all-to-all broadcast scheme, every node needs to receive p ? 1 id's in a system of p nodes, proving this corollary. Q.E.D.

5 Remarks It is worth mentioning that similarly to other optimization problems whose solutions closely depend on the models assumed, the optimal schemes derived in this paper are results from the model M 21

described in Section 2. Speci cally, the numbers of messages in the proposed schemes are proved minimal among all NODUP schemes. Clearly, without being restricted to the NODUP schemes, at the cost of having more id's transmitted in the broadcast, one may further minimize the number of messages required. It is noted that we do not exclude either the possibility of two-way transmission between two nodes, or the capability of each node to participate in both sending and receiving messages in one communication step, thus distinguishing our work in this paper from the one in [23]. Also, we do not assume that nodes in the system will be faulty or maliciously send wrong messages to others. To analyze and improve fault-tolerance of these schemes is an important, but not fully explored issue. In addition, the schemes proposed in this paper are developed under the assumption that the system is completely connected and every message takes one communication step. While their variations could provide fair performance for hypercube multicomputers, these schemes are not designed for all system interconnections. Certainly, assumptions that the system has a predetermined topology for its interconnection and that every message may take more than one communication step are reasonable assumptions for some computing environments, and will lead to very di erent solutions. Last but not the least, there is no restriction on the number of messages each node can receive in one step in our model. Imposing a constraint on the message number one node can receive in one step is an interesting direction and will be a matter of our future study.

6 Conclusion In this paper, we developed optimal all-to-all broadcast schemes for a distributed processing system of an arbitrary number of nodes. The emphasis was on how to complete the broadcast in the system with not only the minimal number of message steps but also the minimal number of messages. The optimal all-to-all broadcast scheme for the case of one-port communication was rst developed. The concept of the partitioning tree of a positive number was introduced to address the nodes in the system. Under this addressing scheme, optimal all-to-all broadcast can be systematically executed based on the generation of minimal complete sets, and completed in dlog2 pe steps for a distributed system of p nodes. It was proved that the number of messages incurred by the proposed scheme, np + p ? 2n , is the minimal number of messages required for all-to-all NODUP broadcast with one-port communication in n steps where n = dlog2 pe. Moreover, we extended our results to the case of k-port communication. The minimal number of messages required to complete all-to-all NODUP broadcast in the minimal number of steps, i.e., dlogk+1 pe steps, was derived in Theorem 3. An algorithm to build the optimal partitioning tree was also presented. Note that we not only derived the theoretically minimal bounds for the numbers of steps and messages required, but also devised e ective schemes to achieve them.

ACKNOWLEDGEMENT 22

The authors would like to thank J. Chen at IBM for her comments and assistance on improving the presentation of this paper.

References [1] W. C. Athas and C. L. Seitz. Multicomputers: Message-Passing Concurrent Computers. IEEE Computer Mag., 21:9{24, August 1988. [2] L. Bhuyan and D. P. Agrawal. Generalized Hypercube and Hyperbus Structures for a Computer Network. IEEE Transactions on Computers, C-33(4):323{333, April 1984. [3] M.-S. Chen, K.-L. Wu, and P. S. Yu. Ecient Decentralized Consensus Protocols in a Distributed Computing System. Proceedings of the 12th International Conference on Distributed Computing Systems, pages 426{433, June 1992. [4] Intel Corporation. iPSC/2 User's Guide. Intel Corporation, March 1988. [5] S. B. Davidson, H. Garcia-Molina, and D. Skeen. Consistency in Partitioned Networks. ACM Computing Surveys, 17(3):341{370, September 1985. [6] R. Dechter and L. Kleinrock. Broadcast Communications and Distributed Algorithms. IEEE Transactions on Computers, C-35(3):210{219, March 1986. [7] P. J. Denning. Parallel Computing and its Evolution. Comm. of ACM, 29:1163{1167, December 1986. [8] A. M. Farley. Minimal Broadcast Networks. NETWORKS, 9:313{332, 1979. [9] J. Halpern and Y. Moses. Knowledge and Common Knowledge in a Distributed Environment. Journal of ACM, 37(3):549{587, July 1990. [10] F. Harary. Graph Theory. Addison-Wesley, MA, 1969. [11] S. M. Hedetniemi, S. T. Hedetniemi, and A. Liestman. A Survey of Broadcasting and Gossiping in Communication Networks. NETWORKS, 18:319{351, 1988. [12] S. L. Johnsson and C. T. Ho. Optimum Broadcasting and Personalized Communication in Hypercubes. IEEE Transactions on Computers, C-38(9):1249{1268, September 1989. [13] P. Kermani and L. Kleinrock. Virtual Cut-Through: A New Computer Communication Switching Technique. Computer Networks, 3:267{286, 1979. [14] T. V. Lakshman and A. K. Agrawala. Ecient Decentralized Consensus Protocols. IEEE Transactions on Software Engineering, SE-12(5):600{607, May 1986. [15] H. G. Landau. The Distribution of Completion Times for Random Communication in a Task Oriented Group. Bull. Math. Biophys., pages 187{201, 1954. [16] S. Levitan. Algorithms for Broadcast Protocol Multiprocessor. Distributed Computing Systems, pages 666{671, 1982. 23

[17] D. A. Reed and D. C. Grunwald. The Performance of Multicomputer Interconnection Networks. IEEE Computer Mag., 20:63{73, June 1987. [18] K. A. Ross and C. R. B. Wright. Discrete Mathematics. Prentice-Hall, NJ, 1985. [19] C. L. Seitz. The Cosmic Cube. Comm. of ACM, 28:22{33, January 1985. [20] A. Seress. Quick Gossiping without Duplicate Transmissions. Graphs and Combinatorics, 2:363{383, 1986. [21] K. G. Shin. HARTS: A Distributed Real-Time Architecture. IEEE Computer, pages 25{35, May 1991. [22] L. G. Valiant. A Scheme for Fast Parallel Communication. SIAM, Journal on Computing, 11(2):350{361, May 1982. [23] K. N. Venkataraman, G. Cybenko, and D. W. Krumme. Simultaneous Broadcasting in Multiprocessor Networks. Proceedings of the International Conference on Parallel Processing, pages 555{558, 1986. [24] D. B. West. Gossiping without Duplicate Transmission. SIAM, Journal on Alg. Disc. Meth., (3):418{419, 1982. [25] C.-B. Yang, R. C. T. Lee, and W.-T. Chen. Parallel Graph Algorithms Based upon Broadcasting Communications. IEEE Transactions on Computers, 39(12):1468{1472, December 1990. [26] S.-M. Yuan and A. K. Agrawala. A Class of Optimal Decentralized Commit Protocols. Proc. of 8th Int. Conference on Distributed Computing Systems, pages 234{241, 1988.

24

Suggest Documents