Collective operations for wide-area message-passing ... - CiteSeerX

Int. J. High Performance Computing and Networking, Vol. 5, No. 3, 2008

179

Collective operations for wide-area message-passing systems using adaptive spanning trees Hideo Saito* and Kenjiro Taura Department of Information and Communication Engineering, University of Tokyo, Tokyo, Japan E-mail: [email protected] E-mail: [email protected] *Corresponding author

Takashi Chikayama Department of Frontier Informatics and Department of Information and Communication Engineering, University of Tokyo, Tokyo, Japan E-mail: [email protected] Abstract: We propose a method for wide-area message-passing systems to perform broadcasts and reductions efficiently using latency and bandwidth-aware spanning trees constructed at run-time. These trees are updated when processes join or leave a computation, allowing effective execution to continue. We have implemented our proposal on the Phoenix Message-Passing Library and performed experiments using 160 processors distributed across four clusters. Compared to a static Grid-aware implementation, the latency of our broadcast was within a factor of two, and the bandwidth was 82%. When some processes joined or left a computation, our broadcast temporarily performed poorly, but completed successfully even during that time. Keywords: wide-area networks; WANs; message passing; collective operations; broadcast; reduction. Reference to this paper should be made as follows: Saito, H., Taura, K. and Chikayama, T. (2008) ‘Collective operations for wide-area message-passing systems using adaptive spanning trees’, Int. J. High Performance Computing and Networking, Vol. 5, No. 3, pp.179–188. Biographical notes: Hideo Saito is a graduate student at the University of Tokyo. His recent research interests include parallel and distributed systems and operating systems. Kenjiro Taura is an Associate Professor at the University of Tokyo. His recent research interests include parallel and distributed systems and operating systems. Takashi Chikayama is a Professor at the University of Tokyo. His recent research interests include parallel and distributed systems and operating systems.

1

Introduction

The recent increase in the bandwidth of Wide-Area Networks (WANs) has increased the possibilities of performing parallel computation in wide-area, distributed environments. Now multiple clusters located at geographically separated sites can provide much more processing power, memory and storage than a single cluster located at a single site can. In such an environment, however, systems designed for single Local-Area Networks (LANs) do not perform well. In particular, collective operations (e.g., broadcast and reduction) perform horribly. This is because collective operations for LANs are designed under the assumption that the speed of all links is the same. In wide-area environments, the wide-area links are orders of

Copyright © 2008 Inderscience Enterprises Ltd.

magnitude slower than the local-area links, so collective operations must consider link heterogeneity to deliver high performance. Methods to perform collective operations efficiently in wide-area environments have been proposed in the past (Gabriel et al., 1998; Husbands and Hoe, 1998; Kielmann et al., 1999a, 2000; Karonis et al., 2000). Such methods assume that the computational environment (i.e., the processors involved in the computation and the performance of the network) is static, and rely on manually supplied topology information to achieve high performance. However, this is not desirable considering wide-area environments are dynamic; large-scale computations performed in wide-area environments will last a long-time,

180

H. Saito et al.

so it is possible to add or remove resources at run-time and to tolerate faults. Moreover, it will become harder and harder to obtain knowledge of the network topology, because it will become more and more common to use different resources upon each invocation of an application and to use resources allocated automatically by middleware. Many methods to perform application-level multicasts using topology-aware overlay networks have been proposed in the context of Content Delivery Networks (CDNs) (Jannotti et al., 2000; Banerjee et al., 2002; Bozdog et al., 2003). However, such multicast mechanisms do not meet the needs of message passing, because they allow data loss or are only in Section 5. Thus, we propose a configuration-free method for message-passing systems to perform collective operations efficiently in the presence of joining and leaving processors. Ultimately, we plan to provide support for fluctuating network performance and for fault-tolerance, but in this paper, we only deal with the intentional joining and leaving of processes. In our proposal, we create latency-aware spanning trees and bandwidth-aware spanning trees, and perform collective operations along the generated trees. Our collective operations satisfy the following properties:

are used for short messages, and bandwidth-aware trees are used for long messages. For example, short messages are broadcast along a binomial tree (Figure 1(a)). Meanwhile, a binomial tree is bandwidth-inefficient, because the same process must send the same data multiple times. Thus, long messages are broadcast by first scattering the data among all processes, and then collecting the data to all processes.

•

they perform well when the processes taking part in the computation are not changing

•

they complete successfully even while processes are joining or leaving

•

they adapt to topology changes due to the joining or leaving of processes.

Although many implementations have extended MPICH for Grid execution, the original MPICH implementation is designed for LANs. Thus, its collective operations are optimised under the assumption that latency and bandwidth between processors are uniform. Therefore, its collective operations perform poorly in wide-area environments, where this assumption does not hold.

We have implemented the two spanning tree algorithms along with the broadcast and reduction operations that use the generated trees on the Phoenix Message-Passing Library (Taura et al., 2003), and performed experiments using real processors distributed over several clusters. This work was originally presented at the Grid 2005 Workshop (Saito et al., 2005). Since then, we have strengthened our proposal by performing simulations to help explain our latency-aware spanning tree algorithm. The rest of this paper is organised as follows. We discuss related work in Section 2 and describe the Phoenix Message-Passing Library in Section 3. We then explain our proposal in Section 4 and present our experimental results in Section 5. Finally, we conclude in Section 6.

2

Related work

2.1 The collective operations of MPICH MPICH is the most widely used implementation of the Message-Passing Interface (MPI) (Gropp et al., 1996). MPICH uses different algorithms for different collective operations, message sizes and numbers of processes, but the basic idea in most of the algorithms is to forward messages along a tree (Thakur and Gropp, 2003). Latency-aware trees

Figure 1

(a) Binomial tree; (b) binary tree and (c) flat tree

(a)

(b)

(c)

2.2 Wide-area collective operations A number of projects have studied the use of topology information to perform collective operations efficiently in wide-area environments (Gabriel et al., 1998; Husbands and Hoe, 1998; Kielmann et al., 1999b, 2000; Karonis et al., 2000). Kielmann et al. (1999a) have proposed MagPIe, in which collective operations minimise the use of wide-area links by using different trees for intra-cluster and inter-cluster communication. One node called the coordinator node represents each cluster in inter-cluster communication. In a broadcast, for example, the root node sends a message to each of the coordinator nodes using a flat tree (Figure 1(c)), and then each coordinator node performs a broadcast within its cluster using a binary tree (Figure 1(b)). While MagPIe focuses only on latency and thus short messages, Kielmann et al. (2000) have also proposed bandwidth-efficient collective communication. In this work, measured values of latency and bandwidth are given as input to the system so that the proper tree shape and message segment size can be determined from the message size. MPICH-G2 is an MPI implementation that extends MPICH for Grid execution (Karonis et al., 2003).

Collective operations for wide-area message-passing systems using adaptive spanning trees It provides multilevel topology-aware collective operations that perform better than earlier topology-aware approaches, such as MagPIe, that distinguish only between intra-cluster and inter-cluster communication. Our work differs from these proposals in two ways: our work assumes that the computational environment is dynamic, and our work does not require information concerning the network to be supplied manually.

2.3 Application-level multicast for CDNs In the context of CDNs, many methods to perform application-level multicasts using topology-aware overlay networks have been proposed (Jannotti et al., 2000; Banerjee et al., 2002; Bozdog et al., 2003). However, such multicast mechanisms do not meet the needs of message-passing applications, because they are designed for content delivery. Banerjee et al. (2002) have proposed a low-overhead method to create and maintain a latency-aware overlay network that can be used to efficiently multicast short messages. An extension to this work that uses the latency-aware trees for long messages has also been proposed, but its performance is not as good as those that focus specifically on bandwidth – it does not actually measure the bandwidth of links, and the fanout of the tree that it uses is too large. Moreover, it allows data loss because it is designed for data stream applications, but message-passing applications cannot tolerate data loss. Overcast efficiently multicasts large data by creating and maintaining a bandwidth-aware overlay network (Jannotti et al., 2000). It only supports single-source multicasts because it is designed for video streaming, but in message passing, every process must be able to become the root of a collective operation. Moreover, while multicast is a 1-to-N operation, message-passing systems need to support not only 1-to-N operations, but also N-to-1 operations and N-to-N operations.

3

Phoenix

We have implemented our collective operations in the context of the Phoenix Message-Passing System (Taura et al., 2003), which defines semantics of message-passing programs in the presence of dynamically joining and leaving processes. Although this paper can mostly be understood without knowing Phoenix specifics, this section describes the salient features of Phoenix to more precisely state our problem setting as well as our motivation. In particular, Phoenix defines what each type of collective operation (e.g., broadcast and reduction) should accomplish when the processes participating in a computation change dynamically. In Phoenix, messages are not addressed to processes, but to one of the virtual nodes in a virtual node name-space. This virtual node name-space is divided up among the

181

processes by an arbitrary mapping between processes and virtual nodes. Phoenix maintains a routing table for point-to-point communication, and messages are delivered to the process to which the destination virtual node is mapped at the time of message delivery. Applications can support the dynamic joining and leaving of processes during a computation by migrating virtual nodes. To join a computation in progress, a process connects to another process already involved in the computation and receives a set of virtual nodes. To leave a computation, a process first gives away its virtual nodes to another process, then leaves. Figure 2 demonstrates the joining of a process: before process C joins, messages addressed to virtual node 3 are delivered to process B; after process C joins and virtual node 3 migrates to process C, messages addressed to virtual node 3 are delivered to process C. Figure 2

Joining of a process: (a) before process C joins and (b) after process C joins

(a)

(b)

A broadcast in MPI is an operation that delivers a message to every process in a communicator. Meanwhile, a broadcast in Phoenix is an operation that delivers a message to every virtual node in a virtual node name-space. Thus, when a virtual node migrates from one process to another, a broadcast message must arrive at one of the two processes, but not both. A reduction in Phoenix is an operation that performs some mathematical operation, such as computing the sum or finding the maximum, on data mapped to all virtual nodes, and delivers the result to one virtual node. When a virtual node migrates from one process to another, one of the two processes, but not both, must contribute data to the reduction. As Phoenix allows the processes involved in a computation to change dynamically, and existing static strategies to perform collective operations efficiently need to be extended. With a static tree, processes that join after application start-up and the children of processes that leave after application start-up are not able to participate in collective operations. Thus, a dynamic strategy, such as the one we propose in this paper, is required. While our strategy is most useful in the presence of joining and leaving processes, the self-tuning property of our strategy makes it valuable even in the context of MPI, in which processes neither join nor leave.

182

4

H. Saito et al.

Our proposal

4.1 Overview In our proposal, we dynamically create and maintain topology-aware spanning trees, and perform broadcasts and reductions along the generated trees. It is difficult to uniformly handle short messages and long messages; the performance of short messages is determined by latency, while that of long messages is determined by bandwidth. Thus, we create latency-aware trees for short messages and bandwidth-aware trees for long messages. Moreover, each process needs to be able to become the root of a broadcast or a reduction, so we create a latency-aware tree and a bandwidth-aware tree for each process. To create the spanning trees, each process searches for a suitable parent for each tree by measuring the latency and bandwidth between itself and other processes. Though each spanning tree is independent, performing separate measurements for every tree would be too expensive. Thus, when a process measures the latency and bandwidth between itself and another process, it uses the result to judge the suitability of that process for every tree. Latency is measured with a 1 byte message, and bandwidth is measured with a 128 kilobyte message. Processes perform measurements and search for suitable parents not only at application start-up but also any other time when processes join or leave. Measurements need to be made when a process joins, because a new tree needs to be constructed with the joining process at the root, and the joining process must find a suitable parent for each of the other trees. Measurements need to be made when a process leaves, because it may leave some of the other processes parent-less for some of the trees. No measurements are made when processes are neither joining nor leaving. For both the latency-aware trees and the bandwidth-aware trees, each process continues performing measurements until it has found any parent for each of the trees, and then changes parents every time it finds a better parent. Though we could construct optimal trees by having each process perform measurements with every other process, this would take too much time. Thus, we aim to construct near-optimal trees by having each process perform measurements with a certain number of randomly selected processes. While increasing the number of extra measurements will yield trees that resemble the topology better, decent trees will be created if each process can select several processes within the same LAN. Therefore, each process performs measurements until it has found a parent for all of the trees, and then performs a certain number of extra measurements. In our experiments, we set the number of extra measurements to be 10. We present the parent-selection algorithm for the latency-aware trees in Subsection 4.2 and for the bandwidth-aware trees in Subsection 4.3. Then, we describe how broadcasts and reductions are performed using the generated trees in Subsections 4.4 and 4.5.

4.2 Latency-aware spanning trees In this section, we describe the parent-selection algorithm that we use for the latency-aware spanning trees.

4.2.1 Rationale As we have mentioned in Section 2, collective operations require different strategies in LANs and WANs. This can be explained by the work of Bernaschi and Iannello (1998), in which the performance of collective operations is parameterised by the difference between the completion time ts of a send and the completion time tr of the matching receive. Bernaschi and Iannello (1998) show that when the difference between ts and tr is large, the optimal broadcast tree is a one-level flat tree (Figure 1(c)). They also show that when the difference is small, the optimal broadcast tree is a binomial tree (Figure 1(a)). In between these two extremes, there is a whole spectrum of trees. In WANs, high latency causes the difference between ts and tr to be large, so the optimal broadcast tree is a one-level flat tree. In LANs, the difference is larger, so the optimal tree is deeper. While the exact shape of the optimal tree is different for each LAN, most existing message-passing systems use a binomial tree (e.g., MPICH) or a binary tree (e.g., MagPIe) for the sake of simplicity. Both binomial and binary trees are trees with a depth of log2p, where p is the number of processes. Thus, in our algorithm, we aim to construct near-optimal broadcast trees with the following properties: Property 1: Depth of 1 between LANs. Property 2: Depth of approximately log2p within each LAN. During the spanning tree creation phase, we focus only on broadcasts, because reductions can use broadcast trees in the reverse direction. Reduction-specific problems will be taken care of in Subsection 4.5.

4.2.2 Algorithm For the tree rooted at root, a process p changes parents from its current parent parent to a parent candidate cand if both of the following conditions are satisfied: Condition 1: rttp,cand < rttp,parent Condition 2: distcand,root < distp,root. The four values used in the decision are defined as follows (see Figure 3 for a graphical representation): •

rttp,cand: the round-trip time between p and cand

•

rttp,parent: the round-trip time between p and parent

Collective operations for wide-area message-passing systems using adaptive spanning trees •

distcand,root: the sum of the round-trip times from root to cand along the spanning tree

•

distp,root: the sum of the round-trip times from root to p along the spanning tree via parent.

183

Here, rttp,cand is measured directly by performing 1 byte ping-pongs, while distcand,root is sent to p by cand inside the 128 kilobytes used to measure the bandwidth between p and cand. Moreover, rttp,parent and distp,root are values that are updated as follows when p changes parents:

assuming that the latency between nodes was given by a normal distribution. We show the results in Figure 4. In addition to our algorithm, which uses both Conditions 1 and 2 to select parents, we also simulated an algorithm that uses only Condition 1. While Property 1 can be satisfied with just Condition 1, Figure 4 shows that Property 2 cannot; the trees constructed with only Condition 1 are much deeper than log2p. In contrast, the trees constructed using our algorithm have depths of approximately log2p.

•

rttp,parent:= rttp,cand

Figure 4

•

distp,root := distcand,root + rttp,cand.

Figure 3

Depth of trees within a LAN

rtt and dist

4.2.3 Memory requirements To execute the algorithm described in Subsection 4.2.2, each process holds the following information for each of the spanning trees: •

parent

•

rttp,parent

•

distp,root.

Moreover, each process remembers the following additional information, which is not necessary for creating the spanning trees, but necessary for actually performing broadcasts and reductions: •

its children

•

for each child child, the set of virtual nodes mapped to child and its descendants.

Our algorithm creates trees that are shallower than the algorithm that uses just Condition 1, because Condition 2 prevents processes from placing itself in a position that is much deeper than its current position. It is important to note, however, that our algorithm does not create excessively shallow trees either. To understand how this works, we assume that all the processes within a LAN have equal latencies. Then, if a process selects a parent candidate within its own LAN, Condition 1 will be satisfied with 1/2 probability. Moreover, in a shallow tree, a process in a shallow position has many siblings, and if a process selects one of its siblings as a parent candidate, Condition 2 will be satisfied with 1/2 probability as well. Thus, a process in a shallow position in a shallow tree can easily move to a deeper position by becoming a child of a sibling (Figure 5). Figure 5

Process becoming a child of a sibling

4.2.4 Explanation This algorithm constructs trees that satisfy Property 1 if the involved LANs are sufficiently separated. This is because Condition 1 will cause most parent–child relationships to be inside LANs, and the few that are between LANs to have the parent in the LAN in which the root is located. To show that this algorithm constructs trees that also satisfy Property 2, we performed a simulation. In this simulation, we studied the local-area behaviour of our algorithm by simulating our algorithm for 1–400 nodes,

4.3 Bandwidth-aware spanning tree In this section, we describe the parent-selection algorithm that we use for the bandwidth-aware spanning trees.

184

H. Saito et al.

4.3.1 Rationale

4.3.4 Explanation

The bandwidth-aware spanning trees must use bandwidth efficiently for long broadcasts. While the bandwidth utilisation of a particular tree can be measured by actually broadcasting a message from the root along a tree, it would take too much time to do this every time a process considered a parent change. Thus, in our algorithm, a process judges the suitability of a parent candidate based solely on information obtained from interaction between itself and the parent candidate.

Our estimation scheme of dividing the point-to-point bandwidth by the number of children assumes that the interface of the parent becomes the bottleneck when a parent forwards a large message to many children. As there are cases when other parts of the network become the bottleneck, we need a more sophisticated estimation scheme for better performance. However, our experiments in Subsection 5.1 show that even with this simple estimation scheme, our broadcast performs much better than a Grid-unaware implementation.

4.3.2 Algorithm Given a parent candidate cand, a processor p computes estp,cand, the estimated rate at which p would receive data via cand during a broadcast, as follows: Here, estcand, bwp,cand and nchildrencand are defined as follows:

4.4 Broadcast

•

parent

As described in Section 3, a broadcast in Phoenix delivers a message to every virtual node. In our broadcast, we allow messages for multiple virtual nodes mapped to the same process to be delivered as a single message. Short messages are forwarded along the latency-aware spanning trees described in Subsection 4.2, and long messages are pipelined along the spanning trees described in Section 4.3. Ultimately, we plan to use the measured values of latency and bandwidth to determine the message size at which to switch from the latency-aware trees to the bandwidth-aware trees. Each processor p can use the size of the message, distp;root, and estp;parent to estimate the time it would take a broadcast message to reach it along a given tree. The estimate can then be used to determine whether the latency-aware tree or the bandwidth-tree would minimise the completion time of the broadcast. In this paper, however, we simply use the latency-aware trees for messages shorter than 256 kilobytes and the bandwidth-aware trees for messages longer than or equal to 256 kilobytes. When the spanning trees are stable, a message is simply forwarded along a tree from the root to the leaves. When the spanning trees are changing, however, forwarding a message along a tree does not guarantee that all virtual nodes will receive the message. For example, if a process leaves during a broadcast, the message will not reach its descendants. Thus, when a process forwards a message to a child, it includes in the header of the message the set of virtual nodes that are to be reached through that child. When a process receives a message from its parent, it looks at the set of virtual nodes included in the header of the message and performs the following:

•

estp,parent

I

•

nchildrenp.

if any of the virtual nodes are mapped to itself, it delivers the message to the local user program

II

for virtual nodes that can be reached through a child, it forwards the message along the spanning tree

•

estcand: the estimated rate at which cand would receive data via its parent during a broadcast

•

bwp,cand: the measured point-to-point bandwidth between p and cand

•

nchildrencand: the number of children that cand has (the number of siblings p would have if p became a child of cand).

Process p obtains bwp,cand by measuring the time that it takes to download 128 kilobytes from cand. The other two values, estcand and nchildrencand, are sent to p by cand inside the 128 kilobytes. To decide whether to change parents or not, processor p compares estp,cand to estp,parent, the estimated rate at which p would receive data via its current parent parent during a broadcast. In other words, for the tree rooted at root, a processor p changes parents from its current parent parent to a parent candidate cand if the following condition is satisfied: Condition: estp,cand > estp,parent.

4.3.3 Memory requirements To execute the algorithm described in Subsection 4.3.2, each process holds the following information for each of the spanning trees:

Moreover, each process remembers the following additional information, which is not necessary for creating the spanning trees, but necessary for actually performing broadcasts and reductions: •

its children

III for virtual nodes that apply to neither I nor II, it forwards the message using point-to-point communication.

•

for each child child, the set of virtual nodes mapped to child and its descendants.

When the spanning trees are stable, message headers are consistent with the spanning trees, so broadcasts are

Collective operations for wide-area message-passing systems using adaptive spanning trees performed efficiently using just I and II. Meanwhile, when the spanning trees are changing, inconsistencies between message headers and the spanning trees cause III to be used as well. For example, in Figure 6(a), the spanning trees are stable, so broadcasts are performed along the spanning tree. Meanwhile, in Figure 6(b), process F has given virtual node 5 to process D and left the computation after process A forwarded a message to process C. As process C does not have a child to forward virtual node 5 to, it sends a point-to-point message to virtual node 5. Figure 6

Broadcast using our spanning trees: (a) stable tree and (b) changing tree

(a)

consider operations that are both associative and commutative. When the spanning trees are stable, each process simply waits for data from each of its children, performs the specified operation, and then sends the result to its parent. Meanwhile, when the spanning trees are changing, a process may end up waiting for a child that has already sent data to a different process. Thus, we use a timeout mechanism in which a process waits for only a certain amount of time for data from its children. Each process sets a timeout when the reduction function is called, and when the timeout expires, the process forwards whatever data it has received by that time. Any late data and data from non-child processes are forwarded as they arrive. The timeout must be long enough so that it does not cause data to be forwarded prematurely when the topology is stable. One way to enforce this is to make the timeout longer than the time between the reduction root and the farthest leaf, but each process does not know this value, and it would require much communication to collectively compute this value and deliver it to each process. However, each process does know the time between itself and the root of each spanning tree, so it uses the maximum of that value as an estimate.

5

(b)

Resorting to point-to-point communication lowers the performance of a broadcast, but guarantees that a message will reach all virtual nodes. When the spanning trees stop changing, message headers once again become consistent with the spanning trees, and III stops occurring.

4.5 Reduction As described in Section 3, a reduction in Phoenix performs some mathematical operation, such as computing the sum or finding the maximum, on data mapped to all virtual nodes, and delivers the result to one virtual node. Messages are forwarded along the spanning trees described in Subsection 4.3, with leaf processes beginning a reduction by each sending its own data to its parent; a process with multiple virtual nodes mapped to itself performs a local reduction first. The specified operation is performed on the data as they are forwarded along the tree from the leaves to the root, and the reduction completes when the result arrives at the process to which the root virtual node is mapped. As the order in which data are operated on differ depending on the tree used, we only

185

Experimental results

In this section, we present our experimental results. Each of the following experiments was performed using 128–201 real processors distributed over three or four clusters. The round-trip time of point-to-point messages using Phoenix was several 100 microseconds within a cluster and several milliseconds between clusters. We evaluate the performance of our collective operations when the spanning trees are stable in Subsections 5.1 and 5.2, and the behaviour of our collective operations when processes join and leave in Section 5.3.

5.1 Stable-state broadcast performance In our first experiment, we evaluated the stable-state performance of our broadcast. We waited until all processes had finished performing measurements and the spanning trees had become stable, and performed MPI-like broadcasts by mapping one virtual node to each process.

5.1.1 Short messages To evaluate the performance of short broadcasts, we measured the time it took a 1 byte message to reach each process. We compared our implementation with two other implementations: a Grid-unaware implementation that used the algorithm of MPICH (MPICH-like), and a static Grid-aware implementation that used the algorithm of MagPIe (MagPIe-like). Figure 7 shows our results for 201 processes distributed over three clusters.

186 Figure 7

H. Saito et al. 1 Byte broadcast (201 Procs. in three clusters): (a) MPICH-like implementation; (b) MagPIe-like implementation and (c) our implementation

near-optimal. For example, within a LAN, our trees have a depth of about log2p, which is comparable to that of the optimal tree. However, unlike the optimal tree, our trees may have a large fanout at a deep position. Another reason is the overhead incurred by listing virtual nodes in the message header.

5.1.2 Long messages

(a)

(b)

To evaluate the performance of long broadcasts, we measured the bandwidth of 32 kilobyte to 64 megabyte broadcasts. To compute the bandwidth, we divided the product of the number of bytes sent and the number of processes used (the minimum amount of communication required) by the completion time. We compared our implementation (Dynamic) with a Grid-unaware implementation (MPICH-like) and two static Grid-aware implementations (MagPIe-like and List). In Dynamic, the latency-aware tree was used for messages shorter than 256 kilobytes, and the bandwidth-aware tree for messages longer than or equal to 256 kilobytes. In MPICH-like, the long-message algorithm of MPICH was used. In MagPIe-like, the root sent the message to the coordinator node of each cluster as in MagPIe, and then the long-message algorithm of MPICH was used within each cluster. In List, the processes were lined up in a list that made the bottleneck as wide as possible, and messages were pipelined along that list. Figure 8 shows our results for 137 processes distributed over four clusters. As expected, MagPIe-like and List performed much better than MPICH-like; List outperformed MagPIe-like for very long messages, because a process never sent the same data more than once in List. Our implementation was able to achieve 82% of the bandwidth of List, even though some processes sent the same data multiple times. Figure 8

Long broadcast (137 Procs. in four clusters)

(c)

In MPICH-like, messages that stayed within a cluster arrived within several milliseconds, while messages that were forwarded between clusters many times took over 30 ms to arrive (Figure 7(a)). Meanwhile, in MagPIe-like, messages did not travel through wide-area links unnecessarily, so messages arrived at all processes within 6 ms (Figure 7(b)). This is seven times faster than MPICH-like, which confirms the results of Kielmann et al. In our implementation, every process received its message within 11 ms (Figure 7(c)). This is three times faster than MPICH-like, but almost twice as slow as MagPIe-like. One reason for this slowdown is that our spanning tree algorithm only makes trees that are

5.2 Stable-state reduction performance In our next experiment, we evaluated the stable-state of our reduction. We performed an MPI-like reduction: we mapped one virtual node to each process, gave an array of integers to each process and measured the time to add the arrays together. Figure 9 shows the results for 128 processes distributed over three clusters. Just as with our broadcast,

Collective operations for wide-area message-passing systems using adaptive spanning trees our reduction performed better than the Grid-unaware implementation, but not quite as well as the static Grid-aware implementation. Figure 9

Reduction (128 Procs. in three clusters)

When half of the processes left, the spanning trees broke, causing the performance of broadcasts to fall sharply. However, the spanning trees were quickly repaired, and the performance of broadcasts stabilised once again after 8 s. Similar behaviour was seen when the processes that left earlier rejoined the computation. In this way, our broadcast temporarily performs worse when processes join or leave, but messages are guaranteed to reach all virtual nodes, and our broadcast resumes effective execution once the spanning trees stabilise.

6

5.3 Transition-state behaviour In our last experiment, we evaluated the transition-state behaviour of our broadcast by having some processes leave and then rejoin a computation while we repeatedly performed 4 megabyte broadcasts. We mapped one virtual node to each process in the beginning, and then changed the mapping during computation as follows: I

60 s after application start-up, half of the processes gave away their virtual nodes to the remaining processes and then left the computation

II

another 30 s later, the processes that gave away their virtual nodes in I rejoined the computation and re-acquired their original virtual nodes.

Thus, every process had one virtual node during the first 60 s, and half of the processes each had two virtual nodes during the next 30 s. Figure 10 shows the results for 160 processes distributed over four clusters. Immediately after application start-up, broadcasts performed poorly because the spanning trees were under construction, but 11 s after application start-up, processes finished searching for parents, and the performance of broadcasts stabilised. Figure 10 Transition-state behaviour of our broadcast

187

Conclusion

In this paper, we proposed a method for wide-area message-passing systems to perform collective operations using dynamically created spanning trees. We presented algorithms to create spanning trees using latency and bandwidth measured at run-time, and explained how to perform broadcasts and reductions using the generated trees. In our experiments, we used 128–201 real processors distributed over 3–4 clusters and showed the following: I

In stable-state, our broadcast and reduction performed much better than those of a Grid-unaware implementation, although not quite as well as those of a static Grid-aware implementation. In particular, the latency of our 1-byte broadcast was three times as short as that of a Grid-unaware implementation, and within a factor of 2 of that of a static Grid-aware implementation. Moreover, our 64 megabyte broadcast achieved 82% of the bandwidth of a static Grid-aware implementation.

II

When some processes joined or left a computation, our broadcast temporarily performed worse for about 8 s while the spanning trees adapted to the new topology, but completed successfully even during this time.

In the future, we plan to optimise our broadcast and reduction to close the gap between our implementation and static Grid-aware implementations. Our latency-aware spanning tree can be improved by considering not just round-trip time but also forwarding overhead. Our bandwidth-aware spanning tree can be improved by using a more sophisticated estimation scheme. Moreover, our broadcast can be improved by considering the order in which each process forwards messages to its children; if we send to the child that has the most descendants first, we will be able to overlap more communication. Finally, we can reduce the forwarding overhead in our broadcast by compressing the virtual nodes listed in the message header. Moreover, we plan to implement collective operations besides broadcast and reduction using our spanning trees. Other 1-to-N operations and N-to-1 operations can be implemented like the broadcast and reduction operations, but N-to-N operations, such as all-to-all and barrier, will require a new strategy. Finally, in the spanning-tree algorithms proposed in this paper, processes do not share much information with each

188

H. Saito et al.

other. However, by sharing more information, processes may be able to create the spanning trees more efficiently. For example, instead of selecting parent candidates randomly, a process can use hints given by other processes to preferentially select nearby processes.

Acknowledgements This work is partially supported by “Precursory Research for Embryonic Science and Technology” of Japan Science and Technology Agency and by “Grants-in-Aid for Scientific Research” of Japan Society for the Promotion of Science. We are grateful to the Program Committee of the Grid 2005 Workshop for inviting our paper to this journal. We also thank the anonymous reviewers of the Grid 2005 Workshop for their comments, which helped us improve our paper.

References Banerjee, S., Bhattacharjee, B. and Kommareddy, C. (2002) ‘Scalable application layer multicast’, Proceedings of the 2002 Conference on Applications, Technologies, Architectures and Protocols for Computer Communications, Pittsburgh, USA, pp.205–217. Bernaschi, M. and Iannello, G. (1998) ‘Collective communication operations: experimental results vs. theory’, ACM Transactions on Programming Languages and Systems, Vol. 10, No. 5, pp.359–386. Bozdog, A., van Renesse, R. and Dumitriu, D. (2003) ‘SelectCast – a scalable and self-repairing multicast overlay routing facility’, Proceedings of the 2003 ACM Workshop on Survivable and Self-Regenerative Systems, Fairfax, USA, pp.33–42. Gabriel, E., Resch, M., Beisel, T. and Keller, R. (1998) ‘Distributed computing in a heterogeneous computing environment’, Proceedings of the 5th European PVM/MPI User’s Group Meeting, Liverpool, UK, pp.180–187. Gropp, W., Lusk, E., Doss, N. and Skjellum, A. (1996) ‘A high-performance portable implementation of the MPI message passing interface standard’, Parallel Computing, Vol. 22, No. 6, pp.789–828. Husbands, P. and Hoe, J.C. (1998) ‘MPIStarT: delivering network performance to numerical applications’, Proceedings of the 1998 ACM/IEEE Conference on Supercomputing, San Jose, USA, pp.1–15.

Jannotti, J., Gifford, D.K., Johnson, K.L., Kaashoek, M.F. and O’Toole Jr., J.W. (2000) ‘Overcast: reliable multicasting with an overlay network’, Proceedings of the Fourth Symposium on Operating System Design and Implementation, San Diego, USA, pp.197–212. Karonis, N., de Sipinski, B., Foster, I., Gropp, W., Lusk, E. and Bresnahan, J. (2000) ‘Exploiting hierarchy in parallel computer networks to optimize collective operation performance’, Proceedings of the 14th International Symposium on Parallel and Distributed Processing, Cancun, Mexico, pp.377–384. Karonis, N.T., Toonen, B. and Foster, I. (2003) ‘MPICH-G2: a grid-enabled implementation of the message passing interface’, Journal of Parallel and Distributed Computing, Vol. 63, No. 5, pp.551–563. Kielmann, T., Hofman, R.F.H., Bal, H.E., Plaat, A. and Bhoedjang, A.F. (1999a) ‘MagPIe: MPI’s collective communication operations for clustered wide area systems’, Proceedings of the Seventh ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, Atlanta, USA, pp.131–140. Kielmann, T., Hofman, R.F.H., Bal, H.E., Plaat, A. and Bhoedjang, A.F. (1999b) ‘MPI’s reduction operations in clustered wide area systems’, Proceedings of Message Passing Interface Developer’s and User’s Conference, Atlanta, USA, pp.43–52. Kielmann, T., Bal, H.E. and Gorlatch, S. (2000) ‘Bandwidth-efficient collective communication for clustered wide area systems’, Proceedings of the 14th International Symposium on Parallel and Distributed Processing, Cancun, Mexico, pp.492–499. Saito, H., Taura, K. and Chikayama, T. (2005) ‘Collective operations for wide-area message passing systems using adaptive spanning trees’, Grid 2005 – 6th IEEE/ACM International Workshop on Grid Computing, Seattle, USA, pp.40–48. Taura, K., Endo, T., Kaneda, K. and Yonezawa, A. (2003) ‘Phoenix: a parallel programming model for accommodating dynamically joining/leaving resources’, Proceedings of the Ninth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, San Diego, USA, pp.216–229. Thakur, R. and Gropp, W. (2003) ‘Improving the performance of collective operations in MPICH’, Proceedings of the Euro PVM/MPI, Venice, Italy, pp.257–267.

Website Message Passing -forum.org/

Interface

(MPI)

Forum,

http://www.mpi

Collective operations for wide-area message-passing ... - CiteSeerX

Collective operations for wide-area message-passing ... - CiteSeerX

Suggest Documents

Sparse Collective Operations for MPI - CiteSeerX

Sparse Collective Operations for MPI - CiteSeerX

Collective Error Detection for MPI Collective Operations - Bill Gropp

Collective Error Detection for MPI Collective Operations - Bill Gropp

MAGPIE: MPI's Collective Communication Operations

MAGPIE: MPI's Collective Communication Operations

Protocols for Fully Offloaded Collective Operations ... - Torsten Hoefler

Ropes: Support for Collective Operations Among Distributed Threads ...

Efficient MPI Collective Operations for Clusters in Long ... - GitHub Pages

Collective Operations for Wide-Area Message Passing ... - IEEE Xplore

A Study of Process Arrival Patterns for MPI Collective Operations

Non-Blocking Collective Operations for MPI-2 - Torsten Hoefler

Ropes: Support for Collective Operations Among Distributed Threads

Collective Practices, Instruments for Collective

Collective Disaster Response - CiteSeerX

Collective culture shock - CiteSeerX

Optimization Rules for Programming with Collective ... - CiteSeerX

Optimization Rules for Programming with Collective ... - CiteSeerX

Collective Memory Search - CiteSeerX

Collective Interfaces for Distributed Components - CiteSeerX

Participatory Simulation for Collective Management of ... - CiteSeerX

Collective Classification for Packed Executable Identification - CiteSeerX

Optimization Principles for Collective Neighborhood ... - CiteSeerX

Collective Cognitive Responsibility for the Advancement ... - CiteSeerX