Extensible Communication Architecture for Grid Nodes Nader Mohamed
Jameela Al-Jaroodi
Hong Jiang
Computer Science and Engineering Computer Science and Engineering Computer Science and Engineering University of Nebraska-Lincoln University of Nebraska-Lincoln University of Nebraska-Lincoln Lincoln, NE 68588-0115 Lincoln, NE 68588-0115 Lincoln, NE 68588-0115
[email protected] [email protected] [email protected]
Abstract Nodes in the Grid can be connected through a multigigabit network. However, the network interfaces in the nodes can become a bottleneck for the maximum achievable bandwidth among nodes. This paper proposes an extensible communication architecture to avoid the limitations of the network interfaces by integrating multiple network interfaces in each node. The proposed solution deploys a technique to balance the network traffic load among multiple network interfaces in a single node. While this technique enhances communication bandwidth among the nodes, it is distributed and does not require any coordination or reservation mechanisms. It achieves the maximum possible bandwidth between any pair of Grid nodes with multiple network interfaces. The experimental evaluation provided good results and demonstrated the benefits of the proposed architecture.
1.
Introduction
Many distributed high performance and data-intensive applications such as data mining, scientific simulations, and visualization require enormous resources in terms of processing power, storage capacity and communications bandwidth. This increasing demand for such resources led to great advances in the area of grid computing [4]. Grid nodes are in many cases heterogeneous in terms of their resource capabilities, architectures, operating systems, and function designations (e.g., nodes for storage, computation, storage, computation, visualization, sensor, or instruments). These nodes usually have different networking requirements too. In order to efficiently utilize the Grid resources for executing large-scale applications or to store and retrieve large datasets, the resources should be balanced. For example, communication capabilities should be matched with commensurate processing and storage capabilities. Having high-power processing nodes or high throughput storage nodes with low communication capabilities connecting them, for example, leads to inefficient utilization of Grid resources.
geographically located in different sites. However, end-toend bandwidth between any two nodes is limited by the maximum achievable bandwidth in the network interface card (NIC) and the local area networks (LAN) that connect the nodes to the main Grid optical network. To avoid these limitations, multiple NICs connected to different LANs that are then connected to the main multigigabit links can be used to connect high performance Grid nodes [8]. This helps avoid both NICs and LANs bottlenecks and enhance the achievable communication bandwidth among the nodes. One practical example that uses multiple NICs to increase outbound achievable bandwidth is IP Network Multipathing (IPNMP) [9] found in Sun Solaris. However, IPNMP only enhances outbound bandwidth. Other examples are Ethernet channel bonding [2] and Multirail Networks [3]. These are designed to enhance both inbound and outbound communication bandwidths. However, they are designed to work on multiple independent and homogenous LANs and not to enhance inbound communication bandwidth for Grid nodes which are usually geographically distributed. This paper develops a load balancing technique to enhance achievable bandwidth for Grid nodes using multiple network interface cards. The proposed technique is distributed and does not require any coordination or reservation mechanisms. In addition to achieving the maximum possible bandwidth between any two Grid nodes with multiple network interfaces, the solution has the ability to deal with different number of network interfaces in each node. This general solution can be implemented with some modifications at different network layers. In the rest of this paper, we start in Section 2 with an outline of the network protocols and services limitations in utilizing existing multiple NICs among Grid nodes. Section 3 describes the proposed solution architecture and load balancing technique. Then, in Section 4, we experimentally evaluate the solution. Section 5 discusses related work and Section 6 concludes the paper.
Optical networks are used to provide high bandwidth communication, multi-gigabit, among Grid components
Proceedings of the International Conference on Information Technology: Coding and Computing (ITCC’04) 0-7695-2108-8/04 $ 20.00 © 2004 IEEE
2.
Multiple-Network-Interface Nodes
Multiple NICs may exist in a Grid node, all of which are connected to the same fat network that connects all nodes in the Grid. In this case any node in the Grid can communicate with a multiple-network-interface node through any interface of its set of interfaces. There are two main limitations with the common communication protocols and services in dealing with multiple-networkinterface nodes. The first is coordination limitation; while, the second is utilization limitation. To explain these limitations consider the Grid configuration in Figure 1. Node5
Node4 Node1 Node3 Node2 Figure 1. Heterogeneous Grid nodes connected with fat network. The Grid, in Figure 1, connects multiple machines through a fat network. The machines need different network capacities to balance with their other system resources such as processing and storage resources. Consider having high aggregation network with all NICs being homogeneous; for example, all NICs are Fast Ethernet cards. Assume that Node1 and Nodes2 are storage nodes while Node3, Node4, and Node5 are computation nodes. High performance and storage nodes usually need high network bandwidths. Therefore, the storage node can have more NICs to enhance the networking capabilities. All NICs are connected to the same fat network through a set of LANs. The machines in Figure 1 have 3,2,1,1, and 1 NICs for Node1, Node2, Node3, Node4, and Node5 respectively. The communication protocols and services limitations can be described as follows: 1. Coordination Limitation: Consider Node3, Node4, and Node5 are sending bulk data to Node1. The sending nodes cannot know to which interface in Node1 they should send the data. At the same time we may have more than three sending nodes transferring data to Node1, we may have cases where multiple nodes select to connect to the same NIC on Node1 at the same time. The problem
here is how to coordinate the traffic flow to enhance the average inbound communication bandwidth to Node1. 2. Utilization Limitation: Consider Node2 needs to send bulk data to Node1. The theoretical communication bandwidth is 200Mbps. However, communication protocols and services can achieve a maximum of 100Mbps since most of these protocols and services are designed to work with a single NIC. The problem here is how to simultaneously utilize all the available interfaces to enhance the bulk data transfer. Other utilization limitations can occur form the heterogeneity issues of the NICs. Assume Node5 has a gigabit card while Node1 has three Fast Ethernet cards. The theoretical communication bandwidth between Node1 and Node5 is 300Mbps; however, the common communication services such as the socket can only achieve a maximum of 100Mbps. Using a network resource server or proxy to coordinate the traffic flow can solve both the coordination and utilization limitations. However, these solutions are costly since they require either resource reservation or re-routing mechanisms in the network. In large round-trip-time (RTT) any centralized or distributed coordination that require communication is very costly. In General, the mentioned limitations consist of two sides, the sender and the receiver. Each of these requires some special treatment to achieve an acceptable solution. One practical solution for coordinating outbound traffic utilizing multiple NICs exists in Sun-Solaris IPNMP [9]. However, this solution is only available for Solaris machines and is implemented at the IP layer. Nevertheless, it is possible to adopt the techniques used for other machines or may be implemented at different network layers. IPNMP does not solve the issue of coordinating the inbound traffic or maximizing the utilization of multiple NICs in the receiving nodes. The next section explains our proposed technique to solve both the coordination and utilization limitations for maximizing inbound bandwidth of receiving nodes with multiple NICs without any coordination communication overhead. In general, this paper extends the Sun solution to include inbound traffic load-balancing for the Grid.
3.
Striping with Load Balancing
Multiple channels may exist between any two nodes in the Grid through multiple network interfaces that are connected through multi-gigabit WAN. We will abstract the network resource between any two nodes by a set of channels. A channel (k,m) exists between nodes Ni and Nj in a Grid if there is a physical path between NIC k in node Ni and NIC m in node Nj in which both interfaces k and m are connected to the same fat network. If nodes Ni and Nj are connected through the same WAN, there are ni.nj channels between them, where Ni has ni NICs and Nj has nj NICs. The maximum physical bandwidth between
Proceedings of the International Conference on Information Technology: Coding and Computing (ITCC’04) 0-7695-2108-8/04 $ 20.00 © 2004 IEEE
Ni and Nj is B.min(ni,nj), where B is the bandwidth of a single link, assuming all interfaces are of equal bandwidth.
where ni and nj are numbers of NICs connected to the network in the nodes Ni and Nj, respectively.
The proposed solution assumes the existence of congestion control and multiplexing mechanisms either below or above the solution (depending on the implementation approach). The solution can be implemented at the IP level below a reliable transport layer or at the middleware layer, immediately below the applications and above a reliable transport layer. The congestion control is needed to slow down the flow when there is high load on the interfaces. At the same time the multiplexing mechanism is needed to ensure efficient aggregation of multiple traffic flows from different sources. We also assume that the used reliable transport protocols provides a good effective bandwidth for WAN. Many solutions exist for the High Bandwidth-Delay Product Networks problem [8]. We will explain the solution for the middleware layer then we will show how to customize it for the IP level implementation. The advantage of implementing the proposed solution as middleware is that the it can be introduced as a service that is easily ported to heterogeneous systems. However, the drawback is the overhead added by the middleware layer. On the other hand, the advantage of implementing the solution at the IP layer is that it has less overhead; however, it requires kernel changes; therefore, it is not easily portable to heterogeneous systems.
In some cases, we need to use fragment sizes smaller than f to achieve optimal load balancing, depending on the message size l. If a message of size l is to be sent through C channels while we have l/C < f, where f is the fragment size already used, then a fragment size l/C may be used. Since this method maintains outbound and inbound load balancing among all NICs for each node during striping, global load balancing can be achieved in all the Grid nodes interfaces and leads to better network utilization and performance. As a result the utilization limitation is resolved.
Messages of different sizes are exchanged among nodes on the Grid. Some of these messages are as small as 64 bytes, while others are as large as 1GB. Using the proposed solution, sending large messages in parallel over multiple channels reduces the transfer time. The maximum utilization (and Bandwidth) for network interfaces between two nodes can be achieved if the load balancing among the available network interfaces is optimal. Optimal load balancing can be achieved by having equal load on all network interfaces, i.e., sender node’s NICs and receiver node’s NICs. Efficient load balancing among network interfaces is achieved using striping for all large messages. In striping, a large message is divided into small fragments of f bytes each. The fragments are sent through all available channels, while maintaining a balanced load on all links. Load balancing is achieved using, respectively, the sender NICs and the receiver NICs in sequence (e.g. round-robin). Consider a sender with 2 NICs, labeled 0 and 1, a receiver with 3 NICs, labeled 0, 1, and 2, and a 6-fragement message to be sent between the two nodes. To achieve load balancing both ends we send the fragments in the following NIC pairs (channels) sequence (0,0), (1,1), (0,2), (1,0), (0,1), (1,2). We can determine the channel numbers (k,m) for sending fragment number s between nodes Ni and Nj as follows:
Utilizing reliable transport for all channels ensures the reliability and the FIFO (first in first out) order of the fragments. These properties are utilized to predefine the order of the fragments. Both sender and receiver after communicating the message size use the formulas to know from which channel the fragments will be exchanged. Therefore, there is no need to add headers or sequence numbers to the fragments. If the transport protocol does not guarantee FIFO order, then sequence numbers must be added. Moreover, most Sockets provide send and receive commands for direct memory access, which can be used for striping to avoid extra copying beyond that of the reliable transport protocol. Small messages, on the other hand, are sent without striping in a round-robin manner among available channels. Roundrobin distribution is used to maintain load balancing on the interfaces. This distribution among available channels follows the same method described above, but using multiple small messages rather than fragments of a single large message.
k = s mod ni and m = s mod nj
If multiple nodes send messages to the same interfaces on the destination nodes, the messages will generate the same load on all destination NICs. Therefore, the bandwidth of the NICs and links will be shared evenly and the load will be balanced. Since each channel represents a reliable transport channel, multiple fragments from different sources can be efficiently multiplexed to the destination NICs. The same is true if two processes executing on the same node send two large messages. Both messages will be distributed equally among the node’s NICs. This technique ensures load balancing without any coordination, synchronization or reservation, which are usually costly.
If the solution is to be implemented at the IP layer, sending large messages lets IP layer receive multiple full size packets and one smaller packet, which represents the last part of the message packet. The IP layer can use the same scheduling technique explained above with some modifications. The modifications include forwarding fullsize packets instead of message fragments using the above scheduling technique. As for small packets, they can be treated the same way small messages are handled by the
Proceedings of the International Conference on Information Technology: Coding and Computing (ITCC’04) 0-7695-2108-8/04 $ 20.00 © 2004 IEEE
Two sets of experiments were conducted to evaluate the performance gains of the proposed technique. The first set is in a simulated wide area network environment with high RTT for an IP layer implementation. The second set is for an actual implementation at the middleware layer over TCP and among machines connected through a single high bandwidth network.
In the second set of experiments, one of the nodes (server node) was equipped with two Fast Ethernet cards that are both connected to the same switch. Other nodes (client nodes) have one Fast Ethernet NIC each and all are connected to the same switch. All nodes have two
8 6 4 2 0 5
2
4 ,5
3 ,4 ,5
2 ,3
2 ,3 ,4
2 ,3 ,4 ,5
Nod e s
Figure 2. Average Transfer Time for all Nodes Total Aggregated
The IP layer solution was evaluated using a discrete event simulator that performs a time-step simulation of network operations. The configuration in Figure 1 was used in which the optical network has high aggregate bandwidth. The assumption here is that the reliable transport protocol has efficient transmission in which the size of sender and receiver buffers and congestion window are tuned for best transmission performance and there is no packet loss. Although this assumption is not true in the real life, we just need to compare the performance of using a single NIC against using the proposed technique with multiple NICs. All assumptions were included in both cases. However, the simulation considers the overhead added by different network and protocol layers. Different scenarios were considered, where one or more node of the {Node2, Node3, Node4, Node5} list simultaneously send bulk data to Node1. The RTT between any two nodes was set to 40ms. For all experiments, large messages of size 64MB were used when Node3, Node4, and Node5 transfer data to Node1 while a 128MB message was used when Node2 transfers data to Node1. The average transfer time per client is measured as shown in Figure 2. In addition the total aggregate bandwidth to Node1 is shown in Figure 3. As shown, the available inbound bandwidth in Node1 is shared among incoming traffic from different nodes. All NICs in Node1 have the same load all the time. Thus, the proposed technique solves the coordination and utilization limitations without adding any communication overhead.
10
Sec.
Performance
12
Band. to Node1(Mbps)
4.
AthlonMP 1.4GHz processors and 1GB RAM per node. Multiple 1MB Messages were transferred simultaneously between the client nodes and the server node. The effective bandwidth for transferring the messages for each client and the aggregated effective bandwidth for different number of clients are shown in Table 1. The average effective bandwidth per client represents the outbound bandwidth each client sends to the server. Since the server can only handle a maximum of 200Mbps, three or more clients will evenly share the load of sending a maximum of 200Mbps among all clients.
AVG. Trans. Time in
middleware. That is they are sent in a round-robin manner among available interfaces. If there are concurrent packets going to different machines, a local free interface will be selected for each packet to be sent, while the destination interface in the destination machine will be selected in a round-robin manner in which each destination machine has an individual round-robin scheduling. The coordination of the IP level implementation for selecting a local interface is already available in IPNMP. This solution leads to the same middleware implementation effect. However, in this implementation, TCP will maintain the congestion control and multiplexing of outbound traffic.
300 250 200 15 0 10 0 50 0 5
2
4, 5
3, 4, 5
2, 3
2, 3, 4
2, 3, 4, 5
No d e s
Figure 3. Total Aggregated Bandwidth to Node 1. Table 1. In-bound Load Balancing. No of Clients 1 2 3 4
5.
Avg. Effective Bandwidth per Client (Mbps) 93.259 91.131 62.485 46.181
Total Aggregated Bandwidth to the Server (Mbps) 93.259 182.262 187.455 184.181
Related Work
In our previous work in Multiple-Network-Interface Socket (MuniSocket) [6][7][8], we implemented sockets that utilize multiple NICs for enhancing data transfer time. One of the main differences between our previous work and the current work is that MuniSocket is designed to utilize available bandwidth in network interfaces where there is some uncontrollable load generated outside
Proceedings of the International Conference on Information Technology: Coding and Computing (ITCC’04) 0-7695-2108-8/04 $ 20.00 © 2004 IEEE
MuniSocket’s control by other applications using standard sockets. However, this paper develops an end-to-end load balancing technique where we assume that the entire communication load in the Grid passes through the proposed solution. Thus all running applications used are utilizing the solution. In addition, all the nodes follow the same striping scheduling described in Section 3. There are many solutions for utilizing multiple networks in clusters or between machines in local area networks such as Ethernet channel bonding [2], Reliable and Scalable Striping Protocol [1], and Multirail Networks [3]. These are designed to enhance both inbound and outbound communication bandwidths. However, they are designed to work on multiple independent and homogenous local area networks and not for enhancing inbound communication bandwidth for Grid nodes in wide area networks (WAN). Multirail Network uses a channel reservation technique, which is costly in WAN, to coordinate traffic flow for large messages. These solutions require all nodes to have the same number of homogenous interfaces while in the Grid the situation is different.
6.
Conclusion and Future Work
In this paper we developed a solution for the coordination and utilization limitations among Grid nodes with multiple network interfaces. The technique maintains load balancing among the nodes’ NICs and maximizes communication bandwidth among Grid nodes with different number of network interfaces. This solution extends the Sun IPNMP solution to include inbound traffic load balancing for Grid nodes. One advantage of the proposed coordination technique is that it does not require any coordination or reservation steps thus it does not add any extra overhead. Another advantage of the technique is that it allows easy addition of communication capability to any grid node. Moreover, the proposed technique can be used as part of other Grid tools such as GridFTP [5], which performs machine striping for file transfer, to avoid the limitations of NIC in any participated node. We also experimentally evaluated the proposed load balancing technique in both LAN and simulated WAN environments. The performance gains in terms of increasing the aggregate bandwidth between Grid nodes are very promising. The solution described in this paper is for Grid nodes in which each node has single or multiple homogenous NICs. Although different nodes may have heterogeneous interfaces while all interfaces at the same node must be homogeneous. However, in the future, we plan to develop coordination and utilization solutions to deal with heterogeneous interfaces that exist in the same node.
Acknowledgement This project was partially supported by a National Science Foundation Grant EPS-0091900 and an Academic Priority Grant of the University of NebraskaLincoln, for which we are grateful. We would also like to thank members of the secure distributed information (SDI) group and the research computing facility (RCF) at the University of Nebraska-Lincoln for their continuous help and support.
References [1] H. Adiseshu, G. Parulkar, and G. Vargese, “A Reliable and Scalable Striping Protocol”, Computer Communication Review, volume 26, page 131-141, October 1996. [2] Beowulf Ethernet Channel bonding, http://www.beowulf.org/software /bonding.html, June 2002. [3] S. Coll, E. Frachtenberg, F. Petrini, A. Hoisie, and L. Gurvits, “Using Multirail Networks in HighPerformance Clusters”, IEEE CLUSTER’01, 2001. [4] I. Foster and C. Kesselman, The Grid: blueprint for a new computing infrastructure. San Francisco, Morgan Kaufmann Publishers, 1999. [5] GridFTP: Universal Data Transfer for the Grid, At http://www.globus.org/datagrid/deliverables/GridFTP -Overview-200201.pdf. [6] N. Mohamed, J. Al-Jaroodi, H. Jiang, and D. Swanson, “A User-Level Socket Layer Over Multiple Physical Network Interfaces”, in Proceedings of the 14th International Conference on Parallel and Distributed Computing and Systems, Cambridge, Massachusetts, pp. 810-815, November 2002. [7] N. Mohamed, J. Al-Jaroodi, H. Jiang, and D. Swanson, “A Middleware-Level Parallel Transfer Technique over Multiple Network Interfaces”, ClusterWorld Conference and Expo, San Jose, California, June 2003. [8] N. Mohamed, J. Al-Jaroodi, H. Jiang, and D. Swanson, “Scalable Bulk Data Transfer in Wide Area Networks”, International Journal of High Performance Computing Applications – Special Issue on Grid Computing: Infrastructure and Applications, Guest editor: David Walker, Volume 17, No. 3, pp. 237-248, August 2003. [9] Solaris IP Network Multipathing (IPNMP) Data Sheet, General Microsystems, Inc., at http://docs.sun.com/db/doc/8160850/6m7adiu4a?a=view.
Proceedings of the International Conference on Information Technology: Coding and Computing (ITCC’04) 0-7695-2108-8/04 $ 20.00 © 2004 IEEE