Reliable Multicast for Control in Distributed Routers - CiteSeerX

1

Reliable Multicast for Control in Distributed Routers Markus Hidell, Peter Sjödin, and Olof Hagsand

Abstract—Growing traffic volumes and demands for new services rapidly increase the requirements imposed on network systems, such as IP routers. We argue that a decentralized modular system design would improve the scalability, flexibility, and reliability of future routers. We have designed and implemented such a distributed router, based on physical separation between control and forwarding elements. One challenge with the design concerns the internal communication between the elements constituting the router. This paper presents performance measurements of different internal transaction mechanisms between the control and forwarding planes. In particular, an existing protocol for reliable multicast has been integrated and evaluated in our experimental prototype. The prototype consists of one control element and up to 16 forwarding elements, interconnected by an internal control network based on Ethernet switches. Index Terms—Router architectures, distributed systems, reliable multicast, performance measurements

T

I. INTRODUCTION

HE increasing traffic of the Internet in combination with demands for new services impose requirements on routers and other network systems in terms of flexibility, reliability, and capacity. By sharing the routing functionality workload over multiple units in a distributed router architecture, it is possible to design routers that are better suited to meet the requirements for future network systems. A distributed router appears as a single entity to the outside world, even though it internally consists of multiple interoperating elements. There are several potential benefits to such a design, but there is also a certain cost. The internal elements need to be kept synchronized so that they all share the same routing state, and the internal communication for distributing routing information takes time and occupies resources. Information that needs to be distributed to the elements can for example be forwarding tables, flow tables, filter databases, etc. The control and data planes are separated in most routers— it would not be acceptable if internal control traffic can be Markus Hidell is with KTH – Royal Institute of Technology, ELECTRUM 229, SE-164 40 Kista, Sweden (phone: +46 8 790 42 51; fax: +46 8 752 65 48; email: [email protected]). Peter Sjödin is with KTH – Royal Institute of Technology, ELECTRUM 229, SE-164 40 Kista, Sweden (email: [email protected]). Olof Hagsand is with KTH – Royal Institute of Technology, SE-100 44 Stockholm, Sweden (email: [email protected]).

blocked during periods of heavy data traffic. A typical architecture is to use a specialized, high-speed backplane for data, and a much slower interconnect based on standard technology for control. This means that even in a highcapacity router, the bandwidth for internal control traffic may still be very limited. Current forwarding tables can have relatively large memory footprints, and the distribution of forwarding information to a large number of elements could potentially occupy the control network for long periods of time. Such communication takes place when routing is started or restarted, but also during normal operation when there are sudden changes in the routing topology (due to route flaps, for instance). In this paper, we investigate mechanisms for internal communication in distributed routers. In particular, we study the communication overhead associated with sending large forwarding tables to many forwarding elements. This is done through experimentation in the distributed router testbed developed at the Royal Institute of Technology. The communication pattern for distribution of forwarding tables consists of a sender (a router control element) that distributes the forwarding information to multiple receivers (forwarding elements). The same information is distributed to all receivers, which suggests that multicast communication could be appropriate. However, regular multicast based on UDP, IP multicast and a data link layer with native support for multicast (such as Ethernet) may not be suitable. Such multicast is unreliable, which could give inconsistent state if for example some forwarding elements do not receive all forwarding information. TCP could be used instead, but since TCP is strictly unicast the information needs to be duplicated at the sender and transmitted multiple times over the network. This might be acceptable for small amounts of control information and for few receivers, but our experiments indicate that it gets costly for large tables in routers with several forwarding elements. Due to these shortcomings of UDP multicast and TCP, reliable multicast is an appealing candidate for the internal communication. However, there are many different flavors of reliable multicast and the efficiency of generic reliable multicast protocol depends on the application (or, more specifically, it depends on how well the application matches the assumptions underlying the protocol design). We focus on one particular reliable multicast protocol, NORM (NACKoriented reliable multicast) [2]. NORM is intended for reliable bulk transfers in environments with support for native

2 multicast services to a large number of receivers. In this respect, NORM appears to be suitable for distribution of forwarding tables in distributed routers, and our preliminary results are encouraging—the performance of NORM gets close to that of an optimal unreliable multicast transport mechanism, even when subject to (modest) packet loss. The rest of this paper is organized as follows. Section II describes our distributed router system and gives a short background to the concept of distributed routers. Our measurement results are presented and discussed in Section III, and, finally, Section IV concludes the paper and outlines further work. II. SYSTEM DESCRIPTION AND BACKGROUND The networking system we consider is a distributed router consisting of different functional elements. Using the terminology of ForCES [8], there are two main types of elements: Control Elements (CEs) and Forwarding Elements (FEs). CEs implement functions such as routing protocols, signaling protocols, and network management, while FEs perform for example packet forwarding, classification, traffic shaping, and metering. Together the CEs and FEs form a Network Element (NE)—a distributed router—as shown in Fig. 1. The term distributed router is sometimes used for singlechassis routers where the packet forwarding and lookup operations are performed on the line cards—a more limited degree of distributed functionality [3]. We use the term in the more general sense to denote a system with several independent elements, which are physically separated and interconnected by a network. Exploring decentralized architectures is in line with both industry and research efforts to improve the scaling of Internet routers [4], [16], [18]. Recent commercial high-performance routers are based on distributed multi-chassis solutions, where line card chassis are connected to a switch fabric chassis. In the area of software-based decentralization and modularization, a considerable amount of work has been done [5], [6], [10], [12], [13], [14], [17]. The work is focused on the programmability of network systems in the context of active and programmable networks. The performance of control mechanisms for distributed routers has not gained much previous attention. An early version of such a mechanism (Netlink2) was evaluated for a small-scale distributed system [11]. A different approach to speed up the internal communication is to use algorithmic methods to reduce the amount of forwarding information that needs to be distributed [19]. A. System Design and Implementation Our system design is based on the physical separation between control and forwarding, where CEs and FEs are interconnected using an internal network. The internal network carries control and data traffic between the elements.

NE

CE #1

CE #2

CE #3

Internal Network

FE #1

FE #2

FE #3

FE #4

Fig. 1 A distributed router as a Network Element consisting of Forwarding and Control Elements

CE functions are typically implemented in software running on a general-purpose CPU, while FEs can be based on different types of hardware, such as ASICs, FPGAs, network processors, and general purpose CPUs. The internal network can be designed in a variety of ways, using for example highspeed optics or high performance LAN switches. In fact, the internal network could even be a router-based IP network. We have implemented different types of CEs and different types of FEs. The system used in for the work presented here includes a version of a CE based on a UNIX system running the Zebra open source routing software [9], and an FE based on a general-purpose CPU. The CE supports manual configuration of remote network interfaces and static routes. The CE takes commands, given through Zebra’s command line interface or a configuration file, and distributes them over the internal network to the remote FEs. The internal network is separated into two separate parts – one for control and one for data. The control network and the data network are both based on IP, and there is one single router hop between the elements. B. Internal Communication – Forz Protocol The physical separation into control and forwarding elements requires a communication protocol for coordination of activities between the different elements. There are protocols emerging within the IETF/ForCES in this area [8], but there are currently no implementations available that suite our needs. Therefore we have taken the approach to use the emerging protocols as a starting point, and add extensions for our specific purposes. We call the result Forz, which is a protocol with three main parts: association, configuration, and data transfer. In the association phase of the protocol, the constituting elements establish the communication and declare their capabilities. In particular, every element needs to know the IP addresses and port numbers used for control communication. The purpose of the configuration part is to convey configuration commands and event notifications. For example, if a CE computes new routes for a destination, this information is conveyed to the FEs to be installed in the forwarding tables. Correspondingly, if a physical link on a FE fails, the CEs need to be notified.

3 The Forz configuration messages are based on the Netlink [19] protocol. Netlink is used locally on Linux systems to interact between user and kernel space. More specifically, the networking objects that reside in the Linux kernel, such as FIBs, interfaces, filter rule-sets, etc, are managed from userspace using Netlink. In our distributed router, Forz encapsulates Netlink messages, and adds information necessary in a distributed environment. The data transfer part, finally, deals with the switching of data packets between FEs and with data packets that are either destined to or originating from the NE itself. C. Transport Protocols and Reliable Multicast There are different types of internal communication in a distributed router, with different service requirements. For example, the distribution of forwarding information requires efficient and reliable transfer of relatively large amounts of data, while simple port configuration operations probably are best performed using request-response transactions. Therefore Forz is designed to use a variety of transport level protocols, such as TCP, UDP and reliable multicast. In this paper we focus on the internal communication that deals with transmission of large data objects to multiple receivers, that is, multicast distribution. Multicast distribution can be achieved in several ways. A straight-forward way is to open a TCP connection to each receiver, and let the sender duplicate the data onto the connections. This may be acceptable for few receivers and small data objects, but would be prohibitive for scaling to larger systems. A more efficient distribution mechanism can be achieved by using native multicast transmissions. The simplest way of doing this is to use UDP together with IP multicast over a data link with native multicast, such as Ethernet. The drawback with this approach is that UDP is unreliable, and as such does not guarantee that the data is correctly transmitted to all destinations. It is therefore not suitable for distribution of forwarding information—if all forwarding elements do not receive the forwarding information in a correct way, forwarding will be inconsistent and packets may be routed the wrong way or dropped. The conclusion is that for this kind of applications, we should use UDP augmented with support for reliable multicast distribution. In general, the choice of reliable multicast protocols depends on the requirements of the target applications, which may differ significantly. In the work with standardization of reliable multicast protocols, the IETF has taken a modular approach by developing “building blocks” of protocol mechanisms, and then standardize protocols as instantiations of combinations of building blocks. One such instantiation is NORM (NACK-Oriented Reliable Multicast) [1], [2]. NORM is based on the use of NACKs (Negative Acknowledgments) to send repair requests back to the sender when data is missing at the receiver side. To avoid NACK implosions where many receivers simultaneously send NACKs back to the sender, NORM includes mechanisms for

suppressing redundant NACK transmissions among a group of receivers. In addition, NORM provides different types of forward error correction (FEC). NORM is designed for bulk data transfers and mainly intended for “flat” multicast topologies (in contrast to hierarchical, tree-based topologies). In our application of reliable multicast, the geographical size of the multicast group is limited and the group is also formed in a controlled environment (within the internal network of the distributed router). We feel that this environment fits well with the model underlying the design of NORM, and therefore we have incorporated support for transportation of Forz messages using NORM in our system. The NORM protocol implementation used is developed at INRIA [15], and we refer to this software module as INRIA NORM. III. PERFORMANCE EVALUATION In our performance evaluation, we measure the total time it takes to distribute a large routing table from one control element to a set of forwarding elements over the internal control network. We run the experiments for three different types of transport mechanisms, which can be used by the Forz protocol. These transport mechanisms are UDP over IP multicast (or UDP multicast for short), multiple TCP connections, and reliable multicast using NORM. The main purpose with the measurements is to investigate the extra cost for internal communication when distributing a large number of routes in a distributed router. A. Experimental Set-up The experimental set-up consists of one CE and up to 16 FEs. The CE and FEs are based on rack-mounted PCs with 1.70 GHz Intel Pentium 4 processors, running the OpenBSD operating system. The internal network consists of three 100 Mb/s Ethernet switches (Netgear FSM726S) that interconnect the CE and the FEs. The switches are wire-speed Ethernet switches, and none of the output ports is overloaded in our measurements. Thus, there is no congestion inside the internal network. The structure of the experimental platform is shown in Fig. 2, and a photograph of the system implementation can be seen in Fig. 3. Switch Switch

CE

FE 1

Switch

FE 8

FE 9

FE 16

Fig. 2 Experimental platform, consisting of one CE and up to 16 FEs

In the experiments reported in this section, the CE reads 100,000 static routes from a configuration file and distributes the entries to the FEs via the configuration part of the Forz protocol. The time it takes for the entries to be installed on the FEs and acknowledged back to the CE is measured. The total amount of data that is distributed is 8.4 MB.

4

Fig. 3 A rack with CE, FEs, and switches belonging to the internal network

B. Performance Measurements Fig. 4 shows the measurement results for the three different transport mechanisms that we study. The graphs display the total time to distribute the routing table when we let the number of FEs vary from 1 to 16.

Fig. 4 Total distribution time for the three different transport mechanisms that have been evaluated

1) TCP When TCP is used, the transport is reliable and there is no need for higher-level protocol functions to ensure reliable delivery of data to the receivers. The drawback is that multicast transfers cannot be used. From a scaling perspective, the ideal result would be if the internal communication time were independent of the number of FEs, as in the case with the transport mechanisms based on multicast. For TCP, the unicast-based transport mechanism, the total distribution time increases linearly with the number of FEs. Such an increase is far from ideal, but it is predictable and may be manageable for small systems. However, IETF/ForCES states the goal that the architecture should scale to 100s or even 1000s of FEs. Clearly, TCP would result in unacceptably long distribution times in such scenarios. 2) Unreliable UDP Multicast In comparison with TCP, UDP lacks mechanism like flow

control, segmentation, and congestion control. In our controlled environment, there is no congestion in the internal network, so congestion control is not needed. However, we quickly discovered that, with UDP multicast, flow control is needed to avoid loosing information at the receivers. It was also clear that the flow control mechanism as well as the segmentation significantly affect the performance. To implement flow control at the Forz level, a simple ACK (acknowledgment) strategy is used. We introduce an ACK interval specifying the number of Forz messages a sender can send before waiting for an ACK. We also implement a form of segmentation for UDP by letting the sender accumulate several Forz messages and send those messages in one UDP datagram. This can be compared with the way TCP tries to send as large segments as possible. We experiment with different values of the ACK interval and the number of Forz messages per UDP datagram to find suitable values for our further experiments. When it comes to segmentation, the general rule is that best efficiency is achieved with largest possible datagram size as long as it fits within the network’s maximum transmission unit (MTU). The maximum number of Forz messages that can be fit into the MTU of the internal control network is 16. Further there is an upper bound on the ACK interval: if the ACK interval is too large, packets will be lost. For example, in our experiment, we found that if more than 160 Forz messages are sent before waiting for an ACK, receiver buffers will overflow and packets will be lost. This ACK interval is specific to our configuration. A general solution would be to implement an adaptive flow control, e.g., such as sliding windows in TCP. To conclude, an ACK interval of 160 combined with a segment size of 16 Forz messages per UDP datagram gives the best possible performance in our experimental set-up. We see this result as the upper bound on the performance that can be expected by NORM since NORM will introduce additional protocol activities to achieve reliable multicast distribution of data. For the following throughput analysis of the NORM protocol the UDP multicast result will serve as a reference. 3) Reliable Multicast – NORM The NORM protocol supports different types of content objects to be reliably transported. Basically, these objects can be either finite (but large) units of contents (file-oriented) or stream-oriented. INRIA NORM supports the file-oriented reliable delivery service. Therefore, all Forz messages containing route updates are accumulated into one large block of 100,000 messages. Thereafter, this block is handed over to the NORM protocol module and is distributed as one large finite content object. INRIA NORM supports user-defined segmentation into a desired UDP datagram size. We have chosen the maximum datagram size (1500 bytes including NORM, UDP, and IP headers) for our measurements. The reason is to achieve as high efficiency as possible, as was the case in the earlier described throughput analysis of UDP multicast. The measurements shown in Fig. 4 indicate that the performance of NORM is close to that of UDP multicast. We

5 believe that the difference is due to the additional information, control information as well as NACKs and retransmissions, that is communicated to provide reliability. However, our results indicates that this overhead has only little influence on the performance, and that NORM appears to be an efficient protocol for reliable multicast. We also note that the total distribution time is independent of the number of receivers (FEs), which is desirable from a scaling perspective. It could be expected that the total distribution time would increase slightly with the number of FEs, since the total amount of packet losses and retransmissions would increase. This would lead to a larger total amount of data being sent from the CE to FEs. However, we have not observed any such effects in our measurements. NORM supports both dynamic rate control through the use of a congestion control, and static configuration of the transmission rate [2]. INRIA NORM supports one of these methods, namely the static rate control, so there is no automatic adjustment of the transmission rate based on feedback from the receivers. Instead, the user specifies a desired transmission rate at the beginning of a reliable multicast session. In order to study the effect of the transmission rate on the total distribution time, we experimented with different transmission rates for our scenario. The results of these measurements are depicted in Fig. 5.

NORM, finally, turns out to be a promising alternative for distribution of forwarding information. It has more overhead than UDP multicast, but the measurements on our system do not indicate a significant performance penalty for this overhead. We have used a basic version of NORM with userconfigured rate control, and for further work it would be interesting to study the effect of adaptive transmission rate based on congestion control. Furthermore, we have used NACKs as the sole mechanism for error control. NORM also provides support for error control through Forward Error Correction (FEC), and it would be interesting to study the performance of NACKs in combination with FEC under different packet loss scenarios. REFERENCES [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12]

Fig. 5 Total distribution time for different NORM transmission rates

In this figure, the total distribution time decreases until we reach a level of roughly 90 Mbit/s. Above this level, the distribution time increases due to additional packet losses and retransmissions.

[13] [14] [15]

IV. CONCLUSIONS AND FURTHER WORK

[16]

We have evaluated three different transport mechanisms for the distribution of control information in distributed routers: TCP, unreliable UDP multicast, and reliable multicast with NORM. Our results are discouraging for using TCP in large systems with many forwarding elements, since TCP lacks support for multicast. UDP multicast is strictly not an option for this kind of applications, since it only provides unreliable services, but it has been included in our study since it represents a measure of the maximum achievable performance.

[17] [18] [19]

[20] [21]

R. B. Adamson, J. P. Macker, “Quantitative Prediction of NACKOriented Reliable Multicast (NORM) Feedback”, Proceedings of MILCOM 2002, vol. 2, pp. 964-969, October 2002. B. Adamson et al, ”Negative-acknowledgment (NACK)-Oriented Reliable Multicast (NORM) Protocol”, IETF Internet RFC 3940, November 2004. H. Chan, H. Alnuweiri, and V. Leung, “A framework for optimizing the cost and performance of next-generation IP routers”, IEEE JSAC, vol. 17, no. 6, pp1013-1029, June 1999. Cisco, “Next Generation Networks and the Cisco Carrier Routing System”, White paper, 2004. Computer Networks. Special Issue on Programmable Networks, Vol. 38, No. 3, February 2002. D. Decasper, Z. Dittia, G. Parulkar, and B. Plattner, ”Router Plugins: A Software Architecture for Next-Generation Routers”, IEEE/ACM Transactions on Networking, 8, 2000. A. Doria et al, “ForCES Protocol Specification”, IETF Internet Draft draft-ietf-forces-protocol-01.txt, October 2004. ForCES (Forwarding and Control Element Separation) IETF Working group, URL=http://www.ietf.org/html.charters/forces-charter.html. GNU Zebra, URL=http://www.zebra.org Y. Gottlieb and L. Petersen, “A Comparative Study of Extensible Routers”, In 2002 IEEE Open Architectures and Network Programming Proceedings, Pages 51-62, June 2002. G. Goutaudier, “Enhancements and Prototype Implementation of the ForCES Netlink2 Protocol”, IBM Research Report RZ 3482 (# 99522), September 2003. M. Handley, O. Hodson, and E. Kohler, “XORP: An Open Platform for Network Research”, First Workshop in Networks, Princeton, New Jersey, October 2002. G. Hjálmtýsson et al, ”Dynamic Packet Processors, A new abstraction for router extensibility”, Proceedings of OPENARCH-2003, San Francisco, April 2003. IEEE Journal on Selected Areas in Communications on Active and Programmable Networks, Vol. 19, No. 3, March 2001. INRIA – MCL website, http://www.inrialpes.fr/planete/people/roca/mcl/mcl.html. Juniper, “T-Series Routing Platforms: System and Packet Forwarding Architecture”, White paper, 2002. S. Karlin and L. Peterson, "VERA: An Extensible Router Architecture," Computer Networks, Volume 38, Issue 3, Pages 277-293, (2002). I. Keslassy, et al, “Scaling Internet Routers Using Optics”, ACM Sigcomm 2003, Karlsruhe, Germany, 2003. B. C. Kim, et al, “Dynamic Aggregation Algorithms for Forwarding Information in Distributed Router Architecture, “ in Proc. HPSR 2003, 2003 Workshop on High Performance Switching and Routing, Torino, Italy, 2003. F. Kuhns et al, “Design and Evaluation of a High Performance Dynamically Extensible Router”, Proceedings of the DARPA Active Networks Conference and Exposition, 5/02. J. Salim, A Kleen, and A. Kuznetsov, “Linux Netlink as an IP Services Protocol”, Internet RFC 3549, July 2003.

Reliable Multicast for Control in Distributed Routers - CiteSeerX

Reliable Multicast for Control in Distributed Routers - CiteSeerX

Suggest Documents

Adaptive Reliable Multicast - CiteSeerX

Active Reliable Multicast - CiteSeerX

PermeableâLayer Receiver for Reliable Multicast ... - CiteSeerX

RELIABLE MULTICAST DATA DELIVERY for MILITARY ... - CiteSeerX

PermeableâLayer Receiver for Reliable Multicast ... - CiteSeerX

Totally Ordered Reliable Multicast for Whiteboard ... - CiteSeerX

Reliable Multicast in Wireless Sensor Networks - CiteSeerX

A Framework for Reliable Multicast in the Internet - CiteSeerX

Reliable Multicast Transport and Integrated Erasure ... - CiteSeerX

Augmented Reliable Multicast CORBA Event Service - CiteSeerX

Scalability of Two Reliable Multicast Protocols - CiteSeerX

A comparison of reliable multicast protocols - CiteSeerX

Constructing Reliable Distributed Communication ... - CiteSeerX

Pruning Algorithms for Multicast Flow Control - CiteSeerX

NACK-Oriented Reliable Multicast for Routing Table ... - CiteSeerX

Active Reliable Multicast Strategies for Internet-based Grid ... - CiteSeerX

A reliable overlay video transport protocol for multicast ... - CiteSeerX

Reliable Multicast for the Grid: a comparison of protocol ... - CiteSeerX

Multicast routers cooperating with channel ... - INRIA - Planete

Work-Conserving Distributed Schedulers for Terabit Routers - CiteSeerX

A Reliable Multicast Framework for Light-weight Sessions ... - CiteSeerX

Optimizing Buffer Management for Reliable Multicast

Optimizing Buffer Management for Reliable Multicast

Distributed Queueing in Scalable High Performance Routers

Reliable Multicast for Control in Distributed Routers - CiteSeerX