in the control plane, i.e., the software controlling the router. This development tends ... and the Cisco CRS-1 [13]). The 100 Tb/s router project at ... ing protocols, and network management, while FEs perform for example packet for- warding ...
Control and Forwarding Plane Interaction in Distributed Routers Technical Report March 2005 Markus Hidell, Peter Sjödin, and Olof Hagsand TRITA-S3-LCN-0501 ISSN 1653-0837 ISRN KTH/S3/LCN/--05/01--SE Laboratory for Communication Networks Department of Signals, Sensors, and Systems KTH – Royal Institute of Technology Stockholm, Sweden
Abstract. The requirements on IP routers continue to increase, both from the control plane and the forwarding plane perspectives. To improve scalability, flexibility, and availability new ways to build future routers need to be investigated. This paper presents a decentralized, modular system design for routers. The design is based on a modularization of routing functions into control elements for control functionality, such as routing protocols, and forwarding elements for packet processing. Design alternatives are discussed and a system implementation is presented. Further, we present measurements on internal communication between control and forwarding elements in an experimental platform consisting of one control element and up to 16 forwarding elements. In particular, we present measurements on different transaction mechanisms for distribution of large routing tables from control element to forwarding elements.
Keywords; Router architectures, distributed systems, performance measurements
1 Introduction The growth of the Internet in combination with the demand for new services rapidly increase the requirements imposed on network systems, such as IP routers. The growing traffic volumes require higher performance of the forwarding plane, i.e., the router’s capacity to process and forward packets. This is a challenge considering the fact that the speed of logical circuits and memories used in routers increases slower than the traffic volumes and the capacity of the optical links connecting to the router. The demand for new services is a two-folded issue. First, new services often require more operations to be performed on each packet, making it even more challenging to keep forwarding speed on a par with increasing traffic and line rates. Second, new services require new protocols in the routers, thus leading to an increased complexity in the control plane, i.e., the software controlling the router. This development tends to slow down the performance of the control plane and make it more error prone. We believe that the monolithic structure of traditional routers is an architectural limitation when it comes to meeting future requirements. Software and hardware are intertwined into a single, complex system, where a change in one subsystem may affect many other subsystems. Therefore we take the approach to investigate architectures that allow network systems instead to be composed from multiple modules (or elements), which communicate through open, well-defined interfaces. Our hypothesis is that such architectures can significantly improve scalability, flexibility, and availability: Scalability is improved because modules can be added as capacity requirements increase. Flexibility comes from the ability to dynamically add, remove and modify modules. Availability is mainly due to two factors: First, the modularity makes it possible to use redundancy and replication of critical functionality over multiple modules. Second, the modular structure in itself tends to limit the impact of faults in individual modules, and encourages sound engineering design principles. One of the main challenges in designing such a decentralized, modular network system relates to its distributed nature. The network system should appear as a single monolithic system to the outside world, even though it is composed of many interoperating elements internally. In addition, there is a certain cost associated with the decentralized structure, since the internal communication incurs overhead. The purpose of this paper is to investigate different aspects of the internal communication. Specifically, we analyze the cost in terms of time for internal communication, as well as the scalability of different mechanisms for internal communication. We also describe the system design and implementation of a decentralized modular router. The rest of the paper is organized as follows: Section 2 describes related work in this area. Thereafter, in Section 3, we outline our design of a decentralized and modular router, and discuss important design issues. In Section 4, we introduce a new protocol for internal communication, which is a part of an experimental platform of a distributed router. Section 6 deals with performance evaluation performance of internal communication mechanisms. Finally, in Section 7, we draw conclusions from the performance evaluation and discuss future research.
2 Related Work A considerable amount of work has been done on modularization and programmability of network systems in the context of active and programmable networks [1], [2], [4], [5], [6], [7], [8], [9]. A main goal in that work is to support extensibility—the ability to dynamically reconfigure network systems to support new services and applications. The work has mainly been aimed at software extensibility, to allow the packet processing software to be dynamically modified in centralized single-system architectures. In contrast, our work aims at decentralized architectures that allow for physical and logical separation of different heterogeneous modules. Exploring decentralized architectures is in line with both industry and research efforts to improve the scaling of Internet routers. Recent commercial high-performance routers are based on distributed multi-chassis solutions (e.g., the Juniper T640 [12] and the Cisco CRS-1 [13]). The 100 Tb/s router project at Stanford University [11] targets a distributed architecture where line card chassis are connected to an optical switch fabric using fiber bundles spanning over 100s of yards. However, from a logical viewpoint, these are still closed systems without open, well-defined interfaces. Within the IETF, the ForCES (Forwarding and Control Element Separation) working group [10] aims at defining a protocol for communication between control functions in a router (routing protocols, management, etc) and packet forwarding functions [17]. The purpose is to replace proprietary interfaces between control plane and forwarding plane with standardized protocols. An early proposal for the ForCES protocol, called Netlink2, has been evaluated by Goutaudier [18]. The paper reports throughput analysis of communication between one control element and two forwarding elements using Netlink2. Finally, Feamster et al. suggest an approach to separation between routing and forwarding [16]. They propose that a Routing Control Platform (RCP) should be used to separate interdomain routing from the individual routers, which mainly should be concerned with route lookups and forwarding of packets. The motivation for this work is to deal with increasing complexity in routing as well as to improve flexibility and scalability. We have similar goals, but our work aims at extending the separation to include control functionality in general, not only routing.
3 System Description and Design Issues We consider a distributed router consisting of different functional elements, as depicted in Fig. 1. Using the terminology of ForCES [10], the distributed router constitutes a Network Element (NE), which consists of Control Elements (CEs) and Forwarding Elements (FEs). CEs implement functions such as routing protocols, signaling protocols, and network management, while FEs perform for example packet forwarding, classification, traffic shaping, and metering. CE functions are typically performed in software running on a general-purpose CPU, while FEs can be based on a variety of different hardware platforms such as ASICs, FPGAs, network processors, and general purpose CPUs.
3.1 Internal Network The internal network interconnects CEs and FEs. It can be built in many different ways, e.g., using high-speed optics or high performance switches (Gigabit Ethernet or 10 Gigabit Ethernet). In principle, the internal network could even be a router-based IP-network! The purpose of the internal network is to carry control and data traffic between the elements. We choose to divide the internal network into two separate parts—one for control and one for data. One reason for this separation is that data and control traffic have different requirements. The data traffic needs high capacity and it should be possible to increase the capacity if more forwarding elements are added to the system, while the capacity requirements for control traffic are more relaxed. Another reason for the separation is that control traffic needs to be transferred between elements independently of the current data traffic situation—it is not acceptable if control traffic can be starved out by large volumes of data passing through the NE. NE – Network Element CE #1
CE #2
CE – Control Element FE – Forwarding Element
Data Control
FE #1
FE #2
External Ports
Fig. 1 System description of a Network Element (such as a router) consisting of control and forwarding elements
Another aspect of the internal network concerns the choice of transport protocol. For simplicity, it is tempting to use UDP, which would also enable the use of IP multicast. However, it turns out that there are different requirements on the internal communication. For example, some information may require reliable transport (e.g., route updates) and others may prefer timeliness rather than reliability (e.g., heartbeat messages). There may even be information that is sent from one to many that still requires reliable transport. Our hypothesis is therefore that we cannot rely solely on UDP. Instead, we expect to use a mix of transport protocols, and one of the purposes of this work is to investigate how the overhead associated with internal communication depends on the choice of protocol.
4 The Internal Protocols—Forz The internal communication protocol coordinates activities between the different elements in a distributed router. Within the IETF, there are currently protocols emerging in this area, but we found that for the moment there are no protocols available that suit all our needs. Therefore we have taken the approach to use the emerging protocols as a starting point, and add extensions for our specific purposes. We call the result Forz, which is a protocol with three main parts: association, configuration, and data transfer. 4.1 Association The association part of Forz is related to discovery and configuration. The different elements in a distributed router must be able to recognize each other and establish communication with each other. For this purpose, all elements periodically send out Hello messages. These messages are sent using UDP multicast, and contain for example the IP address and port number of the sending element. This information can then be used by other elements that wish to communicate with the sending element. Hello messages are sent when an element is initiated, and then periodically as a form of “heart beat” signals. Hello messages also contain capabilities. Capabilities represent a way for elements to declare what services and functions they support. We envision that distributed routers consist of different types of CEs, for example, and an FE need a way to know which CE is responsible for what function. One CE could for example declare that it is willing to process OSPF messages, and another that it supports ARP. 4.2 Configuration The purpose of the configuration part of Forz is to convey configuration commands and event notifications between CEs and FEs. For example, if the routing protocols in the CE compute a new best route for some destination, then the FEs need to be informed of this so that they can change their forwarding tables. Correspondingly, if a link on an FE is disabled by a hardware event, the CEs need to be notified so that traffic can be rerouted. Forz configuration is based on Netlink [14]. Netlink is a message-based Application Programming Interface (API) in UNIX for networking applications. In general, there is a number of networking objects that reside in the UNIX kernel, such as FIBs, interfaces, filter rulesets, etc. Networking applications such as routing daemons and configuration programs (“ifconfig”; “netstat”, etc) use Netlink to access those objects. In our distributed router, networking objects and networking applications reside on different machines. So the purpose of the configuration part in Forz is to carry Netlink messages from one machine to another. As an example, consider a CE running a routing daemon. If a route changes, the routing daemon will generate a Netlink message with the new route. The Forz proto-
col will pick up this message and distribute it to all relevant FEs, which in turn will update their FIBs. In a similar way, if an interface goes down on an FE, the operating system kernel on that FE will generate a Netlink message with the new interface status. Forz will take this message and distribute it over the internal network to all CEs and FEs. 4.3 Data Transfer The purpose of the data transfer part of Forz is to switch packets between FEs—when the incoming and outgoing port of a packet reside on different FEs, the packet needs to be transported across the internal network. A particular problem that occurs here is that in the general case, the IP header of the packet cannot be used for routing over the internal network. The reason is that the IP header may not be modified within the router (except for the mandatory TTL and checksum updates). Furthermore, if IP addresses are used for internal addressing, those IP addresses have to be taken from a different address space than is used for the external routing. For this reason, data packets need to be encapsulated before they are transmitted across the internal network. This could be done in different ways, such as IP in IP encapsulation, GRE, MPLS, VLAN, etc, but we chose to use application-level encapsulation where data packets are carried as a special kind of Forz messages. We chose this solution mainly for simplicity—it can be implemented entirely at the application level, without any modifications at the operating system level.
5 Implementation We have implemented a prototype version of a distributed router consisting of regular PCs connected together by two switched Ethernets (one for control and one for data). So far we have implemented one version of a CE, based on a UNIX system running Zebra open source routing software [3], as shown in Fig. 2. Zebra routing software
netlink RIB user kernel
RIB – Routing Information Base FIB – Forwarding Information Base
forwarder FIB IF
IF
IF
IF
external links
Fig. 2 PC-based monolithic router
The CE is implemented by extending Zebra with the Forz protocol. We refer to the resulting applications as Zebra-CE, which can be described as an application-level
CE. By that we mean that the CE functionality reside entirely within the application. We are currently working on system-level implementations as well, where the Forz protocol is implemented as a service at the operating system level. Zebra-CE supports manual configuration of remote network interfaces and static routes. For example, Zebra-CE takes commands, given through Zebra’s command line interface or a configuration file, and translates them to Netlink messages that are distributed through the Forz protocol (see Fig. 3). We currently have one FE implementation, which is also based on PCs running UNIX. On such an FE, there is a Forz daemon, which listens for Forz commands on the control network. When a command is received, the daemon performs the corresponding operation on the local system, such as adding a route, changing the status of an interface, etc. Likewise, when the daemon detects a local status change, it will generate a Forz message onto the control network. When the CEs and FEs are started, they will automatically form an NE (a distributed router) during the association phase. Each element has a local configuration file specifying for example the identity of the NE to which it belongs. Once the association phase is completed, the operation of the NE is completely event-driven. Different events will lead to different types of control information being distributed between the FEs and CEs. For instance, FEs will report the status of their external interfaces (such as link up/link down). The CE will receive commands for e.g., setting of IP addresses on external interfaces located at the FEs, adding or removing static routes, etc. CE
FE
routing s/w Forz
netlink
RIB
netlink
Forz
forwarder FIB ctl
ctl
Internal control network
IF
IF
external links
Fig. 3 Physical separation between control element and forwarding element
6 Performance Evaluation In a decentralized architecture, there is an extra cost due to internal communication, compared to a centralized system. We believe that the choice of protocols and communication mechanisms has large influence on the efficiency of the internal transactions. The purpose of our performance evaluation is to investigate how different
mechanisms affect the cost in terms of time spent on internal communication. We are interested in both the absolute values of the time it takes to distribute information over the internal network and in how the internal communication scales with the number of elements in the system. There are many types of transactions that can take place over the internal network. We chose to study a communication-intensive operation, where a large routing table is distributed from the control element to the forwarding elements. Through this example we can examine the performance of the internal communication in terms of number of route updates per second as well as in terms of the total time it takes to communicate routing state to the forwarding elements. 6.1 Experimental Set-Up The experimental platform consists of one CE and up to 16 FEs. The CE and FEs are based on rack-mounted PCs with 1.70 GHz Intel Pentium 4 processors, running the OpenBSD operating system. The internal network consists of three 100 Mb/s Ethernet switches (Netgear FSM726S) that interconnect the CE and the FEs. The switches are wire-speed Ethernet switches, and none of the output ports is overloaded in our measurements. Thus, there is no packet loss inside the internal network. The measurements focus on the actual communication between the CE and the FEs. We do not include the time for updating the FIB in the FEs, since the FIB update time depends on the type of forwarding element and the design of the FIB. The FIB can be implemented in a variety of ways, such as binary trees or hash tables in software based FEs, and CAMs (Content Addressable Memory) in FEs based on specialized hardware. The time it takes to update the FIB may vary significantly between different designs. Therefore we chose to only consider the transaction times introduced by Forz and the internal network. The performance measurements are run in the following way. First, the FEs and the CE are started. The CE then reads a configuration file holding 100,000 routing table entries and distributes the entries to the FEs, and we measure the time it takes to send out those entries. So far, we only consider loss-free transactions. Retransmissions and performance under different levels of packet loss are subjects to further studies. 6.2 Transport Protocols for Forz Forz can run on top of UDP as well as TCP. The choice of transport protocol depends on the type of information that is communicated over the internal network. The distribution of a routing table means that information is duplicated from a sender (control element) to many receivers (forwarding elements). Thus, it appears to be an ideal candidate for multicast transmission. To investigate this, we compare UDP multicast to two unicast mechanisms: UDP unicast, and TCP (which is always unicast). In contrast to TCP, UDP lacks mechanisms for flow control, congestion control, and segmentation. So in order to use UDP, we have to add support for those mechanisms in Forz. For example, we quickly discovered that if we use UDP without any
flow control, a large portion of the route updates are lost since the receivers (the FEs) cannot process the incoming messages fast enough. Furthermore, we also found that the packet size has significant influence on UDP performance. However, congestion control has not been needed in our measurements, since they have been performed in a controlled environment without congestion. To implement flow control at the Forz level, a simple ACK (acknowledgement) strategy is used. We introduce an ACK interval specifying the number of Forz messages a sender can send before waiting for an ACK. We also implement a form of segmentation for UDP by letting the sender accumulate several Forz messages and send those messages in one UDP datagram. This can be compared with the way TCP tries to send as large segments as possible. We experiment with different values of the ACK interval and the number of Forz messages per UDP datagram to find suitable values for our further experiments. The purpose of the first measurement is to investigate how the ACK interval affects the total time to distribute the routing table (see Fig. 4). This measurement is done with UDP multicast to 16 FEs, and 8 Forz messages accumulated in each UDP datagram.
Fig. 4 Time measurements for different ACK intervals when 8 Forz messages are sent in one UDP datagram. This example is made for multicast transfers to 16 FEs
The graph shows that a significant performance improvement can be made by increasing the ACK interval, but that the relative gain decreases with an increasing ACK interval. Further, there is an upper bound on the ACK interval: if the ACK interval is too large, packets will be lost. For example, in our experiment, we found that if more than 48 Forz messages are sent before waiting for an ACK, receiver buffers will overflow and packets will be lost. This ACK interval is specific to our configuration. A general solution would be to implement an adaptive flow control, e.g., such as sliding windows in TCP.
Fig. 5 Time measurements for different UDP datagram sizes when the ACK interval is 48 Forz messages per ACK. This example is made for multicast transfers to 16 FEs
The next measurement investigates the UDP datagram size, in terms of number of Forz messages per UDP packet. The total distribution time as a function of UDP datagram size is shown in Fig. 5. This measurement is performed with UDP multicast to 16 FEs, and the ACK interval is 48 Forz messages per ACK. The figure shows that the total time is reduced with almost a factor of two by going from one to eight Forz messages per UDP datagram, but also that there is not much to be gained by increasing the UDP datagram size beyond that. There is a small irregularity between the second and the third measurement points. This behavior can be explained by the buffer management mechanisms in the operating system (OpenBSD), which has an optimization for small packets (less than 236 bytes). Accumulating three Forz messages per UDP datagram is just above the limit for this optimization to be applicable. Based on the results of this initial investigation of ACK intervals and packet sizes, we chose to use an ACK interval of 48 Forz messages per ACK, and 8 Forz messages per UDP datagram, for the rest of our experiments. 6.3 Performance Measurements In this subsection, we study how the routing table distribution time varies with the number of FEs for the three different transport protocols: UDP unicast, UDP multicast, and TCP. We let the number of FEs vary from 1 to 16, and the results are shown in Fig. 6 For UDP, we employ a rudimentary flow control mechanism, where we let one of the FEs be responsible for sending ACKs back to the CE. This is a simplification, which works in our homogeneous environment where all FEs are identical. It would not be a generally applicable flow-control mechanism for UDP; we use it merely as a means to control UDP packet rate in our measurements.
Fig. 6 Time measurements for distributing 100,000 routing table entries to a number of FEs, using different transport mechanisms
We observe that with UDP unicast, it takes slightly more than 2 seconds to communicate the 100,000 route updates to one single FE. When the number of FEs increases, there is a linear increase in the time it takes to distribute the routing table to all FEs. One might expect that the time would increase with 2 seconds for each additional FE, but a certain level of parallelism is achieved due to the fact that a set of route updates are distributed to all FEs before the CE starts waiting for an ACK. For UDP multicast, the time is constant independently of the number of FEs, and the value is roughly the same as for UDP unicast with one FE (slightly more than 2 seconds). Thus, in this experimental set-up, the system can distribute roughly 50,000 route updates per second when the Forz protocol uses UDP multicast as the transport mechanism. For comparison, we also measure TCP. Thus, the transport is reliable and there is no need for Forz level ACKs and segmentation. The drawback is that multicast transfers cannot be used. The results show that the total time to distribute the routing table is slightly lower for TCP than for UDP unicast, which can be explained by TCP’s more efficient flow control mechanism and more optimal segmentation. 6.4 Discussion We have examined how the time for internal communication depends on the number of FEs. From a scaling perspective, the ideal result would be if the internal communication time were independent of the number of FEs, as in the case with UDP multicast. For the unicast-based transport mechanisms (UDP unicast and TCP), the total distribution time increases linearly with the number of FEs. Such an increase is far from ideal, but it is predictable and may in some cases be manageable even for large systems. Further, from our experiments we conclude that for UDP unicast together with application-level support for flow control and segmentation, the distribution time is roughly equivalent to that of TCP.
7 Conclusions and Further Work To examine the internal communication overhead, we have measured the time it takes to distribute a large routing table with 100,000 entries from a control element to a set of forwarding elements using Forz. We have also investigated how the time varies with the number of FEs for different types of transaction mechanisms. We have evaluated three different transaction mechanisms, based on multiple UDP unicast, UDP multicast, and multiple TCP unicast respectively. From our measurements, it appears that UDP multicast would be the most suitable, mainly for two reasons: First, with UDP multicast the time to distribute the routing table remains constant as the number of FEs is increased. Second, for a small number of FEs, the time is approximately the same for all three mechanisms, which indicates that there is no or little penalty for using UDP multicast even when the number of receivers is small. From our experiments we also conclude that mechanisms such as flow control and segmentation are crucial for performance. This would be in favor of TCP, which provides support for those functions at the transport level. If UDP is used, those functions have to be provided at a higher level instead. Therefore, for the purpose of our experiments, we have implemented simplified support for flow control and segmentation in Forz. The implementation of flow control and segmentation is specific to the configuration we have measured. A more general solution would be to design a reliable multicast protocol for this. Several different reliable multicast protocols have been proposed and evaluated previously (e.g., for Grid applications [15]), and it is subject to further study to find a suitable reliable multicast protocol for transporting Forz messages in distributed routers.
References 1. 2. 3. 4. 5. 6. 7. 8. 9.
F. Kuhns et al, “Design and Evaluation of a High Performance Dynamically Extensible Router”, Proceedings of the DARPA Active Networks Conference and Exposition, 5/02. D. Decasper, Z. Dittia, G. Parulkar, and B. Plattner, ”Router Plugins: A Software Architecture for Next-Generation Routers”, IEEE/ACM Transactions on Networking, 8, 2000. GNU Zebra, URL=http://www.zebra.org Y. Gottlieb and L. Petersen, “A Comparative Study of Extensible Routers”, In 2002 IEEE Open Architectures and Network Programming Proceedings, Pages 51-62, June 2002. M. Handley, O. Hodson, and E. Kohler, “XORP: An Open Platform for Network Research”, First Workshop in Networks, Princeton, New Jersey, October 2002. G. Hjálmtýsson et al, ”Dynamic Packet Processors, A new abstraction for router extensibility”, Proceedings of OPENARCH-2003, San Francisco, April 2003. S. Karlin and L. Peterson, "VERA: An Extensible Router Architecture," Computer Networks, Volume 38, Issue 3, Pages 277-293, (2002). Computer Networks. Special Issue on Programmable Networks, Vol. 38, No. 3, February 2002. IEEE Journal on Selected Areas in Communications on Active and Programmable Networks, Vol. 19, No. 3, March 2001.
10. ForCES (Forwarding and Control Element Separation) IETF Working group, URL=http://www.ietf.org/html.charters/forces-charter.html. 11. I. Keslassy, et al, “Scaling Internet Routers Using Optics”, ACM Sigcomm 2003, Karlsruhe, Germany, 2003. 12. Juniper, “T-Series Routing Platforms: System and Packet Forwarding Architecture”, White paper, 2002. 13. Cisco, “Next Generation Networks and the Cisco Carrier Routing System”, White paper, 2004. 14. J. Salim, A Kleen, and A. Kuznetsov, “Linux Netlink as an IP Services Protocol”, Internet RFC 3549, July 2003. 15. M. P. Barcellos, “Reliable Multicast for the Grid: a comparison of protocol implementations”, e-Science All-Hands Meeting, Nottingham, UK, September 2004. 16. N. Feamster, et al, “The Case for Separating Routing from Routers”, ACM SIGCOMM FDNA Workshop, Portland, Oregon, USA, August 30, 2004. 17. A. Doria et al, “ForCES Protocol Specification”, IETF Internet Draft draft-ietf-forcesprotocol-01.txt, October 2004. 18. G. Goutaudier, “Enhancements and Prototype Implementation of the ForCES Netlink2 Protocol”, IBM Research Report RZ 3482 (# 99522), September 2003.