Extensible Routers for Active Networks Nadia Shalaby Larry Peterson Andy Bavier Yitzchak Gottlieb Scott Karlin Aki Nakao Xiaohu Qie Tammo Spalink Mike Wawrzoniak
fnadia,llp,abc,zuki,scott,nakao,qiexh,tspalink,
[email protected] Department of Computer Science Princeton University
Abstract This paper describes our effort to build an extensible router in support of active networks. Our work is driven by two goals: (1) supporting the injection of new functionality into a router, and (2) exploiting commercially available hardware. Our approach is a hierarchical architecture, in which packet flows traverse a range of processing/forwarding paths. This paper both presents the architecture, and describes our experiences implementing the architecture across a combination of general-purpose and network processors.
1. Introduction Much of the success of the current Internet is due to the elegance of its architecture: routers in the middle of the network simply forward packets to implement a best effort service, while hosts at the edges of the network address the more complex end-to-end issues (reliability, ordered delivery, security), as well as implement application programs. This simple architecture has proven very robust, which has allowed the Internet to grow to the scale it is today. However, there is now a clear trend towards extending the set of functions that network routers support beyond the traditional forwarding service. For example, routers are programmed to filter packets, translate addresses, make level-n routing decisions, broker quality of service (QoS) reservations, thin data streams, run proxies, support computationally-weak home electronic devices, serve as the front-end to scalable clusters, and support application-specific overlay networks. Active networks are a generalization of this trend, in which users are allowed to inject code into the network, thereby tailoring the network in an application-specific way [2, 44, 49, 53]. Our work has focused on extensible routers, an enabling technology for both active networks and the myriad of other services migrating into the network. In addition to support This work supported in part by DARPA contract N66001–96–8518, NSG grant ANI-9906704, and Intel Corporation.
ing extensibility, we have adopted a strategy that employs COTS hardware rather than a custom design. The result is a multi-level processor and execution hierarchy. The hierarchical framework spans all the paths a packet traverses, from the ports, across the processor hierarchy of a network processor, through the kernel space of a general-purpose processor, all the way to the active services and standard applications [39]. The main contribution of this paper is to conceptually partition an extensible router into a hierarchy of execution levels, where hardware and software are partitioned in concert. While the basic functionality at every execution level remains the same—forwarding and computing on packets—each level commands its own set of requirements and limits on system resources and code security. As such, we are able to introduce a unified framework addressing the central issues of resource allocation and scheduling across the entire execution level hierarchy. The structure of this paper is as follows. Section 2 presents the unifying architecture, called VERA, and illustrates how it naturally yields to an extensible router of a hierarchical nature. Moving from the bottom of the hierarchy up, Section 3 first describes network processor level, and Section 4 describes the general-purpose processor level. Finally, Section 5 addresses the scope and functionality of the control plane, while Section 6 concludes the paper.
2. Architecture We recognize two trends in router design: increasing pressure to extend the set of standard and active services provided by the router and increasing diversity in the hardware components used to construct the router. As a consequence, it is becoming increasingly difficult to map the services onto the underlying hardware. We therefore define a virtual router architecture, called VERA [27], that hides the hardware details from the forwarding functions. VERA is intended to be extensible in adding new router functionality, compliant with RFC1812 [6], and allow for efficient implementations on the given hardware.
Hardware Router Abstraction
Hardware Configurations
(Hardware Configurations)
Figure 1. VERA constrains the space of routing function implementations and hardware exposure for a mapping between the two
2.1. Functional Router Abstraction
F Input port
F C
S
...
Figure 1 shows how the VERA framework constrains and abstracts the essence of both the routing function space and the hardware configuration space. VERA consists of a functional router abstraction, a hardware router abstraction, and a distributed router operating system. Each abstraction must be simultaneously rich enough so as not to constrain its corresponding design space, and remain at a level high enough so as to enable modeling and reasoning about the system. Thus, each abstraction also defines a programming interface (API). In this section, we cover the functional router and hardware router abstractions in more detail, illustrate how both abstractions naturally yield to an extensible router of a hierarchical nature, and then show a cross-section of our prototype implementation with four distinct execution levels, resulting in four corresponding packet paths through the execution level hierarchy.
F
Figure 3. Model of an extensible router A major aspect in router extensibility is the hierarchical nature of packet classification. For example, a first step classifier might identify malformed packets and ignore them altogether, passing the rest up the hierarchy. At the next level, a packet might be compared against a cache of known flows to determine its path within the router. Other packets are sent further up the classification hierarchy. At the next level, most routers run a prefix matching algorithm that maps packets based on some number of bits in the IP header [16, 29, 48, 51]. Eventually, packets move up to the highest level for full classification.
Packets flow from an input port to an output port. We define a switching path as the instantiation of a forwarder, along with its associated input and output ports. Figure 2 illustrates the four components of the functional router abstraction: a queue with any queueing discipline; a forwarder, which applies a forwarding function to a packet; a classifier, which maps network flows onto switching paths; and a scheduler, which selects one of its non-empty input queues to send to the output port.
F
...
C
F
...
S Next to Fastest Path Output
F F
Input C
...
forwarder
...
classifier
...
F
C
S
scheduler
Figure 2. Functional router components
S Slowest Path Output
... F C
F
queue
Output port
Highest Execution Level
Distributed Router Operating System
VERA
S Fastest Path Output
Next to Lowest Execution Level
Functional Router Abstraction
At router initialization time, each port has an associated classifier and scheduler. In addition, there is an initial set of pre-established switching paths. To support QoS flows and extensions, we provide API calls that allow paths to be dynamically created and removed by other paths, as well as update the classifier (such as send new routing tables). In [18], we have employed this abstraction to model, evaluate and compare several different extensible router architectures, namely Princeton’s Scout [33], MIT’s Click router [28] and Washington University’s Router Plugins [15]. This was contrasted with a best effort IP router, and a general extensible router, which we model in Figure 3 as an example. This figure depicts the various flows a packet might traverse from a router’s input port to an output port.
Lowest Execution Level
(Routing Functions)
Routing Functions
Figure 4. Classification on a hierarchically extensible router This hierarchical structure of packet classification is modeled in Figure 4. The earlier a packet gets classified
into a forwarding path, the faster it leaves the router. At the earliest stage, the packets traverse the fastest paths, and at the latest stage, packets would have taken the slowest path in the system.
Proc
Proc
2.2. Hardware Router Abstraction
Switch
This is a typical hardware abstraction layer (HAL) that defines an interface between the hardware and the device independent software, such as the VERA operating system in Figure 1. We define three components for the VERA hardware router abstraction: virtual processors, which model a single processor, a symmetric multiprocessor, or a complex/hybrid processor such as the IXP 1200 [23]; ports, which model the device driver of the media access controller (MAC) chips, owned by a particular processor; and switches, which model passive shared communication devices such as memory or PCI busses. VERA’s switch abstraction provides an interface for interprocessor data movement and distributed queues whose head and tail are on different processors, thus enabling the distributed router OS to implement interprocess communication and message passing. Pentium III Motherboard
PCI BUS RAMiX PMC694 NIC
Filtering Bridge
PowerPC Core plus Memory
PCI BUS
MAC
Intel IXP1200 EEB
IXP1200 Network Processor plus Memory
IX BUS
MAC
Octal MAC
Figure 5. Testbed based on a Pentium III motherboard with both a PMC694 NIC and an IXP1200 Ethernet evaluation board Consider Figure 5, where we depict our development router testbed consisting of a commodity motherboard connected to two different off-the-shelf network interface cards (NICs) using a standard PCI bus, with an aggregate ten ports. The motherboard is an Intel CA810E with a Pentium III CPU. The first NIC is a RAMiX PMC694 [41], PowerPC processor, two containing its own = Ethernet ports, and of memory. The second NIC is an Intel IXP1200 Ethernet Evaluation Board [24], containing a IXP1200 network pro= Ethernet ports, and of cessor, eight memory.
100 Mbit s
266 MHz
Switch
32 Mbytes 199 MHz 100 Mbit s 32 Mbytes
MAC
Proc
Switch
MAC
MAC
Figure 6. The hardware abstraction of Figure 5. The solid lines indicate packet flow paths. The dashed arrows show the processor hierarchy. To demonstrate the hardware router abstraction, we map Figure 5 onto Figure 6, which depicts the router topology by directly connecting the router components into a graph. By ignoring the switches and considering only the processors and ports, we can find a spanning tree with the master processor at the root and all the ports as leaves. This spanning tree is called the processor hierarchy. The HAL API defines such functions as those responsible for data movement between local and remote addresses, as well as those allocating and managing the distributed queues mentioned earlier. Extensible routers built from a set of components provide us with a processor hierarchy. Furthermore, on a single processor, we can further subdivide isolated components that are given a separate pool of resources, or are further segregated along a kernel/user space protection boundary. We loosely call these execution levels. We therefore establish a processor hierarchy, which subsumes an execution hierarchy, which in turn subsumes a classification hierarchy. To paraphrase, within a hierarchically extensible router, one processor may span multiple execution levels, each of which may have several levels of classification. For example, Figure 4 depicts an extensible router with a one-to-one mapping of the execution hierarchy to the classification hierarchy; that is, each successive classification function is mapped onto a successively higher execution level, where packets traverse a successively slower path.
2.3. Implementation: The Grand Scheme The remaining component of the VERA architecture is the distributed router operating system. The distributed router operating system provides an execution environment for the forwarding functions which ties together the router abstraction and hardware abstraction, a computation abstraction in the form of a thread API, a memory abstraction in the form of both a buffer management API, and an internal routing header datatype.
Linux User Space
AA
AA
k+1
F
A1
l
F
...
Ak
AA F
...
AA
l+1
m
...
F
Execution Environments SNOW (NodeOS API)
socket API
ioctl API
linux socket i/f
linux ioctl i/f
application program
from classifier
to scheduler
user/kernel space boundary
Scout
SILK Linux thread
F
Scout path
...
Linux Kernel Space
Pentium
User processes
F
CPU sched
C
CodeI
to scheduler
from classifier
S
Geni partner i/f
file system i/f
Rover
Vera control i/f Vera data i/f
network driver i/f 802.11 drivers
Vera driver
...
ethernet drivers
PCI bus encapsulation
Strong ARM Context
PCI bus encapsulation
F
polling threads
... F
Thrill
C
forwarding C function
from classifier
to scheduler
S
shared memory bus encapsulation
shared memory bus encapsulation
F polling threads per microengine
code block from 1 classifier
...
...
Microengines’ Context
shared memory bus
F C
code block
2
...
IXP 1200
PCI bus
code block
i
to scheduler
S
IX bus encapsulation
IX bus input and ouput ports
Figure 7. The cross section of an extensible router implementation: packets traverse at most four execution levels, traveling from network processor ports through the two levels of execution contexts of the IXP, through the Linux kernel, to standard and active applications in Linux user space. All levels perform the extensible router functions: classification among various flows, forwarding and scheduling the flows for output. The difference lies in the granularity of the forwarding function: from a string of code blocks in microcode, to a C function, to scout paths, to an entire application program. The Vera driver bridges the Pentium to the IXP 1200 across the PCI bus.
1 Although, as shown in Figure 5, we have implemented our extensible router for both the RAMiX PMC694 and the IXP 1200 boards, Figure 7 portrays the abridged version, showing only the IXP 1200. This choice was made for the sake of perspicuousness, as well as to highlight that an extra execution level provided by the IXP falls naturally into our hierarchically extensible framework.
Pentium IXP 1200
We show a prototype implementation of the VERA architecture for a hierarchically extensible router in Figure 7, where the distributed router OS is the software residing on all levels of the processor hierarchy, tying them all together to provide the router functionality. Figure 7 portrays a mapping of the router’s execution levels onto the processor hierarchy, where we connect a network processor, the IXP 1200, to a general purpose processor, the Pentium III motherboard, via a PCI bus. We will demonstrate that our design and implementation exhibit nearly an order of magnitude improvement in performance over a pure PC-based router, at a cost of roughly US$1500, based on an estimated US$700 for a IXP1200 board produced in low volume. The key block accomplishing this connection across the PCI bus is the Vera driver, which acts as a bridge between the lower and higher execution levels of the hierarchy. From “below” the Vera driver interfaces to a PCI bus encapsulation module, rendering connection to any system area network (SAN) switch (e.g. Infiniband [22]), programmable line card, (e.g. Alteon ACEnic [3], RAMiX PMC694 [41]), or network processor (e.g. Vitesse IQ2000 [50], IBM PowerNP [20], or the IXP 1200 [23], a case in point). 1 From “above”, the Vera driver can either connect directly to the Linux network driver interface, enabling all applications invoking Linux networking to interface to Vera similar to other ethernet or 802.11 network drivers; or, it can connect to its own, richer Vera interface, where control and data are separately instantiated. This latter Vera interface is used by SILK, to be covered in Section 4.2. Figure 7 emphasizes the uniformity of the router functionality across all execution levels of the hierarchy. Each level executes its own threads, and provides an encapsulation for the switching interface to the levels above it and below it. Within the boundaries of this encapsulation, we observe the functional diagram of the hierarchical extensible router of Figure 4 mapped onto the distinct execution levels. Thus, at each level, a packet is classified into a particular flow locally, or sent up the execution hierarchy to the next classifier. A packet scheduler then picks a packet among the various flows to send further down the hierarchy, eventually reaching an output port. This packet path hierarchy is illustrated in Figure 8. At every execution level of the hierarchy, each flow is the instantiation of some forwarding function. The difference between the execution levels lies in the granularity of the forwarding function: from a string of code blocks in microcode at the MicroEngine context, to a C function at
Linux User Space
Linux Kernel Space
StrongARM Space
Microengines’ Space
Figure 8. Four potential paths along the system hierarchy the StrongARM context, to scout paths in the Linux kernel space, to an entire application program in the Linux user space. Our prototype router implements both the data plane that forwards packets, and the control plane where protocols like RSVP, OSPF, LDP and BGP run. From the perspective of Figure 7, we make two distinctions between the data plane and the control plane. The first distinction being that the data plane must process packets at line speed, while the control plane is expected to receive far fewer packets (for instance, whenever routes change or new connections are established). The requirement that the data plane runs at line speed is based on the need to receive and classify packets as fast as they arrive, so as to avoid the possibility of priority inversion; that is, not being able to receive important packets due to a high arrival rate of less important packets. The second distinction between the data and control planes is how much processing each packet requires. At one extreme, the data plane does minimal processing (e.g., IP validates the header, decrements the TTL, recomputes the checksum, and selects the appropriate output port). At the other extreme, the control plane often runs computeintensive programs, such as the shortest-path algorithm to compute a new routing table. However, these are just two ends of a spectrum. In between, different packet flows require different amounts of processing, such as evaluating firewall rules, gathering packet statistics, processing IP options, and running proxy code. Note that this processing can happen in the data plane, in the sense that it is applied to every packet in a particular flow. Taking both packet arrival rates and per-packet processing costs into account, the key is deciding at which execu-
3.1. IXP 1200 Architecture
733 MHz
Our router runs on a PC using a Pentium III processor with the IXP1200 evaluation system illustrated in Figure 9 plugged into one PCI slot. The board consists of an IXP1200 network processor chip (shaded area), of DRAM, of SRAM, of on-chip Scratch memory, a proprietary 64-bit IX bus, and a set of media access controller (MAC) chips implementing ten Ethernet ports ( ). Not shown is a 32-bit PCI bus interface.
32 MB
2 MB
4 KB 66 MHz 8 100 Mbps + 2 1 Gbps 33 MHz 6 MicroEngines
MAC Ports
Scratch
FIFOs
DRAM
IX Bus
tion level each processing step should run. At each execution level, the packet has access to some number of cycles, but there is overhead involved in reaching those cycles. Higher levels (e.g., the Pentium) offer more cycles, but packets also consume resources at lower execution levels to access them. Lower levels (e.g., the MicroEngines) have enough cycles to perform only certain operations at line speeds. Throughout the paper, we address the resource allocation and scheduling problems across this execution level hierarchy. Overall, Figure 7 spans four execution levels, from the input/output ports, and all the way up to standard and active applications, thus providing a natural road map for the rest of the paper. Two execution levels physically reside on the IXP 1200, namely the Microengines’ and StrongARM contexts, which we cover in Section 3. The Linux kernel space, which resides on the Pentium, is covered in Section 4. Finally, the control plane, traditionally resides at the user space level, and is addressed in Section 5.
IXP1200 Chip SRAM
StrongARM
3. Network Processor Level Network processors are designed to operate under severe performance requirements. For example, a network pro) has to processor assigned to an OC–48 link ( : minimum-sized packets-per-second (pps). cess up to : Copying an OC–48 bit stream into and out of memory requires : of memory bandwidth. Network processors commonly employ parallelism to hide memory latency. Our prototype network processor, the Intel IXP1200, contains six MicroEngines, each supporting four hardware contexts. The intention is that during regular execution one of these contexts is doing real work while the others are blocked on (hiding) a memory operation. The IBM PowerNP and Vitesse IQ2000 use similar designs [20, 50]. This section summarizes the design decisions and implementation results of [45], thereby highlighting two key contributions. The first being that despite our stringent requirement of running at line speed, we are able to extend our basic processing to running forwarders from a set of typical and often complex router services. This is achieved by carefully budgeting the router’s resources and isolating the fixed infrastructure required by every packet from possible extensions (thus fully exploiting the IXP1200’s parallelism), as well as separating the data plane from the control plane and running the latter at higher execution levels. The second contribution is a direct benefit from our design of isolating the fixed infrastructure from the programmable extensions: this enables processing packets within the calculated minimal budget, which permits static allocation of the network processor’s resources. This, in turn, makes the router impervious to variable work loads.
6 1M 2 2 5 Gbps = 5 Gbps
2 5 Gbps
Figure 9. Block diagram of the IXP 1200 evaluation system The IXP1200 chip itself contains a general-purpose StrongARM processor core and six special-purpose Micro( cycle time). Engine cores all running at Each of the six MicroEngines supports four hardware contexts for a total of 24 contexts. Not shown in the figure is a instruction store (ISTORE) associated with each MicroEngine. The StrongARM is responsible for loading these MicroEngine instruction stores. As for the StronI-cache gARM itself, it fetches instructions from a backed by the IXP’s DRAM. The chip also has a pair of FIFOs used to transfer packets to and from the network ports across the IX bus. Each register file. It is “FIFO” is an addressable up to the programmer to use these register files so that they behave as FIFOs. Although not explicitly prescribed by the architecture, the most natural use of the DRAM is to buffer packets. This ), but also of speed. The DRAM is a function of size ( is connected to the processor by a 64-bit data path, implying a potential to move packets into and out of DRAM at : . In theory, this is sufficient to support : total the send/receive bandwidth of the network ports available on the evaluation board, although this rate exceeds the peak capacity of the IX bus. Similarly, SRAM is a natural place to store the routing table, along with any necessary per-flow state. The SRAM data path has a peak transfer rate = : . of 32-bit
200 MHz 5 ns
4 KB
4 KB
16 slot 64 byte
32 MB
100 MHz
6 4 Gbps 2 (8 100 Mbps + 2 1 Gbps) = 5 6 Gbps 4 Gbps 100 MHz 3 2 Gbps
port
port
input context
...
input context
SRAM (packet queues) DRAM (packet buffers)
Input Processing
output context output context
port output FIFO
port port
output context
...
port
input context
...
input FIFO
...
port
...
At this execution level, the forwarding function consists of code blocks strung together in memory. In order to ensure that our design can always process packets at line speed, we first engineer the fixed infrastructure needed to forward minimal-sized packets through the system without any packet processing– that is, we run only a null forwarder. Because we do not consider actual forwarders (including the forwarder that implements IP) until Section 3.5, this discussion is largely independent of IP, and so applies equally well to a router that supports, for example, MPLS [14]. The discussion focuses on the most interesting aspect of the IXP1200, namely managing the MicroEngines’ parallel contexts . Although the description is necessarily tied to the details of the IXP1200, we believe the engineering decisions we made apply generally to any parallel, softwarebased switch. It turned out that many of the issues we faced have direct analogs in managing hardware switching fabrics, which are inherently parallel. The common unit of data transferred through the IXP1200 is a 64-byte MAC-Packet (MP). As each packet is received, the MAC breaks it into separate MPs; tags each MP as being the first, an intermediate, the last, or the only MP of the packet; and stores the MP in an input FIFO slot. Similarly, the individual MPs that make up a packet must be loaded into output FIFO slots to be transmitted by the MAC. Since only a fixed number of input and output FIFO slots are available (16 of each), it is necessary to allocate slots to MAC ports, and it is the responsibility of the forwarding code running on the MicroEngines to drain the input slots and fill the output slot at a rate that keeps pace with each port’s line speed. To avoid port contention, where two or more incoming packets are destined for the same output port, packets are placed into queues, and these queues are serviced asynchronously [30]. Within this framework, each MP must be processed upon arrival at an input port (this implements the router classifier and forwarder functions), pass through a processing pipeline to switch from an input context to an output context, and then be processed for its destination output port (this implements the router scheduler function). All the while, the MP is being serviced by our particular queuing discipline in the process. The manner in which our implementation allocates the MicroEngines’ resources directly impacts processing speeds. Each of these aspects is summarized below.
holds the actual queue data structure (each element in a queue is the address in DRAM where the packet is buffered). Assigning different contexts to each stage prevents MicroEngines from being idle during the time a packet is queued.
...
3.2. Microengine Level
port
Output Processing
Figure 10. MicroEngine processing pipeline 3.2.2 Input processing Each context assigned to input processing, first determines whether a new MP has arrived on an input port. In this case a DMA state-machine copies the MP from the off-chip port memory into the on-chip input FIFO. Since there is only one DMA state-machine on the IXP1200 and requests to it are not hardware-serialized, we implemented a mutual exclusion mechanism, via token passing, to allow multiple MicroEngine contexts to safely execute input loops in parallel. Once the MP is in the FIFO, the MicroEngine copies the MP into its registers and performs all protocol-specific packet header or content modifications, namely classification and forwarding. (The null forwarder only modifies the destination MAC address). Clearly, the processing of the first MP in a packet must determine the destination of the packet. Once the input stage has produced the first MP of a packet, the output stage may start processing it immediately and must have complete destination information available. Protocol processing is performed for each MP to facilitate operations that modify the entire packet, or that manipulate packet headers lying deeper into the packet. After protocol processing, the (possibly modified) MP is copied from registers to DRAM. Exceptional packets, for example those that incur a miss in the routing table or involve additional processing (e.g., IP options), are placed by the classifier code into a queue that is serviced by the next execution level, namely the StrongARM, instead of the usual output process. 3.2.3 Output processing
3.2.1 Processing pipeline MP processing is implemented as a two-stage pipeline, illustrated in Figure 10. The two pipeline stages are implemented using disjoint sets of MicroEngine contexts, with MPs transferred between the stages via DRAM. SRAM
For each output port, we implement a scheduler which chooses a non-empty queue from among the set of queues associated with that port. A packet descriptor is then dequeued from the chosen queue. For each MP of the packet, the DRAM address of the MP is calculated, an available
output FIFO slot is selected, the MP is copied from DRAM to the FIFO, and the FIFO slot is activated to schedule a DMA from the on-chip FIFO memory to the actual off-chip port memory. Unlike the input FIFO, the slots of the output FIFO are strictly ordered and the DMA machine that moves data from the FIFO to network device memory consumes the slots in a circular fashion. To serialize output contexts, we use a token passing loop identical to that used by the input process. 3.2.4 Queuing Discipline For each packet, protocol processing on the first MP chooses a destination queue for the whole packet. In our system, queues are contiguous circular arrays of 32-bit entries in SRAM. Head and tail pointers are simply indexes into the array, and they are stored in Scratch memory. Buffer pointers are inserted into the queue at the head and removed at the tail. Each output port must have one or more queues associated with it. If multiple queues are assigned to a single output context, the context may occasionally have multiple packets available for transmission. Our scheduler prioritizes the queues, such that each context drains its queues in priority order. However, any other policy implementable with little computation, which does not require looking deeper into the queues, can be used (e.g., round robin). If more complex packet scheduling policies are needed, they must be implemented by the input contexts. When multiple queues are available at each output context and when these have fixed priority levels, the larger computing capacity available in input-side protocol processing could be used to select the appropriate priority queue and thereby approximate more complex schemes, such as weighted fair queuing. If input contexts share queues, contention can be managed by using the IXP1200’s hardware mutex support operations (which are non-blocking) for mutually exclusive access to special SRAM regions. 3.2.5 Static Resource Allocation Since it is impossible to fully predict packet traffic or arrival times, for the sake of robustness we must assume that packets arrive at line speed. This means we must be able to execute the input loop once for each MP at the maximal rate the system is being designed to support. Additionally, if more than one context is running the output loop concurrently, they need to cooperate to obey the FIFO ordering. Moreover, output contexts need to synchronize in servicing multiple queues. Finally, if multiple MicroEngine output contexts are servicing queues for the same output port, additional synchronization is required to ensure that all of the MPs for one packet are sent before those of the next packet.
We address these issues by choosing static allocation on all counts, which would at least guarantee robustness, and almost always, a peformance win. Thus, a set of MicroEngine contexts (sufficiently large to meet the line speed requirements) is statically allocated to run only the input loop. By constraining input processing to use at most 16 of the 24 available contexts, we have a simple assignment of FIFO slots to contexts. Similarly, FIFO slots are statically allocated to output contexts. To avoid both additional synchronization costs and reading the tail pointer from memory, queues are also statically assigned to output contexts. This allows the output contexts to keep the queue tail pointers in registers and saves multiple memory operations on each loop iteration. Alternatively, if each output context services only a single queue, memory accesses to the queue head pointer might be avoided by batching packet transmissions. Finally, our development board has enough ports to enable us to statically allocate ports to contexts.
3.3. StrongARM Level Deciding what forwarders to run on the StrongARM is complicated by the fact that the StrongARM must support the Pentium and because it shares SRAM and DRAM bandwidth with the MicroEngines. This means an arbitrary forwarder running on the StrongARM has the potential to interfere with the MicroEngine’s ability to forward packets at line speed. As a consequence, the StrongARM must run within the same resource budget as the MicroEngines. It is for this reason that we elect to not run a general-purpose OS like Linux on this processor. Instead, the StrongARM runs a minimal OS that does three things: (1) acts as a bridge that forwards packets to the Pentium, and (2) supports a small collection of local forwarders, and (3) runs a piece of the code injector, which manages injecting forwarding code from higher execution levels to the MicroEngines. At the execution level of the StrongARM, we demonstrate that for standard 1500-byte packets, packets can be processed at a rate of : , leaving the StrongARM with 4200 excess cycles for implementing a forwarding function. This allows the forwarder to be an entire C function as depicted in Figure 7. For 64-byte packets however, , but leave the StrongARM would exhibit a rate of no excess cycles for extra processing. As a consequence, and because the StrongARM shares SRAM and DRAM bandwidth with the MicroEngines, it must run within the same resource budget as the MicroEngines. For this reason we elect to run a minimal OS on the StrongARM. Since in this scenario programming the StrongARM is straightforward, we only address the interface aspects of our implementation; namely, moving packets and between the StrongARM and the MicroEngines, as well as between the StrongARM and the Pentium. Details of the forwarding code injector are discussed in Section 5.1.
43 6 Kpps
526 Kpps
3.3.1 MicroEngine/StrongARM Interface The StrongARM can directly access DRAM, so packets are available for it to compute on with minimal additional overhead – namely the cost of a MicroEngine signalling the StrongARM to inform it that a packet is available. An input context processes the packet as usual, but upon detecting that the packet requires service by the StrongARM (e.g., there is a miss in the route cache or the packet contains IP options), it enqueues the packet in a StrongARM-specific queue instead of a queue assigned to an output port. At this point, we have two options: interrupt the StrongARM or let the StrongARM poll to see if any packets have arrived. In both cases, the StrongARM dequeues the next packet from this queue, classifies it to a particular flow for forwarder processing, and then schedules the packet on the appropriate output queue. We measured the maximum rate that the StrongARM can process packets by having it run a null forwarder, with the MicroEngine input contexts programmed to pass all their packets to the StrongARM. With this configuration, we achieve a maximum forwarding rate of using polling; while interrupts yielded a significantly slower rate.
526 Kpps
3.3.2 StrongARM/Pentium Interface Consistent with the uniform classification framework of our hierarchy, exceptional packets on the StrongARM, such as those pertaining to the control plane, or those resulting in a miss in the route cache, are classified to a higher execution level, namely to the Linux kernel level of the Pentium. Packets are moved between the IXP1200 and the Pentium over the PCI bus by means of the IXP1200’s DMA engine and queue management hardware registers supporting the Intelligent I/O ( 2 ) standard [25]. For each logical queue from the IXP1200 to the Pentium, the implementation uses a pair of 2 hardware queues: one contains pointers to empty buffers and the other contains pointers to full buffers in Pentium memory. Moving packets from the Pentium to the IXP1200 works in an analogous way, and involves a second pair of 2 queues. We measured the maximum packet processing rate of the Pentium by looping around reading packets of various sizes from the IXP1200 and then writing them back onto the IXP1200. (Due to a silicon error, the 2 mechanism does not work. We therefore had to simulate it in software.) The StrongARM is programmed to feed packets to the Pentium as fast as possible. We also inserted a delay loop on both sides to determine the number of spare cycles available — cycles not involved in the data transfer. The results are given in Table 1, which shows that the router is able to forward up through the Pentium. This rate saturates the to StrongARM, but leaves 500 cycles per packet available on the Pentium.
IO IO
IO
IO
534 Kpps
Packet Size (Bytes) 64 1500
Rate (Kpps) 534.0 43.6
Pentium (Cycles) 500 800
StrongARM (Cycles) 0 4200
Table 1. Measured maximum forwarding rate and excess processor cycles per-packet Note that we have so far focused on 64-byte packets. This is because processing minimal-sized packets is the worst-case scenario. It is also the case that forwarding larger packets scales linearly on the MicroEngines: forwarding a 1500-byte packet involves forwarding twentyfour 64-byte MPs. Crossing the PCI bus is different, however, since the DMA engine runs concurrently with the StrongARM. Also note that even if 1500-byte packets arrive, we do not necessarily need to move them across the PCI bus, as many forwarders just need to inspect the packet header. To account for this likelihood, we move just the first 64-bytes across the PCI bus, along with an 8-byte internal routing header that informs the Pentium of (1) the classification decision made on the IXP1200, and (2) how to retrieve the rest of the message (lazily) should the forwarder running on the Pentium need to access the packet body. Table 1 shows that for 1500-byte packets, the maximum through the Penachievable forwarding rate is : tium, leaving 4200 excess cycles on the StrongARM. It is on the budget of these excess cycles that we are able to run a piece of the code injection mechanism, which controls forwarder code injection onto the MicroEngines.
43 6 Kpps
3.4. Performance
8 100 Mbps
We initially measured the system using the Ethernet ports on the evaluation board using eight Kingston KNE100TX PCI Ethernet cards based on the 21143 Tulip chip-set as traffic sources. (A pair of these cards are plugged into each of four Pentium IIs running a packet generator.) When configured to generate minimum-sized (64-byte) packets, each card transmits , which is : (calculated 95% of the theoretical maximum of from [21]). Given this traffic source, the MicroEngines are able to sustain line speed across all eight ports, resulting in . This is an expected result, a forwarding rate of : as the theoretical forwarding capacity of the processing and memory resources in the IXP1200 is much greater than the of testbed traffic. To determine the maximum forwarding rate of the IXP1200, independent of the ports configured onto the board, we modified the input process to factor out port interaction, thus emulating infinitely fast network ports. Packet classification still occurs based on the destination IP address and we assume a hit in the route cache. Note that omitting the device interaction does not have a significant impact
450 MHz
141 Kpps 148 8 Kpps
1 128 Mpps
800 Mbps
on the reported performance numbers presented as even the worst case device (the ports) accounted for less than 10% of the total per-packet delay. Because queueing packets is the primary complexity in the router at this level, and the router’s performance at this level is greatly influenced by the queueing discipline selected, we measured several combinations of queueing strategies. Table 2 lists the options analyzed. For these experiments, the system was configured with 4 MicroEngines (16 contexts) running the input loop and 2 MicroEngines (8 contexts) running the output loop. All 24 contexts were executing their assigned loop for all the measurements.
100 Mbps
Input Processing (4 MicroEngines) (I.1) private queues in regs (I.2) protected public queues no contention (I.3) protected public queues max. contention Output Processing (2 MicroEngines) (O.1) single queue with batching (O.2) single queue without batching (O.3) multiple queues with indirection
3:75 Mpps 3:47 Mpps 1:67 Mpps 3:78 Mpps 3:41 Mpps 3:29 Mpps
Table 2. Maximum packet rates by input and output process, and by queueing discipline. The fastest feasible system (I.2 + O.1) is able to forward . This result corresponds to packets at a rate of : the situation where no two packets are destined for the same queue at the same time, and so represents an upper bound on performance. Row I.3 corresponds to the same configuration, but this time with all packets destined for the same output queue. Note that this configuration (independent of the workload) does not support QoS since a single queue is associated with each output port. In contrast, configuration I.2 + O.3 corresponds to a system that supports up to 16 queues for each output port, providing significant flexibility in differentiating service. It is able to forward packets at a maximum rate of : . By carefully breaking down the instruction counts for processing one MP [45], we report that each packet requires 280 cycles of registers instructions, plus 180 (DRAM) + 90 (SRAM) + 160 (Scratch) = 430 cycles of memory delay, which totals to 710 cycles. This means that a given packet of delay as it is forwarded by one or experiences more contexts running at . Since the system as a —that is, it outputs a whole is able to forward : packet every —the system is able to forward a little over 12 packets in parallel. In other words, if we factor out memory delay, we calcu/ 280 cylate that one MicroEngine can process cles = for a system total of : . Our actual rate of : is 80% of this optimistic upper bound. To paraphrase, we are within 20% of the maximal possible performance (for a system with our instruction counts).
3 47 Mpps
3 29 Mpps
3550 ns 288 ns
714 Kpps 3 47 Mpps
200 MHz 3 47 Mpps
200 MHz 4 29 Mpps
3.5. Extensibility and Robustness So far, we have established the maximum rate at which each of the three processors can forward packets with a null forwarder. That is, we have statically allocated the MicroEngines to forward minimum-sized packets at line speed. The next step is to evaluate adding more complex forwarders to the data plane and integrating the control plane into the system, without jeopardizing the robustness of the system in the face of different workloads. To this end, we characterize a budget that allows a maximum amount of resources for each MP, in terms of cycles, registers and memory, to run a forwarding function. A detailed analysis reveals that for our development board with Ethernet ports, the MicroEngines are required to forward at most : , leaving a significant budget for arbitrary forwarders [45]. Namely, for 64-byte MPs held in 16 registers, the forwarder has access to 8 additional general purpose 32-bit registers, an additional register for the SRAM address of the flow-specific state, can execute up to 240 cycles worth of instructions, can perform up to 24 SRAM transfers (reads or writes) of 4 bytes each, and 3 hashes with support of the hardware hashing unit. In addition, there are 650 instruction slots in the ISTORE that must be allocated to the competing extensions.2 Since packets passed to the StrongARM will not have yet consumed these resources on the MicroEngines—in particular, the available memory references—all this capacity is also available on the StrongARM. We stress that this evaluation is in the context of the worst-case load the router can experience— forwarding minimum-sized packets arriving at line speed.
8 100 Mbps
Forwarder TCP Splicer Wavelet Dropper ACK Monitor SYN Monitor Port Filter IP
1 128 Mpps
SRAM Read/Write (bytes) 24 8 12 4 20 24
Register Operations (instructions) 45 28 15 5 26 32
Registers Needed 7 4 4 0 2 2
Table 3. Cycle, Memory and Register Requirements of Example Data Forwarders To demonstrate that it is feasible to implement useful router services within our allocated resource budget, we have implemented five example data forwarders. Table 3 gives the memory and cycle requirements for each. We selected some router services that have separate control and data components to emphasize how our hierarchical architecture can exploit this separation in improving performance. Specifically, these services can be implemented 2 The next version of the chip will support 1024 additional instructions giving room for 1674 instructions in the ISTORE.
310 Kpps
1 128 Mpps
310 Kpps
in fact, up to the point that a processor higher in the hierarchy (e.g., the StrongARM) was unable to service the stream of exceptional packets, the router was able to sustain the full rate of : . This is because the MicroEngines budget enough resources to classify and enqueue every packet arriving at line speeds, and once enqueued for a particular forwarder, a given flow receives whatever level of service the scheduling policy dictates.
3 47 Mpps
3.6. Towards Dynamic Resource Allocation We have demonstrated that our choice of static resource allocation has resulted in performance robustness of the overall system. Unfortunately, this choice is not without a price. Our static approach means that the software needs to be re-designed for boards configured with different ports and port speeds. This is especially problematic for a nonhomogeneous set of ports. For instance, we had chosen ports available on the IXP1200, not to use the two which have different I/O behavior than the eight ports. Having to write this code in assembly language only complicates the situation. Another issue is that the static allocation of output contexts to queues restricts the number of queues that each context can service to a maximum of 16, the number of available registers. One solution would be to construct the software for a new port configuration from a collection of building block components. This could ultimately result in a domainspecific compiler. Our current implementation takes the first step in this direction by defining a set of macros that can be used in different combinations. This necessitates a dynamic resource allocation model to partition the resources (contexts, FIFO slots and queues) in the most effective way for a given heterogeneous configuration, which we are currently developing.
1 Gbps
100 Mbps
9 8 7
Forwarding Rate (Mpps)
by a pair of forwarders—a data forwarder running on the IXP1200 that processes every packet and a control forwarder on the Pentium that initializes and manages the data forwarder. The first forwarder performs TCP splicing, a technique for optimizing proxy performance [46]. We run the full TCPs and proxy code in a control forwarder on the Pentium, (it operates on only a few packets per connection), while leaving the splicing code that patches the TCP headers to run in a data forwarder on the MicroEngine level (it operates on all subsequent packets). The Wavelet Dropper divides a wavelet encoded video stream into multiple layers [13]. Depending on the level of congestion experienced at a router, packets carrying lowfrequency layers are forwarded and packets carrying highfrequency layers are dropped. The data forwarder records the number of packets successfully forwarded for this flow, while the control forwarder uses this information to determine the available forwarding rate, and from this, the cutoff layer for forwarding, which it then sends back to the data forwarder. The third (ACK Monitor) watches a TCP connection for repeat ACKs in an effort to determine the connection’s behavior [37]. The fourth (SYN Monitor) counts the rate of SYN packets in an effort to detect a SYN attack. Port Filter is a simple filter that drops packets addressed to a set of up to five port ranges. The last is minimal IP processing, which consists of decrementing the TTL, recomputing the checksum and replacing the Ethernet header. (Note that the IP header also needs to be validated—the checksum verified and the version and length fields checked—but this is done as part of the classifier rather than the forwarder.) In contrast, we have measured more complicated forwarders such as TCP proxies and full IP to require at least 800 and 660 cycles per packet, respectively. Also, the prefix matching algorithm we use [47] requires on average 236 cycles per packet. These forwarders clearly need to run on the StrongARM or Pentium. To validate the robustness of the complete system, we configured the MicroEngines to run a synthetic suite of forwarders based on the examples in Table 3, which utilizes the full resource budget. We then programmed the MicroEngines to forward a variable number of packets to the Pentium. We found that the system was able to forward up to (out of the : offered load) through the Pentium without dropping any packets at any level of the routed through processor hierarchy. Each of the the Pentium, in turn, receives 1510 cycles of service. In a second experiment, we ran the base infrastructure described in Section 3.2, and treated an increasing percentage of the packets as exceptional, thereby simulating a flood of control packets. These exceptional packets had no effect on the router’s ability to forward regular packets, and
6 5 4 3 2
input only output only
1 0 0
4
8
12
16
20
24
MicroEngine Contexts
Figure 11. Maximum packet rates achievable by the output and input processes when running independently. For each datapoint only the minimum number of MicroEngines are used, hence the dip in the graph.
As an example, Figure 11 provides some insight into how a system with different MicroEngine context allocations might function. This figure illustrates that output processing scales almost perfectly with the number of MicroEngines added to this stage. However, input processing benefits very little from more than 16 contexts. This is because of the serialized access to the DMA state machine, which dominates the performance of input processing once there are enough threads to keep it busy. Moreover, MicroEngine contexts are non-preemptive, which imposes a certain time granularity on their allocation. These are examples of some of the issues that need to be factored into our resource model.
4. General Purpose Processor Level This section covers the next execution level of the hierarchy of Figure 7, namely the level of the Linux kernel space within the Pentium. Here, the forwarding function is an entire Scout path, which is the key abstraction at the core of our router design at this execution level. Our primary contribution at this level is to implement all the networking and device communication aspects of the router in SILK, which abstracts the Scout path along with packet classification and scheduling into a loadable Linux kernel module. Our second contribution is to provide a uniform interface to all system devices and file systems, such as networking, audiovisual, character and block devices, as well as the socket interface. We encounter a problem analogous to the one on the IXP level: namely, how to allocate the processor’s resources among the various packet flows. At the Pentium level of the processor hierarchy, this reduces to scheduling threads among the Scout paths. Therefore our third contribution is to devise a scheduling mechanism that simulatneously maximizes the throughput of best effort packets, provides different levels of service to QoS packets, exhibits robust behavior in the presence of varying workloads (such as packet flooding DoS attacks), and supports Scout paths of varying computational costs. Active packets traversing our router and reaching the Pentium processor hierarchy, will inevitably flow through a NodeOS path, which provides an interface for the active services running at the next execution level, Linux user space. In view of this, the fourth contribution at this level is the introduction of SNOW, an additional layer that allows decoupling the NodeOS from its API, thereby making the API portable among different EEs from above, and distinct NodeOS implementations from below. We therefore present our NodeOS architecture and discuss various aspects of its implementation within SILK.
4.1. Scout Paths Scout [33] is a modular, configurable, communicationoriented operating system developed for small network appliances. Scout was designed around the needs of datacentric applications with a variable spectrum of networking requirements. At the kernel level of the router hierarchy in Figure 7, we identify five key requirements in the networking architecture: (1) Early demultiplexing of incoming packets to flow queues. This allows the system to isolate flows as early as possible, in order to prioritize packet processing and accurately account for resources. (2) Early dropping when flow queues are full. The server can avoid overload by dropping packets before investing many resources in them. (3) Accounting of the resources used by each data flow, including CPU, memory, and bandwidth. Knowledge of the resources used by a flow is necessary in order to provide overall fairness or to place resource limits on individual flows. (4) Explicit scheduling of flow processing, including network processing. Scheduling and accounting are combined to provide resource guarantees to flows, such as CPU or bandwidth reservations. (5) Extensibility to enable adding new protocols and constructing new network services. Scout’s main contribution is to combine all of the features listed above into a single, clean abstraction— the Scout path. A path is a structured system activity. Each Scout path encapsulates a particular flow of data, such as a single TCP connection. A path consists of a string of code modules that process and perhaps transform the data as it flows through the system, and all resources consumed on the flow’s behalf are charged to the path. ioctl i/f socket i/f
Geni
Geni NodeOS TCP
UDP
IP
IP
ETH
ETH
Geni
Geni
network driver i/f
vera driver i/f
Non-active Path Active Path Figure 12. Two Scout paths: a non-active TCP path and an active UDP path
Figure 12 portrays two examples of Scout paths. The first is a non-active path corresponding to a single TCP con-
nection. The second is an active UDP path communicating with the application via NodeOS. Each path consists of a chain of protocol modules that process packets belonging to the connection, with input and output queues at each end. Taking the non-active path of Figure 12 as an example, packets arrive via the Linux network interface, and are then classified into a corresponding path. If the packet belongs to the TCP connection of this path, it is placed in the input queue at the bottom of the path; if the queue is full the packet is dropped. A path is considered ready to run once it has data in its input queue. When a path is run, a thread belonging to the path dequeues a piece of data from its input queue, runs the code modules in sequence, and deposits the result in the output queue at the opposite end of the path. In this case, the packets need to undergo Ethernet, IP and TCP processing, respectively. Both ends of the path are delineated by the Geni module, which abstracts the device and communication interfaces to the path framework, and is further described in Section 4.2. At the top, standard user space applications connect to this path via the socket interface. The fact that the socket connection is implemented by a Scout path and not Linux is transparent to the application. The active path of Figure 12 behaves in a similar fashion, except in this case, it is a UDP connection, that has to be routed through the NodeOS module. This module implements the NodeOS API [4], to be further discussed in Section 4.4. In this example, packets arrive via the Vera interface, and at the top, the path connects to the Linux ioctl interface through which it communicates with the active applications running in Linux user space.
4.2.1 Threads Since our goal is to support different levels of packet service across any given devices, SILK needs to have the capability to prioritize among Scout paths, and to provide them with CPU guarantees. To this end, SILK contains its own path scheduler and thread package that coexists with the Linux CPU scheduler. To describe how the Linux and SILK schedulers interact, we refer to Linux and SILK threads as L-threads and S-threads, respectively. At initialization, SILK creates an L-thread set to the maximum realtime priority, the highest priority in the system. Therefore, unless it has to wait for another L-thread to yield, the SILK L-thread will run immediately. Kernel L-threads run non-preemptively in Linux; hence, another Lthread can be scheduled to run only when the SILK L-thread yields or sleeps. SILK multiplexes all of its S-threads onto this high priority kernel L-thread. SILK temporarily transfers control back to Linux via a special “Linux” S-thread. This Linux S-thread is an actual S-thread in SILK and SILK can schedule it like any other thread. When SILK executes the Linux S-thread, it causes the SILK kernel L-thread to yield, thus transfering control to the Linux scheduler. In Linux, an L-thread that yields cannot run again until another L-thread has run. So, after SILK executes the Linux S-thread, the Linux scheduler chooses one other L-thread to run and then transfers control back to SILK. This mechanism empowers SILK to allow Linux to run one L-thread, and is illustrated in Figure 13. SILK Scheduler
4.2. SILK: A Linux Kernel Module Consonant with our spirit of designing extensible, generalpurpose routers with COTS hardware and open operating systems, we selected to run the Linux OS on the Pentium processor hierarchy, rather than a specialized, internally designed, stand-alone OS such as Scout [33]. Yet, to incorporate all the benefits of Scout paths while preserving the ability to run standard Linux applications, we encapsulated the Scout path architecture into a loadable Linux module, which we call SILK (Scout in Linux Kernel). SILK runs on Linux 2.4 version and above. SILK does more than replace the Linux networking stack with Scout paths. For one, it designs a self-containing thread package and scheduler for the CPU, that co-exists with Linux. Second, SILK introduces the concept of an “extended path” into user space, thus crossing the execution level boundary. This encompasses co-scheduling of kernel level Scout paths and user level applications. Lastly, it provides a unified framework for Scout paths in accessing all interfaces and devices, called Geni. We cover these concepts in more detail below.
S-thread 1
S-thread 2
...
S-thread n
S-thread Linux L-thread 1
Linux Scheduler
L-thread 2
...
L-thread k
Figure 13. SILK controls CPU scheduling by encapsulating the Linux scheduler and its Lthreads to run under the Linux S-thread. Consequently, SILK shares CPU cycles with Linux. SILK controls this by assigning various scheduling parameters to the Linux S-thread. For example, if SILK assigns the Linux S-thread a CPU rate of 50%, SILK and Linux will evenly share the CPU. Moreover, since all S-threads hierarchically run beneath the initial SILK kernel L-thread, SILK can maintain Scout’s original suite of schedulers to schedule Scout paths. These schedulers include fixed priority, Earliest Deadline First (EDF), Weighted Fair Queueing (WFQ), and Best Effort Real Time (BERT). One caveat of this design is that we assume no other Lthreads running in Linux with realtime priority. Further investigation is required if this assumption does not hold.
4.2.2 Extending the Path A major benefit of SILK is its ability to provide applicationspecific QoS to standard Linux applications. A schedulingaware application can specify how it and its paths should be scheduled by the system. Conversely, non-schedulingaware applications cannot. We would like to exploit SILK to also provide differentiated service to those latter applications in a transparent manner, without the application’s knowledge or participation. To encompass co-scheduling of kernel level Scout paths and user level applications, SILK introduces the concept of an extended path.
Linux user space
Web Server socket i/f
Geni
Linux kernel space
TCP IP ETH
Geni network driver i/f
Figure 14. Extending a Scout TCP socket path from kernel to user space to subsume scheduling of the web server application Since applications run as Linux user processes, the key idea is to give the SILK scheduler the ability to control Linux’s scheduling decisions, so as to coordinate processing between the networking or communication processing run by SILK and the application. Thus, to “extend the path”, SILK must first be able to identify the L-thread associated with the application pertinent to a particular Scout path. For an application using sockets, this may be simply the L-thread that calls connect or accept. Second, SILK must cause the Linux scheduler to mirror the decisions made by the SILK scheduler. Exactly what form this cooperation takes depends on which scheduler SILK is running. Figure 14 shows a Scout TCP socket path extended to subsume a web server running in user space. To demonstrate this concept, we implemented path extension with SILK’s fixed priority scheduler. Linux can be made to perform strict priority scheduling by using the realtime priorities. Concretely, a realtime L-thread with priority p runs at a higher priority than a realtime L-thread of prior-
ity less than p as well as any non-realtime L-thread. Therefore, to extend a path, SILK mirrors the path’s priority in the realtime priority of the L-thread corresponding to the path’s application. For example, if a path has priority 2 (in SILK) then an L-thread reading from it would inherit a realtime priority of 2 (in Linux). Furthermore, the priority inherited by an L-thread from a path can change over time. For instance, an L-thread that blocks on accept first receives the priority of the SILK listen path. It then adopts the path priority of the socket returned by accept, finally returnurning to its original priority when closing the socket. SILK only changes the priorities of those L-threads that correspond to SILK paths. To avoid priority inversion, the SILK kernel L-thread yields to Linux when a runnable L-thread has a higher priority than any ready path. 4.2.3 Geni Geni stands for Generic Interface and is the gateway interface into SILK [52]. Conceptually, Geni performs three functions: classification of all SILK packets into their appropriate paths (which abstract the forwarding functionality); scheduling the packets at the end of the paths onto their target interfaces; and providing a clean uniform framework to all non-SILK interfaces. We depict Geni’s interaction with other interfaces on the Linux kernel level in Figure 15. 3 Reiterating our unified framework of the functional abstraction of an extensible router across all execution levels depicted in Figure 7, this is compared to the fixed infrastructure that all packets have to consume at the Microengine level in Section 3. In a similar fashion, regardless of the path a packet takes, the packet traverses the Geni module twice, once upon entry and then upon exiting the path, as was shown in Figure 12. Other than the standard classification functionality described in Section 2.1, Geni’s classifier also incorporates a policy decision that determines exactly which flow of packets constitute which path. Certainly each QoS flow is treated as its distinct path, even if the forwarding function is essentially the same as for some other path. On the other hand, multiple best effort flows that share the same forwarding function are classified into the same path. Figure 15 illustrates that Geni interacts with not only the interfaces to network devices, such as the Vera interface or the Linux network driver interface, but also the file systems, including the proc filesystem, and the connections to these interfaces are all bidirectional. Geni also provides an interface to other system devices, such as a one-directional connection from Geni to the sound card and the video frame buffer, and into Geni from the device in the case of the microphone or camera. Additionally, much like the Vera driver 3 Rover
and Code Injection (CodeI) are covered in Section 5.
linux socket i/f
SILK
F
Linux thread
Geni
CPU sched
Device partner driver interface to Geni
CodeI
F C
Geni partner driver interface to the device
linux ioctl i/f
...
Linux Kernel Space
Rover
S
sound i/f microphone i/f frame buffer i/f
Geni partner interface
file system i/f proc filesystem i/f
camera i/f
Vera control i/f Vera data i/f Vera driver
network driver i/f 802.11 drivers
...
ethernet drivers
Figure 15. Geni’s partner interface connects to network & media devices and socket & ioctl interfaces provides a bridge between the IXP and the Pentium levels across the PCI bus, Geni bridges the Linux kernel and user space execution levels by providing an interface to sockets and ioctls. This is all accomplished by the Geni partner interface, which it exports to all other interfaces and devices. These partner interfaces are depicted as the small rectangles in Figure 15. Each such rectangle within Geni conceptually represents an instantiation of Geni specific to a particular driver. To illustrate this point, consider again the two Scout paths depicted in Figure 12. For the non-active path, the bottom and top Geni modules are instantiations of Geni’s network partner driver interface and socket partner interface, respectively. In the case of the active path, the bottom module is an instantiation of Geni’s Vera partner driver interface, which has a richer functionality than its Linux network driver counterpart. The top module of the same path is an instantiation of Geni’s ioctl partner interface. It is noteworthy that Geni’s socket partner interface permits Linux applications that use standard networking protocols via the socket API, such as calls on the PF INET family, to be intercepted and processed by SILK. This allows unmodified legacy applications to access TCP and UDP paths. For applications using experimental or nonstandard protocols, or those requiring additional functionality, SILK provides a new, PF SCOUT protocol family via Geni’s socket partner interface. The underlying mechanism is the conversion of all nonSILK data structures into a Message (SILK’s internal packet abstraction) upon entering SILK, and converting the possibly modified (by forwarder processing) Message to the data structure for the device it is destined for. For example, a packet arriving on the Linux netfilter, would be
converted from an sk buff (Linux’s internal representation) to a Message upon entering SILK. If, say, this is an edge router, and it turns out that this packet is a part of an MPEG video stream to be displayed on this router, Geni will convert the Message into a frame buffer that sends it out to the video device. Geni’s partner interface functions can be categorized into two groups: buffer operations which include buffer allocate, free and get; and device operations which include device open, close, push, pull, flush and control. Thus, to interface with Geni, each device driver must provide its side of the partner interface, as depicted in Figure 15. This entails an implementation of the relevant subset of the above functions, which, in essence, are translators to the original device functionality — an exercise, we claim, takes one hour of programming time.
4.3. Scheduling Scout Paths At this execution level, we revisit the resource allocation issues that were encountered at the MicroEngines’ level of Section 3. In this case however, instead of the 24 MicroEngine contexts, we have a SILK thread pool, the S-threads, to allocate among our processing elements, running on a Pentium III processor with a L2 cache, and RAM. Clearly, memory is not an issue with which we need to grapple in terms of buffering packets or storing instructions. However, the Pentium does a lot more than process packets. It runs the Linux OS and supports a wide range of standard and active applications, all of which tax the processor’s resources. The central problem however, remains the same: how to allocate the CPU resources to simultaneously maximize throughput, support different levels of packet service, and maintain extensibility and robustness of the system as a whole. To this end, we explore the design space for scheduling S-threads across the Scout paths [40].4 In doing so, we have four overriding goals: (1) maximize the throughput of best effort packets; (2) provide different levels of service to QoS packets; (3) exhibit robust behavior in the presence of varying workloads, including packet flooding DoS attacks; and (4) support Scout paths of varying computational costs. The strategy we propose first investigates how many Sthreads to assign to a Scout path, and then applies a combination of two mechanisms: a proportional share scheduler and the batching throttle. Experiments with a prototype implementation demonstrate the effectiveness of the resulting framework. These experiments are conducted with a stream of 64-byte IP packets, and three sources generating packets at rates of up to 140Kpps. This yields an aggregate maximum offered load of 420Kpps. Our experiments are designed to stress the CPU rather than the network. Hence, the
733 MHz 1 GB
4 The
512 KB
experiments reported in [40] were actually conducted on a
450 MHz Pentium II processor.
emphasis on small packets is the same as the one employed at the MicroEngine level in Section 3: a larger number of small packets place a greater load on the CPU than fewer large packets. 4.3.1 Threads per Scout Path
Processing Pipeline
Geni
variable part of Scout path
Geni
F Input port
F C
...
Functional Router Abstraction
SILK
Following our strategy at the MicroEngine level of separating packet processing into a fixed part and a variable part, it is natural to split the work as a three-way processing pipeline, depicted in Figure 16. Packet classification and scheduling, implemented in SILK by Geni, are labeled as input and output processing, respectively, and are accounted for by a fixed budget of CPU cycles.
S
Output port
F Input process
Forwarding process
(fixed number (variable number of CPU cycles) of CPU cycles)
Output process
(fixed number of CPU cycles)
Figure 16. The packet processing pipeline mapped to the functional router abstraction and the corresponding elements of SILK The question then arises of how many S-threads do we assign per Scout path. Guided by our goals of simultaneously supporting Scout paths with varying computational costs while maintaining robustness in the wake of variable workloads, and maximizing performance for best effort flows, ideally, we would like to assign one S-thread per each stage of the processing pipeline, amounting to three Sthreads per path. This would enable us to isolate the fixed processing cost required for input and output processes, and allow us to focus on the scheduling issues of the variable forwarding functions. Before settling on this strategy however, we need to investigate if this approach is prohibitively expensive because of context switching. We therefore conducted a series of experiments, increasing the aggregate load from 0 to , and measuring how well the router sustains this traffic for the case of one, two and three S-threads per pipeline stage. In an interrupt driven implementation of the servicing scheme, the router started to suffer from receive , which is consistent with live-lock and thrash after the results in [31]. Therefore, we diverted our attention to a polling scheme instead. To reduce the overhead of context switches, we also experimented having the forwarding process batch as many packets as possible up to an arbitrary
420 Kpps
48 Kpps
limit of 16 packets, or alternatively, yield after handling one packet. Batching was always turned on for input and output processes. Number of S-threads 1: I+F+O 2: I, F+O 3: I, F, O 3: I, F, O
Maximum Batch Size 16 16 16 1
Max. Forwarding Rate Kpps Normalized 294 1.00 286 0.97 272 0.93 227 0.77
Table 4. Max. Processing Rates with Polling Table 4 shows the relative maximum forwarding rates achieved with the polling scheme. That is, when in each scenario, the router began to drop packets and the corresponding curves turned into plateaus. We observe that each additional S-thread in the processing pipeline adds 3 to 4% overhead. The effects of batching are more significant, improving performance by approximately 16%. Further analysis showed that we are batching on the order of 10 packets at each stage. Comparing three S-threads with batching (our ideal scenario) to the single S-thread case, we see that the overall difference in performance is only 7%, which seems a tolerable overhead for the functionality we seek. It is noteworthy that having two full-blown context switches on the packet’s processing path has the potential to add to the forwarding time, whereas, as can be calculated from Table 4, we were able to achieve a total : forwarding time for each packet. This is primarily due to an optimized implementation of context switches as inexpensive continuations.
10 s
33 s
4.3.2 Proportional Share Scheduler To meet our goals in the face of more complex scenarios than the experiments described above, we employ a proportional share (PS) scheduling discipline that provides a cycle rate to a process. It abstracts the main features of a class of algorithms, such as Weighted Fair Queuing [7]. PS has four essential characteristics: (1) Each process reserves a cycle rate—e.g., 1 million cycles-per-second (Mcps)—and is guaranteed to receive at least this rate when it is not idle. (2) Unused and unallocated capacity is fairly distributed to active processes in proportion to each process’s reservation. An active process that receives extra cycles beyond its reservation is not charged for them. (3) An idle process cannot “save credits” to use when it becomes active—unused share is simply lost. A process whose input queue is not empty and whose output queue is not full is said to be eligible to run. (4) The guarantees made to processes provide isolation between them—each process gets its rate no matter what the other processes do. The PS scheduling strategy allows us to distinguish between the S-threads for the input, forwarding and out-
put processes by assigning them shares proportional to their processing costs. In this case, we achieve a balance that produces good system behavior. For example, microexperiments run on the configuration described above indicate that in the case of three S-threads per pipeline, the input, forwarding and output processes spend : , : and : on each packet, respectively. Thus, the results of Table 4 were achieved with a balanced share assignment of 5:1:5 (I:F:O) for the three S-thread case. Unlike the MicroEngine execution level, where we could measure the minimum fixed infrastructure required to maintain processing at line speed and allocate the resources accordingly for all flows, in SILK we want to handle flows with different qualities of service. We therefore encounter two conflicting draws as to what share to assign the input process. For the sake of best effort flows, we want to assign input process shares based on the balanced cycle distribution in overload, since overscheduling the input process share potentially leads to live-lock. On the other hand, should this rate be less than is required to read and classify packets at line speed, QoS flows are vulnerable to DoS attacks. This is because packets belonging to well-behaved QoS flows may be dropped on a line card if the input process does not have enough share to keep up with packets arriving at line speed. We resolve this conflict by means of a queue estimator (QE), a mechanism that estimates the device queue length based on previous observations in a dynamic fashion. It does this by keeping a weighted average of the packets read during each execution of the polling S-thread. Each input process also has a sleep interval and a target range. If the weighted average of packets is less than the target range, the sleep interval is increased (up to some maximum); if it is greater, the interval is decreased. The process then sleeps for that interval before becoming eligible to run again. In this manner, the state of all queues, including the receive queue on the device, is available to be incorporated into the scheduling decision. To verify router robustness in forwarding best effort packets under system imbalances, we configure three different paths with varying forwarder costs, while the input process cost remains : . Table 5 demonstrates the router’s behavior in three scenarios. In the first, under a balanced share, the forwarding rates are almost exactly proportional to the ideal balance. In the second scenario, the forwarder of path A is overscheduled, and hence unaffected (its unused share is distributed upstream to the input process); whereas for path C, the system gives too much weight to the input process, resulting a throughput drop. The problem is that the input process is running at a faster packet rate than the forwarding process causing packets to drop off the tail of the forwarding process’s input queue. Notice how introducing the QE readjusts the balance and remedies the problem.
16 s 03 s
14 s
16 s
Flow Forwarding Cost Ideal Balance Balanced Share 1:10 Share w/o QE 1:10 Share w/ QE
A 8 s 1:5 101 Kpps 101 Kpps 101 Kpps
B 16 s 1:10 56 Kpps 56 Kpps 56 Kpps
C 24 s 1:15 38 Kpps 35 Kpps 38 Kpps
Table 5. Best effort throughput for an input process of :
16 s
To further demonstrate the significance of the QE mechanism, we consider the scenario of conservatively overscheduling the input process(es) share in an effort to protect against a flood of best effort traffic. We simulated this scenario by configuring three input ports, two of which had no traffic arriving and one on which packets arrive at full speed. We gave each input process a large enough CPU share to receive packets at line speed, and we set the processing rate for each packet to , which fully utilizes the of the CPU is wasted CPU. Without the QE, roughly polling idle input ports, thereby yielding a forwarding rate of for the active port. With the QE enabled, the router was able to forward packets at , the maximum achievable rate for this configuration.
6 s 30%
91 Kpps
130 Kpps
4.3.3 Batching Even though, as seen in Table 4, batching packets in the forwarders significantly improves performance, it also leads to coarser scheduler granularity. A QoS path with finite buffer size requires its CPU share to be delivered within a certain period, that is, before its queue becomes full. With large batch sizes, an S-thread may be able to hog the CPU for a long period of time, thereby delaying the execution of other paths. If this service lag exceeds the maximum another path can buffer, future packets belonging to this path will be dropped even if they are within the path’s reservation. Under such a scenario, the contract between the system and the packet flow is violated. To strike a balance between batching and scheduling granularity, we introduce a batching throttle mechanism. It considers each flow’s processing cost and queue length, as well as the system’s scheduling overhead, and dynamically adjusts the level of batching to trade off granularity for efficiency, and simultaneously meet QoS promises. The batching throttle affects the system’s behavior in three ways. First, it preserves a specific scheduler granularity by requiring threads to surrender control at least every G time units, where G stands for scheduler granularity. Second, it tries to schedule threads that can process a full batch when they run, in an attempt to improve the efficiency of the system. Third, it allows for flows with latency requirements by including a timeout mechanism: after waiting a delay threshold a non-efficient flow is allowed to run anyway. An S-thread can be in one of four possible states: Idle,
4.4. NodeOS SILK commands a plethora of protocol modules that can be chained together to construct Scout paths (all delineated by the Geni module). Modules such as the Ethernet, IP, TCP, UDP, ARP, DHCP, and RTP are some basic examples. The NodeOS module however, warrants special attention in the context of this paper. A general architecture for active networks has evolved over the last few years [10, 44]. This architecture stipulates a three-layer stack on each active node. At the lowest layer, an underlying operating system (NodeOS) multiplexes the node’s communication, memory, and computational resources among the various packet flows that traverse the node. At the next layer, one or more execution environments (EE) define a particular programming model for writing active applications. To date, several EEs have been defined, including ANTS [53, 54], PLAN [2, 19], SNAP [32], CANES [8] and ASP [9]. At the topmost layer are the active applications (AA) themselves. In the realm of an active network, security risks are heightened at the level of end-to-end active users, the active router node itself, the EEs and the active code that traverses the network, thereby necessitating a security component at each of these layers [1, 34]. While each EE exports its own interface to the AAs, establishing a unified inteface to the NodeOS is an essential element to the Active Networks architecture. We have contributed to a NodeOS API specification [4], where we define five primary abstractions: thread pools, memory pools, channels and files, encapsulating the node’s computation, memory, communication and persistent storage, respectively; and a fifth abstraction, the domain, aggregates control and scheduling for the other four abstractions.
AA
...
AA
l+1
Execution Environments
...
m
Linux user space
AAs
l
NodeOS API
Linux kernel space
StrongARM router operating system
StrongARM context
SILK router operating system
(1)
Equation 1 formulates the conflict between fine-grained scheduling and lower scheduler-related overhead. In particular, improving system throughput by lowering the overhead threshold T must result in coarser scheduling granularity G. Detailed analysis and experiments portraying and evaluating system behavior under different conditions can be found in [40].
AA
k+1
Microengine router operating system
Microengine context
C T
AA
SNOW
NodeOS
G
EEs
Eligible, Active or Running. The scheduler transitions Sthreads between these states subject to the system parameters of the batching throttle. For every flow, we define an overhead threshold T , which is bounded by a function of the average cost of a context switch C , the average per packet processing cost, and the batching threshold. Foregoing a detailed analysis, the behavior of the batching throttle is mainly characterized by the two parameters: scheduler granularity G and overhead threshold T , which are subject to the following constraint:
Figure 17. Mapping the execution levels of the prototype hierarchical extensible router to the canonical Active Network architecture (only active applications are shown) How exactly the canonical Active Network architecture layers map onto a prototype router is an implementation issue. Three different example implementations of the NodeOS have been reported [38]: within Scout kernel [33], where no protection domains were maintained; using the OSKit component base [17], where the NodeOS and EEs shared a protection domain – running as a kernel on top of hardware or as one user space process on top of a Unix OS; and AMP running above the exokernel [26], where NodeOS was implemented within a user space OS library, with its own memory and thread management. In the realm of our router, Linux imposes a decision: which side of the user/kernel boundary to locate NodeOS. We view the NodeOS as the entire distributed router OS of the VERA architecture, which on our prototype router spans the lower three execution levels of Figure 7. To illustrate this point, a mapping between our prototype router and the Active Network architecture is given in Figure 17. EEs and AAs interface to NodeOS via a portable set of header files, specifying the NodeOS API on the user level, in C or in Java. A NodeOS module within SILK implements the NodeOS specification, setting up channels, maintaining domains, thread and memory pools. Thus, on behalf of an AA, SILK sets up an active Scout path with this NodeOS module, as was illustrated in Figure 12. SILK NodeOS Wrappers (SNOW) provide a NodeOS encapsulation in user space, thus converting the EE or AA API calls into SILK,
thereby crossing the user/kernel boundary [43]. A recent change in the NodeOS interface rendered this design possible [38]. The two major data structures of the API, NodeOS internal opaque objects and EE-visible specifications, were defined as pointer types in a uniform way, and all arguments within the API function calls reference them as such—that is, pointer types to memory. Such a design offers us three advantages. First, not exposing the data objects within the API calls allows the API specification, in the form of a set of header files, to be portable among different EEs from above, and distinct NodeOS implementations from below. Second, as illustrated in Figure 17, it permits the NodeOS API abstractions to take advantage of the entire hierarchical OS of the prototype router, with its ability to process packets at line speeds, despite the added extensibility and functionality. Lastly, as some objects become more understood, such as the security credentials, they could be migrated into specifications without changing the API, thus preserving compatibility with all existing applications. Code injection, a key element in active networking, is handled in the NodeOS API by specifying a forwarding function during channel creation. To inject this code into the router, NodeOS invokes Thrill, which we cover next in the following section.
5. Control Plane For the most part, we have only been considering data plane packets up to this point. Even though the control plane is not explicitly depicted in Figure 7, control packets permeate all levels of the execution hierarchy from the input ports to the standard and active applications. Recall that in Section 3, we specifically chose to implement those forwarders with separate control and data components to emphasize how our hierarchical architecture can exploit this separation in improving performance. Namely, data plane forwarders were installed on the MicroEngine level, whereas control plane forwarders were assumed to live at a higher execution level of the hierarchy for performance purposes. Traditionally, on a classical router, both the data and control planes were handled by the single CPU of the router. Recently, to be able to sustain performance at line speeds on the data plane, commercial routers began dedicating a separate processor to the control plane. This processor would be responsible for running routing protocols, maintaining routing tables, and executing signalling and network management protocols. This is another “hard-wired” solution to the division of labor. Within the architecture of our extensible router, however, a particular control plane module can reside anywhere in the execution hierarchy. Where exactly it resides is subject to three factors. The first is the availability of resources at each level for the variable forwarding functions (this excludes the fixed
costs of input and output processing, which do not discriminate between control and data packets). For example, the forwarders implemented on the MicroEngines do not have enough resources budgeted for their control plane components, which is why the control packets are classified further up the hierarchy. The second factor is the functionality included in the particular control plane algorithm. Some control plane algorithms are inherently distributed and would probably need to run in user space, whereas others are inherently local. For instance, part of the code injection library resides on the StrongARM, whereas the mechanism responsible for dynamically configuring a Scout path for a particular flow (Rover), is part of SILK, executing in Linux kernel space. The third factor has to do with the trustworthiness of the code that needs to be executed on behalf of the control plane algorithm. For example, the control plane component of the variable-cost MicroEngine forwarders are trusted modules implemented by the router programmers, which can consequently be executed in SILK, on the Linux kernel execution level. Other signalling code sent by remote services to run on this router might not be trusted, and would have to run in user space. Our extensible router architecture incorporates three mechanisms from the control plane domain, which we discuss separately below. The first mechanism, provides the capability to dynamically inject forwarding code into the router in a transparent fashion. This is a trustworthy module providing local router functionality and is therefore designed to reside across all levels of the execution hierarchy but the lowest one. This design is extensible to any number of execution levels. The second mechanism Rover, dynamically configures a Scout path from the available (or injected) choice of protocol modules for a particular flow. This is trusted code with exclusively local functionality and is implemented as part of the SILK kernel, thus residing on the Linux kernel execution level. PathFinder, the third mechanism, employs a three-level overlay approach for the entire network to construct an endto-end path for media applications. Clearly, this is a mechanism of global scope, which the router does not need to trust. Its implementation should therefore reside on the user space level. Finally, we address the issue of robustness in the realm of the control plane. Since this is a new and fully unexplored topic, we first formulate the problem, and then propose a general solution which we are currently investigating.
5.1. Transparent Code Injection A key element of active networks is the application’s ability to inject arbitrary packet forwarding code into the router. In the realm of our extensible router, this necessitates the
ability to control the forwarding functions at all execution levels of the hierarchy. In fact, this is a more fundamental requirement. Standard non-active applications may be executing control plane code on the user space execution level, that needs to communicate with its data plane counterpart further down the execution hierarchy. Even more elemental for our router, SILK, running on the Linux kernel execution level, when given the opportunity, may decide to move some forewarding functionality further down the execution hierarchy. Our impresario for this functionality is the transparent code injection mechanism, whihc resides across all levels of the execution hierarchy but the lowest one. In our case, this spans the highest three levels of the hierarchy, from user level to the StrongARM. The MicroEngines have no further level to control. The code injector mechanism has two separate components, resource and trust assessor mechanisms. The former weighs the resource requirements of the forwarding function to be injected, and assesses the resource availability within the router across the different execution levels. For example, on our prototype, the router might have enough resources to execute a forwarder within SILK, but not on the MicroEngines. Since resource availability within the router is constantly changing, the resource assessor mechanism maintains a database of the existing and currently available resources. This database is updated either at light loads, or upon request. The latter mechanism assesses the safety of the code to be injected. For example, it might be sufficiently safe to run in user space, but not safe enough to inject into the kernel. Since active applications would interface the router through NodeOS, the NodeOS security module would perform the authentication and pass the results to the code injector, if the need for code injection arises. Code trustworthiness need not be limited to security concerns–it may also be a measure of the “bugginess” of a coded function. If, for instance, a piece of code has not been rigorously tested, it might be labeled as trustworthy only up to a certain execution level. Unlike resource assessment on a hierarchical router, trustworthiness is a static scale for the most part, which alleviates the trust assessor from periodically updating its trust database. Similar to the resource assessor, the trust assessor reports the execution level(s) where a match occurred, if any. Our code injector takes the matched execution levels reported by the resource and trust assessors and takes the common denominator of those values. If the common denominator exists, it then initiates the code injector. The injector’s interface also provides functionality to remove functions, and control their execution. As such, it provides a four-call interface at each execution level to enable injecting forwarders: install, remove, getData, and setData.
Additionally, we provide an initialization call which takes the device name of the network card.
5.2. Rover The problem of creating an end-to-end media path across a set of nodes is one of resource discovery and management. First, we must determine what resources are available in the network and map an end-to-end media path that satisfies the user’s requirements onto these resources. Second, we must program the individual nodes along the path to implement the necessary services. From this perspective, establishing a path can be divided into two distinct levels. At the global level, end-to-end media paths are mapped onto the collection of nodes and devices in the network. This involves resource discovery throughout the network, and choosing a route for the endto-end path that includes the resources it requires. At the node level, a piece of an end-to-end path is instantiated on a particular node. Here we program the node to implement one segment of the end-to-end path. This involves loading code modules and binding them to particular devices and packet flows. Figure 18 illustrates an example path that spans three nodes. Resource (Device) Identified Resource Code Module
Global Path Construction source
router
sink
resource discovery and route selection
Local Path Construction
code module loading and packet flow binding
Figure 18. Global and local components of end-to-end path construction In the architecture of our router, the global component of path creation is handled by PathFinder, to be discussed next; whereas the local component can be further subdivided into two pieces. The first piece deals with code injection into a hierarchically extensible router, handled by the code injector described above. The second piece involves binding the code modules to particular devices and establishing the paths for the packet flows, which is performed by Rover [36]. Alternatively, even a standard non-active application (explicitly or implicitly) specifies a sequence of protocols traversed by its flow. This application should not necessarily know the names of these protocol modules at a particular
node, or the particular devices. This demonstrates Rover’s more general utility in binding the protocols to code modules and establishing the Scout paths for any generic application. Rover is implemented as part of the SILK kernel, much like the CPU scheduler, and is depicted as such in Figure 15. Rover performs its functionality in two successive steps. First, it resolves the general protocol specification provided by the application into a sequence of actual code modules, bound to specific devices if necessary. For example, the protocol specification of net in/ip/video out would be instantiated to vera/eth/ip/eth/frame buffer). The second step is for Rover to actually initialize a Scout path for this particular flow. If one of the bound code modules in the resolved protocol specification needs to be downloaded, Rover invokes the code injector mechanism to perform this function. Finally, it invokes SILK’s PathCreate operation to instantiate the path.
the current nodes. Assuming the fully-connected Internet as the underlying network, our framework constructs three overlays, as illustrated in Figure 19: the Substrate, Connectivity and Functionality layers, respectively. Each layer can be viewed as a subgraph of the next lower layer, culminating in a path through the network, which could be an arbitrary mesh with multiple sources and sinks, configured to process a particular data stream.
Functionality Layer
source
sink
Connectivity Layer
Substrate Layer
5.3. PathFinder The recent trend to connect embedded, mobile, and consumer-electronic devices to the network—coupled with the desire to build new services from an environment filled with these devices—is putting new pressures on the network infrastructure. For example, heterogeneity among edge devices, and the data formats they support, implies that data must be transformed into an appropriate format somewhere along the end-to-end path. Similarly, limited bandwidth along the path might mean that a single stream must be thinned or multiple related streams must be aggregated. We expect stream processing to be done on intermediate nodes of the network due to the limited resources and programmability of the edge devices, and as a consequence, these emerging services often benefit from applicationspecific, sometimes called QoS-based, routing. This is because routing decisions are driven by resource availability, including both link capacity (bandwidth, latency, loss rate), and computational resources (CPU cycles, memory). PathFinder [35] develops a distributed mechanism that both yields effective routes, and is not prohibitively expensive. In doing so, it makes two contributions. First, it illustrates the design of a system that employs multi-level overlays to facilitate route selection in supporting end-to-end media paths. Second, it describes and evaluates a new algorithm for constructing a self-organizing overlay network. Unlike RON [5] and Detour [42], our approach is fully distributed; it does not depend on resource attributes being collected by a central manager. Moreover, unlike recent distributed algorithms (e.g., Scattercast [11] and End System Multicast [12]) our approach does not need to acquire information about all other nodes in the overlay to discover the topology. Instead it is able to construct a near-optimal topology by discovering information about just a fraction of
Underlying Physical Network
Figure 19. Three overlay layers In our prototype router, once this path mesh has been determined, PathFinder invokes the code injector to inject application specific code into these intermediate nodes, and then relies on Rover to setup the appropriate protocol modules for the Scout path in SILK. Note that for each layer, we employ a different strategy to select the subgraph that defines the next higher layer. A summary of these strategies is briefly outlined below, followed by a discussion of PathFinder’s overall scalability. 5.3.1 Substrate Layer The purpose of the substrate layer is to define a reasonable topology on top of the underlying IP network. Nodes are added to the substrate as they elect to participate in PathFinder. These nodes self-organize by establishing links to “nearby” nodes. Once the substrate settles on a topology, each node measures the characteristics of its links, including bandwidth and loss rate. Unlike RON or Detour, each node in the substrate maintains this information locally; it is not forwarded to a central database. Nodes that join the substrate are called core nodes. They must be programmable in the sense that we can dynamically load code on them in support of different data streams. They may be either end-systems or extensible routers. Networkenabled devices attach to nodes in the substrate, but do not participate in the overlay algorithms. However, such edge devices are the ultimate sources or sinks of various data streams. An example collection of core nodes, along with a set of edge devices, are shown in Figure 20.
Edge Node Core Nodes Devices
Edge Node
Devices
Figure 20. Overlay network & edge devices When a node joins the substrate, using a breadth-first search via another already joined node, it selects a small set of weak neighbor nodes, effectively adding links from itself to these neighbors. Our goal is to reduce the set of nodes that each node must consider to select its neighbors, yet come as close as we can to selecting the optimal neighbors for each node. When the set of weak neighbors converges, the node selects the closest node in its neighbors to be a strong neighbor. This neighbor differentiation allows the establishment of sound algorithms for nodes leaving the substrate layer, or in the case of a node failure. A detailed description of these algorithms is given in [35].
be selected from the connectivity layer by a standard reservation protocol like RSVP or a QoS-based signalling protocol. In our case, an Active Signalling (AS) protocol selects a sequence of nodes with sufficient resources and decides what application-specific code to run on each. AS first sends a PATH message (analogous to RSVP’s PATH message) from the source node to the sink node via the connectivity subgraph. This PATH message specifies a combination of Link Rules (Li ) and Node Rules (Nj ), where the former identifies the required characteristics of the links along the end-to-end path, and the latter specifies the necessary attributes of the nodes along this path. Since in general, we do not know the number of nodes along the path ahead of time, the PATH message is really given by a regular expression, and we can think of AS as implementing a Deterministic Finite Automata (DFA), of which an example is shown in Figure 22. 2
1 SRC
5.3.2 Connectivity Layer This middle overlay defines a subgraph of the substrate that includes the source and sink node(s) of a particular application, along with k alternative paths between these nodes. A given connectivity overlay is constructed as an extension to a traditional shortest-path routing algorithm (i.e., link state or distance vector) that runs on the substrate. An example of the distance vector router algorithm on the connectivity layer is depicted in Figure 21.
s
t
Shortest Path p(s,t)
NR1
LR1 S
5 LR3 3 LR2
NR3 4
NR2
6 LR4
T SINK
Figure 22. Deterministic finite automata for the path rule: SRC (L1 N1 )* L2 N2 (L3 N3 )* L4 SINK Whenever AS successfully applies a Link Rule or a Node Rule, it proceeds to the next link or node, and changes its current state of the DFA. Note that although this DFA is deterministic in terms of the alphabets of Node Rules and Link Rules, each link property and node capability could match several Link Rules and Node Rules, respectively. Therefore we still have non-determinism in the DFA. Whenever AS encounters this non-determinism, it spawns copies of the internal state of the PATH message to follow non-deterministic branches in parallel. Eventually, multiple PATH messages may arrive at the sink, in which case the sink selects the best one; for example, the one that supports the highest quality or lowest cost data stream.
Neighborhood Connectivity Layer Subgraph
Figure 21. Connectivity layer Unlike the substrate layer below it, the individual nodes are not aware that they are part of a particular connectivity subgraph. Instead, a given connectivity graph is maintained in a manner that is analogous to a source-route through a traditional network, the only difference being that the “route” is a mesh rather than a linear path. 5.3.3 Functionality Layer This top-most overlay corresponds to the actual set of nodes and links that make up an end-to-end path. The path could
5.3.4 Overall Scalability Alongside our prototype implementation of PathFinder’s connectivity and functionality layers, we also implemented a simulator for the substrate layer in order to evaluate the scalability of our algorithm. For search depth 1, and K weak neighbors per node, we ran our experiments increasing to a total of N nodes in the network, up to 1000 nodes. The average ratio of the weak neighbor nodes to the optimal set of nearest neighbor nodes (which we want to maximize) was consistently at 95%. On the other hand, the average ratio of nodes visited by the algorithm to the total number of nodes in the overlay,
=5
number of nodes we visit is roughly proportional to the diameter of the network, since the exploration for neighbors proceeds approximately along a straight line from the starting point to p the nearest neighbors. In other words, proportional to = N . We observed that Rv monotonically decreased with the total number of nodes from around 90% for a handful of nodes p to 10% for 1000 total nodes and roughly followed the = N curve in its distribution. To strike a balance between introducing sufficient redundancy into the overlay topology (for increased protection against node failures) and the added cost of visiting more nodes, we varied the out-degree parameter of weak neighbors, K . Our results show that there is no added benefit of increasing K beyond 5 neighbors, which is when the algorithm reaches saturation.
1
1
tion bugs. This is an issue for all control plane protocols; namely signalling protocols such as RSVP and ASP, routing protocols such as RIP, OSPF, EGP, and BGP, as well as network management protocols such as SNMP. Our proposed solution is to provide a set of tool kits that would be successively applied to new or existing protocol implementations, as shown in Figure 23. Some possibilities would be to annotate the code to detect extraordinary events, insert memory leak checks and tag certain parts of the code to raise security red flags. These tool kits may be implemented via compiler time directives or RPC calls. (new) protocol implementation
code annotation
resilient protocol security implementation tagging
set of tool kits successively applied to the protocol implementation
5.4. Towards a Robust Control Plane To orientate the problem at hand, we first address the issue of control plane robustness at the high level. We view the robustness of a particular control plane protocol in two parts: scalability and resilience. Formulating the first means that a control plane protocol must be able to scale with network growth and perform gracefully under unexpectedly high loads. Currently, a typical core router on the present day Internet would have 16-32 ports with STS-48 links, thus operating att 2.5Gbps link speeds, 10 to 50 neighboring routers and number of neighbor: 10 - 50 neighbors routers, and around 100,000 entries in its routing table. To sustain future growth, we contend that control plane protocols should be able to scale to an order of magnitude. This would entail supporting routers with hundreds or possibly thousands of ports, with STS-768 links operating at approximately 40Gbps, thousands of neighbors (think of mobile devices), and one million entries in the routing table. The solution we propose lies along two dimensions. The first one is to parallelize the control protocols to run on multiprocessors. If the parallelization algorithm is scalable, then so is the protocol. The second solution is to distribute the portions of the control plane processing hierarchically, among the lower execution levels. One example would be to install filters and flip a switch to turn them on to process BGP packets at lower levels if an anomaly in BGP traffic is detected. A more general solution would be to invoke a mechanism such as Thrill to push down a BGP forwarding function to say the StrongARM or the MicroEngines. The main research question here is how to balance the cost of this distribution in maintaining protocol consistency across the system hierarchy. The second part to the overall robustness problem is how to guarantee that a control plane protocol is resilient in the face of malicious attacks (such as DDosS) or implementa-
memory leaks check
...
Rv , is a metric we want to minimize. We believe that the
Figure 23. Making a protocol implementation resilient to malicious attacks or implementation bugs The strength of the proposed solution lies in the fact that it should not solely apply to control protocols but can be generalized to the entire protocol stack. Therefore, we can experiment to evaluate this approach on any data protocol. We have chosen TCP as an example and are currently investigating this approach in three steps. Our first step is to gather statistics with regard to the frequency of occurrence of different events, and under what conditions, for commonly taken forwarding paths. This provides a reference point for the various metrics and the ratios of the common versus the anomalous paths within each one. The second step is to instrument these metrics into the code. For example, if a leaky bucket counts the rate at which anomalies occur, it would be easy to catch a DDOS attack. The final step entails programing an appropriate action into the code for each suspicious scenario. The simplest action would be to drop the packet. More sophisticated techniques would delay and queue the packet, or set up a filter at an upstream router in the packet flow. The upstream filter could be in the form of a code module already running on the upstream router, in which case a packet is sent to turn the filtering on; or, the control protocol may send packets to actively inject the filtering module into the upstream router.
6. Conclusion We have described our effort to build an extensible router in support of active networks using COTS hardware and an open operating system. Our design supports the injection of new functionality into a router in a transparent fashion.
The central idea is to logically partition the router into a hierarchy of execution levels, where hardware and software are partitioned in concert. While the basic functionality at every execution level remains the same—forwarding and computing on packets—each level has its own set of requirements and limits on system resources and code security. As such, we were able to introduce a unified framework addressing the central issues of resource allocation and scheduling across the entire execution level hierarchy. At the network processor level, we demonstrate how to safely run extensions despite the stringent requirement of running at line speed. This is achieved by statically allocation the router’s resources between the fixed infrastructure required by every packet and any extensions loaded onto the processor, as well as by separating the data plane from the control plane and running the latter at higher execution levels. At the kernel level on the general-purpose processor, we contribute four major ideas to router design. First, we encapsulate a particular flow of data with all its communication, service and resource requirements into a Linux kernel module, called SILK, thereby making it possible to deploy the system on any Linux machine. This is accomplished via an elaborate threading mechanism between SILK and Linux, which apart from granting complete scheduling flexibility to SILK, allows extending this flow scheduling into user space level. The second contribution is to abstract the interface of all devices within SILK into a unified framework, called Geni, thereby providing a uniform interface for end-to-end flows. Third, we design and implement a resource scheduling strategy within SILK that simultaneously maximizes the throughput of best effort packets, provides different levels of service to QoS packets, exhibits robust behavior in the presence of varying workloads (such as packet flooding DoS attacks), and supports forwarders of varying computational costs. The fourth contribution is the design of the NodeOS such that it takes advantage of the entire hierarchical framework of our architecture, while providing an API that is portable among different EEs from above, and distinct NodeOS implementations from below. As for the control plane, our three key contributions are subsumed in the three respective mechanisms that encapsulate those contributions. The first mechanism provides the capability to dynamically inject forwarding code into the router. That is, we employ an interface that hides the router hierarchy from the application. The second contribution is a mechanism that dynamically configures a path from the available (or injected) choice of protocol modules for a particular flow. Lastly, PathFinder, employs a threelevel overlay approach for the entire network to construct an end-to-end path for media applications, thereby presenting a novel algorithm for constructing a self-organizing overlay network.
References [1] Active Networks Security Working Group. Security architecture for active nets. Available as ftp://www.ftp.tislabs.com/pub/ activenets/secarch2.ps, July 1998. [2] D. S. Alexander, M. Shaw, S. M. Nettles, and J. M. Smith. Active Bridging. In Proceedings of the ACM SIGCOMM ’97 Conference, pages 101–111, September 1997. [3] Alteon WebSystems, Inc., San Jose, California. ACEnic Server-Optimized 10/100/1000 Mbps Ethernet Adapters Datasheet, August 1999. [4] AN NodeOS Working Group. NodeOS interface specification. Available as http://www.cs. princeton.edu/nsg/papers/nodeos.ps, Jan. 2002. [5] D. Andersen, H. Balakrishnam, F. Kaashoek, and R. Morris. Resilient Overlay Networks. In Proceedings of the 18th ACM Symposium on Operating Systems Principles, pages 131–145, October 2001. [6] F. Baker. Requirements for IP Version 4 Routers; RFC 1812. Internet Request for Comments, June 1995. [7] J. C. R. Bennett and H. Zhang. Hierarchical Packet Fair Queueing Algorithms. In Proceedings of the ACM SIGCOMM ’96 Conference, pages 143–156, August 1996. [8] S. Bhattacharjee, K. Calvert, and E. Zegura. Congestion control and caching in CANES. In ICC ’98, 1998. [9] B. Braden, A. Cerpa, T. Faber, B. Lindell, G. Phillips, J. Kann, and V. Shenoy. Introduction to the ASP execution environment (v1.5). Available as http://www.isi.edu/active-signal/ARP/ DOCUMENTS/ASP_EE.ps, November 2001. [10] K. Calvert. Architectural framework for active networks. Available as http://www.cs.gatech.edu/ projects/canes/papers/arch1-0.ps.gz, July 1999. [11] Y. Chawathe, S. McCanne, and E. Brewer. An Architecture for Internet Content Distribution as an Infrastructre Service. http://yatin.chawathe.com/ yatin/papers. [12] Y.-H. Chu, S. G. Rao, and H. Zhang. A Case For End System Multicast. In Proceedings of the ACM SIGMETRICS 2000 Conference, pages 1–12, June 2000. [13] M. Dasen, G. Fankhauser, and B. Plattner. An Error Tolerant, Scalable Video Stream Encoding and Compression for Mobile Computing. In Proceedings of ACTS Mobile Summit 96, pages 762–771, November 1996. [14] B. Davie and Y. Rekhter. MPLS: Technology and Applications. Morgan Kaufmann Publishers, Inc., 2000. [15] D. Decasper, Z. Dittia, G. Parulkar, and B. Plattner. Router plugins: A software architecture for next generation routers. IEEE/ACM Trans. on Networking, 8(1):2–15, Feb. 2000. [16] M. Degermark, A. Brodnik, S. Carlsson, and S. Pink. Small forwarding tables for fast routing lookups. In Proceedings of the ACM SIGCOMM ’97 Conference, pages 3–14, October 1998. [17] B. Ford, G. Back, G. Benson, J. Lepreau, A. Lin, and O. Shivers. The Flux OSKit: A substrate for OS and language research. In Proceedings of the 16th ACM Symposium on Operating Systems Principles, pages 38–51, St. Malo, France, October 1997. [18] Y. Gottlieb and L. Peterson. A comparative study of extensible routers. In Proceedings of the Open Architectures 2002 Conference, May 2002. To appear.
[19] M. Hicks, P. Kakkar, J. T. Moore, C. A. Gunter, and S. Nettles. PLAN: A packet language for active networks. In Proceedings of the 3rd ACM SIGPLAN International Conference on Functional Programming Languages, pages 86–93, September 1998. [20] IBM Microelectronics Division. IBM PowerNP NP4GS3 Network Processor Solutions Product Overview, April 2001. [21] IEEE. Standard 802.3, October 2000. [22] InfiniBandSM Trade Association. InfiniBandTM Architecture Specification, Release 1.0, October 2000. [23] Intel Corporation. IXP1200 Network Processor Datasheet, September 2000. [24] Intel Corporation. IXP12EB Intel IXP1200 Network Processor Ethernet Evaluation Kit Product Brief, 2000. [25] Intelligent I/O (I2 O) Special Interest Group. Intelligent I/O (I 2 O ) Architecture Specification, Version 2.0, March 1999. [26] M. F. Kaashoek, D. R. Engler, G. R. Ganger, H. Briceno, R. Hunt, D. Mazieres, T. Pinckney, R. Grimm, J. Jannotti, and K. Mackenzie. Application performance and flexibility on exokernel systems. In Proceedings of the 16th ACM Symposium on Operating Systems Principles, pages 52–65, St. Malo, France, October 1997. [27] S. Karlin and L. Peterson. VERA: An Extensible Router Architecture. Computer Networks, 38(3):277–293, 2002. [28] E. Kohler, R. Morris, B. Chen, J. Jannotti, and M. F. Kaashoek. The Click modular router. ACM Transactions on Computer Systems, 18(3):263–297, Aug. 2000. [29] T. V. Lakshman and D. Stiliadis. High Speed Policy-based Packet Forwarding Using Efficient Multi-dimensional Range Matching. In Proceedings of the ACM SIGCOMM ’98 Conference, pages 203–214, September 1998. [30] H. C. Lauer and R. M. Needham. On the Duality of Operating System Structures. Operating Systems Review, 13(2):3–19, April 1979. [31] J. C. Mogul and K. K. Ramakrishnan. Eliminating Receive Livelock in an Interrupt-Driven Kernel. ACM Transactions on Computer Systems, 15(3):217–252, August 1997. [32] J. T. Moore, M. Hicks, and S. Nettles. Practical programmable packets. In Proceedings of the Twentieth IEEE Computer and Communication Society INFOCOM Conference, pages 41–50, April 2001. [33] D. Mosberger and L. L. Peterson. Making Paths Explicit in the Scout Operating System. In Proceedings of the Second USENIX Symposium on , pages 153–167, Seattle, WA USA, October 1996. [34] S. Murphy, E. Lewis, R. Puga, R. Watson, and R. Yee. Strong security for active networks. In Proceedings of the Open Architectures 2001 Conference, pages 1–8, Anchorage, AK USA, April 2001. [35] A. Nakao and L. Peterson. PathFinder: Multi-level overlays for constructing end-to-end active services. Nov. 2001. [36] A. Nakao, L. Peterson, and A. Bavier. Constructing End-to-End Paths for Playing Media Objects. In Proceedings of the Open Architectures 2001 Conference, pages 117–128, Anchorage, AK USA, April 2001. [37] V. Paxson. Automated Packet Trace Analysis of TCP Implementations. In Proceedings of the ACM SIGCOMM ’97 Conference, pages 167–179, September 1997. [38] L. Peterson, Y. Gottlieb, M. Hibler, P. Tullmann, J. Lepreau, S. Schwab, H. Dandelkar, A. Purtell, and J. Hartman. An OS Interface for Active Routers. IEEE Journal on Selected Areas in Communications, 19(3):473–487, March 2001.
[39] L. L. Peterson, S. C. Karlin, and K. Li. OS Support for General-Purpose Routers. In Proceedings of the 7th Workshop on Hot Topics in Operating Systems (HotOS–VII), March 1999. [40] X. Qie, A. Bavier, L. Peterson, and S. C. Karlin. Scheduling Computations on a Software-Based Router. In Proceedings of the ACM SIGMETRICS 2001 Conference, pages 13–24, June 2001. [41] RAMiX Incorporated, Ventura, California. PMC/CompactPCI Ethernet Controllers Product Family Data Sheet, 1999. [42] S. Savage, T. Anderson, A. Aggarwal, D. Becker, N. Cardwell, A. Collins, E. Hoffman, J. Snell, A. Vahdat, G. Voelker, and J. Zahorjan. Detour: A Case for Informed Internet Routing and Transport. IEEE Micro, 19(1):50–59, January 1999. [43] N. Shalaby, Y. Gottlieb, and L. Peterson. SNOW on SILK: A portable NodeOS interface for active services. Technical Report TR–641–02, Department of Computer Science, Princeton University, January 2002. [44] J. M. Smith, K. L. Calvert, S. L. Murphy, H. K. Orman, and L. L. Peterson. Activating Networks: A Progress Report. IEEE Computer, 32(4):32–41, April 1999. [45] T. Spalink, S. Karlin, L. Peterson, and Y. Gottlieb. Building a Robust Software-Based Router Using Network Processor. In Proceedings of the 18th ACM Symposium on Operating Systems Principles, pages 216–229, Banff, Alberta, Canada, October 2001. [46] O. Spatscheck, J. Hansen, J. Hartman, and L. Peterson. Optimizing TCP Forwarder Performance. IEEE/ACM Trans. on Networking, 8(2):146–157, April 2000. [47] V. Srinivasan and G. Varghese. Fast address lookups using controlled prefix expansion. ACM Transactions on Computer Systems, 17(1):1–40, February 1999. [48] V. Srinivasan, G. Varghese, S. Suri, and M. Waldvogel. Fast and Scalable Level Four Switching. In Proceedings of the ACM SIGCOMM ’98 Conference, pages 191–202, September 1998. [49] D. Tennenhouse and D. Wetherall. Towards an active network architecture. In Multimedia Computing and Networking 96, January 1996. [50] Vitesse Semiconductor Corporation, Longmont, Colorado. IQ2000 Network Processor Product Brief, 2000. [51] M. Waldvogel, G. Varghese, J. Turner, and B. Plattner. Scalable High Speed IP Routing Lookups. In Proceedings of the ACM SIGCOMM ’97 Conference, pages 25–36, October 1997. [52] M. Wawrzoniak, N. Shalaby, and L. Peterson. Geni: A Generic Interface Abstraction for End-to-End Flows. Technical Report TR–642–02, Department of Computer Science, Princeton University, January 2002. [53] D. Wetherall. Active network vision and reality: lessons from a capsule-based system. In Proceedings of the 17th ACM Symposium on Operating Systems Principles, pages 64–79, December 1999. [54] D. Wetherall, J. Guttag, and D. Tennenhouse. ANTS: A toolkit for building and dynamically deploying network protocols. In IEEE OPENARCH 98, San Francisco, CA, April 1998.