architecture based on Linux software and COTS hardware. The main focus of this .... kernel, while most of the control and monitoring operations (the ..... internal measurements using specific software tools ..... The Linux kernel provides a better.
6
JOURNAL OF NETWORKS, VOL. 2, NO. 3, JUNE 2007
Linux Software Router: Data Plane Optimization and Performance Evaluation Raffaele Bolla and Roberto Bruschi DIST - Department of Communications, Computer and Systems Science University of Genoa Via Opera Pia 13, 16145 Genoa, Italy Email: {raffaele.bolla, roberto.bruschi}@unige.it
Abstract - Recent technological advances provide an excellent opportunity to achieve truly effective results in the field of open Internet devices, also known as Open Routers or ORs. Even though some initiatives have been undertaken over the last few years to investigate ORs and related topics, other extensive areas still require additional investigation. In this contribution we report the results of the in-depth optimization and testing carried out on a PC Open Router architecture based on Linux software and COTS hardware. The main focus of this paper was the forwarding performance evaluation of different OR Linux-based software architectures. This analysis was performed with both external (throughput and latencies) and internal (profiling) measurements. In particular, for the external measurements, a set of RFC2544 compliant tests was also proposed and analyzed. Index Terms - Linux Router; Open Router; RFC 2544; IP forwarding.
I.
INTRODUCTION
Internet technology has been developed in an open environment and all Internet-related protocols, architectures and structures are publicly created and described. For this reason, in principle, everyone can “easily” develop an Internet device (e.g., a router). On the contrary, and to a certain extent quite surprising, most of the professional devices are developed in an extremely “closed” manner. In fact, it is very difficult to acquire details about internal operations and to perform anything more complex than a parametrical configuration. From a general viewpoint, this is not very strange since it can be considered a clear attempt to protect the industrial investment. However, sometimes the “experimental” nature of the Internet and its diffusion in many contexts might suggest a different approach. Such a need is even more evident within the scientific community, which often runs into various problems when carrying out experiments, testbeds and trials to evaluate new functionalities and protocols. Today, recent technological advances provide an opportunity to do something truly effective in the field of open Internet devices, sometimes called Open Routers (ORs). Such an opportunity arises from the use of Open Source Operative Systems (OSs) and COTS/PC components. The attractiveness of the OR solution can be summarized as: multi-vendor availability, low-cost and continuous updating/evolution of the basic parts. As far as performance is concerned, the PC architecture is
© 2007 ACADEMY PUBLISHER
general-purpose which means that, in principle, it cannot attain the same performance level as custom, high-end network devices, which often use dedicated HW elements to handle and to parallelize the most critical operations. Otherwise, the performance gap might not be so large and, in any case, more than justified by the cost differences. Our activities, carried out within the framework of the BORA-BORA project [1], are geared to facilitate the investigation by reporting the results of an extensive optimization and testing operation carried out on OR architecture based on Linux software. We focused our attention mainly on packet forwarding functionalities. Our main objectives were the performance evaluation of an optimized OR, in addition to external (throughput and latencies) and internal (profiling) measurements. To this regard, we identified a high-end reference PC-based hardware architecture and Linux kernel 2.6 for the software data plane. Subsequently, we optimized this OR structure, defined a test environment and finally developed a complete series of tests with an accurate evaluation of the software module’s role in defining performance limits. With regard to the state-of-the-art of OR devices, some initiatives have been undertaken over the last few years to develop and investigate the ORs and related topics. In the software area, one of the most important initiatives is the Click Modular Router Project [2], which proposes an effective data plane solution. In the control plane area two important projects can be cited: Zebra [3] and Xorp [4]. Despite custom developments, some standard Open Source OSs can also provide very effective support for an OR project. The most relevant OSs in this sense are Linux [5][6] and FreeBSD [7]. Other activities focus on hardware: [8] and [9] propose a router architecture based on a PC cluster, while [10] reports some performance results (in packet transmission and reception) obtained with a PC Linux-based testbed. Some evaluations have also been carried out on network boards (see, for example, [11]). Other fascinating projects involving Linux-based ORs can be found in [12] and [13], where Bianco et al. report some interesting performance results. In [14] a performance analysis of an OR architecture enhanced with FPGA line cards, which allows direct NIC-to-NIC packet forwarding, is introduced. [15] describes the Intel
JOURNAL OF NETWORKS, VOL. 2, NO. 3, JUNE 2007
I/OAT, a technology that enables DMA engines to improve network reception and transmission by offloading the CPU of some low-level operations. In [16] the virtualization of a multiservice OR architecture is discussed: the authors propose multiple Click forwarding chains virtualized with Xen. Finally, in [17], we proposed an in-depth study of the IP lookup mechanism included in the Linux kernel. The paper is organized as follows. the hardware and software details of the proposed OR architecture are reported in sections II and III reports, while Section IV contains a description of performance tuning and optimization techniques. The benchmarking scenario and the performance results are reported in Sections V and VI, respectively. Conclusions are presented in Section VII. II.
LINUX OR SOFTWARE ARCHITECTURE
The OR architecture has to provide many different types of functionalities: from those directly involved in the packet forwarding process to the ones needed for control functionalities, dynamic configuration and monitoring. As outlined in [5], in [18] and in [19], all the forwarding functions are developed inside the Linux kernel, while most of the control and monitoring operations (the signaling protocols such as routing protocols, control protocols, etc.) are daemons / applications running in the user mode. Like the older kernel versions, the Linux networking architecture is basically based on an interrupt mechanism: network boards signal the kernel upon packet reception or transmission through HW interrupts. Each HW interrupt is served as soon as possible by a handling routine, which suspends the operations currently being processed by the CPU. Until completed, the runtime cannot be interrupted by anything, and not even by other interrupt handlers. Thus, with the clear purpose of making the system reactive, the interrupt handlers are designed to be very short, while all the time-consuming tasks are performed by the so-called “Software Interrupts” (SoftIRQs) afterwards. This is the well-known “top half–bottom half” IRQ routine division implemented in the Linux kernel [18]. SoftIRQs are actually a form of kernel activity that can be scheduled for later execution rather than real interrupts. They differ from HW IRQs mainly in that a SoftIRQ is scheduled for execution by a kernel activity, such as an HW IRQ routine, and has to wait until it is called by the scheduler. SoftIRQs can be interrupted only by HW IRQ routines. The “NET_TX_SOFTIRQ” and the “NET_RX_ SOFTIRQ” are two of the most important SortIRQs in the Linux kernel and the backbone of the entire networking architecture, since they are designed to manage the packet transmission and reception operations, respectively. In detail, the forwarding process is triggered by an HW IRQ generated from a network device, which signals the reception or the transmission of packets. Then the corresponding routine performs some fast checks, and
© 2007 ACADEMY PUBLISHER
7
schedules the correct SoftIRQ, which is activated by the kernel scheduler as soon as possible. When the SoftIRQ is finally executed, it performs all the packet forwarding operations. As shown in Figure 1, which reports a scheme of Linux source code involved in the forwarding process, these operations computed during SoftIRQs can be organized in a chain of three different modules: a “reception API” that handles packet reception (NAPI1), a module that carries out the IP layer elaboration and, finally, a “transmission API” that manages the forwarding operations to the egress network interfaces. In particular, the reception and the transmission APIs are the lowest level modules, and are activated by both HW IRQ routines and scheduled SoftIRQs. They handle the network interfaces and perform some layer 2 functionalities. The NAPI [20] was introduced in the 2.4.27 kernel version, and has been explicitly created to increase reception process scalability. It handles network interface requests with an interrupt moderation mechanism, through it is possible to adaptively switch from a classical to a polling interrupt management of the network interfaces. In greater detail, this is accomplished by inserting the identifier of the board generating the IRQ on a special list, called the “poll list”, during the HW IRQ routine, scheduling a reception SoftIRQ, and disabling the HW IRQs for that device. When the SoftIRQ is activated, the kernel polls all the devices, whose identifier is included on the poll list, and a maximum of quota packets are served per device. If the board buffer (Rx Ring) is emptied, then the identifier is removed from the poll list and its HW IRQs re-enabled. Otherwise, its HW IRQs is left disabled, the identifier remains on the poll list and another SoftIRQ is scheduled. While this mechanism behaves like a pure interrupt mechanism in the presence of a low ingress rate (i.e., we have more or less one HW IRQ per packet), when traffic increases, the probability of emptying the RxRing, and thus re-enabling HW IRQs, decreases more and more, and the NAPI starts working like a polling mechanism. For each packet received during the NAPI processing a descriptor, called skbuff [21], is immediately allocated. In particular, as shown in Figure 1, to avoid unnecessary and tedious memory transfer operations, the packets are allowed to reside in the memory locations used by the DMA-engines of ingress network interfaces, and each subsequent operation is performed using the skbuffs. These descriptors do in fact consist of pointers to the different key fields of the headers contained in the associated packets, and are used for all the layer 2 and 3 operations. A packet is elaborated in the same NET_RX SoftIRQ, until it is enqueued in an egress device buffer, called Qdisc. Each time a NET_TX SoftIRQ is activated or a new packet is enqueued, the Qdisc buffer is served. When 1
In greater detail, the NAPI architecture includes a part of the interrupt handler.
8
JOURNAL OF NETWORKS, VOL. 2, NO. 3, JUNE 2007
ip_rcv
ip_rcv_finish
ip_forward
ip_forward_finish
ip_send Netfilter hook
ip_route_input
ip_output
rt_hash_code
ip_finish_output
IP Processing
Poll_Queue CPU1 Device 1 Device 2
Root Qdisc device 2 dev_queue_xmit
netif_receive_skb
eth_header
e1000_alloc_rx_buffers
eth_type_trans
e1000_clean_rx_irq
alloc_skb
qdisc_restart
Device 3 hard_start_xmit
e1000_clean_tx_irq
e1000_xmit_frame
Tx_Ring device 2 Completion queue Rx_Ring device 3
NAPI
Kernel Memory
kfree
TX-API
DMA engines net_rx_action
interrupt handler
net_tx_action
HW Interrupt
Figure 1. Detailed scheme of forwarding code in 2.6 Linux kernel versions.
a packet is dequeued from the Qdisc buffer, it is placed on the Tx Ring of the egress device. After the board successfully transmits one or more packets, it generates an HW IRQ, whose routine schedules a NET_TX SoftIRQ. The Tx Ring is periodically cleaned of all the descriptors of transmitted packets, which will be deallocated and refilled by the packets coming from the Qdisc buffer. Another interesting characteristic of the 2.6 kernels (introduced to reduce performance deterioration due to CPU concurrency) is the Symmetric Multi-Processors (SMP) support that may assign management of each network interface to a single CPU for both the transmission and reception functionalities. III.
HARDWARE ARCHITECTURE
The Linux OS supports many different hardware architectures, but only a small portion of them can be effectively used to obtain high OR performance. In particular, we must take into account that, during networking operations, the PC internal data path has to use a centralized I/O structure consisting of the I/O bus, the memory channel (used by DMA to transfer data from network interfaces to RAM and vice versa) and the Front Side Bus (FSB) (used by the CPU with the memory channel to access the RAM during packet elaboration). The selection criterions for hardware elements have been very fast internal busses, RAM with very low access times, and CPUs with high integer computational power (i.e., packet processing does not generally require any floating point operations).
© 2007 ACADEMY PUBLISHER
In order to understand how hardware architecture affects overall system performance, we selected two different architectures that represent the current state-ofthe art of server architectures and the state-of-the-art from 3 years ago, respectively. To this regard, as old HW architecture, we chose a system based on the Supermicro X5DL8-GG mainboard: it can support a dual-Xeon system with a dual memory channel and a PCI-X bus at 133MHz with 64 parallel bits. The Xeon processors (32 bit and mono-core) we utilized have a 2.4 GHz clock and a 512KB cache. For the new OR architecture we used a Supermicro X7DBE mainboard, equipped with both the PCI Express and PCIX busses, and with a 5050 Intel Xeon (dual core 64-bit processor). Network interfaces are another critical element, since they can heavily affect PC Router performance. As reported in [11], the network adapters on the market offer different performance levels and configurability. With this in mind, we selected two different types of adapters with different features and speed: a high performance and configurable Gigabit Ethernet interface, namely Intel PRO 1000, which is equipped with a PCI-X controller (XT version) or a PCI-Express (PT version) [22]; a DLink DFE-580TX [23] that is a network card equipped with four Fast Ethernet interfaces and a PCI 2.1 controller. IV.
SOFTWARE PERFORMANCE TUNING
The entire networking Linux kernel architecture is quite complex and has numerous aspects and parameters that can be tuned for system optimization. In particular, in this environment, since the OS has been developed to act
JOURNAL OF NETWORKS, VOL. 2, NO. 3, JUNE 2007
as network host (i.e., workstation, server, etc.), it is natively tuned for “general purpose” network end-node usage. In this last case, packets are not fully processed inside kernel-space, but are usually delivered from network interfaces to applications in user-space, and vice versa. When the Linux kernel is used in an OR architecture, it generally works in a different manner, and should be specifically tuned and customized to obtain the maximum packet forwarding performance. As reported in [19] and [25], where a more detailed description of the adopted tuning actions can be found, this optimization is very important for obtaining maximum performance. Some of the optimal parameter values can be identified by logical considerations, but most of them have to be empirically determined, since their optimal value cannot be easily derived from the software structure and because they also depend on the hardware components. So we carried out our tuning first by identifying the critical elements on which to operate, and, then, by finding the most convenient values with both logical considerations and experimental measures. As far as the adopted tuning settings are concerned, we used the 6.3.9 e1000 driver [24], configured with both the Rx and Tx ring buffers to 256 descriptors, while the Rx interrupt generation was not limited. The qdisc size for all the adapters was dimensioned to 20,000 descriptors, while the scheduler clock frequency was fixed to 100Hz. Moreover the 2.6.16.13 kernel images used to obtain the numerical results in Section VI include two structural patches that we created to test and/or optimize kernel functionalities. In particular, those patches are described in the following discussion. A. Skbuff Recycling patch We studied and developed a new version of the skbuff Recycling patch, originally proposed by R. Olsson [26] for the “e1000” driver. In particular, the new version is stabilized for the 2.6.16.13 kernel version and extended to the “sundance” driver. This patch intercepts the skbuff descriptors of transmitted packets before they are de-allocated, and reuses them for new incoming packets. As shown in [19], this architectural change significantly reduces the computation weight of the memory management operations, thus attaining a very high performance level (i.e., about 150-175% of the maximum throughput of standard kernels). B. Performance Counter patch To further analyze the OR’s internal behavior, we decided to introduce a set of counters in the kernel source code in order to understand how many times a certain procedure is called, or how many packets are kept per time. Specifically, we introduced the following counters: • IRQ: number of interrupt handlers generated by a network card; • Tx/Rx IRQ: number of tx/rx IRQ routines per device; • Tx/Rx SoftIRQ: number of tx/rx software IRQ routines;
© 2007 ACADEMY PUBLISHER
9
•
Qdiscrun and Qdiscpkt: number of times the output buffer (Qdisc) is served, and number of served packets per time. • Pollrun and Pollpkt: number of times the rx ring of a device is served, and the number of served packets per time. • tx/rx clean: number of times the tx/rx procedures of the driver are activated. The values of all these parameters have been mapped in the Linux “proc” file system. V.
BENCHMARKING SCENARIO
To benchmark the OR forwarding performance, we used a professional device, known as Agilent N2X Router Tester [27], which can be used to obtain throughput and latency measurements with high availability and accuracy levels (i.e., the minimum guaranteed timestamp resolution is 10 ns). Moreover, with two dual Gigabit Ethernet cards and one 16 Fast Ethernet card, we can analyze the OR behavior with a large number of Fast and Gigabit Ethernet interfaces. To better support the performance analysis and to identify the OR bottlenecks, we also performed some internal measurements using specific software tools (called profilers) placed inside the OR which trace the percentage of CPU utilization for each software module running on the node. The problem is that with many of these profilers the relevant computational effort required perturbs system performance, thus generating what are not very meaningful the results. We verified with many different tests that one of the best is Oprofile [28], an open source tool that continuously monitors system dynamics with frequent and quite regular sampling of CPU hardware registers. Oprofile effectively and profoundly evaluate CPU utilization of each software application and each single kernel function running in the system with very low computational overhead. With regard to the benchmarking scenario, we decided to start by defining a reasonable set of test setups (with increasing level of complexity) and for each selected setup to apply some of the tests defined in the RFC 2544 [29]. In particular, we chose to perform these activities by using both a core and an edge router configuration: the former consists of a few high-speed (Gigabit Ethernet) network interfaces, while the latter utilizes a high-speed gateway interface and a large number of Fast Ethernet cards which collect traffic from the access networks. More specifically, we performed our tests by using the following setups (see Figure 2): 1) Setup A: a single mono directional flow crosses the OR from one Gigabit port to another one; 2) Setup B: two full duplex flows cross the OR, each one using a different pair of Gigabit ports; 3) Setup C: a full-meshed (and full-duplex) traffic matrix applied on 4 Gigabit Ethernet ports; 4) Setup D: a full-meshed (and full-duplex) traffic matrix applied on 1 Gigabit Ethernet port and 12 Fast Ethernet interfaces. In greater detail, each OR forwarding benchmarking session essentially consists of three test sets, and namely:
JOURNAL OF NETWORKS, VOL. 2, NO. 3, JUNE 2007
a) Throughput and latency: this test set is performed by using constant bit rate traffic flows, consisting of fixed size datagrams, to obtain: a) the maximum effective throughput (in Kpackets/s and as a percentage with respect to the theoretical value) versus different IP datagram sizes; b) the average, maximum and minimum latencies versus different IP datagram sizes; b) Back-to-back: these tests are carried out by using burst traffic flows and by changing both the burst dimension (i.e., the number of the packets comprising the burst) and the datagram size. The main results for this kind of test are: a) zero loss burst length versus different IP datagram sizes; b) average, maximum and minimum latencies versus different sizes of IP datagram comprising the burst (“zero loss burst length” is the maximum number of packets transmitted with minimum inter-frame gaps that the System Under Test (SUT) can handle without any loss). c) Loss Rate: this kind of test is carried out by using CBR traffic flows with different offered loads and IP datagram sizes; the obtainable results can be summarized in throughput versus both offered load and IP datagram sizes. Note that all these tests have been performed by using different IP datagram sizes (i.e., 40, 64, 128, 256, 512, 1024 and 1500 bytes) and both CBR and burst traffic flows.
Setup A
Setup B
Setup C
Setup D
Figure 2. Benchmarking setups.
two hardware architectures described in Section III are reported in Subsection F, in order to evaluate how HW evolution affects forwarding performance. A. Setup A numerical results In the first benchmarking session, we performed the RFC 2544 tests by using setup A (see Figure 2) with both the single-processor 2.6.16 optimized kernel and Click. As we can observe in Figs. 3, 4 and 5, which report the numerical results of the throughput and latency tests, both software architectures cannot achieve the maximum theoretical throughput in the presence of small datagram sizes. As demonstrated by the profiling measurements reported in Fig. 6, obtained with the single processor optimized 2.6.16 kernel and with 64 Bytes sized datagrams, this effect is clearly caused by the computational CPU capacity that limits the maximum forwarding rate of the Linux kernel to about 700 Kpackets/s (40% of the full Gigabit speed). In fact, even if the CPU idle goes to zero at 40% of full load, the CPU occupancies of all the most important function sets appear to adapt their contributions up to 700 Kpackets/s; after this point their percentage contributions to CPU utilization remains almost constant. 100
Throughput [%]
10
80 60 40 20 2.6.16 Click 0 0
NUMERICAL RESULTS
A selection of the experimental results is reported in this section. In particular, the results of the benchmarking setups shown in Figure 2 are reported in Subsections A, B, C and D. In all such cases, the tests were performed with the “old” hardware architecture described in Section III (i.e., 32-bit Xeon and PCI-X bus). With regard to Software architecture, we decided to compare different 2.6.16 Linux kernel configurations and a Click Modular Router. In particular, we used the following versions of the 2.6.16 Linux kernel: • single-processor 2.6.16 optimized kernel (a version based on the standard one with single processor support that includes the descriptor recycling patch). • dual-processor 2.6.16 standard kernel (a standard NAPI kernel version similar to the previous one but with SMP support); Note that we decided not take into account the SMP versions of both the optimized Linux kernel and the Click Modular Router, since they lack a minimum acceptable level of stability. Subsection E summarizes the results obtained in the previous tests by showing the maximum performance for each benchmarking setup. Finally, the performance of the
© 2007 ACADEMY PUBLISHER
200 400 600 800 1000120014001600 IP Packet Size [Bytes]
Figure 3. Throughput and latencies test, testbed A: effective throughput results for the single-processor 2.6.16 optimized kernel and Click. 400 2.6.16 2.6.16 Click Click
350
Latency [us]
VI.
300
min max min max
250 200 150 100 50 0 0
200 400 600 800 1000120014001600 IP Packet Size [Bytes]
Figure 4. Throughput and latencies test, testbed A: minimum and maximum latencies for both the single-processor 2.6.16 optimized kernel and Click.
More expressly, Fig. 5 shows that the computational weight of memory management operations (like sk_buff allocations and de-allocations) is substantially limited, thanks to the descriptor recycling patch, to less than 25%. In other our works, such as [19], we have shown that this patch can be used to save a CPU time share equal to about 20%.
JOURNAL OF NETWORKS, VOL. 2, NO. 3, JUNE 2007
400 2.6.16 avg Click avg
Latency [us]
350 300 250 200 150 100 50 0 0
200 400 600 800 1000120014001600 IP Packet Size [Bytes]
Figure 5. Throughput and latencies test, testbed A: average latencies for both the single-processor 2.6.16 optimized kernel and Click. CPU Utilization [%]
60 idle scheduler memory IP processing NAPI Tx API IRQ Eth processing oprofile
50 40 30 20 10 0 0
10
20
30
40
50
60
70
80
Offered Load [%]
11
With regard to all the other operation sets (i.e., IP and Ethernet processing, NAPI and TxAPI), their behaviour is clearly bound by the number of forwarded packets: the weight of almost all the classes increases linearly up to the saturation point, and subsequently remains more or less constant. This analysis is confirmed also by the performance counters reported in Figs. 7 and 8, in which both the Tx and Rx boards reduce their IRQ generation rates, while the kernel passes from polling the Rx Ring twice per received packet, to about 0.22 times. The number of Rx SoftIRQ per received packet also decreases as offered traffic load rises. For what concerns transmission dynamics, Fig. 8 shows very low function occurrences: in fact, the Tx IRQ routines decrease their occurrences up to saturation, while the “wake” function, which represents the number of times that the Tx Ring is cleaned and the Qdisc buffer is served during an Rx SoftIRQ, exhibits a mirror-like behavior: this occurs because when the OR reaches the saturation, all the tx functionalities are activated when the Rx SoftIRQ starts. 100000
1.4 1.2
2
1 1.5
0.8
1
0.6 0.4
0.5
0.2
0
IRQ Poll Run rxSoftIrq
10
20
30 40 50 60 Offered Load [%]
70
0.14 0.12
0.004
0.1
0.003
0.08
0.002
0.06
0.001
0.04
0
0.02 0
10
20
30 40 50 60 Offered Load [%]
70
Occurrences [# / Pkts]
Figure 7. Number of IRQ routines, polls and Rx SoftIRQ (second yaxis) for the RX board for the skbuff recycling patched kernel, in the presence of an incoming traffic flow with only 1 IP source address. 0.005
IRQ Wake Func
80
Figure 8. Number of IRQ routines for the TX board, of Tx Ring cleaned by TxSoftIRQ (“func”) and by RxSoftIRQ (“wake”) for the skbuff recycling patched kernel, in the presence of an incoming traffic flow with only 1 IP source address. The second y axis refers to “wake”.
The behavior of the IRQ management operations would appear to be rather strange: in fact, their CPU utilization level decreases with an increase in input rate. There are mainly two reasons for such a behavior related to the packet grouping effect in the Tx and in the RxAPI: in particular, when the ingress packet rate rises, NAPI tends to moderate the IRQ rate by causing it to operate more like a polling than an interrupt-like mechanism (and thus we have the first interrupt number reduction), while TxAPI, under the same conditions, can better exploit the packet grouping mechanism by sending more packets at time (and then the number of interrupts for successful transmission confirmations decreases). When the IRQ weight becomes zero, the OR reaches the saturation point, and operates like a polling mechanism. © 2007 ACADEMY PUBLISHER
1000 100 10 2.6.16 Click 0 200 400 600 800 1000120014001600 IP Packet Size [Bytes]
80
0.006
10000
1
0 0
Occurrences [# / Pkts]
Burst Length [pkt]
2.5
Occurrences [# / Pkts]
Occurrences [# / Pkts]
Figure 6. Profiling results of the optimized Linux kernel obtained with testbed setup A.
Figure 9. Back-to-back test, testbed A: maximum zero loss burst lengths. BACK-TO-BACK TEST, TESTBED A: LATENCY VALUES FOR BOTH THE SINGLE-PROCESSOR OPTIMIZED KERNEL AND CLICK.
TABLE I.
PktLength [Byte] 40 64 128 256 512 768 1024 1280 1500
optimized 2.6.16 Kernel Latency Min Average Max [us] [us] [us] 16.16 960.08 1621.47 14.95 929.27 1463.02 16.04 469.9 925.93 16.01 51.65 58.84 18.95 54.96 61.51 23.35 100.76 164.56 25.31 123.68 164.21 28.6 143.43 166.46 30.38 142.22 163.63
Min [us] 23.47 23.52 19.34 22.49 20.72 22.85 32.02 24.77 32.01
Click Latency Average Max [us] [us] 1165.53 1693.64 1007.42 1580.74 54.88 53.45 52.62 47.67 62.92 59.95 116.61 155.59 128.85 154.72 151.81 178.45 154.79 181.43
Similar considerations can also be made for the Click modular router: the performance limitations in the presence of short-sized datagrams continue to be caused by a computational bottleneck, but the simple Click packet receive API based on the polling mechanism improves throughput performance by lowering the weight of IRQ management and RxAPI functions. For the same reasons, as shown in Figs. 4 and 5, the receive mechanism included in Click introduces higher packet latencies. According to the previous results, the back-toback tests, as reported in Fig. 9 and Table I, also demonstrate that the optimized 2.6.16 Linux kernel and Click continue to be affected by small-sized datagrams. In fact, while when using 256 Byte or higher sized datagrams the measured zero-loss burst length is quite
12
JOURNAL OF NETWORKS, VOL. 2, NO. 3, JUNE 2007
Throughput [%]
100
sizes obtained for all three software architectures. In particular, we note that both Linux kernels, which in this case provide very similar results, ensure minimum latencies lower than Click. Instead, Click provides better average and maximum latency values for short-sized datagrams. 10000
Latency [us]
close to the maximum burst length used in the tests carried out, it appears to be heavily limited in the presence of 40, 64 and, only for what concerns the Linux kernel, 128 Byte-sized packets. Exception is made for the single 128-Byte case, in which the computational bottleneck starts to affect NAPI while the Click forwarding rate continues to be very close to the theoretical one. The Linux kernel provides a better support for burst traffic than Click. As a result, zero-loss burst lengths are longer and associated latency times are smaller. The loss rate test results are reported in Fig. 10.
1000 2.6.16 2.6.16 Click Click SMP SMP
100
80
10 60
0
min max min max min max
200 400 600 800 1000120014001600 IP Packet Size [Bytes]
40
Figure 12. Throughput and latencies test, testbed B: minimum and maximum latencies.
2.6.16 40B 2.6.16 64B 2.6.16 128B Click 40B Click 64B
20 0
0
20
40
60
80
100
2.6.16 avg Click avg SMP avg
10000
Offered Load [%]
Throughput [%]
30 25 20 15 10 2.6.16 Click SMP
5 0 0
200 400 600 800 1000 1200 1400 1600 IP Packet Size [Bytes]
Figure 11. Throughput and latencies test, testbed setup B: effective throughput.
Thus, in this particular setup, computational load sharing attempts to manage the two interfaces, to which a traffic pair is applied, with a single fixed CPU, fully processing each received packet with only one CPU, thus avoiding any memory concurrency problems. Figs. 12 and 13 report the minimum, the average and the maximum latency values according to different datagram
© 2007 ACADEMY PUBLISHER
1000
100
10 0
200 400 600 800 1000120014001600 IP Packet Size [Bytes]
Figure 13. Throughput and latencies test, testbed B: average latencies. 800
Burst Length [pkt]
B. Setup B numerical results In the second benchmarking session we analyzed the performance achieved by the optimized single processor Linux kernel, the SMP standard Linux kernel and the Click modular router with testbed setup B (see Fig. 2). Fig. 11 reports the maximum effective throughput in terms of forwarded packets per second for a single router interface. From this figure it is clear that, in the presence of short-sized packets, the performance level of all three software architectures is not close to the theoretical one. More specifically, while the best throughput values are achieved by Click, the SMP kernel seems to provide better forwarding rates with respect to the optimized kernel. In fact, as outlined in [25], if no explicit CPUinterface bounds are present, the SMP kernel processes the received packets (using, if possible, the same CPU for the entire packet elaboration) and attempts to dynamically distribute the computational load among the CPUs.
Latency [us]
Figure 10. Loss Rate test, testbed A: maximum throughput.
2.6.16 Click SMP
700 600 500 400 300 200 100 0 0
200
400
600
800
1000
1200
1400
1600
IP Packet Size [Bytes]
Figure 14. Back-to-back test, testbed B: maximum zero-loss burst lengths.
The back-to-back results, reported in Fig. 14 and Table II, show that the performance level of all analyzed architectures is nearly comparable in terms of zero-loss burst length, while as far as latencies are concerned, the Linux kernels provide better values. By analyzing Fig. 15, which reports the loss rate results, we note how the performance values obtained with Click and the SMP kernel are better, especially for low-sized datagrams, than the one obtained by the optimized single processor kernel. Moreover, Fig. 15 also shows that all three OR software architectures do not achieve the full Gigabit/s speeds even for large datagrams, with a maximum forwarding rate of about 650 Mbps per interface. To improve the readability of these results, we reported in Fig. 15 and in all the following loss rate tests only the OR behavior with the minimum and maximum datagram sizes since they are, respectively, the performance lower and upper bound.
JOURNAL OF NETWORKS, VOL. 2, NO. 3, JUNE 2007
Optimized 2.675 Kernel
Click
2.6.16 SMP Kernel
Latency [us]
Latency [us]
Latency [us]
Pkt Length
[Byte] 40 64 128 256 512 768 1024 1280 1500
BACK-TO-BACK TEST, TESTBED B: LATENCY VALUES FOR ALL THREE SOFTWARE ARCHITECTURES.
Min Average 122.3 2394.7 124.8 1717.1 212.1 1313.1 139.6 998.8 70.8 688.2 55.7 585.3 71.0 373.8 58.6 427.7 66.4 485.6
Max 5029.4 3320.4 2874.2 2496.9 2088.7 2122.7 1264.5 1526.6 1707.8
Min 27.9 30.1 46.9 45.4 21.2 28.2 33.8 45.0 38.3
Average
5444.6 2854.7 2223.5 1314.9 574.5 480.0 458.3 426.7 462.3
Max 13268 16349 7390.0 5698.6 2006.4 1736.5 1603.3 1475.5 1524.3
Min 70.48 89.24 67.02 37.86 31.77 35.01 37.19 38.58 36.68
Average
1505 1577 1202 1005 728 587 361 482 478
Max 3483 3474 3047 2971 2085 1979 1250 1868 1617
30
Throughput [%]
TABLE II.
13
25 20 15 10 2.6.16 Click SMP
5 0 0
200 400 600 800 1000 1200 1400 1600 IP Packet Size [Bytes]
Figure 16. Throughput and latencies test, setup C: effective throughput results.
2.6.16 40B 2.6.16 1500B Click 40B Click 1500B SMP 40B SMP 1500B
80 60 40 20
10000
Latency [us]
Throughput [%]
100
1000
2.6.16 2.6.16 Click Click SMP SMP
100
0 0
20
40
60
80
100
10
Offered Load [%]
© 2007 ACADEMY PUBLISHER
200 400 600 800 1000120014001600 IP Packet Size [Bytes]
Figure 17. Throughput and latencies test, testbed C: minimum and maximum latencies.
Latency [us]
10000
2.6.16 avg Click avg SMP avg
1000
100
10 0
200 400 600 800 1000120014001600 IP Packet Size [Bytes]
Figure 18. Throughput and latencies test, results for testbed C: average latencies. 800
Burst Length [pkt]
C. Setup C numerical results In this benchmarking session, the three software architectures were tested in the presence of four Gigabit Ethernet interfaces with a full-meshed traffic matrix (Fig. 2). By analyzing the maximum effective throughput values in Fig. 16, we note that Click appears to achieve a better performance level with respect to the Linux kernels while, unlike the previous case, the single processor kernel provides maximum forwarding rates larger than the SMP version with small packets. In fact, the SMP kernel tries to share the computational load of the incoming traffic among the CPUs, resulting in an almost static assignment of each CPU to two specific network interfaces. Since, in the presence of a fullmeshed traffic matrix, about half of the forwarded packets cross the OR between two interfaces managed by different CPUs, this decreases performance due to memory concurrency problems [19]. Figs. 17 and 18 show the minimum, the maximum and the average latency values obtained during this test set. In observing the last results, we note how the SMP kernel, in the presence of short-sized datagrams, continues to undergo memory concurrency problems which lowers OR performance while considerably increasing both the average and the maximum latency values. By analyzing Fig. 19 and Table III, which report the back-to-back test results, we note that all three OR architectures achieve a similar zero-loss burst length, while Click reaches very high average and maximum latencies with respect to the single-processor and SMP kernels when small packets are used. The loss-rate results in Fig. 20 highlight the performance decay of the SMP kernel, while a fairly similar behavior is achieved by the other two architectures. Moreover, as in the previous benchmarking session, the maximum forwarding rate for each Gigabit network interface is limited to about 600/650 Mbps.
0
2.6.16 Click SMP
700 600 500 400 300 200 100 0 0
200 400 600 800 1000120014001600 IP Packet Size [Bytes]
Figure 19. Back-to-back test, testbed C: maximum zero loss burst lengths. 100
Throughput [%]
Figure 15. Loss Rate test, testbed B: maximum throughput versus both offered load and IP datagram sizes.
min max min max min max
2.6.16 40B 2.6.16 1500B Click 40B Click 1500B SMP 40B SMP 1500B
80 60 40 20 0 0
20
40
60
80
100
Offered Load [%]
Figure 20. Loss Rate test, testbed C: maximum throughput versus both offered load and IP datagram sizes
14
JOURNAL OF NETWORKS, VOL. 2, NO. 3, JUNE 2007
Pkt Latency [us] Length [Byte] Min Average Max 40 92.1 2424.3 5040.3 64 131.3 1691.7 3285.1 128 98.1 1281.0 2865.9 256 19.6 915.9 2494.6 512 23.9 666.9 2138.9 768 22.3 571.3 2079.7 1024 22.0 353.7 1232.2 1280 25.9 436.4 1525.4 1500 27.4 469.5 1696.7
Click
2.6.16 SMP Kernel
Latency [us]
Latency [us]
Min Average Max 73.5 6827.0 15804.9 176.7 6437.6 16651.1 60.2 3333.8 9482.1 16.7 1388.9 3972.2 15.9 649.2 2119.3 22.5 543.7 2002.6 36.3 382.2 1312.7 34.6 443.0 1460 36.7 457.5 1525.7
Min Average Max 74.8 3156.8 6164.2 66.5 2567.5 5140.6 77.1 1675.1 3161.6 44.1 790.8 1702.8 23.2 815.8 2189.0 23.6 737.3 2193.0 30.0 411.7 1276.8 29.8 447.7 1469.5 30.0 482.6 1719.6
D. Setup D numerical results In the last benchmarking session, we applied setup D, which provides a full-meshed traffic matrix between one Gigabit Ethernet and 12 Fast Ethernet interfaces, to the single-processor Linux kernel and to the SMP version. We did not use Click in this last test since, at the moment and for this software architecture, there are no drivers with polling support for the D-Link interfaces. By analyzing the throughput and latency results in Figs. 21, 22 and 23, we note how, in the presence of a high number of interfaces and a full-meshed traffic matrix, the performance of the SMP kernel version drops significantly: the maximum measured value for the effective throughput is limited to about 2400 packets/s and the corresponding latencies would appear to be much higher with respect to those obtained with the single processor kernel. However, the single processor kernel also does not support the maximum theoretical rate: it achieves 10% of full speed in the presence of short-sized datagrams and about 75% for high datagram sizes. 80 2.6.16 SMP
Throughput [%]
70 60 50 40
10000
Latency [us]
optimized 2.6.16 Kernel
first (Fig. 24) consists of 12 CBR flows that cross the OR from the Fast Ethernet interfaces to the Gigabit one, while the second (Fig. 25) still consists of 12 CBR flows that cross the OR in the opposite direction (e.g., from the Gigabit to the Fast Ethernet interfaces). These simple traffic matrices allow us to separately analyze the reception and transmission operations.
1000
100
2.6.16 avg SMP avg
10 0
200 400 600 800 1000120014001600 IP Packet Size [Bytes]
Figure 23. Throughput and latencies test, testbed D: average latencies for both the Linux kernels. 80 CPU Percentage [%]
BACK-TO-BACK TEST, TESTBED C: LATENCY VALUES FOR THE SINGLE-PROCESSOR 2.6.16 OPTIMIZED KERNEL, THE CLICK MODULAR ROUTER AND THE SMP 2.6.16 KERNEL
idle scheduler memory IP processing NAPI Tx API IRQ Eth processing oprofile control
70 60 50 40 30 20 10 0 10
20
30 40 50 60 70 80 Offered Load [Kpackets/s]
90
100
Figure 24. Profiling results obtained by using 12 CBR flows that cross the OR from the Fast Ethernet interfaces to the Gigabit one. 70 CPU Percentage [%]
TABLE III.
idle scheduler memory IP processing NAPI Tx API IRQ Eth processing oprofile control
60 50 40 30 20 10 0 10
30
20
30 40 50 60 70 80 Offered Load [Kpackets/s]
90
100
Figure 25. Profiling results obtained by using 12 CBR flows that cross the OR from a Gigabit interface to 12 FastEthernet ones.
20 10 0 0
200 400 600 800 1000 1200 1400 1600 IP Packet Size [Bytes]
Figure 21. Throughput and latencies test, setup D: effective throughput results for both Linux kernels.
Latency [us]
10000
1000
100 2.6.16 2.6.16 SMP SMP
10 0
min max min max
200 400 600 800 1000120014001600 IP Packet Size [Bytes]
Figure 22. Throughput and latencies test, results for testbed D: minimum and maximum latencies for both Linux kernels.
To better understand why the OR does not attain fullspeed with such a high number of interfaces, we decided to perform several profiling tests. In particular, these tests were carried out using two simple traffic matrices: the
© 2007 ACADEMY PUBLISHER
Thus, Figs. 24 and 25 report the profiling results corresponding to the two traffic matrices. The internal measurements shown in Fig. 24 highlight that fact that the CPUs are overloaded by the very high computational load of the IRQ and TX API management operations. This is due to the fact that during the transmission process each interface must signal the state of both the transmitting packets and the transmission ring to the associated driver instance through interrupts. More specifically, and again referring to Fig. 24, we note that IRQ CPU occupancy decreases by up to 30% of the offered load, and afterwards, while the OR reaches saturation, it remains constantly at about 50% of the computational resources. The initial decreasing behavior is due to the fact that by increasing the offered load traffic, the OR can better exploit packet grouping effects. Instead, the constant behavior is due to the fact that the OR manages the same packet quantity. Referring to Fig.
JOURNAL OF NETWORKS, VOL. 2, NO. 3, JUNE 2007
25, we note how the presence of traffic incoming from many interfaces increases the computational weights of both the IRQ and the memory management operations. The decreasing behavior of the IRQ management computational weight is not due, as in the previous case, to the packet grouping effect, but to the typical NAPI structure that passes from an IRQ based mechanism to a polling one. The high memory management values can be explained quite simply by the fact that the recycling patch is not operating with the Fast Ethernet driver. 2.6.16 SMP
30 25
in Figs. 28 and 29 the aggregated2 maximum values for each testbed of, respectively, the effective throughput and the maximum throughput (obtained in the loss rate test). By analyzing Fig. 28, we note that in the presence of more network interfaces, the OR generates values higher than 1 Gbps and, in particular, that it reaches maximum values equal to 1.6 Gbps with testbed D. We can also point out that the maximum effective throughput of setups B and C are almost the same: in fact, these very similar testbeds have only one difference (i.e., the traffic matrix), which has an effect only on the performance level of the SMP kernel, but practically no effect on the behaviors of the single processor kernel and Click.
20
Effective Throughput [Mbps]
Burst Length [pkt]
35
15
15 10 5 0 0
200 400 600 800 1000 1200 1400 1600 IP Packet Size [Bytes]
Figure 26. Back-to-back test, testbed D: maximum zero loss burst lengths. TABLE IV. BACK-TO-BACK TEST, TESTBED D: LATENCY VALUES FOR THE SINGLE-PROCESSOR 2.6.16 OPTIMIZED KERNEL AND THE SMP 2.6.16 KERNEL. Optimized 2.6.16 Kernel Latency [us] Min 285.05 215.5 216.15 61.83 57.73 101.78 108.17 73.15 109.24
Average 1750.89 1821.81 1847.76 1445.15 2244.68 1981.5 1386.19 1662.13 1149.76
Min 37.04 38.07 34.52 30.61 32.97 50.64 52.14 58.11 70.92
Average 1382.96 1204.43 1244.87 2082.76 908.19 1007.27 819.98 981.92 869.36
1200 1000 800 600 400 200 0 0
setup setup setup setup
3000 2500
80
400
600
800
1000 1200 1400 1600
A B C D
2000 1500 1000 500 0 0
2.6.16 40B 2.6.16 1500B SMP 40B SMP 1500B
200
3500
Max 2347.89 1963.7 1984.4 3586.6 1661.18 1750.45 1642.13 1953.62 1698.68
100
Throughput [%]
1400
A B C D
Figure 28. Maximum effective throughput values obtained in the implemented testbeds.
Latency [us]
Max 2921.23 2892.13 3032.22 2353.12 4333.44 3497.81 2394.4 3029.54 2250.78
setup setup setup setup
1600
IP Packet Size [Bytes]
Throughput [Mbps]
Pkt Length [Byte] 40 64 128 256 512 768 1024 1280 1500
2.6.16 SMP Kernel
1800
200
400 600 800 1000 1200 1400 1600 IP Packet Size [Bytes]
Figure 29. Maximum throughput values obtained in the implemented testbeds.
60 40 20 0 0
20
40
60
80
100
Offered Load [%]
Figure 27. Loss Rate test, testbed D: maximum throughput versus both offered load and IP datagram sizes.
The back-to-back results, reported in Fig. 26 and Table IV, show a very particular behavior: in fact, even if the single processor kernel can achieve longer zero-loss burst lengths than the SMP kernel, the latter appears to ensure lower minimum, average and maximum latency values. In the end, Fig. 27 reports the loss rate test results, which, compatible with the previous results, show that a single processor kernel can sustain a higher forwarding throughput than the SMP version. E. Maximum Performance In order to effectively synthesize and improve the evaluation of the proposed performance results, we report
© 2007 ACADEMY PUBLISHER
The aggregated maximum throughput values, as reported in Fig. 29, are obviously higher than the ones in Fig. 28. This highlights the fact that the maximum forwarding rates sustainable by the OR are achieved in setups B and C with 2.5 Gbps. Moreover, while in setup A the maximum theoretical rate is achieved for packet sizes larger than 128, in all the other setups the maximum throughput values are not much higher than half the theoretical ones. F. Hardware Architecture Impact In the final benchmarking session, we decided to compare the performance of the two hardware architectures introduced in Section III, which represent the current and the state-of-the-art of server architectures four years ago. The benchmarking scenario is the one used in testbed A (with reference to Fig. 2), while the selected software architecture is the single processor optimized kernel. 2
In this case, “aggregated” refers to the sum of the forwarding rates of all the OR network interfaces.
16
JOURNAL OF NETWORKS, VOL. 2, NO. 3, JUNE 2007
100 90
Throughput [%]
80 70 60
40
Old PCI-Ex PCI-X 0
30 Old PCI-Ex PCI-X 100
1000
Figure 30. Throughput and latencies test, setup A with the “old” HW architecture and the “new” one equipped with PCI-X and PCI-Express busses: effective throughput results for the single processor optimized kernel. Note that the x-axis is in the logarithmic scale. 300 250
Latency [us]
60
0
40
IP Packet Size [Bytes]
200 150 Min Old Max Old Min PCI-Ex Max PCI-Ex Min PCI-X Max PCI-X
100 50 0 200
400
600
800
1000 1200 1400 1600
IP Packet Size [Bytes]
Figure 31. Throughput and latencies test, results for testbed A with the “old” HW architecture and the “new” one equipped with PCI-X and PCI-Express busses: minimum and maximum latencies for the single processor optimized kernel. 250
200
Latency [us]
80
20
10
150
100
50
Avg Old Avg PCI-Ex Avg PCI-X
0 0
100
50
20
0
RAM and, subsequently, to benefits in terms of memory accesses by the CPU. In other words, this high performance enhancement is caused by a more effective memory access of the CPU, thanks to the features of the PCI Express DMA.
Throughput [%]
It is clear that the purposes of these tests was to understand how the continuous evolution of COTS hardware affects overall OR performance. Therefore, Figs. 30, 31 and 32 report the results of effective throughput tests for the “old” architecture (i.e., 32-bit Xeon) and the “new” one (i.e., 64-bit Xeon) equipped with both PCI-X and PCI-Express busses. The loss rate results are shown in Fig. 33.
200
400 600 800 1000 1200 1400 1600 IP Packet Size [Bytes]
Figure 32. Throughput and latencies test, results for testbed A with the “old” HW architecture and the “new” one equipped with PCI-X and PCI-Express busses: average latencies for the single processor optimized kernel.
By observing the comparisons in Figs. 30 and 31, it is clear that the new architecture generally provides better performance values than the old one: more specifically, while using the new architecture with the PCI-X bus slightly improves performance, when the PCI-Express is used the OR effective throughput is an impressive 88% with 40 Byte-sized packets, achieving the maximum theoretical rate for all other packet sizes. All this is clearly due to the high efficiency of the PCI Express bus. In fact, with this I/O bus DMA transfers occur with a very low control overhead (since it behaves like a leased line), which probably leads to less heavy accesses to the © 2007 ACADEMY PUBLISHER
200
400
600 800 1000 IP Packet Size [Bytes]
1200
1400
1600
Figure 33. Loss Rate test, testbed A for the “old” HW architecture and the “new” one equipped with PCI-X and PCI-Express busses: maximum throughput versus both offered load and IP datagram sizes.
VII. CONCLUSIONS In this contribution we report the results of the in-depth optimization and testing carried out on PC Open Router architecture based on Linux software and, more specifically, based on the Linux kernel. We have presented a performance evaluation in some common working environments of three different data plane architectures, including the optimized Linux 2.6 kernel, the Click Modular Router and the SMP Linux 2.6 kernel, with external (throughput and latencies) and internal (profiling) measurements. External measurements were performed in an RFC2544 [29] compliant manner by using professional devices [27]. Two hardware architectures were tested and compared for the purpose of understanding how the evolution in COTS hardware may affect performance. The experimental results show that the optimized version of the Linux kernel with suitable hardware architectures can achieve such high performance levels to effectively support several Gigabit interfaces. The results obtained show that the OR can achieve very interesting performance levels while attaining aggregated forwarding rate values of about 2.5 Gbps with relatively low latencies. REFERENCES [1] Building Open Router Architectures Based On Router Aggregation project (BORA-BORA), homepage at http://www.tlc.polito.it/borabora. [2] E. Kohler, R. Morris, B. Chen, J. Jannotti, and M. F. Kaashoek, "The Click modular router", ACM Transactions on Computer Systems 18(3), Aug. 2000, pp. 263-297. [3] Zebra, http://www.zebra.org/. [4] M. Handley, O. Hodson, E. Kohler, “XORP: an open platform for network research”, ACM SIGCOMM Computer Communication Review, Vol. 33 Issue 1, Jan. 2003, pp. 53-57. [5] S. Radhakrishnan, “Linux - Advanced networking overview”, http://qos.ittc.ku.edu/howto.pdf. [6] M. Rio et al., “A map of the networking code in Linux kernel 2.4.20”, Technical Report DataTAG-2004-1, FP5/IST DataTAG Project, Mar. 2004. [7] FreeBSD, http://www.freebsd.org.[10]
JOURNAL OF NETWORKS, VOL. 2, NO. 3, JUNE 2007
[8] B. Chen and R. Morris, "Flexible Control of Parallelism in a Multiprocessor PC Router", Proc. of the 2001 USENIX Annual Technical Conference (USENIX '01), Boston, USA, June 2001. [9] C. Duret, F. Rischette, J. Lattmann, V. Laspreses, P. Van Heuven, S. Van den Berghe, P. Demeester, “High Router Flexibility and Performance by Combining Dedicated Lookup Hardware (IFT), off the Shelf Switches and Linux”, Proc. of the 2nd International IFIP-TC6 Networking Conference, Pisa, Italy, May 2002, LNCS 2345, Ed E. Gregori et al, Springer-Verlag 2002, pp. 1117-1122. [10] A. Barczyk, A. Carbone, J.P. Dufey, D. Galli, B. Jost, U. Marconi, N. Neufeld, G. Peco, V. Vagnoni, “Reliability of datagram transmission on Gigabit Ethernet at full link load”, LHCb technical note, LHCB 2004-030 DAQ, Mar. 2004. [11] P. Gray, A. Betz, “Performance Evaluation of CopperBased Gigabit Ethernet Interfaces”, Proc. of 27th Annual IEEE Conference on Local Computer Networks (LCN'02), Tampa, Florida, November 2002, pp. 679-690. [12] A. Bianco, R. Birke, D. Bolognesi, J. M. Finochietto, G. Galante, M. Mellia, M.L.N.P.P. Prashant, Fabio Neri, “Click vs. Linux: Two Efficient Open-Source IP Network Stacks for Software Routers”, Proc. of the 2005 IEEE Workshop on High Performance Switching and Routing (HPSR 2005), Hong Kong, May 2005, pp. 18-23. [13] A. Bianco, J. M. Finochietto, G. Galante, M. Mellia, F. Neri, “Open-Source PC-Based Software Routers: a Viable Approach to High-Performance Packet Switching”, Proc. of the 3rd International Workshop on QoS in Multiservice IP Networks (QOS-IP 2005), Catania, Italy, Feb. 2005, pp. 353-366 [14] A. Bianco, R. Birke, G. Botto, M. Chiaberge, J. Finochietto, G. Galante, M. Mellia, F. Neri, M. Petracca, “Boosting the Performance of PC-based Software Routers with FPGA-enhanced Network Interface Cards”, Proc. of the 2006 IEEE Workshop on High Performance Switching and Routing (HPSR 2006), Poznan, Poland, June 2006, pp. 121-126. [15] A. Grover, C. Leech, “Accelerating Network Receive Processing: Intel I/O Acceleration Technology”, Proc. of the 2005 Linux Symposium, Ottawa, Ontario, Canada, Jul. 2005, vol. 1, pp. 281-288. [16] R. McIlroy, J. Sventek, “Resource Virtualization of Network Routers”, Proc. of the 2006 IEEE Workshop on High Performance Switching and Routing (HPSR 2006), Poznan, Poland, June 2006, pp. 15-20. [17] R. Bolla, R. Bruschi, “The IP Lookup Mechanism in a Linux Software Router: Performance Evaluation and Optimizations”, Proc. of the 2007 IEEE Workshop on High Performance Switching and Routing (HPSR 2007), New York, USA. [18] K. Wehrle, F. Pählke, H. Ritter, D. Müller, M. Bechler, “The Linux Networking Architecture: Design and Implementation of Network Protocols in the Linux Kernel”, Pearson Prentice Hall, Upper Saddle River, NJ, USA, 2004. [19] R. Bolla, R. Bruschi, “A high-end Linux based Open Router for IP QoS networks: tuning and performance analysis with internal (profiling) and external measurement tools of the packet forwarding capabilities”, Proc. of the 3rd International Workshop on Internet Performance, Simulation, Monitoring and Measurements (IPS MoMe 2005), Warsaw, Poland, Mar. 2005. [20] J. H. Salim, R. Olsson, A. Kuznetsov, “Beyond Softnet”, Proc. of the 5th annual Linux Showcase & Conference, Nov. 2001, Oakland, California, USA.
© 2007 ACADEMY PUBLISHER
17
[21] A. Cox, "Network Buffers and Memory Management" Linux Journal, Oct. 1996, http://www2.linuxjournal.com/ lj−issues/issue30/1312.html. [22] The Intel PRO 1000 XT Server Adapter, http://www.intel.com/network/connectivity/products/pro10 00xt.htm. [23] The D-Link DFE-580TX quad network adapter, http://support.dlink.com/products/view.asp?productid=DF E%2D580TX#. [24] J. A. Ronciak, J. Brandeburg, G. Venkatesan, M. Williams, “Networking Driver Performance and Measurement – e1000 A Case Study”, Proc. of the 2005 Linux Symposium, Ottawa, Ontario, Canada, July 2005, vol. 2, pp. 133-140. [25] R. Bolla, R. Bruschi, “IP forwarding Performance Analysis in presence of Control Plane Functionalities in a PC-based Open Router”, Proc. of the 2005 Tyrrhenian International Workshop on Digital Communications (TIWDC 2005), Sorrento, Italy, June 2005, and in F. Davoli, S. Palazzo, S. Zappatore, Eds., “Distributed Cooperative Laboratories: Networking, Instrumentation, and Measurements”, Springer, Norwell, MA, 2006, pp. 143-158. [26] The descriptor recycling patch, ftp://robur.slu.se/pub/ Linux/net-development/skb_recycling/. [27] The Agilent N2X Router Tester, http://advanced. comms.agilent.com/n2x/products/. [28] Oprofile, http://oprofile.sourceforge.net/news/. [29] Request for Comments 2544 (RFC 2544), http://www.faqs. org/rfcs/rfc2544.html. Raffaele Bolla was born in Savona (Italy) in 1963. He received his Master of Science degree in Electronic Engineering from the University of Genoa in 1989 and his Ph.D. degree in Telecommunications at the Department of Communications, Computer and Systems Science (DIST) in 1994 from the same university. From 1996 to 2004 he worked as a researcher at DIST where, since 2004, he has been an Associate Professor, and teaches a course in Telecommunication Networks and Telematics. His current research interests focus on resource allocation, Call Admission Control and routing in Multi-service IP networks, Multiple Access Control, resource allocation and routing in both cellular and ad hoc wireless networks. He has authored or coauthored over 100 scientific publications in international journals and conference proceedings. He has been the Principal Investigator in many projects in the Telecommunication Networks field. Roberto Bruschi was born in Genoa (Italy) in 1977. He received his Master of Science degree in Telecommunication Engineering in 2002 from the University of Genoa and his Ph.D. in Electronic Engineering in 2006 from the same university. He is presently working with the Telematics and Telecommunication Networks Lab (TNT) in the Department of Communication, Computer and System Sciences (DIST) at the University of Genoa. He is also a member of CNIT, the Italian inter-university Consortium for Telecommunications. Roberto is an active member of various Italian research projects in the networking area, such as BORA-BORA, FAMOUS, TANGO and EURO. He has co-authored over 10 papers in international conferences and journals. His main interests include Linux Software Router, Network processors, TCP and network modeling, VPN design, P2P modeling, bandwidth allocation, admission control and routing in multiservice QoS IP/MPLS networks.