tion using a virtual machine monitor (VMM) or hy- pervisor (HV) can be attributed to the need to virtu- alize peripheral devices and to then manage the virtu-.
Implementing a Scalable Self-Virtualizing Network Interface on an Embedded Multicore Platform Himanshu Raj, Karsten Schwan CERCS, Georgia Institute of Technology Atlanta, 30332 (rhim, schwan)@cc.gatech.edu Abstract Some of the costs associated with system virtualization using a virtual machine monitor (VMM) or hypervisor (HV) can be attributed to the need to virtualize peripheral devices and to then manage the virtualized resources for large numbers of guest domains. For higher end, ‘smart’ IO devices, it is possible to offload selected virtualization functionality onto the devices themselves, thereby creating self-virtualizing devices. A self-virtualizing device is aware of the fact that it is being virtualized. It implements this awareness (1) by presenting to the HV an interface that enables the HV to manage virtual devices on demand, and (2) by presenting to a guest OS a virtual device abstraction that permits it to interact with the device with minimal HV involvement. This paper presents the design of a self-virtualizing network device and its implementation on a modern multi-core network platform, an IXP2400 network processor-based development board. Initial performance results demonstrate that the performance of virtual interfaces is similar to that of regular network interfaces. The longer term goal of this work is to better understand how modern multi-core platforms can be used to realize light-weight, scalable virtualization for future server platforms.
1
Introduction
Virtualization functionality is becoming an integral element of processor architectures ranging from high end server architectures, to lower end PowerPC and x86-based machines [4]. However, the hypervisors (HVs) or Virtual Machine Monitors (VMMs) running on such machines not only have to virtualize processors, but they also have to carry out the following two equally important tasks: 1. virtualize all of the platform’s physical resources, including its peripheral devices, and 2. manage these virtualized components for the
multiple guest OSes (domains) running on the platform. This paper focuses on the virtualization of peripheral resources (i.e., I/O device virtualization). Current practice is to manage peripherals in the hypervisor itself or in some trusted control partition. In either case, the controlling entity must export to guest domains an interface to a virtualized device that provides multiplex/demultiplex access to the actual physical device, where all accesses to the device must go through this controlling entity. Current implementations of I/O device virtualization use a distinct I/O controller domain for all devices [13], a controller domain per device [15], or a scheme in which the driver code runs as part of hypervisor itself [7]. The latter approach has been dismissed due to the need to avoid hypervisor failures caused by potentially faulty device drivers. The former approaches involve context switching among multiple domains for each device access. Recent advances in hardware-supported virtualization [4] do not offer specific support for peripheral virtualization. Our approach attempts to reduce the costs of peripheral virtualization with self-virtualizing devices. A self-virtualizing device (1) presents to the hypervisor (HV) an interface that enables the HV to manage virtual devices on demand, and (2) presents to a guest OS a virtual device abstraction that permits it to interact with the device with minimal HV involvement. One example of a device with similar functionality is the SCSI controller that supports RAID [2]. This controller can provide multiple logical disk drives that look like single physical disk drive to the OS. This paper presents the design of a self-virtualizing network interface realized on a IXP2400 network processor based board [1]. This processor has multiple processing cores capable of independently running different execution threads. This internal concurrency provides a suitable platform for a scalable im-
plementation of self-virtualization functionality that offers to applications both high bandwidth and low end-to-end latency. Experimental results described in Section 5 demonstrate virtual devices capable of operating at full link speeds for 100 Mbps ethernet links with available TCP bandwidth and latency of 94Mbps and .127ms respectively. At gigabit link speeds, the PCI performance dominates the overall performance of virtual devices with available TCP bandwidth and latency of 156Mbps and .076ms respectively. Performance of virtual devices also scales with increasing number of virtual devices at a host. In the remainder of this paper, we first provide the overall design of the self-virtualizing network interface and describe its functionalities. This is followed by specific details of our prototype implementation. Next, we present the experimental setup followed by experimental results. The paper concludes with a summary of results and a description of future work.
2
A Self-virtualizing Network Interface
This section describes the design of a self-virtualizing network interface using the IXP2400 network processor(NP)-based enp2611 board. This board resides in a host as a PCI add-on. The board exports the NP’s SDRAM and configuration registers to the host via a non-transparent PCI-to-PCI bridge [3]. The bridge also provides a PCI window for host memory access from NP, which can be dynamically configured to map to a certain area of host memory. The board itself hosts a XScale (ARM) processing core used for control functions and 8 specialized processing cores, termed micro-engines, for carrying out network I/O functions. Each micro-engine supports 8 hardware contexts with minimal context switching overhead. As will become evident in the next two sections, this environment provides the flexibility (i.e., programmability) and concurrency needed to efficiently implement device virtualization functionality. In particular, the functionality implemented on that platform realizes virtual interfaces (VIFs) for the network device, where each VIF consists of two message queues, one for outgoing messages (send queue) and one for incoming messages (receive queue). These queues have configurable sizes that determine transmit and receive buffer lengths. Every VIF is assigned a unique id, and it has an associated pair of signals. One is sent by the NP to the host device driver indicating a packet transmission initiated by driver is complete; the other is sent by NP to the host driver indicating a packet has been received for the associ-
ated VIF, and it is enqueued for processing by the driver. Network packets are sent and received using PCI read/write transactions on the associated message queues. We next describe the self-virtualizing device’s two main functionalities: (1) managing VIFs and (2) performing network I/O. 2.1
Virtual Interface Management
This functionality includes the creation of VIFs, their removal, and changing attributes and resources associated with VIFs. Our current implementation utilizes both the XScale core and the micro-engines. The implementation places control functionality onto the XScale core, while ‘fast path’ I/O functions are executed by micro-engines. Specifically, a management request is initiated from the host to the XScale core. The XScale core appropriates resources according to the request and communicates this resource modification to the host and micro-engines. The microengines apply these modifications to the network I/O subsystem. Subsequent I/O actions are executed by micro-engines, as explained in more detail next. 2.2
Network I/O
Network I/O is completely managed by microengines. It can be further subdivided into two parts: egress and ingress. Egress deals with multiplexing multiple VIFs’ packets from their send queues onto physical network ports. Ingress deals with demultiplexing the packets received from physical network ports onto appropriate VIFs’ receive queues. Since VIFs export a regular ethernet device abstraction to the host, this implementation models a software layer 2 switch. In our current implementation, the XScale core appropriates one micro-engine context per VIF for egress. This context is selected among a pool of contexts belonging to a single micro-engine in a round robin fashion for simple load balancing. Ingress is managed for all VIFs by a pool of contexts on a micro-engine. Two other micro-engines are used for physical network input and output, respectively. Hence, we are still operating at 50% of the capacity. We plan to use the rest of the micro-engines for a more scalable implementation in future.
3
Selected Implementation Details
The device supports two configurations. In one configuration, both of the message queues associated with a VIF are implemented in the NP’s SDRAM, hereafter referred to as host-only. This configuration is useful when the firmware present on the device does not provide the capability for upstream PCI transactions, as was the case with our previous firmware
version. In this case, none of the host’s resources are visible to the NP, and only the host can read or write data from the NP’s SDRAM. This also disables the use of the DMA engines present on the board for data transfers from/to host memory. Both egress and ingress are performed by the host via programmed I/O using PCI read/write transactions. This seriously impacts device performance, as discussed in Section 5. In the other configuration, the send message queue is implemented in NP’s SDRAM, while the receive message queue is implemented in host memory, hereafter referred to as the host-NP configuration. On the egress path, the host side driver places frames into the send message queue using PCI write transactions, which are read locally from the NP’s SDRAM by micro-engines. On the ingress path, micro-engines place frames into receive message queue using PCI write transactions, which are read locally from host memory by the host side driver. Signals associated with a VIF are implemented as PCI interrupts. The device has a master PCI interrupt and a 16 bit interrupt identifier. These bits are shared by multiple VIFs in case the total number of VIFs exceed 8 (since each VIF requires two signals). Thus, an interrupt can result in redundant signaling of VIFs sharing the same id bit. In the future, we plan to replace the interrupt id sharing with a more asynchronous messaging system, where messages will contain the exact id of the VIF to be signaled. These messages will be managed in a shared message queue among the host and the NP.
4
Experimentation Setup
In this paper, our focus is to evaluate the costs of device self-virtualization. Accordingly, the experiments compare non-virtualized to virtualized devices, ignoring the host-level use of virtualization technologies. Specifically, we run a standard Linux kernel on the host, then evaluate virtual interfaces as if they were regular ethernet devices on the host. In this configuration, the host appears as a hotplug capable system behind a switch that can provide virtual network interfaces to the host on demand. This switch is implemented by the NP-based board and connects all VIFs on the host to the rest of the network. The physical link connects one gigabit port on the NPbased board to the rest of LAN segment. A host using a virtual interface provided by NP-based board for network communication is hereafter termed as VNIC host. Other hosts using a regular broadcom gigabit ethernet controller for network communication are hereafter termed as REG hosts. Figure 1 shows
VNIC_host host
PCI REG_host
...
enp2611
...
100Mbps link
1Gbps link
100Mbps Switch
GigE Switch Gigabit uplink
Gigabit uplink
GigE Switch
Figure 1: Network topology
the abstract network topology. Note that REG hosts are connected via a 100Mbps switch, which effectively limits the bandwidth of these interfaces to 100Mbps. All hosts are dual 2-way SMT Pentium Xeon 2.80GHz servers with 2GB RAM running linux kernel 2.6.13.1 with RHEL 4 distribution. The embedded boards are IXP2400-based with 256MB SDRAM running the Linux kernel 2.4.18 with the monta-vista preview kit 3.0 distribution.
5
Experimental Results
5.1
Self-virtualizing network device without host virtualization
In this section, we primarily report the performance of an initial self-virtualized NIC implementation for host-NP configuration. We discuss some performance implications of using host-only configuration in 5.1.2. Basic thesis behind these experiments is that selfvirtualization can be obtained at a low cost and the performance of virtual interfaces is at least similar to that of regular network interfaces with proper IO configuration. 5.1.1
Latency
We use a simple libpcap [6] client server application to measure the round trip time between two hosts, at least one of which is a VNIC host. The client sends 64 byte messages to server using packet sockets and SOCK RAW mode. This packet is directly handed to the device driver without any network layer processing. The server, which receives the packet directly from the device driver echoes the message back to client. The round trip time for this message send/receive is reported as the RTT. This RTT serves as an indicator of the inherent latency of the selfvirtualizing network interface implementation. For a detailed discussion of the breakdown of overall self-
section 5.2 for details). We hope to eradicate this issue via interrupt-batching for multiple packets and employing host side polling at regular intervals for packet reception. In any case, these experiments demonstrate that end-to-end latency using self-virtualizing interfaces is better or almost similar to that of using regular hardware interfaces. This configuration also demonstrates good initial scalability in terms of latency with increasing number of VIFs on VNIC host.
0.18 REG_IF to REG_IF VIF to VIF VIF to REG_IF
0.16 0.14
RTT (ms)
0.12 0.1 0.08 0.06 0.04
5.1.2
0.02 0
#vifs=1
#vifs=2
#vifs=4
#vifs=8
Figure 2: RTT for host-NP configuration
virtualization costs, refer to Section 5.2. For measurements across two VNIC hosts, we use an n:1x1 access pattern, where ‘n’ is the number of VIFs on both hosts. In this pattern, n libpcap client/server pairs are run, each utilizing a different VIF on request sender and a different VIF on responder. For measurements across a VNIC host and a REG host, we can use either a n:nx1 or a n:1xn access pattern, where n is the number of VIFs on the VNIC host. In the n:nx1 access pattern, each of the n libpcap clients utilize a different VIF on request sender and the libpcap server is run on the same interface on the peer node. In the n:1xn pattern, each of the n libpcap clients use the same interface of sender but a different VIF on the peer node. All libpcap clients are started (almost) simultaneously and are configured to send a packet every .2 seconds. We report the average of the average RTT times for n libpcap client server sessions, where n is the number of VIFs on the VNIC host. Figure 5.1.1 shows RTT results for the host-NP configuration. The RTT between two VIFs on separate VNIC hosts is smaller than that of the RTT between two REG IFs on separate REG hosts. This is likely due to a faster vs. slower switch (refer to Figure 1). The RTT between a VIF on a VNIC host and a REG IF on a REG host is slightly higher than that of the RTT between two REG hosts. We believe that the difference can be attributed to the additional switch crossing. With increasing number of VIFs, the end-to-end latency scales in the beginning. However, with larger number of VIFs, the performance degrades rapidly. We believe this degradation results from host’s inability to handle large number of interrupts, since the cost of self-virtualization doesn’t increase a lot with large number of VIFs (refer to
Bandwidth
We use iperf [5] to measure the achievable bandwidth for virtual interfaces. In these experiments, the client sends data to the server for 50 seconds over a TCP connection with default parameters. Three separate cases are considered, each with both host-only and host-NP configurations: 1. Both the iperf client and the server are run bound to VIFs on different VNIC hosts. 2. The iperf client is bound to a VIF on a VNIC host, and the server is run on a REG host. 3. The iperf server is bound to a VIF on a VNIC host, and the client is run on a REG host. For the host-only configuration, the measured average bandwidth is ∼ 25.6 Mbits/sec for Case 1 and Case 3, while it is ∼ 94 Mbits/sec for Case 2. The bandwidth achieved for Case 2 is similar to the average bandwidth achieved between two REG hosts, whilst in the other cases, there is a large performance difference. This performance difference is entirely due to the relatively high costs of PCI read vs. PCI write from the host, as is evident from the performance measurements for the host-NP configuration. For the host-NP configuration, the measured average bandwidth for Case 1 is ∼ 156 Mbits/sec, while for both Cases 2 and 3 it is ∼ 94 Mbits/sec, which is similar to the performance between two REG hosts. This demonstrates that the cost of self-virtualizing itself is low, it is the IO performance that dominates the overall cost and hence the network IO path must be configured carefully. The average bandwidth degrades linearly as the number of VIFs is increased and simultaneous iperf measurements are performed. For example, for 2 VIFs, the average bandwidth of each stream is 47 Mbits/sec for Case 2. 5.2
Self-virtualization micro-benchmarks
In order to better assess the costs associated with self-virtualization, we microbenchmark specific parts of the micro-engine code and host code in the hostNP configuration.
Time taken by network IO micro-engine(s) for transmitting the packet on the physical link is not shown here as we consider it a part of network latency. For the ingress path, we consider the following subsections:
2500 #vifs=1 #vifs=8
4
3.5 2000
2.5
cycles
1500
2 1000
• pkt rx - Dequeuing the packet from the receive queue of the physical port. • channel demux - Demultiplexing the packet based on its destination MAC address. This is the most crucial step in the self-virtualization implementation. • msg send - Copying the packet into host memory via a PCI write transactions and interrupting the host via a PCI write transaction.
micro seconds
3
1.5
1 500 0.5
0
’msg_recv’
’pkt_tx’
0
’total’
(a) Egress Path
3000
5 #vifs=1 #vifs=8
4.5
2500 4 3.5
cycles
3 1500
2.5 2
1000 1.5 1 500 0.5 0
’pkt_rx’
’channel_demux’
’msg_send’
’total’
0
(b) Ingress Path Figure 3: Latency Micro-benchmarks
On micro-engines, we use cycle counting for performance monitoring. Figures 3(a) and 3(b) show the results for egress path and ingress path respectively. The following sub-sections of the egress path are considered: • msg recv - The time it takes for the context specific to a VIF to acquire information about a new packet queued up by the host side driver for transmission. This involves polling the send queue in SDRAM. Additional delay can be perceived at the end application due to the scheduling of contexts by the micro-engine. Although the scheduling is non-preemptive, contexts frequently yield the micro-engine for IO. • pkt tx - Enqueueing the packet on the transmit queue of the physical port.
micro seconds
2000
Time taken by network IO micro-engine(s) for receiving the packet from the physical link is not shown as we consider it a part of network latency. With increasing number of vifs, the cost of egress path increases due due to increased SDRAM polling contention by micro-engine contexts for message receive from host. Similarly, the cost of ingress path increases due to increased demultiplexing cost. The overall effect of this cost increase on the end-to-end latency is small. For host side performance monitoring, we count cycles for message send and receive via the TSC register. For #vif s = 1, host encumbers a cost of ∼ 9.42µS for a message receive (involves memory copy) and an average cost of ∼ 14.47µS with ∼ 1.3µS variability (in terms of standard deviation) for a message send (involves PCI write). For #vif s = 8, the average cost of message send increases to ∼ 17.35µS whilst variability increases to ∼ 7.85µS. Possible reasons for increased average cost and variability of message send are PCI bus contention from two processors and increased number of interrupts that arise due to larger number of VIFs. The cost of message receive remains similar to previous case. Host message receive and message send, and egress and ingress paths on the NP account for ∼ 59µS in the overall end-to-end latency (#vif s = 1). Rest of it is in the host (context switches, data copy to user buffers and interrupt handling) and network (time on wire and gigE switch latency).
6
Related Work
Given current trends in processor performance growth, it is usually I/O that is the bottleneck for overall system performance. Specific hardware support, such as virtual channel processors [11] is envisioned in future architectures [9] for improving I/O performance. The self-virtualizing network interface
developed in our work is specifically designed to improve the performance of future virtualized systems, by improving their I/O performance. In order to improve network performance for end user applications, multiple configurable network interfaces have been designed, including programmable ethernet NICs [16], Arsenic [12], and an ethernet network interface based on the IXP1200 network processor [10]. In this paper, we design a network interface with self-virtualization support based on a IXP2400 network processor based embedded board. Another example of such a device is the SCSI controller with ServeRAID technology [2], which can provide virtual block devices on demand. Deploying a separate core for running application specific code has also been used for Virtual Communication Machines [14] and stream handlers [8].
7
Conclusions and Future Work
In this paper, we present the design and an initial implementation of a self-virtualizing network interface device using an IXP2400 network processor-based board. Initial performance analysis establishes the viability of the design. The performance of virtual interfaces provided by the self-virtualizing network interface in terms of end-to-end available bandwidth and latency is similar to those of regular network interfaces. The performance scales with increasing number of virtual interfaces. The self-virtualizing device also allows a host to (re)configure virtual interfaces configurable in terms or available resources. These properties of the device makes it an ideal candidate for virtualizing and sharing network resources in a server platform. We plan to use this device for system virtualization, where each guest OS running on top of a hypervisor will own one or more VIFs. We envision lighter-weight and more scalable virtualized systems when using self-virtualizing devices. In particular, the HV on the host will be responsible for managing the virtual interfaces using the API provided by the self-virtualizing device. Once a virtual interface has been configured, major part of network IO will take place without any HV involvment. HV will be responsible for routing the interrupt(s) generated by the self-virtualizing device to appropriate guest domains. Future interrupt sub-system modifications such as larger interrupt space and hardware support for allowing interrupts to be routed directly to guest domains [4] may relieve the HV of interrupt routing responsibility altogether. We also plan to make network IO path more efficient. We will add support for DMA for data copying
across PCI bus and benchmark its performance. We also plan to add support for jumbo frames in order to improve performance in terms of achievable bandwidth, specifically for gigabit links.
References [1] ENP-2611 Data Sheet. http://www.radisys.com/files/ENP2611 07-1236-05 0504 datasheet.pdf. [2] IBM eserver xSeries ServeRAID Technology. ftp://ftp.software.ibm.com/pc/pccbbs/pc servers pdf/ raidwppr.pdf. [3] Intel 21555 Non-transparent PCI-to-PCI Bridge. http://www.intel.com/design/bridge/21555.htm. [4] Intel Virtualization Technology Specification for the IA-32 Intel Architecture. ftp://download.intel.com/technology/computing/ vptech/C97063-002.pdf. [5] Iperf. http://dast.nlanr.net/projects/Iperf. [6] Tcpdump/libpcap. http://www.tcpdump.org/. [7] Dragovic, B., Fraser, K., Hand, S., Harris, T., Ho, A., Pratt, I., Warfield, A., Barham, P., and Neugebauer, R. Xen and the Art of Virtualization. In Proceedings of the ACM Symposium on Operating Systems Principles (October 2003). [8] Gavrilovska, A., Mackenzie, K., Schwan, K., and McDonald, A. Stream Handlers: Application-specific Message Services on Attached Network Processors. In Proceedings of the 10th International Symposium of Hot Interconnects (HOT-I 2002) (2002). [9] Hady, F. T., Bock, T., Cabot, M., Chu, J., Meinechke, J., Oliver, K., and Talarek, W. Platform Level Support for High Throughput Edge Applications: The Twin Cities Prototype. IEEE Network (July/August 2003). [10] Mackenzie, K., Shi, W., McDonald, A., and Ganev, I. An Intel IXP1200-based Network Interface. In Proceedings of the 2003 Annual Workshop on Novel Uses of Systems Area Networks (SAN-2) (February 2003). [11] McAuley, D., and Neugebauer, R. A case for Virtual Channel Processors. In Proceedings of the ACM SIGCOMM 2003 Workshops (2003). [12] Pratt, I., and Fraser, K. Arsenic: A User Accessible Gigabit Network Interface. In Proceedings of the IEEE INFOCOM (2001). [13] Pratt, I., Fraser, K., Hand, S., Limpach, C., Warfield, A., Magenheimer, D., Nakajima, J., and Mallick, A. Xen 3.0 and the Art of Virtualization. In Proceedings of the Ottawa Linux Symposium (2005). [14] Rosu, M., Schwan, K., and Fujimoto, R. Supporting Parallel Applications on Clusters of Workstations: The Virtual Communication Machine-based Architecture. In Proceedings of Cluster Computing (1998). [15] Uhlig, V., LeVasseur, J., Skoglund, E., and Dannowski, U. Towards Scalable Multiprocessor Virtual Machines. In Proceedings of the 3rd Virtual Machine Research and Technology Symposium (San Jose, CA, May 2004), pp. 43–56. [16] Willmann, P., Kim, H., Rixner, S., and Pai, V. An Efficient Programmable 10 Gigabit Ethernet Network Interface Card. In Proceedings of the 11th International Symposium on High-Performance Computer Architecture (2005).