IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, VOL. 31, NO. 12, DECEMBER 2013
1
SR-IOV Based Network Interrupt-Free Virtualization with Event Based Polling HaiBing Guan, Member, IEEE/ACM, YaoZu Dong, Kun Tian, Jian Li, Member, IEEE/ACM
Abstract—Along with the developments of networking and virtualization technologies, high speed network connections have become one of the key components in cloud computing and datacenters. Single-Root I/O Virtualization (SR-IOV) enhances the network throughput to the extent of becoming close to the line rate and achieving high scalability in the 10Gbps and higher network environments. However, the overhead of SR-IOV interrupt virtualization remains significant due to some additional trapand-emulation overhead on the virtual interrupt controller. The higher the virtualization network connection is, the higher the interrupt frequency becomes through high bandwidth network. To mitigate this problem, we propose a smart Event-Based Polling model (sEBP), which leverages existing system events to trigger a regular packet polling such that network interrupts are eliminated from the critical I/O paths in the virtual environment. Due to the many varieties of system events, sEBP can deal with the network workload in a configurable and flexible manner. Based on a hierarchical virtualized environment, it can also be implemented either at the guest OS kernel level or at the Virtual Machine Manager (VMM) level. Since polling is much lighter than interrupt processing, sEBP significantly reduces the network processing overhead. The experimental results prove the efficiency of sEBP, which can achieve up to a 59% performance improvement and a 23% improved scalability ratio.
Fig. 1. Architecture of SR-IOV.
Index Terms—Virtualization, Event-based polling, Interrupt, SR-IOV.
I. I NTRODUCTION IRTUALIZATION technology is widely employed in data-centers and cloud computing platforms in order to improve the utilization of the underlying hardware, including modern multi-core processors and high bandwidth networks. Along with the continuous increase of network speed, network virtualization has become one of the architectural foundations for service and application strategies, which is also a key issue that affects the whole system performance and scalability. Therefore, high-speed network connection performance and efficiency are critical for a virtualization-based cloud computing system. Software-based I/O virtualization solutions, such as device emulation [20] and the para-virtualized split driver model [14], have advantages in system migration and flexibility. However, they suffer from a high overhead resulting from either excessive trap-and-emulations or bulk data movements [14][22].
V
Manuscript received December 1, 2012; revised July 31, 2013. H.B. Guan is with Shanghai Key Laboratory of Scalable Computing and Systems, Shanghai Jiao Tong University, China (e-mail:
[email protected]). YaoZu Dong and K. Tian are with the Intel China Software Center, and YaoZu Dong is current a Ph.D candidate at SJTU (e-mail: {Eddi.dong, kun.tian}@intel.com). J. Li, is with School of Software, Shanghai Jiao Tong University, China (e-mail:
[email protected]). Digital Object Identifier 10.1109/JSAC.2013.1312xx.
The main issue is relying on the fact that their performances cannot be scaled to cope with high-speed networks (such as 10Gbps or higher). Single Root I/O Virtualization (SR-IOV) [1] is created by PCI SIG, which has designed a set of hardware enhancements targeting PCIe devices. It removes major Virtual Machine Monitor (VMM) interventions related to performance data movement, packet classification and address translation. SRIOV inherits Direct I/O technology through the use of Input/Output Memory Management Unit (IOMMU) to offload memory protection and address translation. As Fig. 1 shows, an SR-IOV Network Interface Card (NIC) assigns a Virtual Functions (VF) driver to each virtual machine. Virtual machines can use VFs to transfer data packages directly with NIC, bypassing the hypervisor. With the help of the SR-IOV, virtualized network connections can achieve nearly a 10Gbps line-rate throughput while introducing reduced CPU overhead and higher scalability [11]. This has therefore become the de facto standard for I/O virtualization solution for Virtual Machine (VM) based high performance computing [26]. Although the data-transmission-overhead of the SR-IOV has been reduced to a negligible degree, its notification-overhead is still significant, due to the intervention of the VMM when handling interrupts, that is interrupt virtualization [1] [5] [4] [10] [15]. The interrupt processing routine involves highly frequent
c 2013 IEEE 0733-8716/13/$31.00
2
IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, VOL. 31, NO. 12, DECEMBER 2013
context switches due to occurrences of physical interrupts and trap-and-emulations on the virtual interrupt controller. Consequently, the interrupt handling path is lengthened and the cache is polluted. In the meantime, the interrupt frequency in the SR-IOV environment increases almost linearly with the number of VMs, as each Virtual Function (VF) generates its own interrupts in parallel [11]. Note that although the SR-IOV offloads part of the network processing from the CPU workload to the PCI hardware device, the overhead of the interrupt virtualization remains a major bottleneck in high performance network virtualizations. Through different experiments we have observed that up to 39% of the CPU cycles are consumed by the interrupt virtualization path on a 10Gbps network using the SR-IOV NIC. More detailed analysis will follow in Section II. The current solutions to this problem follow two main strategies: reducing the cost of each interrupt [1] [5] [10] [21] [25] [15], and mitigating the interrupt frequency [4] [10] [11]. These reduce the overhead of the interrupt virtualization in the SR-IOV; however, interrupts are still used as the fundamental notification mechanism in the critical I/O paths, and their overhead remains significant. We call such mechanisms an interrupt-driven device driver model or interrupt model, for short. This paper is dedicated to finding an ideal solution of an interrupt free network processing scheme for over high speed network connections. The original usage of interrupts in the I/O data path was to notify the CPU of a timely response to a given incoming I/O events. However, the interrupt model in a virtual environment does not perform in the same way as the bare-metal today. Indeed excessive interrupt processing overhead adds unnecessary latency and impacts the overall throughput of high bandwidth networks. Futhermore, note that VCPU scheduling generates additional difficulties with I/O latency. To this end, we propose sEBP, a smart event-based polling driver model, which completely eliminates interrupts in the critical I/O paths. Rather than relying on expensive device interrupts, sEBP takes advantage of a variety of system events to drive the device driver in order to process the I/O events in due time. sEBP displays higher performances and uses CPU resources more efficiently, compared to other contemporary SR-IOV solution, including hardware interrupt throttle (Intel 82599) and software interrupt mitigation schemes (NAPI driver). The major contributions of our work are as follows: • To propose an event-based polling model for efficient I/O virtualization. This new model can take advantage of the existing system events for polling incoming I/O events. • To implement and study both sEBP-host (sEBP in the host or VMM) and sEBP-guest (sEBP in the guest kernel or VM). Up to 59% performance improvement and 23% better scalability ratio are achieved. For i given VMs we define the performance scaling ratio by perf ormanceivm −perf ormance1vm . perf ormance1vm • To import the event manager, which consists of a rate controller, a compensating timer and a cross-VM event sharing. It ensures the performance of sEBP under any condition. The rest of the paper is organized as follows. Section II
Fig. 2. Additional Interrupt Processing Overhead in Virtual Environment.
provides an overview regarding the overheads of the interrupt virtualization in high-speed networks. In Sections III and IV we describe the design and implementation of sEBP, covering policies both in the VMM and at the VM level. Then, in Section V, we discuss experiments to evaluate sEBP and compare it to the interrupt model. Section VI presents some related work while Section VII discusses future work and gives conclusions. II. OVERHEADS OF I NTERRUPT V IRTUALIZATION Interrupt virtualization remains a key overhead source in high performance network virtualization. In fact this is due to the additional trap-and-emulation overheads of the virtual interrupt controller, and high interrupt frequency on high bandwidth networks. This section introduces the heavy overheads per virtual interrupt processing and their high occurrence frequency in high speed network connection. 1) The Cost of Handling An Interrupt: As shown in Fig. 2, the cost of handling a interrupt in a virtual environment is much higher than on the bare metal [1] [10] [11] [15]. Unlike the bare metal, where only two context switches occur between the interrupted context and the Interrupt Service Routine (ISR), interrupt virtualization has a longer path involving context switches between the VMM and the VM. The virtual interrupt processing routine is shown in Fig. 2, where first the physical interrupt is handled by the ISR in the VMM, then a virtual interrupt is injected into the ISR in the VM. During the execution of the VM ISR, there are several more context switches, due to the trap-and-emulation of the virtual interrupt controller. Fig. 3 illustrates the cycle consumption distribution for handling interrupts in a virtual environment, where the ISR in the VM itself consumes only 14% of the total cycles. KVM hypervisor [3] and Netperf [19] are used in the experiments. “ISR (VMM)” stands for the phase starting with the occurrence of the physical interrupt and ending with the VMM ISR; as such it includes the scheduling latency caused by the KVM running the ISR in a kernel thread. “vIRQ injection (VMM)” comes from setting the virtual pending bits to the point where the target VM is actually scheduled, which again may include scheduling latency. “APIC emulation (VMM)” occurs when the VM ISR accesses the virtual interrupt controller. It is obvious that the interrupt handling path costs in a virtual environment are of an order of magnitude higher than those of the bare metal.
GUAN et al.: SR-IOV BASED NETWORK INTERRUPT-FREE VIRTUALIZATION WITH EVENT BASED POLLING
3
Fig. 3. Anatomy of the interrupt overhead in the virtual environment, where the VM ISR itself only consumes 14% of the total CPU cycles.
Fig. 5. Candidate list of system events in the VM and VMM level.
Fig. 4. The cycles spent on handling interrupts occupy up to 39% of the total consumed cycles in the 8vm case.
2) Frequency of Interrupts: Many techniques have been proposed to reduce excessive interrupts. The Intel 82599 Niantic 10Gbps NIC [18] supports an interrupt throttling mode, through which the driver can configure the maximum interrupt rate allowed by the NIC. For example, the Intel 82599 ixgbe driver sets a default value of 8000 intr/s to the NIC, meaning that no more than 8000 interrupts can be delivered by the NIC within a second. NAPI [23] in Linux mitigates interrupts on the receiver side, by dynamically switching between polling and interrupts. An interrupt is kept masked by NAPI until exiting the polling phase when the receiving buffer is empty or a pre-defined polling threshold is reached. Note that these interrupt mitigation schemes will be taken as the experimental Baseline to measure the efficiency of our method. All the details can be found in Section V. Even when using the above interrupt mitigation techniques (NAPI and interrupt coalescing), the interrupt frequency remains high in high-speed SR-IOV environments. Fig. 4 shows the interrupt frequency and the CPU processing overhead while running network streaming (Netperf benchmark with a packet size of 1500B). A detailed experimental setup configuration is given in Section V. The interrupt frequency per VM becomes close to 8000 intr/s (∼32k for 4VM) when scaling from 1VM to 4VM, almost quadrupling the total interrupt
frequency. When hosting 8VM, the interrupt frequency per VM reduces to 5000 intr/s (∼40k for 8VM), reflecting the fact that NAPI and NIC use interrupt throttling to eliminate some of the interrupts. However, the frequency of interrupts and the CPU overhead required to process them in virtual environments remain a great challenge. Moreover, Fig. 4 illustrates that interrupt processing consumes about one third of the CPU cycles, and can use up to 39% of the total cycles when hosting 8VM. Note that when dealing with network streaming of small packets the interrupt frequency may be multiplied tenfold, implying a significant increase of the processing overhead. III. S EBP A RCHITECTURE D ESIGN A. Overall Design The basic idea behind the design of sEBP is to leverage various events occurring in the system such as system calls and VM EXITs, to replace heavy interrupt processing. Those kinds of events happen frequently all over the system so that they can mimic the notification role fulfilled by the interrupts. On the other hand, a considerable number of cycles is already spent on handling those system events, so the added cost to those paths for polling the NIC status is negligible. Event Selection: In a virtualized hierarchical architecture, the existing events are located over the VMM or the guest OS kernel. Fig. 5 shows the available frequently occurring events that can be potentially used to trigger network packet polling. 1) On the guest kernel layer, kernel events such as signals, exceptions, and interrupts as well as entering/leaving the idle loop can serve as driving events. 2) If the virtual environment is accelerated by hardware virtualization techniques (e.g. Intel VT-x [18]), the VM EXIT points on the host VMM layer can also be used. VT-x delivers all the exits to a common handler with the specific exit reasons clarified. These events happen as long as the system is not idle, so they can serve as driving events at the hypervisor layer. There are several different reasons causing VM EXITs, such as interrupt or halt.
4
IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, VOL. 31, NO. 12, DECEMBER 2013
Fig. 6. Overall Architecture of sEBP.
However, they all call the same code path that we choose as the point of event collection. Eventually, sEBP can be implemented in either the VMM or the guest OS kernel, leading to two potential implementations of sEBP: sEBP-guest and sEBP-host. Their detailed designs and implementation will be introduced in the next section. Here, we first give an overview of the functionalities for the design of sEBP. Additionally, multiple system events may come together when one event is correlated with another event. What occurs between the interrupts and the software timers is such an example. Software timers are driven by hardware timer interrupts, so counting the occurrences of both does not generate more meaningful polling events than only counting the hardware timer interrupts. In fact, selecting two correlated events is not necessary since having several polling actions too close to one another is meaningless. The overall functional components of the sEBP consist of the event collector and the event manager. The event manager is composed of a rate controller, a compensating timer module and a cross-VM event sharing module, as shown in Fig. 6. Since the number of events accumulated by the event collectors may not be as exact as expected, an event manager is introduced to throttle the number of effective events out of the event poll. The strategy is similar to the one employed in the interrupt throttling mechanism of NIC. Those effective events finally fulfil the role of interrupts to drive the polling logic of the NIC status and then the subsequent packet handling. B. Event Collector The event collector monitors the occurrence of a given system event at the entry point when starting to handle it, and then notifies the event manager. To cooperate with the event collector, many hook functions are embedded in specific places in the kernel according to the event selection chosen in the previous section. Note that the event collector of sEBPhost is simpler. There is a common exit point in the host VMM layer if the virtual environment is accelerated by a hardware virtualization technique (e.g. Intel VT-x [24]). C. Event Manager In order for the interrupt model to be replaced by the existing system events, the selected events must occur frequently
and duly. However, the events that are collected by the event collector may happen too frequently or insufficiently. The event manager module is responsible for dealing with these two problems. 1) Rate Controller for Throttling Excessive Events: Some workloads generate many system events in a short interval and if all these events drive polling, they introduce an unnecessary overhead. The event manager includes a rate controller to throttle the polling frequency in order to solve the first problem. We introduce ERthreshold , an Event Rate threshold for the event controller; this is an important parameter of sEBP. This constrains the polling to be triggered a certain number of times per second (polling frequency). In order to produce the expected ERthreshold , the rate controller imposes a minimum interval between two consecutive polling actions. Once a polling action is completed, that is there is no pending I/O packet, sEBP suspends any further polling action for a predefined time interval, It . This scheme is implemented in a similar way to the hardware interrupt throttling mechanism. The lightweight processor Time-Stamp Counter (TSC) on the Intel X86 CPU is used as a high resolution time source. When a polling action is completed, a time-stamp (Tlast ) is recorded. When a new event happens, the event manager samples the current time-stamp (Tcurr ), and compares it to Tlast ; if the interval is not large enough, the polling action from the new event is suppressed until Tcurr − Tlast ≥ It .
(1)
We use this formula to make sure that two polling actions will never be too close from one another. Therefore, it can effectively limit the upper bound of the actual polling rate. An event which passes Formula (1) is called an effective event. Only effective events can finally trigger the polling actions. After a polling is completed, the rate controller should wait for a time interval It before starting the next polling. However, there will be a time gap between the expected It and the first effective event occurrence time (Tcurr ) as shown in formula (1). In order to avoid the accumulated delay caused by this gap, sEBP calculates It as It = 0.8/ERthreshold .
(2)
Notice that we here shorten It by 20% in order to avoid the accumulated polling delay caused by the system event arrival latency. We conducted a series of tests by shortening the factor by 5%, 10%, ... 35%, and concluded that it was adequate to set it to 20% in order to achieve the expected ERthreshold . Using formulas (1) and (2), the event manager can ensure that the effective event rate is as close as possible from ERthreshold . The event manager counts the number of events that are offered by the event collector (Etotal ) and the number of effective events (Eef f ) that pass the examination in a given interval. So, the effective rate R over a period of time can be calculated as R = Eef f /Etotal . R is considered as constant for a given event manager implementation, which is useful when dynamically adjusting the frequency of the compensating timer, as we are now going to study.
GUAN et al.: SR-IOV BASED NETWORK INTERRUPT-FREE VIRTUALIZATION WITH EVENT BASED POLLING
2) Compensating Timer for Insufficient Number of Events: Observe from the previous subsection that shortening It can be enough when there is a sufficient number of events. Other workloads may fail to provide enough system events such that incoming I/O events cannot be dealt with in a timely manner. More precisely, (Tcurr −Tlast ) may be extremely large for rare events. The event manager must deal with the case where an insufficient number of system events occurs. This can happen for workloads that spend the majority of their time simply sending/receiving network packets, with little time on digesting or generating the packet content, for example, the Netperf micro-benchmark. Once the VM becomes idle after handling the current network traffic, only occasional events separated by long intervals occur which can severely impact the I/O performance. In sEBP, we use a timer-based compensation approach, by arming a compensating timer when the OS cannot generate as many events as expected (shown in Fig. 6). The timer serves as the worst-case polling event if the event collectors cannot generate enough events based on the ERthreshold requirement. Though the timer-based method uses a physical timer interrupt, it is still more efficient than the interrupt-mode since a timer interrupt is much lighter for processing than an NIC interrupt. For the compensating timer approach, controlling the timer configuration could be an issue. When the event generation status of a system changes, the frequency of the timer should be adjusted dynamically. We propose an adaptive algorithm to render effective events to be as close as possible to our expectation when the event generation status of the system changes (Fig. 7). If the effective events that occurred in last statistical observation window are fewer than expected, we increase the frequency of the timer such that more timer interrupts will be generated in the next window. If the effective events exceeds our expectation, we decrease the frequency. When we observe that the number of events without the timer is sufficient, we just stop it. This is the rough idea onto which our compensation timer is based. Dynamic Timer Adjustment: Let α be the number of effective events in an statistical observation window p. If α is less than our expectation e, we use a timer to generate effective p for the next window. However, interrupt events at a rate of e−α if the number of system events in the next windows changes, or if the rate controller discards or misses events that cannot pass the effectiveness test, then the new timer is adjusted. This configuration is based on the statistical information of the last observation window and use it to predict the adequate timer configuration for the next window. As aforementioned, we use the periodic statistical observation window p = 500ms, during which the expected number of effective events is e = ERthreshold · p. We count the effective events in the last observation window p, and denote their number by α. Let old be the timer setting in the last window, then the number of timer events occurring in the last p . Since the effectiveness rate R is stable for window is old a given event manager, their number, provided by the event controller selected from the system kernel event, denoted by k, can be obtained by formula (3).
5
sEBP Module
Event
Event Collector
k
Event Manger (R)
Į
e
old Timer Adjustment Formula (5)
new
Fig. 7. Dynamic Timer Adjustment Method.
(
p + k)R = α. old
(3)
Assuming that the events provided by the event collector will not change significantly, we can adjust the timer setting (new) according to (
p + k)R = e. new
(4)
We can then obtain the new timer configuration through an easy deduction of these two formulas as new =
e−α p·R
1 +
1 old
,
(5)
if new adjustment value is positive. The above statement on the dynamic timer adjustment is clearly illustrated in Fig. 7. Next, if the number of effective events is sufficient, we need to determine whether to stop the timer or not. It is obvious that the timer cannot be set to a negative value. When new is adjusted to be negative, that is, if p ·R ≤ a−e), we can assume that the kernel itself is able to ( old provide enough events without the help of the compensating timer, so we cancel it. Note that R is assumed to be stable in the deduction, but the timer can adaptively compensate its slight changes with the adaptive control loop shown in Fig. 7. The compensating timer approach can be used in both sEBP-guest and sEBP-host. However, note that sEBP-host has a more efficient method called cross-VM event sharing to address this problem; this will be considered in the next section. IV. I MPLEMENTATION This section introduces the two different implementations of the sEBP model, and illustrates how to mitigate the interrupt virtualization overhead. The implementation is based on KVM [3][8], Intel virtualization technology [11] [18], and Linux VMs. KVM is a new Linux subsystem that leverages these virtualization extensions to add a virtual machine monitor (or hypervisor) capability to Linux, where the hypervisor is running as a module in the host kernel. As mentioned earlier, sEBP has several implementation alternatives: One is collecting events and triggering polling at the guest layer while the other does it at the hypervisor layer. We call them the sEBP-guest and the sEBP-host respectively.
6
IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, VOL. 31, NO. 12, DECEMBER 2013
Fig. 9. Architecture of sEBP-host. Fig. 8. Architecture of sEBP-guest.
A. sEBP-guest Implementation The architecture of sEBP-guest is shown in Fig. 8. It is a module in the guest kernel, collecting system events in the guest VM, such as system calls, context switches and so on, and triggering polling NIC. sEBP-guest has one ERthreshold that controls the event manager in this VM. The polling module in sEBP-guest simply utilizes the existing NAPI interface of the Linux VM, i.e. it invokes napi schedule() to schedule a polling action and then triggers a NET RX softirq. When the softirq is handled, the poll() callback of the VF driver is invoked to check the receiving buffer on the VF. The sending buffer usually needs less frequent monitoring, so the polling module triggers one NET TX for every two RX polling according to the observation on the TX/RX interrupt rate. The physical interrupts of the VF are kept masked. Note that the whole interrupt path is given in Fig. 8. sEBPguest can omit all the steps of the interrupt process. It does not involve the hypervisor at all neither does it import any extra context-switch. The event collector is so light-weight that its overheads can be ignored, but it needs modification to the guest OS kernel to hook various functions for the event collector. B. sEBP-host Implementation sEBP-host is a module in the host kernel (VMM), which is shown in Fig. 9. sEBP-host collects various VM EXITs at the hypervisor layer as events. In order to manage multiple VMs separately, sEBP-host maintains one ERthreshold value for each running VM. This means that a different ERthreshold value can be assigned to each VM according to the workload. The API polling method called by guest OS is hidden from the hypervisor, leading to a less efficient way. We use a virtual interrupt injection as the polling starting point. However, its overhead is only a little less than the interrupt model, which means that sEBP-host has limited advantage in some cases. Fortunately, sEBP-host provides a more lightweight solution called cross-VM event sharing. Since sEBP-host collects events at the hypervisor layer, the events from the different VMs can be shared. More specifically, as shown in Fig. 10,
Fig. 10. Cross-VM Event Sharing in sEBP-host.
when the hypervisor collects an effective event from a VM, it does not only notify it to poll its NIC, but it also notifies all the other VMs running at the same time. In other words, one event can be used as a multiple event. It can significantly improve the performance when running several VMs at the same time and each VM has an insufficient number of events by itself. Besides, compared to the compensating timer approach, it induces almost no extra overhead. C. sEBP-guest vs. sEBP-host Analytical Comparison Two significant differences between sEBP-guest and sEBPhost make the performance of sEBP-guest and sEBP-host quite different. 1) Different Event Sets: sEBP-guest uses system events in the guest kernel. Some events happening in the guest can be observed in the form of VM EXIT at the hypervisor layer, for example, timer interrupt and entering idle (they all cause a VM EXIT). Some other events however happen in the guest without causing a VM EXIT, for example, context switches and system salls. Therefore, sEBP-host has fewer available events than sEBP-guest, but features a simpler event collector. In addition, Context Switch and System Call events are good candidates as they are frequent and fairly-distributed. They represent the major part of the whole set of events. Without them, sEBP-host might at time suffers from a lack of events. 2) Different Code Paths Involved When Polling: sEBPguest collects events and triggers polling in the guest OS kernel. In detail, sEBP-guest directly calls the entry of softirq
GUAN et al.: SR-IOV BASED NETWORK INTERRUPT-FREE VIRTUALIZATION WITH EVENT BASED POLLING
TABLE I E XPERIMENTAL P LATFORM C ONFIGURATION Components Host Linux Host NIC PF driver Guest Linux Virtual NIC VF driver Memory per VM VCPUs per VM
Version and Configuration RHEL6.2 (kernel: 2.6.39.4) Ixgbe-3.4.24 RHEL6.1 (kernel: 2.6.32.7) Ixgbevf-2.2.0 Default 256MB (384MB for Memcached) 1
which is the last step of the whole procedure. sEBP-host needs to use the virtual interrupt injection as the starting point for the polling, which is much longer than that of sEBP-guest. Moreover, this path cannot exclude the APIC emulation, which incurs several expensive VM EXITs. As there are more event candidates and the polling path is shorter, sEBP-guest outperforms sEBP-host in some cases. However note that sEBP requires to modify the guest OS, which is contrary to the common virtualization strategy. Furthermore, when lacking events, the system could benefit from a cross-VM event sharing method through the many hosted VMs in sEBP-host, as described earlier. Cross-VM events cannot occur as regularly as timer interrupts, so the efficiency rate of the cross-VM will be lower than that of the timer interrupt events in sEBP-guest. This results in an occasional lack of events to trigger polling in the sEBP-host. In this case, we can use a larger receiving buffer size to remedy the occasional irregular arrival time of cross-VM events. In general, both the sEBP-host and the sEBP-guest display strong points in special cases that will be examined and evaluated in the next section. V. E VALUATION This section evaluates the performance of sEBP for various workloads in a virtual environment. The results show that sEBP achieves great performance improvements in all scenarios, with up to 59% performance improvement in the WebBench experiments and 23% better scalability ratio in the Memcached experiments. A. Experimental Setup Configuration 1) Platform: The test environment is composed of two machines: one acts as a client that sends the network workload while the other acts as a server hosting the VMs. Both the client and the server have the same configuration: one 2.93Ghz Intel Westmere socket (6 cores/12 threads), 16GB memory, and Intel 82599 Niantic 10Gbps NIC. The two machines are connected to one another using a direct 10Gbps fiber. The detailed environment information is described in Table I. The implementation of sEBP is as introduced in Section IV. 2) Experimental Scenario: Baseline: In order to fairly evaluate the efficiency of sEBP, we compare it with the modern Intel SR-IOV solution that we consider as a Baseline solution. Note that Intel SR-IOV solution already features the advanced interrupt mitigation methods. We use the Intel 82599 network interface card (NIC) together with the ixgbe based driver, where the NIC hardware consists of the interrupt throttle method and the software driver integrates the NAPI technologies (shown in
Intel 82599 IXGBE NIC Interrupt Throttle Rate Control Hardware Interrupt Mitigation
7
IXGEB driver NAPI Interrupt Moderation Software Interrupt mitigation
Fig. 11. Baseline with Interrupt Mitigation Schemes
Fig. 11). Moreover, the evaluations are also compared to dynamic interrupt mitigation configurations such as tuned interrupt throttle rate. The detailed description of the interrupt mitigation in the Intel SR-IOV solution was given in Section II. We introduce the fair comparison of sEBP with the Baseline as follows. As aforementioned in Section II.2, the Intel 82599 NIC VF driver sets the interrupt throttling rate to 8000 with NAPI enabled by default [18]. NAPI [23] mitigates interrupts on the receiving side, by keeping an interrupt masked until exiting the polling phase, the receiving buffer is empty or a pre-defined polling threshold is reached. To make a fair comparison, we set ERthreshold to 8000 which allows a maximum of 8000 effective events per second to fulfil the role of the interrupts. In all the experiments, we measure both the performance and CPU utilization when scaling from 1VM to 8VM. We compare the Event Base Polling schemes to the Baseline strategies using modern interrupt mitigation schemes. We are then able to evaluate the advantages of the sEBP models in terms of network performance and CPU utilization. The experimental results will be normally compared to the Baseline performance attributes, and we show the advantages of sEBPguest and sEBP-host relatively to the Baseline. Benchmark: Instead of using CPU hungry loops and PING alike simple artificial workload, we employ various benchmarking tools, namely Memcached [13], WebBench [2] and Nefperf [19], in order to produce different event environment with a practical broad workloads as the experimental scenarios. These benchmarks are widely used and acknowledged by academic researchers as well as industrial enterprises (e.g., Youtube, Twitter, Wikipedia, HP, etc.) when testing network performances. 1) Memcached: Memcached is an in-memory key-value storage server for small chunks of arbitrary data (strings, objects) resulting from database calls, API calls, or page rendering [13]. It is used by many high-profile web sites to evaluate the overall performance and scalability. The processing work of a Memcached server consists of sending and receiving network traffic, as well as parsing high level network protocols, accessing disks, and so on, which will all produce relative system events. So, it generates excessive system events that show the performance improvement brought by sEBP when a sufficient number of events can be used to trigger polling actions. 2) WebBench: we also evaluate sEBP performance using WebBench [2], a stress testing tool for benchmarking web and proxy servers. We installed an Apache server on each VM on the server. Each WebBench instance simulates four concurrent clients. Like Memcached, the
IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, VOL. 31, NO. 12, DECEMBER 2013
WebBench test also simulates situations with enough events. 3) Netperf: Netperf is a benchmarking tool that can be used to measure the performance of many different networking types [19]. For streaming tests, it opens a TCP connection between two machines of our platform, and makes as many rapid write() calls of a given size as possible. When testing the receiving side performance, the Netperf server simply receives packets from the line, without doing any post-processing on the received data. In such an environment, the majority of the available system events are actually the result of the current network traffic, therefore they drop dramatically after handling the packets. Thus, Netperf is helpful to study an environment with an insufficient number of events. These benchmarks emulate all the major workloads in cloud computing and data-centers. They cover the situations where CPU-intensive, database accessing and I/O-intensive workloads co-exist, but also include pure network workloads. Note that the case consisting of only the CPU computing workloads without any network connection is out of our scope. In Section V.B, we first evaluate the performance of sEBPguest, sEBP-host and the interrupt model assuming there is a sufficient number of events. Then Section V.C evaluates sEBP when lacking system events. Finally, we show the benefit of Dynamical ERthreshold . B. sEBP Performance with Sufficient Number of Events In this section, we evaluate the performance of sEBP in the case of a sufficient number of events, and use Memecached and WebBench to this end. When sEBP has enough available events for selection, it performs better. Therefore, this situation is interesting as it shows the real impact of sEBP. 1) Memcached Evaluation: Memcached [13] is a generalpurpose, distributed memory caching system, which is often used to speed up dynamic database-driven websites by caching data and objects in RAM in order to reduce the number of times an external data source (such as a database or API) must be read. The system uses a client-server architecture, where Memslap [27] runs on the client and acts as a load generation tool by sending requests to the Memcached server. The Memcached server processes the requests and sends the replies to the client. We configure each Memslap client to make sixty-four concurrent requests using four threads. The Memcached server execution involves sending and receiving network traffic, parsing high level network protocols, accessing disks, and so on. We measure Memcached performance using the request-rate which refers to how many requests the system can handle. sEBP-guest delivers a significantly higher request-rate for each VM configuration as illustrated in Fig. 12. The CPU utilization comparison is not shown, since Memslap sends as many requests as possible to keep the server side busy in fact. Therefore, the CPU utilization in every VM is saturated and no difference can be observed. Besides an improvement of 24.8% and 29% on the request-rate (Request Per Second (RPS)) improvement can be observed in the 1VM and the 4VM cases, respectively. sEBPguest achieves a 45% higher request-rate than the Baseline in the 8VM case.
8
Fig. 12. Memcached Throughput Comparison of sEBP-host, sEBP-guest and Baseline (Enough Events).
Besides the request-rate, another key metric of the Memcached benchmark is the scalability. It is usually used to evaluate the scalability of an application affected by a shared lock contention when executed as multiple threads over a many-core server. Here, we use scalability to evaluate the case where multiple VMs are run on a server, and define the 8vm −request1vm . As shown in Fig. scalability ratio as requestrequest 1vm 12, the request rate of the Baseline increases by a factor of 2.57 when the number of VM increases from 1VM to 8VM. At the same time the request rate of sEBP-guest increases by a factor of 3.16. So, sEBP-guest actually achieves 23% ( (3.16-2.57)/2.57) better scalability ratio. sEBP-host performs better than the Baseline, but worse than sEBP-host assuming a sufficient number events is available. This will now be discussed in more detail. On the other hand, Fig. 13 illustrates the total number of effective events and processed interrupts when executing Memcached measurements. By setting ERthreshold to the same interrupt throttling rate (8000intr/s) than on the VF, the rate controller in sEBP-guest generates almost the same event rate as the interrupt model (Baseline) in the 1VM and 4VM cases. This explains why sEBP-guest can achieve the highest RPS in Fig. 12. In the 8VM case, sEBP-guest generates a higher event rate than the interrupt mode. In fact, since it saves CPU resources to process more events by eliminating the overhead of the interrupt virtualization, enough available events are issued. The processed interrupt number of Baseline decreases from ∼8000 to ∼6500 since more CPU is occupied by other computing processes. On sEBP-host side, the amount of events decreases when the number of VM is scaled up. Note that the sEBP-host achieves a higher throughput than the Baseline even with much lower event rates, but also achieves similar performance to the sEBP-guest with much less events. This also proves our discussion in Section IV.C.1, and illustrates the efficiency of sEBP. We derive the statistics of the effective system events for each configuration, as shown in the 1VM case in Table II. Both the total of collected events and effective events are illustrated. Given an event type, we define the effectiveness ratio as the percentage of effective events out of all the collected events. In Table II, there is a total of ∼303K events collected per second by sEBP-guest, with ∼8K effective events pass-
GUAN et al.: SR-IOV BASED NETWORK INTERRUPT-FREE VIRTUALIZATION WITH EVENT BASED POLLING
9
1.2
1
1
1 0.92
1 0.88
0.85
0.8
0.6
LLC-load miss rate
LLC-total miss rate
LLC-store miss rate
Interrupt model/sEBP-host
sEBP-guest
Fig. 13. Produced System Event Amount in Memcached. Fig. 14. LLC Cache Miss Rate in Memcached Evaluation. TABLE II E VENT S ELECTION IN THE 1 VM M EMCACHED C ASE Event Types
sEBP-guest
sEBP-host
System Calls Virtual Interrupts Page Faults Context Switches Enter Idle Loop Signals VM EXIT
Effective Events 7864 109 24 28 0 0 2007
Total Events 300590 1005 1114 666 0 0 9573
Effective Rate 2.6% 10.8% 2.2% 4.2% n/a n/a 21.0%
ing through the rate controller; The system call events are responsible for ∼98% of the effective events. This provides a sufficient, well-distributed system event set for polling NIC status evenly, and thus it efficiently replaces the role of the interrupts. The virtual interrupts are responsible for 1.4% of the effective events (109 events in 8K). The vast majority is coming from the virtual timer interrupt. The timer interrupts normally occur at very regular intervals, for example the timer to drive the scheduler happens every 1ms (Hz=1000). Because of its even distribution, the effectiveness ratio of the virtual interrupts is higher (10.8%). It is also important to note that Page Faults and Context Switches events only contribute with a few effective events at a low effectiveness ratio. This may be because they appear beside earlier effective events and are thus ignored. Also, note that there is neither an idle loop nor signals in this workload. In fact the system is too busy to enter into the idle loop and there are no inter-process-communication generating signals in this testing. Table II also exhibits a distinctive behavior between sEBPhost and sEBP-guest. In the sEBP-host case the system call events do not cause the VM to exit, and thus have no contribution. We observed ∼2K of effective events among a total of ∼9.5K VM EXIT events. Furthermore, Fig. 14 compares the last level cache miss rate of sEBP-guest and the interrupt model in the Memcached test, by normalizing the rate in the interrupt model to 1. A 12% average miss rate and a 15% lower store cache miss rate are achieved by sEBP-guest. This is explained by the fewer context switches incurred in the event-based polling model. Note that the current sEBP-host solution does not eliminate the influence of the extra VM EXITS. Therefore the cache miss rate of sEBP-host is similar to tat of the Baseline. They share the same labelled column in Fig. 14.
Note that, although sEBP-host also obviously outperforms the Baseline interrupt model as previously shown in Fig. 12, it is more meaningful to compare sEBP-host to the Baseline under the same matching interrupt rate. For a fair comparison if the event rate of sEBP-host is observed to be less than the interrupt rate, then the interrupt rate is manually throttled down to this very rate. Moreover, we set the per-VM interrupt rate of the interrupt model to 2000 for 1VM, 1500 for 4VM, and 1000 for 8VM, according to the collected event rate in sEBPhost, denoted by ‘Baseline-dyn’ in Fig. 15. sEBP-host displays similar performance to Baseline-dyn for the 1VM and 4VM cases. For the 8VM case, Baseline-dyn achieves 1.315x more RPS than Baseline while sEBP-host achieves 1.38x more. This means that an extra 6.5% request-rate is obtained by sEBPhost in the 8VM case. This is reasonable since the interrupt overhead in Baseline-dyn is already at a moderate level due to the low interrupt rate. This also proves that sEBP-host shows it advantages when serving many concurrent VMs, as discussed in Section IV.C.2. Note that sEBP-guest achieves higher RPS than sEBP-host, which performs much better than Baseline-dyn in the cases of 1VM, 4VM and 8VM; the corresponding data is shown in Fig. 12. In summary, sEBPhost achieves similar performance to the dynamic interrupt rate tuning version of the Baseline and sEBP-guest performs better than baseline-dyn. Note that sEBP can perform even better if one tunes ERthreshold . This will be introduced in the next subsection. 2) WebBench Evaluation: Besides Memcached evaluation, we also evaluate sEBP performance using WebBench [2], which is a stress testing tool for benchmarking web or proxy servers. WebBench can test the sever capacity to handle a large amount of concurrent client requests. Moreover, the performance of the transaction rate (request/response) also reveals the network latency. We installed an Apache server on each VM on the server; each WebBench instance simulates four concurrent clients. Like the Memcached, the WebBench test also represents the situation where there are sufficiently many events. Fig. 16 illustrates the request/response rate measured by WebBench when varying the number of VM hosted. A similar relative improvement can also be observed when hosting 1VM or 4VM. sEBP-guest achieves 6587 and 20878 RPS with
IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, VOL. 31, NO. 12, DECEMBER 2013
595998 567924
%:5HTXHVWVSHUVHFRQG
600000 485000 480632
500000
431882
418548
400000 300000 200000
129000 130329 120963
700000
100000
0 1vm Baseline(BW)
4vm Baseline-dyn(BW)
8vm
10
EBP-host(BW)
Fig. 15. sEBP-host Still Outperforms the Interrupt Model with a Matching Interrupt Rate.
1VM and 4VM, respectively. This is to be compared with the Baseline which achieves 4130 and 13060 RPS with 1VM and 4VM, respectively. In fact sEBP-guest serves 59% more requests than Baseline when hosting 1VM and 4VM. In the 8VM case, although sEBP-guest has only a 4.4% improvement over the Baseline, there is a 12% CPU utilization reduction. This is due to a bottleneck on the client side; the client is incapable of sending enough packets to the server. For the reason already mentioned above, sEBP-host delivers a requestrate higher than Baseline but lower than sEBP-guest. This can be observed through a lower bandwidth and a higher CPU utilization in the 1VM and 4VM cases. In the 8VM case, sEBP-host achieves a higher throughput and a higher CPU utilization in comparison to sEBP-guest. This proves that sEBP-host is only efficient when sharing the polling events between several concurrent VMs. In fact a virtual interrupt injection is involved when triggering a polling, rendering it heavier than an sEBP-guest event (as discussed in Section IV.C.2) We have evaluated sEBP performances using Memcached and WebBench to emulate the practical workload of database accesses and web service applications. In summary, sEBPguest and sEBP-host both significantly outperform Baseline in terms of bandwidth and CPU utilization rate. Moreover, sEBPhost achieves similar performance to sEBP-guest with a much lower amount of events. Next, we evaluate sEBP performances in the case of a networking only workload. C. sEBP Performance with Insufficient Number of Events Previous evaluations were carried out in the case where network processing coexists with related computing or memory/disk access workloads. These scenarios represent the major workload in cloud computing and data-centers. In this section, we evaluate sEBP performance with the Netperf benchmarking tool, where only network streaming workloads exist. As mentioned earlier, Netperf is used to evaluate sEBP performance in the case of an insufficient number of events. Netperf is a micro-benchmarking tool which measures networking performances. It is designed on a simple clientserver communication model. When testing the receiver side performances, the Netperf server simply receives packets from
Fig. 16. sEBP Performance in WebBench Evaluation (sufficient number of events).
the line, and does not perform any post-processing on the received data. In such an environment, the majority of the available system events are actually the results of the current network traffics, and the event amount decreases dramatically once the packets have been handled. Once the VM enters the idle loop after handling the data, the available system events are significantly reduced, so there may be a long delay before the next effective system event. As sEBP does not have enough events, polling is not triggered in a timely manner. This causes sEBP to display much lower performance than the interrupt mode. We performed tests to highlight this drawback and to check the effectiveness of our two countermeasures to overcome this issue: the introduction of a compensating timer and the use of cross-VM event sharing. Netperf TCP STREAM is chosen and the message size set to 1500B and 54B for the large and small streaming traffic scenarios, respectively. We evaluate the efficiency of the compensating timer and of crossVM event sharing in the case of insufficiently many events. Since cross-VM can only be implemented in sEBP-host, it does not produce any interrupt by notifying all VMs when a VM EXIT event occurs. Cross-VM can reduce the system overhead, but it is only effective when hosting multiple VMs. The compensating timer method can be implemented in either sEBP-host or sEBP-guest, but there is no VM EXIT alike cross-VM event sharing scheme. Therefore, in what follows, we only illustrate the compensating timer efficiency in the sEBP-guest case, and the cross-VM efficiency in the sEBPhost case. 1) Compensating Timer Efficiency For sEBP-guest: As shown in Fig. 17, we evaluate sEBP-guest performance and the efficiency of the compensating timer in the case of an insufficient number of events, using Netperf streaming of 1500B packets. The total bandwidth achieved by sEBP-guest (without compensating timer) is only around 4Gbps, far from the peak ∼9.4Gbps achieved in the interrupt model. This is clearly due to the lacking of events. The experiment shows that the VM may stay in the idle loop for over 100ms before the next system event wakes it up, although there may already be pending packets in the receiving buffer of the VF. So sEBPguest has the lowest bandwidth and CPU utilization. As aforementioned, we armed a timer whose frequency is dynamically adjusted according to the algorithm described
GUAN et al.: SR-IOV BASED NETWORK INTERRUPT-FREE VIRTUALIZATION WITH EVENT BASED POLLING
40.00% 35.00%
25.00% 20.00% 15.00%
CPU(%)
BW(Mbps)
30.00%
10.00% 5.00% 0.00% 1vm
4vm
sEBP-guest-timer(BW)
baseline(CPU%)
45.00% 40.00%
BW(Mbps)
30.00% 25.00% 20.00%
CPU(%)
35.00%
15.00% 10.00% 5.00% 0.00% 1vm
4vm
8vm
Baseline(CPU%)
sEBP-host-cv(CPU%)
45.00% 40.00% 35.00% 30.00% 25.00% 20.00% 15.00% 10.00% 5.00% 0.00% 4vm
8vm
16vm
Fig. 19. Efficiency of sEBP-host-cv in Netperf 1500B Streaming (Lacking Events).
Baseline(BW)
sEBP-guest-timer(CPU%)
10000 9000 8000 7000 6000 5000 4000 3000 2000 1000 0
sEBP-host-cv(BW)
10000 9000 8000 7000 6000 5000 4000 3000 2000 1000 0
8vm
Fig. 17. Efficiency of sEBP-guest-timer in Netperf 1500B Streaming (Lacking Events).
baseline (BW)
Baseline(BW)
sEBP-host-cv(BW)
Baseline(CPU%)
sEBP-host-cv(CPU%)
70.00%
10000 9000 8000 7000 6000 5000 4000 3000 2000 1000 0
60.00% 50.00% 40.00% 30.00%
CPU(%)
10000 9000 8000 7000 6000 5000 4000 3000 2000 1000 0
CPU(%)
sEBP-guest-timer(BW) sEBP-guest-timer(CPU%)
BW(Mbps)
sEBP-guest(BW) sEBP-guest(CPU%)
BW(Mbps)
Baseline(BW) Baseline(CPU%)
11
20.00% 10.00% 0.00% 4vm
8vm
16vm
Fig. 18. Efficiency of sEBP-guest-timer in Netperf 54B Streaming (Lacking Events).
Fig. 20. Efficiency of sEBP-host-cv in Netperf 54B Streaming (Lacking Events).
in Section III. This timer generates compensating events to maintain the event-based polling model progressing, and thus recover the 9.4Gbps bandwidth peak as shown by ‘sEBPguest-timer’ in Fig. 17. At the same time, the total CPU utilization is largely reduced; as illustrated by Fig. 17, the Baseline uses ∼37% CPU while sEBP-guest-timer occupies only ∼24% (relatively saving ∼36% of the CPU resources). In fact, the timer is only activated on demand and as its interrupt cost is lower than that of a physical NIC interrupt, it proves the efficiency of the event polling schemes. Note that without the timer sEBP-guest does not fit at all the case of where events are lacking. Fig. 18 evaluates sEBP-guest-timer performance in comparison with the Baseline when dealing with streaming traffics of small packet (54B). sEBP-guest-timer achieves almost the same bandwidth when hosting 1VM, 4VM and 8VM. The total bandwidth increases with the number of VM hosted. This is because the small packet traffic causes much higher frequent interrupts, and then the VM can more efficiently leverage the multicore resources to process these interrupts. It is worth noticing that the CPU cost of the sEBP-guesttimer for achieving the same bandwidth is much lower than that of Baseline. For example, when hosting 8VM, the sEBPguest-timer utilizes only ∼19% of the CPU while Baseline
requires ∼42% CPU (relatively saving ∼54% of the total CPU resources). This proves that the event base polling has a lighter processing cost than the interrupt model. Note that the performance of sEBP-guest without compensating timer is omitted here and in that which follows since its inefficiency for insufficiently many events has already been proven (much less CPU utilization, lower bandwidth and does not scale up with the increasing number of VMs. as shown in Fig. 17). 2) Cross-VM Efficiency For sEBP-host: Another way to cope with the insufficient number of system events is to use the mechanism of cross-VM event sharing in sEBP-host. We use Netperf tests to compare the interrupt model and sEBPhost with a cross-VM scheme; we call it sEBP-host-cv. The experiments focus on the streaming traffics of large and small packets as shown in Fig. 19 and Fig. 20, respectively. Since cross-VM events occur only when there are multiple guest VMs, we evaluate the performance when hosting 4VM, 8VM and 16VM. It should be noticed that cross-VM events cannot occur as regularly as timer interrupts, like jitters in the periodic task arrival times. The uneven time distribution of the crossVM events may introduce a lack of event to trigger polling in time. Therefore, the receiving buffer might suffer from overflow in this case. In the evaluation of sEBP-host-cv, we increase the receiving buffer size from 32K to 64K in order
IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, VOL. 31, NO. 12, DECEMBER 2013
to compensate for the uneven cross-VM event occurrence. Fig. 19 illustrates the achieved bandwidth and CPU utilization of sEBP-host-cv in comparison with the Baseline evaluated in the Netperf 1500B traffic streaming test when hosting 4VM, 8VM, and 16VM, respectively. Both Baseline and sEBP-host-cv achieve full network bandwidth (∼9.4Gbps, with sEBP-host-cv bandwidth being a little bit lower than that of Baseline), while sEBP-host-cv has much lower CPU utilization than the Baseline. For example, in the 8VM case, Baseline uses 37% of the CPU while sEBP-host-cv has only 24% CPU utilization (relatively saving about 35% of the CPU resources). Fig. 20 shows the performance of sEBP-host-cv tested by the Netperf 54B streaming traffic. The same performance pattern can be observed. sEBP-host-cv achieves a throughput close to the that of the Baseline proving the efficiency of the cross-VM scheme in the case where events are lacking. Also, the CPU utilization of sEBP-host-cv is much lower than that of Baseline, and the CPU resources saved by sEBPhost-cv are about 49.5%, 55% and 50% for 4VM, 8VM and 16VM, respectively. So, sEBP-host-cv can remedy the insufficiency of events and achieve a similar bandwidth to the interrupt model. sEBP-host-cv displays better scalability for saving CPU resources when serving stream with small packets than when serving streams with large packets. sEBP has been evaluated by streaming both large and small packets. Note that Netperf also provides a Request/Response (RR) model to test the performance, but we omit it here since it is similar to the request rate of WebBench, a more sophisticated benchmark for web service simulation. The request submission without generating any calculation is not very realistic, and the WebBench evaluation is, therefore, more representative. In summary, sEBP scheme can effectively deal with a network I/O-intensive workload when a compensating timer and cross-VM events are used to trigger packet polling for overcoming the issue raised by an insufficient number of events. The CPU resources are significantly reduced when using the sEBP scheme while achieving similar bandwidth than the Baseline. The saved CPU resources can be used to process the computing, memory/disk access and so on. In particular, this explains why sEBP achieves higher performance when evaluated by Memcached and WebBench. Comparing sEBP-host-cv and sEBP-guest-timer, sEBP-host-cv requires a larger buffer size, and is only effective when the number of VM increases; in such a case it achieves better CPU resource efficiency since one cross-VM event is shared with multiple VMs for polling packets. On the other hand its implementation only involves minor modifications to the hypervisor, while sEBP-guest-timer requires individually modifying each guest OS. Therefore, both sEBP-host and sEBP-guest have their own suitable and applicable cases. D. Dynamical ERthreshold Efficiency For the interrupt model, the adaptive tuning of the interrupt rate can significantly increase the performance [11] [17] [16]. It has been proven that sEBP with the default setting performs similarly to the Baseline with the interrupt rate tuning, see Fig. 15. Moreover, in sEBP, the rate controller can be configured
Throughput(request/s) Throughput(re equest/s
12
680000 660000
649873
640000 620000 600000
629606 615490
657358 625141 613086 601782 597305
580000 8000/s 6000/s 4000/s 2000/s 1000/s 500/s 250/s 100/s
Fig. 21. Performance of Memcached Server in 8vms configuration using different ERthreshold .
through the ERthreshold parameter, which also affects the overall network performance. In many workloads that treat throughput rather than latency as the first priority, an excessive polling is meaningless. Moreover, when the CPU or the memory are the performance bottleneck, an excessive polling will consume more CPU resources and can even harm the performance. On the other hand, an insufficient number of polling will not drive the system to run at its full load so the maximum performance cannot be achieved either. Therefore, the best performance derives from an appropriate polling-rate that is neither too high nor too low. The optimal ERthreshold value is VM specific and VM number dependent. Moreover, both sEBP-guest and sEBPhost can be configured independently for each active VM, and we can adjust ERthreshold dynamically based on the number of active VMs sharing the same NIC. For example, the threshold for each VM can be halved when the number of VMs doubles, since the bandwidth allocated to each VM consequently halves. Service level agreements can also be taken into consideration: a higher polling frequency can be used for VMs with strict latency requirements. Because the overhead of the EBP is much lower than the interrupt model, we expect the adaptive rate control of sEBP to further enhance the system performance. We study the influence of ERthreshold by running the Memcached tests in an 8VM configuration (fixed VM number); we tune ERthreshold from 8000 to 100, as shown in Fig. 21. The result shows that the 2000 ERthreshold achieves the best throughput (657359 requests/s). However, many factors are involved in setting this value properly, for example, the number of VM, the type of workload. We did not establish any mathematical model explaining how to set the value due to the complexity of the task. In practical industrial implementation, the setting can be determined by some experimental knowledge based on a series of testing (similar to Fig. 21) in order to achieve a satisfactory performance. It is worth noticing that the difference in the columns in Fig. 21 is enlarged due to the Y-axis starting at 580000. When tuning ERthreshold , as shown in Fig. 21, an ∼8% higher RPS can be achieved (default setting was 8000). This magnitude is different from the magnitude usually observed in the interrupt mitigation part. This is because the event based polling scheme has already
GUAN et al.: SR-IOV BASED NETWORK INTERRUPT-FREE VIRTUALIZATION WITH EVENT BASED POLLING
decreased the packet processing cost, which also underscores the efficiency of the event based polling strategy. VI. R ELATED W ORK Many studies aimed at reducing the overhead of interrupt virtualization using one or both of the following methods: (1) reducing the bottlenecks of handling interrupts, or (2) mitigating the interrupt frequency. Instead of following these directions, we intended to completely eliminate the network interrupts. Note that the Intel SR-IOV solution used as the Baseline in our paper integrates the current mainstream interrupt mitigation technologies, which were introduced in Sections II and V. Dong et al. [11] identified the high cost of emulating MSI mask/unmask operations to the high dom0 utilization. Up to 27% of CPU cycles can be saved when applying the appropriate optimization. ELI [15] introduced an exitless interrupt mechanism allowing a physical interrupt to be injected directly into the VM without the intervention of the VMM. This technique greatly shortens the path for handling interrupts, but its effect has not been properly evaluated for multiple VMs. vIC [4] coalesces interrupts for virtual storage devices based on the number of “commands in flight”. Unfortunately vIC applies only to network devices that do provide information on commands in flight. Dong et al. in [11] dynamically adjusted the interrupt rate on the NIC device based on the buffer overflow condition in the driver. NAPI [23] disables interrupts at polling time in order to remove unnecessary interrupts. HIP [12] only uses the interrupts under low network load conditions, and prefers polling otherwise. In general the algorithms predicting the desired interrupt rate for the future are complex and not always accurate. Aron et al. [6] introduced the concept of soft timers: certain states in the execution of a system, including system calls and page faults, are collected to drive network polling. An event handler is installed for each state at a low cost. A 25% improvement is observed in the web server throughput when interrupts are eliminated. Despite its similar concept sEBP targets consolidated server environments featuring 10Gbps high performance networks while soft timers were only tested on 100Mbps networks. Moreover, Aron et al. only studied their effects on the bare metal, while sEBP implementation was tested on extensive data at both VM and VMM levels. This underscores the wide application scope of event-based polling. Ben-Yehuda et al. [9] created a polling thread that runs on a dedicated core and continuously polls for packets pending in the receiving buffer. While the network performance is improved for a single VM environment, it does not have any impact on high performance network environment when scaling up. SplitX [20] adopts a similar idea by splitting the guest/hypervisor execution onto different cores, such that the guest may poll for virtual interrupts without needing the hypervisor to send an IPI. VII. C ONCLUSION AND F UTURE W ORK Interrupt virtualization remains a key overhead source in high performance network virtualization. Therefore to date,
13
most of the research has focused on reducing the per-interrupt handling overhead or mitigating the interrupt rate. Here we decided to take a step forward by adopting a new approach that completely eliminates interrupts in the critical I/O handling path, and instead replaces them by a smart event-based polling model named sEBP. It collects various system events either at the VMM level (sEBP-host) or at the VM level (sEBP-guest); these system events are then used to fulfil the notification role of the interrupts and drive the polling on the NIC status. Experimental results show that sEBP achieves great performance improvements in all the scenarios, with up to 59% performance improvement in the WebBench experiments and 23% better scalability ratio in the Memcached experiments. Future work will be carried out in order to improve the performance of sEBP-host. The main concern with this solution is related to the fact that sEBP-host uses the virtual interrupt controller in the KVM. In turn, this method displays a high overhead since it requires APIC emulation. A solution to investigate is an approach similar to the event notification mechanism used in Xen [7]. Instead of using the virtual interrupt controller in the VM, the event notification mechanism is based on the shared-memory communication interface. As a result sEBP-host performances should be greatly improved. ACKNOWLEDGMENT This work was supported by NSFC grant (No.61272100, 61202374); 863 Program of China (No. 2012AA010905); 973 Program of China (No. 2012CB723401); The International Cooperation Program of Shanghai (No. 12510706100), HuaWei grant, and Singapore NRF CREATE programme. R EFERENCES [1] Sr-iov. PCI Interest Group http://www.pcisig.com/home. [2] Webbench. http://home.tiscali.cz/ cz210552/webbench.html. [3] D. L. U. L. A. Kivity, Y. Kamay and A. Liguori. Kvm: the linux virtual machine monitor. In Ottawa Linux Symposium (OLS), 2007. [4] I. Ahmad, A. Gulati, and A. Mashtizadeh. vic: interrupt coalescing for virtual machine storage device io. In Proc. 2011 USENIX conference on USENIX annual technical conference, USENIXATC’11, pages 4–4, Berkeley, CA, USA, 2011. USENIX Association. [5] N. Amit, M. Ben-Yehuda, D. Tsafrir, and A. Schuster. viommu: efficient iommu emulation. In Proc. 2011 USENIX conference on USENIX annual technical conference, USENIXATC’11, pages 6–6, Berkeley, CA, USA, 2011. USENIX Association. [6] M. Aron and P. Druschel. Soft timers: efficient microsecond software timer support for network processing. ACM Trans. Comput. Syst., 18(3):197–228, Aug. 2000. [7] P. Barham, B. Dragovic, K. Fraser, S. Hand, T. Harris, A. Ho, R. Neugebauer, I. Pratt, and A. Warfield. Xen and the art of virtualization. In Proc. nineteenth ACM symposium on Operating systems principles, SOSP ’03, pages 164–177, New York, NY, USA, 2003. ACM. [8] M. Ben-Yehuda, M. D. Day, Z. Dubitzky, M. Factor, N. Har’El, A. Gordon, A. Liguori, O. Wasserman, and B.-A. Yassour. The turtles project: design and implementation of nested virtualization. In Proc. 9th USENIX conference on Operating systems design and implementation, OSDI’10, pages 1–6, Berkeley, CA, USA, 2010. USENIX Association. [9] M. Ben-Yehuda, M. Factor, E. Rom, A. Traeger, E. Borovik, and B.A. Yassour. Adding advanced storage controller functionality via lowoverhead virtualization. In Proc. 10th USENIX conference on File and Storage Technologies, FAST’12, pages 15–15, Berkeley, CA, USA, 2012. USENIX Association. [10] Y. Dong, D. Xu, Y. Zhang, and G. Liao. Optimizing network i/o virtualization with efficient interrupt coalescing and virtual receive side scaling. In Cluster Computing (CLUSTER), 2011 IEEE International Conference on, pages 26–34, 2011.
14
IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, VOL. 31, NO. 12, DECEMBER 2013
[11] Y. Dong, X. Yang, X. Li, J. Li, K. Tian, and H. Guan. High performance network virtualization with sr-iov. In High Performance Computer Architecture (HPCA), 2010 IEEE 16th International Symposium on, pages 1–10, 2010. [12] C. Dovrolis, B. Thayer, and P. Ramanathan. Hip: hybrid interrupt-polling for the network interface. SIGOPS Oper. Syst. Rev., 35(4):50–60, Oct. 2001. [13] B. Fitzpatrick. Distributed caching with memcached. Linux Journal Vol. 124, Revision 2.0, Linux Journal, 2004. [14] K. Fraser, S. H, R. Neugebauer, I. Pratt, A. Warfield, and M. Williamson. Safe hardware access with the xen virtual machine monitor. In 1st Workshop on Operating System and Architectural Support for the on demand IT InfraStructure (OASIS, 2004. [15] A. Gordon, N. Amit, N. Har’El, M. Ben-Yehuda, A. Landau, A. Schuster, and D. Tsafrir. Eli: bare-metal performance for i/o virtualization. In Proc. seventeenth international conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS XVII, pages 411–422, New York, NY, USA, 2012. ACM. [16] H. Guan, Y. Dong, R. Ma, D. Xu, Y. Zhang, and J. Li. Performance enhancement for network i/o virtualization with efficient interrupt coalescing and virtual receive-side scaling. IEEE Trans. Parallel Distrib. Syst., 24(6):1118–1128, 2013. [17] Z. Huang, R. Ma, J. Li, Z. Chang, and H. Guan. Adaptive and scalable optimizations for high performance sr-iov. In Cluster Computing (CLUSTER), 2012 IEEE International Conference on, pages 459–467, 2012. [18] Intel. 82599 10gigabit ethernet controller. http://www.intel.com/content/www/us/en/ethernet-controllers/82599-10gbe-controller-brief.html. [19] R. A. Jones. A network performance benchmark. Tech. Rep. Revision 2.0, , Hewlett Parkard, 1995. [20] A. Landau, M. Ben-Yehuda, and A. Gordon. Splitx: split guest/hypervisor execution on multi-core. In Proc. 3rd conference on I/O virtualization, WIOV’11, pages 1–1, Berkeley, CA, USA, 2011. USENIX Association. [21] J. Liu. Evaluating standard-based self-virtualizing devices: A performance study on 10 gbe nics with sr-iov support. In Parallel Distributed Processing (IPDPS), 2010 IEEE International Symposium on, pages 1– 12, 2010. [22] A. Menon, A. L. Cox, and W. Zwaenepoel. Optimizing network virtualization in xen. In Proc. annual conference on USENIX ’06 Annual Technical Conference, ATEC ’06, pages 2–2, Berkeley, CA, USA, 2006. USENIX Association. [23] J. Salim. When napi comes to town. Linux Conf., 2005. [24] R. Uhlig, G. Neiger, D. Rodgers, A. L. Santoni, F. C. M. Martins, A. V. Anderson, S. M. Bennett, A. Kagi, F. H. Leung, and L. Smith. Intel virtualization technology. Computer, 38(5):48–56, May 2005. [25] P. Willmann, J. Shafer, D. Carr, S. Rixner, A. Cox, W. Zwaenepoel, and W. Zwaenepoel. Concurrent direct network access for virtual machine monitors. In High Performance Computer Architecture, 2007. HPCA 2007. IEEE 13th International Symposium on, pages 306–317, 2007. [26] Y. Zhai, M. Liu, J. Zhai, X. Ma, and W. Chen. Cloud versus in-house cluster: evaluating amazon cluster compute instances for running mpi applications. In State of the Practice Reports, SC ’11, pages 11:1–11:10, New York, NY, USA, 2011. ACM. [27] M. Zhuang and B. Aker. memslap: Load testing and benchmarking tool for memcached. http://docs.tangent.org/libmemcached/memslap.html.
HaiBing Guan received the PhD degree in computer science from the Tongji University, China. He is currently a professor with the Faculty of Computer Science, Shanghai Jiao Tong University, Shanghai, China. He is a member of the IEEE and ACM. His current research interests include, but are not limited to, computer architecture, compiling, virtualization, and hardware/software co-design.
YaoZu Dong software architect of Open Source Technology Center at Intel Corporation, works on Linux virtualization including KVM and Xen, which cover IA-32, Intel(R) 64 and Itanium(R) architecture. He is currently a Ph.D. candidate at Shanghai Jiao Tong University under the supervision of Professor H.B. Guan. Yaozu Dong is an active participant of both Industry and academia event. He is a frequent presenter of Xen, KVM and academia conferences. Before that Yaozu worked in Linux kernel debugger and other OS enabling work for Xscale architecture in his 10+ years Intel working experience.
Kun Tian received his M.E. degree in communication system from University of Electronic Science and Technology of China. He is now in Intel Open Source Technology Center, with the research focus on system architecture. His main focus is system virtualization and cloud computing.
Jian Li is a member of the ACM/IEEE. He is an Assistant Professor in the School of Software at ShangHai Jiao Tong University. Dr. Li obtained his Ph.D. in Computer Science from the Institut National Polytechnique de Lorraine (INPL) - Nancy, France in 2007. He received his MS Degree in Telecommunication and Computer Science in 2003 from the University of Henri Poincar`e (France), and received his BS Degree in 2001 in Electronics and Information Technology from TianJin University (China). He has worked as postdoctoral researcher at University of Toronto and as associated researcher at McGill University in 2007 and 2008, respectively. His research interests include embedded system and virtualization, Cyber-Physical system, real-time scheduling theory, network protocol design and quality of service.