Full and Para-Virtualization with Xen: A Performance Comparison

15 downloads 52912 Views 636KB Size Report
these two approaches shows that PV performs the best. ... guests are modified operating systems such as Linux, ..... monitor for a computer with a segmented.
Vol. 4, No. 9 September 2013

ISSN 2079-8407

Journal of Emerging Trends in Computing and Information Sciences ©2009-2013 CIS Journal. All rights reserved. http://www.cisjournal.org

Full and Para-Virtualization with Xen: A Performance Comparison 1 1 2

Hasan Fayyad-Kazan, 2 Luc Perneel, 3 Martin Timmerman

PhD Candidate, Department of Electronics and Informatics, Vrije Universiteit Brussel, Pleinlaan 2- 1050 Brussel, Belgium PhD Candidate, Department of Electronics and Informatics, Vrije Universiteit Brussel, Pleinlaan 2- 1050 Brussel, Belgium 3 Professor, Department of Electronics and Informatics, Vrije Universiteit Brussel, Pleinlaan 2- 1050 Brussel, Belgium

ABSTRACT Xen is one of the most popular open-source virtualization software’s. It supports the two leading virtualization approaches, Full-Virtualization (FV) and Para-Virtualization (PV).Searching the public resources for performance comparison between these two approaches shows that PV performs the best. But most of the answers are theoretical, based on the implementation of both. For instance, Xen community states the following: “PV delivers higher performance than full virtualization because the operating system and hypervisor work together more efficiently, without the overhead imposed by the emulation of the system's resources”.This paper presents an experimental work based on tests and scenarios to compare the quantitative performance between PV and FV. The quantitative results show indeed the predicted performance gap. Keywords: Xen, Full-Virtualization, Para-Virtualization

1. INTRODUCTION Virtualization as a powerful and convenient technique has been widely used recently. It refers to the creation of a Virtual Machine (VM), called domain, which acts as a real computer with an operating system (OS) [1]. It allows also sharing the underlying physical machine resources among different VMs. The software layer providing the virtualization is called Virtual Machine Monitor (VMM) or hypervisor. This hypervisor is inserted between the hardware and VMs. It enables concurrent execution of multiple VMs, isolate them, and schedule them among the available resources [1]. The hypervisor can run either on the hardware directly (called bare-metal, or Type-1 virtualization), or on top of a host operating system (called hosted, or Type-2 virtualization) [2]. Since it has direct access to the hardware resources rather than going through an operating system, a native hypervisor is more efficient than a hosted architecture and delivers greater scalability, robustness and performance [3]. There are several ways to implement virtualization. Two leading approaches are Full virtualization (FV) and Para-virtualization (PV) [4]. One of the most popular open source virtualization solutions that can host both is Xen. The goal of this paper is to provide a quantitative performance comparison between Full-Virtualization and Para-Virtualization approaches hosted by Xen. This paper is organized as follows: Section 2 describes Xen architecture together with the two virtualization approaches; section 3 shows the experimental setup used for our evaluation; section 4 is an explanation of

the test metrics, scenarios and obtained results; and finally a conclusion.

2. XEN: A BRIEF BACKGROUND Xen [5] is a very famous virtualization solution, originally developed at the University of Cambridge. It is the only bare-metal solution that is available as open source, and is used as the basis for a number of different commercial and open source applications [5]. It consists of several components that work together to deliver the virtualization environment including Xen Hypervisor, Domain 0 Guest (referred as Dom0), and Domain U Guest (referred as DomU) which can be either PV Guest or FV Guest. The Xen hypervisor is the software layer that sits directly on the hardware below any operating system. It is responsible for CPU scheduling and memory partitioning of the various VMs [6]. It delegates the management of guest domains (DomU) to the privileged domain (Dom0) [7]. This structure allows it to remain a thin code layer, rather than becoming a large and complex piece of machinery like a full kernel. Dom0 is a privileged PV domain (modified Linux kernel) that has the native device drivers to assist other domains in performing real I/O operations as well as interact with the other virtual machines running on the system [8]. DomU guest have no direct access to the physical hardware. It is often referred to as unprivileged. DomU PV guests are modified operating systems such as Linux, Solaris, FreeBSD, and other UNIX operating systems. DomU FV guests do not require any modifications and run

719

Vol. 4, No. 9 September 2013

ISSN 2079-8407

Journal of Emerging Trends in Computing and Information Sciences ©2009-2013 CIS Journal. All rights reserved. http://www.cisjournal.org

standard Windows or any other unchanged operating system [6].

Machine (UTFVM). Each VM has one Virtual CPU (VCPU).

Xen hosts both FV and PV VMs. FullVirtualization [9] is designed to provide a complete simulation of the underlying physical system and creates a complete virtual system in which the guest operating systems can execute. No modification is required in the guest OS or application [4]. This approach can be advantageous because it enables complete decoupling of the software from the hardware. However, FV may incur a performance penalty as the VMM must provide the VM with an image of an entire system, including virtual BIOS, virtual memory space, and virtual devices [4].

The hardware platform used for conducting the tests has the following characteristics: Intel® Desktop Board DH77KC, Intel® Xeon® Processor E3-1220 v2 with 4 cores each running at a frequency of 3.1 GHz, and no hyper-threading support. The cache memory size is as follows: each core has 32 KB of L1 data cache, 32KB of L1 instruction cache and 256 KB of L2 cache. L3 cache is 8MB accessible by all cores. The system memory is 8 GB.

In contrast, Para-Virtualization [10] increases the VM performance by reducing the proportion of emulated hardware resources, relative to FV. Each VM is presented with an abstraction of the hardware that is similar but not identical to the underlying physical hardware. Also, it requires modifications to the guest operating system running in the VM. As a result, the guest operating systems are aware that they are executing on a VM—allowing for near-native performance [4]. The two approaches differ in the way of dealing with sensitive instructions consisting of privileged and nonprivileged instructions [11]. In Full-Virtualization, VMM capture sensitive instructions on guest OS and emulates their functions. This trap-and-emulate action can cost hundreds to thousands of cycles [12]. On the other hand, Para-Virtualization reduces its performance overhead by changing the source code of the guest OS. It replaces the sensitive instructions with hypercalls to VMM, so VMM can take over the sensitive operations on its own initiative. Para-virtualization further provides optimizations such as combining several hypercalls into one hypercall to reduce guest OS/VMM transition cost [12].

4. TESTING PROCEDURES AND RESULTS 4.1 Measuring Process The Time Stamp Counter (TSC) is used for obtaining (tracing) the measurement values. It is a 64-bit register present on all x86 processors since the Pentium. The instruction RDTSC is used to return the TSC value. This counting register provides an excellent high-resolution, low-overhead way of getting CPU timing information and runs at a constant rate. 4.2 Testing Scenarios Before describing the tests and the results, we will first explain the evaluation scenarios. Four scenarios are used: 4.2.1 Scenario 1: Affinity + CPU-Load As shown in figure 1, this scenario has four VMs. Each one has a single VCPU that is pinned to one physical CPU (PCPU) or core. These VMs are: Dom-0, UTVM, and 2 other VMs doing CPU stress tests on their corresponding PCPUs. The latter 2 VMs are called CPU-Load VMs. The CPU-stress test is an infinite loop of mathematical calculations. Dom-0 is in idle state.

After this theoretical comparison between the two approaches, the following sections provide an experimental comparison aiming at quantifying the performance gap.

3.

EXPERIMENTAL SETUP

Xen 4.2.1, the latest version at the time of evaluation process, is used here. Dom0 is running OpenSuse OS-version 12.3. Linux PREEMPT-RT [13] is the guest OS running in the FV and PV VMs. Being open source and configurable for usage in PV VM are the main reasons for selecting it as the guest OS. PV and FV VMs are created, where the tests are done in each one separately. The tested PV VM is referred as Under Test Para-Virtualized Machine (UTPVM), while the tested FV VM is called Under Test Fully-Virtualized

Fig 1: Affinity+ CPU-Load scenario

720

Vol. 4, No. 9 September 2013

ISSN 2079-8407

Journal of Emerging Trends in Computing and Information Sciences ©2009-2013 CIS Journal. All rights reserved. http://www.cisjournal.org

4.2.2 Scenario2: Affinity + Memory-Load This scenario is exactly the same as scenario 1, except the two VMs are doing Memory-Load tests instead of CPU-Load. The memory-Load test is a program running an infinite loop of memcpy() function that copies 9 MB (a value that is larger than the whole caches) from one object to another. 4.2.3 Scenario 3: Contention + CPU-Load As shown in figure 2 below, this scenario has 3 VMs, Dom-0, UTVM, and CPU-Load VM. The latter two are running on the same core.

Test method: a real-time thread with the highest priority is created. This thread does a finite loop of the following tasks: get the time using RDTSC instruction, start a “busy loop” that does some calculations, get time again using the same instruction. Having the time before and after the “busy loop” provides the time needed to finish its job. In case we run this test on the bare-machine, this “busy loop” will only be delayed by interrupt handlers. As we remove all other interrupt sources, only the clock tick timer interrupt can delay the “busy loop”. When the “busy loop” is interrupted, its execution time increases. Running the same test in a VM shows as well when it is scheduled away by the VMM, which in turn impacts latency. Figure 3 presents the obtained results of this test on the bare-machine, followed by an explanation. The X-axis indicates the time when a measurement sample is taken with reference to the start of the test. The Y-axis indicates the duration of the measured event; in this case the total duration of the “busy loop”.

Fig 2: Contention + CPU-Load scenario 4.2.4 Scenario 4: Contention + Memory-Load This scenario is exactly the same as scenario 3, except using Memory-load VM instead of CPU-Load VM.

Clock tick processing duration = 76 – 68 =8 µs

The aim of the “Affinity” scenarios is to detect the pure hypervisor overhead (as there is no contention). “CPULoad” and “Memory-Load” tests aim to detect the impact of system bus contention between the VMs. “Contention” scenarios aims to explore the scheduling mechanism of the hypervisor between competing VMs. 4.3 Testing Metrics After describing the scenarios, an explanation of the tests and their corresponding results is provided now. Note that all the test metrics described below are done initially on a non-virtualized machine (Bare-Machine) as reference. 4.3.1 Clock Tick Processing Duration This test examines the clock tick processing duration in the kernel. The results are extremely important as the clock interrupt - being on a high level interrupt on the used hardware platform - will bias all other performed measurements. Using a tickless kernel will not prevent this from happening as it will only lower the number of occurrences. The kernel is not using the tickless timer option.

Fig 3: Clock tick processing duration of the bare-machinezoomed The bottom values (68 µs) of figure 3 present the “busy loop” execution durations if no clock tick happens. In case of clock tick interruption, its execution is delayed until the clock interrupt is handled, which is 76 µs (top values). The difference between the two values is the delay spent handling the tick (executing the handler), which is 8 µs. Note that the kernel clock is configured to run at 1000 Hz, which corresponds to a tick each 1 ms. This is obvious in Figure 3, which is a zoomed version of Figure 4 below.

721

Vol. 4, No. 9 September 2013

ISSN 2079-8407

Journal of Emerging Trends in Computing and Information Sciences ©2009-2013 CIS Journal. All rights reserved. http://www.cisjournal.org

scenarios is explained in the next section (System bus bottleneck in SMP systems). Note that, in case of contention (scenarios 3 and 4), Xen CREDIT [14] scheduler assigns each VM a quantum of 30 ms to run. Therefore, the UTVM will be scheduled away each 30 ms for a period of 30 ms, which explains the reason of having an increase of 30 ms to the results. Decreasing this quantum value to 1 ms will in turn decrease this high overhead to a value which is dependent on the number of VMs running on the same core.

Fig 4: Clock tick processing duration of the bare-machine Figure 4 represents the test results of 128000 captured samples, in a time frame of 9 seconds. Due to scaling reasons, the samples form a line. As shown in Figure 4, the “busy loop” execution time is 78 µs at some periods. Therefore, a clock tick delays any task by 8 to 10 µs. The test is executed on the bare-machine for few minutes, and the maximum measurement obtained is 78µs. Therefore, a worst case overhead of 10 µs occurred in the system during this testing period. Table 1: presents the results of executing the same test in UTPVM and UTFVM using the four scenarios.

This test is very useful as it detects all the delays that may occur in a system during runtime. Therefore, we execute this test for long duration (1 hour), and provide a statistical overview of the captured measurements. Figure 5 presents the statistical distribution of the samples obtained during the 1 hour test on the baremachine. Before looking at the figure results, we provide here an explanation of how we obtain the statistical samples. The measured delay values are counted in binary based bins. This can be done without much overhead as an assembler instruction exists to find the highest bit set in the measured delay. The highest bit set is used as first level bin selection, while the next couple of lower bits are used for the second level bin selection. This makes it possible to statistically collect a huge amount of samples without significant overhead caused by the measurement system itself. Remark that the bin distribution using this method is logarithmic.

Table 1: Comparison between the machines’ overheads The values between brackets show the percentage of increase with reference to the bare-machine. This percentage is calculated using the formula: Relative Difference ∗ 100 Percentage Increase = Reference Number

In scenarios 1 and 2, the same number of VMs are running but with different tasks. The overhead in scenario 2, compared to scenario 1, is increased by (560-250=310 %) for the UTFVM and 270 % for the UTPVM. The clarification of this huge increase between these two

Fig 5: Distribution of samples during the hour test During the 1-hour test, 60 million samples are captured. The X-Axis represents the delay values in the binary bins, while the Y-Axis is a logarithmic presentation of the number of obtained samples and their corresponding percentage. Figure 5 shows that 93 % of the samples

722

Vol. 4, No. 9 September 2013

ISSN 2079-8407

Journal of Emerging Trends in Computing and Information Sciences ©2009-2013 CIS Journal. All rights reserved. http://www.cisjournal.org

(55873426) are between 66 µs and 69µs. This is logical as the “busy loop” execution time (68µs) falls in this region. The importance of figure 4 is to show the exact tracing values and the moments of their occurrence while figure 5 shows their distribution over time. Any samples above 68µs are considered delays. We see that 6.8 % of the samples are between 74 µs and 77µs (which means a delay between 6 and 9 µs), while 0.002 % is between 77 and 79.5 µs (1110 samples captured

in this interval. The maximum value captured in this region is 77.98µs, which corresponds to a delay of 9.98 µs). Therefore, the maximum overhead that happens in the system is 9.98 µs. The same statistical test is done on UTFVM and UTPVM. Figures 6 and 7 present the results of the test only for scenario 1, and the same explanation of Figure 5 applies here.

Fig 6: Samples distributions for UTFVM Note that most of the delays that occurred in the system are Figure 6 shows that an overhead between 32 and between 8 and 10.5 µs (77->79.5 µs interval). 37 µs (interval between 100->105) happens in the system. The maximum value obtained in this interval is 35.01 µs.

Fig 7: Samples distributions for UTPVM The maximum value captured in our measurements is 90.7, which is the only sample in the region 90->95 µs. This means that the maximum overhead happens in our

systems is 21.7 µs. the big percentage of the delays happens in the region 69->71.5 µs.

723

Vol. 4, No. 9 September 2013

ISSN 2079-8407

Journal of Emerging Trends in Computing and Information Sciences ©2009-2013 CIS Journal. All rights reserved. http://www.cisjournal.org

These two figures show that the tasks in UTPVM will be affected by lower delays than the ones of UTFVM. This is due to the changes of the kernel in the UTPVM. 4.3.2 System Bus Bottleneck in SMP Systems Our testing hardware platform is a Symmetric Multiprocessing (SMP) system with four identical processors connected to a single shared main memory using a system bus. They have full access to all I/O devices and treated equally. Fig 8: Scenario 2.1 with Dom0 doing Memory-Load test The system bus can be used by only one core at a time. If two processors are executing tasks that need to use the system bus at the same time, then one of them will use the bus while the other will be blocked for some time. As the used processor has 4 cores, and in case all of them are running at the same time, system bus contention occurs. Also, the system bus has a maximum bandwidth which may cause in collisions in case of intensive data transfer from and to the memory. Scenario 1 is not causing high overheads because the CPU stress program is quite small and fits in the core cache together with its data. Therefore, the two CPULoading VMs are not intensively loading the system bus which in turn will not highly affect the UTVM. Referring back to scenario 2, the two MemoryLoad VMs are intensively using the system bus. The UTVM is also running and requires the usage of system bus from time to time. Dom0 is idle. Therefore, the system bus is shared most of the time between three VMs (UTVM and 2 Memory-Load VMs), which causes extra contention. Thus, the more cores in the system that are accessing the system bus simaltenuously, the more contention will occur and thus the overhead increases.

Fig 9: Scenario 2.3 with three Memory-Load VMs Table 2 shows the “clock tick processing duration” test including the results of the two newly added scenarios. Table 2: Clock tick processing duration in all scenarios

To explicitly show this effect, we created another two scenarios, which are sub-scenarios of scenario 2. Therefore, three Memory-Load scenarios are used. Scenario 2.1 has one Memory-Load VM (Dom0), scenario 2.2 has two Memory-load VMs, and scenario 2.3 has three Memory-load VMs. Together with these 2 newly added scenarios, we will have in total 6 scenarios instead of 4 as mentioned before. Here are images of the 2 newly added scenarios. Looking at the scenarios 2.1, 2.2 and 2.3, we see that each extra bus loading increases the system delays (overheads).

724

Vol. 4, No. 9 September 2013

ISSN 2079-8407

Journal of Emerging Trends in Computing and Information Sciences ©2009-2013 CIS Journal. All rights reserved. http://www.cisjournal.org

4.3.3 “Maximum Sustained Interrupt Frequency” Test This test detects when an interrupt cannot be handled anymore due to the interrupt overload. In other words, it shows a system limit depending on, for example, how long interrupts are masked, how long higher priority interrupts (the clock tick or other) take, and how well the interrupt handling is designed. This test gives a very optimistic worst case value due to the fact that, because of the high interrupt rate, the amount of spare CPU cycles between the interrupts is limited or nil. Also, depending on the length of the interrupt handler, it might mostly be present in the caches. In a real world environment, the worst case duration will be longer. In this test, 1 million interrupts are generated at specific interval rates. Our test measures whether the system under test misses any of the generated interrupts. The test is repeated with smaller and smaller intervals until the system under test is no longer capable handling the interrupt load. In order to do this test, an external PCI-Device is connected to the hardware platform which will be the source of interrupts. PCI Pass-through is used to give control of the attached PCI-Device to the UTVM. The driver for handling the generated interrupts is added explicitly to the VM as it will receive the interrupt. Note that to be able to use PCI Pass-through for UTFVM, Intel VT-d (Intel Virtualization Technology for Directed I/O) technology is required. In contrast, PCI Passthrough works without hardware support for UTPVM. Table 3 shows the results of this test.

time threads should be on different priority levels to be capable of applying rate monotonic scheduling theory, this test is executed with threads on the same priority level in order to easily measure thread switch latency without interference of something else. For this test, threads must voluntarily yield the processor for other threads, so SCHED_FIFO scheduling policy is used. Test method: A “creating” thread starts creating 1000 threads-with the same priority level-which is higher than its priority. Whenever a thread is created, it will immediately lower its priority below the priority level of the creating thread in order for the “creating” thread to continue creating all the desired threads. Once all the threads are created, the creating thread lowers its priority below the priority level of the created threads. The first thread in the queue will start execution, does its job, and then yield the processor for the next thread (which does the same). The job of each created thread is to get the timer counter value at the beginning of its execution, do some calculations, and then again get the timer count value at the end of its execution. The difference between the ending timer count of the previous thread and the start value of the next thread is the switch latency. Table 4 shows the results of this test. Table 4: Thread switch latency comparison

Table 3: Interrupt test for the VMs

The values in this table present the maximum switch latency which is also dependent on the clock tick processing duration. This explains the big difference between the latencies of FV and PV Looking back at scenario 1 in Table 3, we notice a difference of 10 % between FV and PV, which is caused due to the emulation layer. 4.3.4 Thread Switch Latency Between Threads of Same Priority This test measures the time needed to switch between threads having the same priority. Although real-

4.3.5 Xen Power Management By default, Xen power management is enabled. To clarify the impact of this option on the performance, the “clock tick processing duration” test is done on UTFVM in scenario 1. Figure 10 shows the results of this test.

725

Vol. 4, No. 9 September 2013

ISSN 2079-8407

Journal of Emerging Trends in Computing and Information Sciences ©2009-2013 CIS Journal. All rights reserved. http://www.cisjournal.org

Fig 10: Impact of power management on the performance As shown, the test requires more execution time at the beginning of the test. This happens because the processor is running initially at reduced speed and then ramps up. From what is seen by our measurements, it takes 25 ms for the processor to start executing the task at its maximum speed.

Disabling Xen power management and limiting the system bus usage increases the performance in both. Our tests show that the overhead in FV is at least 35 % larger than in a PV. This value can increase depending on the test case performed.

REFERENCES Note that all the tests in this paper are performed with Xen power management disabled. In this section we wanted to demonstrate the impact of the Xen power management.

[1]

In case power-management is disabled, the maximum measurement captured in the UTFVM in scenario 1 is 103 µs (68 + 35), while it is 179 µs with powermanagement enabled. This means that Xen power management can increase the overhead in our system by 74 %.

Min Lee, A. S. Krishnakumar, P. Krishnan, Navjot Singh, Shalini Yajnik “Supporting Soft Real-Time Tasks in the Xen Hypervisor” Proceeding VEE '10 Proceedings of the 6th ACM SIGPLAN/SIGOPS international conference on Virtual execution environments, 2010.

[2]

Zonghua Gu, Qingling Zhao “A State-of-the-Art Survey on Real-Time Issues in Embedded Systems Virtualization” Journal of Software Engineering and Applications, 2012

5. CONCLUSION

[3]

VMWare, “Understanding Full virtualization, Paravirtualization and hardware Assist,” 2007. [Online]. Available: http://www.vmware.com/files/pdf/VMware_paravirtu alization.pdf

[4]

T. Abels, P. Dhawam and B. Chandrasekaran, “An overview of Xen Virtualization,” [Online]. Available: http://www.dell.com/downloads/global/power/ps3q0 5-20050191-abels.pdf

[5]

Linux Foundation, “The Xen Project, the powerful open source industry standard for virtualization,” [Online]. Available: http://www.xenproject.org/.

Xen is one of the most popular open-source virtualization solutions that can support both virtualization approaches: Full-Virtualization (FV) and ParaVirtualization (PV). All the public resources declare theoretically that PV performs better than FV. In this paper, experimental evaluation is done to confirm the theory. The results show that Xen default setting with power management option enabled increases the worst-case latency of both approaches by 74 % on our system. Also, the intensive usage of the system bus can highly affect both approaches. Having heavy system bus load can increase latencies up to a factor of 3 compared with a system with only one active core. The system bus load impact is larger on a FV than on a PV.

726

Vol. 4, No. 9 September 2013

ISSN 2079-8407

Journal of Emerging Trends in Computing and Information Sciences ©2009-2013 CIS Journal. All rights reserved. http://www.cisjournal.org

[6]

Linux Foundation, “How Xen Works,” [Online]. Available: http://wwwarchive.xenproject.org/files/Marketing/HowDoesXen Work.pdf

[7]

Hwanju K., Heeseung J., and Joonwon L. XHive: Efficient Cooperative Caching for Virtual Machines , IEEE Transactions on Computers, VOL. 60, NO. 1, 2011.

[8]

Qu Xin, Chen Hao, “The research of inter-domain communication optimization under Xen hardwareassisted virtualization”, International Conference on Business Computing and Global Informatization, 2011.

[9]

Devine, S., Bugnion, E., and Rosenblum, M. “Virtualization system including a virtual machine monitor for a computer with a segmented architecture”, May 28 2002. US Patent 6,397,242

[10]

Warfield, A. “Xen and the art of virtualization”, In Proceedings of the nineteenth ACM symposium on Operating systems principles (2003), ACM, p. 177. [11]

Smith J E, Nair R. Virtual Machines: “Versatile Platforms for Systems and Processes”, San Francisco: Morgan Kaufmann Publishers, 2005

[12]

XiaoLin Wang, YiFeng Sun, YingWei Luo, ZhenLin Wang, Yu Li, BinBin Zhang, HaoGang Chen, XiaoMing Li ; “Dynamic memory paravirtualization transparent to guest OS”, Journal of Science in China Series F: Information Sciences 2010, Volume 53, Issue 1, pp 77-8

[13]

Linux Foundation, Real-Time Linux Wiki, https://rt.wiki.kernel.org/index.php/Main_Page

[14]

Xen, “Credit scheduler” [Online]. Available: http://wiki.xen.org/wiki/Credit_Scheduler

Barham, P., Dragovic, B., Fraser, K., Hand, S., Harris, T., Ho, A., Neugebauer, R., Pratt, I., and

727

Suggest Documents