Jun 9, 2006 - for remote VMs over fast net- works .... The horizontal lines represent the local .... erence line which shows us the maximum amount of data that ...
Study of Virtual Machine Performance over Network File Systems Roxana Geambasu, John P John June 9, 2006
Abstract
there have been proposals of using virtual machines in grid computing[11]. All of the above rely on network or distributed file systems to store and transport virtual state. ISR considers aggressively caching distributed file systems, such as Coda[14] or AFS[13], to be the appropriate choice for storing virtual disks. VMotion relies on a storage area network as virtual disk storage infrastructure. Given this context, understanding how VMs perform over network file systems is extremely relevant. It offers important insight as to what characteristics of network file systems are appropriate for workloads generated by VMs. We present a study of the performance of virtual machines whose disks are stored on network file systems. Our goal is to identify some good configurations for VMs over network file systems, under varying network conditions. The work is organized as follows. We first give a motivation for our study and define the set of questions we intend to answer (section 2). We then present some related work (section 3). In sections 4 and 5 we give a full description of the experiments we performed. In section 6 we answer the questions based on our experimental results. Section 7 proposes a VM-tuned network file system, that we believe would improve the performance of VMs. Section 8 presents some interesting aspects that weren’t covered in our study, due to the time limitations.
Important benefits of virtual machines have led to extensive research on using virtual machines over the networks to provide certain services, such as seamless computing. One solution to remotely access VMs is to use network file systems to store the VM image. In this context, knowing how VMs perform on network file systems is extremely relevant. We present a study of VM performance on network file systems over various network conditions. Our goal is to find the key characteristics and tuning of a network file system that would favor remote VM performance. We experiment with three network filers (NFS, AFS, Coda). NFS and AFS are found to each have some key desirable properties for using them to hold VM images. NFS is widely available and behaves well when storing virtual images on low-delay networks, irrespective of the bandwidth. AFS caches aggressively and uses large block sizes, that make it desirable when network has high latency. In the end, we present a possible VM-tuned NFS-like file system that combines the best from each of these file systems and which we believe would improve the VM performance.
1
Introduction
Important benefits of virtual machines, such as platform independence and clean encapsulation of hardware and software state, have lately led to extensive research on VM-based migration and how VMs can be used to provide seamless computing. Internet Suspend and Resume[15] proposes a way to implement seamless computing based on virtual machine suspend, transport, and resume. VMotion[17] introduces a mechanism to migrate live virtual machines in a totally transparent way and with no downtime. The Collective[19] envisions a network of hosts of virtual appliances in an attempt to gain global, uniform access to data and ”hassle-free computing”. Recently,
2
Our Study
The behavior of virtual machines over network file systems is poorly understood. To the best of our knowledge, there is no piece of work that provides an analysis of VMs over network file system. At the same time, we are dealing with a very complex system. It consists of two OS’s on top of each other. Both of them try to optimize the usage of their disks independently. Moreover, the virtual disk 1
3
is stored remotely, which complicates things even further. The optimizations (the policies) on the guest OS may either be redundant, or may affect performance adversely, when used in conjunction with certain policies on the host OS. Given all this complexity, it is impossible to just know the whole range of implications that the interaction between all these layers have upon VM performance. Some examples of such implications would be double caching, double paging, double scheduling, but many others surely exist. Faced with this problem, our approach is that of measurement, followed by analysis. Measurement gives us solid ground to place our suppositions about what would improve the performance of VMs over networks. As a first cut in understanding virtual machines over network file systems, our work is concentrated on several key questions:
Related Work
Much research has been done on virtual machines since the notion was first introduced in the 1960’s. An initial survey on this research is presented in [12]. A very good classification of virtual machine monitors is offered in [18], which we will adopt in this document. They distinguish between Type I, or unhosted VMM’s (e.g. Xen[10], VMware[6] ESX Server), Type II, or hosted VMM’s (e.g. UMLinux[9]), and hybrid (which are hosted for I/O, but unhosted for CPU) VMM’s (e.g. Microsoft VirtualPC[5], or VMware Workstation). In our study, we will focus on the hybrid type of virtual machines. Much work studying virtual machines involved testing their efficiency in one way or another, finding bottlenecks and improvement solutions. [20] presents the implications of hosted virtualization on the performance of I/O devices, with focus on network adapter virtualization. It identifies sources of overhead and proposes solutions to enhance performance, most of which are based on modifying the kernel of either host or guest, and/or the VMM. Although the paper focuses on network I/O, it does identify some bottlenecks and solutions (e.g. buffering sends) that are possibly valid for file I/O, as well, and therefore it is relevant to our work. [8] compares virtual disk performance for the unhosted VMware ESX Server to native disk performance, in an attempt to create a methodology to predict how virtualization impacts the performance of applications in general. Although the paper mentions an analysis (or rather a prediction) of remote disk performance as a future goal, no other follow-up of the paper addresses this aspect, to the best of our knowledge. Nonetheless, the paper is indeed very relevant to our work and it presents important insight in how to structure testing methodology. As part of the effort to make virtual machines a more feasible solution in grid computing, [22] suggests several improvements to NFS to make it a better transportation means for VMs. Although many enhancements in the article seem appealing at first glance, they are mostly based on assumptions that were never proven experimentally, or referenced anywhere in the article. For example, they assume temporal and spatial locality of references to virtual disks, which is not a clear characteristic of a virtual disk. Our approach is different from theirs in that sense. Our intent is to experimentally find bottlenecks and only then propose solutions.
Question 1 What are some good and bad configurations for remote VMs on fast LANs over NFS? Here, we are interested to see how different configurations possible for VMs over NFS effect their performance, when the network imposes virtually no limitation (1Gbps, 0delay). The goal is to single out a configuration that suits remote VMs best over NFS. Question 2 How do virtual machines operate over different network file systems, on varying network conditions? The goal here is to find which file systems offer better performance for the VM under which network conditions. Knowing the characteristics of these FSs, we can then infer what would be some desirable properties of VM-tuned network filers. We have chosen to experiment with three key file systems: NFS, AFS, Coda. Question 3 How much does caching improve the performance of remote-disk virtual machines? Question 4 How much locality is there in virtual disk accesses? By answering this question, we should be able to say whether prefetching and caching help the performance of remote VMs. We have addressed the above questions through our experiments. In sections 6 we will present our findings related to each of these questions. 2
4
[18] exposes some bottlenecks in running VMs on a hosted VMM and proposes solutions based on new host kernel support. Although some of solutions might have a larger applicability, most of them concentrate on a particular implementation of VMM’s (UMLinux), which, from their results, we infer to be utterly inefficient if compared to other VMM technologies such as VMware. A lot of valuable work has been done on making virtual machine migration fast. Although we are not directly interested in VM migration, we believe that some of the solutions to improve migration are relevant to us. Some examples of these relevant solutions are: background transfer of the virtual disk, which is similar in spirit to pre-fetching (VMotion), aggressive caching strategies (ISR), etc. We will therefore briefly discuss some of this work that optimizes VM migration. The Internet Suspend/Resume[15, 16] is a technology that makes an important step towards the goal of seamless mobile computing. It simulates the data ”following” the user by migrating virtual machines by the means of clever suspend, transport, and resume. ISR thus seems to give an appealing means to convey VMs. Nevertheless, many of the assumptions that the ISR mechanisms are based on are not valid in more general scenarios. For example, the predictability of user’s next hop (or, in our terms, the predictability of which VM will be requested where) may be valid in the context of ISR scenario, but not in a more general one. VMotion[17] is a novel mechanism for transparently migrating running virtual machines. It comes with important methods to reduce the migration time. Nevertheless, it is based on storage area networks to transport the largest part of a virtual machine (the virtual disk), which are only applicable to LANs. To sum up, measuring the performance of remotedisk virtual machines with a goal of finding bottlenecks and improvement solutions is a new idea, to the best of our knowledge. However, many people have looked into measuring other performance aspects of VMs. We have learned a lot from their work, both from their solutions, and from their measurement methodology. Other people have built systems that migrate virtual machines, under given assumptions (such as specific scenario or supporting infrastructure). We believe that our work is complementary to theirs and that some of their solutions may suggest improvement solutions to us and vice-versa.
Methodology
We ran multiple sets of experiments that helped us understand the behavior of VM’s over the network better. We varied the workload, the network file system, the network conditions, and some characteristics of NFS and VMware. A complete list of the experiments whose results will be used in this document is given in section 5. Each of these experiments is related to one or more of the questions enumerated in section 2. In the remaining part of this section, we will present our methodology and motivate our choices for different settings.
4.1
The Virtual Machine Monitor
In all our experiments we used VMware Workstation 5.5.1[6]. It is currently the most popular virtual machine technology [7]. Also, it is fairly easy to install, configure, and use. In addition, it is available on a large variety of platforms and supports many operating systems as guests. We did not vary the VM technology because of several reasons. First, studies show that VMware has the best performance for a Type III virtual machine, to the best of our knowledge. An example of such study is [4], which offers a performance comparison between VirtualPC and VMware and finds the latter better. Second, comparing VMware’s performance to that of a Type I VM (such as Xen) would be unfair. Much of the performance penalties of Type III virtual machines are incurred due to the double layering (two OS’s on top of each other), which is not present in Type I machines. Therefore, a study of the performance of Type I VM over network would not support any comparison to our study of VMware VMs over networks. Although we admit that an independent evaluation of the performance of Xen VMs over networks would be interesting, the lack of time prevented us from pursuing this. Having motivated why we chose VMware for all our experiments, let us describe in more detail the VM configurations used throughout our tests. We created a VM image with a virtual disk capacity of 7GB (not pre-allocated by VMware, but available), and with an IDE adapter instead of a SCSI adapter (please see the midterm report, section 4.1 for a motivation of this choice). The VM saw a memory size of 256MB and a 512MB swap. The guest operating system was 3
a minimal install of Fedora Core 4, kernel version resume time, while but the performance afterwards 2.6.11-1.1369. The total size of the original virtual will be extremely close to the local VM performance. disks1 was 2.6GB. We did not perform experiments on Coda because we were not able to install/configure Coda on our Fedora Core host (the Fedora Core Linux distribu4.2 File Systems tion is not supported by the Coda group). Changing the host Linux distribution to the Coda-supported We varied the file system that stored the VM disk to Red Hat 9 would have made the comparison with our see how this impacts performance. The network file previous FC4 experiments difficult. Moreover, some systems we experimented with were: NFS, AFS, and notes on the Coda forum indicated that Coda does Coda2 . not act very well on large files, such as our virtual Each of these network file systems is attractive in disks. In addition, we realized that we can easily give its own way. NFS is by far the most widely-used an estimation of the performance of VMs over Coda. network file system today. It is available on most platforms, and it is included in most flavors of Unix. It is also very easy to install, configure, and use3 . 4.3 Workloads In addition, it is known to have a good performance In our experiments we used two workloads: the on fast LANs. These advantages of NFS over most Bonnie++[1] file system benchmark and the compiother network FS’s make it attractive to store vir- lation of a Linux kernel. tual machine disks. However, NFS is known to have poor performance on low-bandwidth, high-delay networks, especially due to the fact that it uses only in- 4.3.1 Bonnie memory caches. It is relevant, therefore, to find out The first workload was a file system benchmark: how VMs over NFS perform across different network Bonnie++[1]. The Bonnie benchmark was used in conditions. tests related to Question 1 solely (see section 5 for AFS[13] and Coda[14], on the other hand, are not the list of tests). Although a file system benchmark easy to install and configure, but have aggressive cannot be considered as representative to a normal caching strategies, which might be an attractive point workload that users are likely to produce, we believe for remote-disk VMs. Therefore, experimenting with that the statistics reported by such a benchmark are them is also relevant. valuable. First, a file system benchmark, especially Although AFS and Coda are similar in their whole- Bonnie, is specifically designed for the study of sysfile on-disk caching strategy, there are differences be- tem bottlenecks[21]. Second, the benchmark gives separate statistics tween the two which make each interesting to experiment with separately. AFS performs on-demand for different types of operations (block/character block access and caches blocks from each file in a per- reads/writes). In this way, we were able to experifile cache. It is interesting to see how this caching ment with different configurations and see how each strategy improves the performance of virtual ma- kind of operation performs in each case. From the statistics reported by the file system benchmark, we chines over various network conditions. Coda, on the other hand, performs whole-file fetch were able to single out a good configuration for VMs and on-disk caching. This means the open system call over the NFS on fast LANs (VMware cache on, NFS blocks until the whole file is fetched over the network asynchronous). This allowed us to reduce the space and cached on the local disk. In our case, where files of experiments we needed to run and to provide some are huge (they are virtual disk files, of more than simple rules to tune VM performance on NFS. We chose this particular benchmark because of sev1GB), using Coda would probably lead to a large VM eral advantages it has over other fs benchmarks. It 1 We refer here to the original image we created, to which is fairly easy to use and its metrics are easy to unwe rolled back before each test, and therefore do not include derstand (those of interest for us are throughput of the disk occupied by the running workloads per-character/block operations)[21]. 2 Please note that we did not run any actual tests on Coda In all tests, Bonnie was operating on 4GB-worth – see below. 3 Note that in this document we are not concerned with of data4 , so that the buffer cache of the host cannot scalability issues, which would make NFS administration more difficult.
4 Please
4
refer to the Bonnie Web page[1] for details on these
Question 1. All experiments used the Bonnie++ workload. We ran both native and VM experiments. We used only NFS for the network file system. We varied the write cache state of VMware and the NFS operation mode (synchronous/asynchronous). All ex4.3.2 Kernel Compilation periments took place at Emulab and were run on a In addition to a file system benchmark, we also used network with bandwidth 1Gpbs and delay 0msec. a compilation workload. Such a workload, while From the results reported by Bonnie, we found the still highly I/O bound, is closer to a real user work- influence of certain knobs, such as the NFS operation load. People often use compile-like workloads: e.g. mode and VMware write cache, on the performance when installing a new program, when compiling the of different I/O operations inside the VM. We were LATEXsources, etc. able to rule out some of the combinations possible We chose the compilation of the 2.6.12-1.1390 Fe- and concentrate all our other tests on only one of dora Core 4 kernel. It is large enough to impose some them. stress on the file systems. With some exceptions, we Question 2. All experiments used the kernel comran three trials of kernel make (the exceptions were pilation workload. We varied the network conditions the NFS native tests when delay was large, which (bandwidth and delay). We experimented with NFS took extremely long – 17 hours to 2 days). and AFS. The NFS experiments took place at Emulab and all the AFS experiments took place at UW 4.4 Emulab CSE Systems Lab. All experiments used VM write We chose to run all our experiments on Emulab[2]. cache on and NFS in asynchronous operation. The Emulab is a standard environment to run experi- values we chose for bandwidth and delay are: 1000, ments under controllable and customizable network 100, 10, 1 Mbps for bandwidth and 0, 20, 50 ms for settings. We already summarized the benefits of Em- delay. These are the most common characteristics in ulab in our midterm report and will not repeat them today’s Internet connections (especially the campus ones) and can therefore be considered of interest. here. hold all the data. For each test, we ran three trials of Bonnie, and took the mean, so that random variations do not affect the result.
These experiments allow us to see how the performance of virtual machines over NFS, AFS, and Coda is influenced by different network characteristics.
For our experiments, we chose the following configuration of host machines: 64-bit Xeon processor, with a speed of 3GHz, 2GB RAM. A direct link between the two nodes was simulated by Emulab. We varied the characteristics of this link, to simulate LAN and WAN conditions. The host operating system was a Fedora Core 4, kernel 2.6. For all tests involving AFS, however, we used another environment: two machines in the Systems Lab of the CSE Department of the University of Washington. The reason for this were incompatibilities of AFS with the customized Fedora Core 4 offered by Emulab. The Systems Lab machines used were: Intel Pentium 4, 3.2GHz, 2GB RAM. The connection between the two machines was intermediated by a 100Mpbs switch. The host operating system was a Fedora Core 3, kernel 2.6.
5
Question 3. All experiments used the kernel compilation workload; all experiments were run on AFS. We varied the AFS cache size to see how this impacts on the performance under different network conditions. Together with the NFS VM experiments, these tests show how caching helps improving the performance of VMs over networks. Question 4. For this question, we did not run any other experiments. The locality conclusions were drawn through the analysis of the experiment results from the previous two questions.
Experiment Set
6
Each of our experiments are related to one or more of the questions we enumerated in section 2.
Measurement Results
In this section we answer each of the four questions enumerated in section 2, based on the results from our experiments.
parameters of the Bonnie benchmark
5
6.1
Question 1: Good configurations for VM reads is due to transient network or machine for remote VMs over fast net- conditions, and that this difference can be undoubtedly considered insignificant. works
To conclude, the asynchrony of NFS increases the performance of remote-disk VMs. With the clear performance benefits of asynchrony even in the native case, SUN NFS’ default option for an export is ’synchronous’. This is because asynchrony can result in data loss if the server crashes. However, given that we are particularly interested in the performance of virtual machines over network file systems, we have decided to ignore this aspect.
In our first question, we are interested in seeing how different knobs affect the performance of VMs. We assume that the network does not impose any bottleneck (bandwidth is 1Gbps, delay is 0). We use NFS as the underlying network file system holding the virtual disk. As already motivated in section 4, NFS is employed in this setup because it is known to have good performance over fast networks and it is the most widely used network FS. We experiment with two knobs: NFS operation mode and VMware write cache. 6.1.1
6.1.2
Effect of VMware’s Write Cache
VMware has a fixed-size cache that buffers writes to the virtual disk. We know that there already exist some levels of caching in the virtualized case: the guest OS’s buffer cache, and the host OS’s buffer cache (which is used for writes only in the local VM case). Then, does this additional level of caching help? As shown in figure 2, VMware’s write cache benefits write operations. We are comparing the disk throughput reported by Bonnie for local and remote VMs when the VMware write cache is on versus off. For writes (figure 2(a)), the difference is remarkable. For the NFS case, for instance, the write throughput when VM cache is on is almost 100% better than when the cache is off. The explanation is that VMware’s cache absorbs many of the disk accesses from the VM. For reads, though, VMware write cache does not make any difference basically (2%), except for the local case. We attribute the larger penalty incurred by the local VM when the cache is on to the fact that VMware’s write cache is used only to make writes seem asynchronous, and not to cache reads. However, it occupies memory that would otherwise be used by the buffer cache to improve read performance. We conclude that VMware’s write cache helps writes and does not significantly affect reads.
NFS Operation Mode: Sync/Async
NFS has two modes of operation: synchronous and asynchronous. It is well known that asynchronous operation improves the performance of NFS for the native case. Is this true for VMs, though? Graph 1 shows the influence of the operation mode of NFS over disk I/O inside the VM. We compare the disk throughput reported by Bonnie for writes/reads when NFS is operating in asynchronous, respectively synchronous mode. Block writes5 benefit a lot from the asynchrony of NFS in the virtualized case: a 55% gain from async when VM cache is off, and 21% when VM cache is on. The reason is that, unlike asynchronous NFS, with synchronous NFS, whenever VMware pushes a write to the virtual disk file, it has to wait for the server to commit to stable storage. We note that the gain from asynchrony is less significant when VMware’s cache is enabled. This is because the extra buffering performed by VMware itself hides delays due to synchronicity from the VM. Still, the 21% gain of asynchrony in the cache-on case is valuable enough for us to consider asynchronous operation of NFS as a good way to improve the performance of writes inside a remote-disk VM. For VM character and block reads, however, asynchronous operation of NFS seems to affect the throughput, but only slightly (figure 1(b)6 ). We be- 6.1.3 lieve that the 6% penalty introduced by asynchronism
Good configuration for running VMs over NFS
In the previous two subsections, we saw the influence of the NFS operation mode and VMware write cache over the read/write throughput of virtual machines. We proposed asynchronous NFS operation and VMware cache enabled respectively best settings.
5 The same results as shown for block writes are valid for character writes. 6 Please note that in this figure, we introduced the new values for NFS native that we obtained from re-running the Bonnie tests on the same type of Emulab node (pc3000) – please refer to the content of our email for a detailed explanation.
6
(a)
(b)
Figure 1: The influence of NFS operation: asynchronous versus synchronous
(a)
(b)
Figure 2: The influence of VMware write cache We now assess the benefits and penalties incurred Ultimately, we propose NFS async and VMware from choosing this configuration. cache on as our good combination for remote VMs Figure 3(a) shows the variance in the throughput over NFS. In this case, the writes achieve the best perof reads and writes from one of our configurations formance possible with virtualization over network to another. The horizontal lines represent the local (even better than local VM with VMware cache off). native throughput for reads, respectively for writes. The gain for writes is 22% from nfs sync vm cacheon What this figure shows clearly is that our configura- (the best option for reads). At the same time, reads tion (number 3) is the best one for writes, but the are affected extremely little (the penalty of choosthird-best for reads. People may argue that favoriz- ing this configuration instead of nfs sync vm cacheon ing writes at the expense of worse performance for is less then 5%). Our conclusion – NFS async and VMware cache on is best – was confirmed by several reads would a mistake. similar kernel compilation tests, whose results we do The graph, however, also shows our motivation for not present here. singling out the NFS-async and VMware-cacheon settings as our good configuration. First, we notice that reads are very little affected by the different config- 6.2 Question 2: How do VMs operurations (there is a maximum penalty of 8% for the ate over network filers, on varying worst configuration from the best configuration for network conditions? reads). Second, write performance are much skewed (the maximum penalty is 58% for the worst configu- We study the performance of virtual machines over ration from the best configuration). three network file systems: NFS, AFS, and Coda. 7
Each of these possibly has a distinct desirable char- re-reads data not in the cache). This would lead to acteristic for VMs over networks (see section 4.2). To even more impact of delay on the performance of the test the performance of VMs over these file systems, VM over NFS. we vary the bandwidth and the delay. Bandwidth As the bandwidth decreases from 1Gbps to 10Mbps, the increase in the compilation time is very 6.2.1 NFS little. The maximum increase in time is about 8%. Figure 3(b) shows the time taken by the kernel com- (When delay is 50ms, we even see an apparent depilation workload inside the VM over NFS, for dif- crease in time when bandwidth goes from 100Mbps ferent combinations of bandwidth and delay. Two down to 10Mbps. We can only explain this slight drop important conclusions will be reached about the per- as caused by some external temporary conditions and formance of VMs over NFS: delay has a significant im- view it as insignificant.) The fact that these bandpact, and non-bottleneck bandwidths (for our work- widths (1000,100,10Mbps) have little impact on our load) do not affect VMs over NFS at all. kernel compilation workload inside the VM over NFS Delay led us to think that 10Mbps is not a bottleneck bandThe kernel compilation inside the VM is affected by width for this workload. delay quite a lot. With a bandwidth of 100Mbps, we To verify this fact, we ran experiments for a bandsee an increase of about 43% in the compilation time width of 1Mbps. When the bandwidth goes down when delay increases from 0ms to 50ms. The percent- to 1Mbps, we observe a 200% increase in the compiage increase is a little smaller when the bandwidth is lation time. This suggests that the bottleneck for 10Mbps (about 25%), but it is still significant. We this particular workload (kernel compilation inside also notice that the absolute value of the increase in the VM) is between 10Mbps and 1Mbps for NFS. We compilation time from delay=0 to delay=50 is ap- also notice that the three-fold increase in the compiproximately constant across all bandwidths. lation time is less than the ten-fold decrease in bandThe effect of delay can be explained. In general, width. the time penalty due to delay depends on the number Although the 200% performance penalty due to of network transfers times the delay increase7 . NFS bottleneck bandwidth seems large, we will see that uses relatively small blocks (4-32KB[3], default for this is much better than what the VM over AFS exkernel 2.6. being 32KB, which was our size, as well) periences in similar conditions. to fetch the virtual files required by the compilation NFS native vs. NFS VM inside the VM (which are about 300MB). Therefore, A completely unexpected result deserves mention. many block transfers are needed and, on every fetch While VMs over NFS have quite acceptable perforover the network, the delay adds up to the global mance as network conditions get worse, native comcompilation time. An simple calculation would lead pilation over NFS is performs extremely badly, esto a roughly 450-480-second penalty incurred due to pecially when delay increases. For a bandwidth of the delay increase from 0-50ms, which is consistent 100Mbps, native NFS completes the kernel compilato the results obtained (figure 3(b)). tion in 1282 seconds (about 21 minutes), when there As a conclusion, the small default block size of NFS is no delay. When delay increases to 20ms (bandleads to an important affect of delay on the perfor- width remains 100Mbps), the compilation time jumps mance of VMs. We have verified this fact only in the to 68352 seconds (about 18 hours), and with a delay case of the kernel compilation workload. In general, of 50ms compilation time is 197945 seconds (about 2 the impact of delay will be proportional to the size of days and 7 hours). Similar results for 10Mpbs conthe working set of the VM workload times the delay. firm our observation that delay ”kills” NFS native. So, if the working set of the workload inside the VM is The striking difference in performance between large, the performance of the VM will decrease drasti- NFS VM and NFS native is due to several reasons. cally with the increase in delay. Also, since NFS does First, the VM benefits from additional cache levels: only in-memory caching, a working set larger than the guest OS’ buffer cache and VMware’s write cache. the host memory may mean that some blocks will be As we have already seen, in practice, these caches fetched more than once (of course, if the workload do not conflict with each other even in the 1Gbps 7 This clearly explains the constancy of delay affect across or local FS (see section 6.1). The benefit of these caches is even more increased when network condibandwidths 8
(a)
(b)
Figure 3: (a) Comparative variance of read and write performance of VMs on NFS; (b) VMs over AFS when bandwidth and delay vary AFS cache size is set to 3GB, which is larger than the size of the VM (around 2.5GB). From this figure, we make the following two observations. First, delay does not seem to affect the performance at all. Second, non-bottleneck bandwidths for our workload affect the VM performance only a little, while bottleneck bandwidths affect AFS enormously (especially compared to NFS). Delay For the 100Mbps case, we see that there is only a 3% increase in running time as the delay goes up from 0 to 50ms; for the 10Mbps case, the difference is even less. As already mentioned, the performance is going to be affected by delay only if the number of accesses to the network is large. With AFS, the number of network accesses is typically small (or, to be more precise smaller than in the case of NFS). Two reasons exist for the above. First, AFS caches every block in an on-disk client cache. Thus, AFS accesses the network only if the required data block is not available in the client cache. Even more, with the assumed large cache size that fits the entire VM, each block of data is brought at most once from the network. Second, the block size of AFS is 1MB, so much larger than the NFS block size. Consequently, each time AFS accesses the network, it brings a large chunk of data into the local cache, thereby saving additional network accesses for all the data brought in. This effectively bounds the number of network accesses. Consequently, the large on-disk cache of AFS and
tions worsen. The native compilation, on the other hand, only has NFS’ cache (which is present in the VM case, as well). Second, access control has a huge overhead in the native case. The native compilation deals with extremely many different files. When it opens each of them, it NFS needs to verify the authority in the ACL. This consumes time, and messages between the server and the client. Each of these messages incurs an overhead due to the delay on the network. In the VM case, however, VMware accesses only one giant file (the virtual disk), and so it only has to go through the access control and authentication protocol of NFS only once (or a small number of times). Let us summarize our results for VMs over NFS. We found that the performance of VMs over NFS is affected by delay mainly due to the small block sizes. Also, while non-bottleneck bandwidths have essentially no effect on the VM compilation performance, bottleneck links have quite a drastic effect. Still, the performance of VMs on NFS under bad network conditions (especially high delay) is much better than that of purely native NFS. Also, as we will see from the AFS section, VMs over NFS are likely to outperform VMs over AFS on low-bandwidth networks. 6.2.2
AFS
Figure 4(a) shows the time taken by the kernel compilation workload inside the VM over AFS, for different combinations of bandwidth and delay. The 9
6.2.3
its large block size lead to no effective penalty imposed by delay in the network. Bandwidth We observe that when the bandwidth decreases from 100Mbps to 10Mbps, there is a noticeable increase in the compilation time. However, this increase in compilation time is quite small (about 12%), when compared to the 10-fold decrease in the bandwidth. This was quite surprising, since we expected that bandwidth would be important when transferring such large quantities of data (over 2GB) over the network (see below). To understand what was happening, we plotted the size of the client cache over time (with 100 Mbps and no delay), and this is shown in Figure 4(b). Since the cache is large and AFS caches every file block, the amount of data in the cache is equal to the total amount of data transferred over the network from the AFS server. In the graph, the size of the client cache starts at around 372MB and increases to 2.2GB at the end of the compilation. The 372MB is the amount of data cached when the guest operating system is loaded into VMware, before the compilation has started. The 10Mbps line in the graph is a reference line which shows us the maximum amount of data that could be transferred into the cache at a speed of 10Mbps. We observe that except for a small interval initially, the amount of data in the cache is always less than the amount that could have been transferred over a 10Mbps link. This shows that the 10Mbps link is not the bottleneck, and that the workload does not demand such a high throughput. Therefore, decreasing the bandwidth to 10Mbps does not affect performance much. Since 10Mbps was not an effective bottleneck, we computed the average bandwidth used by the compilation, and it was around 7Mbps. Therefore, it is reasonable to expect that if the available bandwidth is lower than this value, we would see a hit in performance. To test our theory, we ran the same workload again, but this time with a bottleneck bandwidth of 1Mbps. As expected, the time increased drastically, to 15600 seconds! We also believe that time to run this workload is actually dominated by the network transfer time (since the bandwidth is low), which is proportional to the bandwidth. Therefore, we can expect the compilation time to vary linearly with the bandwidth. 10
Coda
Although we could not experiment with Coda (due to difficulties that precluded us from installing Coda), we believe that we can predict Coda’s performance. Given Coda’s whole-file transfer and caching strategy, Coda would lead to a very high VM resume time and very good performance (close to local VM) once the VM is started. The total time to run a kernel compilation on Coda would, thus, be the resume time plus the local VM compilation time. As already seen, AFS also brings almost all the virtual disk locally, which suggests that the total time for kernel compilation inside the VM would be similar in both cases. There are, however a few points of distinction between Coda and AFS. Since AFS does only on-demand block fetching, it may require several accesses to the network. Coda, on the other hand, performs only a single, huge access (or few if the virtual disk is splitted across multiple files). This would probably be benefit Coda when the delay is large. However, since the entire file has to be cached locally, it is likely that Coda would bring more of the file than AFS, which would fetch fewer blocks. This would probably benefit AFS if the bandwidth is low. Overall, we expect the performance of AFS to be comparable to, and, in most cases better than Coda. Apart from this, there are some other problems with using Coda. Due to the whole-file-caching semantics, interactive programs would suffer since the entire gigantic file has to be fetched before operation can begin. Another problem is that Coda wouldn’t work if the local cache is not large enough to fit the entire file. Even if the cache is large enough to hold the largest file, unless it can hold all the virtual disk files simultaneously, there would be severe performance issues. Consider an example where the cache size is 1GB and the virtual disk is contained in 2 files of 1GB each. If there were ever a case where blocks had to be read alternately from each virtual disk file, then Coda would have to fetch the first 1GB file, read a block, then fetch the next file, read another block, and so on. This would be akin to ’thrashing’, and would render the system unusable. 6.2.4
Comparison between NFS and AFS
The NFS experiments were run on Emulab, whereas the AFS experiments were run in the department. Since the hardware configurations are different, we feel that a direct comparison of their performance
Growth of AFS cache with time for kernel compilation 3500 AFS cache 10Mbps 3000
Cache Size (MB)
2500
2000
1500
1000
500
0 0
(a)
500
1000
1500 Time (seconds)
2000
2500
3000
(b)
Figure 4: (a) VMs over AFS when bandwidth and delay vary; (b)Variance of AFS cache size during the compilation would be unfair. However, we believe we can give an estimative comparison based on an NFS experiment we performed in the department (we re-ran the 100Mbps, 0 delay NFS experiment in the lab). In figure 5(a), for NFS, the first bar which is solid, shows this measured value. The bars for 20ms and 50ms delay are not solid, and these represent the expected values. These values were estimated based on the variations observed in our experiments at Emulab. The main point is that when there is no latency, both NFS and AFS are comparable, with AFS performing slightly better. However, as the latency increases, NFS performs worse whereas the performance of AFS is not affected by the latency. Therefore, we believe that AFS would outperform NFS, and the difference would be more evident as delay increases. All this holds when bandwidth is not a bottleneck. However, from the results of the 1Mbps run, we see a large difference in the performance of NFS and AFS, with NFS taking 4000 seconds compared to AFS’s 15000 seconds 8 . This difference can be explained by the block sizes used. Since NFS uses blocks of size 4-32K, it brings only 400MB of data over the network. AFS, on the other hand, uses blocks of size 1MB, and as mentioned earlier, ends up transferring over 2GB over the network. When the bandwidth is 8 We mentioned this result in our 6.2.2 bandwidth discussion, but did not plot it here because it would have rendered the other bars invisible.
11
so low, this additional 1.6GB of unneeded data that AFS transfers kills its performance.
6.3
Question 3: How does caching help VMs over networks?
In this section, we look at the effect of caching on performance. As expected, caching helps and a large cache provides better performance than a smaller one. From figure 5(b), we see a 20% increase in compilation time when the cache size is limited from 3GB to 300MB. Since the cache is now limited (much less than what AFS brings over the network), several blocks of data would have to be brought several times over the network (as opposed to ’at most once’ with an unlimited cache). We observed that the total amount of data transferred is around 2.2GB with the 3GB cache, and 4.5GB with the 300MB cache. Again, delay does not significantly affect performance since the number of disk accesses (though doubled from 2200 to 4500) is still not too large. We believe that the amount of data transferred would be extremely high with a very small cache, and that it would decrease exponentially as the size of the cache increases and eventually flatten out (i.e. the marginal returns would be high initially but after a certain cache size it would be low). Due to time constraints, we were unable to verify this hypothesis.
(a)
(b)
Figure 5: (a) Comparison between NFS and AFS; (b) Effect of AFS cache size
6.4
Question 4: Is there locality in 7 virtual disk accesses?
It is well known that locality of accesses is a desirable characteristic of a program. It has several important benefits. It allows good cache hit rates, which leads to good performance by eliminating slow disk reads. It also allows large reads, which eliminates the even-slower disk seeks. Moreover, it enables prefetching. Prefetching is especially good when data lies remotely, as in our case, when the bandwidth is low and delay is high. For us, pre-fetching would probably be very beneficial. While the VM is busy doing some computation with the data just read, the network file system or VMM would pre-fetch the blocks in the vicinity of the data read. Thus, the VM would not have to wait for the slow transfers due to high delay and low bandwidth. Unfortunately, for our kernel compilation we can state that locality is not a property of the virtual disk accesses. We saw that in order to compile the kernel (which is around 300MB in size), VMware over AFS brought in 2.2GB of data, which is almost the entire virtual disk! This leads us to believe that the files are quite spread out on the virtual disk, and that there is not much locality to be exploited. A different scheme of arranging files on the virtual disk might result in better locality, which would mean performance could be improved further by prefetching neighboring virtual disk blocks. 12
A possible VM-tuned NFS
In the previous sections, we saw how both NFS and AFS have desirable properties, which might be useful for VMs. NFS is extremely popular, comes by default with most OS installations, and has excellent support (please refer to 4.2 for more on this). AFS, on the other hand, has large disk caches and block sizes, which make it relatively immune to large network delays, and redundant block fetches. At the same time, however, the large block size of AFS, plus the lack of locality for virtual disk accesses lead to much more data than needed to be transferred over the network (remember that AFS transfers about 2.2GB for the kernel compilation workload). This, in turn, leads to AFS’ bad performance with bottleneck bandwidths. One possible way to combine the ‘best of both worlds’ would be to implement a special version of NFS – call it VM-tuned NFS – which adds the relevant features of AFS without the extra baggage. Of course, in the course of this, both NFS and AFS would most probably lose some of their properties and semantics: consistency for NFS and scalability for AFS. We assume there is no sharing of virtual machines, and, thus, consistency is not an issue. Also, we assume that we do not care about VM corruption due to server or client crashes. In this document, we are solely interested in tuning the performance for VMs. Several recommendations can be made for our VMtuned NFS. First, the VM-tuned NFS client will have an on-disk cache, as inherited from AFS. The larger the cache the better, as we saw from the answer for questions 2 and 3. Since disk is cheap nowadays, we
can assume that the client cache is large enough to fit the whole VM image9 . One implication of the large client cache is that our FS no longer guarantees any consistency semantics, but, as assumed above, this is not an issue for us. Second, the block size is an important factor to be tuned for our FS. Its impact on performance is presented in the answer to question 2. Deciding the block size is a very difficult issue, because it reveals a trade-off. A large block size would mean more data can be brought in per network access, and therefore fewer accesses. As a result the network latency would not affect performance much. However, given the lack of spatial locality in virtual disk accesses observed in question 4 (for kernel compilation at least), the additional data brought in along with the required data would be just an unnecessary overhead. If the bandwidth is high enough not to be a bottleneck for the specific workload, then the additional data brought in would not really cost much in terms of time. But in a bandwidth-starved network, the additional overhead could kill the system. Small blocks have exactly the opposite effect: they bring in not much more than necessary, but in many transfers. We conclude that the choice of the block size should be based on the delay and bandwidth characteristics of the network, as well as on the type of workload inside the VM. The best approach to do this is probably by choosing a workload that will approximate the workload inside the VM and then experiment with different block sizes. While on this topic, another related option to explore could be iSCSI. iSCSI is a transport layer protocol for establishing connections with remote storage devices over an IP network. Since our VM-tuned network file system is highly specialized, we could consider an implementation which didn’t provide all the features of a file system, but only a simple block device. This is quite feasible for a VM, since the VM typically uses only a handful of files (a couple of virtual disk files, a configuration file, and a memory file). So, a rudimentary ’file system’ could be implemented at the client, which would keep track of the raw blocks on the remote disk. In such a scenario, we could envision using iSCSI to communicate with the remote SCSI device. This would be a much leaner ‘file sys9 Note that we are not concerned here with the scalability of client storage if the client accesses multiple VMs. However interesting this aspect might be, it is outside the scope of our project
13
tem’, and since it has fewer features (but enough for our purpose), we could expect it to show improved performance.
8
Future work
Although we have reached a set of conclusions, much work remains to be done. First, experimenting with different block sizes for NFS and AFS would be relevant, since we have seen how important the block size is for the performance of VMs. Another interesting aspect is to clearly verify that our properties are valid for other workloads. While we expect high-level ideas to be true for most other workloads (e.g. the independence of AFS on the delay, its poor performance with bottleneck bandwidths, etc.), certain findings may not hold (e.g. value of the bottleneck bandwidth for NFS, AFS etc.). Also, in order for our recommendations to be verified, a VM-tuned NFS as proposed in section 7 would need to be built and evaluated.
9
Conclusion
We presented a study of the performance of VMs over network file systems under various network conditions. We first described a good configuration for VMs over NFS when the network imposes no restrictions (1Gbps, 0 delay). We then explored how virtual machines perform on several file systems, under various network conditions. We learned that AFS’ on-disk caching strategy and large block size allow it to be relatively immune to large network delays and to redundant fetches (given that the cache is large enough). At the same time, we saw that the large block sizes make AFS behave very poorly for bottleneck bandwidths. Ultimately, we learned that, for the kernel compilation workload we used, not much locality was there in the references to the virtual disk. We also presented some guidelines on how a VMtuned network file system can be built. One should combine NFS’ simplicity with AFS’ caching strategy and tune its block size according to the network characteristic and workload type.
References [1] Bonnie: http://www.coker.com.au/bonnie++/.
and Applications, page 40. IEEE Computer Society, 2002.
[2] Emulab: http://www.emulab.net.
[3] How to configure nfs: http://wiki.ltsp.org/twiki/bin/view/ltsp/nfs. [16] M. Kozuch, M. Satyanarayanan, T. Bressoud, C. Helfrich, and S. Sinnamohideen. Seamless [4] VirtualPC - VMware performance comparison: mobile computing on fixed infrastructure. IEEE http://www.osnews.com/story.php?news id=5887. Computer, pages 65–72, 2004. [5] VirtualPC: http://microsoft.com/windows/virtualpc/. [17] M. Nelson, B-H. Lim, and G. Hutchins. Fast transparent migration of virtual machines. In [6] Vmware: http://vmware.com. Proceedings of USENIX Annual Technical Con[7] Wikipedia virtualization: ference, 2005. http://en.wikipedia.org/wiki/virtualization. [18] P. M. Chen S. T. King, G. W. Dunlap. Oper[8] I. Ahmad, J.M. Anderson, A. M. Holler, ating System Support for Virtual Machines. In R. Kambo, and V. Makhija. An analysis of Proceedings of the 2003 USENIX Technical Condisk performance in VMware ESX server virtual ference, 2003. machines. In IEEE International Workshop on [19] C. P. Sapuntzakis and M. S. Lam. Virtual apWorkload Characterization, 2003. pliances in the Collective: A road to hassle-free [9] K. Buchacker and V. Sieh. Framework for testing computing. In Proceedings of the 9th Workshop the fault-tolerance of systems including OS and on Hot Topics in Operating Systems, 2003. network aspects. In Proceedings of 3rd IEEE International High-Assurance Systems Engineer- [20] G. Venkitachalam and B. Lim. Virtualizing I/O devices on VMware workstation’s hosted virtual ing Symposium, pages 95–105, 2001. machine monitor. In Preceedings of USENIX [10] B. Dragovic, K. Fraser, S. Hand, T. Harris, Annual Technical Conference, 2001. A. Ho, I. Pratt, A. Warfield, P. Barham, and R. Neugebauer. Xen and the art of virtualiza- [21] R. C. Wilson. UNIX test tools and benchmarks. Prentice Hall PTR, New Jersey, 1995. tion. In Proceedings of the ACM Symposium on Operating Systems Principles, 2003. [22] M. Zhao, J. Zhang, and R. Figueiredo. Distributed file system support for virtual machines [11] R. Figueiredo, P. Dinda, and J. Fortes. A case for in Grid computing. In Proceedings of High PerGrid computing on virtual machines. In Proceedformance Distributed Computing (HPDC), 2004. ings of International Conference on Distributed Computing Systems (ICDCS), 2003. [12] R. P. Goldberg. Survey of virtual machine research. IEEE Computer, pages 34–45, 1974. [13] J. Howard, M. Kazar, S. Menees, D. Nichols, M. Satyanarayanan, R. Sidebotham, and M. West. Scale and performance in a distributed file system. ACM Transactions on Computer Systems, 6:51–81, 1988. [14] J. J. Kistler and M. Satyanarayanan. Disconnected operation in the Coda file system. In Thirteenth ACM Symposium on Operating Systems Principles, volume 25, pages 213–225. ACM Press, 1991. [15] M. Kozuch and M. Satyanarayanan. Internet suspend/resume. In Proceedings of the Fourth IEEE Workshop on Mobile Computing Systems 14