Hypervisor Support for Efficient Memory De-duplication Ying-Shiuan Pan
Jui-Hao Chiang
Han-Lin Li
Industrial Technology Research Institute Hsinchu, Taiwan Email:
[email protected]
Computer Science Stony Brook University Stony Brook, USA Email:
[email protected]
Industrial Technology Research Institute Hsinchu, Taiwan Email:
[email protected]
Po-Jui Tsao
Ming-Fen Lin
Tzi-cker Chiueh
Industrial Technology Research Institute Industrial Technology Research Institute Industrial Technology Research Institute Hsinchu, Taiwan Hsinchu, Taiwan Hsinchu, Taiwan Email:
[email protected] Email:
[email protected] Email:
[email protected]
Abstract—Memory de-duplication removes the memory state redundancy among virtual machines that run on the same physical machine by identifying common memory pages shared by these virtual machines and storing only one copy for each of common memory pages. A standard approach to identifying common memory pages is to hash the content of each memory page, and compare the resulting hash values. In a virtualized server, only the hypervisor is in a position to compute the hash value of every physical memory page on the server, but the memory de-duplication engine is best implemented outside the hypervisor for flexibility and simplicity reasons. A key design issue in the memory de-duplication engine is to minimize the performance impact of these hashing computations on the running VMs. To reduce this impact, memory page hashing should be performed with low overhead and when the CPU is idle. This paper describes why existing hypervisors do not provide adequate support for our memory de-duplication engine, how a new primitive called deferrable aggregate hypercall (DAH) fills the need, and what the resulting performance improvement is.
I. I NTRODUCTION Server virtualization isolates the CPUs, memory and I/O devices on a physical machine (PM) from the virtual machines (VM) running on it, and enables such management flexibility and benefits as load balancing, fail-over, consolidation, etc. A major design goal of modern hypervisors is to enable as many VMs to run on a PM with a given set of hardware resources as possible. The maximal number of VMs that can run on a PM, also known as the virtualization ratio, is in most cases limited by the amount of memory resource on that PM. Therefore, how to make the most of the physical memory resource on a PM is the key to maximize the virtualization ratio. There are three known ways to improve the memory utilization efficiency of a virtualized server. First, memory deduplication removes the memory state redundancy among virtual machines that run on the same physical machine by identifying common memory pages shared by these virtual machines and physically storing only one copy for each of common memory pages. Second, memory compression keeps some of the memory-resident pages in compressed form, and
uncompresses these pages only when or immediately before they are accessed. Third, thin memory provisioning allocates to a VM a smaller number of physical memory pages than what is initially provisioned for the VM, but will give back the full provision amount when the VM demonstrates that it needs all of them later on. The focus of this paper is only on memory de-duplication. A standard way to implement memory de-duplication is to periodically compute a hash value for every physical memory page, check if the hash value is in a page hash value database, and put the hash value in the database if it is not already in. In a virtualized server, the best place to compute the per-page hash values is in the hypervisor, because only the hypervisor can access all physical pages and have the knowledge of which pages have been modified since the last round. However, to minimize the hypervisor’s code footprint and thus its trust computing base, it is best to include only page content hashing into the hypervisor and leave most of the memory de-duplication engine outside the hypervisor. In addition, because memory de-duplication is meant to be a performance optimization, it is essential to keep its own performance impacts on the running VMs to the minimum. Because the most time-consuming part of memory deduplication is per-page hashing, which is done by the hypervisor, reducing memory de-duplication’s performance impacts means (a) minimizing the absolute performance overhead of per-page hashing, and (b) scheduling per-page hashing when VMs are not running on the server’s CPUs. Unfortunately, the hyper-call interfaces provided by modern hypervisors such as Xen [1] are inadequate because it is either inefficient or inflexible. More concretely, existing interfaces allow the developer to either minimize the performance impacts of hypercalls on other VMs at the expense of additional performance overhead, or minimize the performance overhead of hypercalls without having control over their performance impacts on other VMs, but not both. We propose a deferrable aggregate hyper-call (DAH) mechanism that addresses this deficiency and achieves
both low overhead and low impacts, implement it in the Xen hypervisor, and successfully demonstrate its superiority by applying it to the implementation of a memory de-duplication engine for Xen. The rest of this paper is organized as follows. In Section 2, we outline the design of our memory de-duplication engine and motivate the need for a hypercall aggregation mechanism. In Section 3, we describe the hypercall aggregation support currently in the Xen hypervisor, its deficiency, and the new design we proposed that removes this deficiency. In Section 4, we present the performance comparison between the two memory de-duplication engine designs, and other evaluation results of the DAH mechanism. In Section 5, we conclude this paper with a summary of the main research contributions of this work. II. M EMORY D E - DUPLICATION We have designed and implemented a memory deduplication engine for the Xen hypervisor. At the start-up time, the engine first computes a hash value for every physical memory page in the machine, and stores the hash values in a persistent array. From then on, at the end of each fixed-sized dedup interval, the engine computes the hash value for every physical memory page and compares the resulting hash value with the page’s corresponding entry in the persistent array. If the two are different, the engine stores the new hash value into the persistent array, otherwise the engine looks up the new hash value in a separate in-memory hash table, which is emptied at the end of each dedup interval. If the lookup is a hit, the current physical memory page is a duplicate, otherwise the hash value of the current physical memory page is stored in the hash table. To de-duplicate a physical memory page, the engine uses the nominate() interface to mark this page and optionally the page it hits in the hash table as read-only, and follows with a share() call to indicate to the hypervisor that this page and the page it hits in the hash table can potentially be merged. The hypervisor then performs a byte-by-byte comparison between these two pages, and reclaims one of them by marking the remaining one as copy-on-write. Byte-by-byte comparison serves as a last-line defense against false de-duplication due to hashing collisions. In the above design, we assume pages that are read-only are more promising candidates for de-duplication. To track pages that are dirtied within each dedup interval, we perform per-page hashing and compare each page’s hash values at consecutive intervals. This mechanism is much simpler and portable than the dirty page tracking mechanism built into the Xen hypervisor and is quite efficient, around 5 µsec per page. Still, per-page hash computation is the most time-consuming part of the memory de-duplication engine. The memory de-duplication engine runs in the Dom0 domain of a physical machine virtualized by the Xen hypervisor, except the per-page hash computation part, which is carried out by the hypervisor because only the hypervisor has the privilege of accessing the machine’s physical pages. If the memory de-duplication engine computes the hash value of
each physical page by making a separate hypercall, the fixed overhead associated with each such hypercall is too high to be acceptable. Therefore the memory de-duplication engine requires the hypervisor to provide some sort of hypercall aggregation mechanism to reduce the per-page hashing cost. In Xen, when a guest OS issues a hypercall, the hypervisor take control via a trap, or interrupt 82h in the X86 architecture, and everything inside the guest OS is blocked until the hypercall returns. To service a hypercall, the hypervisor calls the corresponding hypercall routine with the input parameters passed from the guest OS. When the hypercall routine is done, Xen returns the hypercall’s result and control back to the calling guest OS. To reduce the total performance overhead associated with a series of multiple calls, Xen provides an multicall API for a guest OS to batch multiple hypercalls into a multicall, and calls on the hypervisor once for these hypercalls. This multicall API is mainly used by the memory management module (e.g., updating multiple entries of memory page table) and the front-end and back-end NIC drivers in the guest OS. The Xen hypervisor services each hypercall in a multicall iteratively, and returns the results of these hypercalls to the calling guest OS after all of them are completed. However, immediately after completing a hypercall of a multicall, the hypervisor checks if there is any pending soft interrupt or event before moving on to the next hypercall in the multicall. If the hypervisor finds a pending soft interrupt or event, it stops servicing the current multicall, and processes the pending soft interrupt or event, which in turn may cause the guest OS calling the multicall to be preempted1 , and another guest OS to be scheduled. After control is passed back to the virtual CPU associated with the calling guest OS, the hypervisor continues the service of the (interrupted) multicall, from the hypercall right next to the last completed hypercall. The multicall mechanism greatly reduces the fixed overheads of making hypercalls because it can reduce this perhypercall overhead by a factor of M if M hypercalls are batched into a multicall. Unfortunately, the multicall mechanism has a drawback: the hypervisor servicing a multicall could potentially starve other processes running on the same physical machine because the hypervisor continues the execution of a multicall until certain hypervisor events occur. The Linux kernel also provides a multicall API, which allows an application to buffer multiple hypercalls in a queue inside the kernel and to flush the queued hypercalls by issuing a multicall at a later time. Therefore, Linux’s multicall API is more flexible than Xen’s multicall API because it decouples accumulation of hypercalls from their issue to the hypervisor; however, it still suffers from the same starvation problem. In the next section, we describe a new hypercall aggregation mechanism that achieves the same performance benefit as Xen’s multicall but removes its weakness of starving other processes. III. D EFERRABLE L OW-OVERHEAD 1 The
details are explained in section III
schedule
yes yes
no restore context
tasklet scheduled?
idle_loop
no
Idle VCPU?
pick next VCPU by policy
no
do_softirq
restore context
non-idle VCPU context
do_softirq
do_tasklet
q
hypercall / interrupt return TIMER_SOFTIRQ pending?
no
do_softirq
yes TIMER_SOFTIRQ p c Raise SCHED_SOFTIRQ and reprogram hardware timer interrupt d Process hypercall routines of DAH Note: TIMER_SOFTIRQ is raised when hardware timer interrupt comes
SCHED_SOFTIRQ pending?
yes
no Other softirqs
Fig. 1. The flow chart describing how the Xen hypervisor schedules tasklets, softirqs, and schedule functions. The two do_softirq boxes refer to the same code snippet, whose internal details are listed here.
H YPERCALL The proposed deferrable aggregate hypercall (DAH) mechanism works as follows: 1) A guest OS calls a DAH. 2) The hypervisor inserts the DAH request to the DAH queue and immediately returns control to the calling guest OS so that it does not need to block. 3) Processing of the DAH is left completely to the discretion of the hypervisor. 4) The calling guest OS will get a notification when the DAH is finished. Most operating systems support a deferrable function mechanism for the kernel to perform background work inside the operating system, which could serve as the basis for implementing the proposed DAH mechanism. The Xen hypervisor supports a similar deferrable function mechanism. However, there are two problems with Xen’s deferrable function. First, it is exposed to guest operating systems. Second, there are certain flaws in its scheduling logic that render it inadequate for our purpose. A. Xen’s Deferrable Function Traditional operating systems defer certain non-critical kernel tasks associated with an interrupt outside the interrupt handler because the interrupt handler should perform as little work as possible so as to keep the kernel as responsive as possible. To execute these non-critical kernel tasks, Linux supports two mechanisms: softirq and tasklet [2], both of
which are referred as deferrable functions. The Xen hypervisor also supports tasklet and softirq, which correspond to the functions do_tasklet and do_softirq respectively in Figure 1. Each tasklet corresponds to a callback function and is dynamically scheduled after the pointer to the callback function is placed in a per-PCPU (physical CPU) tasklet queue. When the do_tasklet function is called, Xen removes the first callback function from the tasklet queue and executes it. As for softirqs, Xen statically defines two important functions, SCHED_SOFTIRQ, which calls the schedule function, and TIMER_SOFTIRQ, which executes the software timer-related functions, functions that are registered by other hypervisor components to perform tasks at a certain time. The Xen hypervisor raises a softirq by setting a certain bit in a perPCPU variable to raise the corresponding softirq to a pending state. As soon as the do_softirq is called, all raised softirqs are processed one by one, and their pending bits are cleared. The current Xen implementation clears all the entries in the tasklet queue before moving on to the softirq queue, all of whose entries Xen will also clear before moving on. Figure 1 shows the scheduling logic of the Xen hypervisor when executing tasklets and softirqs. Each PCPU is initialized with an idle VCPU context that loops inside the idle_loop function, which consists of two parts: do_tasklet, as discussed above, and do_softirq, which does the following: • If TIMER_SOFTIRQ is raised due to a hardware timer interrupt, all registered software timer callback functions are checked to determine if they need to be executed, SCHED_SOFTIRQ is raised, and the hardware timer is
re-programmed, e.g., local APIC timer, to setup the next timer firing. • In the case when SCHED_SOFTIRQ is raised, do_softirq calls the schedule function, which picks up a new VCPU context based on the following rules. First, if the tasklet queue is not empty, the scheduler chooses the idle VCPU context to run on the current PCPU. Otherwise, the scheduler picks up the next VCPU context based on the hypervisor current scheduling policy, e.g., SEDF or Credit [3][4]. If the idle VCPU is chosen, the PCPU enters idle_loop; otherwise, the context of a non-idle VCPU is restored, and the corresponding domain can continue from where it left off last time. To ensure the execution priority of tasklets and softirqs, the Xen hypervisor uses the following heuristics: • As long as the tasklet queue of a PCPU is non-empty, the schedule function will select the PCPU’s corresponding idle VCPU to execute next. • When a tasklet is scheduled, SCHED_SOFTIRQ is also raised so that the schedule function is called within do_softirq, which selects the idle VCPU to run if there are pending tasklets. • When a PCPU is running in a non-idle VCPU context, do_softirq is called each time a hypercall or hardware interrupts returns. The current implementation of Xen’s deferrable functions has certain problems. Because do_softirq cannot be invoked until do_tasklet is completed, the scheduler may not get called and the hardware timer may not get reprogrammed. Thus, each tasklet can not run for a long time; otherwise, softirqs may get blocked. Accordingly, developers have to separate a piece of work into multiple smaller tasklets, and the tasklet has to schedule itself for the next round as shown in the following pseudo code: void tasklet_fn(unsigned long data){ /* do something or part of job */ if(my_job != done) tasklet_schedule(&my_tasklet); } Unfortunately, the above solution does not work because it would still starve other non-idle VCPUs queued on the same PCPU. The reason is that this approach generates a new tasklet every time an existing tasklet runs, and the scheduler always picks the idle VCPU over non-idle VCPUs when there are still tasklets to run. In other words, non-idle VCPUs will never get scheduled until the entire piece of work, which is implemented as a series of tasklets, is completed. To mitigate this starvation problem, the developer could have chosen another PCPU to schedule the new tasklet for the next round as shown in the following code: void tasklet_fn(unsigned long data){ /* do something or part of job */ if(my_job != done){
Fig. 2. Comparison of Multicall and DAH for a multi-threaded program using a Fork-Join model
next_cpu = choose_next_cpu(); tasklet_schedule_on_cpu(&my_tasklet, next_cpu); } } While this does work in a multi-core CPU environment, it does not work in a single-core CPU environment because there is no other core for executing new tasklets. B. DAH Design We propose a DAH mechanism in which a separate DAH queue is assigned to each guest domain. Each entry of the DAH queue contains a function pointer to a hypercall routine that a guest domain wants the hypervisor to run on its behalf. A guest domain makes a DAH call to insert entries into its DAH queue and the call returns as soon as the insertion is done. After the scheduler selects a VCPU corresponding to a guest domain, the hypervisor will process the entries in the guest domain’s DAH queue, before handing control to the guest domain’s OS. This additional step is introduced in the 2 in Figure 1. place marked as Every time the hypervisor visits a guest domain’s DAH queue, it processes nr entries entries in the queue before transferring control to the guest domain. In terms of accounting, the resource used by processing these DAH entries is charged to the guest domain inserting them, which is fair because hypercall routine is designed to be used by guest domains rather than just by the hypervisor. For example, suppose a VCPU is given 30 ms of execution time by the scheduler and its hypercall routine costs 5 ms, then this VCPU should be given only 25 ms for its own execution. In contrast,
the current Xen implementation always charges the resource used by tasklets to the idle VCPU. To receive a notification while the work underlying a DAH call is done, the program issuing the DAH call could register an event channel with the hypervisor to indicate the intention to receive such a notification. The API of a DAH call includes 4 parameters as shown below: dah_call(call_list, nr_calls, callback_fn, nr_entries); The first two parameters are the same as in a multicall: call_list is an array, each element of which stores the op code of a hypercall and the hypercall’s parameters; nr_calls indicates the number of entries in call_list. The third parameter, callback_fn, is a callback function pointer, which is called by the hypervisor after all the hypercalls in call_list are done. The last parameter, nr_entries, is used to tune the processing granularity of each DAH call. This parameter gives the developer the flexibility of limiting the amount of work done upon each visit to the DAH queue is limited by developer, thus preventing any tasklet-related starvation that is observed in the current Xen hypervisor. Compared with multicall, DAH scales better for guest domains that are assigned multiple VCPUs, especially when multi-threaded applications with inter-thread dependencies run on these domains. Suppose a multi-threaded application using the Fork-Join model [5], [6] runs on a guest domain that is assigned 4 VCPUs, which could run on 4 PCPUs simultaneously, and as shown in Figure 2, it contains a master thread that creates a group of parallel threads, i.e., FORK, with all these threads eventually synchronizing with the master thread and then terminating, i.e., JOIN. Further assume there is another program running in the same guest domain issues a deferrable function call consisting of ten hypercall routines. If this deferrable function call is implemented by a multicall, the ten hypercall routines run on the same VCPU, some of the parallel threads may get seriously delayed, and the entire program’s execution time thus could increase significantly because of the JOIN at the end. In summary, the multicall lacks the ability to leverage all available VPCUs in the calling guest domain, it will block the VCPU from which the multicall is issued until all the hypercalls in the multicall are done. In contrast, DAH could distribute the ten hypercall routines on all four VCPUs, equalize the impact of this DAH call on all parallel threads, and thus minimize the overall delay introduced by this background DAH call. IV. P ERFORMANCE E VALUATION We ran a series of experiments using our memory deduplication engine prototype to evaluate the proposed DAH mechanism and compare it with Xen’s multicall mechanism. The test machine contains an Intel Xeon E5640 processor, which is a 4-core processor with hyper-threading and VT [7], [8] enabled, 24-GB RAM, and a 500-GB hard disk. The host runs Xen-4.1 with CentOS-5.5 as Dom0.
In the first experiment, we ran 50,000 hypercalls by issuing a single DAH call and a multicall from Dom0, and measured the amount of time required to complete these hypercalls under these two configurations. Each hypercall is designed to consume about 1.2 millisecond of PCPU time on our test machine, which is slightly longer than most hypercalls in Xen. In addition, we varied the number of VCPUs assigned to Dom0 to examine its impact on the completion time of executing these 50,000 hypercalls. As shown in Table I, the completion time of DAH decreases with the number of VCPUs assigned to Dom0, because our DAH implementation is able to execute the hypercalls in a DAH call on all VCPUs assigned to the guest domain making the DAH call, in this case, Dom0. In contrast, the number of VCPUs has no impact on the completion time of the multicall configuration, because hypercalls in a multicall can only run on a single VCPU. When the number of VCPUs is fixed, the completion time of the DAH configuration decreases with nr entries, because increasing the granularity of each visit to the DAH queue cuts down the number of visits and thus the overall elapsed time. TABLE I E XECUTION TIME OF M ULTICALL AND DAH WITH DIFFERENT NUMBER OF VCPU S AND nr entries Execution Time (second)
XXX X
Number XXX of VCPUs X XX Mechanism X Multicall DAH (nr entries=1) DAH (nr entries=10) DAH (nr entries=15)
2
4
131.89 131.49 751.23 375.59 75.13 37.57 50.09 25.05
8 132.08 187.81 18.79 12.53
In the second experiment, we ran on a guest domain a parallel make application, which builds the Linux kernel (version 2.6.37), and measured the impact of running 50,000 hypercalls from the same guest domain using multicall and DAH on the parallel make application’s execution time. The number of VCPUs assigned to the guest domain on which the parallel make application runs is set to 8 and the nr entries parameter of DAH is set to 15. All other settings were the same as in the first experiment. In order to utilize the VCPUs as much as possible, we included the -j8 option to the make command. The result is listed in Table II When there is no background hypercall, the Linux kernel build task takes 117.75 seconds to complete. When there are 50,000 background hypercalls made via a multicall, the total execution time of the parallel Linux kernel build task becomes 164.16 second, or a 39.4% overhead. When these background hypercalls are made via a DAH call, the total execution time of the parallel Linux kernel build task becomes 131.07 seconds, or a 11.3% overhead. This result demonstrates that DAH is indeed more effective in reducing contention with other concurrent applications when scheduling its execution. In the third experiment, we built two memory de-duplication engines, one using DAH and the other using multicall, and compared their de-duplication speed when running on the
TABLE II P ERFORMANCE IMPACT OF 50,000 BACKGROUND HYPERCALLS ON A L INUX KERNEL BUILD TASK USING DAH AND multicall Configuration No Background Load Multicall DAH
Execution Time (second) 117.75 164.16 131.07
Impact (%) 39.4% 11.3%
same number of memory pages, and their impacts on other concurrently running guest domains. We ran the memory deduplication engine ran on Dom0, and in addition ran two other guest domains on the same machine simultaneously. One guest domain ran the Sysbench threads [9] test. This Sysbench test creates –num-threads threads and sets up – mutex-num mutexes; each thread executes requests for –maxrequests times, where each request consists of the following steps: lock the mutex, yield the CPU, putt the thread into the run queue, and unlock the mutex when the thread is rescheduled back. In this experiment, the value of –numthreads is set as 4, –mutex-num is 4096, and –max-requests is 150000. The other guest domain ran the Sysbench CPU test. This Sysbench test created –num-threads threads to execute the requests concurrently until the total number of requests exceeds the value of –cpu-max-prime, where –num-threads is set as 4 and –cpu-max-prime is 150000. Both benchmark are CPU-bound. To evaluate the performance impact of the memory de-duplication engine on these two guest domains, we configured the memory de-duplication engine to perform three rounds, which means that it scans and computes the hash values for the physical memory pages on the machine three times during this experiment. Table III shows the performance impact comparison among three configurations: Baseline (no memory de-duplication), memory de-duplication using multicall and memory deduplication using DAH. When memory de-duplication is disabled, the execution times of the Sysbench threads and CPU tests are 155 and 143 seconds, respectively. When memory de-duplication using multicall is enabled, the execution times of the Sysbench threads and CPU tests become 169 and 147 second, respectively, and the associated performance impacts are 9% and 2.8%, respectively. When memory de-duplication using DAH is enabled, the execution times of the Sysbench threads and CPU tests are 163 and 146 seconds, respectively, and the associated performance impacts are 5.2% and 2%, respectively. The performance impact on the Sysbench threads test is higher than that on the Sysbench CPU test, because the Sysbench threads test uses a large number of mutex locks for synchronization and thus tends to suffer more as explained by the fork-join model in 2. When a multicall blocks a thread T, it will block all threads that directly or indirectly contend with locks that are already held by T. Nonetheless the performance impacts of the memory deduplication engine on the two guest domains are relatively minor, regardless of whether DAH or multicall is used in per-page hashing. This is because the hypercalls for perpage hashing only block the VCPUs of Dom0, but have no
effects on the VCPUs of the guest domains. Because these two Sysbench tests are CPU-bound, they don’t use much I/O and therefore are relatively immune to the blocking of VCPUs on Dom0. TABLE III P ERFORMANCE IMPACTS OF D OM 0- BASED MEMORY DE - DUPLICATION ON CPU- BOUND GUEST DOMAINS Execution Time (second)
``` `
Test ` Configuration```` Baseline Multicall DAH
Sysbench Threads
Sysbench CPU
155 169 (9%) 163 (5.2%)
143 147 (2.8%) 146 (2%)
To evaluate the performance impact of de-duplicating memory pages in Dom0 on a guest domain running an I/O-bound application, we ran the dd command (known as disk dump) in a guest domain to create a 4GB file, and measured the elapsed time with and without memory de-duplication. The results are shown in IV. The performance impacts of multicallbased memory de-duplication varied from 30.9% to 47%, whereas the performance impacts of DAH-based memory deduplication only varied from 3.6% to 14.6%. This shows that DAH has much less impact on I/O intensive applications running on guest domains than multicall. In Xen, Dom0 is responsible for all disk and network I/O requests from guest domains on the same physical machine. Whenever Dom0 is blocked because of hypercalls, the performance impact due to this blocking is more likely to propagate to guest domains that need more services from Dom0 (I/O-bound) than those that don’t (CPU-bound). Moreover, the performance impact due to hypercall blocking decreases as the number of VCPUs assigned to Dom0 increases, because a hypercall only blocks the VCPU from which it is issued, and other Dom0 VCPUs could be used to service I/O requests from guest domains. TABLE IV P ERFORMANCE IMPACTS OF D OM 0- BASED MEMORY DE - DUPLICATION ON A DD TASK IN A GUEST DOMAIN
Execution Time (second) Number XXX of VCPUs 1VCPU 4VCPU 8VCPU X XX Mechanism X Baseline 110 103 110 Multicall 162 152 144 DAH 119 118 114
XXX X
In summary, DAH is better than multicall because it could effectively leverage multiple VCPUs assigned to a guest domain, and minimize the impact of a large number of hypercalls on applications by spreading the hypercall processing out to avoid any temporary starvation of applications running on the same domain as the hypercalls or other concurrently running domains. V. C ONCLUSION AND F UTURE W ORK Memory de-duplication is a critical technology to maximize the number of VMs that can run on a physical machine of a
given resource. The most time-consuming part of memory deduplication is per-page hashing, which applies a hash function to a physical memory page’s contents to compute an ”abstract” for the page. On a virtualized server, because the hypervisor is the only entity that has the privilege to access all the physical memory pages, it is natural to delegate the per-page hashing computation to the hypervisor while leaving the rest of the memory de-duplication engine outside the hypervisor. Existing interfaces provided by modern hypervisors, such as Xen, are inadequate for per-page hashing, because they either incur high per-invocation overhead or lead to result in noticeable impact on the applications running on the same or concurrently running guest domains. Because memory deduplication is meant to be an optimization, it is essential to keep its performance impact to the minimum, if not zero. Because the hypervisor has unique knowledge about guest domains, there are times in which it is necessary to make a large number of hypercalls in a row, In this paper, we propose a DAH mechanism that is designed to achieve both low invocation overhead and low performance impact on running applications. Compared to multicall, which incurs low invocation overhead but high performance impact, DAH is more efficient because it is capable of effectively exploiting multiple VCPUs assigned to the domain making the DAH call, and yet minimizing the impact of a large number of hypercalls on the concurrently running applications by scheduling the execution of hypercalls at a lower priority than these applications. The current DAH prototype requires the developer to explicitly specify the nr entries parameter, which represents a trade-off between the extent of blocking and the execution efficiency of hypercalls. We are currently working on an automated mechanism that can dynamically tune the nr entries parameter based on the physical machine’s PCPU usage information and the calling guest domain’s VCPU usage information. For example, DAH should process more entries in a DAH queue when the corresponding VCPU or PCPU is not highly utilized, and stops immediately when there is a pending event in the corresponding guest OS. This mechanism hopefully could reduce the performance impact of DAH to close to zero, regardless of the hypercalls contained within. R EFERENCES [1] P. Barham, B. Dragovic, K. Fraser, S. Hand, T. Harris, A. Ho, R. Neugebauer, I. Pratt, and A. Warfield, “Xen and the art of virtualization,” SIGOPS Oper. Syst. Rev., vol. 37, pp. 164–177, October 2003. [Online]. Available: http://doi.acm.org/10.1145/1165389.945462 [2] D. Bovet and M. Cesati, Understanding The Linux Kernel. Oreilly & Associates Inc, 2005. [3] L. Cherkasova, D. Gupta, and A. Vahdat, “Comparison of the three cpu schedulers in xen,” SIGMETRICS Perform. Eval. Rev., vol. 35, pp. 42–51, September 2007. [Online]. Available: http://doi.acm.org/10.1145/1330555.1330556 [4] Credit-Based CPU Scheduler, http:// wiki.xensource.com/xenwiki/CreditScheduler. [5] X. Martorell, E. Ayguad´e, N. Navarro, J. Corbal´an, M. Gonz´alez, and J. Labarta, “Thread fork/join techniques for multi-level parallelism exploitation in numa multiprocessors,” in Proceedings of the 13th international conference on Supercomputing, ser. ICS ’99. New York, NY, USA: ACM, 1999, pp. 294–301. [Online]. Available: http://doi.acm.org/10.1145/305138.305206
[6] D. Lea, “A java fork/join framework,” in Proceedings of the ACM 2000 conference on Java Grande, ser. JAVA ’00. New York, NY, USA: ACM, 2000, pp. 36–43. [Online]. Available: http://doi.acm.org/10.1145/337449.337465 [7] R. Uhlig, G. Neiger, D. Rodgers, A. L. Santoni, F. Martins, A. Anderson, S. Bennett, A. Kaegi, F. Leung, and L. Smith, “Intel virtualization technology,” IEEE Computer Society, vol. 5, pp. 48–56, Jul. 2005. [8] G. Neiger, A. Santoni, F. Leung, D. Rodgers, and R. Uhlig, “Intel virtualization technology: Hardware support for efficient processor virtualization,” Intel Technology Journal, vol. Vol. 10 Issue 3, Aug. 2006. [Online]. Available: http://download.intel.com/technology/itj/2006/v10i3/v10-i3-art01.pdf [9] SysBench: a system performance benchmark, http://sysbench.sourceforge.net.