ronment because a layer of virtualization separates hardware from the operating system and the applications executing inside a virtual machine. We first adapt ...
Energy-Efficient Memory Management in Virtual Machine Environments Lei Ye, Chris Gniady, John H. Hartman Department of Computer Science University of Arizona Tucson, USA Email: {leiy, gniady, jhh}@cs.arizona.edu Abstract—Main memory is one of the primary shared resources in a virtualized environment. Current trends in supporting a large number of virtual machines increase the demand for physical memory, making energy efficient memory management more significant. Several optimizations for memory energy consumption have been recently proposed for standalone operating system environments. However, these approaches cannot be directly used in a virtual machine environment because a layer of virtualization separates hardware from the operating system and the applications executing inside a virtual machine. We first adapt existing mechanisms to run at the VMM layer, offering transparent energy optimizations to the operating systems running inside the virtual machines. Static approaches have several weaknesses and we propose a dynamic approach that is able to optimize energy consumption for currently executing virtual machines and adapt to changing virtual machine behaviors. Through detailed trace driven simulation, we show that proposed dynamic mechanisms can reduce memory energy consumption by 63.4% with only 0.6% increase in execution time as compared to a standard virtual machine environment. Keywords-Energy Management; Virtual Machine; Memory;
I. I NTRODUCTION Current computing infrastructures use virtualization to increase resource utilization by deploying multiple virtual machines on the same hardware. Virtualization is particularly attractive for data center, cloud computing, and hosting services; in these environments a computer system is typically configured to have large physical memory capable of supporting many virtual machines. For example, the HP ProLiant DL580 G7 Server can support up to 1 TB memory [1]. As a result, memory can consume a significant fraction of a computer system’s energy, making it worthwhile to consider ways to improve memory energy efficiency. In addition to improving memory energy efficiency by creating denser memory modules, memory hardware now provides low-power states that can be controlled by software. In a non-virtualized system this control is handled by the operating system [2], [3], [4], however in a virtualized system the additional layer of virtualization makes energy management more challenging. An operating system has a detailed view of the running applications, the demand they place on the system, and the power state of all memory in the c 2011 IEEE 978-1-4577-1221-0/11/$26.00
system. This global knowledge allows the operating system great flexibility and aggressiveness in managing memory. Memory management in a virtualized environment is more challenging, since virtualization decouples the underlying physical hardware from the guest operating system that runs in the virtual machine. No longer does the operating system have a detailed view of the hardware and the power state of all system memory. Neither is it easy to implement energy management techniques below the operating system in the Virtual Machine Monitor (VMM) layer since the VMM does not have process-level knowledge of the running applications. In this paper we address the above challenges in a distinct way. We optimize energy efficiency in the VMM through efficient memory allocation and dynamic management that does not require any knowledge of process level execution within the guest operating system. The optimizations are transparent to the guest operating system, as a result, energy efficiency is improved without any modifications to the guest operating system code. Subsequently, we make the following contributions in this paper: (1) we implement energy-aware memory allocation in Xen to minimize the total number of active memory ranks; (2) we design and implement a lightweight memory usage tracing and dynamic power state transition mechanisms to preserve performance of the memory system for the virtual machines while improving energy efficiency. II. BACKGROUND AND M OTIVATION A. Power-Aware Memory Module Energy management in the operating system relies on hardware power states. In this paper, we consider the DDR2 RAM that serves as main memory in our system. DDR2 RAM is packaged into DRAM modules, each consisting of two ranks. Each rank includes several physical devices, registers, and a phase-lock loop (PLL). The smallest unit of power management in DDR2 RAM is a rank and all devices in a rank operate at the same power state [5]. Each rank can independently operate in one of the four power states: (1) the active state (ACT) when memory is reading or writing data; (2) the pre-charge state (PRE) is a high-power idle state where the next memory I/O can take place immediately
5427.5mW ACT
Bandwidth 6.4 GB/s
READ / WRITE 1877 1mW 1877.1mW 2.5ns
500ns
PRE 2.5ns
2.5ns
PD
SR 963.5mW
80.075mW
Figure 1. Power state transitions and latencies for Micron 1GB DDR2-800 memory rank.
at the next clock cycle; (3) the power down state (PD) is a lower power idle state with rank components disabled, such as sense amplifier, and row/column decoder; (4) and the self-refresh state (SR) is the lowest power idle state that additionally disables PLL, and registers in the rank and as a result encounters the longest delays. Ranks in the SR and PD states have to be transitioned to the PRE state before a memory I/O is serviced. These state transitions incur delays, as show in Figure 1, and can degrade performance if not taken into consideration. Therefore, energy management mechanisms have to carefully trade performance for energy by selecting appropriate power states for ranks in the system. B. Energy Management in Standard Operating Systems Previous research has proposed several methods for reducing energy consumption of main memory based on power-aware memory allocation and dynamic power state transitions. Standard memory allocators treat every request uniformly and assign them to any free region in physical memory. Huang et al. [3] noticed energy efficiencies in this approach and proposed to allocate pages more compactly, using NUMA software layer, to minimize number of memory ranks that have to be on. Lee et al. [4] also proposed similar ideas to reduce active memory units for buffer cache. Similarly, Tolentino et al. [6] proposed proactive page allocation that enables the kernel to allocate pages from a particular physical memory device and attempts to pack allocations to minimize the total number of active memory devices. We use similar ideas to allocate virtual machines to a minimum number of ranks before we apply dynamic energy management. Lebeck et al. further explored several policies such as random, sequential first-touch, and frequency [7] to minimize memory footprint of running processes. Finally, Marchal et al. [8] proposed dynamic memory allocators to handle the bank assignment on shared multibanked SDRAM memories for multimedia applications. Research has also focused on optimizing power states for the accessed ranks. Lebeck et al. [7] introduced static policies that place a memory rank in a single power state and dynamic policies that transition memory rank between
different power states according to runtime context. The power-mode transitions can be effectively hidden within the operating system scheduler, during context switches between processes [2]. Alternatively, Zhou et al. [9] utilized page miss ratio curve to identify and power down memory chips that are not being accessed by any application. Tolentino et al. [6] proposed history-based predictors for memory energy management. Liu et al. [10] presented a distinct mechanism to optimize memory energy efficiency by tolerating errors in the non-critical data. C. Page Migration for Energy Savings To further reduce the number of active ranks at runtime we can consider data migration. Delaluz et al. proposed an automatic data migration strategy that dynamically places the arrays with temporal affinity into the same set of banks and allows the use of more aggressive energy-saving modes [11]. Ramamurthy et al. utilized page migration mechanism in performance-directed energy management [12]. Page migration can be very expensive due to time and energy overhead to move data and also the overheads by tracking mechanisms to classify page utilization. Classifying page utilization using hardware is the most efficient [9], [13], but it is not available in general purpose machines. Subsequently, several software approaches have been explored [2], [14] that do not require specialized hardware but result in large runtime overheads. D. Access Monitoring with Performance Counters Both Intel and AMD include performance monitoring features in their processors [15], which consist of a set of registers that can be configured to track a variety of processor events. The performance monitoring counters can be used to track memory behavior of the applications by monitoring the last level cache misses, which result in memory accesses. There is no overhead associated with performance counter monitoring except to setup and read the counter, which is very small. However, the low overhead monitoring comes at the cost of reduced information content. The counters only report a number of monitored events that occurred during the monitoring period. We do not have exact timing information such as when the events occurred or if there was any clustering. Furthermore, we do not know what parts of memory have been accessed. Therefore, performance counter monitoring cannot directly be used to detect popular pages to enable page migration. It can also be challenging to accurately apply performance counters to power state prediction. Despite these limitations, we use performance counters in this paper because they offer a low overhead solution. III. D ESIGN AND I MPLEMENTATION In this section, we describe the design and implementation of our system for improving energy efficiency of virtualized
8 Number of Ranks
7 6
we enable energy-aware memory allocation by selecting specific memory nodes.
Standard Allocation Energy-Aware Allocation
B. Page Allocation Policies
5 4 3 2 1 0 0.25 0.5 0.75
1
2
3
4
5
6
Memory Allocated per VM (Unit: GB) Figure 2. Memory ranks used by a virtual machine for standard rank allocation in Xen and energy aware memory allocation in DPSM.
systems. First, we describe an energy-aware memory allocator that takes energy efficiency into account when allocating memory pages to virtual machines. Second, we present a Dynamic Power State Management (DPSM) mechanism that transparently optimizes energy efficiency of main memory. A. Energy-Aware Memory Allocation A VMM typically allocates physical memory to individual virtual machines without considering memory ranks. VMMs use a standard memory allocator, such as Buddy, Slab, Slob, or TLSF (Two-Level Segregate Fit), that are not energyaware and do not consider the rank to which a page belongs. Subsequently, even small virtual address ranges can occupy several memory ranks and accessing them efficiently will require all ranks to be fully powered. This behavior translates to virtual machines, and Figure 2 shows the number of memory ranks allocated for a given virtual machine memory size when using the standard memory allocator in Xen. Xen employs a binary buddy allocator [16], which fragments the memory allocation to more memory ranks than necessary. Furthermore, allocation becomes more fragmented across ranks the longer a system runs due to physical address space fragmentation. To improve memory allocation, we investigate energy aware allocation developed for PAVM [3]. The PAVM memory allocation was implemented in a standard operating system kernel by using Non-Uniform Memory Access (NUMA) layer to handle memory allocation at process level. The NUMA management layer allows us to partition physical memory into virtual memory nodes, each corresponding to a memory rank. We employ the NUMA layer in Xen hypervisor to handle physical memory page allocations for virtual machines in our system equipped with AMD Phenom II X4 940 processor and 8GB Micron DDR2-800 memory. The modified Xen NUMA layer adds memory rank information by adding address ranges for 1GB memory ranks to numa emulation(). The binary buddy allocator then supports memory allocation for NUMA optimization on eight virtual memory nodes rather than a default single node. As a result,
Using the NUMA layer, we consider two page allocation policies: sequential [7] and distributed. The sequential policy allocates pages to virtual machines in the order they are created so that virtual machines with different memory configurations are packed into a minimal number of memory ranks. This scheme will minimize the total number of active memory ranks, allowing unused ranks to be put into a low power state. However, this allocation can result in fragmentation of a virtual machine memory space over several ranks in case the memory of a given virtual machine needs to be increased. To minimize the number of ranks allocated to each virtual machine, we use a distributed policy that starts memory allocation to each virtual machine at a beginning of a new rank. If the memory of a virtual machine needs to be grown beyond a rank, we start allocation from a new memory rank. When the system runs out of empty ranks we then consider allocation to partially occupied ranks using worst-fit policy. The worst fit allocation policy will allow for further growth of the virtual machine memory while minimizing fragmentation of virtual machine memory space between ranks. This allocation policy is shown in Figure 2 and we can see that it can minimize rank allocation even for a statically allocated system that does not grow memory. In this paper, we allocate the maximum number of virtual machines for the given memory space and as a result the memory of individual virtual machines does not grow. In a system when the virtual machine’s memory growth is possible the proposed allocation would offer even more benefit. C. Improving Energy Efficiency To take advantage of energy efficient memory allocation, we must employ power management of individual ranks. The natural optimization is to keep ranks that hold data that are currently being accessed in a PRE state. This optimization was implemented in a standard operating systems as PAVM [3]. We adopt this optimization at the VMM level and power up all memory ranks of the virtual machine being scheduled in to the PRE state in the VMM scheduler. When the virtual machine is being de-scheduled all ranks occupied by the virtual machine are transitioned to the SR state. PAVM offers high performance since the PRE state allows immediate servicing of memory I/Os and can offer energy savings if there are ranks that are not occupied by currently scheduled virtual machines. To improve energy efficiency further, we need to consider power optimizations of ranks that are occupied by data from currently scheduled virtual machines. Subsequently, we consider on-demand power-down (ODPD) and on-demand self-refresh policy (ODSR) [5], which maintain ranks in the
Energy D Delay Product
SR PD PRE
SR-PD PD-PRE Number of Memory Accesses Figure 3. memory.
Entire system energy delay product for power states of main
PD or SR state, respectively, during the execution of the virtual machine. The memory has to be transitioned to the PRE state before servicing a memory I/O. The transitions take time and can significantly degrade performance when demand for memory I/Os is high. Similarly, keeping memory in a high power state, such as PRE in case of PAVM, can impact energy efficiency if the demand for memory I/Os is very low. Therefore, we need a dynamic approach that will match power state to the demand placed on memory in the running virtual machine. D. Capturing Memory Behavior Due to high overhead of detailed memory access tracking we use the AMD CPU performance counters to track memory accesses. Each core has a separate set of performance counters [15], [17] that can be used to track L3 cache misses by setting performance event-select registers (PerfEvtSeln). A cache miss from the last level cache corresponds to a memory access, so by counting cache misses from the L3 cache we can count memory accesses for a given core that is running a given virtual machine. We virtualize performance counters for each VM so that concurrent VMs could be tracked separately by restoring counters for a scheduled VM and saving counters for a de-scheduled VM at context switch time. Since we can only count number of memory accesses, we do not know what part of memory they occur in. The only information we know is how many memory accesses a given virtual machine performed during the scheduling period on the CPU. However, we are able to tell what subset of ranks the accesses occurred to based on the ranks allocated to the current virtual machine. Energy efficient memory allocation minimizes the number of ranks and as a result we can assign memory access characteristics to the small set of ranks that are occupied by the currently executing virtual machine. E. Dynamic Power State Management Dynamic management of the power states requires accurate prediction of memory demand for the upcoming period and selection of the best power state for the predicted demand. To accomplish that DPSM records memory access history for each virtual machine and therefore the ranks each
virtual machine occupies. DPSM uses an exponential moving average (EMA) to record aggregate history of accesses for the virtual machine, and calculates it according to the following formula: EM At = α * Accessprev + (1 - α) * EM At−1 Accessprev is the number of memory accesses in the previous scheduling interval, EM At−1 is previous EMA, and α represents a weighted coefficient, which could be tuned to balance the amount of older and newer history in EMA calculation. In our system, α is chosen to be 0.85. EMA averages memory accesses across scheduling interval, and DPSM uses the current EMA to predict the memory demand for the next scheduling interval. Once the number of memory accesses is predicted for the upcoming interval, DPSM must select the best power state to minimize system’s energy delay product. The energy delay product quantifies both performance and energy impact and a lower energy delay product indicates a better combination of performance and energy savings. We need to consider the energy delay product of the entire system since what matters are the energy savings of the entire system and not just the memory subsystem. If we consider just the energy delay product of the memory subsystem, longer delays may become more beneficial for making memory more energy efficient. However, those longer delays will cause the entire system to run longer and consume more energy, subsequently making the energy delay product and the resulting energy efficiency of the entire system much worse. We measured the power of our quad-core system to be 179.5W, or 44.9W per core. We use this information combined with memory power specifications in Figure 1 to calculate the system energy delay product for different power states. Figure 3 presents the resulting energy delay product curves for keeping memory in the SR, PD, and PRE power states during the scheduling intervals. We observe that if there are fewer than the SR-PD threshold of memory accesses during the scheduling interval, the most efficient state is SR. This represents an interval that is not memory intensive since most of the data fits in the CPU caches. The interval between SR-PD and PD-PRE thresholds is best served in the PD state as it has the lowest energy delay product. This is a wide range of accesses and low transition overhead of the PD state best accommodates them. When the memory requests reach above the PD-PRE threshold, we are faced with memory streaming applications that quickly touch a large amount of data and in this case the best state is PRE that has the highest performance. Figure 3 also illustrates the need for dynamic energy management. Keeping memory on one power state during the entire execution does not account for variance in performance demand and therefore ODSR, ODPD, and PAVM are not the most energy efficient mechanisms for a wide range of applications that may run inside a virtual machine.
DPSM uses the intersection points on energy delay product curves for different power states to select the power state that best matches predicted memory accesses for the upcoming scheduling interval. Subsequently, DPSM selects the power state that has the lowest energy delay product for the predicted number of memory accesses. IV. M ETHODOLOGY We use trace based simulation to analyze and evaluate the proposed mechanisms. We trace and then simulate a platform running 64-bit Xen 3.4.2 and pvops Linux 2.6.31.6 on an AMD Phenom II X4 940 processor with 8GB Micron DDR2-800 memory and power states specified in Figure 1. The monitoring system tracks memory accesses for every virtual machine during scheduling intervals and records the number of memory references along with a timestamp. We trace execution of each virtual machine for 10 minutes resulting in the same number of scheduling intervals for all virtual machines. We simulate a worst case scenario for performance with memory accesses distributed uniformly through the scheduling interval. In reality, some clustering of memory accesses occur resulting in better performance and higher energy savings with a given mechanism. We selected a range of benchmarks to provide a variety of execution between virtual machines. All benchmarks were installed in each virtual machine and executed randomly to mimic the workload of some general purpose applications. • DaCapo [18] is a Java benchmark suite which consists of a set of open source, real world applications with non-trivial memory loads. • SPEC CPU2000 benchmarks measure the performance of the processor, memory, and compiler on a given system. • SPECjvm2008 is a benchmark suite for measuring the performance of a Java Runtime Environment (JRE), containing several real life applications and benchmarks. • Memtester [19] is a memory streaming utility that produces high pressure on testing the memory subsystem. It sweeps the memory pages within a specified range performing different algebraic and logical operations. 8GB of memory in our system is distributed between the VMM and virtual machines as follows. Dom0 is configured with 512MB memory to run tracing program. Both Xen hypervisor and Dom0 are allocated in the first rank. The last rank is occupied by the frame buffer of the integrated graphics device (IGD). The remaining six ranks are dedicated for concurrent execution of six DomU virtual machines. Xen 3.4.2 hypervisor does not support paging; therefore, each DomU virtual machine can only request any reasonable memory configuration that is not larger than available physical memory in our system, and as a result each virtual machine was configured with one virtual CPU
and 1GB of memory which is adequate to execute our studied benchmarks in each virtual machine concurrently. Rank 7 that is occupied by the frame buffer always remains in the PRE state due to frequent accesses from the IGD. Rank 0 occupied by Xen hypervisor and Dom0 is managed according to PAVM and is kept in the PRE state when Xen hypervisor and Dom0 are scheduled. For the remaining ranks we explore the following energy management mechanisms: • ALL (Always On) – represents a standard system without energy management that keeps all memory ranks in the PRE state. • PAVM (Power-Aware Virtual Memory) – keeps memory ranks of currently scheduled virtual machines in the PRE state and other ranks in the SR state [3]. • ODSR (On-Demand Self-Rrefresh) – keeps all memory ranks in the SR state during execution and only transitions them to the PRE state to serve the arriving memory I/O, and back to the SR state once memory I/O completes [5]. • ODPD (On-Demand Power-Down) – keeps all memory ranks in the PD state during execution and only transitions them to the PRE state to serve the arriving memory I/O, and back to the PD state at the completion of memory I/O [5]. • DPSM – is the proposed mechanism, that dynamically selects a power state for the upcoming scheduling interval. • OPT – is an optimal mechanism, that uses future knowledge to select the power state that will minimize energy delay product of the system during upcoming scheduling interval. V. E VALUATION In this section, we examine the performance and energy efficiency of the studied mechanisms. A. Memory Energy Consumption Figure 4 presents memory energy distribution between power states normalized to the energy consumed by keeping all six memory ranks occupied by virtual machines (VMs) always on. There are five power states responsible for energy consumption: idling in the SR, PD, and PRE states; switching to the PRE state from the SR or PD state before serving memory I/Os; and servicing memory I/Os. Energy consumed to service memory I/Os is the same for all mechanisms because the memory has to serve the same amount of data no matter what energy saving mechanisms are used. Remaining energy is consumed according to the given mechanism’s power state selection. To guarantee the performance, the PAVM mechanism keeps ranks of the currently executing virtual machines in the PRE state and other ranks of suspended virtual machines in the SR state resulting in four ranks in the PRE state and the remaining two ranks occupied by currently not running
W_PRE
virtual machines in the SR state. The resulting energy reduction is 30.4% without any performance degradation. Since the active ranks are kept in the PRE state at all times during the virtual machine scheduling intervals, the majority of remaining energy consumption is attributed to the PRE state. The energy consumption by the 2 ranks in the SR state only contributes 1.9% to the total energy consumption, further highlighting the argument that the inactive ranks should be in a lowest power state. Switching overhead is entirely hidden by the context switch time; therefore, switching energy is negligible. To reduce energy consumption further, we need to consider lower power states for the currently accessed memory ranks that may expose switching overheads in terms of delays and energy consumption. The two static mechanisms we consider are ODSR and ODPD mechanisms that trade performance for energy savings to a different degree. The ODPD mechanism is closest to the PAVM mechanism with almost the same behavior except it keeps the currently active ranks in the PD state. Subsequently, the energy is reduced by additional 28.6% at the cost of exposing power state transition to the PRE state before serving the memory I/O request. Similarly, majority of the remaining energy is consumed idling in the PD state. The ODSR mechanism, on the other hand utilizes a lower power SR state at the cost of higher delays. One may expect a larger energy savings due to utilization of the very low power SR state during idle period; however, as we can see from Figure 4 this is not the case. The long power state transition latencies (500ns) translate into substantial energy consumption due to frequent power state transitions from the SR to PRE state, resulting in the ODSR mechanism consuming much more energy than the ODPD mechanism. Therefore, a static ODSR mechanism is not desirable unless the virtual machine is almost idle, which is not a common case in our experiments. However such states of idleness do exist but they are usually short and need to be taken advantage of dynamically. Furthermore, certain execution periods have high memory activities that cannot tolerate any delays and therefore the
PD
VM 1
VM 2 VM 3
Figure 5. Power state mechanisms. SR, PD, and state. W SR, W PD and power states in the DPSM
W_SR
VM 4
SR
VM 5
DPSM
OPT
DPSM
OPT
OPT
DPSM
OPT
DPSM
ALL PAVM ODSR ODPD DPSM
Figure 4. Memory energy consumption normalized to energy consumed by the standard Xen configuration that keeps all six memory ranks always in the PRE state.
W_PD
OPT
I/O
DPSM
Switch
OPT
SR
OPT
PD
DPSM
PRE
PRE
100 90 80 70 60 50 40 30 20 10 0
Power State Distribution (%)
Normalized Energy Consumption (%)
100 90 80 70 60 50 40 30 20 10 0
VM 6
distribution determined by OPT and DPSM PRE present the energy delay reducing power W PRE represent the incorrect prediction of mechanism.
PRE state is desirable. The DPSM mechanism balances those states and achieves energy consumption comparable to the OPT mechanism. The DPSM mechanism reduces energy consumption by additional 4.4% as compared to the ODPD mechanism, which is only 0.4% away from the OPT mechanism. Furthermore, switching overhead is negligible resulting in smaller energy consumption and better performance as compared to the ODPD mechanism. B. Prediction of Power States The energy consumption distribution of the OPT and DPSM mechanisms can be correlated to the optimal and predicted power state selection as shown in Figure 5. Figure 5 compares power state selection of the OPT and DPSM mechanisms for multiple concurrent virtual machines. The OPT mechanism selects the best power state to minimize the overall energy delay product in the system. The OPT mechanism uses thresholds defined in Figure 3 to compare with memory access to set memory rank power state during each scheduling interval. As Figure 5 shows, the PD state offers the best energy delay product majority of the time, averaging 53.1% across the virtual machines. The DPSM mechanism makes its prediction based on history and as result can mispredict memory behavior of the upcoming period. Figure 5 shows the mispredict portion of each state. The incorrect portion above each state represents the number of periods in that state that were incorrectly predicted and should have been in other power states. Keeping ranks incorrectly in the SR state degrades performance and keeping ranks incorrectly in the PRE state reduces energy savings. Incorrect predictions in the PD state indicate that the virtual machine could be better off in other states, either PRE or SR and as a result they can either degrade performance or increase energy consumption. In either case, incorrect predictions degrade energy efficiency.
565.4%
only a 0.6% increase in execution time. The small deviation between DPSM and OPT is attributed to mispredictions in the DPSM mechanism resulting in extension of program execution. In Figure 5, mispredictions on the PRE state cause energy dissipation but do not introduce delay; 2.8% and 1.8% of mispredictions on the SR and PD states respectively, on average across all 6 virtual machines, result in additional 0.4% of delay compared with the OPT mechanism.
Execution Time Overhead (%)
6 5 4 3 2 1
D. Energy Consumption in the Entire System
0 ALL PAVM ODSR ODPD DPSM
OPT
Figure 6. Execution time overhead normalized to the standard Xen configuration that keeps all six memory ranks always in the PRE state.
However, Figure 5 shows that the mispredicted portions are small and as a result the DPSM mechanism, on average, predicts the correct power state for 93.4% of the periods. The prediction performance and the period distribution vary across virtual machines. The benchmarks running in a virtual machine are randomly chosen by a script program, resulting in a mix of memory and compute intensive virtual machines. Furthermore, some benchmarks show relatively more fluctuations in memory behavior across scheduling intervals in a virtual machine, thus, making it difficult for the DPSM mechanism to correctly predict power state transitions. In virtual machine 3, benchmarks like derby, scimark.fft.large, gzip, and gcc which have high demand for memory bandwidth, were chosen by a random selector showing a larger portion on the PRE state. For other virtual machines, the random selector selected the benchmarks more uniformly resulting in relatively smaller portion on the PRE state. However, the DPSM mechanism shows similar performance of the predictor across virtual machines. C. Execution Time Overhead Incorrect state selection can translate into performance degradation and delays in application execution as shown in Figure 6. Figure 6 shows delays in execution time normalized to the system that keeps memory in the PRE state all the time (ALL). PAVM keeps currently active ranks in the PRE state and as a result does not show any degradation in performance. ODSR shows a prohibitive degradation in performance since it has to transition from the SR to PRE state before a memory access exposing 500ns transition latency for each memory access. This large delay translates into a significant increase in energy consumption, since the entire system must stay on for almost seven times longer. The remaining mechanisms increase execution time by less than 6% and are preferable over ODSR. ODPD increases execution time by 5.6% due to transitions between the PD and PRE states exposing 2.5ns latency on every memory access. The delay that minimizes the overall energy delay product of the system is 0.2% of the running time experienced by the OPT mechanism. DPSM is again very close to OPT with
A complete system contains many components that consume energy, including memory, CPUs, hard drives, network interfaces, and the motherboard. Figure 7 illustrates memory energy consumption as a fraction of overall system energy for various system sizes. To show impact of memory optimizations in a larger system we consider a dual Intel Xeon Processor X5667 system with Micron DDR2-800 memory. Each CPU is a quad-core processor that supports hyperthreading, allowing eight virtual cores. Each CPU has peak power of 95W and the rest of the system, not including memory, requires 220W. We consider four different memory configurations of 16, 24, 32, and 64 memory ranks. Each virtual machine is still assigned 1GB of memory resulting in a 16, 24, 32, and 64 concurrent virtual machines. We have a total of 16 virtual cores (8 cores with hyper-threading), and therefore at most 16 virtual machines can concurrently execute which implies the larger systems will put more ranks into the SR state when the virtual machines are currently not running. Finally, we recalculate the values for thresholds SR-PD and PD-PRE accordingly to account for a different system size. We first observe that increasing system memory consumes more power in case of the base system (ALL) that keeps all memory in the PRE state, reaching 35.8% for system with 64 memory ranks. In case of system with 16 ranks PAVM and ALL behave the same way, since there are 16 virtual machines running concurrently and PAVM will keep all 16 ranks active. We also note that as a number of ranks increases the energy consumption of PAVM remains relatively constant, since the number of ranks that is kept in the PRE state remains the same and only the energy consumed in the SR state is increasing as number of ranks increases. To provide any energy optimizations in case of 16 ranks we must use low power states. The trends are similar to PAVM where once energy savings of 16 ranks are obtained the increase in the number of ranks just increases energy consumed in the SR state. We observe that DPSM offers significant improvement in energy savings over ODPD and is comparable to OPT across system configurations. We do not consider ODSR since the delays result in both performance degradation and higher energy consumption than even a base system (ALL).
22.1%
16 Ranks
32 Ranks
OPT
DPSM
ODPD
PAVM
ALL
OPT
DPSM
ODPD
PAVM
ALL
OPT
DPSM
ODPD
24 Ranks Figure 7.
35.8%
I/O
PAVM
ALL
Switch
OPT
SR
DPSM
ODPD
PD
PAVM
PRE
ALL
Memory Energy y Portion (%)
20 18 16 14 12 10 8 6 4 2 0
64 Ranks
Fraction of energy due to memory in large memory systems.
Similar to Figure 6, DPSM has very small impact on performance making execution time overhead relatively constant; ODPD mechanism incurs higher delays to the system. As we have seen so far, the DPSM mechanisms offers the best performance and energy savings combination, resulting in close to optimal energy efficiency for any system configuration. VI. C ONCLUSIONS As main memories increase in size, so does their energy consumption. Increasing energy without decreasing performance is difficult in a virtualized environment. We propose a dynamic approach (DPSM) that transparently provides energy optimizations for main memory by adopting existing mechanisms such as energy-aware memory allocation and static power state management. The mechanisms are low overhead since they rely on performance counters and algorithmically simple resulting in energy efficient memory management. In all scenarios, DPSM is better than PAVM, and static ODPD and ODSR mechanisms. Transparency and low complexity allow the proposed mechanisms to be easily deployable in a range of virtualized environments. VII. ACKNOWLEDGMENT This material is based upon work supported by the National Science Foundation under Grant No. 0834179. R EFERENCES [1] “Hp proliant dl580 g7 server series specifications,” http://h10010.www1.hp.com/wwpc/us/en/sm/WF06a/ 15351-15351-3328412-241644-3328422-4142916.html, 2010. [2] V. Delaluz, A. Sivasubramaniam, M. Kandemir, N. Vijaykrishnan, and M. J. Irwin, “Scheduler-based dram energy management,” in DAC, 2002. [3] H. Huang, P. Pillai, and K. G. Shin, “Design and implementation of power-aware virtual memory,” in ATC, 2003.
[4] M. Lee, E. Seo, J. Lee, and J.-s. Kim, “Pabc: Power-aware buffer cache management for low power consumption,” IEEE Transactions on Computers, 2007. [5] Micron, “Ddr2 sdram features,” http://download.micron.com/ pdf/datasheets/dram/ddr2/1GbDDR2.pdf, 2010. [6] M. E. Tolentino, J. Turner, and K. W. Cameron, “Memorymiser: a performance-constrained runtime system for powerscalable clusters,” in CF, 2007. [7] A. R. Lebeck, X. Fan, H. Zeng, and C. Ellis, “Power aware page allocation,” in ASPLOS, 2000. [8] P. Marchal, J. I. Gomez, L. Pinuel, D. Bruni, L. Benini, F. Catthoor, and H. Corporaal, “Sdram-energy-aware memory allocation for dynamic multi-media applications on multiprocessor platforms,” in DATE, 2003. [9] P. Zhou, V. Pandey, J. Sundaresan, A. Raghuraman, Y. Zhou, and S. Kumar, “Dynamic tracking of page miss ratio curve for memory management,” in ASPLOS, 2004. [10] L. Song, P. Karthik, M. Thomas, and G. Benjamin, “Flikker: Saving dram refresh-power through critical data partitioning,” in ASPLOS, 2011. [11] V. D. L. Luz, M. Kandemir, and I. Kolcu, “Automatic data migration for reducing energy consumption in multi-bank memory systems,” in DAC, 2002. [12] P. Ramamurthy and R. Palaniappan, “Performance-directed energy management using bos,” SIGOPS Oper. Syst. Rev., 2007. [13] Y. Bao, M. Chen, Y. Ruan, L. Liu, J. Fan, Q. Yuan, B. Song, and J. Xu, “Hmtt: a platform independent full-system memory trace monitoring system,” in SIGMETRICS, 2008. [14] W. Zhao and Z. Wang, “Dynamic memory balancing for virtual machines,” in VEE, 2009 [15] AMD, “Amd64 architecture programmer’s manual, volume 2: System programming,” http://support.amd.com/us/Processor TechDocs/24593.pdf, 2010. [16] D. Chisnall, The Definitive Guide to the Xen Hypervisor. Upper Saddle River, NJ, USA: Prentice Hall PTR, 2007. [17] AMD, “Amd bios and kernel developers guide,” http:// support.amd.com/us/Processor TechDocs/41256.pdf, 2010. [18] “Dacapo benchmarks,” http://dacapobench.org/, 2010. [19] “Memtester,” http://pyropus.ca/software/memtester/, 2010.