techniques for collective physical memory ubiquity

TECHNIQUES FOR COLLECTIVE PHYSICAL MEMORY UBIQUITY WITHIN NETWORKED CLUSTERS OF VIRTUAL MACHINES

BY MICHAEL R. HINES B.S., Johns Hopkins University, 2003 M.S., Florida State University, 2005

DISSERTATION Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Computer Science in the Graduate School of Binghamton University State University of New York 2009

c

Copyright by Michael R. Hines 2009 All Rights Reserved

Accepted in partial fullfillment of the requirements for the degree of Doctor of Philosophy in Computer Science in the Graduate School of Binghamton University State University of New York 2009 July 31st, 2009

Dr. Kartik Gopalan, Department of Computer Science, Binghamton University Prof. Kanad Ghose, Department of Computer Science, Binghamton University Dr. Kenneth Chiu, Department of Computer Science, Binghamton University Dr. Kobus van der Merwe, AT&T Labs Research, Florham Park, NJ.

iii

ABSTRACT

This dissertation addresses the use of distributed memory to improve the performance of state-of-the-art virtual machines (VMs) in clusters with gigabit interconnects. Even with ever-increasing DRAM capacities, we observe a continued need to support applications that exhibit mostly memory-intensive execution patterns, like databases, webservers, scientific and grid applications. In this dissertation, we make four primary contributions. First, we fully survey the history of the solutions available for basic, transparent distributed memory support. Then, we document a bottom-up implementation and evaluation of a basic prototype whose goal is to move deeper into the kernel than previous application-level solutions. We choose a clean, transparent device interface capable of minimizing network latency and copying overheads. Second, we explore how recent work with VMs has brought back into question the memory management logic of the operating system. VM technology provides ease and transparency for imposing order on OS memory management (using techniques like full virtualization and para-virtualization). As such, we evaluate distributed memory in this context by trying to optimize our previous prototype at different places in the Xen virtualization architecture. Third, we leverage this work to explore alternative strategies for live VM migration. A key component that determines the success of migration techniques has been exactly how memory is transmitted and when. More specifically, this involves fine grained page-fault management either before a VM’s CPU state is migrated (the current default) or afterwards. Thus, we design and evaluate the Post-Copy live VM migration scheme and compare it to the existing (Pre-Copy) migration scheme, realizing significant improvements. Finally, we promote the ubiquity of individual page frames as a cluster resource by integrating the use of distributed memory into the hypervisor (or virtual machine monitor). We design and implement CIVIC: a system that allows un-modified VMs to oversubscribe their DRAM size to larger than a given host’s physical memory. Then, we compliment this by implementing and evaluating network paging in the hypervisor for locally resident VMs. We evaluate the performance impact of CIVIC on various application workloads and show how CIVIC allows for many possible VM extensions such as better VM consolidation, multi-host caching, and the ability to better coordinate with VM migration. iv

ACKNOWLEDGEMENTS

First, I would like to thank a few organizations responsible for providing invaluable sources of funding which allowed me to work through graduate school. The AT&T Labs Research Fellowship Program, in cooperation with Kobus van der Merwe in New Jersey, provided support for a full 3 years. The Clark fellowship program at SUNY Binghamton also provided a full year of funding. The department of Computer Science at both Florida State and Binghamton made teaching assistantships available for a year. These deeds often go unsaid - without them I would not have been able to complete this degree. I would also like to thank the National Science Foundation and the Computing Innovation Fellows Project (cifellows.org). Through them, I will be continuing on an assistantship as a post-doctoral fellow for the next year. My advisor deserves his own paragraph. Not many graduate students can say what I can: I have one of the greatest advisors on the planet. Six years ago, he took a chance on me and stood patiently through the entire process: through the transfers, the applications, the bad papers, the good papers, the leaps of faith, the happy accomplishments, and the sad ones. Not only is a he a fantastic researcher, but he is a strong teacher. I am very proud to be his student and I know many other students will be as well.

v

DEDICATION

To my father: for his unconditional support, and love. And for all our tribulations. To my mother: for her strength, wisdom, and love. And for all of our struggles. To my brother: for his continuous perseverence and happiness. To my extended family: I stand on your shoulders.

vi

BIOGRAPHICAL SKETCH

Michael R. Hines was born and raised in Dallas, Texas in 1983 and grew up playing classical piano. He began college at a program called the Texas Academy of Math and Science at the University of North Texas. 2 years later he transferred to Johns Hopkins University in Baltimore, Maryland and received his Bachelor of Science degree in Computer Science in 2003. Subsequently, he entered Florida State University to complete an Information Security certification in 2004 and a Masters degree in Computer Science in 2005. Immediately after that, he transferred to SUNY Binghamton University in New York state to finish working on a PhD degree in Computer Science in 2009. Michael will begin post-doctoral research at Columbia University in late 2009. He is a recipient of multiple awards, including the Jackie Robinson Undergraduate Scholarship (2yrs), the AT&T Labs Foundation Fellowship (3yrs), the Clark D. Gifford Fellowship (1yr) from Binghamton University, and the CIFellows CRA/NSF Award (1yr) for Post-Doctoral Research. He is a member of the academic honor societies Alpha Lambda Delta, Phi Eta Sigma and the Computer Science honor society Upsilon Pi Epsilon. His hobbies include billiards, skateboarding, and yo-yos.

vii

Contents

List of Figures

xiii

List of Tables

xviii

1 Introduction and Outline

1

1.1 Distributed Memory Virtualization in Networked Clusters . . . . . . . . . . .

3

1.2 Virtual Machine Based Use for Distributed Memory . . . . . . . . . . . . . .

3

1.3 Improvement of Live Migration for Virtual Machines . . . . . . . . . . . . . .

4

1.4 VM Memory Over-subscription with Network Paging . . . . . . . . . . . . .

5

2 Area Survey

7

2.1 Distributed Memory Systems . . . . . . . . . . . . . . . . . . . . . . . . . .

7

2.1.1 Basic Distributed Memory (Anemone) . . . . . . . . . . . . . . . . .

8

2.1.2 Software Distributed Shared Memory . . . . . . . . . . . . . . . . .

9

2.2 Virtual Machine Technology and Distributed Memory . . . . . . . . . . . . .

10

2.2.1 Microkernels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

10

2.2.2 Modern Hypervisors . . . . . . . . . . . . . . . . . . . . . . . . . . .

11

2.3 VM Migration Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . .

13

2.3.1 Process Migration . . . . . . . . . . . . . . . . . . . . . . . . . . . .

13

2.3.2 Pre-Paging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

14

2.3.3 Live Migration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

15

viii

2.3.4 Non-Live Migration . . . . . . . . . . . . . . . . . . . . . . . . . . . .

15

2.3.5 Self Ballooning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

16

2.4 Over-subscription of Virtual Machines . . . . . . . . . . . . . . . . . . . . .

16

3 Anemone: Distributed Memory Access

18

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

18

3.2 Design & Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . .

20

3.2.1 Client and Server Modules . . . . . . . . . . . . . . . . . . . . . . .

23

3.2.2 Remote Memory Access Protocol (RMAP)

. . . . . . . . . . . . . .

24

3.2.3 Distributed Resource Discovery . . . . . . . . . . . . . . . . . . . . .

27

3.2.4 Soft-State Refresh . . . . . . . . . . . . . . . . . . . . . . . . . . . .

27

3.2.5 Server Load Balancing

. . . . . . . . . . . . . . . . . . . . . . . . .

28

3.2.6 Fault-tolerance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

28

3.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

29

3.3.1 Paging Latency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

30

3.3.2 Application Speedup . . . . . . . . . . . . . . . . . . . . . . . . . . .

32

3.3.3 Tuning the Client RMAP Protocol . . . . . . . . . . . . . . . . . . . .

36

3.3.4 Control Message Overhead . . . . . . . . . . . . . . . . . . . . . . .

37

3.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

38

4 MemX: Virtual Machine Uses of Distributed Memory

39

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

39

4.2 Split Driver Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

41

4.3 Design and Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . .

43

4.3.1 MemX-Linux: MemX in Non-virtualized Linux . . . . . . . . . . . . .

44

4.3.2 MemX-DomU (Option 1): MemX Client Module in DomU . . . . . . .

46

4.3.3 MemX-DD (Option 2): MemX Client Module in Driver Domain . . . .

48

4.3.4 MemX -Dom0: (Option 3) . . . . . . . . . . . . . . . . . . . . . . . .

51

4.3.5 Alternative Options . . . . . . . . . . . . . . . . . . . . . . . . . . . .

51

4.3.6 Network Access Contention: . . . . . . . . . . . . . . . . . . . . . .

52

4.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

53

ix

4.4.1 Latency and Bandwidth Microbenchmarks . . . . . . . . . . . . . . .

54

4.4.2 Application Speedups . . . . . . . . . . . . . . . . . . . . . . . . . .

61

4.4.3 Multiple Client VMs . . . . . . . . . . . . . . . . . . . . . . . . . . . .

63

4.4.4 Live VM Migration . . . . . . . . . . . . . . . . . . . . . . . . . . . .

65

4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

65

5 Post-Copy: Live Virtual Machine Migration

67

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

68

5.2 Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

70

5.2.1 Pre-Copy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

71

5.2.2 Design of Post-Copy Live VM Migration . . . . . . . . . . . . . . . .

73

5.2.3 Prepaging Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . .

76

5.2.4 Dynamic Self-Ballooning . . . . . . . . . . . . . . . . . . . . . . . . .

78

5.2.5 Reliability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

80

5.2.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

81

5.3 Post-Copy Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . .

82

5.3.1 Page-Fault Detection . . . . . . . . . . . . . . . . . . . . . . . . . . .

83

5.3.2 MFN exchanging . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

85

5.3.3 Xen Daemon Modifications . . . . . . . . . . . . . . . . . . . . . . .

86

5.3.4 VM-to-VM kernel-to-kernel Memory-Mapping . . . . . . . . . . . . .

88

5.3.5 Dynamic Self Ballooning Implementation . . . . . . . . . . . . . . .

89

5.3.6 Proactive LRU Ordering to Improve Reference Locality . . . . . . . .

92

5.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

93

5.4.1 Stress Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

94

5.4.2 Degradation, Bandwidth, and Ballooning . . . . . . . . . . . . . . . .

98

5.4.3 Application Scenarios . . . . . . . . . . . . . . . . . . . . . . . . . . 104 5.4.4 Comparison of Prepaging Strategies . . . . . . . . . . . . . . . . . . 107 5.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 6 CIVIC: Transparent Over-subscription of VM Memory

110

6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 x

6.2 Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 6.2.1 Hypervisor Memory Management . . . . . . . . . . . . . . . . . . . 113 6.2.2 Shadow Paging Review . . . . . . . . . . . . . . . . . . . . . . . . . 115 6.2.3 Step 1: CIVIC Memory Allocation, Caching Design . . . . . . . . . . 116 6.2.4 Step 2: Paging Communication and The Assistant . . . . . . . . . . 118 6.2.5 Future Work: Page Migration, Sharing and Compression

. . . . . . 124

6.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 6.3.1 Address Space Expansion, BIOS Tables . . . . . . . . . . . . . . . . 127 6.3.2 Communication Paths . . . . . . . . . . . . . . . . . . . . . . . . . . 129 6.3.3 Cache Eviction and Prefetching . . . . . . . . . . . . . . . . . . . . . 130 6.3.4 Page-Fault Interception, Shadows, Reverse Mapping . . . . . . . . . 134 6.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138 6.4.1 Micro-Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138 6.4.2 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142 6.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147 7 Improvements and Closing Arguments

148

7.1 MemX Improvements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148 7.1.1 Non-Volatile MemX Memory Descriptors . . . . . . . . . . . . . . . . 148 7.1.2 MemX Internal Caching . . . . . . . . . . . . . . . . . . . . . . . . . 149 7.1.3 Server-to-Server Proactive Page Migration . . . . . . . . . . . . . . 150 7.1.4 Increased MemX bandwidth w/ Multiple NICs . . . . . . . . . . . . . 150 7.2 Migration Flexibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151 7.2.1 Hybrid Migration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151 7.2.2 Improved Migration of VMs Through CIVIC . . . . . . . . . . . . . . 151 7.3 CIVIC Improvements and Ideas . . . . . . . . . . . . . . . . . . . . . . . . . 152 7.3.1 How high can you go?: Extreme Consolidation . . . . . . . . . . . . 152 7.3.2 Improved Eviction and Shadow Optimizations . . . . . . . . . . . . . 152 7.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153

xi

A CIVIC Screenshots

154

A.1 Small-HVM Over-subscription . . . . . . . . . . . . . . . . . . . . . . . . . . 154 A.2 Large-HVM Oversubscription . . . . . . . . . . . . . . . . . . . . . . . . . . 156 B The Xen Live-migration process

158

B.1 Xen Daemon . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158 B.2 Understanding Frame Numbering . . . . . . . . . . . . . . . . . . . . . . . . 161 B.3 Memory-related Data Structures . . . . . . . . . . . . . . . . . . . . . . . . 163 B.4 Page-table Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165 B.5 Actually Performing the Migration . . . . . . . . . . . . . . . . . . . . . . . . 166 [] Bibliography

169

xii

List of Figures

3.1 Placement of distributed memory within the classical memory hierarchy. . .

21

3.2 The components of a client. . . . . . . . . . . . . . . . . . . . . . . . . . . .

22

3.3 The components of a server. . . . . . . . . . . . . . . . . . . . . . . . . . .

23

3.4 A view of a typical anemone packet header. The RMAP protocol transmits these directly to the network card from the BDI device driver. . . . . . . . .

26

3.5 Random read disk latency CDF . . . . . . . . . . . . . . . . . . . . . . . . .

30

3.6 Sequential read disk latency CDF . . . . . . . . . . . . . . . . . . . . . . . .

31

3.7 Random write disk latency CDF . . . . . . . . . . . . . . . . . . . . . . . . .

31

3.8 Sequential write disk latency CDF . . . . . . . . . . . . . . . . . . . . . . .

32

3.9 Execution times of POV-ray for increasing problem sizes.

. . . . . . . . . .

33

3.10 Execution times of STL Quicksort for increasing problem sizes. . . . . . . .

34

3.11 Execution times of multiple concurrent processes executing POV-ray. . . . .

35

3.12 Execution times of multiple concurrent processes executing STL Quicksort.

35

3.13 Effects of varying the transmission window using Quicksort. . . . . . . . . .

36

4.1 Split Device Driver Architecture in Xen. . . . . . . . . . . . . . . . . . . . . .

42

4.2 MemX-Linux: Baseline operation of MemX in a non-virtualized Linux environment. The client can communicate with multiple memory servers across the network to satisfy the memory requirements of large memory applications. 44 4.3 MemX-DomU: Inserting the MemX client module within DomU’s Linux kernel. The server executes in non-virtualized Linux. . . . . . . . . . . . . . . .

xiii

47

4.4 MemX-DD: Executing a common MemX client module within the driver domain, allowing multiple DomUs to share a single client module. The server module continues to execute in non-virtualized Linux.

. . . . . . . . . . . .

49

4.5 I/O bandwidth, for different MemX-configurations, using custom benchmark that issues asynchronous, non-blocking 4-KB I/O requests. “DIO” refers to opening the file descriptor with direct I/O turned on, to compare against bypassing the Linux page cache. . . . . . . . . . . . . . . . . . . . . . . . . .

55

4.6 Comparison of sequential and random read latency distributions for MemXDD and disk. Reads traverse the filesystem buffer cache. Most random read latencies are an order of magnitude smaller with MemX-DD than with disk. All sequential reads benefit from filesystem prefetching. . . . . . . . . . . .

58

4.7 Comparison of sequential and random write latency distributions for MemXDD and disk. Writes goes through the filesystem buffer cache. Consequently, all four latencies are similar due to write buffering.

. . . . . . . . .

58

4.8 Effect of filesystem buffering on random read latency distributions for MemXDD and disk. About 10% of random read requests (issued without the direct I/O flag) are serviced at the filesystem buffer cache, as indicated by the first knee below 10µs for both MemX-DD and disk. . . . . . . . . . . . . . . . .

59

4.9 Quicksort execution times in various MemX combinations and disk. While clearly surpassing disk performance, MemX-DD trails regular Linux only slightly using a 512 MB Xen Guest. . . . . . . . . . . . . . . . . . . . . . . .

60

4.10 Quicksort execution times for multiple concurrent guest VMs using MemXDD and iSCSI configurations. . . . . . . . . . . . . . . . . . . . . . . . . . .

62

4.11 Our multiple client setup: Five identical 4 GB dual-core machines, where one houses 20 Xen Guests and the others serve as either MemXservers or iSCSI servers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

63

5.1 Pseudo-code for the pre-paging algorithm employed by post-copy migration. Synchronization and locking code omitted for clarity of presentation. . . . .

xiv

74

5.2 Prepaging strategies: (a) Bubbling with single pivot and (b) Bubbling with multiple pivots. Each pivot represents the location of a network fault on the in-memory pseudo-paging device. Pages around the pivot are actively pushed to target. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

76

5.3 Pseudo-Swapping (item 3): As pages are swapped out within the source guest itself, their MFN identifiers are exchanged and Domain 0 memory maps those frames with the help of the hypervisor. The rest of post-copy then takes over after downtime. . . . . . . . . . . . . . . . . . . . . . . . . .

82

5.4 The intersection of downtime within the two migration schemes. Currently, our downtime consists of sending non-pageable memory (which can be eliminated by employing the use of shadow-paging). Pre-copy downtime consists of sending the last round of pages. . . . . . . . . . . . . . . . . . .

85

5.5 Comparison of total migration times between post-copy and pre-copy. . . .

95

5.6 Comparison of downtimes between pre-copy and post-copy. . . . . . . . . .

96

5.7 Comparison of the number of pages transferred during a single migration. .

97

5.8 Kernel compile with back-to-back migrations using 5 seconds pauses. . . .

98

5.9 NetPerf run with back-to-back migrations using 5 seconds pauses. . . . . . 100 5.10 Impact of post-copy NetPerf bandwidth. . . . . . . . . . . . . . . . . . . . . 101 5.11 Impact of pre-copy NetPerf bandwidth. . . . . . . . . . . . . . . . . . . . . . 102 5.12 The application degradation is inversely proportional to the ballooning interval.103 5.13 Total pages transferred for both migration schemes. . . . . . . . . . . . . . 105 5.14 Page-fault comparisons: Pre-paging lowers the network page faults to 17% and 21%, even for the heaviest applications. . . . . . . . . . . . . . . . . . . 106 5.15 Total migration time for both migration schemes. . . . . . . . . . . . . . . . 106 5.16 Downtime for post-copy vs. pre-copy. Post-copy downtime can improve with better page-fault detection. . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 5.17 Comparison of prepaging strategies using multi-process Quicksort workloads.108 6.1 Original Xen-based physical memory design for multiple, concurrently-running virtual machines. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116

xv

6.2 Physical memory caching design of a CIVIC-enabled Hypervisor for multiple, concurrently-running virtual machines.

. . . . . . . . . . . . . . . . . . . . 117

6.3 Illustration of a full PPAS cache. All page accesses in the PPAS space must be brought into the cache before the HVM can use the page. If the cache is full, an old page is evicted from the FIFO maintained by the cache. . . . . . 119 6.4 Internal CIVIC architecture: An Assistant VM holds two kernel modules responsible for mapping and paging HVM memory. One module directly (ondemand) memory-maps portions of PPAS #2, whereas MemX does I/O. A modified, CIVIC-enabled hypervisor intercepts page-faults to shadow page tables in the RAS and delivers them to the Assistant VM. If the HVM cache is full, the Assistant also receives victim pages. . . . . . . . . . . . . . . . . 121 6.5 High-level CIVIC architecture: unmodified CIVIC-enabled HVM guests have both local reservations (caches) while small or large amounts of their reservations actually expand out to nearby hosts. . . . . . . . . . . . . . . . . . . 123 6.6 Future CIVIC architecture: a large number of nodes would collectively provide global and local caches. The path of a page would potentially exhibit multiple evictions from Guest A to local to global. Furthermore, a global cache can be made to evict pages to other global caches. . . . . . . . . . . 125 6.7 Pseudo-code for the prefetching algorithm employed by CIVIC. On every page-fault, this routine is called to adjust the window based on the spatial location of the current PFN address in the PPAS. . . . . . . . . . . . . . . . 132 6.8 Page Dirtying Rate for different types of Virtual Machines, including HVM Guests, Para-virtual Guests, and with different types of shadow paging. This includes the overhead of creating new page tables from scratch. . . . . . . 140 6.9 Bus-speed Page Dirtying Rate in gigabits-per-second. This is line-speed hardware memory speed once page-tables have already been created and shows throughput at an order of magnitude higher than the previous graph.

141

6.10 Completion times for quicksort on a CIVIC-enabled virtual machine and a regular virtual machine. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143

xvi

6.11 Completion times for Sparse Matrix Multiplication with a resident memory footprint of 512 MB while varying the cache sizes. . . . . . . . . . . . . . . 144 6.12 Requests Per Second for the RUBiS Auction Benchmark with a resident memory footprint of 490 MB while varying the cache sizes. . . . . . . . . . 145 A.1 A live run of an HVM guest on top of CIVIC with a very small PPAS cache size of 64 MB. The HVM has 2 GB. (Turn the page sideways) . . . . . . . . 155 A.2 A live run of an HVM guest on top of CIVIC with a very large PPAS cache size of 2GB. The HVM believes that it has 64 GB. (Turn the page sideways)

xvii

157

List of Tables

3.1 Average application execution times and speedups for local memory, Distributed Anemone, and Disk. N/A indicates insufficient local memory. . . . .

32

4.1 I/O latency for each MemX-Combination in Microseconds. . . . . . . . . . .

54

4.2 Execution time comparisons for various large memory application workloads. 62 5.1 Migration algorithm design choices in order of their incremental improvements. Method #4 combines #2 and #3 with the use of pre-paging. Method #5 actually combines all of #1 through #4, by which pre-copy is only used in a single, primer iterative round. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

74

5.2 Percent of minor and network faults for flushing vs. pre-paging. Pre-paging greatly reduces the fraction of network faults. . . . . . . . . . . . . . . . . .

97

6.1 Latency of a page-fault through a CIVIC-enabled hypervisor to and from network memory at different stages.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . 139

6.2 Number of Shadow-Pagefaults to and from network memory with CIVIC prefetching disabled and enabled. Each application has a memory footprint of 512 MB and a PPAS cache of 256 MB. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

xviii

146

Chapter

1

Introduction and Outline Both the methods for design and use of main memory have changed dramatically over the last half-century. And because of fast moving advances in hardware and software, the OS designer’s choices have also increased, especially as the performance gaps between each level of the memory hierarchy grow larger. In this dissertation, we observe that the need to support large-memory non-parallel applications still persists, whose memory access patterns are mostly singular and disjoint from each other. This continues to include many common applications like databases and webservers as well as scientific and grid applications. We describe a bottom-up attempt over the last few years to investigate solutions for these kinds of large-memory applications (LMAs) that can be applied across high-speed networked clusters of machines. The representative set of applications we benchmark in this dissertation include: • • • • • • • • • •

Large Sorting Graphical Ray-tracing Database Workloads Support Webserver E-commerce Webserver Kernel Compilation Parallel Benchmarks Torrent Clients Network Throughput Network Simulation

1

CHAPTER 1. INTRODUCTION AND OUTLINE

2

We refer to these applications are ”large-memory applications”. They tend to be somewhat CPU intensive. Across application boundaries (between individual running processes), they are either not necessarily parallelizable or not designed to be without explicit threading. Their computational behavior is such that: when they do need to access portions of their large memory pools, they need it fast. These accesses are also usually done in a relatively “cache-oblivious” manner, such that the size of the working-set in memory will eventually converge to a size that fits within memory (before moving onto a new working set). For these kinds of applications, this work has investigated low-level memory management options across a number of different projects and this chapter presents a high-level outline of them. The focus of this work is the virtualization of physical memory to support these applications. We categorize this chapter’s high-level outline into three overarching goals: 1. Maximum Application Transparency: We want to improve the performance of these large memory applications with zero changes to the application. The last project of this dissertation extends this all the way to complete operating system transparency as well. 2. Clustered Memory Pool: We want to provide a potentially unlimited pool of clusterwide memory to these applications with the help of distributed, low-latency communication. 3. Ubiquitous Resource Management: We ultimately want page-granular support for the arbitrary, transparent relocation of any single page frame in a cluster of machines. This dissertation employs a combination of virtual machine technology, operating system modifications and network protocol design to accomplish those three high-level goals for the aforementioned types of applications. The bottom-up process taken to explore the virtualization of physical memory in this dissertation is organized as follows: first we build a distributed memory virtualization system, followed by its evaluation in a virtual machine environment. Next, we develop alternative strategies for VM migration by leveraging distributed memory virtualization. Finally, we integrate these techniques to develop a system


3

for VM memory oversubscription. First we begin with a discussion of the basic distributed memory system.

1.1

Distributed Memory Virtualization in Networked Clusters

Chapter 3 begins by investigating the options available to large memory applications with basic, transparent distributed memory support in clusters of gigabit ethernet-connected machines. Distributed memory itself is a very old idea, but our previous efforts at reinvestigating it have revealed various unsolved performance issues as well new applications. Additionally, implementing a distributed memory solution was a springboard to tackle low-level memory management issues in virtual machines. Our prototype was an effort to move further away from the application than previous work (very low into the kernel) by choosing a clean, familiar interface (like the device) such that needs of the application are still respected without any changes. It consists of a fully distributed, non-shared, Linuxbased, all kernel-space distributed memory solution, including a custom networking protocol and a full performance evaluation. The solution exports an interface to any process that wants to map it, and hides the complexity of shipping those frames over gigabit ethernet to other connected machines. It is not, however, a software distributed shared memory solution - it does not provide cache coherent resolution protocols for simultaneous write access by parallel clients. That was not the focus of this work.

1.2

Virtual Machine Based Use for Distributed Memory

In Chapter 4, we investigate how distributed memory virtualization could benefit state-ofthe-art virtual machine technology. We describe the design and implementation of a system that evaluates how distributed memory can enhance this transparency. We did this by placing (and improving upon) the aforementioned distributed memory solution at different places within the virtual machine architecture and benchmarking applications within those VMs. At the end of 2005, a handful of virtual machine projects had already matured into both


4

proprietary and open-source versions. We began looking into how our distributed memory implementation could apply to virtual machine technology. Recent work with VMs in the last decade is interesting in that it has brought into question once again where exactly the memory management logistics of LMAs should be placed, now that there is an extra level of indirection (called the virtual machine monitor, or “hypervisor”) placed below the OS (a well-known technique). Both hardware and software advances have created many ways to impose order on the handling of OS memory management while still maintaining high amounts of transparency to applications through techniques such as full virtualization and para-virtualization.

1.3

Improvement of Live Migration for Virtual Machines

It soon became clear that virtual machine technology has succeeded tremendously at demonstrating the utility of transparent, live OS migration. In fact, it’s likely that the increasing pervasiveness of VMs may never have happened without it. It is well known that many process migration prototypes, while very well-built, were unable to become widespread due to fundamental limitations relating to transparency, process portability, and residual dependencies on the host OS. But changing the unit of migration to the OS itself has taken that problem out of the picture completely, even among different hypervisor vendors. The ability to run the VM transparently has shifted the base unit of computational containment from the process to the OS without changing the semantics of the application. A key component that determines the success of migration has been exactly how the virtualization architecture migrates the VM’s memory, which is what initially lead us to this particular problem. In Chapter 5, we applied some of the techniques developed for virtual machine based distributed memory use to develop alternative strategies for live migration of VMs. We design, implement and evaluate a new migration scheme and compare it to existing migration schemes present in today’s virtualization technology. We were able to realize significant gains in migration performance with a new live-migration system for write-intensive VM workloads - as well as point out some fundamental ways in which the management of VM


5

memory could be improved from the pre-copy approach.

1.4

VM Memory Over-subscription with Network Paging

Our experience we the previous projects exposed the need for more fine-grained policies underneath the OS - particularly when VMs are consolidated from multiple physical hosts onto single hosts and compete with each other for memory resources. In situations like this, the need to determine better runtime placement and allocation of individual page frames among each VM is important. This is where the idea of the ubiquity of individual frames of memory comes from: not only does virtualization remove the constraints on a page frame as to its location in memory, but it releases a page frame from even being on the same physical machine, even when the VM is still considered to have local ownership of the frame. We believe that, with the dynamics of a virtualized environment, the OS should consider its physical memory as being a ubiquitous “resource” without worrying about its physical location. This does not mean that it should not be aware of the contiguity of the physical memory space (with respect to kernel subsystems that handle memory allocation and fragmentation). Rather, this means that the source of that contiguous resource should be more flexible. On the same lines, the interfaces that export this resource should both maintain fast, efficient memory access and do so without duplicating implementation effort or functionality. With that, Chapter 6 presents the last contribution of this dissertation: a complete implementation and evaluation of a system that allows an un-modified VM to use more DRAM than is physical provided by the host machine. Our system is able to do this without any changes to the virtual machine. This is done through a combination of means. First, we alter the hypervisor under the VM and give the VM a view of a physical memory allocation that is larger than what is available at the host on which it is running. We then hook into the shadow paging mechanism, a feature provided by all modern hypervisors to intercept page-table modifications performed by the VM. Finally, we supplement this by implementing a network paging system at the hypervisor level to allow for victim page selection when non-resident pages are accessed. This system is implemented while preserving the tradi-


6

tional concepts of paging and segmentation employed by an OS and by taking a page (pardon the pun) from microkernels by continuing to keep the hypervisor as small as possible. Our implementation also maintains the same transparency to the OS and its applications that all of our previous work has guaranteed. This system gives the system administrator and application programmer a wide berth: to have the option to arbitrarily cache, share, or move individual page frames for improved consolidation of multiple co-located VMs among physical hosts.

Chapter

2

Area Survey Aside from the focus of this work discussed in Chapter 1, there is a great deal of related work. This chapter will present a literature survey of supporting work up to this point. We will go through the three major steps taken in prior work discussed in the introduction and explain how other literature is similar and how it differs from our work, including the Anemone system, the MemX system, the Post-Copy Migration system, and our final system, CIVIC.

2.1

Distributed Memory Systems

Our distributed memory system, Anemone [50, 51], was the first system that provided unmodified large memory applications (LMAs) with completely transparent access to clusterwide memory over commodity gigabit Ethernet LANs. One goal of our work was to make a concerted effort to bring all components of the implementation into the Linux kernel and optimize for network conditions in the LAN that were specific to the network memory traffic: particularly the repeated flow control of 4 Kilobyte page frames. As such, it can, briefly, be treated as distributed paging, distributed memory-mapping, or as a remote in-memory filesystem, while the logic and design decisions are hidden behind a block device driver.

7

CHAPTER 2. AREA SURVEY

2.1.1

8

Basic Distributed Memory (Anemone)

The two most popular celebrities among systems designed to support distributed memory in the 1990s (which are now dormant) included the NOW [15] project at Berkeley and the Global Memory System [37] at Washington. We decided to re-tackle this problem at the time for a few reasons: a). neither of these two projects were available for use. b). networks and CPU speeds had increased an order of magnitude since, and c). both projects required extensive operating system support. The Global Memory System was designed to provide network-wide memory management support for paging, memory mapped files, and file caching. This system was closely built into the end-host operating system and operated upon a 155Mbps DEC Alpha ATM Network. The NOW project [15] did a plethora of things on top of the Digital Unix operating system. In the end, their solution included an OSsupported “cooperative caching” system, which is a type of distributed filesystem that had the added responsibility of caching disk blocks (which could be memory mapped) into the memory of participating nodes. We will describe cooperative caching systems later, but suffice it to say that these were very large implementations that could be functionally reduced to doing the task of distributed memory in an indirect manner. Our goals were to re-tackle just the distributed memory components of these systems without any OS modifications as low as possible within a device driver in the hopes that the project would be an enabling mechanism for more complicated projects in the later years, which is exactly what we did. But in order to explore these problems, we needed a working prototype that solved these problems in the Linux operating system while taking into account all the design principles of kernel development in the current state of operating systems design that was also capable of functioning well over gigabit ethernet networks. For all of those reasons, Chapter 3 will describe a very new system as we’ve designed it. Although the previously mentioned projects were the most popular, they were by far not the only projects around in the 1990s. The earliest non-shared efforts [40, 21, 57] at using distributed memory aimed to improve memory management, recovery, concurrency control, and read/write performance for in-memory database and transaction processing systems. The first two distributed paging were presented in [28] and [38]. These projects also


9

took the stance of incorporating extensive OS changes to both the client and the memory servers on other nodes. The Samson project (of which my advisor was a member, actually) [90] was a dedicated memory server with a highly modified OS over a Myrinet interconnect that actively attempts to predict client page requirements. The Dodo project [59, 9] was another late 1990’s attempt to provide a more end-to-end solution to the distributed-memory problem. They built a user-level library based interface that a programmer can use to coordinate all data transfers to and from a distributed memory cache. This obviously required legacy applications to be aware of specific API in that library. For the Anemone project, this was pretty much a deal-breaker. The work that is probably the closest to our prototype was done by [68] and followed up in [39], implemented within the DEC OSF/1 operating system in 1996. They use a transparent device driver just like we do to do paging. Again, our primary differences are as in the NOW case: a slow network, an out-of-date operating system, and no available code for which we could build a broader research project out of. They do, however have a recovery system built into their work, capable of supporting single node failures.

2.1.2

Software Distributed Shared Memory

For shared memory systems, typically called ”Software Distributed Shared Memory” (DSM), a group of nodes participates in one of a host of different consistency protocols, not unlike the hardware requirements of cache-coherent Non-Uniform Memory Access (NUMA) shared memory machines. There are many of these systems. By its nature, the purpose of cache-coherent systems is to be able to provide a competing paradigm for Parallel Execution systems that depend on Message Passing Interfaces (MPI). In general, a DSM and MPI are competitors, each attempting to provide the means for parallel speedup across multiple physical host machines at different levels of the computing hierarchy. MPI attempts to provide the speedup through explicit data movement across each node through a series of calls, where as a properly implemented DSM attempts to make this data movement inherent. This is typically done either at the language level or (like MPI) at the library level in such a way that the DSM system handles shared writes (with proper ordering) so


10

that the concurrently running programs on different nodes need only to focus on locking critical sections that access shared data structures. As we mentioned, Anemone is not a DSM, nor are we trying to do research on parallel execution. Nevertheless, some of the more popular DSM projects in the 1990s included [35] and [14], which allow a set of independent nodes to behave as a large shared memory multi-processor, often requiring customized programming to share common data across nodes.

2.2

Virtual Machine Technology and Distributed Memory

Whole operating system VM technology, in which multiple independent, and possibly different, operating systems running simultaneously has been re-invented in the last decade. The modern virtual machine monitor or hypervisor is inspired by three different kinds of OS virtualization: a). Library Operating Systems b.) Microkernels (versus monolithic kernels), and c). the commodity OS virtualization work in the early 1970’s. We will briefly survey some of these ideas and how they’ve influenced choices in our work, resulting in a project called “MemX” [49]. When that work was completed, MemX was the first system in a VM environment that provided unmodified LMAs with a completely transparent and virtualized access to cluster-wide distributed memory over commodity gigabit Ethernet LANs. We begin our survey of virtual machine technology with Microkernels first and then discuss modern hypervisors.

2.2.1

Microkernels

Microkernels were attempts by the operating systems community in the 1980’s and 90’s to shrink the size of the core OS base and move more of the subsystems in a traditional “Macro” OS into user-land processes or servers. This decreased the privileges of these subsystems, giving them more fault-tolerance from foreign device drivers and required fast communication mechanisms for them to talk to each other. Other motivations for the use of microkernels included the ability to provide UNIX-compatible environments without the need to constantly port drivers to new systems and without the need to port new systems to new CPU architectures. As long as you keep the microkernel and the supporting


11

communication framework constant as a standard, one could gain a great deal of interoperability, a source of headache that continues to exist today. The advantages provided by microkernels and virtual machines are almost identical, and without going into too much of a philosophical debate, virtual machine designers add more hypervisor-aware code to current operating systems every year. One could almost consider modern hypervisors to be microkernels [45]. Probably the only reason that microkernels did not become more widespread was that industry support for these research prototypes never really gained traction completely, where as virtual machine technology has managed to do so. Nevertheless, the exploration of Microkernels had a great deal of success beginning in the 1980s, including successful projects like Mach [8], Chorus [7] Amoeba [72], and L4 [64]. Notable work was also performed on “Library” operating systems. This is based on the idea of having another root system ”fork” off a smaller operating system in much the same way library code is stored and loaded on demand. These kinds of systems do not fall cleanly into the definition of a microkernel, but they are closer to microkernels than virtual machines because they also depend on fast communication primitives and their focus is not to provide full virtualization of multiple CPU architectures. Such systems included the Exokernel [36] and Nemesis [62].

2.2.2

Modern Hypervisors

The first hypervisors (the current term for the longer virtual machine monitor), have been around since the late 1960’s [10] and were developed all the way through the late 70s (primarily by industry) until academic research began to focus on microkernels, which took over research until the mid 1990s. These early hypervisors were generally paired directly with specific hardware and meant to support multiple identical copies of the same operating system. After the microkernel movement slowed down, probably the first “revival” of hypervisor technology started with Disco [23]. The context of this work was on top of cache-coherent NUMA machines, motivated by IBM’s work [10]. Their focus was similar: to support multiple commodity operating systems, but their aim was to do it with as few changes as possible. A popular open-source attempt also sprang up for a short while


12

called “User Mode Linux” [2], but operated completely in userland. (We actually used this for a while to test our early distributed memory prototypes, but the developer base did not continue to grow.) At the turn of the century, two more hypervisor arrived, including Denali [6] (which was later modified to be a microkernel) and the familiar VMware system. Modern hypervisors are split into three categories at the moment: a). Full Virtualization, b). Para-Virtualization and c.) Pre-Virtualization. Para-virtualization indicates that the OS has been modified to be aware that it is virtualized and to provide direct support to the supporting hypervisor to improve upon the speed of virtualizing memory accesses and device emulation. Full Virtualization indicates that the guest operating system (the OS being virtualized by a hypervisor) has not been modified to support virtualization. Full virtualization can be supported in two ways: with or without hardware support. Both AMD [13] and Intel [3] provide hardware support for virtualization by enabling the processor to trap directly into the hypervisor voluntarily when the guest attempts to execute a privileged instruction that must be emulated. Full virtualization systems like KVM [4] depend on hardware support completely. Projects like Xen [20] support both para-virtualized and fullyvirtualized operating systems both with and without support from hardware. The second way to perform full virtualization is to use binary translation, as is the case with VMware. Critiques to this attempt are that they must do this at clock speeds, requiring execution overheads of up to 20%. Similarly, Pre-virtualization [63] is a related attempt to do these translations offline in a layered manner or with a custom compiler, but existing prototypes have not caught much traction in the community. Finally, para-virtualization is an opposite technique to do virtualization by modifying the operating system itself. This technique met with a lot of success with the Xen project [20], which is the hypervisor platform used in this work. Recently, the Linux and Windows communities have been updating these macro-kernels with Hypervisor aware hooks to mitigate the overhead of forward-porting. Such changes will also benefit many of the aforementioned full-virtualization technologies. Other paravirtualization techniques include operating-system level virtualization, similar to [2] in which the OS itself and all processes are isolated into individual containers without the use of a true hypervisor [1].


2.3

13

VM Migration Techniques

Chapter 5 targets the performance of the live migration of virtual machines. The technique we use, accompanied with a handful of new optimizations is called “Post-Copy”. Live migration is a mandatory feature of modern hypervisors. It facilitates server consolidation, system maintenance, and lower power consumption. Post-copy refers to the deferral of the “copy” phase of live migration until the virtual machine’s CPU state has already been migrated. On the other hand, pre-copy refers to the opposite, and currently is the dominant way to migrate a process or virtual machine. A survey of the different units of migration and types involved is present here.

2.3.1

Process Migration

The post-copy algorithm (whose name has assumed different titles) has variably appeared in the context of process-migration among four previous incarnations: first implemented as “Freeze Free” using a file-server [84] in 1996, simulated in 1997 [83] (which is where the term post-copy was first coined) and later followed up by an actual Linux implementation in 2003 [74] - the original creator of the “hybrid” assisted post-copy scheme, which we will summarize later. Also, in 2008 a version under the openMosix kernel was presented again with respect to process migration in [85]. Our contributions instead address new challenges at the virtual machine level that are not seen at the process level and benchmark an array of applications affecting the different metrics of full virtual machine migration, which these two approaches do not do. The closest work to Post-Copy is a report called SnowFlock [44]. They use a similar technique in the context of Parallel Computing by introducing “Impromptu Clusters” which clones a VM to multiple destination nodes and collect results from the new clones. They do not compare their scheme to (or optimize on) the original pre-copy system. Also, their page-fault avoidance heuristics are also different in that they paravirtualize Xen Guests to avoid transmitting free pages, where as we use ballooning as it is less invasive and transparent to kernel operations. Process migration schemes, well surveyed in [71] have not become widely pervasive, though several projects exist, including Condor [30], Mosix [19], libckpt [80], CoCheck [91], Kerrighed [58], and Sprite [34].


14

The migration of entire operating systems is inherently free of residual dependencies while still providing a live and clean unit of migration. Techniques also exist to migrate applications [71] or entire VMs [17, 27, 73] to nodes that have more free resources (memory, CPU) or better data access locality. Both Xen [27] and VMWare [73] support migration of VMs from one physical machine to another, for example, to move a memory-hungry enterprise application from a low-memory node to a memory-rich node. However large memory applications within each VM are still constrained to execute within the memory limits of a single physical machine at any time. In fact, we have shown that MemX can be used in conjunction with the VM migration in Xen, combining the benefits of both live VM migration and distributed memory access. MOSIX [19] is a management system that uses process migration to allow sharing of computational resources among a collection of nodes, as if in a single multiprocessor machine. However each process is still restricted to use memory resources within a single machine.

2.3.2

Pre-Paging

The post-copy algorithm does its best (as pre-copy does) to identify the collective workingset of the virtual machine’s processes, whose concept for individual processes was first identified in 1968 [32]. Pre-copy does this with shadow paging: the use of an additional read-only page table level that tracks the dirtying of pages. Post-copy does this by the reception of a page-fault. We mitigate the effect of faults on applications through the use of pre-paging, a technique that also goes by different titles. In virtual-memory and application level solutions, it is called pre-paging. At the I/O level or the actual paging-device level, it can also be referred to as “adaptive prefetcing”. For process migration and distributed memory systems it can also be referred to as “adaptive distributed paging” (whereas ordinary distributed paging suffers from the residual dependency problem, and may or may not involve the use of pre-fetching). In either case, we use the term pre-paging to refer a migration system that adaptively “flushes” out all of the distributed pages while simultaneously trying to hide the latency of page-faults as pre-fetching does. We do not use disks or intermediate nodes. Traditionally, the algorithms involved in pre-paging involve both re-


15

active and history based approaches to anticipate as best as possible what the working set of the application may or may not be. Pre-paging has experienced a very brief resurgence this decade and goes back as far as 1968 [76], a survey of which can be found in [94]. In our case, we implement a reactive approach with a few optimizations at the virtual machine level described later. History-based approaches may benefit future work, but we do not implement them here.

2.3.3

Live Migration

System-level virtual machine migration has been revived with several projects, including architecture independent approaches w/ VMware migration [73] and Xen migration [27], architecture-dependent projects using VT-x or VT-d chips with the KVM project in Linux [4], operating-system level approaches that do not use hypervisors (similar to capsules/pods) with the OpenVZ system [1], and even Wide-Area-Network approaches [22], all of which can potentially benefit from the post-copy method of VM migration presented in this paper. Furthermore, the self-migration of operating systems has much in common with migration of single processes [48]. The same group started this project built on top of their ”Nomadic Operating Systems” [47] project as well as their first prototype implementation on top of the L4 Linux Microkernel using “NomadBIOS”. All of these systems currently use pre-copy based migration schemes.

2.3.4

Non-Live Migration

There are several non-live approaches to migration systems, in which the dependent applications must be completely suspended during the entire migration. The term capsule was introduced by Schmidt in [87]. In this work, capsules were implemented by grouping together processes in Linux or solaris operating systems and migrating all of their state as a group as opposed to the full operating system. Along the same lines, Zap [78] uses units of migration called process domains (pods), which are essentially process groups along with their process-to-kernel interfaces such as file handles and sockets. Migration is done by suspending the pod and copying it to the target. Also, connections to active


16

services are not maintained during transit. The Denali project [6, 5] dealt with migrating checkpointed VMWare virtual machines across a network incurring longer migration downtime. Chen and Noble suggested using hardware-level virtual machines for user mobility [41]. The Capsules/COW project [24] addresses user mobility and system administration by encapsulating the state of computing environments as objects that can be transferred between distinct physical hosts, citing the example of the transfer of an OS instance to a home computer while the user drives home from work. The OS instance is not active during the transfer. The “Internet Suspend/Resume” project [66] focuses on the capability to save and restore computing state on anonymous hardware. The execution of the virtual machine is suspended during transit. In contrast to these systems, our aim is to transfer live, active OS instances on fast networks without stopping them.

2.3.5

Self Ballooning

Ballooning is the act of changing the view of the amount of physical memory seen by the operating system during runtime. Ballooning itself has already been used a few times among virtual machine technology but none have been made to be continuous in production as of yet, nor has the use of ballooning been investigated among different VM migration systems, which is the purpose of this work. Prior ballooning work includes VMware’s 2002 publication [96], which was inspired by “self-paging” in the nemesis operating system [46]. It is not clear, however how their ballooning mechanisms interact with different forms of VM migration, which is what we are trying to investigate. Xen is also capable of simple one-time ballooning during migration and system boot time. Additionally, an effort is being made to commit a general version of self-ballooning into the Xen upstream development tree by a group within Oracle Corp [67]. Such contributions will help standardize the use of ballooning.

2.4

Over-subscription of Virtual Machines

The most notable attempt to oversubscribe virtual machine memory was presented in [96] for VMware and [33] for Xen. These projects work very well, but the amount of VM memory


17

is constrained to what is available on the physical host. Additionally, a couple of DSMlevel attempts to present a Single-System Image (SSI) for unmodified VMs exist in [12] and [69]. Building an SSI was not the focus of this dissertation, but rather to allow local virtual machines to gain cluster memory access. This is because we want to increase VM consolidation and migration performance rather than spread processing out into the cluster. Thus, processor resources available to such VMs in our work are only available on one host. Ballooning, as described in the previous section, also allows VMs to oversubscribe virtual machine memory, but requires direct operating system participation. Ballooning also does not allow access to non-resident memory. This requires a one-to-one static memory allocation throughout the virtual machine’s lifetime. To date, the CIVIC system, described in Chapter 6 is the first attempt to apply distributed memory to unmodified virtual machines running applications with large memory requirements in a low-latency environment through the use of network paging and shadow memory interception within the Xen hypervisor.

Chapter

3

Anemone: Distributed Memory Access In this Chapter, we describe our initial distributed memory work in detail, called the Anemone project. Because the performance of large memory applications degrade rapidly once the system hits the physical memory limit, they will likely start paging or thrashing. We present the design, implementation and evaluation of Distributed Anemone (Adaptive Network Memory Engine) – a lightweight and distributed system that pools together the collective memory resources of multiple Linux machines across a gigabit Ethernet LAN. Anemone treats distributed memory as another level in the memory hierarchy between very fast local memory and very slow local disks. Anemone enables applications to access potentially “unlimited” network memory without any application or operating system modifications (when Anemone is used as a swap device). Our kernel-level prototype features fully distributed resource management, low-latency paging, resource discovery, load balancing, soft-state refresh, and support for ’jumbo’ Ethernet frames. Anemone achieves low pagefault latencies of 160µs average, application speedups of up to 4 times for single process and up to 14 times for multiple concurrent processes, when compared against disk-based paging.

3.1

Introduction

Performance of large-memory applications (LMAs) can suffer from large disk access latencies when the system hits the physical memory limit and starts paging to local disk. 18

CHAPTER 3. ANEMONE: DISTRIBUTED MEMORY ACCESS

19

At the same time, affordable, low-latency, gigabit Ethernet is becoming commonplace with support for jumbo frames (packets larger than 1500 bytes). Consequently, instead of paging to a slow local disk, one could page over a gigabit Ethernet to the unused memory of distributed machines and use the disk only when distributed memory is exhausted. Thus, distributed memory can be viewed as another level in the traditional memory hierarchy, filling the widening performance gap between low-latency RAM and high-latency disk. In fact, distributed memory paging latencies of about 160µs or less can be easily achieved whereas disk read latencies range anywhere between 6 to 13ms. A natural goal is to enable unmodified LMAs to transparently utilize the collective distributed memory of nodes across a gigabit Ethernet LAN. Several prior efforts [28, 38, 37, 59, 68, 39, 70, 90] have addressed this problem by relying upon expensive interconnect hardware (ATM or Myrinet switches), slow bandwidth limited LANs (10Mbps/100Mbps), or heavyweight software Distributed Shared Memory (DSM) [35, 14] systems that require intricate consistency/coherence techniques and, often, customized application programming interfaces. Additionally, extensive changes were often required to the LMAs or the OS kernel or both. Our earlier work [50] addressed the above problem through an initial prototype, called the Adaptive Network Memory Engine (Anemone) – the first attempt at demonstrating the feasibility of transparent distributed memory access for LMAs over commodity gigabit Ethernet LAN. This was done without requiring any OS changes or recompilation, and relied upon a central node to map and exchange pages between nodes in the cluster. Here we describe the implementation and evaluation of a fully distributed Anemone architecture. Like the centralized version, distributed Anemone uses lightweight, pluggable Linux kernel modules and does not require any OS changes. Additionally, it achieves the following significant improvements over a centralized system. 1. Full distribution: Memory resource management is distributed across the whole cluster. There is no single control node. 2. Low latency: The round-trip time from one machine to the other is reduced by over a factor of 3 when compared to disk access – to around 160µs. 3. Load balancing: Clients make intelligent decisions to direct distributed memory traf-


20

fic across all available memory servers, taking into account their memory usage and paging load. 4. Dynamic Discovery and Release: A distributed resource discovery mechanism enables clients to discover newly available servers and track memory usage across the cluster. The protocol also has a mechanism for releasing servers and re-distributing their memory so that individual servers can be taken down for maintenance. 5. Large packet support: The distributed version incorporates the flexibility of whether or not ’jumbo’ frames should be used based on which network hardware is used, allowing operation in networks with any MTU size. Our protocol is custom built without the use of TCP. As far as the application is concerned, network transmission does not exist, so the end-to-end design of our protocol is built to satisfy the efficiency needs of code in the kernel. We evaluated our prototype using unmodified LMAs such as ray-tracing, network simulations, in-memory sorting, and k-nearest neighbor search. Results show that the system is able to reduce average page-fault latencies from 8.3ms to 160µs. Single-process applications (including those that internally contain threads) speed up by up to a factor of 4, and multiple concurrent processes by up to a factor of 14, when compared against disk-based paging.

3.2

Design & Implementation

Distributed Anemone has two major software components: the client module on low memory machines and the server module on machines with unused memory. The client module appears to the client system simply as a block device that can be configured in multiple ways. • Storage: the “device” can be treated like storage. One can place any filesystem on top of it and mount it like a regular filesystem.


21

REGISTERS CACHE MAIN MEMORY REMOTE MEMORY DISK TAPE Figure 3.1: Placement of distributed memory within the classical memory hierarchy. • Memory Mapping: one can memory map the anemone device directly, creating the view of a linear array of addresses within the application itself. This is a standard practice by many applications, most popularly for the dynamic loading of libraries, but can be made explicit through standard system calls. • Paging Device: The system can be used for distributed memory paging directly by the operating system. This is the mode we use to evaluate the system later on. Whenever an LMA needs more virtual memory, the pager (swap daemon) in the client swaps out pages from the client to other server machines. As far as the pager is concerned, the client module is just a block device not unlike a hard disk partition. Internally, however, the client module maps swapped out pages to distributed memory servers. On a high level, our goal was to develop a prototype that could realize a view presented in Figure 3.1, where distributed memory represents a new level of the memory access hierarchy. The servers themselves are also regular machines, but have unused memory to contribute, and can in fact switch between the roles of client and server at different times, depending on their memory requirements. Client machines discover available servers by using a simple distributed resource discovery mechanism. Servers provide regular feedback about their load information to clients, both as a part of the resource discovery pro-


22

Client Node Large−Memory App. (LMA)

PAGER

RAM

MODULE Block Device Interface

Write−Back Cache

RMAP

Mapping

Protocol

Intelligence

NIC

Figure 3.2: The components of a client. cess and as a part of regular paging process (piggy backed on acknowledgments). Clients use this information to schedule page-out requests by choosing the least loaded server node to send a new page. Also, both the clients and servers use a soft-state refresh protocol to maintain the liveness of pages stored at the servers. The earlier Anemone prototype [50] differed in that the page-to-server mapping logic was maintained at a central Memory Engine, instead of individual client nodes. Although simpler to implement, this centralized architecture incurred two extra round trip times on every request besides forcing all traffic to go through the central Memory Engine, which can become a single point of failure and a significant bottleneck.


23

Server Node RAM

MODULE RMAP Protocol

Mapping Intelligence

NIC

Figure 3.3: The components of a server.

3.2.1

Client and Server Modules

Figure 3.2 illustrates the client module that handles paging operations. It has four major components: 1. The Block Device Interface (BDI), 2. a basic LRU-based write-back cache, 3. mapping logic for server location of swapped-out pages, and 4. a Remote Memory Access Protocol (RMAP) layer. The pager issues read and write requests to the BDI in 4KB data blocks. The device driver that exports the BDI is instructed to keep page write requests aligned on 4 KB boundaries. (The usual sector size of a block devices is 512KB). The BDI, in turn, performs read and write operations to our write-back cache (for which pages do not get transmitted until eviction). When the cache is full, a page is evicted to a server using RMAP. Figure 3.3 illustrates the two major components of the server module: (1) a hash table that stores client pages


24

along with the client’s identity (layer-2 MAC address) and (2) the RMAP layer. The server module can store/retrieve pages for any client machine. Once the server reaches capacity, it responds to the requesting client with a negative acknowledgment. It is then the client’s responsibility to select another server, if available, or to page to disk if necessary. Page-toserver mappings are kept in a standard chained hashtable. Linked-lists contained within each bucket hold 64-byte entries that are managed using the Linux slab allocator (which performs fine-grained management of small, equal-sized memory objects). Standard disk block devices interact with the kernel through a request queue mechanism, which permits the kernel to group spatially consecutive block I/Os (BIO) together into one “request” and schedule them using an elevator algorithm for seek-time minimization. Unlike disks, Anemone is essentially a random access device with a fixed read/write latency. Thus, the BDI does not need to group sequential BIOs. It can bypass request queues, perform outof-order transmissions, and asynchronously handle un-acknowledged, outstanding RMAP messages.

3.2.2

Remote Memory Access Protocol (RMAP)

RMAP is a tailor-made, low-overhead communication protocol for distributed memory access within the same subnet. It implements the following features: (1) Reliable Packet Delivery, (2) Flow-Control, and (3) Fragmentation and Reassembly. While one could technically communicate over TCP, UDP, or even the IP protocol layers, this choice comes burdened with unwanted protocol processing. Instead RMAP takes an integrated, faster approach by communicating directly with the network device driver, sending frames and handling reliability issues in a manner that suites the needs of the Anemone system. Every RMAP message is acknowledged except for soft-state and dynamic discovery messages. Timers trigger retransmissions when necessary (which is extremely rare) to guarantee reliable delivery. We cannot allow a paging request to be lost, or the application that depends on that page will fail altogether. RMAP also implements flow control to ensure that it does not overwhelm either the receiver or the intermediate network card and switches. The performance of any distributed system is heavily influenced by the types of net-


25

working requirements imposed, including both the design of the network and the application’s requirements. To minimize latency and protocol-related processing overhead, a conscious choice was made to eliminate the use of TCP/IP and write a simpler, lightweight protocol. The subset of networking-functions needed by our system in the kernel was significantly smaller than the full set provided by the combination of TCP and IP in a cluster of machines. Four of the most prominent features that we do not include are: • Port Abstraction: Our system has no use for the concept of ports, application-level socket buffers, byte-streams, or in-order delivery. Since our system operates at the block-I/O level, these mostly application-driven requirements disappear. • IP Addresses: The system does not operate across routed IP subnets, nor do we plan on supporting this feature due to performance overheads. They take from the distributed nature of the system and create unwanted link congestion bottlenecks with flows from other networks and is not the kind of problem we’re trying to attack. As a result, the ability of one node to address/communicate with another node is simplified. We also noticed that a custom protocol was much easier to maintain in the kernel because the client and servers can address each other over the network much easier, without the need to juggle IP addresses and socket error handling. • Fragmentation: With the right use of the Linux networking API, this turned out to be a far simpler problem to solve: today’s Linux provides a good enough design abstraction to deploy a non-IP based, zero copy fragmentation solution. Furthermore, Our protocol can auto-detect the MTU of the system’s NIC and automatically send larger-packets (so called ‘jumbo’ frames) if the card supports it, especially because we have no need for multi-network ICMP mtu discovery (assuming that all hops in the network support the same size MTU). • Segmentation Offload: The performance of 10-gigabit and higher speed networks depends heavily on the use of TCP and Checksum offloading. It is gradually becoming quite commonplace to find gigabit cards with offloading engines on them that the kernel can exploit. Recent 2.6 kernels have integrated the zero-copy use of segmen-


26

RMAP Header Format Type

Anemone Packet Page Data (if any)

Status Sequence union { Advertisement { Session ID

RMAP Header

Load Status

(Network) }

Ethernet Header (Data Link)

Load Capacity

Page Request { Offset

}

}

Size

Fragmentation Flags Figure 3.4: A view of a typical anemone packet header. The RMAP protocol transmits these directly to the network card from the BDI device driver. tation into their TCP/IP APIs. We’ve observed that, under a highly-active system, the network can easily exhibit full-speed workloads. Since we use RMAP, this potentially frees up the use of segmentation offloading for application-level networking traffic that might be concurrently running within the same guest VM. Figure 3.4 depicts what a typical anemone packet header looks like. The last design consideration in RMAP is that while the standard memory page size is 4KB (although it is not uncommon for an operating system to employ the use of 4 MB super-pages for better use of the translation-lookaside-buffer), the maximum transmission unit (MTU) in traditional Ethernet networks is limited to 1500 bytes. RMAP implements dynamic fragmentation/reassembly for paging traffic. Additionally, RMAP also has the flex-


27

ibility to use Jumbo frames, which are packets with sizes greater than 1500 bytes (typically between 8KB to 16KB). Jumbo frames enable RMAP to transmit complete 4KB pages to servers using a single packet, without fragmentation. Our testbed includes an 8-port switch that supports Jumbo Frames (9KB packet size). We observe a 6% speed up in RMAP throughput by using Jumbo Frames. However, in this Chapter, we conduct all experiments with 1500 byte MTU sizes with fragmentation/reassembly performed by RMAP.

3.2.3

Distributed Resource Discovery

As servers constantly join or leave the network, Anemone can (a) seamlessly absorb the increase/decrease in cluster-wide memory capacity, insulating LMAs from resource fluctuations and (b) allow any server to reclaim part or all of its contributed memory. This objective is achieved through distributed resource discovery described below, and soft-state refresh described next in Section 3.2.4. Clients can discover newly available distributed memory in the cluster and the servers can announce their memory availability. Each server periodically broadcasts a Resource Announcement (RA) message (1 message every 10 seconds in our prototype) to advertise its identity and the amount of memory it is willing to contribute. Besides RAs, servers also piggyback their memory availability information in their page-in/page-out replies to individual clients. This distributed mechanism permits any new server in the network to dynamically announce its presence and allows existing servers to announce their up-to-date memory availability information to clients.

3.2.4

Soft-State Refresh

Distributed Anemone also includes soft-state refresh mechanisms (keep-alives) to permit clients to track the liveness of servers and vice-versa. Firstly, the RA message serves an additional purpose of informing the client that the server is alive and accepting paging requests. In the absence of any paging activity, if a client does not receive the server’s RA for three consecutive periods, it assumes that the server is offline and deletes the server’s entries from its hashtables. If the client also had pages stored on that server that went offline, it needs to recover the corresponding pages from a copy stored either on the local disk on


28

another server’s memory. Soft-state also permits servers to track the liveness of clients whose pages they store. Each client periodically transmits a Session Refresh message to each server that hosts its pages (1 message every 10 seconds in our prototype), which carries a client-specific session ID. The client module generates a different and unique ID each time the client restarts. If a server does not receive refresh messages with matching session IDs from a client for three consecutive periods, it concludes that the client has failed or rebooted and frees up any pages stored on that client’s behalf.

3.2.5

Server Load Balancing

Memory servers themselves are commodity nodes in the network that have their own processing and memory requirements. Hence another design goal of Anemone is to avoid overloading any particular server node as far as possible by transparently distributing the paging load evenly. In the earlier centralized architecture, this function was performed by the memory engine which kept track of server utilization levels. Distributed Anemone implements additional coordination among servers and clients to exchange accurate load information. Section 3.2.3 described the mechanism to perform resource discovery. Clients utilize the server load information gathered from resource discovery to decide the server to which they should send new page-out requests. This decision process is based upon one of two different criteria: (1) The number of pages stored at each active server and (2) The number of paging requests serviced by each active server. While (1) attempts to balance the memory usage at each server, (2) attempts to balance the request processing overhead.

3.2.6

Fault-tolerance

The ultimate consequence of failure in swapping to distributed memory is no worse than failure in swapping to local disk. However, the probability of failure is greater in a LAN environment because of multiple components involved in the process, such as network cards, connectors, switches etc. Although RMAP provides reliable packet delivery as described in Section 3.2.2 at the protocol level, our future work plans to build two alternatives for


29

tolerating server failures: (1) To maintain a local disk-based copy of every memory page swapped out over the network. This provides same level of reliability as disk-based paging, but risks performance interference from local disk activity. (2) To keep redundant copies of each page on multiple distributed servers. This approach avoids disk activity and reduces recovery-time, but consumes bandwidth, reduces the global memory pool and is susceptible to network failures. In an ideal implementation, the memory servers would participate in a protocol similar to raid 5 [26].

3.3

Evaluation

The Anemone testbed consists of one 64-bit low-memory AMD 2.0 GHz client machine containing 256 MB of main memory and nine distributed-memory servers. The DRAM on these servers consist of: four 512 MB machines, three 1 GB machines, one 2 GB machine, one 3 GB machine, totaling to almost 9 gigabytes of distributed memory. The 512 MB servers range from 1.7 GHz to 800 MHz Intel processors. The other 5 machines are all 2.7 GHz and above Intel Xeons, with mixed PCI and PCI express motherboards. For disk based tests, we used a Western Digital WD800JD 80 GB SATA disk, with a 7200 RPM speed, 8 MB of cache and 8.9ms average seek time, (which is consistent with our results). This disk has a 10 GB swap partition reserved on it to match the equivalent amount of distributed memory available in the cluster, which we use exclusively when comparing our system against the disk. Each machine is equipped with an Intel PRO/1000 gigabit Ethernet card connected to one of two 8-port gigabit switches, one from Netgear and one from SMC. The performance results presented below can be summarized as follows. Distributed Anemone reduces read latencies to an average 160µs compared to 8.3ms average for disk and 500µs average for centralized Anemone. For writes, both disk and Anemone deliver similar latencies due to write caching. In our experiments, Anemone delivers a factor of 1.5 to 4 speedup for single process LMAs, and delivers up to a factor of 14 speedup for multiple concurrent LMAs. Our system can successfully operate with both multiple clients and multiple servers. We also run experiments in which multiple client machines are simultaneously accessing the memory system at the same time. These


30

CDF of 500,000 Random Reads to a 6 GB Space 100

Percent of Requests

80 60 40 20 Anemone Local Disk

0 1

10

100 1000 10000 100000 Latency (microseconds, logscale)

1e+06

Figure 3.5: Random read disk latency CDF results are equally as successful as the single-process cases.

3.3.1

Paging Latency

To begin the experiments, we first want to characterize exactly what kinds of microbenchmarks we observe for different types of I/O, both read and write streams. The next 4 graphs present these results for both our memory system and the disk. Figures 3.5, 3.6, 3.7, and 3.8 show the distribution of observed read and write latencies for sequential and random access patterns with both Anemone and disk. Though real-world applications rarely generate purely sequential or completely random memory access patterns, these graphs provide a useful measure to understand the underlying factors that impact application execution times. Most random read requests to disk experience a latency between 5 to 10 milliseconds. On the other hand most requests in Anemone experience only around 160µs latency. Most sequential read requests to disk are serviced by the on-board disk cache within 3 to 5µs because sequential read accesses fit well with the motion of disk head. In contrast, Anemone delivers a range of latency values, most below 100µs. This is because network communication latency dominates in Anemone even for sequential re-


31

CDF of 500,000 Sequential Reads to a 6 GB Space 100

Percent of Requests


0 1

10


1e+06

Figure 3.6: Sequential read disk latency CDF

CDF of 500,000 Random Writes to a 6 GB Space 100 90 Percent of Requests

80 70 60 50 40 30 20 Anemone Local Disk

10 0 1

10


Figure 3.7: Random write disk latency CDF

1e+06


32

CDF of 500,000 Sequential Writes to a 6 GB Space 100

Percent of Requests


0 1

10

100

1000

10000

100000

1e+06

Latency (microseconds, logscale)

Figure 3.8: Sequential write disk latency CDF

Povray Quicksort NS2 KNN

Size (GB) 3.4 5 1 1.5

Local Mem 145 N/A 102 62

Distr. Anemone 1996 4913 846 7.1

Speedup Disk 8018 11793 3962 2667

Disk Anemone

4.02 2.40 4.08 3.7

Table 3.1: Average application execution times and speedups for local memory, Distributed Anemone, and Disk. N/A indicates insufficient local memory. quests, though it is masked to some extent by the prefetching performed by the pager and the file-system within the Linux kernel. The write latency distributions for both disk and Anemone are comparable, with most latencies being close to 9µs because writes typically return after writing to the local Linux buffer cache (which is now a unified page cache in Linux 2.6).

3.3.2

Application Speedup

Single-Process LMAs: Table 3.1 summarizes the performance improvements seen by unmodified single-process LMAs using the Anemone system. This is a setup, similar to


33

Single Process ’POV’ Ray Tracer Local Memory Anemone Local Disk

Render Time (seconds)

8000

6000

4000

2000

0

0

1000

2000

3000

Amount of Scene Memory (MB) Figure 3.9: Execution times of POV-ray for increasing problem sizes. the previous microbenchmarks, in which a single LMA process on a single client node is using the memory system consisting of all nine available servers at its disposal. The first application is a ray-tracing program called POV-Ray [81]. The memory consumption of POV-Ray was varied by rendering different scenes with increasing number of colored spheres.

Figure 3.9 shows the completion times of these increasingly large renderings

up to 3.4 GB of memory versus the disk using an equal amount of local swap space. The figure clearly shows that Anemone delivers increasing application speedups with increasing memory usage and is able to improve the execution time of a single-process POV-ray rendering by a factor of 4 for 3.4 GB memory usage. The second application is a large in-memory Quicksort program that uses a C++ STL-based implementation [89], with a complexity of O(N log N ) comparisons. We sorted randomly populated large in-memory arrays of integers. Figure 3.10 shows that Anemone delivers a factor 2.4 speedup for a single-process Quicksort using 5 GB of memory. The third application is the popular NS2 network simulator [75]. We simulated a delay partitioning algorithm [42] on a 6-hop widearea network path using voice-over-IP traffic traces. Factors contributing to memory usage in NS2 include the number of nodes being simulated, the amount of traffic sent between nodes, and choices of protocols at different layers. Table 3.1 shows that, with NS2 requiring


34

Single Process Quicksort 12000 11000

Local Memory Anemone Local Disk

Sort Time (seconds)

10000 9000 8000 7000 6000 5000 4000 3000 2000 1000 0

0

1000

2000

3000

4000

5000

Sort Size (MB) Figure 3.10: Execution times of STL Quicksort for increasing problem sizes. 1GB memory, Anemone speeds up the simulation by a factor of 4 compared to disk based paging. The fourth application is the k-nearest neighbor (KNN) search algorithm on large 3D datasets, using code from [29]. This algorithm is useful in applications such as medical imaging, molecular biology, CAD/CAM, and multimedia databases. Table 3.1 shows that, when executing KNN search algorithm over a dataset of 2 million points consuming 1.5GB memory, Anemone speeds up the simulation by a factor of 3.7 over disk based paging.

Multiple Concurrent LMAs: In this section, we test the performance of Anemone under varying levels of concurrent application execution. Multiple concurrently executing LMAs tend to stress the system by competing for computation, memory and I/O resources and by disrupting any sequentiality in paging activity, including competition for buffer space on the network switch itself - particularly at gigabit speeds. Figures 3.11 and 3.12 show the execution time comparison of Anemone and disk as the number of POV-ray and Quicksort processes increases. The execution time measures the time interval between the start of the execution and the completion of last process in the set. We try to keep each process at around 100 MB of memory. The figures show that the execution times using disk-based swap increases steeply with number of processes. Paging activity loses out sequentiality to


35

Multiple Process ’POV’ Ray Tracer 10000

Render Time (seconds)

9000

Anemone Local Disk

8000 7000 6000 5000 4000 3000 2000 1000 0

0

1

2

3

4

7

6

5

Number of Concurrent Processes Figure 3.11: Execution times of multiple concurrent processes executing POV-ray.

Multiple Process Quicksort

Sort Time (seconds)

2800

Anemone Local Disk

2400 2000 1600 1200 800 400 0

0

1

2

3

4

5

6

7

8

9

10

11

12

Number of Concurrent Processes Figure 3.12: Execution times of multiple concurrent processes executing STL Quicksort.


36

Effect of Transmission Window Size 1 - GB Quicksort

10000

Bandwidth Acheived (Mbit/s) No. Retransmissions (Requests) Completion Time (secs)

(Logscale)

1000

100

10

1

0

2

4

6

8

10

12

14

Max Window Size Figure 3.13: Effects of varying the transmission window using Quicksort. the memory system performance with an increasing number of processes, making the disk seek and rotational overheads dominant. On the other hand, Anemone reacts very well as execution time increases very slowly, due to the fact that network latencies are mostly constant, regardless of sequentiality. With 12–18 concurrent LMAs, Anemone achieves speedups of a factor of 14 for POV-ray and a factor of 6.0 for Quicksort.

3.3.3

Tuning the Client RMAP Protocol

One of the important knobs in RMAP’s flow control mechanism is the client’s transmission window size. Using a 1 GB Quicksort, Figure 3.13 shows the effect of changing this window size on three characteristics of the Anemone’s performance: (1) the number of retransmissions, (2) paging bandwidth, which is represented in terms of “goodput”, i.e. the amount of bandwidth obtained after excluding retransmitted bytes and header bytes, and (3) completion time. Recall that in our implementation of the RMAP protocol, we use a static window size - configured once before runtime. This means the traditional sense


37

of “flow-control” that you would expect from a TCP-style protocol is not fully dynamic. As a result, our window size is chosen empirically to be large enough to maintain network throughput but small enough to fit within the capabilities of the NIC’s ring buffers. A complete implementation of RMAP would provide a dynamic flow control window, but we leave that to future work. To demonstrate this, figure 3.13 shows us that as the window size increases, the number of retransmissions increases because the number of packets that can potentially be delivered back-to-back also increases. For larger window sizes, the paging bandwidth is also seen to increase and saturates because the transmission link remains busy more often, delivering higher goodput in spite of an initial increase in the number of retransmissions. However, if driven too high, the window size will cause the paging bandwidth to decline considerably due to increasing number packet drops and retransmissions. The application completion times depend upon the paging bandwidth. Initially, an increase in window size increases the paging bandwidth and lowers the completion times. Similarly, if driven too high, the window size causes more packet drops, more retransmissions, lower paging bandwidth and higher completion times.

3.3.4

Control Message Overhead

To measure the control traffic overhead due to RMAP, we measured the percentage of control bytes generated by RMAP compared to the amount of data bytes transferred while executing a 1GB POVRay application. Control traffic refers to the page headers, acknowledgments, resource announcement messages, and soft-state refresh messages. We first varied the number of servers from 1 to 6, with a single client executing the POV-Ray application. Next, we varied the number of clients from 1 to 4 (each executing one instance of POV-Ray), with 3 memory servers. The percentage of control traffic overhead was consistently measured at 1.74% – a very small percentage of the total paging traffic.


3.4

38

Summary

In this Chapter, we presented Distributed Anemone – a system that enables unmodified large memory applications to transparently utilize the unused memory of nodes across a gigabit Ethernet LAN. Unlike its centralized predecessor, Distributed Anemone features fully distributed memory resource management, low-latency distributed memory paging, distributed resource discovery, load balancing, soft-state refresh to track liveness of nodes, and the flexibility to use Jumbo Ethernet frames. We presented the architectural design and implementation details of a fully operational Anemone prototype. Evaluations using multiple real-world applications, include ray-tracing, large in-memory sorting, network simulations, and nearest neighbor search, show that Anemone speeds up single process application by up to a factor of 4 and multiple concurrent processes by up to a factor of 14, compared to disk-based paging. Average page-fault latencies are reduced from 8.3ms with disk based paging to 160µs with Anemone.

Chapter

4

MemX: Virtual Machine Uses of Distributed Memory In this Chapter, we present our experiences in developing a fully transparent distributed system, called MemX, within the Xen VM environment that coordinates the use of clusterwide memory resources to support large memory workloads.

4.1

Introduction

In modern cluster-based platforms, VMs can enable functional and performance isolation across applications and services. VMs also provide greater resource allocation flexibility, improve the utilization efficiency, enable seamless load balancing through VM migration, and lower the operational cost of the cluster. Consequently, VM environments are increasingly being considered for executing grid and enterprise applications over commodity high-speed clusters. However, such applications tend to have memory workloads that can stress the limited resources within a single VM by demanding more memory than the slice available to the VM. Clustered bastion hosts (mail, network attached storage), data mining applications, scientific workloads, virtual private servers, and backend support for websites are common examples of resource-intensive workloads, I/O bottlenecks in these applications can quickly form due to frequent access to large disk-resident dataset, paging

39

CHAPTER 4. MEMX: VIRTUAL MACHINE USES OF DISTRIBUTED MEMORY

40

activity, flash crowds, or competing VMs on the same node. Even though virtual machines with demanding workloads are here to stay as integral parts of modern clusters, significant improvements are needed in the ability of memory-constrained VMs to handle these workloads. I/O activity due to memory pressure can prove to be particularly expensive in a virtualized environment where all I/O operations need to traverse an extra layer of indirection. Over-provisioning of memory resources (and in general any hardware resource) within a physical machine may not always be a viable solution as it can lead to poor resource utilization efficiency, besides increasing the operational costs. Although domain-specific out-of-core computation techniques [56, 65] and migration strategies [71, 17, 27] can also improve the application performance up to a certain extent, they do not overcome the fundamental limitation that an application is restricted to using the memory resources within a single physical machine, particularly with some of the aforementioned applications that are not generally parallelized. In this Chapter, we present the design, implementation, and evaluation of the MemX system for VMs that bridges the I/O performance gap in a virtualized environment by exploiting low-latency access to the memory of other nodes across a Gigabit cluster. MemX is fully transparent to the user applications – developers do not need any specialized APIs, libraries, recompilation, or relinking for their applications, nor does the application’s dataset need any special pre-processing, such as data partitioning across nodes. We compare and contrast the three modes in which MemX can operate with Xen VMs [20]: 1. MemX-DomU: the system within individual guest virtual machines. The letter ‘U’ in “DomU” refers to the guest domain. Specifically it refers to the fact that they are “unprivileged” in reference to Dom0. 2. MemX-DD: the system within a common driver domain (DD) (in this case: dom0 functions as the DD). However, this time the system is shared by multiple guest OSes that co-reside with the DD. We use “DD” to distinguish between the fact that the client module is running in the same place as the MemX-Dom0 case (within Domain Zero itself), except that the client module is actually being used by applications located


41

within guest VMs (DomU), rather than applications within the driver domain (dom0) itself. 3. MemX-Dom0: The distributed memory system within dom0, called ”Dom0” in Xen terms. This represents the base virtualization overhead without the presence of other guest virtual machines. The proposed techniques can also work with other VM technologies besides Xen. We focus on Xen mainly due to its open source availability and para-virtualization support. In the performance section, we also compare all three options to the baseline case where just a regular, non-virtualized Linux system is used as described in Chapter 3.

4.2

Split Driver Background

As we stated in Chapter 2, Xen is an open source virtualization technology that provides secure resource isolation. Xen provides close to native machine performance through the use of para-virtualization [97] – a technique by which the guest OS is co-opted into reducing the virtualization overheads via modifications to its hardware dependent components. The modifications enable the guest OS to execute over virtualized hardware and devices rather than over bare metal. In this section, we review the background of the Xen I/O subsystem as it relates to the design of MemX. Xen exports I/O devices to each guest OS (domU) as virtualized views of “class” devices as opposed to real physical devices. For example, Xen exports a block device or a network device, rather than a specific hardware make and model. The actual drivers that interact with the native hardware devices execute within Dom0 – the privileged domain that can directly access all hardware in the system. Dom0 acts as the management VM that coordinates devices access and privileges among all of the other guest domains. In the rest of the Chapter, we will use the term driver domain and Dom0 interchangeably. Physical devices (and their device drivers) can be multiplexed among multiple concurrently executing guest OSes. To enable this multiplexing, the privileged driver domain and the unprivileged guest domains (DomU) communicate by means of a split device-driver ar-


42

EVENT CHANNEL DRIVER DOMAIN

BACK END DRIVER

FRONT END DRIVER

NATIVE DRIVER

GUEST OS Hypercalls / Callbacks ACTIVE GRANT TABLE

SAFE H/W I/F XEN HYPERVISOR PHYSICAL DEVICE

Figure 4.1: Split Device Driver Architecture in Xen. chitecture. This architecture is shown in Figure 4.1. The driver domain hosts the backend of the split driver for the device class and the DomU hosts the frontend. The backends and frontends interact using high-level device abstractions instead of low-level hardware specific mechanisms. For example, a DomU only cares that it is using a block device, but doesn’t worry about the specific type of driver that is controlling that block device. Frontends and backends communicate with each other via the grant table: an inmemory communication mechanism that enables efficient bulk data transfers across domain boundaries. The grant table enables one domain to allow another domain access to its pages in system memory. The access mechanism can include read, write, or mutual exchange of pages. The primary use of the grant table in device I/O is to provide a fast and secure mechanism for unprivileged DomU domains to receive indirect access to hardware devices. They enable the driver domain to set up a DMA based data transfer directly to/from the system memory of a DomU rather than performing the DMA to/from driver domain’s memory with the additional copying of the data between DomU and driver domain. In other words, the grant table enables zero-copy data transfers across domain


43

boundaries. The grant table can be used to either share or transfer pages between the DomU and driver domain depending upon whether the I/O operation is synchronous or asynchronous in nature. For example, because block devices perform synchronous data transfer, the driver domain will know at the time of I/O initiation as to which DomU requested the block I/O request. In this case, the frontend of the block driver in DomU will notify the Xen hypervisor (via the gnttab grant foreign access hypercall) that a memory page can be shared with the driver domain. A hypercall is the hypervisor’s equivalent of a system call in the operating system. The DomU then passes a grant table reference ID via the event channel to the driver domain, which sets up a direct DMA to/from the memory page of the DomU. Once the DMA is complete, the DomU removes the grant reference (via the gnttab end foreign access call). On the other hand, network devices receive data asynchronously. This means that the driver domain does not know the target DomU for an incoming packet until the entire packet has been received and its header examined. In this situation, the driver domain DMAs the packet into its own page and notifies the Xen hypervisor (via the gnttab grant foreign transfer call) that the page can be transferred to the target DomU. The driver domain then transfers the received page to target DomU and receives a free page in return from the DomU. In summary, Xen’s I/O subsystem for shared physical devices uses a split driver architecture that involves an additional level of indirection through the driver domain and the Xen hypervisor, with efficient optimizations to avoid data copying during bulk data transfers.

4.3

Design and Implementation

The core functionality of the MemX system partially builds upon our previous work and is encapsulated within kernel modules that do not require modifications to either the Linux kernel or the Xen hypervisor. However the interaction of the core modules with rest of the virtualized subsystem presents several alternatives. In this section, we briefly discuss the different design alternatives for the MemX system, justify the decisions we make, and present the implementation details.


LOW MEMORY CLIENT

44

LARGE MEMORY SERVER

LARGE MEMORY APPLICATION User Kernel

User Kernel FILE SYSTEM

CONTRIBUTED DRAM

PAGER

RAW BLOCK DEVICE INTERFACE MemX CLIENT MODULE

REMOTE MEMORY ACCESS

MemX SERVER MODULE

PROTOCOL OVER GIGABIT INTERCONNNECT

Figure 4.2: MemX-Linux: Baseline operation of MemX in a non-virtualized Linux environment. The client can communicate with multiple memory servers across the network to satisfy the memory requirements of large memory applications.

4.3.1 MemX-Linux: MemX in Non-virtualized Linux Figure 4.2 shows the operation of MemX in a non-virtualized (vanilla) Linux environment. An earlier variant of MemX-Linux was published in [51]. MemX-Linux includes several additional features listed later in this section. For completeness, we summarize the architecture of MemX-Linux in this section and use it as a baseline for comparison with other virtualized versions of MemX – the primary focus of this work. Two main components of MemX-Linux are the client module on the low memory machines and the server module on the machines with unused memory. The two communicate with each other using a remote memory access protocol (RMAP), described in detail in Chapter 3, Section 3.2.2. Both client and server components execute as isolated Linux kernel modules. Aside from optimizations, the function of this code operates much the same way as described in Chapter 3. Nevertheless, there are a number of important


45

changes to that work, and we present a brief summary of those components here.

Client and Server Modules: The client module provides a virtualized block device interface to the large dataset applications executing on the client machine. This block device can either be: a) configured as a low-latency primary swap device, b) treated as a lowlatency volatile store for large data sets accessed via the standard file-system interface, or c) memory mapped to the address space of an executing large memory application. Internally, the client module maps the single linear I/O space of the block device to the unused memory of multiple distributed servers, using a memory-efficient radix-tree based mapping. The old system used a hashtable-based implementation, but we found that to use high amounts of memory for the table data structure (for buckets and entries), particularly as we bought newer machines with substantially more memory than our old ones. A radix tree is a modified trie-structure in which a tree is referenced by strings of an alphabet, one character at a time. This system worked perfectly for things like addresses and file offsets - key types that are used everywhere throughout distributed memory system. As usual, the memory system discovers and communicates with distributed server modules using a custom-designed, reliable, Servers broadcast periodic resource announcement messages which the client modules can use to discover the available memory servers. Servers also include feedback about their memory availability and load during both resource announcement and regular page transfers with clients. When a server reaches capacity, it declines to serve any new write requests from clients, which then try to select another server, if available, or otherwise write the page to disk. Binding these modules together is the Remote Memory Access Protocol (RMAP). This protocol is described later in much more detail than was provided in the previous Chapter. The server module is also designed to allow a server node to be taken down while live; our RMAP implementation can disperse, re-map, and load-balance an individual server’s pages to any other servers in the cluster that are capable of absorbing those pages, allowing the server to shut down without killing any of its client’s applications. Getting this custom protocol to work properly in a virtualized environment exposed a great number of kernel bugs that were not originally present in the Anemone prototype, making the system much more robust.

CHAPTER 4. MEMX: VIRTUAL MACHINE USES OF DISTRIBUTED MEMORY Additional Virtualization Features:

46

MemX also includes a couple additional features

that are not the specific focus of this work, and which were not present in the original Anemone system either. The first is the ability to support named distributed memory data spaces that can be shared by multiple clients. What this provides is a read-only DSM system in which data stored on server nodes remains persistent through the life of the server, even when client nodes disconnect from the system altogether. When a client re-connects, all servers that have records for that client in the past will re-forward the necessary mapping information to allow the client to re-construct its radix-tree data structure mappings and begin re-accessing the same persistent data. The system does not allow multiple concurrent writers, however, as this was not the focus of our work. There are two other features that turned out to be very important to the memory system as a whole because of virtualization-specific reasons: First, because of the way the split-driver system works that we described, the device driver needs to be able to support multiple major and minor block numbers. The driver would then be responsible for mapping one device per local virtual machine on a physical host, allowing the completely seamless, transparent access by multiple VM clients on the same host. This was part of the motivation behind switching to a radix tree over a hashtable - because the mapping data stored to lookup page locations would be relatively large for so many virtual machines, on the order of 10s of megabytes. It turned out the worst-case lookup time for the tree was equally comparable to the hashtable and did not take away from the efficient performance of the system. Second, we had to optimize the fragmentation implementation designed in the Anemone system: it had to be able to support zero-copy transmission and receipt of page fragments, which would not have worked properly in the original system.

4.3.2 MemX-DomU (Option 1): MemX Client Module in DomU In order to support large dataset applications within a VM environment, the simplest design option is to place the MemX client module within the kernel of each guest OS (DomU), whereas distributed server modules continue to execute within non-virtualized Linux kernel on machines connected to the physical network. This option is illustrated in Figure 4.3. The


Management Processes

Native Drivers

Large−Memory Application

Backend Backend

Frontend Network, Disk, PCI

Backend

Virtual H/W

47

Frontend Frontend

Event Channel

Driver Domain

MemX Module

Virtual H/W Guest Domains

Scheduling, Grant Tables, Exception Handling, Memory Enforcement

Hypercall API Hardware Devices Figure 4.3: MemX-DomU: Inserting the MemX client module within DomU’s Linux kernel. The server executes in non-virtualized Linux. client module exposes the block device interface to large memory applications within the DomU as in the baseline, but communicates with the distributed server using the virtualized network interface (VNIC) exported by the network driver in the driver domain. The VNIC in Xen is also organized as a split device driver in which the frontend (residing in the guest OS) and the backend (residing in the driver domain) talk to each other using well-defined grant table and event channel mechanisms. Two event channels are used between the backend and frontend of the VNIC – one for packet transmissions and one for packet receptions. To perform zero-copy data transfers across the domain boundaries, the VNIC performs a page exchange with the backend for every packet received or transmitted using the grant table. All backend interfaces in the driver domain can communicate with the physical NIC as well as with each other via a virtual network bridge. Each VNIC is assigned its own MAC address whereas the driver domain’s own internal VNIC in Dom0 uses the physical NIC’s MAC address. The physical NIC itself is placed in promiscuous mode by the driver domain to enable the reception of any packet addressed to any of the local virtual


48

machines. The virtual bridge demultiplexes incoming packets directed towards the target VNIC’s backend driver. Compared to the baseline non-virtualized MemX -Linux deployment, MemX-DomU has the additional overhead of requiring every network packet to traverse across domain boundaries in addition to being multiplexed or demultiplexed at the virtual network bridge. Additionally, the client module needs to be separately inserted within each DomU that might potentially execute large memory applications. Also note that each I/O request is typically 4KBytes in size, whereas our network hardware uses a 1500-byte MTU (maximum transmission unit), unless the underlying network supports Jumbo frames. Thus the client module needs to fragment each 4KByte write request into (and reassemble a complete read reply from) at least 3 network packets. In MemX-DomU each fragment needs to traverse the domain boundary to reach the backend. Due to current memory allocation policies in Xen, buffering for each fragment ends up consuming an entire 4KByte page worth of memory allocation, which results in three times the actual memory needed within the machine. If this were a non-virtualized case, each of those fragments would still come from the same physical page because of the internal Linux slab allocator. But virtualization requires those fragments to be separated out. Newer Xen versions may offer solutions to this type of problem, but we leave it for now. We will contrast this performance overhead in greater detail with MemX-DD (option 2) below.

4.3.3 MemX-DD (Option 2): MemX Client Module in Driver Domain A second design option is to place the MemX client module within the driver domain (Dom0) and allow multiple DomUs to share this common client module via their virtualized block device (VBD) interfaces. This option is shown in Figure 4.4. The guest OS executing within the DomU VM does not require any MemX specific modifications. The MemX client module executing within the driver domain exposes a block device interface, as before. Any DomU, whose applications require distributed memory resources, configures a split VBD. The frontend of the VBD resides in DomU and the backend in the block driver domain. The frontend and backend of each VBD communicates using event chan-


Management Processes

Native Drivers

Large−Memory Application

Backend Backend

MemX Module

Frontend Network, Disk, PCI

Backend

Virtual H/W

49

Frontend Frontend

Event Channel

Driver Domain

Virtual H/W Guest Domains

Scheduling, Grant Tables, Exception Handling, Memory Enforcement

Hypercall API Hardware Devices Figure 4.4: MemX-DD: Executing a common MemX client module within the driver domain, allowing multiple DomUs to share a single client module. The server module continues to execute in non-virtualized Linux.


50

nels and the grant table, as in the earlier case of VNICs. (This splitting of interfaces is completely automated by the Xen system itself). The MemX client module provides a separate VBD lettered-slice (/dev/memx{a,b,c}, etc.) for each backend that corresponds to a distinct DomU. On the network side, the MemX client module attaches itself to the driver domain’s VNIC which in turn talks to the physical NIC via the virtual network bridge. For performance reasons, here we assume that the VNIC in the driver domain and the disk in the driver domain are co-located - meaning both drivers are within the same privileged driver domain (dom0). Thus the driver domain’s VNIC does not need to be organized as another split driver. Rather it is a single software construct that can attach directly to the virtual bridge. During execution within a DomU, read/write requests to distributed memory are generated in the form of synchronous I/O requests to the corresponding virtual block device frontend. These requests are sent to the MemX client module via the event channel and the grant table. The client module packages each I/O request into network packets and transmits them asynchronously to distributed memory servers using RMAP. Note that, although the network packets still need to traverse the virtual network bridge, they no longer need to traverse a split VNIC architecture, unlike in MemX-DomU. One consequence of not going through a split VNIC architecture is that, while client module still needs to fragment a 4KByte I/O request into 3 network packets to fit the MTU requirements, each fragment no longer needs to occupy an entire 4KByte buffer, unlike in MemX-DomU. As a result, only one 4KByte I/O request needs to cross the domain boundary across the split block device driver, as opposed to three 4KB packet buffers in Section 4.3.2. Finally, since the guest OS within DomUs do not require any MemX specific software components, the DomUs can potentially run any para-virtualized OS and not just XenoLinux. However, compared to the non-virtualized baseline case, MemX-DD still has the additional overhead of using the split VBD and the virtual network bridge, though still with highly acceptable performance. Also note that, unlike MemX-DomU, MemX-DD does not currently support seamless migration of live Xen VMs using distributed memory. This is because part of the internal state of the guest OS, (in the form of page-to-server mappings) that resides in the driver domain of MemX-DD is not automatically transferred by the migration mechanism in Xen. We plan to enhance Xen’s migration mechanism to transfer this


51

internal state information in a host-independent manner to the target machine’s MemX-DD module. Furthermore, our current implementation does not support per-DomU reservation of distributed memory, which can potentially violate isolation guarantees. This reservation feature is currently being added to our prototype.

4.3.4 MemX -Dom0: (Option 3) As we mentioned in the introduction, we will also present this scenario. Again, this is described as the distributed memory system within Dom0 (same as the driver domain), except that the applications are executed directly within this domain and not inside of a guest domain. This represents the base virtualization overhead without the presence of other guest virtual machines.

4.3.5

Alternative Options

Guest Physical Address Space Expansion: Another alternative to supporting large memory applications with direct distributed memory is to enable support for this indirectly via a larger pseudo-physical memory address space than is normally available within the physical machine. This option would require fundamental modifications to the memory management in both the Xen hypervisor as well as the guest OS. In particular, at boot time, the guest OS would believe that it has a large “physical” memory – or the so called pseudophysical memory space. It then becomes the Xen hypervisor’s task to map each DomU’s large partly into guest-local memory, partly into distributed memory, and the rest to secondary storage. This is analogous to the large conventional virtual address space available to each process that is managed transparently by traditional operating systems. The functionality provided by this option is essentially equivalent to that provided by MemX-DomU and MemX-DD. However, this option requires the Xen hypervisor to take up a prominent role in memory address translation process, something that original design of Xen strives to minimize. Exploring this option is the focus of Chapter 6. MemX Server Module in DomU Technically speaking, we can also execute the MemX server module within a guest OS by itself as well, coupled with Options 1 or 2 above. This


52

could enable one to initiate a VM solely for the purpose of providing distributed memory to other low-memory client VMs that are either across the cluster or even within the same physical machine. However, practically, this option does not seem to provide any significant functional benefits whereas the overheads of executing the server module within DomU are considerable. This is also not necessary because our system already supports the re-distribution of server memory to nearby servers - allowing the server to shut down if necessary. This obviates such a need to run the module within a virtual machine. Consequently, we do not pursue this option further.

4.3.6

Network Access Contention:

Handling network contention within the physical machine itself was the biggest (solvable) difficulty in our decision to implement RMAP without TCP/IP. Three major factors contribute to network contention in our system: • Inter-VM Congestion: MemX generates traffic at the block-I/O level. In a virtual machine environment, each guest VM on a given node assumes that it has full control of the NIC, when in reality that NIC is generally shared among multiple VMs. We elaborate on this simple but important problem of inter-VM congestion in Section 4.4.3 while evaluating multiple VM performance. • Flow Control: Currently, RMAP uses a static send window per MemX node. In a subnet with fairly constant round trip times, this serves us well, although a reactive approach where the receiver informs the client of the size of its receive window could be easily deployed. We have not observed a need for this feature as of yet. • Switch/Server Congestion: MemX servers in the network can potentially be the destination for dozens of client pages. Two or more clients generating traffic towards a particular server can quickly overwhelm both the switch port and the server itself. As a partial solution to this problem, MemX clients perform load-balancing across MemX servers by dynamically selecting the least loaded server for page write operations. Empirically, we’ve observed that congestion happens only when the number of


53

clients significantly outweighs the number of servers. If MemX were scaled to hundreds of switched nodes, a cross-bar or fat-tree design in addition to more advanced switch-bound congestion control would be mandatory, but our 8-node cluster hasn’t warranted this as of yet. We plan to handle this if our testbed scales to more nodes.

4.4

Evaluation

In this section we evaluate the performance of the different variants of MemX. Our goal is to answer the following questions: • How do the different variants of MemX compare in terms of I/O latency and bandwidth? • What are the overheads incurred by MemX due to virtualization in Xen? • What type of speedups can be achieved by real large memory applications using MemX when compared to virtualized disk? • How well does MemX perform in the presence of multiple concurrent VMs? Our testbed consists of eight machines. Each machine has 4 GB of memory, an SMP 64-bit dual-core 2.8 Ghz processor, and one gigabit Broadcom Ethernet NIC. Our Xen version is 3.0.4 and XenoLinux version 2.6.16.33. Backend MemX-servers run Vanilla Linux 2.6.20. Collectively, this provides us with over 24GB of effectively usable cluster-wide memory after accounting for roughly 1GB of local memory usage per node. We limit the local memory of client machines to a maximum of 512 MB under all test cases. In addition to the three MemX configurations described earlier, namely MemX-Linux, MemX-DomU, and MemX-DD, we also include a fourth configuration – MemX-Dom0 – for the sole purpose of performance evaluation. This additional configuration corresponds to the MemX client module executing within Dom0 itself, but not as part of the backend for a VBD. Rather, the client module in MemX-Dom0 serves large memory applications executing within Dom0, and helps to measure the basic virtualization overhead due to Xen. Furthermore, whenever we mention the ”disk” baseline, we are referring to virtualized disk within Dom0. When MemX-DD or MemX-DomU is compared to virtualized disk in any experiment, it means


Kernel RTT

MemX-Linux 85 usec

MemX-Dom0 95 usec

MemX-DD 95 usec

MemX-DomU 115 usec

54

Virtualized Disk 8.3 millisec

Table 4.1: I/O latency for each MemX-Combination in Microseconds. that we exported the virtualized disk as a frontend VBD to the dependent guest VM, just as we exported the block device from MemX itself to applications.

4.4.1

Latency and Bandwidth Microbenchmarks

Figure 4.5 and Table 4.1 characterize different MemX-combinations in terms of these two metrics. Table 4.1 shows the average round trip time (RTT) for a single 4KB read request transmitted from a client module and replied to by a server node. The RTT is measured in microseconds, using the on-chip time stamp counter (TSC) register at the kernel level in the client module immediately before transmission to the NIC and after reception of the ACK from the NIC. Thus the measured RTT values include only MemX related time components and exclude the variable time required to deliver the page to user-level, put that process back on the ready-queue, and perform a context switch. Moreover, this is the latency that VFS (virtual filesystem) or the system pager would experience when sending I/O to and from MemX. MemX-Linux, as a base case, provides an RTT of 85µs. Following close behind are MemX-Dom0, MemX-DD, and MemX-DomU in that order. The virtualized disk base case performs as expected at an average 8.3ms. These RTT numbers show that accessing the memory of a remote machine over the network is about a two orders of magnitude faster than from local virtualized disk. Also, the Xen VMM introduces a negligible overhead of 10µs in MemX-Dom0 and MemX-DD over MemX-Linux. Similarly the split network driver architecture, which needs to transfer 3 packet fragments for each 4KB block across the domain boundaries, introduces an overhead of another 20µs in MemX-DomU over MemX-Dom0 and MemX-DD. Figure 4.5 shows throughput measurements using a custom benchmark [52] which is made to issue long streams of random/sequential, asynchronous, 4KB requests. We ensure that the range of requests is at least twice that of the size of local memory of a


55

Figure 4.5: I/O bandwidth, for different MemX-configurations, using custom benchmark that issues asynchronous, non-blocking 4-KB I/O requests. “DIO” refers to opening the file descriptor with direct I/O turned on, to compare against by-passing the Linux page cache.


56

client node (about 1+ Gbyte). These tests give us insight during development of where bottlenecks might exist. The throughput for all of the tests is generally at its maximum minus the effect of CPU overhead. A small loss of 50 Mbits/second naturally occurs for MemX-domU, which is to be expected. The only case that suffers is random reads, which all hover around 300 Mbits/second. There is a very specific reason for this that is a direct artifact of the way VFS in the Linux kernel handles asynchronous I/O (AIO) [60].

Asynchronous I/O and Scheduling in Linux

. Block devices, by nature, handle all I/O

asynchronously (AIO) unless otherwise instructed to by the Virtual Filesystem (VFS). In Linux, the AIO call stack is the fundamental atomic operation to the device (through the page cache) by which other types of I/O are realized. As of 2007, the AIO hierarchy in Linux uses a separate thread that plays tricks to run in the same process context of the user application that submitted the I/O (for those file descriptors that are asynchronous). Doing so, the application can continue doing other work and check for the results later. The core problem described here involves the kernel thread that handles AIO system calls itself: it is in fact executed synchronously after the request handoff has been made. Linux (and perhaps other kernels) is capable of accepting a submission of multiple (sparse) AIO reads/writes using a single system call. After the system call returns, the thread then synchronously issues those I/Os to the device driver one-by-one (by blocking and removing itself from the run-queue). For devices with variable latencies (i.e. disks), this long-standing VFS design makes sense, as I/O should block, while the device is kept busy from dynamically generated amounts of parallel I/O from the read-ahead (prefetching) policies of the Linux page-cache. But for random-access style devices, this is useless. Additionally, the Linux I/O scheduler makes similar assumptions for devices that have requests queues (per-driver queues that re-order I/Os for better fairness and latency guarantees). What this means for the MemX in both virtualized and non-virtualized environments is that out-bound randomly-spaced read block I/O bandwidth (not networking bandwidth) is cut by two thirds to about one third of it’s normal speed. The presents a chain reaction for these kinds of randomly-spaced reads: Rather than getting a read performance of a full gigabit per second over the network, the application only experiences about three


57

megabits per second. This does not significantly affect the speedups we experience in the next section, but it does explain some of the microbenchmarks performed at the beginning. To solve the problem in the future, we propose a “re-plumbing” of the VFS and I/O scheduling subsystems to dynamically detect the underlying latency characteristics of the device (specifically the un-changing behavior of constant vs. variable latency) in order to allow those subsystems to take alternate code-paths that are capable fully exploiting the deliverable performance of the underlying device. The actual blocking call is that of ”lock page()” within the function ”do generic mapping read” as per the Linux AIO call stack. On the bright side, as of 2007, there was a patch [60] in progress (Contact information for the developers can be found in linux-2.6.xx/MAINTAINERS). The patch could be modified to handle *the more specific case* that needs rather than be a generic solution for all users of the page cache. We also noticed that, if the user is a userland C program (versus say a filesystem thread running within the kernel), then setting O DIRECT on the file descriptor will cause the system call to by-pass the page cache and go direct-to-BIO. Maximum throughput will then be realized. We also observed that, out of the 4 I/O schedulers available in Linux, none of them have any effect whatsoever on device drivers that do not use a request queue, which is the case for our client module implementation - due to it exhibiting random-access style latencies when pages are accessed through network memory. Demonstration of this problem involved: 1. instrumenting Linux AIO stack to print out TSC counter microsecond estimates, 2. logging the MemX outbound queue-size, 3. recording the amount of time in between which new requests were handed to the device driver, 4. a preliminary hypothesis derived from an observation that the dependent process was spending too much time idly waiting (inside mwait idle()), and 5. finally receiving confirmation from the mainline kernel developers of the hypothesis. Figures 4.6 through 4.8 compare the distributions of the total RTT measured from a user level application that performs either sequential or random I/O on either MemX or the virtual disk, both with and without the O DIRECT flag enabled. Note that these RTT-values are measured from user-level synchronous read/write system calls, which adds a few tens of microseconds to the kernel-level RTTs in Table 4.1.

Figure 4.6 compares the read

latency distribution for MemX-DD against disk-based I/O in both random and sequential


58

CDF of MemX-DD vs Disk Latencies (Reads, Buffered) 100

Percent of Requests

80 60 40 MemX-DD-Rand MemX-DD-Sequ Disk-Rand Disk-Seq

20 0 1

10


1e+06

Figure 4.6: Comparison of sequential and random read latency distributions for MemX-DD and disk. Reads traverse the filesystem buffer cache. Most random read latencies are an order of magnitude smaller with MemX-DD than with disk. All sequential reads benefit from filesystem prefetching.

CDF of MemX-DD vs Disk Latencies (Writes, Buffered) 100

Percent of Requests

80 60 40 MemX-DD-Rand MemX-DD-Sequ Disk-Rand Disk-Seq

20 0 1

10

100 1000 10000 100000 1e+06 1e+07 Latency (microseconds, logscale)

Figure 4.7: Comparison of sequential and random write latency distributions for MemX-DD and disk. Writes goes through the filesystem buffer cache. Consequently, all four latencies are similar due to write buffering.


59

CDF of MemX-DD vs Disk Latencies (Reads, Random) 100

Percent of Requests

80 60 40 MemX-DD-Buffer MemX-DD-Direct Disk-Buffer Disk-Direct

20 0 1

10


1e+06

Figure 4.8: Effect of filesystem buffering on random read latency distributions for MemXDD and disk. About 10% of random read requests (issued without the direct I/O flag) are serviced at the filesystem buffer cache, as indicated by the first knee below 10µs for both MemX-DD and disk. reads via the filesystem cache. Random read latencies are an order of magnitude smaller with MemX-DD (around 160µs) than with disk (around 9ms). Sequential read latency distributions are similar for MemX-DD and disk primarily due to filesystem prefetching. Figure 4.7 shows the RTT distribution for buffered write requests. Again MemX-DD and disk show similar distributions, mostly less than 10µs, due to write buffering. Figure 4.8 demonstrates the effect of passing o direct flag to the open() system call, which bypasses the filesystem buffer cache. The random read latency distributions without the flag display a distinct knee below 10µs indicating that roughly 10% of the random read requests are serviced at the filesystem buffer cache and that prefetching benefits MemX as well as disk. We observed a similar trend for sequential read distributions, with and without the flag, where the first knee indicated that about 90% of sequential reads were serviced at the filesystem buffer cache.


60

Quicksort 12000

MemX-DomU MemX-DD MemX-Linux Local Memory Local Disk

11000

Sort Time (seconds)

10000 9000 8000 7000 6000 5000 4000 3000 2000 1000 0

0

1

2

3

4

5

Sort Size (GB) Figure 4.9: Quicksort execution times in various MemX combinations and disk. While clearly surpassing disk performance, MemX-DD trails regular Linux only slightly using a 512 MB Xen Guest.


4.4.2

61

Application Speedups

We now evaluate the execution times of a few large memory applications using our testbed. Again, we include both MemX-Linux and virtual disk as base cases to illustrate the overhead imposed by Xen virtualization and the gain over the virtualized disk respectively. Figure 4.9 shows the performance of a very large sort of increasingly large arrays of integers, using an in-house C implementation of the old static-partitioning quicksort algorithm. We stopped using the STL version because of it’s lack of ability to provide more detailed runtime information about the progress of the sort. We record the execution times of the sort for each of the 3 mentioned cases. We also include an ”extreme” base case plot for local memory using one of the vanilla-Linux 4 GB nodes, where the sort executes purely in-memory. From the figure, we ceased to even bother with the disk case beyond 2GB problem sizes due to the unreasonably large amount of time it takes to complete, potentially for days. The sorts using MemX-DD, MemX-domU, and MemX-Linux however finished within 90 minutes, where the distinction between the different virtualization levels is very small. Table 4.2 lists execution times for some much larger problem sizes with both quicksort and a second large memory application – the same ray-tracing scene used in Chapter 3 [81]. Each row in the table describes an increasingly large problem size, as high as 13 GB. Again, both the MemX cases behave similarly, while the disk lags behind. These performance numbers show that MemX provides a highly attractive option for executing large memory workloads in both virtualized and non-virtualized environments. Furthermore, given the amount of un-quantified amount of randomized-reads generated by the system’s pager (which is correlates with the recursive nature of the sort algorithm), the same non-asynchronous I/O problem that we described in the previous section also applies here. If a fix is applied, the observed speed-ups in the figure have the potential to double or triple what they already are. But for now, the throughput observed from the system pager remains around 300 to 400 Mbits/sec.


Application 5 GB Quicksort 6 GB Ray-tracer 13 GB Ray-tracer

Client Mem 512 MB 512 MB 1 GB

MemX-Linux 65 min 48 min 93 min

MemX-DD 93 min 61 min 145 min

62

Disk several hours several hours several hours

Table 4.2: Execution time comparisons for various large memory application workloads.

MemX vs. Parallel iSCSI: Multiple Guests 14000

MemX-DD iSCSI - DD

Sort Time (seconds)

12000 10000 8000 6000 4000 2000 0

0

2

4

6

8

10

12

14

16

18

20

Number of VMs Figure 4.10: Quicksort execution times for multiple concurrent guest VMs using MemX-DD and iSCSI configurations.


Domain 1

63

4 GB MemX Server

Domain 2

80 GB iSCSI

RMAP or iSCSI

4 GB MemX Server 80 GB iSCSI

4 GB MemX Server

Domain 19

80 GB iSCSI

Domain 20 Domain 0

GigE Switch

4 GB MemX Server 80 GB iSCSI

Xen Hypervisor 4 GB Memory

Figure 4.11: Our multiple client setup: Five identical 4 GB dual-core machines, where one houses 20 Xen Guests and the others serve as either MemXservers or iSCSI servers.

4.4.3

Multiple Client VMs

In this section, we evaluate the overhead of executing multiple client VMs using the MemXDD combination. In a real data center, an iSCSI or FibreChannel network would be setup to provide backend storage for guest virtual machines. To duplicate this base case in our cluster, we use five of our dual-core 4GB memory machines to compare MemX-DD against a 4-disk parallel iSCSI setup illustrated in Figure 4.11. For the iSCSI target software, we used the open source project IET [93] and used open-iscsi.org for the initiator software within Dom0 as a driver domain for all the Xen Guests. Our setup involves using one of the five machines to execute up to twenty concurrently running 100MB Xen Guests. Within each guest, we run a 400MB quicksort. We vary the number of concurrent guest VMs from 1 to 20, and in each guest we run quicksort to completion. We perform the same experiment for both MemX-DD and iSCSI. Figure 4.10 shows the results of this experiment. At its highest point (about 10 GB of collective memory and 20 concurrent virtual machines) the execution time with MemX-DD is about 5 times smaller than with iSCSI setup. Recall that we are using four remote iSCSI disks, and one can observe a stair-step behavior in the iSCSI curve where the level of parallelism wraps around at 4, 8, 12, and 16 Virtual machines. Even with concurrent disks and competing virtual machine CPU activity, MemX-DD provides clear benefits in providing low-latency I/O among multiple


64

concurrent Xen virtual machines.

Inter-VM Congestion:

In Section 4.3.6, we described the phenomenon of inter-VM

congestion, that arises due to the absence of explicit congestion control across multiple guests within a Xen node. Here we discuss how inter-VM congestion is handled in different MemX configurations. 1. MemX-DomU and MemX-Linux: Inter-VM congestion does not arise trivially in the base cases of MemX-Dom0 and MemX-Linux because the only users of the client module are local application processes. These processes, controlled by a static send window, use semaphores and wait queues to put competing processes on the OS’s blocked list when the client’s send window is full. So, there is no competition among multiple virtual machines - only between competing processes. 2. MemX-DD: Inter-VM congestion in MemX-DD is handled indirectly by Xen itself. Xen schedules block I/O backend requests in a strictly round-robin fashion. Since MemX is the destination of requests from the backend, Xen will “stop” the delivery of requests to MemX when there is a full queue (of some fixed size). This stop is performed by placing the dependent guest VMs in a blocked state in the same way that multi-programmed processes are blocked when waiting for I/O. 3. MemX-DomU: For MemX-DomU, recall that inter-VM congestion arises from multiple network front-end drivers rather than competing block front-ends. Xen handles this type of contention by using credit-based scheduling, where each front-end is allocated a bandwidth share of the form x bytes every y microseconds. VMs that use up their credit are blocked. This leaves us to handle only the network contention at the switch and server-level, which we plan to address as our testbed scales to more nodes.


4.4.4

65

Live VM Migration

While migration techniques can MemX-DomU configuration has a significant benefit when it comes to migrating live Xen VMs [27] during runtime, even though it has lower throughput and higher I/O latency than MemX-DD. Specifically, a VM using MemX-DomU for fast I/O to distributed memory can be seamlessly migrated from one physical machine to another, without disrupting the execution of any large dataset applications within the VM. There are two specific reasons for this benefit. First, since MemX-DomU is designed as a selfcontained pluggable module within the guest OS, any page-to-server mapping information is migrated along with the kernel state of the guest OS without leaving any residual dependencies behind in the original machine. The second reason is that RMAP is used for communicating read-write requests to distributed memory is designed to be reliable. As the VM carries with itself its link layer MAC address identification during the migration process, any in-flight packets dropped during migration are safely retransmitted to the VM’s new location, thereby enabling any large memory application to continue execution without disruption. What makes the MemX-DomU case interesting is that administrators of virtual hosting centers can exploit live-migration features by seamlessly transferring guest VMs to other physical machines at will to better utilize resources. Our work in the next Chapter 5 focuses exclusively on the optimization of virtual machine migration and will elaborate on this in more detail.

4.5

Summary

State-of-the-art in virtual machine technology does not adequately address the needs of large memory workloads that are increasingly common in modern data centers and virtual hosting platforms. Such application workloads quickly become throttled by the disk I/O bottleneck in a virtualized environment where the I/O subsystem includes an additional level of indirection. In this Chapter, we presented the design, implementation, and evaluation of the MemX system in the Xen environment that enables memory and I/O-constrained VMs to transparently utilize the collective pool of memory within a cluster for low-latency


66

I/O operations. Large dataset applications using MemX do not require any specialized APIs, libraries, or any other modifications. MemX can operate as a kernel module within non-virtualized Linux (MemX-Linux), an individual VM (MemX-DomU), or a driver domain (MemX-DD). The latter option permits multiple VMs within a single physical machine to multiplex their memory requirements over a common distributed memory pool. Performance evaluations using our MemX prototype shows that I/O latencies are reduced by an order of magnitude and large memory applications speed up significantly when compared against virtualized disk. As an extra benefit, live Xen VMs executing large memory applications over MemX-DomU can be migrated without disrupting applications. Our future work includes the capability to provide per-VM reservations over the cluster-wide memory, developing mechanisms to control inter-VM congestion, and enabling seamless migration of VMs in the driver domain mode of operation.

Chapter

5

Post-Copy: Live Virtual Machine Migration In this Chapter, we present the design, implementation, and evaluation of the post-copy based approach for the live migration of virtual machines (VMs) across a gigabit LAN. Live migration is a mandatory feature of modern hypervisors today. It facilitates server consolidation, system maintenance, and lower power consumption. Post-copy [53] refers to the deferral of the memory “copy” phase of live migration until after the VM’s CPU state has been migrated to the target node. This is in contrast to the traditional pre-copy approach, which first copies the memory state over multiple iterations followed by the transfer of CPU execution state. The post-copy strategy provides a “win-win” by approaching the baseline total migration time achieved with the stop-and-copy approach, while maintaining the liveness and low downtime benefits of the pre-copy approach. We facilitate the use of postcopy with a specific instance of adaptive prepaging (also known as adaptive distributed paging). Pre-paging eliminates all duplicate page transmissions and quickly removes any residual dependencies for the migrating VM from the source node. Our pre-paging algorithm is able to reduce the number of page faults across the network to 17% of the VM’s working set. Finally, we enhance both the original pre-copy and post-copy schemes with the use of a dynamic, periodic self-ballooning (DSB) strategy, which prevents the migration daemon from transmitting unnecessary free pages in the guest OS. DSB significantly

67

CHAPTER 5. POST-COPY: LIVE VIRTUAL MACHINE MIGRATION

68

speeds up both the migration schemes with very negligible performance degradation to the processes running within the VM. We implement the post-copy approach in the Xen VM environment and show that it significantly reduces the total migration time and network overheads across a range of VM workloads when compared against the traditional pre-copy approach.

5.1

Introduction

This Chapter addresses the problem of optimizing the live migration of system virtual machines (VMs). Live migration is a key selling point for state-of-the-art virtualization technologies. It allows administrators to consolidate system load, perform maintenance, and flexibly reallocate cluster-wide resources on-the-fly. We focus on VM migration within a cluster environment where physical nodes are interconnected via a high-speed LAN and also employ a network-accessible storage system (such as a SAN or NAS). State-of-theart live migration techniques [73, 27] use the pre-copy approach, where the bulk of the VM’s memory state is migrated even as the VM continues to execute at the source node. Once the “working set” has been identified through a number of iterative copy rounds, the VM is suspended and its CPU execution state plus remaining dirty pages are transferred to the target host. The overriding goal of the pre-copy approach is to keep the service downtime to a bare minimum by minimizing the amount of VM state that needs to be transferred during the downtime. We seek to demonstrate the benefits of another strategy for live VM migration, called post-copy, that was previously applied only in the context of process migration in the late 1990s, but to address the issues involved at the operating system level as well. We believe that modern hypervisors provide the means to employ alternative approaches without much additional complexity. On a high-level, post-copy refers to the deferral of the memory “copy” phase of live migration until the virtual machine’s CPU state has already been migrated to the target node. This enables the migration daemon to try different methods by which to perform the memory copy. Post-copy works by transferring a minimal amount of CPU execution state to the target node, starting the VM at the target, and then pro-


69

ceeds to actively push memory pages from the source to the target. This active push component, also known as pre-paging, distinguishes the post-copy approach from both pre-copy as well as from the demand-paging approach, in which the source node would passively wait for the memory pages to be faulted in by the target node across the network. Pre-paging is a broad term that was used in earlier literature [76, 94] in the context of optimizing memory-constrained disk-based paging systems and refers to a more proactive form of page prefetching from disk. By intelligently sequencing the set of actively prefetched memory pages, the memory subsystem (or even a cache) can hide the latency of high-locality page faults or cache misses from live applications, while continuing to retrieve the rest of the address space out-of-band until the entire address space is complete. Modern memory subsystems do not typically employ pre-paging anymore due the increasingly large DRAM capacities in commodity systems. However, pre-paging can play a significant role in the context of live VM migration which involves the transfer of an entire physical address space across the network. We design and implement a post-copy based technique for live VM migration in the Xen VM environment. Through extensive evaluations, we demonstrate how post-copy can improve the live migration performance across each of the following metrics: pages transferred, total migration time, downtime, application degradation, network bandwidth, and identification of the working set. The traditional pre-copy approach does particularly well in minimizing two metrics – application downtime and degradation – when the VM is executing a largely read-intensive workload. These two metrics are important in preserving system uptime as well as the interactive user experience. However all the above metrics can be impacted adversely when pre-copy is confronted with even moderately write-intensive VM workloads during migration. Post-copy not only maintains VM liveness and application performance during migration, but also improves upon the other performance metrics listed above. The two key ideas behind an effective post-copy strategy are: (a) transmitting each page across the network no more than once, in other words, avoiding the potentially nonconverging iterative copying rounds in pre-copy and (b) an adaptive pre-paging strategy that hides the latency of fetching most pages across the network by actively pushing pages


70

from the source before the page is faulted in at the target node and adapting the sequence of pushed pages using any network page-faults as hints. We show that our post-copy implementation is capable of minimizing network-bound page faults to 17% of the workingset. Additionally, we identified deficiencies in both the pre-copy and post-copy schemes with regard to the transfer of free pages in the guest VM during migration. We improved both migration schemes to avoid transmitting free pages through the use of a Dynamic SelfBallooning (DSB) technique in which the guest actively balloons down its memory footprint without human intervention. DSB significantly speeds up the total migration time, normalizes both approaches, and is capable of frequent ballooning without adversely affecting live applications with intervals as small as 5 seconds. Both the Xen and VMWare hypervisors have demonstrated that the use of migration itself is an essential tool. The original pre-copy algorithm does have other advantages: it employs a relatively self-contained implementation that allows the migration daemon to isolate most of the copying complexity to a single process at each node. Additionally, pre-copy provides a clean method of aborting the migration should the target node ever crash during migration, because the VM is still running at the source and not the target host (whether or not this benefit is made obvious in current virtualization technologies). Although our current post-copy implementation does not handle target node failure, we will discuss a straightforward approach in Section 5.2.5 by which post-copy can provide the same level of reliability as pre-copy. Our contributions are to demonstrate a complete way in which, with a little bit more help from the migration system, one can preserve the liveness and downtime benefits of pre-copy while also breaking from the non-deterministic convergence phase inherent in pre-copy, ensuring that each page of VM memory is transferred over the network at most once.

5.2

Design

We begin with a brief discussion of the performance goals of VM migration. Afterwards, we will present our design of post-copy and how it improves those goals.


5.2.1

71

Pre-Copy

For a more in-depth performance summary of pre-copy migration, we refer the reader to [27] and [73]. For completeness, pre-copy migration works as follows: Pre-copy is an eager strategy in which memory pages are actively pushed to the target machine, while the migrating VM continues to run at the source machine. Pages dirtied at the source that have already been transferred to the target are re-sent through several iterations until the number of dirtied pages is as small as a fixed threshold. (Note: this threshold is not dynamic. Although one could potentially imagine modern hypervisors designing a dynamic threshold, neither their companies nor the literature have attempted to do so.) Furthermore, in all known implementations, if the threshold is never reached, an empirical “cap” on the total number of iterations is chosen (currently set to 30) by the migration implementer. Without this cap, it is possible that pre-copy may never converge at all. After the iterations complete, the VM is then suspended and its state is transferred to the target machine where it is restarted. This transfer of VM state is accompanied by the final flush of the remaining address space modified at the host. The VM is the resumed at the target and the source VM copy is destroyed. Pre-copy migration involves the following six performance goals: 1. Transparency: The pre-copy scheme can work transparently in both fully-virtualized and para-virtualized environments. Any new migration scheme must maintain that ability without requiring any application changes. 2. Preparation Time: Any required CPU or network activity within either the migrating guest VM or the maintenance VM contributes to preparation time. This includes most of the memory copying during pre-copy rounds. There is no guarantee that this time ever converges to a stopping round. In fact, later we show that even with mildly active VMs, these rounds never converge. 3. Down Time: This time represents how long the migrating VM is stopped, during which no execution progress is made. Pre-copy uses this time for dirty memory transfer. Minimizing this goal is their primary goal.


72

4. Resume Time: Any remaining cleanup required by the maintenance VM at the target host goes into this time period. Although pre-copy has nothing to do besides rescheduling the migrating VM, the majority of our post-copy design operates primarily in this period. After this period is complete, regardless of which migration algorithm is used, all dependencies on the source VM must be eliminated. 5. Pages Transferred: This performance goal consists of a total count of the number of transferred memory pages across all of the above time periods. For pre-copy this is dominated by preparation time. 6. Total Migration Time: For pre-copy, the total time required to complete the migration is dominated by the preparation time. Total migration time is important because it affects the release of resources on both sides within the individual host as well as within the VMs on both hosts. Until completion of migration, the unused memory at the source cannot yet be freed, and both maintenance VMs will continue to consume network bandwidth and CPU cycles. 7. Application Degradation: This refers to the extent of slowdown experienced by application workloads executing within the VM due to the migration event. The slowdown occurs primarily due to CPU time taken away from normal applications to carry out the migration. Additionally, the pre-copy approach needs to track dirtied pages across successive iterations by trapping write accesses to each page, which significantly slows down write-intensive workloads. In the case of post-copy, access to memory pages not yet present at the target results in network page faults, potentially slowing down the VM workloads. One of this Chapter’s contributions is to reduce the number of pages transferred compared to pre-copy: the wasteful transfer of pages that may never be used at the target machine is likely to occur. If the threshold of the number of dirty pages chosen to terminate the pre-copy phase is too small, then pre-copy may never converge and terminate. On the other hand, if the number of pages transferred during final iteration is large, significant downtime can result. Given that the number of pages transferred directly impacts all other


73

metrics, our post-copy method aims to reduce this metric.

5.2.2

Design of Post-Copy Live VM Migration

Post-copy is a strategy in which the migrating virtual machine is first suspended at the source, a minimalistic execution state is copied over to the target where the virtual machine is restarted and then the memory pages that are referenced are faulted over the network from the source. VM execution experiences a delay during this period of faults, and that delay depends on the characteristics of the network connection and how fast the source machine can serve the request. As a result, this method incurs considerable resume time. Additionally, leaving any long-term residual dependencies on the source host is not acceptable. Thus, post-copy is not useful unless two additional goals are required: 1. Post-copy must effectively anticipate page-faults from the target and allow VM execution to move forward, while hiding the latency of page-faults. 2. Post-copy must flush the remaining clean pages from the source out-of-band while the VM is simultaneously faulting, so that no residual dependency remains on the source. Note that both migration schemes must be normalized with respect to the unused / free pages within the guest VM. This must be done such that any improvement is realized only by the treatment of pages that actually contributed to the guest VM’s working set. We will discuss this solution momentarily. The post-copy algorithm can actually be designed in multiple ways, each of which provides an incrementally better improvement on the previous method across all the aforementioned performance goals. Table 5.1 illustrates how each of these ways slightly increases in complexity from the previous one during a certain phase of the migration, with the common goal of improving the bottom line. Method 1 heads off the table as the current form of migration. Method 2: Post-Copy via Demand Paging: The demand paging variant of postcopy is the simplest and slowest option. Once the VM resumes at the target, its memory accesses result in page faults that can be serviced by requesting the referenced page over


1 2 3 4 5

Pre-copy Only Demand Paging Basic Post-copy Pre-paging + Post-copy Hybrid Pre + Post

Preparation Multiple iterative memory transfers Pre-suspend time (if any) Pre-suspend time (if any) Pre-suspend time (if any) Single pre-copy round

Downtime Send dirty memory CPU state transfer CPU state transfer CPU state transfer CPU state transfer

74

Resume CPU state transfer Page-faults only Flushing + page-faults Bubbling + page-faults Bubbling + page-faults

Table 5.1: Migration algorithm design choices in order of their incremental improvements. Method #4 combines #2 and #3 with the use of pre-paging. Method #5 actually combines all of #1 through #4, by which pre-copy is only used in a single, primer iterative round.

1. 2. 3. 4.

let let let let

N := total # of guest VM pages page[N] := set of all guest VM pages bitmap[N] := all zeroes pivot := 0; bubble := 0

5. ActivePush (Guest VM) 6. while bubble < max (pivot, N-pivot) do 7. let left := max(0, pivot - bubble) 8. let right := min(MAX_PAGE_NUM-1, pivot + bubble) 9. if bitmap[left] == 0 then 10. set bitmap[left] := 1 11. queue page[left] for transmission 12. if bitmap[right] == 0 then 13. set bitmap[right] := 1 14. queue page[right] for transmission 15. bubble++ 16. PageFault (Guest-page X) 17. if bitmap[X] == 0 then 18. set bitmap[X] := 1 19. transmit page[X] immediately 20. discard pending queue 21. set pivot := X // shift pre-paging pivot 22. set bubble := 1 // new pre-paging window Figure 5.1: Pseudo-code for the pre-paging algorithm employed by post-copy migration. Synchronization and locking code omitted for clarity of presentation.


75

the network from the source node. However, servicing each fault will significantly slow down the VM due to the network’s round trip latency. Consequently, even though each page is transferred only once, this approach considerably lengthens the resume time and leaves long-term residual dependencies in the form of un-fetched pages, possibly for an indeterminate duration. Thus, post-copy performance for this variant by itself would be unacceptable from the viewpoint of total migration time and application degradation. Method 3: Post-Copy via Active Pushing: One way to reduce the duration of residual dependencies on the source node is to proactively “push” the VM’s pages from the source to the target even as the VM continues executing at the target. Any major faults incurred by the VM can be serviced concurrently over the network via demand paging. Active push avoids transferring pages that have already been faulted in by the target VM. Thus, each page is transferred only once, either by demand paging or by an active push. Method 4: Post-Copy via Prepaging: The goal of post-copy via prepaging is to anticipate the occurrence of major faults in advance and adapt the page pushing sequence to better reflect the VM’s memory access pattern. While it is impossible to predict the VM’s exact faulting behavior, our approach works by using the faulting addresses as hints to estimate the spatial locality of the VM’s memory access pattern. The prepaging component then shifts the transmission window of the pages to be pushed such that the current page fault location falls within the window. This increases the probability that pushed pages would be the ones accessed by the VM in the near future, reducing the number of major faults. Various prepaging strategies are described in Section 5.2.3. Method 5: Hybrid Live Migration: The hybrid approach was first described in [74] for process migration. It works by doing a single pre-copy round in the preparation phase of the migration. During this time, the VM continues running at the source while all its memory pages are copied to the target host. After just one iteration, the VM is suspended and its processor state and dirty non-pageable pages are copied to the target. Subsequently, the VM is resumed at the target and post-copy as described above kicks in, pushing in the remaining dirty pages from the source. As with pre-copy, this scheme can perform well for read-intensive workloads. Yet it also provides deterministic total migration time for write-intensive workloads, as with post-copy. This hybrid approach is currently being


Backward Edge of Bubble

(a)

76

Forward Edge of Bubble

0

MAX Pivot

(b)

0

MAX P1 P1 P2 P3

P2

Pivot Array

P3 Stopped Bubble Edges

Figure 5.2: Prepaging strategies: (a) Bubbling with single pivot and (b) Bubbling with multiple pivots. Each pivot represents the location of a network fault on the in-memory pseudo-paging device. Pages around the pivot are actively pushed to target. implemented and not covered within the scope of this chapter. Rest of this paper describes the design and implementation of post-copy via prepaging.

5.2.3

Prepaging Strategy

Prepaging refers to actively pushing the VM’s pages from the source to the target. The goal is to make pages available at the target before they are faulted on by the running VM. The effectiveness of prepaging is measured by the percentage of VM’s page faults at the target that require an explicit page request to be sent over the network to the source node – also called network page faults. The smaller the percentage of network page faults, the better the prepaging algorithm. The challenge in designing an effective prepaging strategy is to accurately predict the pages that might be accessed by the VM in the near future, and to push those pages before the VM faults upon them. Below we describe different design options for prepaging strategies. (A) Bubbling with a Single Pivot: Figure 5.1 lists the pseudo-code for the two components of bubbling with a single pivot – active push (lines 5–15), which executes in a kernel thread, and page fault servicing (lines 16–21), which executes in the interrupt context whenever a page-fault occurs. Figure 5.2(a) illustrates this algorithm graphically. The VM’s pages at source are kept in an in-memory


77

pseudo-paging device, which is similar to a traditional swap device except that it resides completely in memory (see Section 5.3 for details). The active push component starts from a pivot page in the pseudo-paging device and transmits symmetrically located pages around that pivot in each iteration. We refer to this algorithm as “bubbling” since it is akin to a bubble that grows around the pivot as the center. Even if one edge of the bubble reaches the boundary of pseudo-paging device (0 or M AX), the other edge continues expanding in the opposite direction. To start with, the pivot is initialized to the first page in the in-memory pseudo-paging device, which means that initially the bubble expands only in the forward direction. Subsequently, whenever a network page fault occurs, the fault servicing component shifts the pivot to the location of the new fault and starts a new bubble around this new location. In this manner, the location of the pivot adapts to new network faults in order to exploit the spatial locality of reference. Pages that have already been transmitted (as recorded in a bitmap) are skipped over by the edge of the bubble. Network faults that arrive at the source for a page that is in-flight (or just been pushed) to the target are ignored to avoid duplicate page transmissions. (B) Bubbling with Multiple Pivots: Consider the situation where a VM has multiple processes executing concurrently. Here, a newly migrated VM would fault on pages at multiple locations in the pseudo-paging device. Consequently, a single pivot would be insufficient to capture the locality of reference across multiple processes in the VM. To address this situation, we extend the bubbling algorithm described above to operate on multiple pivots. Figure 5.2(b) illustrates this algorithm graphically. The algorithm is similar to the one outlined in Figure 5.1, except that the active push component pushes pages from multiple “bubbles” concurrently. (We omit the pseudo-code for space constraints, since it is a straightforward extension of single pivot case.) Each bubble expands around an independent pivot. Whenever a new network fault occurs, the faulting location is recorded as one more pivot and a new bubble is started around that location. To save on unnecessary page transmissions, if the edge of a bubble comes across a page that is already transmitted, that edge stops progressing in the corresponding direction. For example, the edges between bubbles around pivots P2 and P3 stop progressing when they meet, although the opposite edges continue making progress.


78

In practice, it is sufficient to limit the number of concurrent bubbles to those around k most recent pivots. When new network faults arrives, we replace the oldest pivot in a pivot array with the new network fault location. For the workloads tested in our experiments in Section 5.4, we found that around k = 7 pivots provided the best performance. (C) Direction of Bubble Expansion: We also wanted to examine whether the pattern in which the source node pushes the pages located around the pivot made a significant difference in performance. In other words, is it better to expand the bubble around a pivot in both directions, or only the forward direction, or only the backward direction? To examine this we included an option of turning off the bubble expansion in either the forward or the backward direction. Our results, detailed in Section 5.4.4, indicate that forward bubble expansion is essential, dual (bi-directional) bubble expansion performs slightly better in most cases, and backwards-only bubble expansion is counter-productive. When expanding bubbles with multiple pivots in only a single direction (forward-only or backward-only), there is a possibility that the entire active push component could stall before transmitting all pages in the pseudo-paging device. This happens when all active bubble edges encounter already sent-pages at their edges and stop progressing. (A simple thought exercise can show that stalling of active push is not a problem for dual-direction multi-pivot bubbling.) While there are multiple ways to solve this problem, we chose a simple approach of designating the initial pivot (at the first page in pseudo-paging device) as a sticky pivot. Unlike other pivots, this sticky pivot is never replaced by another pivot. Further, the bubble around sticky pivot does not stall when it encounters an already transmitted page; rather it skips such a page and keeps progressing, ensuring that the active push component never stalls.

5.2.4

Dynamic Self-Ballooning

The Free Memory Problem. As we touched on earlier, there can be an arbitrarily large number of free pages within the guest VM before migration begins - or there may be little or no free pages. Nevertheless, it is wasteful to send those pages, regardless which migration algorithm you are using. If you do not eliminate as many of these pages as possible from


79

being migrated in the pre-copy algorithm, then one cannot properly compare it to postcopy. This is because there would be no way of distinguishing clean pages from free pages during each pre-copy iteration. If a clean page is freed, there is no way for the migration process to detect this. We observe that there are two ways to solve this problem. For post-copy, this turns out to be quite easy: this leaves us with Method 5 (hybrid method) in Table 5.1. This method combines pre-copy with post-copy. This works by doing a single pre-copy round in the preparation time phase of the migration. As a result this allows the guest VM to continue running at the source while its free pages and clean pages are copied to the target host. Subsequently, the post-copy process kicks in immediately after downtime. There is no memory transfer during downtime, and post-copy operates just as we described. The second way to solve the free memory problem is through the use of ballooning. The first time the hybrid scheme was used in the literature was in [74]. But since we are dealing with whole system VM migration, this presents a problem for a performance comparison against a stand-alone pre-copy migration: the hybrid scheme does not eliminate the transmission of free pages. Without eliminating them, we cannot determine the effectiveness of post-copy with respect to how well pre-paging successfully promotes VM execution time by hiding page-fault latency from the migrating guest VM. We cannot evaluate that effectiveness for two reasons: First, if a free page is transmitted (which is highly probable), it consumes bandwidth that might otherwise have been used by both pre-paging as well as for the iterative rounds used in pre-copy. Second, during pre-paging, if a free page is allocated by the guest VM and subsequently causes a page-fault (as the result of a copy-on-write by the virtual memory system), this will cause additional delay on the VM at the target when there need not have been. Therefore, we cannot do a performance analysis of post-copy without eliminating the transmission of those empty page frames. Ballooning is the act of changing the view of physical memory (and pseudo-physical memory) such that the guest VM has a larger or smaller amount of allocatable memory than it had before. In current virtualization systems, this is only used to during guest VM boot time when it is first created and initialized. If the maintenance VM cannot “reserve” enough memory for the new guest - henceforth referred to as a reservation - it steals


80

some from the other VMs on the host by enlarging a kind of balloon in other VMs and giving it to the new one. This is done by giving the existing VMs a “target” reservation and waiting for them to release enough pages from their own reservations to satisfy that smaller target. The system administrator can re-enlarge those diminished reservations at a later time should more memory become available. This might happen as the result of either shutting down or even migration itself. What we have implemented is a way for the migrating guest VM to perform this ballooning continuously by itself, called Dynamic Self-Ballooning (DSB). The way to make this effective for migration is two-fold: First, we must choose an appropriate interval between consecutive DSB attempts such that the CPU time consumed by the DSB process does not interfere with the applications running within the VM. Second, the DSB process must ensure that it can allow the balloon to shrink. When one or more memory-intensive applications begins to run and perform copyon-writes within the guest VM, there must be a way for the DSB process to detect this and respond to it by releasing free pages from the balloon so that the applications can use them. We’ve devised a way to do this in the next couple of sections and have chosen an interval of about 5 seconds through some performance experiments and determined that application performance is not adversely affected. During pre-copy migration only, DSB is used continuously. On the other hand, post-copy only performs DSB once right before the beginning of the downtime phase. After resume, it is disabled and the rest of post-copy proceeds as described.

5.2.5

Reliability

As we touched on in the introduction, post-copy has a drawback with respect to the reliability of the target node. Either the source or destination node can fail in the middle of VM migration. In both pre-copy and post-copy migration, failure of source node implies permanent loss of the VM itself. Failure of destination node has different implications in the two cases. For pre-copy, failure of the destination node does not matter because the source node still holds an entire up-to-date copy of the VM’s memory and CPU execution state and the VM can be revived if necessary from this copy. However, with post-copy, the


81

VM begins execution at the target node as soon as minimal CPU execution state is transferred first, which implies that the destination node has the more up-to-date version of the VM state and copy at the source happens to be stale, except for pages not yet modified at the destination. Thus failure of the destination node constitutes a critical failure of the VM during post-copy migration. We plan to address this problem by developing mechanisms to incrementally checkpoint the VM state from the destination node back at the source node, an approach taken by the Xen-based system Remus [18]. According to them, we believe that the increased network overhead of doing this is negligible, but a through evaluation of that would first be required. One approach is as follows: while the active push of pages is in progress from the source node to the destination, we also propagate incremental changes to memory pages and execution state in the VM at the destination back to the source node. We do not need to propagate the changes from the destination on a continuous basis, but only at discrete points such as when interacting with a remote client over the network, or committing an I/O operation to the storage. This mechanism can provide a consistent backup image at the source node that we can fall back on in case the destination node fails in the middle of post-copy migration, although at the expense of some increase in reverse network traffic. Further, once the migration is over, the backup state at the source node can be discarded safely. The performance of this mechanism would depend upon the additional overhead imposed by reverse network traffic from the destination to the source. In a different context, similar incremental checkpointing mechanisms have been used to provide high availability in the Remus project [18].

5.2.6

Summary

We have described Post-Copy and solved four problems that are important for the improved migration of system virtual machines. By focusing on the total number of pages transferred, we use the following approaches: demand-paging, flushing, pre-paging through what we call “bubbling”, and dynamic self-ballooning (DSB), all working together at the same time. Demand paging ensures that we eliminate the non-deterministic copying iterations involved


82 TARGET VM

SOURCE VM Non−Pageable Memory Double Memory Reservation

Domain 0 (at source)

MFN

Pageable Memory

Restore Memory Reservation

Exchange

Migration Daemon (Xend) Pseudo−Paged Memory

Memory−Mapped

New Page−Frames

Page−Frames Page−Fault Traffic mmu_update()

Source Hypervisor

Pre−Paging Traffic

Target Hypervisor

Figure 5.3: Pseudo-Swapping (item 3): As pages are swapped out within the source guest itself, their MFN identifiers are exchanged and Domain 0 memory maps those frames with the help of the hypervisor. The rest of post-copy then takes over after downtime. in pre-copy. Flushing ensures that no residual dependencies are left on the source host. Bubbling helps minimize the number of page faults as well as the length of time spent in the resume phase. Self-ballooning allows us to normalize the two migration schemes for comparison by eliminating the transmission of free pages. Note that we do not implement the Hybrid scheme that we mentioned earlier as it does not directly contribute to the comparison of the two schemes, but would nonetheless still significantly improve the treatment of clean pages during post-copy migration. We leave that to future work.

5.3

Post-Copy Implementation

We’ve implemented post-copy on top of the Xen 3.2.1 along with all of the optimizations introduced in Section 5.2. We use the para-virtualized version of Linux 2.6.18.8 as our base. We begin by first discussing how there are different ways of trapping page-faults within the Xen / Linux architecture and their trade-offs. Then we will discuss our implementation of dynamic self-ballooning.


5.3.1

83

Page-Fault Detection

The working set of the Virtual Machine can (and will) span multiple user applications and in-kernel data structures. We propose three different ways by which the demand-paging component of post-copy at the system virtual machine level can trap accesses to the WWS. These include: 1. Shadow Paging: Through the pre-existing, well designed use of an extra, read-only set of page tables underneath the VM, shadow paging provides multiple benefits to virtual machines in modern hypervisors. Support for shadow paging contributes to the use of both fully-virtualized VMs and para-virtualized VMs as well as the facilitation of pre-copy migration by detecting page dirtying. In the post-copy case, each attempt to write to a page at the target would be trapped by shadow-paging. The migration daemon would then use this information to retrieve that page before the read or write can proceed. 2. Page Tracking: The idea here is to use the downtime phase to mark all of the resident pages in the VM as not present within the corresponding page-table-entries (PTEs) for each page. This has the effect of forcing a real page-fault exception on the CPU. The hypervisor would then be responsible for propagating that fault to Domain 0 to be fixed up. The migration process would then bring in the page and fixup the page-table entry back to normal. x86 PTEs currently have 2 or 3 unused bits in their lower order bits that can be used to track this information for fixup. 3. Pseudo Swapping: This solution preserves the spirit of para-virtualization, but remains transparent to applications. The idea is to take the set of all pageable application and page cache memory within the guest VM and make it “suddenly appear” that it has been swapped out but without the actual cost of doing so - and without the use of any disks whatsoever. Although this sounds strange, recall that the source VM is not running during post-copy. Only the target VM is running. So the memory reservation that the source VM is occupying is essentially acting like a limited swap device. During resume time, the guest VM itself can be paravirtualized to request


84

those pages from a sort of pseudo swap device. In the end, we chose to use Pseudo Swapping because it was the the quickest to implement, which is illustrated in Figure 5.3. Initially, we actually started with Page Tracking, but stopped working on it. We believe that Page Tracking is the fastest, most efficient form of demand paging at the system VM level. This is because faults are true CPU exceptions. We started writing this by trapping those exceptions directly within the Hypervisor and then propagating a new Virtual Interrupt to Domain 0. The major problem with this scheme is that there exists no way in modern operating systems to detect when a physical page frame is no longer in use by the operating system. Ideally one could imagine an architecturally defined bitmap structure that is managed by the OS, not unlike the way a page table is architecturally defined. This bitmap would allow the hardware to know which page frames actually contain real bytes and which were free. Once page tracking was initiated, Domain 0 could use this bitmap in combination with the aforementioned page table modifications to determine whether or not it was still necessary to fixup the PTE at the given time. Page Tracking is not feasible without this feature. On the other hand, Shadow Paging provides a clear middle ground. Although it would be slower than Page Tracking (due to the extra level of PTE propagation) it is more transparent than Pseudo Swapping. For the most part, such an implementation would remain relatively unchanged except for making a hook available for trapping into Domain 0. Recently, a version of this type of demand-paging for use in parallel cloud computing was demonstrated in a tech report [44] based on top of the Xen hypervisor. Our page-fault detection is implemented through the use of two loadable kernel modules. One sits inside the migrating VM and one sits inside Domain 0. These modules leverage our prior work called MemX [49], which provides distributed paging support for both Xen VMs and Linux machines at the kernel level. Once the target is ready to begin pre-paging in the post-copy algorithm, MemX is invoked to service page faults through the use of pseudo swapping as described. Figure 5.4 illustrates a high-level overlay of how both pre-copy and post-copy relate to each other. Recall that in order to use Pseudo Swapping to implement demand paging, one can only apply this to the set of all pageable


85

Non−pageable Memory Post−Copy Downtime

Preparation (live)

Post−Copy Prepaging (live)

Pre−Copy Rounds (live) Round: 1

Complete 2

3 ... N

Pre−Copy Downtime

Dirty Memory

Time

Figure 5.4: The intersection of downtime within the two migration schemes. Currently, our downtime consists of sending non-pageable memory (which can be eliminated by employing the use of shadow-paging). Pre-copy downtime consists of sending the last round of pages. memory in the system. Thus, the remaining memory (which is typically made up of small in-kernel caches or pinned pages) must be sent over to the target host during downtime. This drawback to Pseudo Swapping is that it puts a small lower-bound on the achievable downtime experienced by our implementation of Post-Copy, but is not a fundamental limitation of the post-copy method of migration by any means. In future work, we plan to switch to Shadow Paging as a means to implement the demand paging component of post-copy. This will eliminate that drawback. Nonetheless, we preserve the worsened downtime values later in our performance experiments. These downtimes typically range from 600 ms to a little over one second.

5.3.2

MFN exchanging

Because of our speedy implementation, it was necessary to devise a way of making it appear that the set of all pageable memory in the guest VM had been swapped out without actually moving those pages anywhere. This can be accomplished in two ways: we can either transfer the pages out of the guest VM (and into the maintenance VM) or we can alter the location of the physical frame within the VM itself to a new location (with zero copying). We chose the latter because it does not place any extra dependencies on the maintenance VM. We accomplish this by what we call performing an “mfn exchange”. This works by first doubling the memory reservation of the VM and allocating free pages from


86

the new memory and briefly suspending all of the running processes in the system. We then instruct the kernel to swapout each pageable frame. Each time a used frame is paged, we re-write the hypervisor’s PFN to MFN mapping table (called a “physmap”) and exchange the two physical frames without actually copying them. We also do the same thing for the kernel-level page tables entries of both physical frames. This is efficient because we batch the hypercalls necessary to perform these operations within the hypervisor. Once downtime has completed, we restart the applications and wait for page-faults to the pseudo swap device to arrive.

5.3.3

Xen Daemon Modifications

A handful of modifications to the Xen Daemon where made to support page-fault detection. The Xen daemon has the responsibility of initializing the migration and initial memory transfer process, including page tables and CPU state. For our system, the only memory transfer that the daemon is responsible for is the transfer of non-pageable memory. All other pages are ignored until later as usual. Additionally, the set of pages that are eliminated through self-ballooning must also be ignored. By default, however, the Xen Daemon has no way of knowing whether a particular memory page actually belongs to any of those 3 categories (pageable, non-pageable, or ballooned) because of the strict memory reservation policy employed by Xen (as it should be). This presents a problem for Post-Copy: the way non-pageable memory is transferred in our system is implemented by using the same code that runs when the daemon is ready to execute a Pre-Copy iteration in the original system. Thus, to support our system, we patch this code to check a new bitmap data structure that indicates whether or not a particular frame should actually be sent or not (rather than just treating all pages as dirty or not dirty as before in the original system). This bitmap is populated by the kernel module running inside the actual guest VM itself running at the source (before downtime begins). The next part is not so obvious upon first examination: The Xen Daemon (the management process running inside the co-located Domain 0 on the same host) needs to be able to read this bitmap from user-space. Thus, we perform a memory-mapping from the kernel


87

space of the guest VM to the user space of the Xen Daemon in the other virtual machine. Furthermore, in order to perform a successful memory-mapping, the Xen Daemon needs to know in advance what the physical frame numbers (the MFNs) are of each page frame that physically backs that bitmap data structure. This is required because of the nature of performing a memory mapping: the Xen Daemon’s page tables must be populated with physical mfn numbers, not virtual ones. As a result, we use a physically contiguous mapping to discover these MFNs: the physical-to-machine (p2m) mapping table. This is a table that translates every PFN of a guest (from 0 to max) into a physical frame number (mfn) owned by the guest virtual machine. Finally, to complete the memory mapping this results in the Daemon only needing to know 2 pieces of information: The location of the *first* virtual frame number and the total number of frames. Thus, the guest VM needs only to transmit these two values to the Xen Daemon before downtime begins. We accomplish this by exporting the address (the PFN, specifically) of the (virtually contiguous) first frame of the bitmap inside of the p2m table into the “Xen Store”. The Xen Store is a messaging abstraction for Xen virtual machines to be able to communicate small pieces of information to each other and is organized into a directory structure for each co-located virtual machine on the same host. Recall that we also have a kernel module running inside the management VM acting as the retrieval entity for the whole post-copy process and it is responsible for facilitating pseudo-paging. This module reads the first bitmap frame address from the Xen store and then communicates that information upwards to the Xen Daemon running inside the same virtual machine. The daemon then performs a memory mapping of this bitmap by grabbing each mfn out of the p2m table based on this first frame number one-by-one. Finally, once the bitmap is mapped and the physical frames are mapped, the Daemon can then determine which frames should be transmitted to the target host and which ones can be ignored by simply checking the bitmap.


5.3.4

88

VM-to-VM kernel-to-kernel Memory-Mapping

During downtime, in addition to the Xen Daemon modifications in the previous section, the third-party module running within Domain 0 itself that is responsible for transmitting faulted pages to the target host also has the responsibility of memory-mapping the entirety of the guest’s VM memory footprint. This is to avoid copying memory. A problem arises, however, which is similar to the one presented in Section 5.3.3: in order to complete this memory mapping, we must again know what the addresses are of each page frame owned by the migrating guest VM. This is a much larger task however, because we’re not just exporting a bitmap to another virtual machine (where the total mapped data is one bit per page). Instead, we are memory-mapping 8 bytes per page owned by the guest. Thus, for a common 512 MB guest virtual machine, this means we have a megabyte of data to transmit to the other virtual machine. (512 MB constitutes 128K pages, so a 64-bit page frame identifier would require a megabyte of memory to store all of the physical frame numbers). The problem with this megabyte is that you cannot simply allocate a contiguous megabyte of memory in kernel space with any guaranteed certainty. Slab caches and kmalloc’s are not meant for that. So, this leaves you with the alloc pages() family of routines in Linux. This routine allocates memory in power-of-two orders of 4 KB pages. The largest contiguous order allowed by linux is 12 (and that is under ideal circumstances). Even a simple 1-MB allocation requires an “order-8” memory allocation. Larger VM memory sizes would thus approach 9 and 10. Under a heavily utilized system it is highly unlikely the Linux buddy system would return success on such requests. This requires us to find another way to send this 1-MB of data to the third party module inside Domain 0: through a second-level memory mapping. This solution involves constructing a kind of “impromptu” page-table structure. This structure has the exact same 3-level hierarchy of a regular page table except that is not architecturally defined; but it still places the required mfn data at the leaves of the tree. We create this structure very quickly and pass the root of the table to the third party module through the use of the Xen store as was done in the previous section. During downtime,


89

the receiving module maps each frame of the page table structure itself in a recursive fashion beginning at the root. Once that is complete, it maps all of the page frame mfn numbers stored at the leaves of the table. These leaves collectively store the addresses of only those page frames that can potentially incur page faults. Thus, when a page fault actually occurs at the target host, the module needs to only consult this table and snap up that page to be ready for transmission without any copying whatsoever.

5.3.5

Dynamic Self Ballooning Implementation

The Xen Hypervisor has a set of hypercalls responsible for allowing a guest to change its memory reservation on-demand. The general idea to implementing DSB under Xen is three-fold. We will discuss how each of these steps is implemented within our version of the post-copy system and also how we modified it to be used in the original Xen implementation of pre-copy: 1. Inflate the balloon: For migration, this is accomplished by allocating as much free memory as possible and handing those pages over to the “decrease reservation” hypercall. This results in those pages being placed under the ownership of the hypervisor. 2. Detect memory pressure: There are a few ways of doing this within Linux, which we will describe shortly. Memory pressure indicates that either an application or the kernel needs a page frame right now. In response, the DSB process must deflate the balloon by the corresponding amount of memory pressure (but it need not destroy the balloon completely). 3. Deflate the balloon: Again, this is accomplished by performing the reverse of step 1: first the DSB process invokes the “increase reservation” hypercall. Then it proceeds to release the list of free pages that were previously allocated (and handed to the hypervisor for re-use) and actually give them back to the kernel’s free pool. In order to rapidly inflate and deflate the balloon, we first had to determine where to initiate these operations. One can either place the DSB process within Domain 0 and


90

communicate the intent to modify the balloon to the migrating VM through Xen’s internal communication mechanisms (called the XenStore) - or one could place the DSB process within the VM itself. Because the nature of performing ballooning requires internal knowledge of the kernel anyway, we decided to go with the latter. But the deciding factor on this placement actually dealt with the balloon driver that ships with the Xen source code instead. We found this driver to be a little slow. This driver does not batch the hypercalls required to perform ballooning, but instead executes hypercalls one-by-one. In our laboratory, we observed that a single hypercall can take as long as 2-3 microseconds. If we are to expect to perform DSB rapidly, these hypercalls must be batched together into a single hypercall - a feature which Xen already provides. It just simply needed a little kick forward. Thus we placed the DSB process within the guest VM itself and updated the existing driver to perform this batching. Memory Overcommitment. Memory over-commitment within an individual operating system is a method by which the virtual memory subsystem can provide an illusion to an application that the physical memory in the machine is larger than what is true. However, there are multiple operating modes of over-commitment within the Linux kernel - and these modes can be enabled or disabled at runtime. By default, Linux disables this feature. This has the effect of causing application level memory allocations to be precluded in advance by returning a failure. So, by default (within Linux) if an application submits a memory allocation request without sufficient physical memory available, Linux will return an error. However if you enable over-commitment, the kernel will truly view the set of physical memory as infinite. One could spend an entire paper arguing that the over-commitment feature should be enabled by default, but the Linux community has instead chosen to “err on the side of caution” and defer such a decision to experienced system administrators. Over-commitment is required for the transparent detection of memory pressure that we have developed for our version of the DSB process, which we describe next. Detecting Memory Pressure. Surprisingly enough, the Linux kernel already provides a transparent way of doing this: through the filesystem interface. When a new filesystem is registered with the kernel, one of the function pointers provided includes a callback to request that the filesystem free any in-kernel data caches that the filesystem may have


91

pinned in memory. Such caches typically include things like inode and directory entry caches. These callbacks are driven by the virtual memory system and are invoked when applications ask for more memory. Indirectly, the virtual memory system makes this determination when it is time to perform a copy-on-write on behalf of an application that has allocated a large amount of memory but has only recently decided to write to it for the first time. Consequently, the DSB process does not register a new filesystem, but we are still allowed to register a callback function by which the virtual memory system can use. This worked remarkably well and does indeed provide very precise feedback to the DSB process on exactly when a memory-intensive application has become active. The name of this Linux function is called set shrinker(). Alternatively, one could periodically wakeup the DSB process at an interval and scan the /proc/meminfo and /proc/vmstat files to determine this information by hand. We found the filesystem interface to be more direct as well as accurate. Whenever we get a callback, the callback already contains a numeric value of exactly how many pages it wants the DSB process to release at once. The size of this batch is typically 128 pages. The callback can happen very frequently in a back-to-back manner on behalf of active user applications. Each time the callback occurs the DSB process will deflate the balloon as described by the requested amount and go back to sleep. Completing the DSB process. Finally, the DSB process, with the ability to detect memory pressure, must periodically reclaim free pages that may or may not have been released by running applications or the kernel itself. We perform this sort of “garbage collection” within a kernel thread. Note: this is not true garbage collection - that is not our intention. The kernel thread will wake up at periodic intervals and attempt to re-inflate the balloon as much as possible and then go back to sleep. If memory pressure is detected during this time, the thread will preempt itself and cease inflation and go back to sleep. The only thing that was required complete this is a 200-line patch to the Xen migration daemon running within Domain 0. Recall the operation of the DSB process with respect to pre-copy and post-copy. Post-copy uses DSB only once: the kernel thread will balloon a single time before downtime occurs and go back to sleep, whereas DSB runs continuously for pre-copy. The migration daemon has a policy to which it strictly adheres: if a page frame has never been mapped before, it will not be migrated or transmitted. Note: this is not


92

the same as detecting whether or not a page frame has been allocated and subsequently freed - only when a page has been allocated for the first time (by assigning a machine frame number to the corresponding pseudo-physical frame number). This information is stored in what Xen calls a “physmap”, which we discussed earlier in the mfn exchanging section. A property of this physmap is that the total number of valid entries in this map is monotonically increasing; it will never decrease on the same host. This means that if the DSB process has inflated the balloon and the balloon contains a page frame that is mapped inside the physmap table, then the migration VM transmit that frame regardless. That defeats the purpose of the DSB process. So we modify the migration daemon by exposing to it the list of ballooned pages. As a result, whenever the migration daemon is ready to transmit a particular page, it first consults that list and skips transmission if it is in the list. (This list is actually a bitmap). Our suggestion to the Xen community is to develop a sort of watermarked “dynamic physmap garbage collection” such that the kernel would be responsible for clearing the physmap when it is no longer using a page. This is almost identical to the earlier suggestion in the Page Tracking scheme we devised, except that such use of the physmap would not be architecturally defined - nor would it necessarily be visible to the hardware. We believe that a garbage-collected physmap would allow for both the seamless implementation of Dynamic Self-Ballooning as well as the ability to implement Page Tracking without hardware support. But for now, we are using the cards we have been dealt.

5.3.6

Proactive LRU Ordering to Improve Reference Locality

During normal operation, the guest kernel maintains the age of each allocated page frame in its page cache. Linux, for example, maintains two linked lists in which pages are maintained in Least Recently Used (LRU) order: one for active pages and one for inactive pages. A kernel daemon periodically ages and transfers these pages between the two lists. The inactive list is subsequently used by the paging system to reclaim pages and write to the swap device. As a result, the order in which pages are written to the swap device reflects the historical locality of access by processes in the VM. Ideally, the active


93

push component of post-copy could simply use this ordering of pages in its pseudo-paging device to predict the page access pattern in the migrated VM and push pages just in time to avoid network faults. However, Linux does not actively maintain the LRU ordering in these lists until a swap device is enabled. Since a pseudo-paging device is enabled just before migration, post-copy would not automatically see pages in the swap device ordered in the LRU order. To address this problem, we implemented a kernel thread which periodically scans and reorders the active and inactive lists in LRU order, without modifying the core kernel itself. In each scan, the thread examines the referenced bit of each page. Pages with their referenced bit set are moved to the most recently used end of the list and their referenced bit is reset. This mechanism supplements the kernel’s existing aging support without the requirement that a real paging device be turned on. Section 5.4.4 shows that such a proactive LRU ordering plays a positive role in reducing network faults. Lines of Code. The kernel-level implementation of Post-Copy, which leverages the MemX system, is about 7000 lines of code within pluggable kernel modules. 4000 lines of that is part of the MemX system that is invoked during demand-paging. 3000 lines contribute to the pre-paging component, the flushing component, and the ballooning component combined. (The DSB implementation also operates within the aforementioned kernel modules and runs inside the guest OS itself as a kernel thread. There is no dom0 interaction with the DSB process). A 200 line patch is applied to the migration daemon to support ballooning and a 300-line patch is applied to the guest kernel so that the initiation of pseudo swapping can begin. When all is said and done, the system remains completely transparent to applications and approaches about 8000 lines. Neither the original pre-copy algorithm code, nor the hypervisor itself is changed at all. (As discussed before, alternative page-fault detection methods will require additional hypervisor support).

5.4

Evaluation

In this section, we present the detailed evaluation of our post-copy implementation and compare it against Xen’s original pre-copy migration. Our test environment consists of two 2.8 GHz dual core Intel machines connected via a Gigabit Ethernet switch. Each


94

machine has 4 GB of memory. Both the guest VM in each experiment and the Domain 0 are configured to use two virtual CPUs. Guest VM sizes range from 128 MB to 1024 MB. Unless otherwise specified, the default guest VM size is 512 MB. In addition to the performance metrics mentioned in Section 5.2, we evaluate post-copy against an additional metric. Recall that post-copy is effective only when a large majority of the pages reach the target node before they are faulted upon by the VM at the target, in which case they become minor page faults rather than network-bound page faults. Thus the fraction of network page faults compared to minor page faults is another indication of the effectiveness of our postcopy approach. Secondly, we quantify the pages transferred of pre-copy by scripting them those numbers from the Xen logs. For post-copy we output this information to procfiles. That value is then added to the number of pages that contribute to “non-pageable memory” for a grand total.

5.4.1

Stress Testing

We start by first doing a stress test for both migration schemes with the use of a simple, highly sequential memory-intensive C program. This program accepts a parameter to change the working set of memory accesses and a second parameter to control whether it performs memory reads or writes during the test. The experiment is performed in a 1024 MB VM with its working set ranging from 8 MB to 512 MB. The rest is simply free memory. We perform the experiment with seven different test configurations: 1. Stop-and-copy Migration: This is a non-live VM migration scenario which provides a baseline to compare the total migration time and number of pages transferred by post-copy. 2. Read-intensive Pre-Copy: This configuration provides the best-case workload for pre-copy. The performance is of the total migration time metric expected to be roughly similar to pure stop-and-copy migration. 3. Write-intensive Pre-Copy: This configuration provides the worst-case workload for pre-copy and causes worsening of all performance metrics. 4. Read-intensive Post-Copy:


95

Total Migration Time

Total Migration Time (Secs)

80 70 60 50 40

Write Pre-Copy w/o DSB Read Pre-Copy w/o DSB Write Pre-Copy DSB Write Post-Copy Read Post-Copy Read Pre-Copy DSB Stop-and-Copy DSB

30 20 10 0

8 MB

16 MB

32 MB

64 MB 128 MB 256 MB 512 MB

Working Set Size (MB) Figure 5.5: Comparison of total migration times between post-copy and pre-copy. 5. Write-intensive Post-Copy: These two configurations will stress our pre-paging algorithm and flushing implementations and are expected to perform almost identically. 6. Read-intensive Pre-Copy without DSB: 7. Write-intensive Pre-Copy without DSB: These two configurations test the default implementation of pre-copy in Xen that does not use DSB. Unless we specify otherwise, the reader should assume that DSB is turned on for pre-copy. Post-copy always uses DSB. For each figure, the plots in the legend are in the same order as you see them from top to bottom in the figure. Total Migration Time: Figure 5.5 shows the variation of total migration time with increasing working set size. Notice that both post-copy plots for total time are at the bottom, surpassed only by read-intensive pre-copy. Our first observation is that both the read and write intensive tests of post-copy perform very similarly. Thus our post-copy algorithm’s performance is agnostic to the read or write-intensive nature of the application workload. Future work might involve giving more priority to page fault writes over reads. Further-


96

Downtime 5000

Downtime (millisec)

4500 4000 3500 3000

Write Pre-Copy w/o DSB Write Pre-Copy DSB Write Post-Copy Read Post-Copy Read Pre-Copy DSB Read Pre-Copy w/o DSB

2500 2000 1500 1000 500 0

8 MB 16 MB 32 MB 64 MB 128 MB 256 MB 512 MB

Working Set Size (MB) Figure 5.6: Comparison of downtimes between pre-copy and post-copy. more, we observe that without DSB activated, as in the default Xen implementation, the total migration time for read-intensive pre-copy is very high due to unnecessary transmission of free guest pages over the network. This conclusion demonstrates itself in the three remaining plots as well. Downtime: Figure 5.6 exhibits similar behavior for the metric of downtime as the working set size increases. Recall that our choice of page fault detection in Section 5.3 increases the base downtime in post-copy. Thus, the figure shows a roughly constant downtime that ranges between 600 milliseconds to over one second. As is expected, the downtime for write-intensive pre-copy test increases significantly as the size of the writable working set increases. Pages Transferred and Page Faults: Figure 5.7 and Table 5.2 illustrate the utility of our pre-paging algorithm in post-copy across increasingly large working set sizes. Figure 5.7 plots the total number of pages transferred. As expected, post-copy transfers fewer pages than write-intensive pre-copy as well as pre-copy without DSB, the reduction being as much as 85%. It performs on par with read-intensive post-copy with DSB and stop-


97

Pages Transferred

4 KB pages (in thousands)

1M 900 k 800 k 700 k 600 k 500 k

Write Pre-Copy w/o DSB Read Pre-Copy w/o DSB Write Pre-Copy DSB Write Post-Copy Read Post-Copy Read Pre-Copy DSB Stop-and-Copy DSB

400 k 300 k 200 k 100 k 0

8 MB

16 MB 32 MB 64 MB 128 MB 256 MB 512 MB

Working Set Size (MB) Figure 5.7: Comparison of the number of pages transferred during a single migration.

Working Set Size 8 MB 16 MB 32 MB 64 MB 128 MB 256 MB

Pre-Paging Net Minor 2% 98% 4% 96% 4% 96% 3% 97% 3% 97% 3% 98%

Flushing Net Minor 15% 85% 13% 87% 13% 87% 10% 90% 9% 91% 10% 90%

Table 5.2: Percent of minor and network faults for flushing vs. pre-paging. Pre-paging greatly reduces the fraction of network faults.


98

Completion Time (secs)

Degradation Time: Kernel Compile 300 250

No Migration Post-Copy

Pre-Copy w/o DSB Pre-Copy DSB

200 150 100 50 0

128 MB

256 MB

512 MB

1024 MB

Guest Memory (MB) Figure 5.8: Kernel compile with back-to-back migrations using 5 seconds pauses. and-copy, all of which transfer each page only once over the network. Table 5.2 compares the fraction of network and minor faults in post-copy. We see that pre-paging reduces the fraction of network faults from 7% to 13%. To be fair, the stress-test is highly sequential in nature and consequently, pre-paging predicts this behavior almost perfectly. We expect the real applications in the next section to do worse than this optimal case.

5.4.2

Degradation, Bandwidth, and Ballooning

Next, we quantify the side effects of migration on a couple of sample applications. We want to answer the following questions: What kind of slow-down do VM workloads experience during pre-copy versus post-copy migration? What is their impact on network bandwidth received by applications? And finally, what kind of balloon inflation interval should we choose to minimize the impact of DSB on running applications? For application degradation and DSB interval, we use Linux kernel compilation. For bandwidth testing we use the NetPerf TCP benchmark.


99

Degradation Time: Figure 5.8 depicts a repeat of an interesting experiment from [73]. We initiate a kernel compile inside the VM and then migrate the VM repeatedly between two hosts. We script the migrations to pause for 5 seconds each time. Although there is no exact way to quantify degradation time (due to scheduling and context switching), this experiment provides an approximate measure. As far as memory is concerned, we observe that kernel compilation tends not to exhibit too many memory writes. (Once gcc forks and compiles, the OS page cache will only be used once more at the end to link the kernel object files together). As a result, experiment is good for post-copy comparison because it represents the best case for the original pre-copy approach when there is not much repeated dirtying of pages. This experiment is also a good worst-case tester for our implementation of Dynamic Self Ballooning due to the repeated fork-and-exit behavior of the kernel compile as each object file is created over time. (Interestingly enough, this experiment also gave us a headache, because it exposed the bugs in our code!) We were surprised to see how many additional seconds were added to the kernel compilation in Figure 5.8 just by executing back to back invocations of pre-copy migration. Nevertheless, we observe that post-copy tends to match pre-copy by the same amount of degradation. Although we would have preferred to see less degradation than pre-copy, we can at least rest assured that we’re not doing worse. This is in line with the competitive performance of post-copy with read-intensive pre-copy tests in Figures 5.5 and 5.7. We suspect that a shadow-paging based implementation of post-copy would perform much better due to the significantly reduced downtime it would provide. Additionally, Figure 5.9 shows the same experiment using NetPerf. A sustained, highbandwidth stream of network traffic causes slightly more page-dirtying than the compilation does. The setup involves placing the NetPerf sender inside the guest VM and the receiver on an external node on the same switch. Consequently, regardless of VM size, postcopy actually does perform slightly better and reduce the degradation time experienced by NetPerf. The figure also indicates an example of severe degradation without DSB due to transmission of free pages. Effect on Bandwidth: In their paper [27], the Xen project proposed a solution called “adaptive rate limiting” to control the bandwidth overhead due to migration. However, this


100

Completion Time (sec)

Degradation Time: NetPerf 250 200

No Migration Post-Copy

Pre-Copy w/o DSB Pre-Copy DSB

150

100 50 0

128 MB

256 MB

512 MB

1024 MB

Guest Memory (MB) Figure 5.9: NetPerf run with back-to-back migrations using 5 seconds pauses. feature is not enabled in the currently released version of Xen. In fact it is compiled out without any runtime options or any pre-processor directives. This could likely be because it is difficult, if not impossible to predict beforehand the bandwidth requirement of any single guest in order to guide the behavior of adaptive rate limiting. Hence, there is no explicit arbitration of network bandwidth contention between simultaneous operation of the migration daemon and a network-heavy application. With that in mind, Figures 5.10 and 5.11 show a visual representation of the reduction in bandwidth experienced by a highthroughput NetPerf session. We conduct this experiment by measuring bandwidth values rapidly and invoke VM migration in between. The impact of migration can be seen in both figures by a sudden reduction in the observed bandwidth during migration. This reduction is more sustained, and greater, for the pre-copy approach than for post-copy due to the fact that the total number of pages transferred in pre-copy is much higher.

This is exactly the

bottom line that we were targeting for improvement. Each experiment henceforth operates under that mode of operation. We believe their choice does make sense, however: the migration daemon really cannot guess if the guest is hosting, say, a webserver, it’s likely


1. Normal Operation 3. CPU + non-paged memory

2. DSB Invocation

5. Migration Complete

4. Resume + Pre-paging

Figure 5.10: Impact of post-copy NetPerf bandwidth.

101


5. Migration Complete

1. Normal Operation

2. DSB Invocation

4. CPU-state Transfer 3. Iterative Memory Copies

Figure 5.11: Impact of pre-copy NetPerf bandwidth.

102


103

Dynamic Ballooning Effects on Completion Time Kernel Compile, 439 secs

Slowdown Time (secs)

50 128 MB Guest 512 MB Guest

40 30 20 10 0 0

200 400 600 Balloon Interval (jiffies)

800

Figure 5.12: The application degradation is inversely proportional to the ballooning interval. the webserver will take whatever size pipe in can get its hands on, which would suggest that the migration daemon should just let TCP do what it normally does. On the other hand, the daemon might use up CPU cycles that might otherwise be granted to the guest itself. The point is that it’s all guesswork without some kind of signal from the guest. In fact, while strolling through the Xen daemon’s code, the end of the pre-copy iteration process is guided only by two factors: a 30-iteration maximum constant combined with a minimum page dirtying rate of 50 pages per pre-copy round. This will allow the daemon to iterate forever until either one of those conditions is met. This is why even mildly write intensive applications never converge. Dynamic Ballooning Interval: Figure 5.12 shows how we chose the DSB interval, by which the DSB process wakes up to reclaim available free memory. With the kernel compile as the test application, we execute DSB process at intervals from 10ms to 10s. At every interval, we script the kernel compile to run multiple times and output the average completion time. The difference in that number from the base case is the degradation time


104

added to the application by the DSB process due to its CPU usage. As expected, the choice of ballooning interval is inversely proportional to the application degradation. The more often you balloon, the more it affects the VM workload. The graph indicates that we should choose an interval between 4 and 10 seconds to balance between frequently reclaiming free pages and avoid impacting applications significantly. Note that this graph only represents on type of mixed application. For more CPU-intensive workloads, it will be necessary to make the ballooning interval dynamic enough that it could increase for CPU-intensive applications or applications that performed rapid memory allocation.

5.4.3

Application Scenarios

The last part of our evaluation is to re-visit the aforementioned performance metrics across four real applications: 1. SPECWeb 2005: This is our largest application. It is a well-known webserver benchmark involving at least 2 or more physical hosts. We place the system under test within the guest VM, while six separate client nodes bombard the VM with connections. 2. Bit Torrent Client: Although this is not a typical server application, we chose it because it is a simple representative of a multi-peer distributed application. It is easy to initiate and does not immediately saturate a Gigabit Ethernet pipe. Instead it fills up the network pipe gradually, is slightly CPU intensive, and involves a somewhat more complex mix of page-dirtying and disk I/O than just a kernel compile. 3. Linux Kernel Compile: We consider this again for consistency. 4. NetPerf: Once more, as in the previous experiments, the NetPerf sender is placed inside the guest VM. Using these applications, we evaluate the same four primary metrics that we covered in Section 5.4.1: Downtime, Total Migration Time, Pages Transferred, and Page Faults. Each figure for these applications represents one of the four metrics and contains results for a constant, 512 MB virtual machine in the form of a bar graph for both migration schemes across each application. Each data point is the average of 20 samples. And just as before,


105

Pages Transferred #4K Pages Transferred

200000 180000 160000 140000 120000 100000 80000 60000 40000 20000 0

Post-Copy Pre-Copy

Kernel Compile SpecWeb2005 BitTorrent

NetPerf

Application Figure 5.13: Total pages transferred for both migration schemes. the guest VM is configured to have two virtual CPUs. All of these experiments have the DSB activated. Pages Transferred and Page Faults.. The experiments in Figures 5.13 and 5.14 illustrate these results. For all of the applications except the SPECWeb, post-copy reduces the total pages transferred by more than half. The most significant result we’ve seen so far is in Figure 5.14 where post-copy’s pre-paging algorithm is able to avoid 79% and 83% of the network page faults (which become minor faults) for the largest applications (SPECWeb, Bittorrent). For the smaller applications (Kernel, NetPerf), we still manage to save 41% and 43% of network page faults. There is a significant amount of additional prior work in the literature aimed at working-set identification, and we believe that these improvements can be even better if we employ both knowledge-based and history-based predictors in our pre-paging algorithm. But even with a reactive approach, post-copy appears to be a strong competitor. Total Time and Downtime. Figure 5.15 shows that post-copy reduces the total migration time for all applications, when compared to pre-copy, in some cases by more than

# 4K Page-Faults (logscale)


100000 10000 1000

106

Page-Faults 79%

Minor Faults Network Faults

21% 41%59%

83% 17%

100

43%57%

10 1


NetPerf

Application Figure 5.14: Page-fault comparisons: Pre-paging lowers the network page faults to 17% and 21%, even for the heaviest applications.

12

Total Migration Time Post-Copy Pre-Copy

Time (secs)

10 8 6 4 2 0


NetPerf

Application Figure 5.15: Total migration time for both migration schemes.

Time (millisec)


2000 1800 1600 1400 1200 1000 800 600 400 200 0

107

Downtime Post-Copy Pre-Copy


NetPerf

Application Figure 5.16: Downtime for post-copy vs. pre-copy. Post-copy downtime can improve with better page-fault detection. 50%. However, the downtime in Figure 5.16 is currently much higher for post-copy than for pre-copy. As we explained earlier, the relatively high downtime is due to our speedy choice of pseudo-paging for page fault detection, which we plan to reduce through the use of shadow paging. Nevertheless, this tradeoff between total migration time and downtime may be acceptable in situations where network overhead needs to be kept low and the entire migration needs to be completed quickly.

5.4.4

Comparison of Prepaging Strategies

This section compares the effectiveness of different prepaging strategies. The VM workload is a Quicksort application that sorts a randomly populated array of user-defined size. We vary the number of processes running Quicksort from 1 to 128, such that 512MB of memory is collectively used among all processes. We migrate the VM in the middle of its workload execution and measure the number of network faults during migration. A smaller network fault count indicates better prepaging performance. We compare a number of


# 4K Page-Faults

6000 5000 4000 3000 2000 1000

108

Push-Only Push-Only-LRU Forward Dual Forward-MultiPivot Forward-LRU Dual-LRU Forward-MultiPivot-LRU Dual-MultiPivot-LRU

0

(1, 512)

(2, 256)

(4, 128)

(8, 64)

(16, 32)

(32, 16)

(64, 8)

(128, 4)

(# Processes, MB Each Process)

Figure 5.17: loads.

Comparison of prepaging strategies using multi-process Quicksort work-

prepaging combinations by varying the following factors: 1. whether or not some form of bubbling is used; 2. whether the bubbling occurs in forward-only or dual directions; 3. whether single or multiple pivots are used; and 4. whether the page-cache is maintained in LRU order. Figure 5.17 shows the results. Each vertical bar represents an average over 20 experimental runs. First observation is that bubbling, in any form, performs better than push-only prepaging. Secondly, sorting the page-cache in LRU order performs better than non-LRU cases by improving the locality of reference of neighboring pages in the pseudo-paging device. Thirdly, dual directional bubbling improves performance over forward-only bubbling in most cases, but never performs significantly worse. This indicates that it is always preferable to use dual directional bubbling. (The performance of reverse-only bubbling was found to be much worse than even push-only prepaging, hence its results are omitted). Finally, dual multi-pivot bubbling is found to consistently improve the performance over single-pivot bubbling since it exploits locality of reference at multiple locations in the pseudo-paging device.


5.5

109

Summary

We have presented the post-copy based live virtual machine migration using adaptive prepaging and dynamic self-ballooning. Post-copy is a combination solution consisting of 4 pieces: demand paging, our pre-paging algorithm called “bubbling”, flushing, and the use of dynamic self-ballooning. We have implemented and evaluated this system and shown that it is able to effect significant performance improvements over the pre-copy based migration of system virtual machines by reducing the number of pages transferred between the source and target hosts. Our future work will explore the use of alternative page-fault detection mechanisms as well as an attempt to explore future use applications of dynamic self-ballooning. There is a great deal of additional work that remains to be done. As we mentioned in Section 5.3, there are three different methods by which one can implement page-fault detection to support demand paging at the Virtual Machine level. We would like to set aside our expedient choice of pseudo-swapping in favor of a shadow paging based method of detection, and if possible investigate extensions to the Xen phsymap (the array of mappings between pseudo-physical and real page frames), with the goal of implementing the more efficient use of real CPU exceptions, which we called “page tracking”. Second, as stated in Section 5.2 we must take care to addresses the reliability issue for post-copy so that we may provide the same level of reliability that the original pre-copy scheme provides.

Chapter

6

CIVIC: Transparent Over-subscription of VM Memory In this chapter, we describe the design, implementation and evaluation of Collective Indirect Virtual Caching, or CIVIC for short. CIVIC is a significantly lower-level support for access to virtual cluster-wide memory than MemX. CIVIC is a memory oversubscription system for VMs designed to integrate the techniques from the previous three systems described in this dissertation by which the hypervisor can multiplex individual page frames of unmodified Virtual Machines in a fine-grained manner. Three primary uses of CIVIC are: 1. Higher Consolidation: to oversubscribe the limited memory of a single physical host for the purpose of running higher numbers of consolidated Virtual Machines with greater use of the hardware and without depending on para-virtualization or ballooning. 2. Large-Memory Pool: to provide large-memory applications transparent access to a cluster-wide, low-latency memory-pool without any additional binary or operating system interfaces, and 3. Improved Migration: to reduce the amount of resident main memory when the time comes to migrate individual Virtual Machines across the network. Due to time, this feature has not been implemented, but CIVIC is designed for it. 110

CHAPTER 6. CIVIC: TRANSPARENT OVER-SUBSCRIPTION OF VM MEMORY

111

The motivation for this work derives directly from the last few chapters: Now that we have a number of systems for both distributing and migrating individual page frames, the final step is to fully utilize the power of virtualization technology to support VMs with more transparency and ubiquity - to use unmodified, commodity operating systems in such a way that they have access to a (potentially) unlimited memory resource with the participation of an entire cluster. The end goal of this chapter is to build a system underneath any commodity OS capable of giving a systems programmer arbitrary access to individual page frames located anywhere in the entire cluster in such a way that new techniques can be designed for VM memory management with ease and efficiency. The CIVIC system does just that: it transparently allows a Virtual Machine to oversubscribe (or overcommit) it’s physical memory space without any participation from the VM operating system whatsoever. Any non-committed memory is then paged out. In our case, it’s paged out to MemX.

6.1

Introduction

Although used frequently in many areas of computer science (and it is not always made obvious that one is doing so), one of the great rules in system design is that if a piece of data will likely be used again in the future, you will probably succeed wildly by going out of your way to design your algorithm or data structure such that it caches or preserves that data. It is remarkable how often that rule shows itself. The transparency afforded to VMs by hypervisors provides some good opportunities to exploit caching. Virtual Machines experience almost no consciousness of the fact that their low-level view of physical memory is being “toyed” with in significant ways. So, in order to achieve the kind of memory ubiquity that we described, we’re proposing to combine the ability to do more fine-grained caching underneath VMs with the ability to virtualize cluster-wide memory (which was covered in earlier chapters). With CIVIC, we propose to allow the hosts in the cluster to cooperate with each other in order to transparently support VMs whose physical memory footprints can span across multiple machines in the cluster. To re-iterate: CIVIC is not a Distributed Shared Memory system (DSM). There are already two existing hypervisor-level DSM attempts: one by Virtual Iron in 2005 [12] and one at the Open Kernel Labs in 2009 [69].


112

The purpose of these systems is to build a single-system image (SSI). Building an SSI is not the focus of this dissertation. Rather, our goal is to allow unmodified virtual machines to gain access to cluster memory. Our focus is to enable greater VM consolidation and migration performance rather than to spread processing out into the cluster. Thus, VMs in our work use only local processors. A simple view of CIVIC’s role is that it does for VMs precisely what modern operating systems already do for processes in their virtual memory sub-systems: to give a running process (nearly) unlimited access to virtual memory. The OS has a well-established method of multiplexing virtual to physical memory accesses - the page table. We leverage a similar mechanism to manipulate a VM’s view of physical memory, namely the ”pseudo”physical address space, hereto referred to as the PPAS. (The “real” address space seen by the processor is thus respectively referred to as the RAS). The hypervisor undertakes the responsibility of mapping pages in the PPAS to pages in the RAS. Technically, one could of course use a disk-based swap device to page in and out the unused portions of the PPAS, but that would lead to significant a slowdown in VM performance as we have explored extensively in this dissertation. Instead, we use MemX to expand a VM’s PPAS to utilize the cluster-wide memory pool and minimize performance impacts on the VM that a disk might otherwise incur without changing the operating system at all. The hypervisor plays the role of an intermediary by (1) providing the VM with the view of an expanded PPAS, (2) intercepting memory accesses by the VM to non-resident PPAS pages, and (3) efficiently redirecting these memory accesses for servicing by MemX, which executes in a separate virtual machine.


6.2

113

Design

The design of CIVIC depends heavily on the virtualization platform, which in our case is Xen. Although we have covered the design of Xen frequently in previous chapters, none of those systems were strictly at the hypervisor level. This requires a brief discussion of the hypervisor’s memory management schemes, including memory allocation and shadow paging. After this discussion, we will present the design choices for CIVIC within the hypervisor itself and its interactions with higher-level services followed by implementation specific details.

6.2.1

Hypervisor Memory Management

VM memory management is fairly straightforward, with an extra level of indirection through the PPAS. This address space sits in between the virtual address space and the real physical address space (RAS) seen by the processor. Since the processor is no longer singly owned by one operating system, this extra level allows for multiplexing of multiple PPASes on top of a single RAS. Additionally, from here on out, the frame numbers associated with the PPAS (in Xen terminology) are called ’P’ frame numbers: or ”PFN”s. Similarly, real frame numbers are called “machine” frame numbers or MFNs. PFNs are contiguously numbered, whereas MFNs allocated to a VM in the RAS are almost guaranteed to be sparse. In modern VM technology, there are three ways to manage the PPAS: 1. Para-virtualization: A para-virtual VM (or guest) is one that has been modified in such a way that the VM is aware of the hypervisor. It has been patched directly to inform the hypervisor explicitly when it intends to update any given page table in its ownership. In such a guest, the OS will map page frames using machine frame numbers (MFN)s and has no actual concept of the PPAS (except for memory allocation and VM migration, discussed in the last chapter). Thus, frame identifiers in a para-virtual guest’s page table entries are the same ones seen by the processor. This has performance advantages because the guest OS can “batch” a number of page table updates in one hypercall (but only up to a limit, as we’ll see in option


114

#3). Para-virtual support has recently been made upstream and built into both Linux and Windows, which mitigates some of the problems with this approach that relate to transparency of maintaining upstream compatibility with newly released operating systems versions. Thus, para-virtualization is no longer a technological obstacle. 2. Shadow-paging: When modifying the guest is unacceptable (for older OS kernels), hypervisors no longer place real MFNs into guest OS page tables. Instead, “pseudo” PFNs are used in such a way that virtual page numbers map to PFNs in their page tables. Subsequently, the hypervisor traps write accesses to those tables (using CR3 register virtualization and by marking them read-only) while maintaining another set of “shadow” page tables underneath the virtual machine that map the virtual page numbers to MFNs. These page tables are exposed to the processor. Thus, memory virtualization and device emulation can be done for arbitrary, unmodified operating systems. When this kind of memory management is used, we refer to the guest OS as a hardware virtual machine or “HVM”, as opposed to a para-virtual guest. An elaboration of shadow paging is described next, in Section 6.2.2. 3. Hardware-assisted Paging: This approach is an improvement over shadow-paging by moving the translation logic shadow paging from the hypervisor into the processor. Essentially this is an MMU expansion - making the MMU do a little more of what it is already doing. With this support, it is no longer necessary to trap into the hypervisor as frequently - allowing for page-fault exceptions to be delivered directly to the guest OS. Such guests are also called HVM guests, with the internal distinction of hardware-assisted paging. As of this writing, CIVIC depends exclusively on the hypervisor’s ability to perform shadow-paging for unmodified HVM guest operating systems. The most basic ability required by CIVIC is to both create and intercept page-fault exceptions before they are propagated to the guest virtual machine that would not normally be seen by the OS itself. An unmodified HVM running on top of a CIVIC-enabled hypervisor that used hardware-assisted paging (instead of shadow-paging) would require additional logic to force the processor to trap into the hypervisor when a page is owned by CIVIC (a non-resident page frame)


115

during such CPU exceptions. So, as of this writing, CIVIC depends on shadow paging only without the assistance of hardware-assisted paging. Section 6.4 describes how the use of shadow paging affects the baseline performance of the virtual machine when used on top of a CIVIC-enabled hypervisor.

6.2.2

Shadow Paging Review

Next, it is necessary to elaborate for the reader about the use of shadow-paging and some of the common Xen-specific data structures. All of the machines in our cluster particular Xen-cluster are 64-bit machines. Thus, the assumption of this discussion is that our HVM guests are also 64-bit virtual machines, requiring a standard 4-level page table hierarchy. When we say “L1” page tables, this means the standard definition where pointers to data pages are contained at the lowest level of the hierarchy in the leaves and the root of the page table is at level L4. All L4 tables are pointed to by “Control Register #3”, or CR3, sometimes called the page-table base pointer. And as usual, for any given process running on the CPU, the value of CR3 will only point to the root L4 table of a single process at a time - or to the kernel’s page tables. A “resident” page table entry (PTE) at any level of the page table hierarchy means that the lowest order bit in a PTE is set, indicating the page beneath it (either data or page table) is actually sitting in memory somewhere. During the shadow paging process, three things can happen: 1. Shadow-Walk: The MMU, with access to a virtualized CR3 base pointer attempts to walk the shadow page table hierarchy of a particular virtual machine. For every HVM page table, there is a corresponding shadow page table at each level of the PT hierarchy. If the MMU does not find a shadow PTE, a trap into the hypervisor occurs. 2. Guest-walk: The hypervisor then performs a manual walk of the real HVM tables starting at what the HVM thinks is the true CR3 base pointer. If the hypervisor finds the appropriate PTE, then the whole page table is copied and control returns to the CPU for that virtual machine.


Paravirt VM #1

HVM #2

HVM #3

Domain 0 PPAS #1 PPAS #2 PPAS #3

116

HVM #N

PPAS #N

Multiple Pseudo−Physical Address Spaces (PPAS) Real Address Space (RAS) Hypervisor Figure 6.1: Original Xen-based physical memory design for multiple, concurrently-running virtual machines. 3. Guest-walk-miss: Otherwise, a missing PTE in the guest signifies a real CPU exception and the fault is propagated to the HVM. At that point, it is the HVM’s responsibility to service the fault and proceed as normal. Furthermore, during the shadow-paging process, there are upwards of a dozen or so “shadow optimizations” employed by Xen on top of this basic design that are used to speed up memory access latency when going through the shadows with respect to Windows virtual machines, HVMs and more. For the current version of CIVIC, these optimizations are disabled. Doing so was necessary to get an initial version of CIVIC working. Future versions of CIVIC can be made to take advantage of these optimizations. Thus, the rest of this chapter and the next section discuss the rest of our implementation under the assumption that these optimizations are disabled. This assumption also constitutes our base case for doing benchmarking during our evaluation.

6.2.3

Step 1: CIVIC Memory Allocation, Caching Design

Figure 6.1 illustrates how memory is allocated to virtual machines in a typical virtualization architecture. Each VM gets a statically-allocated region of physical memory on the host (depending on ballooning). During normal operation, the size of the PPAS for each virtual machine does not change. Any number (depending on the amount of memory available) of VMs can be created by the administrator in an adjacent manner and the OS of each


117

HVM #2: un−modified

111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000 000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000 111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000 111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000 111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000 111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 Network PPAS #2 Cached PPAS #2 Network PPAS #2 000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000 111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000 111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111

Paravirt (non−civic) VM #1 PPAS #1

Paravirt (non−civic) VM #N

HVM #3: un−modified

111111111111111111111111111111111111111111111111 000000000000000000000000000000000000000000000000 000000000000000000000000000000000000000000000000 111111111111111111111111111111111111111111111111 000000000000000000000000000000000000000000000000 111111111111111111111111111111111111111111111111 000000000000000000000000000000000000000000000000 111111111111111111111111111111111111111111111111 Network PPAS #3 Cached PPAS #3 000000000000000000000000000000000000000000000000 111111111111111111111111111111111111111111111111 000000000000000000000000000000000000000000000000 111111111111111111111111111111111111111111111111

Domain 0 PPAS #1 Cache #2

Cache #3

PPAS #N

PPAS #N

Real Address Space (RAS) CIVIC−enabled Hypervisor

Figure 6.2: Physical memory caching design of a CIVIC-enabled Hypervisor for multiple, concurrently-running virtual machines. virtual machine will manage the PPAS given to them (since the PPAS is contiguous) without interruption. In this default design, if an operating system places a reference (PFN) to a page in one of it’s page tables that it expects to be physically resident in memory, then it will be there - no questions asked. All VM technology currently works this way (except for our previous VM migration work where dynamic ballooning is used in the last chapter). In figure 6.1, we have four virtual machines, three of which are HVMs and one that uses para-virtualization. Regardless, the PPAS of all 4 virtual machines is static: from the moment those VMs are booted up to the time they shut down, their PPAS is fixed.

CIVIC relaxes the assumption that a page actually exists when the VM asks for it. The first step in designing CIVIC involves taking an unmodified operating system of an arbitrary Virtual Machine and growing its PPAS by some amount. Afterwards, we add another level of indirection within the hypervisor that recognizes this expanded PPAS (by intercepting access through shadow-paging). Figure 6.2 illustrates how the hypervisor has been modified to change the memory allocation strategy for an unmodified CIVIC-enabled hypervisor. VMs #2 and #3 get a statically allocated cache-size on which only a subset of their total PPAS is actually resident. The rest is out on the network. Hits in the cache go to the RAS whereas hits in the PPAS go to the network. Take note of the difference between HVM #2 and HVM #3: the PPAS of an unmodified


118

virtual machine need not be larger than the RAS of the physical host. This gives the administrator a choice: to either grow the PPAS to be very large, or simply to provide higher levels of consolidation by running more VMs on one host. However, the PPAS should at least be larger than (or equal to) the cache that CIVIC provides to each PPAS. It cannot be smaller, or that would preclude the need for CIVIC. Notice that Figure 6.2 also has two simultaneously running paravirtualized VMs. The current CIVIC implementation supports multiple PPAS strategies and does not require any VM to use CIVIC. You may choose to grow the PPAS of a virtual machine during boot time or choose to leave it unchanged in its default mode of operation. Figure 6.3 demonstrates the operation of an example CIVIC cache underneath an HVM. This HVM has 3 working sets (perhaps from three different processes or three different data structures within one process). The figure represents the common case, where the cache is full populated with accessed memory. In this example, two of the working sets are in the cache, and a page fault to frame #6 occurs in the {4,5,6} set. Since the {8, 9} set is older, according to the FIFO, frame #9 is evicted to MemX. An old copy of page #9 may or may not actually exist yet on MemX, but it will likely be there if the HVM has been running for a long time. The next section will use the same HVM and describe the hypervisor-level interactions between the cache and MemX.

6.2.4

Step 2: Paging Communication and The Assistant

Devices and drivers available in the modern virtualization stack today that are used to service popular devices for Virtual Machines are typically bundled up into a VM that is commonly called ”Domain 0” or ”Dom0” for short. From here on out we will not refer to this VM anymore except to acknowledge its presence. During runtime, this VM always exists and typically hosts various drivers and has direct access to those corresponding devices, acting as a relay for co-located virtual machines. There is a movement to break away from this unified, ”monolithic” design. CIVIC follows that philosophy [45]. Dom0 is not only a single point of failure during the development process, but also performance bottleneck for the hypervisor’s CPU scheduler due to the fact that all I/O must go through Dom0 while the


119

HVM: un−modified Userland Virtual Address Space

Guest CR3

Guest Tables

111111111111111111111111111111111111111111111111111111111111 000000000000000000000000000000000000000000000000000000000000 000000000000000000000000000000000000000000000000000000000000 111111111111111111111111111111111111111111111111111111111111 2 4 5 6 8 9 000000000000000000000000000000000000000000000000000000000000 111111111111111111111111111111111111111111111111111111111111 000000000000000000000000000000000000000000000000000000000000 111111111111111111111111111111111111111111111111111111111111 000000000000000000000000000000000000000000000000000000000000 111111111111111111111111111111111111111111111111111111111111 HVM Working Sets: {2}, {4, 5, 6}, {8, 9}

Before Fault on 6: 111111111111111111 000000000000000000 000000000000000000 111111111111111111 5 4 8 9 000000000000000000 111111111111111111 000000000000000000 111111111111111111 000000000000000000 111111111111111111

FIFO out

After Fault on 6: FIFO in

111111111111111111 000000000000000000 000000000000000000 111111111111111111 6 5 4 8 000000000000000000 111111111111111111 000000000000000000 111111111111111111 000000000000000000 111111111111111111

MemX View: (on cluster) 111111111111111111111111111111111111111111111111111111111111 000000000000000000000000000000000000000000000000000000000000 000000000000000000000000000000000000000000000000000000000000 111111111111111111111111111111111111111111111111111111111111 1 2 3 4 5 6 7 8 9 10 000000000000000000000000000000000000000000000000000000000000 111111111111111111111111111111111111111111111111111111111111 000000000000000000000000000000000000000000000000000000000000 111111111111111111111111111111111111111111111111111111111111 000000000000000000000000000000000000000000000000000000000000 111111111111111111111111111111111111111111111111111111111111 Figure 6.3: Illustration of a full PPAS cache. All page accesses in the PPAS space must be brought into the cache before the HVM can use the page. If the cache is full, an old page is evicted from the FIFO maintained by the cache.


120

dependent VMs block. Thus, CIVIC introduces a second VM that assists exclusively in the paging process and nothing else. We refer to this domain as the ”Assistant”. Figure 6.4 illustrates the host-level internal design of CIVIC and the placement of the Assistant within the Virtualization stack. Observe that Dom0 still exists, except that it is very thin. It still hosts device drivers for all the virtual machines running on the host, but the Assistant handles the most critically time-sensitive component of the modified Virtualization stack that is responsible for transferring pages in and out of the PPAS. Another motivation behind this design is that the machines we have in our cluster have two Network Interfaces. So, we can dedicate one of the interfaces to the Assistant and one interface to Dom0. Dom0 still handles regular network traffic out of individual virtual machines. Thus, the Assistant can be scheduled independently by the hypervisor and will context-switch into Dom0 much less.

The final piece in Figure 6.4 is the design of the page-delivery and page-fault communication paths. The sequence of steps taken by CIVIC at this level are as follows: 1. When the host first boots up, Dom0 will start up the Assistant. 2. If there are any non-CIVIC dependent VMs, they can be started simultaneously as well. 3. Next, one or more HVMs are created. Almost immediately, they will begin filling up their respective caches. 4. At some point a shadow-level page-fault exception occurs to a page that is not in the cache during runtime. A page is allocated for the missing page in the HVM. 5. The CIVIC-enabled hypervisor induces the faulting HVM to block execution on the Virtual CPU that caused the execution by de-scheduling that Virtual CPU. 6. The hypervisor puts an entry into a piece of memory that is shared with the Assistant and delivers a Virtual Interrupt to the Assistant. If the cache for the faulting HVM is full, then a victim is chosen from the cache and one or more additional entries are put in to the shared memory.


121

HVM: un−modified Userland Virtual Address Space Guest Tables

Guest CR3

1111111111111111111111111111111111111111111111111111111111111111111111111111111111111 0000000000000000000000000000000000000000000000000000000000000000000000000000000000000 0000000000000000000000000000000000000000000000000000000000000000000000000000000000000 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111 Networked PPAS #2 0000000000000000000000000000000000000000000000000000000000000000000000000000000000000 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111 0000000000000000000000000000000000000000000000000000000000000000000000000000000000000 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111 0000000000000000000000000000000000000000000000000000000000000000000000000000000000000 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111

Assistant VM (paravirt) MemX Client

Dom0 PPAS #0

mmap() module

memory−mapping Cached PPAS #2

PPAS #1

111111111 000000000 000000000 111111111 000000000 111111111 000000000 111111111 000000000 111111111 000000000 111111111 000000000 111111111 000000000 111111111 000000000 111111111

Real Address Space (RAS) Page−Fault on HVM

Shared Memory Notify

Eviction on HVM

Shadow Tables Virtual CR3

Cached Page Frames

CIVIC−enabled Hypervisor To Network Figure 6.4: Internal CIVIC architecture: An Assistant VM holds two kernel modules responsible for mapping and paging HVM memory. One module directly (on-demand) memorymaps portions of PPAS #2, whereas MemX does I/O. A modified, CIVIC-enabled hypervisor intercepts page-faults to shadow page tables in the RAS and delivers them to the Assistant VM. If the HVM cache is full, the Assistant also receives victim pages.


122

7. The mmap() module in the Assistant receives the interrupt through a kernel-level interrupt handler and proceeds to memory map() the faulting pages and victim pages on-demand. 8. The mmap() kernel module submits one or more I/O operations to the MemX client kernel module which then use the RMAP protocol to read or write the corresponding page frames to and from the network. 9. When the I/O is complete, the Assistant invokes a CIVIC-specific hypercall to notify the hypervisor that the fault exception has been fixed up. 10. The hypervisor un-blocks the faulting virtual CPU and schedules it for execution and the HVM continues until the cycle repeats itself. All-in-all, there are four primary sources of latency in the path of an individual page frame: 1) the Virtual IRQ notification to the Assistant, 2) the time it takes for MemX to store/retrieve pages to and from the network, 3) the time it takes to fixup the exception and re-schedule the Virtual Machine after the Assistant notifies the hypervisor, and 4) the time it takes to evict pages out of the PPAS cache. One additional thing to note regards the design of the mmap() module inside the Assistant: This is a kernel module and is responsible for directly mapping page frames located within the cache of a CIVIC-dependent HVM guest (in order to hand them over to MemX). Recall that the page numbers located within the virtual machine’s PPAS are contiguous: all page frame numbers in the PPAS are sequentially chosen at startup and do not change (unless ballooning is activated). At first glance, one might just choose to memory-map the entire PPAS of the virtual machine during startup and get rid of this module altogether. This is not possible because, despite the fact that the PPAS is contiguous, the frames backing the PPAS are not all resident - only a subset of the PPAS is actually in the cache. As a result, at any given time during the execution of an HVM, pages are evicted and populated in the HVM’s cache. When pages are re-populated, new RAS-level frame numbers (MFNs) are chosen. There is absolutely no guarantee that the MFN for the corresponding PFN (in the PPAS) is the same as it was before it was evicted in the past. In fact, we can almost


123

HVM 1 Assistant

HVM 2

Dom0 MemX Client Dom0 MemX Server

Hypervisor B

PPAS Caches

Dom0 MemX Server

Hypervisor C

Hypervisor A

Page−Fault on HVM1

Eviction on HVM2

Figure 6.5: High-level CIVIC architecture: unmodified CIVIC-enabled HVM guests have both local reservations (caches) while small or large amounts of their reservations actually expand out to nearby hosts. guarantee that it will not be the same frame number. Thus, if we were to memory-map the entire PPAS in the beginning, the majority of those mappings would be invalidated in the near future as each page in the PPAS is victimized. So, the mmap() module in the Assistant maps those pages in an on-demand fashion during fault time. We have optimized this module to batch as many pages as possible when this occurs (since these mappings require an additional hypercall to be completed). There are two opportunities for batching memory mappings: First, any N faults on N virtual CPUs (across all HVMs) can be batched simultaneously. Second, all pages that are pre-fetched into the cache (which we will discuss later) can also be batched. This allows us to perform this mapping very quickly with little overhead during the paging process. At the cluster level, we employ MemX, as described in Chapter 4. MemX is a kernelto-kernel distributed memory system designed for low-latency memory access. The same source code for the MemX kernel module is loaded into the Assistant simultaneously and automatically detects available memory servers in the cluster. Once a CIVIC host is up and running, the Administrator can choose any cluster design they like, such as the one illustrated in Figure 6.5. In this example, we have a cluster of virtual machines, where Hypervisors B and C host MemX Servers. MemX is flexible enough that it can be loaded anywhere. The servers need not be virtualized at all, but we illustrated it this way for completeness.


6.2.5

124

Future Work: Page Migration, Sharing and Compression

This design of CIVIC has many, many areas of possible improvement. In the next chapter, we provide a discussion of several interesting new systems that can be built on top of CIVIC. Here, we discuss some of those obvious improvements to the base system implemented in this dissertation: At some future time, one or more VMs will run a large memory application and become active. At that point, the collective unused memory of various nodes will now become a partial global cache for the pages from the active node. (Caching nodes may as well become active themselves). This next stage for CIVIC might involve migrating globally cached pages out and over the network into global caches on other hosts - should the cache’s host node need that space for local cache space. There are a handful of algorithms to support this type of page migration, involving both greedy approaches [31] and approaches that use hints about access behavior [86]. These approaches are typically done in the context of a file-system, called “cooperative caching” and some of these “eviction” techniques could also be applied to virtual machine memory just as well. A survey of them can be found in [82]. This kind of caching involves allowing neighboring nodes to cache potentially stale page frames in local memory. The intelligence in such a system involves developing a coordinated algorithm that allows multiple nodes to decide which pages to keep in the global cache and which pages to evict. Example systems in the 1990s include ds-RAID-x [55], TickerTAIP [25], xFS [16], and Petal [61]. Page-granular cache migration. Consider the case of a single oversubscribed HVM Guest (A) as depicted in Figure 6.6 a portion of whose physical memory is backed up by one remote MemX Server on another host across the network. We propose the following multi-path of an individual page frame: 1. VM to local: A third party (such as the Assistant) decides to evict a used page out of the VM’s cache into a local cache. Such a local cache does not exist in CIVIC right now, but would be easily implemented by putting a central node-local cache within the Assistant itself that held evicted cache pages from the PPAS caches of individual HVMs.


Guest A Dom0

Assistant

Dom0

(Over−commited)

Large Memory App Used Page New Page

125

MemX Client

MemX Server

EVICTION #1

EVICITION #2

Local Cache

Global cache

PPAS Cache

EVICITION #3 FAULT Hypervisor B

Hypervisor A

(Network)

Figure 6.6: Future CIVIC architecture: a large number of nodes would collectively provide global and local caches. The path of a page would potentially exhibit multiple evictions from Guest A to local to global. Furthermore, a global cache can be made to evict pages to other global caches. 2. Local to global: Potentially, the Assistant-cached page, based on some recency information, can then be evicted into the global cache on another physical host simply by pushing it out of the MemX client. 3. Global to global: Next, based on page-migration heuristics mentioned in related work, the system may potentially move the page from one global cache to another global cache, depending on how it is aged. This is already partially supported by MemX based on existing support we wrote that allows MemX servers to shut themselves down for maintenance by re-distributing their memory to nearby servers. 4. Global to backing store: Should the sum total of all of the local and global caches max out the available physical memory in the cluster, a third party disk (either centralized or distributed) should be available as a backing store. 5. Page Fault: Eventually at some time, a page fault will occur on a cached page, at which point the page must be located and brought back in to the VM’s physical address space. When this happens, the Assistant would be responsible for invoking MemX to bring the page back in from one or more caches.


126

In fact, a page-fault can happen on a page that has been evicted at nearly any level of this caching hierarchy. Recent work that relates to the functionality of the local cache component of CIVIC was published in [79], where they perform “hypervisor caching”. In this work, they evaluate the efficiency and performance of using the hypervisor to present a local pool of reserve pages to the VM. One could think of CIVIC as an extension of this work into the cluster. This completes the basic operation of CIVIC. Page Sharing and Compression. The most ideal, cluster-level use of CIVIC would be to employ the use of two techniques for reducing data duplication throughout the cluster. The first technique is called content-based page sharing, initially proposed in the Disco system [23] and used in the context of VMware’s ESX server [96] as well as the difference engine [33] and project Satori [43]. They proposed allowing for multiple virtual addresses (or virtual page frame numbers in the context of VMs) to refer to the same physical location in memory. Reports of reductions in memory utilization of up to 40% are reported. This is not surprising in the context of VMs because even an under-provisioned physical server with a handful of guest VMs will share many copies of binary executables, common libraries, and potentially similar parts of network filesystem data that gets cached on accesses. In the context of CIVIC, the opportunities for page sharing are increased even further due to our collective use of indirect caching across multiple hosts. Furthermore, some recent work in 2005 [95] provides a compelling case for the advantages of compressing physical page frames in an operating system. We propose to apply these techniques at the VM level in combination with CIVIC.


6.3

127

Implementation

Here we describe the low-level hurdles that were overcome to get CIVIC working and integrated with MemX. In the next Chapter, we provide a better outline of the missing features that could be expanded into more wide-spread systems. All in all, the CIVIC code base comprises about 5000 lines in the Xen hypervisor. This effectually doubles the entire MemX code base. It does not require any patches whatsoever to the Dom0 kernel or the HVM guest at all. There are about 400 lines of new code to support CIVIC in the Xen Daemon and the rest of the code is entirely inside of the hypervisor. We’ve implemented the system within Xen 3.3.0. The Assistant runs XenLinux (paravirtualized) 2.6.18 as usual, but the HVM guests in the performance section are completely unmodified. They run out-of-the box Fedora Core 10 installations. We’ve also run HVM guests for OpenSolaris. Microsoft Windows will actually startup, but due to some shadowpaging related bugs in the Hypervisor, it stops at the login screen. All of the HVM guests in this chapter use a single virtual CPU. Due to time and manpower, an SMP-implementation of CIVIC is not yet complete. Currently, the system is fully implemented except for the more contemporary features described in the previous section’s Future Work.

6.3.1

Address Space Expansion, BIOS Tables

In order to transparently provide an ”oversubscribed” view of the PPAS to the Virtual Machine when it first boots up, we must “lie” about the actual amount of DRAM that the HVM thinks is available by significantly expanding the size of the PPAS, while keeping the size of the RAS cache small. With modern hypervisor technology there are multiple ways to lie to virtual machines, actually, but none of them are completely transparent to HVM operation. Furthermore, none of them allow the size of the RAS to be different from the size of the PPAS. One way to partially expand the PPAS is using ballooning, which we discussed previously in Chapter 5. Ballooning allows you to increase and decrease the physical memory of an HVM, but this still requires that an equivalent amount of actual DRAM inside of the RAS be statically mapped to that memory. Another way to expand the PPAS is to use


128

memory hot-plugging. Some operating systems are capable of receiving ACPI upcalls when new DRAM is available. They can then subsequently add the new DRAM to the kernel and make it available for allocation by processes. A summary of how this could be supported in Xen can be found in [88]. Similarly, memory must be removed in a physically contiguous manner within the PPAS. Both of these solutions require direct participation and modification of the virtual machine, however. In order to avoid these difficulties, CIVIC instead oversubscribes the PPAS at boot time of the HVM. Normally, a physical machine determines the amount of available DRAM by reading a page of memory populated by the BIOS during boot up, called an ”e820” page. This page contains a list of the various usable and reserved areas of physical memory that the Operating System must manage. After the BIOS populates this list, the OS reads the list and initializes all of its own data structures during start of day before other processes in the system begin to run. For a virtual machine, there no longer is a physical BIOS - but a virtual one. Thus, the e820 page containing the list of usable memory ranges is virtualized. To do this, CIVIC constructs the e820 page with an artificial list of memory ranges that are available at the time the HVM is started while taking into consideration how much memory is actually available on the host. Normally, the amount of memory listed in the e820 page is equal to the cache size that Xen allocates for the HVM (meaning that what would have been the RAS cache in our system is normally just a flat RAS that is equal in size to the PPAS). To oversubscribe the HVM, we patch the Xen Daemon to increase the size of usable memory ranges in the e820 page based on an additional configuration parameter in the HVM’s guest configuration file. This modification immediately takes effect for any kind of operating system since this is a standard requirement for an OS to boot up. This is how CIVIC bootstraps the oversubscription process and it works quite well. Furthermore, the semantics of memory seen by the hypervisor are preserved, including the initial pre-allocation of memory. The hypervisor is instructed by the Xen Daemon only to allocate as much memory as has been specified by the configuration file for how big the cache should be. The HVM then proceeds to bootup and begin filling it’s cache as it faults on pages that do not exist in the cache.


6.3.2

129

Communication Paths

The hypervisor’s primary interactions during the slow-path for page-fault handling and cache evictions are with the Assistant VM. In general, there are two ways to communicate between the hypervisor and a para-virtualized VM such as the Assistant or Dom0: • VM-to-hypervisor, Hypercalls: The hypercall API available to a VM is vast. Hypercalls can be invoked by any code running with kernel-level privileges in a guest virtual machine. • hypervisor-to-VM, Virtual IRQs: These kinds of IRQs are the virtualized equivalent of real ones, with the exception that a few ”new” IRQs are imposed by the hypervisor for other reasons, such as alternate consoles, VM-to-VM communication, inter-processor messages, and more. CIVIC uses a combination of both when talking to the Assistant. During the start-of-day, the Assistant is asked to setup a common piece of shared memory for each HVM guest. This memory is shared only between the Assistant and the hypervisor and stores a fixed number of descriptors that represent which pages are to be evicted from or faulted into the cache. At the moment, this memory holds around 2048 descriptors, using 32 Kilobytes of memory. Through empirical experimentation, this was a sufficient number of descriptors required to maintain maximum throughput on a Gigabit Ethernet switched network. CIVIC maintains this memory using a one-way, half-duplex producer consumer relationship. The hypervisor is the sole producer of descriptors and the Assistant is the only consumer - it does not pro-actively bring in pages in our out of the HVM’s cache unless it is instructed to do so by reading a descriptor from shared memory. We later experimented with a circular ring model with the assumption that we would get added concurrency, by allowing the Assistant to asynchronously remove descriptors from shared memory while adding completed descriptors back onto the ring after MemX had completed the I/O. Due to the fact that paging is only half-duplex, this proved to be equivalent to the simpler implementation we used initially because of our use of prefetching, described in the next section. Although a complete proof would be required, we empirically observed no differ-


130

ence between the use of an asynchronous, circular request/response notification versus a synchronous one, so we stuck with the synchronous model. If CIVIC were to allow new pages into the HVM cache that it did not initiate (say, due to our previous future work design where other hosts could initiate global-to-global page cache transfers independently), then an asynchronous model would be mandatory. This is similar to the way the netfront/netback asynchronous rings work in Xen right now discussed in the last chapter, since the network has a natural full-duplex relationship where a receiver on the other end of a socket can independently send data at the same time the sender does. Thus, for CIVIC, once the hypervisor has added descriptors to the ring, a Virtual IRQ is delivered to the Assistant. Once the Assistant is done, a hypercall is made to the hypervisor to signal completion of those descriptors. With more man-power, a fully asynchronous design can be implemented in the future, for example, if we were to go to a future 10-Gigabit ethernet network.

6.3.3

Cache Eviction and Prefetching

Central to maintain HVM performance is CIVIC’s ability to manage the PPAS caches in order to be scalable. Some papers mentioned earlier [79, 44] present some approaches (although not very obvious) to doing recency detection: they use a para-virtual approach. The basic idea is to modify the kernel in a minimal fashion such that a third party is notified when the page’s allocation status changes (from used to free or back again). This has a possibility of being useful for us in that once a third party (say, the Assistant) chooses to evict a page from the VM, it will use the modified kernel interface to receive notifications of future deallocations of that page. This allows the cache to know when it’s safe to clear that page when it is no longer used. Finally, there is only a single problem with the approach: pinning evicted pages for future page faults (or cache hits). Note: an evicted page is a used page. An evicted page must be pinned in such a manner that the cache is notified when the kernel needs the page back. [79] proposes doing this by doing the following: during eviction from the VM address space, the page is Locked for I/O. This means that from the kernel’s perspective, someone else is using the page. After eviction from the


131

PPAS cache, the kernel will need the page back in the future and probe the lock (which is generally a semaphore). Attempts to access that lock are delivered to the hypervisor and the page is released from I/O status. For CIVIC, however, our goal is to maintain maximum transparency, and since we now have a page-fault interception system in place, we no longer need para-virtual support from the operating system. Recency Detection. When it is time to choose victim pages however, it is no longer simply a manner of maintaining one or more Least-Recently-Used lists within the hypervisor and keeping them updated. In this dissertation we do not explore complex page-frame reclamation policies. There is a large body of work in literature that does this already. The LRU used for CIVIC at the moment is a simple first-in first-out FIFO. When the cache is full, pages are evicted from the front of the FIFO. When page-faults occur, pages are allocated onto the end of the FIFO. A proper characterization of the type of eviction scheme required to improve upon a FIFO in the context of multiple, concurrently-running virtual machines will be mandatory in a future incarnation of CIVIC. Prefetching. During implementation, one of the bigger obstacles to a stable implementation that we observed during operation was ”capacity” cache misses. These types of cache misses occur when the HVM is just booting up for the first time or when the HVM has just forked off a large application that is (mostly) sequentially touching large amounts of memory at once. These two common usage scenarios required a basic prefetching implementation to be placed into the implementation of CIVIC. Pre-fetching (a survey of which can be found here [77] (and pre-paging from Chapter 5 on VM migration) are well explored concepts in computer science. Our pre-fetching implementation is a simple strideprefetching algorithm. We will not cover all of the 4 states of stride prefetching, but describe the pseudo-code of our algorithm for detecting page-fault behavior. Pseudo-code for our CIVIC-level prefetching algorithm is in Figure 6.7. CIVIC’s prefetching first maintains a ”window size” of allowed page faults in powers of two, starting with an initial value of one. Each time the prefetching algorithm is invoked we update the window based on the location of the last page-fault to adapt to how large or small the current stride of pages actually is. During any given page-fault, we ask the following question:

CHAPTER 6. CIVIC: TRANSPARENT OVER-SUBSCRIPTION OF VM MEMORY 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28. 29. 30. 31. 32. 33. 34. 35. 36. 37. 38. 39. 40. 41. 42. 43. 44. 45.

132

let miss_count := 0 let window := 1 AdjustWindow (PFN) let last_fault_pfn let current_fault_pfn let last_miss_count

// PFN is in the PPAS := current_fault_pfn := PFN := miss_count

if PFN (last_fault_pfn + window * 2) if PFN 1 // half the window window = max(window / 2, 1) return 1 // Do not prefetch return 0

// go ahead and prefetch

PrefetchFault (PFN) let next_pfn := PFN + 1 let space := max_shared_memory - available_shared_memory if AdjustWindow(PFN) == 1 return

// adjust the window

while (space >= 2) && ((next_pfn - PFN)

techniques for collective physical memory ubiquity

techniques for collective physical memory ubiquity

Suggest Documents

Collective memory - Cell Press

Collective Memory Search - CiteSeerX

Memory Techniques

From collective memory to memory systems

Memory, collective memory, orality and the gospels

From collective memory to memory systems

Collective Memory of Conflicts

Iconography and Collective Memory

20 Memory Techniques

MASONIC MEMORY TECHNIQUES - Phoenixmasonry

FOUR HELPFUL MEMORY TECHNIQUES

From Collective Memory to Collective Imagination - Wiley Online Library

External Memory Techniques for Isosurface Extraction in ...

Active Memory Techniques for ccNUMA Multiprocessors

Multiprogramming on physical memory

Tips for Improving Memory Techniques – PDF

Advanced Memory Optimization Techniques for Low-Power ...

Memory Reduction Techniques for HDTV Video ...

Instance Selection Techniques for Memory-Based ... - CiteSeerX

Total physical Response techniques

collective memory and memory politics in the central ... - Google Sites

Individual Memory and Collective Memory - UCSB Department of History

Memory of the Multitude: The End of Collective Memory

Memory Remains: Understanding Collective Memory in the Digital Age