Performance Implications of Virtualizing Multicore Cluster ... - CiteSeerX

0 downloads 0 Views 1018KB Size Report
with multiple VMs deployed per node, using modern techniques for hypervisor bypass ... by different customers, safely providing different kinds of services to diverse ... tance of virtualization for the 'capacity' systems in common use in both the ...
Performance Implications of Virtualizing Multicore Cluster Machines Adit Ranadive

Mukil Kesavan

Ada Gavrilovska

Karsten Schwan

Center for Experimental Research in Computer Systems (CERCS) Georgia Institute of Technology Atlanta, Georgia, 30332 {adit262, mukil, ada, schwan}@cc.gatech.edu

Abstract High performance computers are typified by cluster machines constructed from multicore nodes and using high performance interconnects like Infiniband. Virtualizing such ‘capacity computing’ platforms implies the shared use of not only the nodes and node cores, but also of the cluster interconnect (e.g., Infiniband). This paper presents a detailed study of the implications of sharing these resources, using the Xen hypervisor to virtualize platform nodes and exploiting Infiniband’s native hardware support for its simultaneous use by multiple virtual machines. Measurements are conducted with multiple VMs deployed per node, using modern techniques for hypervisor bypass for high performance network access, and evaluating the implications of resource sharing with different patterns of application behavior. Results indicate that multiple applications can share the clusters multicore nodes without undue effects on the performance of Infiniband access and use. Higher degrees of sharing are possible with communication-conscious VM placement and scheduling. Categories and Subject Descriptors D.4.7 [Operating Systems]: Organization and Design; C.2.4 [Computer-Communication Networks]: Distributed Systems; C.5.1 [Computer System Implementation]: Large and Medium Computers General Terms Keywords band

1.

Design, Performance, Management, Reliability

Virtualization, High-performance Computing, Infini-

Introduction

In the enterprise domain, virtualization technologies like VMWare’s ESX server [29] and the Xen hypervisor [3] are becoming a prevalent solution for resource consolidation, power reduction, and to deal with bursty application behaviors. Amazon’s Elastic Compute Cloud (EC2) [2], for instance, uses virtualization to offer datacenter resources (e.g., clusters or blade servers) to applications run by different customers, safely providing different kinds of services to diverse codes running on the same underlying hardware (e.g., trading systems jointly with software used for financial analysis

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. HPCVirt ’08 March 31, 2008, Glasgow, Scotland. c 2008 ACM 1-59593-090-6/05/0007. . . $5.00 Copyright

and forecasting). Virtualization has also shown to be an effective vehicle for dealing with machine failures, to improve application portability, and to help debug complex application codes. In high performance systems, research has demonstrated virtualized network interfaces [12], shown the benefits of virtualization for grid applications [1, 17, 20, 10, 23, 35], and argued the utility of these technologies for attaining high reliabilty for large scale machines [26]. Furthermore, key industry providers of HPC technology are actively developing efficient, lightweight virtualization solutions, an example being the close collaboration between vendors of high performance IO solutions like Infiniband, such as Cisco and Mellanox, with representatives of the virtualization industry, including VMWare and Xen. Here, a key motivator is the importance of virtualization for the ‘capacity’ systems in common use in both the scientific and commercial domains, the latter including financial institutions, retail, telecom and transportation corporations, providers of web and information services, and gaming applications [25]. In fact, an analysis of the Top500 list demonstrates over 30 application areaGes, most of which do not belong into the category of traditional HPC scientific codes. Finally, when industry uses large scale HPC systems, now even including IBM’s Bluegene, as platforms for ‘utility’ or ‘cloud’ computing [4, 7], virtualization makes it possible to package client application components into isolated guest VMs that can be cleanly deployed onto and share underlying platform resources. Despite these trends, scientists running traditional high performance codes have been reluctant to adopt virtualization technologies. In part, this is because of their desire to exploit all available platform resources to attain the performance gains sought by use of ‘capability’ HPC machines. Perhaps more importantly, however, this is because of resource sharing can degrade the high levels of performance sought by HPC codes. As a result, the degrees or extent to which virtualization technologies will be adopted in the HPC domain remain unclear [9, 11]. This paper contributes experimental insights and measurements to better understand the effects of resource sharing on the performance of HPC applications. Specifically, for multiple virtual machines running on multicore platforms, we evaluate the extent to which their communications are affected by the fact that they share a single communication resource, using an Infiniband interconnect as the concrete instance of such a resource. Stated more precisely, using standard x86-based quadcore nodes and the Xen hypervisor, we evaluate the degree of sharing possible via Infiniband under a range of platform parameters and application characteristics. The purpose is (1) to understand the performance implication and overheads of supporting multiple VMs on virtualized multicore IB platforms; (2) to explore the performance implication of different interand intra-VM interaction patterns on such platforms; and (3) to de-

vise suitable deployment and co-location and scheduling strategies for individual VMs onto shared virtualized resources. Experimental results presented in the paper demonstrate that a high level of sharing, that is, a significant number of VMs deployed to each node, is feasible without noticeable performance degradation, despite the fact that VM-VM communications share a single Infiniband interconnect. Further, sharing is facilitated by methods for VM deployment and scheduling that are aware of VMs’ communication behaviors (i.e., communication-awareness) and of the requirements on communications imposed by VMs (i.e., awareness of the SLA - ‘Service Level Agreements’ sought by VMs). Technically, this involves (1) manipulating hypervisor-level parameters like scheduling weights, (2) carrying out service-level actions like mapping VMs’ QoS requirements to Infiniband virtual lanes, and (3) devising suitable system-level policies for VM migration and deployment. This paper lays the foundation for such future technical work, by providing experimental insights into the bottlenecks such mechanisms will need to avoid and/or the performance levels they can be expected to deliver. Remainder of paper. The remainder of the paper is organized as follows. Section 2 describes our experimental testbed and methodology. Sections 3 and 4 discuss the experimental results gathered with various VM loads and deployments and different inter- and intra-VM communication patters, for native RDMA communication and MPI applications, respectively. A brief survey of related work and concluding remarks appear in the last two sections.

2.

Testbed

All experimental evaluations are performed on a testbed consisting of 2 Dell 1950 PowerEdge servers, each with 2 Quad-core 64-bit Xeon processors at 1.86 GHz. The servers have Mellanox MT25208 HCAs, operating in the 23208 Tavor compatibility mode, connected through a Cisco/Topspin 90 switch. Each server is running the RHEL 4 Update 5 OS (paravirtualized 2.6.18 kernel) in dom0 with the Xen 3.1 hypervisor. The virtualized Infiniband implementation available on the Xensource site is based on Xen 3.0 with the BVT scheduler [31] and uses kernel sockets for the initial Infiniband split driver setup. Since this implementation does not scale well for multiple VMs, we changed the initial driver setup to be performed over Xenbus instead, and we ported the entire implementation to Xen 3.1 to analyze the new credit scheduler’s [30, 6] impact on Infiniband performance. The guest kernels are paravirtualized running the RHEL 4 Update 5 OS. Each guest is allocated 256 MB of RAM. For running Infiniband applications within the guests, OFED (Open Fabrics Enterprise Distribution) 1.1 [18] is modified to be able to use the virtualized IB driver. Microbenchmarks include the RDMA benchmarks from the OFED1.1 distribution and the Presta MPI from Lawrence Livermore National Labs [19]. These permit us to evaluate the performance impact of executing multiple VMs on shared virtualized resources, for both native IB RDMA and for MPI communications, as well as to consider various VM-VM interaction patterns. For running the Presta MPI Suite, OpenMPI 1.1.1 is installed on dom0 and on domUs. A specific challenge in communication fabrics that support asynchronous IO, like Infiniband, is the inability to obtain accurate timing measurements without additional hardware support. Our results are based on time measurements gathered before posting an IO request and after the corresponding completion event is detected via a polling interface. This approach has been accepted in the community as a viable approximation of the exact timings of various asynchronous IO operations [16, 14].

3.

Experimental Evaluation - Microbenchmarks

The first set of measurements evaluate the Infiniband RDMA communication layer. We do not include IPoIB measurements, as those numbers are inferior in performance compared to native RDMA support. Tests are run with different numbers and deployments of VMs per core and per IB node and with different scheduling criteria. Measurements are taken for the three basic operations in Infiniband, which are RDMA Write, RDMA Send/Receive, and RDMA Read, in terms of average bandwidth and latency (RDMA Write only). Each test consists of 5000 iterations performed for each of the message sizes, as shown in the graphs (from 2B to 8MB). The MTU size in these experiments is 2KB. Basic Benchmarks. For the graphs in Figure 1, the setup of the Virtual Machines is symmetric, i.e., running an equal number of VMs on the two physical machines, denoted as a 2VM-2VM test. The motivation is to understand the performance effects of multiple VMs sharing the same Infiniband HCA. The first graph in Figure 1 shows that the differences in average bandwidth for RDMA Write and Send/Recv tests, achieved running inside a VM vs. in a non-virtualized platform, are practically negligible. This shows that virtualization does not impose noticeable overheads and IB throughput. Varying the number of VMs on each machine from 1 to 6, we find that the bandwidths converge approximately to the total maximum bandwidth divided by the number of VMs. This occurs for larger message sizes, where the network link becomes saturated with data. As the number of VMs increases, saturation occurs at ever smaller message sizes. At the same time, the total bandwidth perceived by VMs in non-saturated cases (e.g., up to 64k in the case of 2VMs and 32k for 3VMs) is the maximum sustainable bandwidth. This implies that 1. the shared use of IB interconnects by multiple VMs is both viable and reasonable, as long as the total bandwidth required by all simultaneously running VMs remains below the maximum sustainable bandwidth. Further, 2. network bandwidth is divided equally among all VMs, with RDMA Write delivering the highest performance, followed by RDMA Send/Receive. RDMA Read performs worst, as well documented in other work [16]. Finally, 3. the maximum bandwidth achieved by any of the RDMA operations is 932 MBps, or approximately 7.5Gbps. Effects of Scheduling. The next test demonstrates the effects of pinning VMs to different and/or the same physical CPUs (PCPUs), thereby controlling the physical resources available to each VM. The Xen scheduler allows guest VMs to either use a specific CPU or any CPU that is free when the VM is scheduled. Specifically, with Xen 3.1’s default Credit Scheduler [30], the same weight is assigned to each VM that is pinned to the same CPU, so that each VM receives an equal CPU share. Note that for these and all future experiments, we show only the results for the RDMA Write microbenchmark. It consistently delivers the highest performance compared to other microbenchmarks. The graphs in Figure 2 show that when all VMs are assigned to the same physical CPU, the bandwidth attained by each VM is highly variable. This is due to the fact that the Xen scheduler shares the CPU by continuously swapping out/swapping in these VMs. In contrast, 4. the performance attained by VMs pinned to different PCPUs is both higher and more consistent, in terms of the average bandwidth achieved by each VM. In both cases, however, average bandwidth converges to maximum bandwidth divided by the number of VMs, as with the simple benchmark tests described above. Furthermore, 5. as the link becomes saturated with increasing message sizes, the average bandwidth attained by each VM decreases. Conclusions derived from these results include the following. First, even when co-locating VMs on the same physical CPU, per-

1000

RDMA Write Results

800

800

1VM

700

2VM

600

3VM

6VM 400

8388608

4194304

524288

2097152

262144

1048576

65536

131072

8192

32768

RDMA Read Results

900 800

1VM

700

2VM

1VM

8388608

4194304

2097152

1048576

524288

65536

262144

Message Size (bytes)

131072

32768

16384

8192

2

8388608

4194304

2097152

1048576

524288

262144

Message Size (bytes)

131072

65536

32768

16384

8192

4096

2048

1024

512

256

64

32

128

16

8

0 4

100

0 2

200

100

4096

300

200

2048

300

6VM 400

1024

6VM

512

400

4VM 500

256

4VM

64

500

3VM

128

3VM

600

32

2VM 600

4

Bandwidth ((MBps)

700

16

800

8

900

Bandwidth (MBps)

16384

Message Size (bytes) 1000

RDMA Send/Recv Results

4096

2

8388608

4194304

524288

2097152

262144

Message Size (bytes) 1000

1048576

65536

131072

32768

8192

4096

16384

2048

512

1024

256

64

128

8

32

16

0 4

100

0 2

200

100

2048

300

200

512

300

4VM

256

Send/Recv - 1 VM

400

500

1024

Write - 1 VM

64

Send/Recv - Native IB

500

8

600

32

Write - Native IB

4

700

16

Bandwidth (MBps)

900

Bandwidth (MBps)

900

128

Native Infiniband v/s 1VM-1VM RDMA Results

1000

Figure 1. RDMA Performance Numbers formance degradation will not occur until total required bandwidth exceeds available IB resources. Second, the “plateau” in each of the graphs shows that even for the case of 6VMs per single machine, we can still achieve the maximum sustainable performance level, as in the native case. The width of this “plateau” is dependent upon the number of VMs and the messages sizes. Latency Tests. Figure 3 shows the latencies recorded for different numbers of VMs. The latencies are measured for pairs of VMs communicating across two physical nodes. As a baseline, we also include measurements performed for communications between the dom0s on the virtualized machines. Results show that 6. the typical latency for a RDMA Write operation does not change much as the number of VMs increases. This is because VMM-bypass capable interconnects like Infiniband avoid the frontend-backend communication overheads experienced by other Xen devices. However, 7. as message sizes increase, latencies increase exponentially due to bandwidth saturation. For smaller message sizes, the difference in latencies in dom0 and VMs is negligible (on the order of less than 10 usec), thereby demonstrating the effectiveness of Infiniband’s VMM-bypass implementation.

4.

MPI benchmarks

For the MPI benchmarks we use the Lawrence Livermore National Lab Presta MPI benchmark suite. The two benchmark tests used include (1) the com test, used to analyze the impact of virtualization on inter-process communication bandwidth and latency, and (2) the glob test, used to analyze the impact on collective operations across VMs or processes within a VM.

MPI Com Test. The com test is an indicator of link saturation between pairs of communicating MPI processes. All of the results reported below are for the unidirectional test. The various test configurations and the resulting trends discovered are listed below: 1. Virtualization Overhead Measurement. The com test is run across two native Linux 2.6.18 kernels and 2 VMs, with one process per machine, virtual or otherwise. 2. Xen credit scheduler effects on IB-based applications running on VMs. It is important to analyze the effects of virtual machine scheduling on applications running in VMs. We run one MPI process per VM, and use two test configurations, where in one configuration, all VMs are pinned to different physical CPU cores and in the other, all the VMs are pinned to the same physical CPU. This represents the ‘best’ vs. ‘worst’ cases concerning the effects of scheduling on communication performance. Tests are performed with 2 and 4 VMs, respectively, running on the same physical machine. 3. Latency Variation due to VM load. To measure the variation in communication latency due to VM load resulting from different distributions of processes across VMs, we use 2, 4, 8, 16, and 32 communicating MPI processes on 2, 4, and 8 virtual machines, with a fair distribution of processes across VMs. We devise two tests: (1) all VMs pertaining to a measurement run are pinned across two physical cores, i.e., multiple VMs may share the same physical cpu core; and (2) 8 physical cores are used for the VMs pertaining to a measurement run, i.e., in some cases, a VM may have more than one VCPU available to it. The rationale is that test results make it possible to compare the effects of load on the native Linux scheduler (the Linux O(1) scheduler in the kernel version used in our tests) vs. the Xen

2VM-2VM Write BW – CPU Pinning

1000

1000

3VM-3VM Write BW - CPU Pinning

900

900

Avg BW - Same PCPU

Avg BW - Same PCPU 800

Avg BW - Diff PCPU

Avg BW - Diff PCPU

Bandwidth (MBps)

Bandwidth (MBps)

800

700

700

600

600

500

500

400

400

300

300

200

200

100

100

1000

4VM-4VM Write BW - CPU Pinning

8388608

4194304

2097152

524288

1048576

262144

65536

131072

32768

8192

16384

4096

2048

512

1024

64

256

128

32

8

Message Size (bytes)

6VM-6VM Write BW - CPU Pinning

900

900

Avg BW - Same PCPU

Avg BW - Same PCPU 800

800

Avg BW - Diff PCPU

Avg BW - Diff PCPU

Bandwidth (MBps)

16

8388608

4194304

2097152

524288

1048576

65536

2

1000

262144

131072

32768

8192

16384

4096

2048

512

1024

256

64

128

8

32

16

4

2

Message Size (bytes)

4

0 0

Bandwidth (MBps)

700 600 500 400 300 200

700 600 500 400 300 200

100

100 0

8388608

4194304

2097152

524288

1048576

262144

65536

32768

16384

8192

4096

2048

512

1024

Message Size (bytes)

131072

Message Size (bytes)

256

128

64

32

16

8

4

2

0

Figure 2. Effect of CPU Pinning on RDMA Operations in MultiVM

25000

1200

RDMA Write Latency

3000

Virtualization overhead

1VM-1VM

Impact of pinning VMs on the same core

2500

1000

2VM-2VM Bandwidth (MBps)

Latency (µsec)

3VM-3VM 4VM-4VM 15000

Dom0-Dom0

10000

800

600

Native IB 2 Dom0s 400

2 VMs

Bandwidth (MBps)

20000 2000

2 VMs – Diff. Cores 4 VMs – Diff. Cores 1500

8 VMs – Diff. Cores 2 VMs – Same Core

1000

4 VMs – Same Core

Figure 3. RDMA write latency

Message Size (bytes)

Figure 4. Virtualization overhead

credit scheduler. Measurements depict the com latency for a pre-configured number of operations for each pair of VMs vs processes. 4. Bandwidth variation due to cpu capping. To simulate cases in which a VM running MPI shares the same CPU with other applications, we use cpu caps of 25, 75, and 100, expressed as the percentage availability of the physical cpu. These caps are applied to each VM in a 2 VMs and 4 VMs case, where 1 MPI process runs on each of these VMs. Figure 4 measures the unidirectional inter-process bandwidth achieved for pairs of MPI processes. Multiple message sizes are evaluated for a native Linux install vs for 2 Dom0s (i.e., the base case) vs with 2 VMs pinned on different physical processor cores.

32 64 128 256 512 1024 2048 4096 8192 16384 32768 65536 131072 262144 524288 1048576 2097152 4194304 8388608

8388608

4194304

524288

2097152

1048576

65536

262144

131072

8192

32768

4096

16384

512

2048

256

0 128

0 1024

8388608

4194304

2097152

1048576

524288

262144

65536

32768

Message Size (bytes)

131072

16384

8192

4096

2048

512

256

1024

64

128

32

8

4

16

2

0

64

200

32

5000

500

Message Size (bytes)

Figure 5. Impact of pinning VMs on the same core

It is evident from the figure that virtualization does not cause additional overheads for MPI communications. Related work has already demonstrated the negligible overheads on MPI processes when deployed in a single VM per node [33, 15]. Figure 5 shows the bandwidths achieved for multiple pairs of MPI processes, each running in its own VM, where the VM (1) has its own physical cpu core and (2) is sharing a physical cpu core with the other VMs. The most notable trend in the graph is that when multiple VMs are all pinned on the same physical CPU core, the bandwidths for message sizes greater than 8KB drop drastically compared to the case when the VMs are pinned to different cores. This is primarily due to VM scheduling overheads. As the VMs share the single physical CPU 8. for small time slices, the smaller

60

Latency comparison for multiple processes and VMs - 8 cores

Latency comparison for multiple processes and VMs - 2 cores

160

140 50 120

30

2VMs 4VMs 8VMs

Latency ((µsec)

Latency (µsec)

40 100

80

2VMs 4VMs

60

8VMs

20 40 10 20

0

0 2

4

8

16

32

2

4

8

# MPI Processes

16

32

# MPI Processes

Figure 6. Latency comparison for multiple processes and VMs with 8 vs. 2 cores

1200

2500

Bandwidth variations due to cpu cap 2 VMs

1000

Bandwidth variations due to cpu cap 4 VMs

800

600

BW – 25% BW – 50% BW – 75%

400

BW – 100%

Bandwidth (MBps)

Bandwidth (MBps)

2000

1500

BW – 25% BW – 50%

1000

BW – 75% BW – 100% 500

200

8388608

4194304

524288

2097152

1048576

262144

65536

32768

131072

8192

16384

4096

512

2048

1024

64

Message Size (bytes)

256

32

128

8388608

4194304

2097152

524288

1048576

65536

262144

32768

131072

8192

4096

16384

2048

512

1024

64

256

128

0 32

0

Message Size (bytes)

Figure 7. Bandwidth variations due to cpu cap messages are sent in a single time slice when the VM is scheduled, so that VM performance is not significantly affected. For larger messages, with sizes greater than 8kb for the tests considered, the VMs are de-scheduled and have to be rescheduled, one or more times, to complete the data transfer. The graphs in Figure 6 show the latencies measured for sets of communicating MPI processes performing a fixed number of operations, on multiple VMs sharing 8 and 2 physical cores, respectively. Details about these measurement include: • VMs sharing 8 cores: when there are less than 8 VMs, the number of VCPUs available per VM is increased to distribute the 8 physical cores evenly amongst VMs. • VMs sharing 2 cores: when there are more than 2 VMs, we pin multiple VMs on a single physical cpu core such that each available cpu core is balanced. Measurements indicate that latency increases as the number of processes per VM increases. In essence, a heavily loaded VM tends to perform poorly irrespective of the presence of RDMA-based MPI implementation. Further, giving a VM more VCPUs for use by guest OS processes appears to be less effective than using a larger number of VMs. This is likely due to the actions of guest OS vs. VM schedulers. Figure 7 shows the variation in bandwidth with 2 VMs and 4 VMs, respectively, for different CPU caps and with each VM running on a different core. Smaller CPU caps result in higher variations in total achieved bandwidths for large message sizes,

again due to scheduling effects (e.g., VMs losing the CPU while communicating). MPI Glob Test. The glob test from the Presta MPI Benchmark is used to measure the latencies of MPI collectives. The MPI Reduce, MPI Broadcast, and MPI Barrier collectives are measured, all of which are frequently used in high performance applications [28]. We perform these measurements to better understand the implications on communication performance of co-deploying interacting VMs on a virtualized infrastructure. Experimental evaluations consider the following configurations: 1. 1 MPI process / VM, with a varied number of VMs, and compared with dom0 results; 2. 4 processes running on different dom0s vs. 4 processes in 4 VMs (all on same physical multicore machine); and 3. 8 processes, with a varied number of VMs, i.e., 2, 4, and 8 VMs. Experimental results for the first configuration are depicted by the graphs in Figure 8. The latencies shown in the broadcast and allreduce graphs are similar to earlier results, demonstrating that for smaller message sizes, the latencies for the MPI Collectives do not vary much as the number of VMs increases. Even at a finer grained scale, the latency differences between the dom0 and 8VM cases are less than 10usec, for message sizes upto 64k. For larger message sizes, as bandwidth is saturated, the latencies increase. In the barrier test in Figure 8, the number in the brackets indicates the number of MPI processes running across the virtual/physical machines. The notation 4Dom0 in the figure indicates that 4 MPI processes run

50000

80

Glob: Broadcast - 1 MPI process/VM

180000

45000

70

Dom0

160000

40000

2VMs 35000

140000

Dom0

120000

2VMs

100000

4VMs

60

30000

8VMs 25000 20000 15000

Time (µsec)

4VMs

Latency (µsec)

Latency (µsec)

Glob:Barrier Test

Glob: AllReduce - 1 MPI process/VM

8VMs 80000 60000

50

40

30

10000 40000 20

5000 20000 0

10 0 0

Message Size (bytes)

Dom0 (2)

2VMs (2)

Message Size (bytes)

4VMs (4)

8VMs (8)

4Dom0 (4)

Configurations

Figure 8. Latencies of collective operations across VMs

30000

120000

Glob: Broadcast - Using VMs v/s dom0

Glob: AllReduce - Using VMs v/s dom0

25000

100000

4VMs

4VMs

4Dom0

Latency (µsec)

Latency (µsec)

4Dom0s 20000

15000

10000

80000

60000

40000

5000

20000

0

0

Message Size (bytes)

Message Size (bytes)

Figure 9. Latencies for collective operations for 4 MPI processes within one domain (dom0) or across 4 VMs

120000

300000

Glob: Broadcast - 8 processes Varying #VMs

100000

250000

8 VMs

8VMs

4 VMs

Latency (µsec)

Latency (µsec)

80000

Glob: AllReduce - 8 processes Varying #VMs

2 VMs

60000

40000

200000

2VMs 150000

100000

20000

50000

0

0

Message Size (bytes)

4VMs

Message Size (bytes)

Figure 10. Latencies for collective operations for different number of processes per VM in dom0. The increased overheads in the barrier case are expected because the increased amount of VM-VM interaction are not amortized by any gains in performance due to improved ability for data movement between processes in the VM and the Infiniband network. We are planning additional tests to gather information from low-level performance counters, such as VMentry/exit operations, time spent in the hypervisor, etc., which we believe will help better explain the observed behaviors for these types of collective operations. The experiments presented in Figure 9 compare the performance of MPI processes running in multiple VMs versus running

in the same VM. The performance of 4VMs (1 MPI process/VM) versus 4 processes in dom0 shows little difference in terms of latency for upto 64KB sized messages. In these tests, we use the default Xen scheduling policy. These results demonstrate that based on the types of interactions between application processes, and the amount of IO performed, it can be acceptable to structure individual components as separate VMs, all deployed on the same platform. This can be useful in maintaining isolation between different application components, or to leverage the Xen-level mechanisms for dynamic VM migration for reliability or load balancing purposes.

Similar tests shown in Figure 10 investigate the impact of varying the number of processes running within a VM. Unlike the barrier case in Figure 6 above, results show that 9. broadcast or allreduce communication patterns benefit if they are structured across a larger number of VMs, particularly for larger message sizes. The best case is the one in which each process is within a single VMs, which is because that reduces the additional scheduling overheads within guest VM (the Linux scheduler) and at the VMM level (the Xen scheduler). These results further strengthen our experimental demonstration of the fact that multiple VMs can easily share a single virtualized platform, even in the high performance domain.

5.

Related Work

Other research efforts that have analyzed the performance overheads of virtualizing Infiniband platforms with the Xen hypervisor appear in [15, 24]. Our work differs in that it specifically focuses on the effects on communication performance when virtualized multicore platforms are shared by many collaborating VMs. For these purposes, the IB split driver was modified to enable guestVM-dom0 interactions via Xenbus, which made it possible for multiple VMs to be instantiated in an efficient and scalable manner, thereby enabling the experiments described in this paper. The opportunities for virtualization in the HPC domain have been investigated in multiple recent research efforts. The work described in [33, 34] assesses the performance impact of Xen-based virtualization on high performance codes running MPI, specifically focusing on the Xen-related overheads. It does not take into account the effects of any specific platform characteristics, such as the multicore processing nodes or the Infiniband fabric considered by our work. Other efforts have used virtualization to ease reliability, management, and development and debugging for HPC systems and applications [26, 27, 8]. The results described in this paper complement these efforts. Finally, many research efforts use virtualization for HPC grid services [35, 23, 17, 20] – our complementary research focus is to understand the performance factors in deploying multiple VMs on the individual multicore resources and cluster machines embedded in such grids. There is much related work on managing shared data centers [5, 32], including considering deployment issues for mixes of batch and interactive VMs on shared cluster resources [13], cluster management, co-scheduling and deployment of cluster process [22, 21]. Our future research will build on such work to create a QoS-aware management architecure that controls the shared use of virtualized high performance resources.

6.

Conclusions and Future Work

This paper presents a detailed study of the implications of sharing high performance multicore cluster machines that use high end interconnection fabrics like Infiniband and that are virtualized with standard hypervisors like Xen. Measurements are conducted with multiple VMs deployed per node, using modern techniques for hypervisor bypass for high performance network access. Experiments evaluate the implications of resource sharing with different patterns of application behavior, including number of processes deployed per VM, types of communication patterns, and amounts of available platform resources. Results indicate that multiple applications can share multicore virualized nodes without undue performance effects on Infiniband access and use, with higher degrees of sharing possible with communication-conscious VM placement and scheduling. Furthermore, depending on the types of interactions between application processes and the amounts of IO performed, it can be beneficial to structure individual components as separate VMs rather than plac-

ing them into a single VM. This is because such placements can avoid undesirable interactions between guest OS-level and VMMlevel schedulers. Such placement can also bring additional benefits for maintaining isolation between different application components, or for load-balancing, reliability and fault-tolerance mechanisms that can leverage the existing hypervisor- (i.e., Xen-) level VM migration mechanisms. Our future work will derive further insights from the experimental results discussed in Sections 3 and 4 by gathering additional low level performance information, including time spent in the hypervisor, number of ‘world switches’ between the VMs and the hypervisor, etc., using tools like Xenoprofile. The idea is to attain greater insights into the implications of shared use of virtualized platforms and the manner in which the platforms’ resources should be distributed among running VMs. We hope to be able to include select results from such measurements into the final version of this paper. In addition, we plan to extend this work to analyze the ability of Infiniband virtualized platforms to meet different QoS requirements and honor SLAs for sets of collaborating VMs, by manipulating parameters such as VMs deployment onto or across individual platform nodes, resource allocation and hypervisor-level scheduling parameters on these multicore nodes, and fabric-wide policies for Service Level (SL) to Virtual Lane (VL) mappings. Certain extensions of our current testbed are necessary to make these measurements possible. The longer term goal of our research is to devise new management mechanisms and policies for QoS-aware management architectures for shared high performance virtualized infrastructures.

References [1] S. Adabala, V. Chadha, P. Chawla, R. Figueiredo, J. Fortes, I. Krsul, A. Matsunaga, M. Tsugawa, J. Zhang, M. Zhao, L. Zhu, and X. Zhu. From virtualized resources to virtual computing grids: the In-VIGO system. Future Generation Computer Systems, 21(6):896–909, 2005. [2] Amazon Elastic Compute Cloud (EC2). aws.amazon.com/ec2. [3] P. Barham, B. Dragovic, K. Fraser, S. Hand, T. Harris, A. Ho, R. Neugebauer, I. Pratt, and A. Warfield. Xen and the Art of Virtualization. In SOSP 2003, 2003. [4] IBM Research Blue Gene. www.research.ibm.com/bluegene/. [5] J. Chase, L. Grit, D. Irwin, J. Moore, and S. Sprenkle. Dynamic Virtual Clusters in a Grid Site Manager. In Twelfth International Symposium on High Performance Distributed Computing (HPDC12), 2003. [6] L. Cherkasova, D. Gupta, and A. Vahdat. Comparison of the Three CPU Schedulers in Xen. ACM SIGMETRICS Performance Evaluation Review, 35(2):42–51, 2007. [7] Technology Review: Computer in the Cloud. www.technologyreview.com/Infotech/19397/?a=f. [8] C. Engelmann, S. L. Scott, H. Ong, G. Vall´ee, and T. Naughton. Configurable Virtualized System Environments for High Performance Computing. In Proceedings of the 1st Workshop on System-level Virtualization for High Performance Computing (HPCVirt) 2007, in conjunction with the 2nd ACM SIGOPS European Conference on Computer Systems (EuroSys) 2007, Lisbon, Portugal, Mar. 20, 2007. [9] R. Farber. Keeping “Performance” in HPC: A look at the impact of virtualization and many-core processors. Scientific Computing, 2006. www.scimag.com. [10] R. Figueiredo, P. Dinda, and J. Fortes. A Case For Grid Computing on Virtual Machines. In Proc. of IEEE International Conference on Distributed Computing Systems, 2003. [11] A. Gavrilovska, S. Kumar, H. Raj, K. Schwan, V. Gupta, R. Nathuji, R. Niranjan, A. Ranadive, and P. Saraiya. Scalable Hypervisor Architectures for High Performance Systems. In Proceedings of the 1st Workshop on System-level Virtualization for High Performance

Computing (HPCVirt) 2007, in conjunction with the 2nd ACM SIGOPS European Conference on Computer Systems (EuroSys) 2007, Lisbon, Portugal, Mar. 20, 2007. [12] W. Huang, J. Liu, and D. Panda. A Case for High Performance Computing with Virtual Machines. In ICS, 2006. [13] B. Lin and P. Dinda. VSched: Mixing Batch and Interactive Virtual Machines Using Periodic Real-time Scheduling. In Proceedings of ACM/IEEE SC 2005 (Supercomputing), 2005. [14] J. Liu, B. Chandrasekaran, J. Wu, W. Jiang, S. Kini, W. Yu, D. Buntinas, P. Wyckoff, and D. K. Panda. Performance Comparison of MPI Implementations over InfiniBand, Myrinet and Quadrics. In Supercomputing’03, 2003. [15] J. Liu, W. Huang, B. Abali, and D. K. Panda. High Performance VMM-Bypass I/O in Virtual Machines. In ATC, 2006. [16] J. Liu, J. Wu, S. P. Kini, P. Wyckoff, and D. K. Panda. High Performance RDMA-Based MPI Implementation over InfiniBand. In Int’l Conference on Supercomputing (ICS ’03), 2003. [17] A. Matsunaga, M. Tsugawa, S. Adabala, R. Figueiredo, H. Lam, and J. Fortes. Science gateways made easy: the In-VIGO approach. Concurrency and Computation: Practice and Experience, 19(1), 2007. [18] OpenFabrics Software Stack - OFED 1.1. www.openfabrics.org/. [19] Presta Benchmark Code. svn.openfabrics.org/svn/openib/gen2/branches/1.1/ofed/mpi/. [20] P. Ruth, X. Jiang, D. Xu, and S. Goasguen. Virtual Distributed Environments in a Shared Infrastructure. IEEE Computer, Special Issue on Virtualization Technologies, 38(5):63–69, 2005. [21] M. Silberstein, D. Geiger, A. Schuster, and M. Livny. Scheduling Mixed Workloads in Multi-grids: The Grid Execution Hierarchy. In Proceedings of the 15th IEEE Symposium on High Performance Distributed Computing (HPDC), 2006. [22] M. S. Squillante, Y. Zhang, A. Sivasubramaniam, N. Gautam, H. Franke, and J. E. Moreira. Modeling and analysis of dynamic coscheduling in parallel and distributed environments. In SIGMETRICS, 2002. [23] A. Sundararaj and P. Dinda. Towards Virtual Networks for Virtual Machine Grid Computing. In Proceedings of the Third USENIX Virtual Machine Technology Symposium (VM 2004), 2004. [24] S. Sur, M. Koop, L. Chai, and D. K. Panda. Performance Analysis and Evaluation of Mellanox ConnectX InfiniBand Architecture with Multi-Core Platforms. In 15th Symposium on Hot Interconnects, 2007. [25] Top500 SuperComputing Sites. www.top500.org. [26] G. Vallee, T. Naughton, H. Ong, and S. Scott. Checkpoint/Restart of Virtual Machines Based on Xen. In HAPCW, 2006. [27] G. Vall´ee and S. L. Scott. Xen-OSCAR for Cluster Virtualization. In ISPA Workshop on XEN in HPC Cluster and Grid Computing Environments (XHPC’06), Dec. 2006. [28] J. Vetter and F. Mueller. Communication Characteristics of LargeScale Scientific Applications for Contemporary Cluster Architectures. In Proc. of Int’l Parallel and Distributed Processing Symposium, 2002. [29] The VMWare ESX Server. http://www.vmware.com/products/esx/. [30] Xen Credit Scheduler. wiki.xensource.com/xenwiki/CreditScheduler. [31] XenSmartIO Mercurial Tree. smartio.hg.

xenbits.xensource.com/ext/xen-

[32] J. Xu, M. Zhao, M. Yousif, R. Carpenter, and J. Fortes. On the Use of Fuzzy Modeling in Virtualized Data Center Management. In Proceedings of International Conference on Autonomic Computing (ICAC), Jacksonville, FL, 2007. [33] L. Youseff, R. Wolski, B. Gorda, and C. Krintz. Evaluating the Performance Impact of Xen on MPI and Process Execution For HPC Systems. In International Workshop on Virtualization Technologies

in Distributed Computing (VTDC), with Supercomputing’06, 2006. [34] L. Youseff, R. Wolski, B. Gorda, and C. Krintz. Paravirtualization for HPC Systems. In XHPC: Workshop on XEN in High-Performance Cluster and Grid Computing, 2006. [35] M. Zhao, J. Zhang, and R. Figueiredo. Distributed File System Virtualization Techniques Supporting On-Demand Virtual Machine Environments for Grid Computing. Cluster Computing Journal, 9(1), 2006.

Suggest Documents