HPC Cluster Readiness of Xen and User Mode Linux - Semantic Scholar

4 downloads 72185 Views 209KB Size Report
simpler cluster administration, by allowing for custom clus- ter images tailored to a ... Software upgrades could be performed on virtual clusters ... speed must have substantial benefit to offset it. ... has its own advantages and disadvantages.
HPC Cluster Readiness of Xen and User Mode Linux Wesley Emeneker, Dan Stanzione High Performance Computing Initiative Ira A. Fulton School of Engineering Arizona State University [email protected], [email protected]

Abstract

challenges of maintaining a cluster environment. In section 4, the advantages of applying common virtualization techniques to cluster administration are examined. Performance of the candidate virtualization technologies is assessed with the HPC Challenge (HPCC) suite [9] and Mpptest[5]- two standard CPU and network benchmarks designed for clusters, in section 4. Finally, in section 5, we discuss some of the possibilities this work may allow, and future avenues of research that may be pursued such as advanced networking, parallel checkpointing, and job spanning.

This paper examines the suitability of different virtualization techniques in a high performance cluster environment. A survey of virtualization techniques is presented. Two representative technologies (Xen and User Mode Linux) are selected for an in depth analysis of cluster readiness in terms of their performance, reliability, and their overall impact on complexity of cluster administration.

1

Introduction

2

Clusters of commodity processors are the dominant architecture in the high performance computing (HPC) world. As they continue to grow in complexity, making efficient use of a cluster without imposing undo burden on system administrators remains a challenge. From security updates to maintaining a consistent set of libraries and binaries, a cluster’s software environment requires much of an administrator’s time and effort. Virtualization technologies hold the promise of much simpler cluster administration, by allowing for custom cluster images tailored to a particular user or application. Potentially, ”virtual clusters” could also provide a more robust method of checkpointing and resource management. Software upgrades could be performed on virtual clusters with no downtime on the physical system. However, employing virtualization does involve trade-offs, in terms of the effort to maintain a set of virtual clusters as well as in performance. In the HPC arena, any penalty in run-time speed must have substantial benefit to offset it. In this paper, a thorough evaluation of two virtualization technologies is presented in terms of the potential costs and benefits in terms of performance, system administration, and reliability. Section 2 provides a taxonomy of various available virtualization techniques. Section 3 examines the administrative

Related Work

Pioneered largely by IBM [15] for sharing large mainframe systems, virtualization has long been a technique applied to allow users to change the local software environment while sharing underlying physical hardware. However, virtualization has seldom been employed in the cluster computing world [4]. Many types of virtualization exist, from file system virtualization to complete machine emulation, and each system has its own advantages and disadvantages. The following sections will discuss a few major types of virtualization.

2.1

Classic Virtualization

Classic virtualization [12] is a well known, mature technique which presents an abstracted machine architecture that may be different than the actual hardware of the host. It typically involves some emulation of hardware [12] in order to safely sandbox the guest environment from the host. Examples of this are VMware [16], which virtualizes an x86 environment on an x86 system, and Bochs [14], which is a complete IA-32 emulator. Both examples use emulated hardware to allow the virtual environment to look identical to a real environment. 1

2.2

Paravirtualization

Paravirtualization is a technique that presents the abstraction of a virtual machine with an interface almost identical to the underlying hardware[3]. This virtualization scheme does not emulate hardware, but instead requires that some sensitive instructions be intercepted [11] by the hypervisor in order to ensure that the guest environment is working within defined boundaries. The most notable example of this type of virtualization is Xen [3], which has been used to successfully run Linux, BSD, and Windows as guests.

2.3

OS level virtualization

This technique virtualizes at the operating system level and allows guest servers to have partitions of system resources that the host OS controls. It requires no hardware emulation and is generally used to isolate system resources and prevent processes in separate guests from interfering with each other. Usually, this kind of virtualization involves system call trapping at the kernel level, e.g. OpenVZ [13], or at the user level like User Mode Linux (UML) [2]. While OpenVZ modifies the kernel to isolate guests by trapping system calls and rewriting the results, UML traps system calls within the userspace guest environment and can arbitrarily modify the results [2]. The attraction of this approach for UML is that new kernels, drivers, and environments can be tested without host downtime.

2.4

2. Administration: A virtualized system must confer several advantages to administrators that makes the overhead of virtualization acceptable. These advantages may include a homogeneous environment, upgrading through centralized system images, and system resource partitioning. 3. Reliability: A virtual machine setup should increase the overall reliability of the cluster or individual application runs by isolating processes, partitioning resources, and minimizing downtime from crashes and upgrades.

Figure 1. Virtual machines in a cluster environment

Other techniques

There are many types of virtualization, but most of these do not have the flexibility necessary to address cluster issues. Other virtual machines, such as Java and chroot [12] provide virtualization in the forms of abstract computer architectures and filesystem jails, but do not provide the necessary functionality or flexibility for cluster computing and administration. The major techniques presented here are among the most common and mature forms of virtualization. Applying one or more of these techniques to cluster computing will require the virtualization systems to solve numerous cluster administration issues and provide performance as close to the native OS as possible.

3

1. Performance: The virtual machine must not have significant impact on system performance.

Virtualization on HPC Clusters

There are several requirements that a virtualization system applied to a cluster should have.

Figure 1 illustrates the use of virtual machines in a cluster. As shown in the diagram, different images for the guest environments can provide different functionalities, libraries and binaries, etc.

3.1

Cluster Administration Challenges

Administering a cluster is a time consuming endeavor. Installing, updating, and fixing nodes can be a tedious job, and these administrative responsibilities are still a large part of running a cluster. Because they are large, expensive computing systems, keeping utilization high maximizes costeffectiveness of the cluster. Thus, the two major goals of cluster administration are minimizing downtime and maximizing throughput. The administration issues listed below are directly related to these two major goals.

• Upgrades and new software: Security patches, node upgrades, and software installation are typical activities for a cluster administrator. Some activities (kernel upgrades, security patches, etc.) require node downtime. In order to keep the cluster environment homogeneous, it may be necessary to shutdown the cluster for hours or days. • Node Reliability: Disk failures, memory errors, and software failures are all capable of making a host nonfunctional. In many parallel MPI applications, a node failure will cause the entire application to fail, losing all of its data. Application level checkpointing can be used to address these failures, but this puts a large burden on the application programmer. • Isolation and Resource Management: In order to efficiently manage a cluster, job schedulers typically assign one job per processor. Because many cluster nodes have multiple processors, the result is that two independent jobs are often scheduled on the same node. Despite the fact that each job may have its own processor, resources like RAM and disk are still shared. Effectively partitioning these resources is a major issue.

4

Xen’s design creates some overhead in every environment although this can be quite small. Generally, hardware such as video cards and sound cards are emulated by more common virtual machines like VMware; in an HPC cluster such emulation is usually unnecessary. By not emulating hardware, Xen and UML are able to avoid some overhead that exists in other virtualization systems. Figure 2 illustrates the extra overhead that hardware emulation can cause as well as displays the percentage performance difference on a naive single processor HPCC benchmark of VMware, UML, and Xen. From this graph, Xen provides the best performance with approximately a 1% performance difference from the native OS. UML suffers a 6% performance loss, while VMware encounters more than a 10% performance loss. This benchmark was run on a single 3.4Ghz Pentium IV with 3GB of RAM on an Ubuntu 5.04 filesystem.

Evaluation and Results

This paper attempts to evaluate the effectiveness of Xen and UML in a cluster environment, therefore their performance on MPI parallel applications is extremely important. Two of the most important performance metrics for MPI applications are the computation and communication rates. Each virtualization system will be evaluated in terms of CPU and network performance with MPI applications.

4.1 4.1.1

Virtualization Evaluation Virtual Machines Chosen

For this paper, Xen and UML were chosen as the two evaluation virtualization architectures. These virtualization solutions represent two extremes of the virtualization spectrumsub-kernel virtualization and userspace virtualization. User Mode Linux is a port of Linux that runs a kernel entirely in userspace[2]. The usermode kernel is able to intercept system calls, and may arbitrarily modify the result of each call. Thus, the host kernel is only notified that it needs to act when the usermode kernel asks. While UML’s architecture requires no host modifications, the overhead of kernel calls may impact performance more than Xen. Xen is a hypervisor kernel designed only to sandbox and arbitrate between guest environments and partitioned system resources. Thus it lacks most of the functionality of full-featured kernels like Linux, Mach, and Windows.

Figure 2. Naive Virtual Machine benchmark

4.1.2

Evaluation of Administration Challenges - Advantages

For virtualization to be effective in a cluster it must, at least partially, address the discussed challenges of cluster administration outlined in section 3.1. • Updates and System Changes: A virtual machine’s image may be updated while the physical cluster is running another virtual machine. This can allow users to install software in an image, and then use that customized image in a cluster without requiring downtime or administrator intervention. • Ease of Administration: Security patches, library and binary updates, and kernel upgrades are also easy to install. A base image can be updated so that the next use

of a guest machine will use the new image or kernel while all old running versions still exist. The first advantage to this approach is that any job running inside an old guest will use that environment until the job is done, but any new job will use the updated image. Because only the guest has been updated, no downtime is required for the upgrade. • Reliability: Xen (but not UML) provides the ability to checkpoint and migrate [7, 1] a guest machine from one physical host to another. With checkpointing and migration, any host that is predicted to fail can migrate any guest virtual machines (VMs) to another host. By taking advantage of component failure prediction, checkpointing and migration can be used to increase “machine” uptime. • Isolation and Resource Management: As previously stated, virtualization presents a subset of physical computing resources to a guest virtual machine. This permits jobs to run inside independent VMs. If a job crashes the virtual machine, only that VM will be affected. Both Xen and UML are capable of partitioning CPU, RAM and disk, and are also completely isolated from any other guest environments running on the same host. Table 1 lists several cluster administration issues and the ability of these virtual machines to address these problems.

Table 1. Administration issues that virtual machines address Problem Xen UML Allows system changes without downtime? yes yes Ability to isolate concurrent jobs? yes yes Partitions system resources (RAM, CPU)? yes yes Ability to checkpoint guests? yes no Maintains software homogeneity? yes yes Ability to run without patching host? no yes

4.1.3

Evaluation of Administration Challenges - Disadvantages

While virtual machines provide many advantages, there are disadvantages inherent in using the abstractions provided by virtualization . • Creation, configuration, and updating: Creating an initial guest image, configuring the image, and updating it can be a time consuming task. Two recurring sources of virtual machine overhead occur when a guest starts and shuts down. For short jobs, a significant fraction of the total run-time and CPU time can be attributed to this overhead.

• Staging and distribution: Any updated or customized image will have to be staged to any host that it will run on. Virtual machine images can be anywhere from 100 megabytes in size to several gigabytes, and staging that amount of data to multiple nodes is costly in both network bandwidth and time. • Consumed resources: A virtual machine has two major components- the virtual machine monitor, and the guest machine. Both components require system resources (RAM, CPU, and disk) that could otherwise be used by an application. If a virtualized system consumes more than 5-10% of extra system resources, it may be unsuitable for use in a cluster. To quantify the disadvantages of using virtual machines, the following sources of common overhead were measured. Table 2. Virtual machine overhead 1 GB Image Overhead Xen UML Guest Startup 23 seconds 51 seconds Guest Shutdown 11 seconds 15 seconds Guest Staging 31 seconds 31 seconds Extra consumed RAM 100MB RAM 50MB RAM Each test was run 5 times (caching effects were eliminated), the highest and lowest values were discarded, and the remaining three values were averaged. The benchmarks included timing guest filesystems staging, booting guests, shutting down guests, and measuring the extra resources consumed by the virtualization. In each of the tests, the time to completion required less than 1 minute. The main reason for the difference between UML and Xen boot time is that UML must make many more system calls to the host and thus is slower. A third major point is the staging time took more than 30 seconds. When staging time is combined with startup and shutdown, the resulting “time-to-ready” is more than 1 minute for both UML and Xen. The 1GB image used for these tests is larger than absolutely required, but is representative of a typical cluster node image in relation to Rocks[10] and OSCAR[6]. According to Xen documentation, the hypervisor consumes 64MB of RAM[3]. This 64MB added to the guest environment consumption is approximately 100MB. With UML, the kernel consumes slightly more RAM than the Xen guest kernel, but the environment consumes almost exactly the same amount.

4.2

Performance

Some loss of performance is inevitable with any virtual machine technology. A small loss (less than 5-10%) may

be acceptable, but the advantages in resource management and administration gained by using virtualization may outweigh the loss. In this section we attempt to quantify the performance penalty of Xen and UML. Many MPI applications are sensitive to network latency and bandwidth; because of Xen’s architecture, we expect to obtain near native performance from any guest environment running a parallel MPI application. For the same measure, UML is not expected to perform as well as Xen since every packet originating from UML must pass through the usermode kernel and the real kernel. 4.2.1

Setup

Both Xen and UML require versions of the Linux kernel to work, so the newest kernel at the time, 2.6.16, was used for the base Linux environment, with patched versions being utilized for both UML and Xen. The hardware and software used for these tests are specified below. Hardware:

latency and bandwidth. Along with the classic ping-pong test, Mpptest is capable of testing asynchronous messaging, broadcast messaging, and network bisection. For this benchmark, 6 parameter sets and several parameter sizes were evaluated. There are two major communication patterns in MPI, point-to-point and broadcast, and two major methods of communication, blocking and nonblocking. Each pattern will be tested with each type of communication for message sizes. • Mpptest patterns: sync, async, broadcast, async broadcast, bisect, async bisect • Mpptest message sizes: 0-1024 bytes (increment of 32 bytes), 8KB-256KB (increment of 8KB)

• Processor: 2- 3.2 Ghz Xeon EM64T • Memory: 6 GB memory DDR2 memory • Network: Gigabit Ethernet

Each of the benchmarks shown were run five times on the same set of nodes. The highest and lowest values for each test were discarded, and the remaining three values were averaged. For the PTRANS and HPL benchmarks, multiple process grids were tested, and the best overall results from a single process grid were used for the graphs. Since some environments perform better than others with the same HPL block size, the 3 block sizes used were averaged together for each problem size.

Software:

4.3

• • • • •

Kernel: Linux 2.6.16 x86 64 architecture Host and Guest environment: CentOS 4.2 Compiler: mpicc (1.2.7) with gcc 3.4.4 Benchmarks: HPCC 1.0.0, Mpptest 1.3a Linear Algebra Libraries: ATLAS 3.7-11

The HPCC suite [9] is a comprehensive set of synthetic benchmarks designed to profile the performance of several aspects of a cluster. The tests used in the results are listed here: 1. HPL- solves a linear system of equations 2. DGEMM- a double precision matrix-matrix multiplication 3. STREAM- measures sustainable memory bandwidth (in GB/s) and the corresponding computation rate for simple vector kernels 4. PTRANS- performs a parallel matrix transposition For this benchmark suite, four problem sizes were evaluated. In an effort to keep the tests as similar as possible, the same per-process problem sizes corresponding to 250MB, 500MB, 750MB, and 1000MB were used for each test. The block and grid sizes chosen are common – block: 80, 88, and 112; grid: 1x[1,2,4,8,16], 2x[2,4,8], and 4x4 – and generally provide reasonable benchmark results. The Mpptest [5] benchmark is used to measure the performance of basic MPI message passing in terms of network

Benchmark Results

The results in the following figures show the performance difference between native Linux, Xen, and UML by calculating the percentage of performance lost for each test. Difficulties encountered: Attempts to allocate very large amounts of RAM (near the node’s capacity) caused instability with UML. As a results, only three problem sizes for the PTRANS and HPL benchmarks were tested with UML. In addition, UML became unstable while testing larger numbers of processes. Therefore the resulting UML graphs are only complete for 1 and 2 processor cases. The first graph, shown in figure 3, shows the average percent of performance difference for the HPL test. For single processor tests, the results for both UML and Xen show less than a 5% performance loss. However, as more processors are added to the testing, the average performance lost by the virtual machines increases. Figures 4 and 5 depict the network latency and bandwidth test results. In each test, Xen’s average latency is approximately 1.6 times higher than native Linux, while UML’s latency is roughly 9 times higher than native Linux. The bandwidth graph shows much the same, with Xen’s and UML’s networking having approximately 60% and 10% of native bandwidth. The only discrepancy occurs when UML shows increased latency and bandwidth performance for broadcasts. During these tests, the latency for each environment increased by approximately 80 microseconds. For

Figure 3. HPCC HPL benchmark

Figure 4. Mpptest Latency benchmark

Xen this increased latency by a factor of 3, and for Linux a factor of 4; however, UML’s latency increase was only a factor of 0.3. Similarly, the bandwidth of both Xen and Linux decreases at a rate greater than UML. Because latency is higher and bandwidth is lower in the virtual machines, and HPL is sensitive to these performance characteristics, it follows that HPL will perform worse in the virtual environments. Following the HPL benchmarks is the PTRANS benchmark in figure 6. The transposition of a parallel matrix is network bound since most participating processes must exchange all of their data. Taking into account that the virtual machine’s bandwidth and latency are worse than native, we expect the virtual machines to always under-perform Linux. As the number of processors increase, the average difference in performance increases- a direct result of the latency and bandwidth penalty. The mysterious drop in performance by UML with 2 processors is not understood at

Figure 5. Mpptest Bandwidth benchmark

Figure 6. HPCC PTRANS benchmark

this time, however the 4 and 8 processor tests are competitive with Xen. The DGEMM benchmark shown in figure 7 illustrates the performance of a parallel matrix multiplication. This test is more CPU bound than the previous benchmarks, so the expected result is that both virtual machines will perform well. According to the graph, the average performance of both virtual machines varies less than 5% from Linux. The final benchmark shown in figure 8 is the Stream benchmark. This test measures both memory bandwidth and the computation rate for vector kernels. Both Xen and UML are able to access memory with very little overhead, and both are expected to perform almost identically to Linux. As seen in the figure, both virtual machines lose very little performance when compared to Linux, and in a few cases actually outperform Linux.

increased by not requiring an expensive node reboot or a node reinstall. Checkpointing a single independent guest worked effectively with Xen. While this ability was not rigorously tested, timed, or profiled, a naive correctness test of the HPCC HPL benchmark demonstrated the possibility of using Xen to backup and restore work done by guest environments.

5

Figure 7. HPCC DGEMM benchmark

Figure 8. HPCC Stream benchmark

4.4

Analysis

Several of the administrative advantages of virtual machines were put to the test with these experiments. First, the ability to make system changes without host downtime allowed new kernels to be compiled and used without rebooting the host. Second, modifying a single machine image and then propagating that image was much simpler than modifying each individual machine in order to perform necessary system changes (i.e. SSH keys, binary and library propagation, etc.). Partitioning system resources was also simple and effective. In both UML and Xen guests, attempting to use more memory than the environment was given, caused out of memory errors that only affected the respective guest. The ability to partition resources and have independent guests is critical for resource management, scheduling, administration, and reliability. Thus the reliability of the system was

Conclusions and Future Work

Applying virtualization to clusters is a promising technique, however much work remains. According to these initial tests, Xen is both stable and fast with a slight difference in performance from a native Linux OS. Xen’s stability, speed, and ability to checkpoint guest environments make it a solid choice for HPC cluster use once the initial hurdle of installation is overcome. UML’s primary strength is the ability to implement a virtualization system on an unmodified cluster. However, stability issues, the significant performance drop, and the fact that UML does not yet support SMP show that UML is not yet ready for general use in high performance clusters. Initial work on using advanced network architectures, specifically Infiniband, in Xen guest kernels has shown [8] latency and bandwidth measurements are virtually identical to those of a stock Linux kernel. Future work will include testing virtual machines in a multi-cluster environment in order to span jobs over independent clusters with heterogeneous software stacks. Integrating virtual machines with resource managers and schedulers like Moab, LSF, and Maui, is an important related goal to spanning cluster jobs, and will be necessary to make transparent allocation of virtual machines in a cluster possible.

References [1] C. Clark, K. Fraser, S. Hand, J. Hansen, E. Jul, C. Limpach, I. Pratt, and A. Warfield. Live Migration of Virtual Machines, 2005. [2] Jeff Dike. A user-mode port of the Linux Kernel, 2000. http://user-modelinux.sourceforge.net/als2000.tex. [3] B. Dragovic, K. Fraser, S. Hand, T. Harris, A. Ho, I. Pratt, A. Warfield, P. Barham, and R. Neugebauer. Xen and the Art of Virtualization. In Proceedings of the ACM Symposium on Operating Systems Principles, October 2003. [4] R. Figueiredo, P. Dinda, and J. Fortes. A Case for Grid Computing on Virtual Machines, 2003.

[5] William Gropp and Ewing Lusk. Reproducible Measurements of MPI Performance Characteristicts, 1999. http://wwwunix.mcs.anl.gov/ gropp/bib/papers/1999/pvmmpi99 /mpptest.pdf. [6] Open Cluster Group. OSCAR: A packaged cluster software stack for high performance computing. http;//oscar.sourceforge.net. [7] J. Hansen and E. Jul. Self-migration of Operating Systems, 2004. [8] Jiuxing Liu, Bulent Abali, Wei Huang, and DK Panda. Virtualizing Infiniband in Xen: Prototype Design, Implementation and Performance, 2006. Xen Summit 2006. [9] Piotr Luszczek, Jack J. Dongarra, David Koester, Rolf Rabenseifner, Bob Lucas, Jeremy Kepner, John McCalpin, David Bailey, and Daisuke Takahashi. Introduction to the HPC Challenge Benchmark Suite, 2005. http://icl.cs.utk.edu/projectsfiles/hpcc/pubs/hpccchallenge-benchmark05.pdf. [10] P. Papadopoulos, M. Katz, and G. Bruno. NPACI Rocks: Tools and Techniques for Easily Deploying Manageable Linux Clusters. IEEE Cluster, 2001. [11] John Scott Robin and Cynthia E. Irvine, 2000. [12] J. Smith and R. Nair. The Architecture of Virtual Machines. Computer, 38(5), May 2005. [13] SWSoft. OpenVZ User’s Guide. http://download.openvz.org/doc/OpenVZ-UsersGuide.pdf. [14] Bochs Development Team. Bochs: IA-32 emulator project, 2006. http://bochs.sourceforge.net/. [15] Melinda Varian. VM and the VM Community: Past, Present, and Future, 1997. http://www.os.nctu.edu.tw/vm/pdf/VM and the VM Community Past Present and Future.pdf. [16] VMware. Virtualization Overview. http://www.vmware.com/pdf/virtualization.pdf.

Suggest Documents