Maestro-VC: A Paravirtualized Execution ... - Semantic Scholar

6 downloads 26337 Views 93KB Size Report
cluster, dedicated storage nodes or a high-performance parallel filesystem ... server which accepted messages and created threads to carry out tasks ... quests, long-running services can be hosted on our platform. The ... hours to weeks or more, we feel that a good target for ..... http://www.cl.cam.ac.uk/Research/SRG/netos.
Maestro-VC: A Paravirtualized Execution Environment for Secure On-Demand Cluster Computing Nadir Kiyanclar†‡ , Gregory A. Koenig†‡ , William Yurcik‡ †



Department of Computer Science National Center for Supercomputing Applications University of Illinois at Urbana-Champaign {nadir,koenig,yurcik}@ncsa.uiuc.edu

Abstract

cluster resource owners, and the advantages of a regular hardware platform for system administrators and system users rest on some key assumptions about computing clusters, assumptions which cannot continue to hold if the trend in increasing performance continues to be achieved by purchasing an increasing number of increasingly powerful (and increasingly hot) machines.

Virtualization, a technology first developed for partitioning the resources of mainframe computers, has seen a resurgence in popularity in the realm of commodity workstation computers. This paper introduces MaestroVC, a system which explores a novel use of VMs as the building blocks of entire Virtual Clusters (VCs). Virtualization of entire clusters is beneficial because existing parallel code can run without modification in the virtual environment. At the same time, inserting a layer of software between a virtual cluster and native hardware allows for security enforcement and flexible resource management in a manner transparent to running parallel code. In this paper we describe the design and implementation of Maestro-VC, and give the results of some preliminary performance experiments.

1

Even if a practical limit is reached on the number of nodes under the control of any one provider, users can still acquire a large allocation by co-allocating nodes across multiple clusters. Sharing of nodes between clusters creates new problems in terms of security and policy to resource owners. Grid computing, first popularized in [10, 9], focuses on combatting these problems by enabling cooperating resource owners to form goaloriented virtual organizations. Grid computing also focuses on the policy problems in coordinating and aggregating distributed resources. This area of research is worth mentioning, as it overlaps with the focus of our own work: On-Demand computing. An On-Demand infrastructure requires fine-grained, timely control of a computing resource by the resource owner in conjunction with fairness to resource consumers.

Introduction

Largely in the last decade, computing clusters have arisen as a scalable and cost effective solution to the problem of ever-increasing demand for computing cycles. In addition to cost and performance, “traditional” clusters also provide scalable benefits to both system administrators and users. In the former case, the homogeneous nature of computing nodes in a cluster 1 simplifies computer maintenance. For consumers, APIs such as MPI and PVM provide a user API for parallel programming which is conceptually simple in terms of communication and synchronization between large numbers of cooperating processes. The benefits of low cost to

In this paper, we present Maestro-VC (or Virtual Cluster). In short, Maestro-VC is a set of system software that creates an On-Demand environment by extending single system virtualization to distributed systems. The programs comprising Maestro-VC run on top of a computing cluster, and allocate entire virtual clusters, or collections of virtual machines on top of the physical nodes of the resource. Because the virtualized machine interface appears identical at the machine and OS level to userspace code, standard applications can run unmodified while under virtualization. This allows

1 most nodes in a typical HPC cluster fall into one of a few classes, such as compute node, login node, etc.

1

2

existing distributed scientific codes to transparently take advantage of features implemented at the VM level, such as checkpointing and migration2 . Even assuming near-native speed in virtualized execution however, significant performance problems resulting from an inefficient mapping of virtual to physical resources can still occur. To some extent, this inefficiency can be corrected by taking advantage of the virtualized environment. For example, network traffic between VMs in a cluster can be monitored, and nodes transparently migrated so as to reduce traffic on the physical network. In some cases, more information about virtualized code is necessary to provide optimal aggregate throughput for distributed tasks. We extend these advantages of information sharing to virtualized clusters by incorporating the concept of two level scheduling to allow the exchange of “hints” between Maestro-VC and its virtualized clients. While there are advantages to such a dialog, it is important to note that this component of virtualized clusters (called the Local Scheduler in this paper) is completely optional. Thus all manner of existing parallel code, including MPI code, can run under virtualization and immediately take advantage of features such as checkpoint/restart. Later, this code can gradually be modified to optimize its performance under virtualization. This provides a low cost of entry to the developers of “conventional” parallel code in terms of the resources they must invest to take advantage of virtualization. At the same time, developers can gradually migrate their code to take advantage of the improved performance cooperative scheduling can provide. The remainder of this paper is organized as follows: Section 2 describes On-Demand computing and explains some of the motivations and implications of work in this area. Section 3 lays out the conceptual architecture of Maestro-VC and summarizes the functionality of the system. In Section 4 we detail the implementation of our Maestro-VC prototype and discuss our strategies for VM allocation and management. We also discuss possible scalability and performance-oriented improvements to the existing system. Section 5 discusses the manner in which Maestro-VC enables a more secure computing infrastructure for resource providers. Performance-related experiments and results and a discussion thereof make up Section 6. Section 7 describes related work in the areas of virtualization, distributed and On-Demand computing. Last, we discuss long-term goals for MaestroVC and conclude in Section 8.

On-Demand Computing

On-Demand computing addresses the problem of computational resource sharing between different providers by treating computational cycles as a purchasable commodity, and working to enable seamless access to such resources on demand. In other words, the goal of an On-Demand infrastructure is to give users immediate access to resources on the basis of their priority or the price they are willing to pay for use. By enabling users to easily access large amounts of resources, an On-Demand system will increase the attractiveness of “cycles on demand.” This in turn will let dedicated resource providers become self-financing. The full realization of a On-Demand computing economy faces a number of challenges. Three among these are the security, reliability, and the heterogeneity of computing resources. Security is an obvious concern in any system where one’s code can run on remote machines, and is dealt with in Section 5. Reliability can already be achieved in systems such as Charm [16] and Condor [1]. The run-time component of these systems can be called upon on to periodically checkpoint a user’s computation. Upon a fault, the job is just rolled back to the nearest checkpoint. The inherent problem with these systems is that all benefits therein are achieved at the application level. Code must be explicitly ported to use Charm’s features, for example. Therefore while a number of existing systems can provide checkpointing functionality, this is not a commonly used feature in parallel code found today. While most clusters consist of a large number of machines with identical configurations, allocations across clusters may be heterogeneous in terms of memory per node, processor speeds, or cluster interconnects. Therefore some mechanism is needed either for classifying jobs according to the machine type they run on, or for abstracting away these differences. We deal only with the case of a single physical cluster in this paper, and leave inter-cluster allocation to future work.

3

System Architecture

At a high-level, the architecture of Maestro-VC is relatively simple, consisting of a handful of interacting components. Our vision for a cluster on which MaestroVC will run is much like a cluster today, consisting of a small number of login or access nodes, and a larger number of dedicated compute nodes. Depending on the cluster, dedicated storage nodes or a high-performance parallel filesystem may also be available. Our test setup consists of a small number of nodes of the first two types; the implications and tradeoffs of special purpose

2 we use the Xen [8] Virtual Machine Monitor (VMM)) as the basis for our implementation

2

machines. It can be viewed as a subcomponent of the Stager. We discuss the reasons for our conceptual separation of the Stager and Allocator when discussing optimizations in Section 4.

storage are left to future work. Maestro-VC control software is installed across all of the machines of a cluster, and can be thought of as consisting of two planes of management. Software which manages the starting and stopping, and other control of virtual machines (for example, setting or changing resource limits) is installed on compute nodes. For Maestro-VC, we use the Xen Virtual Machine Monitor [8] for hosting VMs. Xen uses a technique called paravirtualization to achieve high virtualized performance3 . In Xen, two types of VMs run under the control of a hypervisor which manages and arbitrates access to computer resources: unprivileged VMs, which have no direct I/O device access, and the privileged domain 0, which handles all “real” I/O for itself and other VMs sharing the machine. The domain 0 also runs management software for VM start, stop and control. We build upon this software base to provide Maestro-VC’s functionality. Higher level software which performs start, stop and control of entire virtual clusters, as well as software which interacts with users requesting resources runs on an access node, which we refer to as the master node. In more detail, the components of Maestro-VC are as follows:

• Node Manager - The Node Manager (NM) contains the set of software running on each physical machine, which we briefly described at the start of this section. As we stated, this software handles the creation, destruction, and control of individual VMs for one or more VCs. In addition, most of the work of disk allocation is handled by the NM. In other words, the Stager and Allocator can be viewed as cluster-level coordinators handling specific tasks. • Local Scheduler (LS) - An optional component which runs in an unpriviled Xen domain inside a VC, this entity handles communication with the GS and can modify VC behavior based on events sent to it from the GS. All of this is done with the goal of improving overall VC performance or throughput by enabling the GS to gain useful information about the execution of tasks in a VC. More details on the Maestro-VC system architecture can be found in [17].

• Gateway - The Gateway handles all direct interaction with clients. Clients send resource requests to the gateway via XML files describing a number of machine classes, where each class consists of a machine description and quantity. The Gateway is then responsible for informing the scheduling component of the incoming request. Responses to queries are handled in reverse order, with all interaction again coming through the gateway.

4

Implementation

Our initial implementation was structured as a distributed set of message-driver “actors” communicating via XML-RPC messages. Each actor consisted of a server which accepted messages and created threads to carry out tasks and send reply messages. Two features characterized the initial prototype: use of multithreading in each object, and the explicit separation of requests and responses into different messages. These features allowed for the potential handling of a large number of concurrent requests for cluster resources. A number of factors contributed to our retreating from this heavily concurrent prototype implementation. First, the use of servers on each machine required changes to code in many files every time we wished to make any changes in our messaging protocol. Secondly, the heavy use of messaging in a distributed setting required very careful planning in terms of synchronization, which again inhibited changes in our prototype. Finally, the amount of concurrency originally in the system was not necessary. As with the Xenoserver project, we expect (in fact our architecture implies) the use of co-located machines servicing user requests. However, the nature of the jobs we intend Maestro-VC to service are very different from the project. Whereas [13, 11]

• Global Scheduler - The Global Scheduler (GS) manages the resource reservation policies of a Maestro-VC cluster. That is, it has overall control over when virtual clusters are created and destroyed, and manages the virtual cluster allocation to comply with local policies. • Stager - This component has high-level control over the allocation of a virtual cluster. By allocation we refer to the setup of virtual disk devices for the VMs which will run on each machine, as well as any other VC-specific configuration. • Allocator - The allocator handles disk initialization for the virtual disk devices assigned to VC virtual 3 though at some cost in transparency: guest OSs must be modified to run under Xen

3

privileged domains of the compute nodes. Our familiarity with Debian made this the logical choice as our initial guest domain candidate as well. Our preliminary strategy for node allocation was to use a caching proxy for Debian package repositories on the head node, and the standard debootstrap program to do install a minimal guest VM system. Following this, a set of scripts would perform additional package installation per user request, and finally do virtual cluster specific configuration. Performing this operation in parallel performed well until the bandwidth limitations of the head node (which also doubled as a storage node) were reached, which was sufficient for our tests. However, the proxy software we used displayed stability issues under high load, prompting use to adopt a different strategy. Using our current method, the debootstrap tool is first used to initialize and configure a single guest image. To save time on subsequent configuration, we pre-install a number of packages (compilers, debuggers, parallel libraries such as the MPICH distribution of MPI, to name a few) which we feel will commonly be desired on a node, given our intended audience. The generically configured image is then archived and compressed, and exported via NFS to the privileged domains on the compute nodes of our test cluster. Upon first downloading an image, each local node caches it so that subsequent setups will not tax the image storage server. Virtual cluster-specific configuration is then performed as before. Given that we expect parallel jobs for which virtual clusters are assigned to have lifetimes on the order of hours to weeks or more, we feel that a good target for completely initializing a virtual cluster is five minutes. For small clusters, the NFS based solution is sufficient to prime the compute node-local caches, after which virtual cluster setup proceeds rapidly. For larger clusters, the bandwidth required from the machine serving VM disk images ensures that a naive parallel NFS copy is not a scalable solution. Instead, initialization through a distribution tree is necessary to maintain scalability.

mention the hosting of application services, NetBeans, and web services, our system is intended to be able to accept traditional MPI jobs, with strict limits on wall clock time4 . The above factors motivated us to take a more centralized approach to the later development of MaestroVC. The only communication over XML-RPC now takes place between the clients and the Gateway, and between the Gateway and the local scheduler. While software is still installed on all nodes of a cluster, the functionality of the Node Managers is now activated via simple ssh commands. This has the advantage of being both more flexible and easier to secure than our previous strategy.

4.1

Disk Allocation

Guest VM instances on a Xen system use virtual block devices, or VBDs as virtual disks or partitions. The guest OSs work with a simplified block driver interface, while the privileged domain handles backend access. This means that the backend for a VBD can be either a dedicated partition, an LVM volume, or even a file. For performance reasons, and because of the limited disk space on our nodes, we allocate fixed, dedicated partitions on each physical compute node for use as virtual disk backends. In our configuration, we use four such dedicated partitions per physical machine: our testbed compute nodes host up to two guest VMs apiece5 . As part of the security requirements for this project, we wish to be able to reinitialize a partition after every use, so that no data is unintentionally leaked between successive clients on a physical node. In addition, the guest OSs supported by Xen support a wide variety of filesystem types, so a method is needed for quickly placing new filesystem images on compute node disks6 . We experimented with two simple methods for quickly initializing disks. Our initial efforts were influenced by a number of factors. Primary among these were our exclusive use of the Xen port of Linux as the OS for guest VMs, and our use of the Debian Linux distribution both on the head nodes of our cluster and on the

4.2

Imaging Optimizations

Further speed gains are possible in our prototype, given that we can currently make the assumption that all guest VMs will run Linux. To reduce the time necessary to image a guest VM, we decouple the two phases of staging. Allocation (the copying of image data from the storage server or local cache) is done immediately after the last VM exits and frees a partition. This leaves only VC-specific configuration when a new VC is being constructed. In the case that the default set of parallel libraries is sufficient, only staging of application

4 Certainly, Maestro-VC can service requests for individual machines, and with provisions for automated renewal of resource requests, long-running services can be hosted on our platform. The availability of dedicated high-speed interconnects on HPC clusters may justify a premium on rates charged for resource usage on such a Maestro-VC cluster, however, which may largely preclude the running of services which do not explicitly need such interconnects on these systems. 5 Each machine requires one partition for a root disk, and one for a swap disk. 6 It goes without saying that speed is important here, in order to maintain high cluster utilization in times of great demand.

4

perform NAT through the head node. The 10.94.0.0 class B subnet is reserved for all physical and virtual hosts. We further divide this class B network as follows: The 10.94.0.0 class C subnet is reserved for physical nodes. Every other address of the form 10.94.x.y represents host y of VC x. For larger clusters, this mapping is limiting; a more robust solution would allow arbitrary IP address assignments to VMs.

input is required before a job may start. In the underutilized case, this optimization results in much shorter VC startup times, while in an over-utilized cluster, performance is no worse than in the unoptimized case. In larger clusters with a variety of guest types, this method can still provide a benefit. After a VM exits, a “guess” can be made as to the appropriate filesystem type for the next guest VM, based on a statistical analysis of previous requests to that cluster. Again, performance is never worse than in the unoptimized case, as a wrong “guess” is corrected by simply reinitializing a partition. One valid question, given that VMs from different virtual clusters (assigned to different entities) can exist on the same machine, is what happens to a running guest VM when an initialization is going through on the same node. [8] shows that even with a simple round robin scheme of servicing virtual devices requests between VMs on the same physical machine, isolation is maintained between virtual machines. That is to say, a machine with a 50% CPU reservation proceeds at approximately half the speed of a job with similar device usage with 100% reservation. This holds even if one virtual machine is performing extremely I/O intensive work 7 . It is software running in the privileged Domain 0 VM which is responsible for imaging disks, so to limit the impact of management and control I/O, domain 0 can limit its own CPU reservation before beginning to stage disk data in the case that other guest VMs are already running on that machine.

4.3

5

Security

No discussion of a system designed to encourage ondemand computing is complete without a discussion on security. Virtualization may be used to enhance protection in a computing cluster by isolating physical resources from direct access by users. This provides a security benefit to both resource maintainers and to other users from potentially malicious users. The security isolation and control achievable over users’ VMs with virtualization may also provide some benefit in the context of an On-Demand environment. In addition to security protection, a VMM provides for fine-grained resource control over guest VMs. This type of control is needed in an On-Demand system where jobs must be allocated resources as a function of priority or payment.

5.1

Disk and Filesystem Security

As noted in Section 4.1, the disks or partitions seen by virtual machines in Xen are virtual block devices (VBDs). The important feature of virtual disks is that they, like other virtual resources for a Xen VM, must be explicitly granted access to by the privileged domain 0. This is important from a security standpoint: even if a root-level compromise of an unprivileged VM were to occur, filesystem damage would be limited to a known subset of the physical system’s resources. In the case that a compromise is detected, the offending VM can be killed and its contents safely examined by a system administrator or automated software from the privileged domain. Even in the case that a compromise is undetected, however, the threat can be limited or eliminated by simply reinitializing the guest VM partition. This can also prevent potential information leakage between successive users of a VM partition.

Network Allocation

The discussion above regarding I/O isolation is relevant to network I/O as well, since both disk and network I/O are implemented using the same low-overhead zero-copy page exchange mechanism. Moreover, this mechanism is shown to display low overhead to network speeds far greater that that achievable via the 100MBit Ethernet cards installed in our test platform [19]. As is the case with disk I/O, Xen is noted (in [8]) to use a round robin packet selection scheme to deal with contending VMs on the same physical machine. Therefore we expect reasonable network isolation to be maintained in the case of VMs sharing a host. Network addresses are assigned by virtual cluster. Currently we use the 10.0.0.0 class A non-routable IP address range for compute nodes on our test cluster, and

5.2

Network Security

Strict network control in Xen is possible at low overhead since all traffic flow to a “real” network must be gated through a privileged domain. Our goal with regards to network security is inspired by the paravirtualization employed by Xen at the single-system level: Guest VMs inside virtual clusters can be aware of their virtualized nature, but cannot circumvent the sandboxed

7 This

is according to the tests in the mentioned work. We note, however, that according to the Xen user manual [2], the default Xen scheduler (the Biased Virtual Time scheduler) is known “to penalize I/O intensive domains.” As domains executing scientific codes can quite often be I/O intensive, we are exploring the use of various schedulers when executing more than one guest VM on a physical machine.

5

environment in which they run at the expense of other virtual clusters. In our current configuration all traffic is bridged to the guests on each physical host as in Figure 1(a), via a software bridge on the management domain. This setup may be undesirable for a production system, as malicious hosts may be able to sniff the traffic of other hosts on the same physical machine, including hosts in other VCs. A more secure configuration is to assign static routes to guest domains as in Figure 1(b). With static routing, all traffic is routed through the management domain, which can use standard firewalling rules to prevent packet sniffing, and to prevent malicious and spoofed packets from leaving the physical machine. Moreover, this intercession is guaranteed for all packets leaving an unprivileged domain. With static routing, it should therefore be possible for VMs within a VC to communicate among each other with no encryption, and be assured that VCs sharing the physical cluster will not be able to see this data.

6 6.1

ries over NFS. The extra node on Uranium, our “native rack,” is used for NFS-mounted scratch space, and is not used in our experiments. Both head nodes run the Debian distribution of Linux, and cluster nodes on each are centrally installed and maintained via the FAI (Fully Automated Installation) toolkit. On Plutonium, our virtualized rack, privileged domains on compute nodes occupy 128 MB of available RAM, plus the overhead from Xen. We run our tests with 512MB or 1024MB RAM for each VM. Both the privileged and unprivileged domains under virtualization use Linux 2.6 kernels patched for use with Xen, specifically kernel version 2.6.11.12. For a fair comparison with the natively executing nodes, we installed a single processor kernel on those machines, and booted them with only 512MB RAM or 1024MB RAM visible via a kernel parameter. For the native nodes, we used the 2.6.8 version of Linux, as it was a standard Debian package. A major limitation of our testing comes from hardware incompatibility: We were unable to get the Myrinet cards up and running on our virtualized nodes with any version of Xen we tested. We had hoped to get the Myrinet cards running in domain 0. This in conjunction with the IP-over-Myrinet driver supplied in the Myrinet driver distribution would provide a high speed network backend for guest VMs. While this scheme would most likely have erased the latency benefits of Myrinet, it would also have been a good test for VM performance under heavy I/O load. As it stands, we were only able to use the 100Mbit Ethernet interfaces in our tests, one per node. We ran a number of tests comparing virtualized against unvirtualized execution, as well as different virtualization schemes. We used a subset of the NAS Parallel Benchmarks for testing. NPB consists of a variety of tests which can be run on a number of datasets of different sizes. There are 6 classes of problem size, S (used only for testing), W, and A-D, which are strictly increasing in size. We chose to use problem class B for our tests; smaller problem sizes will not give informative benchmark results, but on the other hand, time constraints required us to scale back our testing somewhat. One consequence of our chosen problem size is that memory pressure does not seem to be an issue. We receive near identical results when testing a given setup, varying only memory allocation to machines or VMs. A summary of the NPB benchmarks we tested is now given. More detailed descriptions of the various algorithms we used to benchmark our systems are given in [7, 15, 22].

Experiments and Results Experimental Setup

Our test setup consists of two racks of IBM eServer nodes which we configured as an “enclosed” cluster. One master node on each rack acts as a gateway to the compute machines, and performs NAT for these machines to the outside network. Each rack contains 30 1U nodes, each a dual-processor Pentium III Xeon 1GHz machine. Each 1U machine is equipped with 1.5GB of memory, an 18GB SCSI disk, and two 100Mbit Ethernet ports. In addition, the nodes contain special purpose low-latency Myrinet 2000 cluster interconnects, which allow for 2GB+2GB full-duplex peak bandwidth between machines. Each rack also houses two management nodes. These contain identical CPUs to the compute nodes, but are equipped with more RAM (2GB), a RAID controller, and an additional Gigabit Ethernet interface. We chose to dedicate one cluster to virtualization and one to native execution for our comparison. Due to the age of some of our equipment, we were unable make either cluster fully functional. On the virtualized cluster, plutonium.cs.uiuc.edu, only one head node and 23 compute nodes are functional, while on the other cluster, uranium.cs.uiuc.edu, both head nodes and 26 compute nodes work properly. In order to keep our experimental comparisons between native and virtualized execution fair, we configured both clusters identically to as great a degree as possible. On both clusters, the a head node performs NAT for the internal network, and the same machine exports user home directo-

• EP - Embarrassingly Parallel - This benchmark 6

Guest VMs

Domain 0

Guest VMs

Domain 0

Static Routes

Bridge

Cluster

Cluster

(a) Bridged Configuration

(b) Routed Configuration

Figure 1. Bridged and routed network configurations for virtualized guests consists of a floating point benchmark with almost no communication done, and as such tests the peak floating point performance of a cluster.

Staging time (cached image) Staging time (no cached image) 200

150

Running Time (seconds)

Running Time (seconds)

• CG - Conjugate Gradient - This computes an approximation of a sparse matrix using the conjugate gradient method. The nonzero entry locations in the matrix are chosen randomly, so that this benchmark tests “unstructured computation and communication” as noted in [15]

100

50

• IS - Integer Sort - Tests integer performanc as well as communication, as a large amount of data is moved in this benchmark.

0 1

• MG - Multi-Grid - “Uses a V-Cycle Multi-grid method to compute the solution of a 3D scalar Poisson equation,” according to [15]. This benchmark tests highly structured remote and local communication.

4

8 Number of Nodes

16

20

23

Figure 2. Staging times with and without cached disk images

• LU - Lower-Upper Gauss-Seidel - This is a sample application, designed to test a wider variety of hardware features than the above benchmarks.

6.2

2

For a given virtual cluster, all disk staging is done in parallel. It can be seen from Figure 2 that for up to 8 nodes, the staging time remains relatively constant. This can be attributed to our testbed setup, as the head node’s Gigabit interface is able to saturate multiple 100MBitequipped compute nodes. At 16 nodes, we begin to see a drop in performance. Testing with more nodes confirms that performance drops steadily after the Gigabit is saturated, as expected. For the second test, we warmed the VM disk cache by pre-copying the the disk image from the master to all compute nodes. Then we ran the same tests as before. Two things can be observed in Figure 2: First, the overall staging time for small numbers of nodes is

Experimental Results

We first tested the VM staging strategy described above, for varying numbers of nodes up to 16, and in two configurations. In the first configuration, we arranged for no cached disk image to be available on a compute node. In this case, the disk image is copied to the compute node before imaging of a VM partition begins, so the additional overhead in these tests (versus the cached tests) is due to remote copying. In both cases, one VM was allocated and staged per physical node. 7

processor 1. Guest domains see an idealized device interface, so a not insignificant amount of I/O and processing is done in parallel with the guest VM in the virtualized case. Though a more fair test is necessary in this case, we encountered stability issues when attempting to enforce fairness test by “pinning” all domains to CPU 0. As for the change in VM performance when more nodes are used, we suspect that the Xen interrupt behavior may be responsible. Xen uses a lightweight event system to function in place of interrupts for virtualized guests [8, 5], and events to guests can be batched to increase inter-domain throughput. This may be happening in our tests at the expense of communication latency. Again, further testing is required to confirm these observations.

shorter, since no initial copying from the master node occurs in this case. Second, the staging time remains relatively constant as the number of nodes exceeds 168 . For the computational tests we ran each of the benchmarks described above on a number of different configurations of machine. Each of the benchmarks used was run three times and the average over all results was taken. First, native and virtualized execution were compared when using a single processor per machine. It was originally our intent to test both native and virtualized machines with 1024MB of RAM. The Xen Linux kernels we used had no high memory support compiled in, and could only address 896MB of RAM, however. To see the effect of memory size on our test results, we first tested identical virtual machines, one per physical node, with varying amounts of memory, in this case 896MB versus 512MB, as shown in Figures 3(a)-3(e). In general, execution times are very similar as expected. On the single node tests, execution time on the 512MB VMs is equal to or slightly longer than on VMs with 896MB RAM. This may be due to memory limitations in the VM with 512MB of memory in the singlenode case. Somewhat surprisingly, execution times in cases where more than one VM is tested are equal two or slightly less on the VMs with less memory. These results may be statistically insignificant, however, so more testing is required before any statements can be made on these figures. Our tests of native versus virtualized execution had more interesting results. For these tests, we booted the natively executing machines with a uniprocessor kernel, and set visible memory to 1024MB via a kernel boot parameter. The results were compared to those achieved under virtualization, with one VM per compute node, and 896MB RAM per VM. These can be seen in Figures 4(a)- 4(e). Looking for general trends in the graphs, we see that for the single machine/single VM case, virtualized execution is faster in all tests. From Figure 4(a), we see that in the conjugate gradient test, virtualization yields a test completion time 7.5% faster than native execution. In general, the advantage enjoyed by virtualization at lower node counts is erased and/or reversed at higher node counts. There are a number of explanations for these observations. The most likely is processor usage. While Xen version 2 does not support SMP guest domains, the VMM itself supports SMP machines. By default the privileged domain runs on processor 0, whereas Xen schedules the guest domain on

7

Related Work

On demand computing refers to on-demand access to resources – immediate access or (if resources are oversubscribed) access reflecting job priorities or workflowbased deadlines. Using virtualization in the context of a computing cluster, virtual machines can be demandbooted on different physical machines. The difference between this and unvirtualized systems such as ClusterOn-Demand [20] is that the later does not address ondemand scheduling of resources. Our goal is to address this very issue – scheduling of resources using virtualization by subdividing physical resources into virtual machines that can meet on-demand requirements. Much research exists to realize virtualization, which was a concept originated in the 1960s [12, 21]. Systems such as IBM’s System/370 [23, 6] have characteristics to allow CPU and I/O virtualization to be carried out in hardware, while the commodity x86 architecture has until recently lacked this support9 . To address the shortcomings of non-virtualizable architectures such as the x86, two techniques have emerged: Full-system virtualization10 . and paravirtualization11 . 9 Intel processors have begun shipping with full virtualization support, while AMD processors with this feature are expected to ship later this year. 10 for lack of a better term which is not overly verbose, we borrow this terminology from [24] 11 Two other methods deserve mention: emulation and what is called OS-level virtualization in [24]. The first we mention only for completeness; the low performance realized by fully simulating a processor in software as done in an emulated setting cannot justify its use in an HPC context. OS-level virtualization involves building isolation mechanisms into an existing OS. While this is more efficient than virtualized or paravirtualized systems, as kernel memory pages are shared between isolated processes, it is also less secure (a root-level compromise still endangers the whole system). Furthermore, OS-level isolation is usually implemented for security purposes; performance isolation must be dealt with by other mechanisms.

8 In fact, the staging time does grow slowly with VC size, as some configuration files and packages must still be downloaded from the master in the cached case.

8

2200

800 Virtualized Execution, 896MB RAM Virtualized Execution, 512MB RAM

Virtualized Execution, 896MB RAM Virtualized Execution, 512MB RAM

2000

700

1800 600

1200 1000

Running Time (seconds)

1400

Running Time (seconds)

Running Time (seconds)

Running Time (seconds)

1600 500

400

300

800 200 600 100

400 200

0 1

2

4 Number of Nodes

8

16

1

(a) Conjugate Gradient

2

4 Number of Nodes

8

16

(b) Embarrassingly Parallel

90

4500 Virtualized Execution, 896MB RAM Virtualized Execution, 512MB RAM

Virtualized Execution, 896MB RAM Virtualized Execution, 512MB RAM 4000

80

3000

Running Time (seconds)

50

Running Time (seconds)

Running Time (seconds)

60

2500

2000

1500

1000 40 500

30

0 1

2

4 Number of Nodes

8

16

1

(c) Integer Sort

2

4 Number of Nodes

8

16

(d) LU Decomposition

250 Virtualized Execution, 896MB RAM Virtualized Execution, 512MB RAM

Running Time (seconds)

200

Running Time (seconds)

Running Time (seconds)

3500

70

150

100

50

0 1

2

4 Number of Nodes

8

16

(e) Multi-Grid

Figure 3. Virtualized execution: VMs with 896MB RAM vs. 512MB RAM

9

2200

800 Native Execution, 1024MB RAM Virtualized Execution, 896MB RAM

Native Execution, 1024MB RAM Virtualized Execution, 896MB RAM

2000

700

1800 600

1200 1000

Running Time (seconds)

1400

Running Time (seconds)

Running Time (seconds)

Running Time (seconds)

1600 500

400

300

800 200 600 100

400 200

0 1

2

4 Number of Nodes

8

16

1

(a) Conjugate Gradient

2

4 Number of Nodes

8

16

(b) Embarrassingly Parallel

90

4500 Native Execution, 1024MB RAM Virtualized Execution, 896MB RAM

Native Execution, 1024MB RAM Virtualized Execution, 896MB RAM 4000

80

3000

Running Time (seconds)

50

Running Time (seconds)

Running Time (seconds)

60

2500

2000

1500

1000 40 500

30

0 1

2

4 Number of Nodes

8

16

1

(c) Integer Sort

2

4 Number of Nodes

8

16

(d) LU Decomposition

250 Native Execution, 1024MB RAM Virtualized Execution, 896MB RAM

Running Time (seconds)

200

Running Time (seconds)

Running Time (seconds)

3500

70

150

100

50

0 1

2

4 Number of Nodes

8

16

(e) Multi-Grid

Figure 4. Native Execution, 1024MB RAM vs. Virtualized Execution, 896MB RAM

10

management mechanisms to present virtualized “grid sessions” to users. In-VIGO encompasses a stack of software whose scope is larger than that of our own work, but the underlying virtual machine management software, VMPlants [18] is similar to our own VM creation facility. SODA [14] is a virtualization based system which targets application service providers. SODA uses virtualization in the form of User-Mode Linux. Probably the most similar system to our own in terms of intention and functionality is another system based on Xen, Xenoserver [13, 11]. Similar to on-demand computing, Xenoserver seeks to enable a concept called utility computing, a self-financing computing infrastructure where users can lease resources on remote machines, and where conversely resource owners can safely execute such remote codes. As in our system, clients can poll a number of entities (here called Xenoservers) and submit requests for available resources and submit requests. The Xenoserver work has a broader scope than our own, as considerations are made for pricing strategies and compensation. However the system seems targeted at allocations on the individual VM level. For example, latency-tolerance is mentioned as a problem for applications run on geographically distant machines, but the proposed solution is to locate geographically close remote machines by polling a set of Xenoservers. [13] mentions that a usage scenario for Xenoserver leased VMs is as a hosting platform for web servers, application servers, and Grid Services. Generally speaking, Xenoserver seems unsuitable as a platform for highperformance parallel jobs.

Full system virtualization is exemplified by the popular VMWare [3] software. To as great an extent as possible, virtualized code in such systems is directly executed on the native machine. Userspace code can be run with no modification, but calls into kernel CPU mode are trapped; virtualized execution is then achieved via binary translation of non-virtualizable code sequences into explicit calls to the VMM. Caching of translated code segments can amortize the cost of translation so that fully virtualized code can approach native code in performance. I/O intensive applications can experience heavy CPU utilization due to the need to faithfully emulate hardware for hosted VMs. Advanced techniques as found in [25] are required to lessen this impact12 . Paravirtualization solves the same problem in the same way as full-system virtualization. Instead of performing run-time binary translation however, the guest OS is modified at the source level to replace problematic CPU instructions with explicit jumps into a VMM. Paravirtualized systems thus sacrifice the transparency of fully virtualized systems in varying degrees for the sake of more efficient utilization of resources on nonfully virtualizable architectures. The most notable systems are Denali [26] and Xen [8]. Paravirtualized systems have a performance advantage over those such as VMWare; once virtualization is made explicit in hosted OSs, additional performance boosts can be achieved by limiting the additional context switches incurred by jumps into the VMM. In Xen, page table update batching and delayed interrupt handling are two examples of such techniques. While it does not encompass virtualization in the manner referred to above, the Charm [16] system uses what is called processor virtualization. Programs in Charm are written as a set of intercommunicating message-driven objects called Chares, which represent the units of work in a Charm program. The Chare abstraction allows the Charm runtime system to transparently migrate work from one machine to another, either due to machine faults or for performance reasons (i.e. load balancing). Our work seeks to bring benefits similar to those of Charm to a wider array of distributed applications. A number of other researchers have investigated systems to enable sharing of distributed computing resources for the purposes of Grid and On-Demand computing. In-VIGO [4] is a system for enabling “virtual computing grids,” which build on virtualized resource

8

Conclusion and Future Work

In this paper, we have described Maestro-VC, a system for creating secure On-Demand virtual clusters, and have examined the performance of several scientific codes running under virtualization. Our primary goal for the near future is to perform rigorous testing of MaestroVC with varied test parameters and newer versions of Xen. In particular, we wish to see the performance overhead incurred when running multiple virtual machines per physical node, in order to determine what kinds of compute jobs can benefit the most from virtualized execution, and what VM scheduling policies result in the best performance for jobs with varying requirements for I/O throughput, I/O latency, and computational power. Beyond raw performance comparisons, we wish to further develop the features of Maestro-VC we feel will most benefit HPC resource users. First, we wish to explore features transparently applicable to unmodified distributed jobs. From experimentation, we have determined that it is possible to pause MPI jobs running

12 To be fair, this work covered the hosted versions of VMWare, where an underlying OS was used for device access. This was another source of I/O overhead.

11

over a TCP/IP transport. Our future plans are first to explore the feasibility of saving and restoring jobs in such a paused state, making checkpoint/restart possible for all manner of distributed jobs running over TCP/IP13 . Future work will also cover performance benefits achieved by exchanging information via the GS-LS interface described briefly in Section 3. Finally, we also wish to develop a monitoring interface for guest VMs in a Maestro-VC cluster. Our initial experiments show that latency may be a key issue affecting virtualized distributed job performance. By monitoring the amount of I/O occurring between various VMs in a VC, it should be possible to improve overall job performance by migrating heavily communicating VMs onto the same physical machine, or onto machines separated by a low network latency, to increase throughput.

[11] K. Fraser, S. Hand, T. Harris, I. M. Leslie, and I. Pratt. The Xenoserver Computing Infrastructure: a Project Overview. Technical Report, 2002. [12] R. Goldberg. Architecture of Virtual Machines. In Proceedings of the AFIPS Computer Conference, July 1973. [13] S. Hand, T. Harris, E. Kotsovinos, and I. Pratt. Controlling the Xenoserver Open Platform. In Proceedings of IEEE Conference of Open Architectures and Network Programming (OPENARCH), 2003. [14] X. Jiang and D. Xu. SODA: a Service-on-Demand Architecture for Application Service Hosting Utility Platforms. In International Symposium on High Performance Distributed Computing (HPDC), 2003. [15] H. Jin, M. Frumkin, and J. Yan. The OpenMP Implementation of NAS Parallel Benchmarks and its Performance. NAS Technical Report NAS-99-011, October 1999. [16] L. Kale and S. Krishnan. CHARM++ : A Portable Concurrent Object Oriented System Based On C++. In Proceedings of the Conference on Object Oriented Programming Systems, Languages and Applications, September– October 1993. [17] N. Kiyanclar, G. A. Koenig, and W. Yurcik. Maestro-VC: On-Demand Secure Cluster Computing Using Virtualization. In 7th LCI International Conference on Linux Clusters, 2006. [18] I. Krsul, A. Ganguly, J. Zhang, J. Fortes, and R. Figueiredo. VMPlants: Providing and Managing Virtual Machine Execution Environments for Grid Computing. In Proceedings of IEEE/ACM Supercomputing Conference (SC), 2004. [19] A. Menon, J. R. Santos, Y. Turner, G. Janakiraman, and W. Zwaenepoe. Diagnosing Performance Overhead in the Xen Virtual Machine Environment. In Proceedings of the 1st ACM Conference on Virtual Execution Environments, Chicago, IL, June 2005. [20] J. Moore, D. Irwin, L. Grit, S. Sprenkle, and J. Chase. Managing Mixed-Use Clusters with Cluster-on-Demand. Duke University Department of Computer Science Technical Report, 2002. [21] G. Popek and R. Goldberg. Formal Requirements for Virtualizable Third Generation Architectures. Communications of the ACM, 17(7), July 1974. [22] S. Saini and D. H. Bailey. NAS Parallel Benchmark (Version 1.0) Results. NAS Technical Report NAS-96-18, November 1996. [23] L. H. Seawright. VM/370 - A Study of Multiplicity and Usefulness. In IBM Systems Journal, January 1979. [24] J. Smith and R. Nair. Virtual Machines: Versatile Platforms for Systems and Proccesses. Elsevier Press, 2005. [25] C. Waldspurger. Memory Resource Management in VMWare ESX Server. In Proceedings of 5th Symposium on Operating Systems Design and Implementation (OSDI), 2002. [26] A. Whitaker, M. Shaw, and S. Gribble. Scale and Performance in the Denali Isolation Kernel. In Proceedings of the 5th Symposium on Operating Systems Design and Implementation (OSDI), 2002.

References [1] The Condor Project. http://www.cs.wisc.edu/ condor/index.html. [2] Xen v 2.0 user’s manual. http://www.cl.cam.ac.uk/Research/SRG/netos /xen/readmes-2.0/user/user.html. [3] VMWare web site. http://www.vmware.com/, 2005. [4] S. Adabala, V. Chadha, P. Chawla, R. Figueiredo, J. Fortes, I. Krsul, A. Matsunaga, M. Tsugawa, J. Zhang, M. Zhao, L. Zhu, and X. Zhu. From Virtualized Resources to Virtual Computing Grids: the In-VIGO sytem. In Future Generation Computer Systems, volume 21, April 2005. [5] P. Barham, B. Dragovich, K. Fraser, S. Hand, A. Ho, and I. Pratt. Safe Hardware Access with the Xen Virtual Machine Monitor. In 1st Workshop on Operating System and Architectural Support for On-Demand IT Infrastructure, May 2004. [6] R. J. Creasy. The Origin of the VM/370 Time-Sharing System. In IBM Journal of Research and Development, September 1981. [7] R. V. der Wijngaart. NAS Parallel Benchmarks v. 2.4. NAS Technical Report NAS-02-007, October 2002. [8] B. Dragovic, K. Fraser, S. Hand, T. Harris, A. Ho, I. Pratt, A. Warfield, P. Barham, and R. Neugebauer. Xen and the Art of Virtualization. In Proceedings of the ACM Symposium on Operating Systems Principles, October 2003. [9] I. Foster, C. Kesselman, J. Nick, and S. Tuecke. The Physiology of the Grid: An Open Grid Services Architecture for Distributed Systems Integration. Open Grid Service Infrastructure Working Group. Global Grid Forum (GGF), June 2002. [10] I. Foster, C. Kesselman, and S. Tuecke. The Anatomy of the Grid: Enabling Scalable Virtual Organizations. International Journal of Supercomputer Applications, 15(3), 2001. 13 Though robust systems using UDP or other unreliable transports should survive a checkpoint/restart using this method as well.

12