Network performance in virtual machine infrastructures - CiteSeerX

5 downloads 12515 Views 235KB Size Report
Feb 6, 2010 - on public cloud services that offer infrastructure as a service(IaaS). Usually ..... Virtual private networks come as a second option that could be ...
Network performance in virtual machine infrastructures Alex Giurgiu February 6, 2010 Coordinators: Paola Grosso & Rudolf Strijkers

Abstract This paper presents the network performance of the Amazon EC2 virtual machines and the limitations of their network. The tests measured the latency, throughput and hop count between virtual machines in the same datacenter, and between virtual machines and a server outside the Amazon datacenter. The results show that the variability of the network performance is quite high while the flexibility of their network is very low. Some possible solutions for the existent problems are discussed.

1

Contents 1 Introduction 1.1 Research question . . . . . . 1.2 Cloud computing . . . . . . 1.3 Xen networking concepts . . 1.3.1 Bridged networking 1.3.2 Routed networking . 1.3.3 Virtual networking . 1.4 Experimental tests . . . . . 1.5 Approach . . . . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

3 3 3 5 5 6 6 7 7

2 Running the tests 2.1 Network topology . . . 2.2 Network isolation . . . 2.3 Security implications . 2.4 Network performance . 2.4.1 Latency . . . . 2.4.2 Bandwidth . . 2.4.3 Packet loss and

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

8 8 9 10 11 11 15 19

. . . . . . . . . . . . . . . . . . jitter

3 Results summary 22 3.1 Protocols . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.2 Security . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.3 Network performance . . . . . . . . . . . . . . . . . . . . . . . . . 23 4 Possible solutions

24

5 Conclusions

25

6 Acknowledgements

27

A The test suite

27

2

1

Introduction

With the popularity of cloud computing on the rise and a number of companies starting to offer computing resources on demand, it is worth finding out what is the current state of their offerings in terms of network performance and flexibility. Resources sharing in the cloud is done by partitioning physical servers in multiple virtual machines and renting them based on the users needs. While resources like the CPU and memory are virtualized quite well, networking is an aspect that hasn’t got much attention, making it an interesting research topic.

1.1

Research question

The goal of this project is to find out what is the level of network performance offered by commercial cloud services like Amazon EC2 and are there any limitations imposed by their network infrastructure. The following questions paint a more specific picture of what I am trying to find out: • Are the virtual machines logically separated from a networking point of view? Does data leak between virtual machines residing on the same local network? This would have serious security implications. • What is the level of performance in terms of network bandwidth and latency between virtual machine instances ? Do we get best-effort performance and if yes how good or bad is the QoS? • Is the network performance consistent over a number of instantiations of two virtual machines in the same datacenter or is it influenced by network usage by other virtual machines and the location where the virtual machines get created? • Does the network topology change for each new pair of virtual machine instances?

1.2

Cloud computing

There are many definitions for cloud computing but in this project the focus is on public cloud services that offer infrastructure as a service(IaaS). Usually IaaS providers make use of virtualization technologies in order to ease the process of resource allocation and make it dynamic, transparent and easy to use. Cloud services offer several advantages over more traditional infrastructures: easy scalability, high availability, cost effective, almost no upfront investment, easy automation through API’s etc. This kind of services are particularly interesting for users or companies that don’t want to invest initially in an IT infrastructure but still be able to scale in case of expansion. As of this writing there are many operators that provide IaaS services including: Amazon AWS, Joyent, Microsoft Azure, GoGrid, FlexiScale, Rackspace Cloud, ElasticHost and others. One of the most used and old cloud computing service

3

providers is Amazon which introduced the EC2 platform in August 2006, offering computing power using virtual machines based on the Xen hypervisor called instances. There are 3 standard instances offered by Amazon: • Small Instance: 1.7 GB of memory, 1 EC2 compute unit, 160 GB of local storage, 32 bit platform • Large Instance: 7.5 GB of memory, 4 EC2 compute units, 850 GB of local storage, 64 bit platform • Extra Large Instance: 15 GB of memory, 8 EC2 compute units, 1690 GB of local storage, 64 bit platform One EC2 Compute Unit (ECU) provides the equivalent CPU capacity of a 1.01.2 GHz 2007 Opteron or 2007 Xeon processor[2]. There are other types of instances, some that are designed for high memory usage or others which are designed for high CPU usage but the ones that are of special interest to us are the standard ones, and especially the Small Instances, because those are the most popular ones. Because of their high usage we chose to run all the tests on small instances, increasing the chances of being allocated virtual machines on servers with high resource usage (CPU, memory, network), conditions that are important to the value and accuracy of the tests. While computer virtualization technologies have evolved and matured well during the years, network virtualization hasn’t entered the mainstream market and has remained in the academic world. The goal of network virtualization in the context of server virtualization is to allow running multiple types of protocols and architectures isolated from each other, sharing the same physical network. A series of problems would need to be addressed: flexibility, programmability, scalability, manageability, security and last but not least, legacy support. There are several projects that try to address this problems through network virtualization at different levels of the network and are relevant to this paper: • Sun Crossbow - virtualizes the network stack and the NIC • OpenFlow - allows running multiple networks on a shared network infrastructure • X-bone - deploys IP overlays and increases network component sharing Ideas and concepts from this projects can act as a reference in bringing improvements and solving current limitations of the network architectures used by commercial cloud services.

4

1.3

Xen networking concepts

Because the paper is focused on virtual infrastructures powered by Xen (ex. Amazon EC2) it is necessary to explain what kind of different networking modes exist and how they work. Xen provides network connectivity to the guest virtual machines by creating interface pairs that are comprised of two virtual network cards that are connected to each other, one residing in dom0 and the other in domU. Once the virtual interfaces are created they need to be connected to the physical interface of the host machine in one of the networking modes. There are 3 different networking modes in Xen: • bridged networking • routed networking • virtual network 1.3.1

Bridged networking

Bridging is the default configuration for Xen and it is used with the purpose of allowing all virtual machines on the host machine to appear on the local network as individual hosts.

Figure 1: Bridged networking A virtual bridge called xenbr0 is created in dom0 that connects all virtual interfaces to the physical interface of the host machine. Basically this mode works as a layer 2 switch that connects all the virtual interfaces. By using this

5

setup all virtual machines will be part of the same layer 2 broadcast domain as the host machine, with broadcasted packets being forwarded out to the eth0 interface. 1.3.2

Routed networking

The routed networking mode works by making a point-to-point network link between dom0 and each domU. In order for this to work the host operating system(dom0) must have a known static IP address, with routes addded in dom0’s routing table to each guest machine(domU). Because in the beginning there are no routes, DHCP won’t work in this mode. Only after a static IP is assigned to each domU, static routes can be added to dom0’s routing table , achieving connectivity. In routing mode dom0 will work as a router for the virtual machines, this means that this will only work if ip forwarding is enabled in the kernel. 1.3.3

Virtual networking

In this mode all domU machines are put on a virtual network with dom0, connected through a virtual bridge xenbr0. It is very similar with the bridged mode with the difference that the virtual network is separated from the physical network through a dummy interface that acts as a gateway.

Figure 2: Virtual networking In this setup, a DHCP server can be installed on dom0 to server can be used to allocate ip addresses to the guest virtual machines, without allowing dhcp 6

requests to escape on the physical network.

1.4

Experimental tests

In order to find out the answer to the above research questions a suite of tests will be created that try to measure performance in an objective way and produce results that are comparable from one virtualization infrastructure to another. The first series of series of tests will be run on the Amazon EC2. The tests consist of running two virtual machine instances at a time doing the following measurements: • response time between two VM instances • response time between a VM instance and a server outside the virtualization infrastructure • network throughput between two VM instances • network throughput between a VM instance and a server outside the virtualization infrastructure • traceroute between the two VM instances • traceroute between a VM instance and a server outside the virtualization infrastructure • packetloss between two VM instances • local network scan on each VM instance to determine the level of network isolation The measurements will be taken and recorded on a pair of instances at a time after which the two instances will be deleted. The test will be run multiple times(20 or more) and each time a new pair of instances will be created. The goal behind this is to increase the randomness of the location where the VM instances will be created which in turn will influence the following two factors: • distance between the two VM instances, meaning the hop count and/or the round-trip time • usage level of the hardware that host both VM instances

1.5

Approach

The first step in the project will be to make a set of tests that are formal and clearly defined. The tests should be generic in such a way that the results produced by them can be comparable from one virtual infrastructure to another, independent on the virtualization technology used.

7

Once the test framework is complete the tests will first run on the Amazon EC2 infrastructure, all the resulted data will be gathered and stored on the OS3 experimental server. In the next phase all recorded data will be analysed and related to each other to get a clear picture on how the network performs. Finally a conclusion will be drawn and possible solutions to the problems that were found, if any, will be discussed.

2 2.1

Running the tests Network topology

To find out how the network topology of Amazon EC2 is structured I instantiated multiple virtual machines in the same datacenter and looked at the network they were assigned in and the ip address they got. To get a more clear picture I will explain how networking works for each virtual machine on Amazon EC2. Each virtual machine has two ip’s allocated: a non-routable private (RFC1918) IP address and one public IP address. Once the virtual machine boots the private ip gets assigned to the local network interface using DHCP. The private ip address is of class A and is part of a /23 network(512 ip space). Access to the Internet is done through the assigned public IP address and gets maped to the private IP using NAT. The following 10 virtual machines were instantiated to see if there are any patterns or rules when assigning the public or private ip addresses. vm1 public ip address: 67.202.21.107, private ip address: 10.210.94.5/23 vm2 public ip address: 174.129.76.62, private ip address: 10.254.187.66/23 vm3 public ip address: 67.202.8.5, private ip address: 10.254.242.130/23 vm4 public ip address: 174.129.56.176, private ip address: 10.210.235.178/23 vm5 public ip address: 75.101.241.73, private ip address: 10.210.194.147/23 vm6 public ip address: 174.129.170.231, private ip address: 10.210.193.235/23 vm7 public ip address: 174.129.179.152, private ip address: 10.254.86.193/23 vm8 public ip address: 174.129.62.223, private ip address: 10.210.70.165/23 vm9 public ip address: 75.101.240.175, private ip address: 10.215.198.208/23 vm10 public ip address: 174.129.155.22, private ip address: 10.254.166.10/23 All virtual machines were assigned private IP addresses in different networks; even though the machines were instantiated in the same batch the ip assignment is very random. For every /23 network there is a gateway that is used by all 8

virtual machines and which is always the first ip in that network. A graphical representation of the logical topology looks like this:

Figure 3: Logical topology

2.2

Network isolation

In order to find out how well the virtual machines are separated from one another we looked at the layer 2 and layer 3 of the local network. The goal was to find out if communication is possible between virtual machines on the same network using layer 2 and layer 3 protocols. There are two tests the were ran in order to find out the answer: • ARP scan for layer 2 discovery using the arp-scan tool • open ports scan for all ip’s on the local network using the nmap tool The results of the ARP scan indicated that either there were no other virtual machines on the same network or layer 2 protocols are filtered at the hypervisor level. Interface: eth0, datalink type: EN10MB (Ethernet) Starting arp-scan 1.6 with 512 hosts (http://www.nta-monitor.com/tools/arp-scan/) 10.254.170.0 fe:ff:ff:ff:ff:ff (Unknown) 10.254.170.1 fe:ff:ff:ff:ff:ff (Unknown) ... 10.254.171.255 fe:ff:ff:ff:ff:ff (Unknown)

9

It is highly unlikely that there are no other hosts on the same network so most probable ARP scanning is filtered by the firewall implemented by Amazon in the Xen hypervisor or routed-networking is used to provide network connectivity to each virtual machine. In the next step each IP on the local network was scanned for open ports using nmap. Scanning is done using TCP SYN and ACK packets on the most widely used ports. The results of this scan were much more fruitful with many hosts responding to the TCP probing, which confirms that only layer 3 connectivity can be achieved on the local /23 networks.

2.3

Security implications

Because layer 3 connectivity is possible between any two virtual machines in the EC2 even though they are on different networks, it is very important to find out if security is enforced by Amazon or if it is left to the user to secure the services that run on their virtual machines. There are two interesting test scenarios: • are all ports open between virtual machines of the same secruity group? • are all ports open between virtual machines of different security groups? To test the first scenario we ran iperf on different ports on virtual machines from the same account and security group, without opening those ports on the Amazon firewall. For this matter ports 25, 80, 711 and 3550 were used by the iperf application to listen on. Connectivity was achieved on all 4 ports without any problems. From a security point of view this does not have any implications because it is expected that virtual machines from the same security group would be able to communicate with each other without any filtering. The second scenario was tested using the same ports, only this time it was done between virtual machines from different security groups. Running the tests revealed that a connection cannot be established on any of the ports prior to opening them in the EC2 firewall. This means that security is enforced by default on any virtual machine created, and in order to have connectivity on any port from a source located outside the security group or on the Internet, that port must be explicitly open in the Amazon firewall. Furthermore, virtual machines from different customers are isolated between them because they are implicitly on different security groups. There is no distinction made by the EC2 firewall between traffic coming from the private networks or from the Internet.

10

2.4

Network performance

Network performance is a very important aspect of any infrastructure, virtual or not, and for that matter it is particularly interesting to do 2 kinf of tests: test performance between nodes in the cloud and test performance between nodes and a host outside the cloud. To test network performance on the Amazon EC2 infrastructure a suite of tests was made that does 6 different measurements: • delay between nodes in the cloud • traceroute between nodes in the cloud • bandwidth between nodes in the cloud • delay between a node and a host outside the cloud • traceroute between a node and a host outside the cloud • bandwidth between a node and a host outside the cloud The delay test was done using the ping tool, by sending 10 ICMP echo requests to the target host and recording the rtt for all 10 packets, the minimum, average and maximum delays. The traceroute was done to see if there are any changes in the hop count between the nodes themselves and the nodes and an external host. Finally, the bandwidth tests were done running iperf for a number of 5 times, with a duration of 5 seconds each time. All tests were repeated 20 times, on 10 different pairs of virtual machines. The 2.4.1

Latency

The network latency was measured using the ping tool, by sending 10 ICMP echo requests to the target host and recording the rtt for all 10 packets. The first set of tests were done between the two virtual machines of each pair. Running ping between nodes in the cloud has revealed that the round-trip times usually stay under 1 ms with rare spikes to as much as 400 ms. Most of the times the first packet sent by the ping tool would have a latency hundreds of times bigger than the rest of the packets. The return-trip times of all the sent ICMP packets are represented in the following graph.

11

Figure 4: Latency avarage for all VM pairs The red bars take into account the latency of the first packet while the green one doesn’t. Although the high latency of the first packet causes a very big difference between the two averages it doesn’t influence the distribution graph noticeably; high latency packets account for only a very small percentage from the overall number of packets. The distribution graph of the latency clearly shows that the vast majority(87.3%) of the packets have a RTT below 1 ms.

12

Figure 5: Distribution of the latency The reason for the first packet having such a high latency might be because the hardware address for the the target host is not in the arp table of the final router. This means that an arp resolution must be done on the target network which would add considerable delay to the first ICMP packet. The second set of latency tests were targeted at an external host, and more exactly google.com. One of the goals of the external latency tests is to find out if there are big variations in the round-trip times, so to minimise the chances of being influenced by the external factors the target host should be as close as possible to the source virtual machine. Because the ip of google.com get’s resolved based on the location of the resolver, this is a suitable choice for the tests.

13

Figure 6: Avarage for each virtual machine Round-trip times to the external hosts range from 2.16 ms to 170 ms. Although the maximum recorded numbers are big, most of the round-trip times are below 10 ms. If we look at the average latency we can see that there are 3 virtual machines that have a particularly high latency; The most probable reason for this is that those virtual machines were sharing the physical hardware with other virtual machines that either produced high network traffic or were heavily using the CPU, leading to a lack of resources to process the network packets.

14

Figure 7: Distribution of the external latency Although some of the pairs experienced high latencies to the external host, the latency distribution graph shows that the vast majority of latency measurements tend to stay around the 3 to 6 ms range. This does not mean that there is no reason for concern for the high latency experienced on some of the virtual machines, this only proves that the cloud has disadvantages. While the hardware resources aren’t overused performance is good, when resources are used at full capacity network performance for the other virtual machines hosted on the same hardware begins to suffer. 2.4.2

Bandwidth

There are 3 kinds of tests that were run to measure the bandwidth inside the cloud and from the cloud to an external host: 1 TCP stream, 10 TCP streams and 1 UPD stream. All tests were done using the iperf tool running for 30 seconds, and the results recorded every 3 seconds. The first set of tests measure the performance inside the cloud, between the virtual machines of each pair. Starting with 10 TCP streams it is obvious that the performance across all the instantiated virtual machines has big variations, ranging from 200 to 800 mbits. 15

Figure 8: Average bandwidth for each VM pair Virtual machines pairs maintain a relative consistency from one test to another, meaning that if they had bad performance on one of the tests, the other 2 also had bad performance. The best performance overall is obtained by running 10 TCP streams, followed by 1 UDP stream and than 1 TCP stream. If we take into account that the theoretical bandwidth is 1 Gbit, the network performance for some of the virtual machines is quite bad. As expected, the performance degrades considerably and fluctuates even more when using a single TCP stream. The bandwidth variations are bigger compared to the ones when using 10 TCP streams, starting from 60 mbits and going up to 680 mbits. There can be two reasons for the poor performance: • high network usage in the datacenter • high resource usage on the physical server that hosts the VM, leading to slow proccessing of the network packets If either of the reasons is the real cause of the poor bandwidth between the virtual machines it is normal that by using more TCP streams a bigger share 16

of the resources(processing or network) are claimed, which leads to a better performance compared to a single TCP stream. The UDP bandwidth measurements were done by sending a 1 Gbit stream of data to the virtual machine instantiated in the same pair. The resulting measurements have considerable less variations than the ones that were done using TCP but the performance isn’t very good either. The minimum UPD bandwidth starts at 220 mbits and tops at 380 mbits. The patterns are very similar to the tests done with TCP; basically the performance varies from virtual machine to virtual machine and on how much the physical hardware is used.

Figure 9: Internal bandwidth distribution The bandwidth distribution is very wide; 1 TCP stream values claim the lower end of the spectrum while the 10 tcp streams test having results distributed on the higher end of the graph. Because of the big variations in all the tests it is safe to say that the probability of getting a pair of virtual machines that have very good bandwidth between them is low. External bandwidth was measured by running the tests between a virtual machine and a server located in the OS3 laboratory. The results are similar to 17

the ones obtained in the internal tests, with the biggest bandwidth obtained by using 10 TCP streams and the lowest using 1 TCP stream.

Figure 10: Average bandwidth to OS3 server Some of the virtual machines had throughput as low as 2 mbits to the OS3 server, on one TCP stream. The gap between the two TCP tests is even bigger than on the internal network and here the advantage of using multiple streams is very obvious. Compared to the internal tests the UDP performance is much more stable relative to TCP, and averages range between 120 and 330 mbits.

18

Figure 11: External bandwidth distribution The distribution graph for the external bandwidth tests has a similar pattern to the one for the internal tests. Performance is lower and the whole graph has shifted down on the performance scale but the structure remains the same. 2.4.3

Packet loss and jitter

Packet loss and jitter were measured while doing the UDP bandwidth test using the iperf tool and together they are a good indication on how the network performs in term of quality. In computer networks, jitter refers to the variability of packet latency over time. Applications like VoIP and video streaming are sensitive to jitter and packet loss so for that matter it is important to see what are the average values.

19

Figure 12: Jitter for each pair The jitter average varies from 0.01 ms to 11 ms, so its safe to say that this wouldn’t influence any applications. On the other hand, the percentage of lost packets on a 1 gbit stream is very big, with losses of up to 27% and an average of 4.96%. This kind of packet loss would be unacceptable to either VoIP or video streaming applications, which need percentages near the 1% mark in order to work properly.

20

Figure 13: Packet loss for each pair(%) For some unknown reason the packet loss between virtual machines is much higher than the packet loss between a virtual machine and the OS3 server which is the opposite of my expectations and quite strange. Take into consideration that the iperf tool was configured to send a full 1 Gbit stream from one virtual machine to another, which doesn’t represent the most common use scenario for most of the VoIP applications.

21

3

Results summary

From all the tests that have been run we can draw a few conclusions about the state of the network in Amazons EC2 virtual infrastructure but also in general for other virtual infrastructures.

3.1

Protocols

The first set of tests tried to asses how Amazon EC2 handles separation on the network between virtual machines and what kind of connectivity you can get between them. The results of the tests show that the virtual machines have no layer 2 connectivity because of the way in which they are connected to the network, using the routed mode provided by Xen. This means that layer 2 protocols like PPP won’t work and this alone is a limitation that could prevent a big range of applications from migrating to the cloud. At the network layer connectivity is also very limited because only the IPv4 protocol works, with no support for protocols like IPsec and IPv6. IPsec is widely used and a lot of companies heavily rely on it to secure and allow external access to their networks. Although IPv6 is not widely used in production, the speed of adoption is increasing and many academic and scientific environments have been using it for years. By only supporting the IPv4 protocol a considerable chunk of the market is left out and prevented from taking advantage of the cloud. The transport layer supports the use of TCP and UDP and most of the time this is enough. Apart from some legacy applications that still require SPX, most of the application use TCP or UDP. The following list shows exactly what protocols you can run on the Amazon EC2 network infrastructure(layers 2, 3 and 4): • data link(l2) - no connectivity • network(l3) - IPv4 and ICMP • transport(l4) - TCP and UDP

3.2

Security

The supported protocols are quite limiting but from a security point of view the network design is good. Communication between the virtual machines and the rest of the network(or Internet) is done through the Xen hypervisor, with each instance having a firewall that resides in the hypervisor itself, between the virtual interfaces of the virtual machines and the physical interfaces of the host server[3]. Because of this architecture there is no connectivity or data leakage between virtual machines that reside on the same physical server. Access control is based on security groups and there are two firewall policies that govern access to each virtual machine:

22

• traffic inside the security group - all ports are open inside the same security group. Each user can have multiple security groups and virtual machines from two separate users can’t be in the same security group • traffic outside the security group - by default all ports are closed for traffic that comes from outside of the security group. This can be the Internet or any other virtual machine that is in a different security group. Once a port has been open in the Amazon firewall connectivity can be achieved from anywhere.

3.3

Network performance

The tests I ran in order to find out the network performance revealed that results vary quite a lot from virtual machine to virtual machine. The bigest performance inconsistencies were found in the bandwidth tests, and especially when using a single TCP stream. The following list shows the variation for each bandwidth test: • 10 TCP streams - 200 to 800 mbits • 1 TCP stream - 60 to 680 mbits • 1 UDP stream - 220 to 380 mbits The theoretical link provided for each virtual machine is 1 Gbit. Comparing it to the results I got, it is very clear that the bandwith is offered on a best effort basis and there aren’t any guarantees made by Amazon on the performance one would get from the network. This should be taken into account by anyone who is planning on designing an architecture that runs in the cloud and has high bandwidth requirements. With 1 Gbit connections becoming the de-facto standard on most of the commercial switches and routers sold today application migrating from a pure physical infrastructure to the cloud would take a considerable performance hit that has to be taken in to consideration. Inside the cloud, the vast majority of the ICMP packets had a round-trip time under 1 ms with occasional spikes as big as 400 ms. External latency was also good for the most part, with round-trip times staying under 10 ms. In both the internal and external latency tests there were some virtual machines that experienced very high latency averages. It is obvious that the underlying reason for this is high usage of either network resources or computing resources, bringing some serious concerns for the future: will performance further deteriorate as more and more users start using the Amazon EC2 infrastructure? Early signs hint that this might be the case. While good quality link shouldn’t have more than a 1% loss in packets, sending a 1 Gbit stream of UDP packets resulted in an average of 5% lost packets. This is a confirmation that the connection between virtual machines in the cloud won’t be able to approach the theoretical limit of 1 Gbit. 23

4

Possible solutions

In this section I provide some suggestions on improving the flexibility and performance of the network. From a flexibility point of view, the ultimate goal would be to give a user connectivity between the virtual machines on all layers, without any constraints on the network protocols he can use, while maintaining a very good separation between networks from different users. There are a few approaches that can be taken to reach that goal or part of the goal: • vlans • virtual private networks • openflow Vlans can be a first solution to this problem because the Xen hypervisor supports vlan tagging(802.1Q) out the box. By implementing vlan support in the network infrastructure there is still one issue that isn’t resolved, and that is routing. Each vlan will need to have a router in order to route packets to the internet and between vlans, and the management of the router can be left either in the user’s hands or offered as a service by the service provider. There are two ways of approaching this: • service provider provides a router or a gateway that takes care of connectivity between the virtual machines • users run their own routers using virtual machines and x86 routers like Vyatta or Untangle Either of the two options can be used or even a combination of the two. If a user wants more control over his network running his own router would be the best solution although this would imply paying for another virtual machine. Currently Amazon only supports one network interface per instance which would impose some limitations for running a router. Virtual private networks come as a second option that could be implemented right away if using a software based solution like OpenVPN. There aren’t any changes necessary in the Amazon infrastructure for this to work. Each user can create an overlay network on top of the current network and start running what ever network protocols he needs. In order to emulate layer 2 connectivity OpenVPN must be configured to use the TAP drivers which operates with layer 2 frames. Although this solution can be implemented right away by any user it does come with a performance penalty. Because OpenVPN encapsulates layer 2 or layer 3 traffic inside TCP or UDP there is an unavoidable overhead that lowers performance and takes more CPU cycles to process, especially if encryption is used[4].

24

A 3rd solution could be OpenFlow which is a new open standard that allows network management based on network flows. This would allow the highest level of flexibility because it doesn’t require a specific set of protocols in order to take routing and switching decisions. Instead, decisions can be taken based on any information from the packet headers, provided that there are custom rules in place for that kind of packet. In this way there are no restrictions on what protocols the network can run[5], having much more flexibility than vlans. This is probably the best solution out of the 3, but implementing it is very hard mostly because it is new new and most of the router and switch vendors haven’t introduced OpenFlow support in their products. In order to improve network performance between virtual machines service providers can look in several directions: • reserve CPU time for network packets • Quality of Service • allocate a user’s virtual machines on servers that are close to each other Because network performance of the virtual machines is heavily influenced by computing resources on the physical host machine, a good idea would be to reserve more CPU time for processing network packets from the virtual interfaces of the guest virtual machines. In case if all virtual machines from the same host server do a lot of CPU intensive tasks, there is still some resources left for doing network related tasks. QoS could be another way of guaranteeing a certain level of network performance for applications that require it: VoIP, video streaming, gaming etc. QoS has its disadvantages, with one of it being the requirement of routers capable of doing it without loosing performance(hardware implementation). Other disadvantages of QoS are the possibility of circumventing it through encryption or vpn’s and the difficulty of engineering good QoS rules. The most cost effective way of improving network performance is to allocate all virtual machines from the same user on servers that are physically very close to each other. In this way network traffic between the virtual machines of the same user will share the physical network links with a smaller number of virtual machines from another user.

5

Conclusions

In this paper I have benchmarked the network performance of the Amazon EC2 infrastructure and assessed the limitations that are imposed by migrating to the cloud. The following conclusions can be drawn about the network performance: • bandwidth capacity between virtual machines is provided on a best effort basis, without any guarantees • most of the time latency values are satisfactory with values under 1 ms. There should be concerns for the future, when more users migrate to the 25

cloud and hardware usage becomes higher, latency will begin to suffer as I experienced on some of the virtual machines. • packet loss levels are concerning, with a maximum of 10% and an average of 5% loss on the 1 Gbit stream. This has to be taken into account by anyone who is planning on running sensitive applications like VoIP or video streaming. The authors of [6] have found similar results regarding the performance of the Amazon EC2 network infrastructure. The effects of virtualization and more exactly CPU sharing among several virtual machines can be clearly observed in the variability of the taken measurements. Looking at the level of flexibility and freedom a user has in the cloud I can conclude the following: • there is no network connectivity on the link layer • the network layer is limited to IPv4 • TCP and UDP can be used on the transport layer Currently there are serious limitations from both a performance and flexibility point of view. In order for virtual infrastructures, like the one provided by Amazon EC2, to be a viable solution for more than just a small number of companies, this limitations must be overcome. The main solutions I propose are vlans for improving network flexibility and more CPU allocation to network related tasks, for improving network performance.

26

6

Acknowledgements

There are a few people I have to thank for the ideas, help and sheer amount of feedback they have provided me during my research project and these are my two coordinators: Paola Grosso and Rudolf Strijkers. Without their help I am sure this project wouldn’t have been possible. I also want to thank Rager Ossel from InTouch for the insightful and thought provoking discussion we had, which helped put things a bit more into perspective.

A

The test suite

A lot of work has gone into programming the suite of tests that were used to test the Amazon EC2 virtual infrastructure and that is why it is worth explaining how it works and how it was done. The programming language used was Ruby, mainly because of its ease of use, dynamic characteristics and my familiarity with it. The goal was to produce something that can be used independently of the type of infrastructure. All the scripts used for doing the tests can be divided in 3 parts: • test configuration file • test scripts • remote automation script In order to make the test suite more flexible, required tests are defined in a configuration file and are read by the test scripts at runtime. Based on the configuration file the test will start running and execute all the defined tests. The configuration file is structured using the YAML data serialisation format and can contain 3 kinds of test: delay, traceroute and bandwidth; each of them with various configuration fields. A sample configuration file is provided bellow: test: name: Testing the Amazon EC2 infrastructure results: test.csv remotedir: /home/user/logs remotehost: nidoran.studlab.os3.nl remoteuser: logs remotepasswd: ****** delay: host: 67.29.112.27 nr: 30 description: Delay to bogus machine, 30 ICMP packets traceroute: host: 67.29.112.27 27

maxhops: 30 description: Traceroute to bogus machine, maximum 30 hops bandwidth: host: 67.29.112.27 protocol: tcp duration: 30 nr: 1 concurent: 10 description: Bandwidth to bogus machine using TCP, 30 seconds and 10 streams The test suite is divided in the following scripts: • main.rb - reads parameters from the command line and launches a set of tests for each configuration file specified as a parameter • test.rb - instantiates the tests as they are defined in the configuration file • delay.rb - runs delay tests using the ping utility • bandwidth.rb - runs bandwidth measurement tests using the iperf utility • traceroute.rb - measures hop count using the traceroute utility • results.rb - saves the results in a CSV file • log.rb - sends status information to the terminal The automation script works by instantiating 2 virtual machines at a time using the Amazon EC2 API, connects to each virtual machine and installs all the requires applications(ruby1.9, iperf, etc) and uploads the test suite. Once the environment is ready for running the tests it runs them one at a time and uploads the results to the OS3 server.

References [1] Borja Sotomayor, Rubn S. Montero, Ignacio M. Llorente, Ian Foster, Virtual Infrastructure Management in Private and Hybrid Clouds [2] http://aws.amazon.com/ec2/#instance [3] Amazon AWA, AWS:Overview of Security Processes [4] Jens Mache, Damon Tyman, Andre Pinter, Chris Allick, Lewis & Clark College, Performance Implications of Using VPN Technology for Cluster Integration and Grid Computing [5] Nick McKeown, Tom Anderson, Hari Balakrishnan, Guru Parulkar, Larry Peterson ,Jennifer Rexford, Jonathan Turner, Scott Shenker, OpenFlow: Enabling Innovation in Campus Networks [6] Guohui Wang, T. S. Eugene Ng, The Impact of Virtualization on Network Performance of Amazon EC2 Data Center

28