IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING
Using Virtual Machine Allocation Policies to Defend against Co-resident Attacks in Cloud Computing Yi Han, Jeffrey Chan, Tansu Alpcan, Christopher Leckie Abstract—Cloud computing enables users to consume various IT resources in an on-demand manner, and with low management overhead. However, customers can face new security risks when they use cloud computing platforms. In this paper, we focus on one such threat − the co-resident attack, where malicious users build side channels and extract private information from virtual machines co-located on the same server. Previous works mainly attempt to address the problem by eliminating side channels. However, most of these methods are not suitable for immediate deployment due to the required modifications to current cloud platforms. We choose to solve the problem from a different perspective, by studying how to improve the virtual machine allocation policy, so that it is difficult for attackers to co-locate with their targets. Specifically, we (1) define security metrics for assessing the attack; (2) model these metrics, and compare the difficulty of achieving co-residence under three commonly used policies; (3) design a new policy that not only mitigates the threat of attack, but also satisfies the requirements for workload balance and low power consumption; and (4) implement, test, and prove the effectiveness of the policy on the popular open-source platform OpenStack. Index Terms—cloud computing security, co-resident attack, virtual machine allocation policy, security metrics modelling
—————————— ——————————
1 INTRODUCTION Security is one of the major concerns against cloud computing. From the customer’s perspective, migrating to the cloud means they are exposed to the additional risks brought about by the other tenants with whom they share the resources − are these neighbours trustworthy, or they may compromise the integrity of others? This paper concentrates on one form of this security problem: the coresident attack (also known as co-location, co-residence, or co-residency attack). Virtual machines (VM) are a commonly used resource in cloud computing environments. For cloud providers, VMs help increase the utilisation rate of the underlying hardware platforms. For cloud customers, it enables ondemand resource scaling, and outsources the maintenance of computing resources. However, apart from all these benefits, it also brings a new security threat [1]. In theory, VMs running on the same physical server (i.e., coresident VMs) are logically isolated from each other. In practice, nevertheless, malicious users can build various side channels to circumvent the logical isolation, and obtain sensitive information from co-resident VMs, ranging from the coarse-grained, e.g., workloads and web traffic rates [1], to the fine-grained, e.g., cryptographic keys [2].
For clever attackers, even seemingly innocuous information like workload statistics can be useful. For example, such data can be used to identify when the system is most vulnerable, i.e., the time to launch further attacks, such as Denial-of-Service attacks. A straightforward solution to this novel attack is to eliminate the side channels, which has been the focus of most previous works [3], [4], [5], [6]. However, most of these methods are not suitable for immediate deployment due to the required modifications to current cloud platforms. In our work, we approach this problem from a completely different perspective. Before the attacker is able to extract any private information from the victim, they first need to co-locate their VMs with the target VMs. It has been shown that the attacker can achieve an efficiency rate of as high as 40% [1], which means 4 out of 10 attacker’s VMs can co-locate with the target. This motivates us to study how to effectively minimise this value. From a cloud provider’s point of view, the VM allocation policy (also known as VM placement − we use these two terms interchangeably in this paper) is the most important and direct control that can be used to influence the probability of co-location. Consequently, we aim to design a secure policy that can substantially increase the ———————————————— difficulty for attackers to achieve co-residence. Yi Han is with the Department of Computing and Information Systems, In our earlier work [7], we have proposed a prototype The University of Melbourne, Melbourne, Australia 3010. E-mail: anof such a secure policy, called the
[email protected]. Jeffrey Chan is with the Department of Computing and Information Sysfirst policy (PSSF). However, this prototype policy only tems, The University of Melbourne, Melbourne, Australia 3010. E-mail: focuses on the problem of security, and hence has obvious
[email protected]. limitations in terms of: Tansu Alpcan is with the Department of Electrical and Electronic Engineering, The University of Melbourne, Melbourne, Australia 3010. E-mail: 1. Workload balance − Workload here refers to the
[email protected]. VM requests. From the cloud provider's point of Christopher Leckie is with the Department of Computing and Information view, spreading VMs among the servers that have Systems, The University of Melbourne, Melbourne, Australia 3010. Email:
[email protected].
1
2
already been switched on can help reduce the probability of servers being over-utilised, which may cause SLA (service level agreement) breaches. From the customer's perspective, it is also preferable if their VMs are distributed across the system, rather than being allocated together on the same server. Otherwise, the failure of one server will impact all the VMs of a user. 2. Power consumption − It has been estimated that the power consumption of an average datacentre is as much as 25,000 households [8], and it is expected to double every 5 years [9]. Therefore, managing the servers in an energy efficient way is crucial for cloud providers in order to reduce the power consumption and hence the overall cost. This has also been the focus of many previous works [10], [11], [12], [13], [14]. In this paper, we take all three aspects of security, workload balance and power consumption into consideration to make PSSF more applicable to existing commercial cloud platforms. Since these three objectives are conflicting to some extent, we improve our earlier policy by applying multi-objective optimisation techniques. In addition, we have implemented PSSF on the simulation environment CloudSim [15], [16], as well as on the real cloud platform OpenStack [17], and performed large scale experiments that involve hundreds of servers and thousands of VMs, to demonstrate that it meets the requirements of all three criteria. Specifically, our contributions include: (1) we define secure metrics that measure the safety of a VM allocation policy, in terms of its ability to defend against co-resident attacks; (2) we model these metrics under three basic but commonly used VM allocation policies, and conduct extensive experiments on the widely used simulation platform CloudSim [15], [16] to validate the models; (3) we propose a new secure policy, which not only significantly decreases the probability of attackers co-locating with their targets, but also satisfies the constraints in workload balance and power consumption; and (4) we implement and verify the effectiveness of our new policy using the popular open-source cloud software OpenStack [17], as well as on CloudSim. The rest of the paper is organised as follows. In Section 2, we give a survey of previous work on co-resident attacks, and current VM allocation policies. In Section 3, we describe our research aim and formally define the problem. In Sections 4 and 5, we model the security metrics under three existing allocation policies, and give an experimental verification. In Section 6, we introduce our new policy and summarise the test results on CloudSim. In Section 7, we present the implementation on OpenStack. Finally, Section 8 concludes the paper, and gives directions for future work.
2 BACKGROUND AND RELATED WORK In this section, we start by surveying previous work on co-resident attacks, including its definition, security risks and possible countermeasures. We then summarise com-
IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING
monly used VM allocation policies as context for our proposed method.
2.1 Survey on Co-resident Attacks The co-resident attack discussed in this paper comprises the following two steps. First, the attacker has a clear set of target VMs, and their goal is to co-locate their VMs with these targets on the same physical servers. Second, after co-residence is achieved, the attacker will construct different types of side channels to obtain sensitive information from the victim. Note that this is different from [18], [19], [20], where attackers do not have specific targets, and their goal is to obtain an unfair share of the cloud platform’s capacity. In order to co-locate with the targets, the attacker can either use a brute-force strategy: start as many VMs as possible (the number may be limited by the cost), or take advantage of the sequential and parallel locality in VM placement. It has been shown in [1] that in the Amazon EC2 cloud [21], if one VM is started immediately after another one is terminated, or if two VMs are launched almost at the same time, it is more likely that these two VMs are allocated to the same server. Security risks Theoretically speaking, co-resident VMs should not be influenced by each other. However, this can still occur in real-world cloud systems, which is the reason why attackers are able to build side channels between VMs, and obtain private information. We can categorise these side channels as either coarse grained or fine grained. 1. Coarse grained: Experiments in [1] show that because the cache utilisation rate has a large impact on the execution time of the cache read operation, attackers can infer the victim’s cache usage and workload information, by applying the Prime+Probe technique [22], [23]. Similarly, they can estimate the victim’s web traffic rate, which also has a strong correlation with the execution time of cache operations. As we mentioned in the introduction, even such coarse-grained information can be useful to clever attackers to maximise the damage of further attacks. 2. Fine grained: It is demonstrated in [2] that attackers can exploit shared hardware resources, such as the instruction cache, to extract cryptographic keys. Specifically, they show how to overcome the following challenges: dealing with core migrations and determining if an observation is associated with the victim, filtering out hardware and software noise, and regaining access to the target CPU core with sufficient frequency. In addition, a number of side channels have been explored [24], [25], [26], [27] in order to transfer sensitive information between VMs, which is prohibited by security policies. In particular, in contrast to exploiting side channels to launch attacks, Zhang et al. [28] discuss how to use side channels to detect whether the isolation of a VM is violated. Existing countermeasures Previous studies have proposed the following five
YI HAN ET AL.: USING VIRTUAL MACHINE ALLOCATION POLICIES TO DEFEND AGAINST CO-RESIDENT ATTACKS IN CLOUD COMPUTING
types of possible defense methods: 1. Eliminating the side channels. Side channel attacks are not unique to cloud systems. Prior to the popularisation of cloud platforms, different methods [29], [30] have already been proposed to mitigate the threat of side channels. However, these methods are at the hardware layer, and hence are normally costly to adopt. In cloud environments, many side channels rely on high resolution clocks, therefore, Vattikonda et al. [3] propose to remove such clocks; Wu et al. [4] choose to add latency to potentially malicious operations; while the approach of Aviram et al. [5] is to “eliminate all internal reference clocks”. An alternative solution is enforcing isolation by preventing the sharing of sensitive resources, e.g., Shi et al. [6] use pagecolouring to limit cache-based side channels. In particular, Szefer et al. [31] take a further step by proposing to remove the hypervisor, and use hardware mechanisms for the isolation of access to shared resources. Nevertheless, the problem with these methods is that they often require substantial changes to be made to existing cloud platforms, and hence are unlikely to be adopted by cloud providers any time soon. More recently, Zhang et al. [32] propose to perform periodic time-shared cache cleansing, in order to make the side channel noisy. In addition, Varadarajan et al. [33] show that a scheduling mechanism called minimum run time (MRT) guarantee is effective in preventing cache-based side channel attacks. These two methods require less changes and hence are easier to be deployed. 2. Increasing the difficulty of verifying co-residence. The easiest way to determine whether two VMs are on the same server is based on network measurements [1]: by performing a TCP traceroute operation the attacker can obtain the IP address of a VM’s Dom0, which is a privileged VM that manages all VMs on a host. If two Dom0 IP addresses are the same, the corresponding VMs are coresident. Cloud providers can prevent Dom0’s IP address from being exposed to customers, so that attackers will be forced to resort to other options that do not rely on network measurements, and often require greater effort. However, as more and more different methods of detecting co-residence have been proposed [34], [35], [36], simply hiding Dom0’s IP address is not sufficient. 3. Detecting the features of co-resident attacks. Sundareswaran and Squcciarini [37] and Yu, et al. [38] observed that when attackers use the Prime+Probe technique to extract information from the victim, there are abnormalities in the CPU and RAM utilisation, system calls, and cache miss behaviours. They propose different methods to detect these features, and design the defense mechanisms accordingly. 4. Migrating VMs periodically. Li and Zhang et al. [39], [40] tackle the problem by applying a Vickrey-
5.
Clarke-Groves (VCG) mechanism to migrate VMs periodically. Specifically, they discuss the number of VMs to be migrated as well as the destination hosts. In addition, they propose a method to generate a VM placement plan, in order to decrease the overall security risk. However, frequently migrating VMs will cause extra power consumption, and may lead to performance degradation, which increases the probability of cloud providers breaking their SLA. Using VM allocation policy to make it difficult to achieve co-residence. This is also the method we choose. To the best of our knowledge, only [41] adopts a similar method. Their co-location resistant (CLR) algorithm label all servers as either open or closed, where open (closed) means the server can (cannot) receive more VMs. At any time, CLR keeps a fixed number (Nopen) of servers open, and allocates a new VM to one of these servers randomly. If the selected server cannot take more VMs due to this allocation, it will be marked closed, and a new server will be opened. Specifically, the larger Nopen is, the more co-location resistant the algorithm becomes. Therefore, the best case scenario is that all servers are open. In this case, CLR works the same as the Random policy defined in the next subsection.
Our earlier work We have been looking at this problem from a new perspective: instead of concentrating on the countermeasures after attackers co-locate with their targets, we explore different ways to make it difficult for attackers to achieve co-residence. In [42], we propose a game theoretic approach to compare three commonly used virtual machine allocation policies, in terms of security (i.e., the ability of defending against co-resident attacks), workload balance and power consumption. In [7], we defined security metrics for assessing the attack, and mathematically modelled these metrics under the three policies. In addition, we proposed a new policy, PSSF, which can significantly decrease the probability of attackers co-locating with the targets. In this paper, we propose a more generally applicable extension of the PSSF policy, and provide a detailed analysis of its performance.
2.2 Summary of Virtual Machine Allocation Policies In cloud computing environments, there are two kinds of VM placements: (1) initial placement, which admits new requests for VMs, and assigns them to specific hosts; and (2) live migration [43], which optimises the current allocation according to certain metrics. The initial placement can be further divided into two steps − searching for a data centre within the system, and then a host within the chosen data centre. In this paper, we focus on the initial VM allocation within a data centre. Although dozens of algorithms [9], [10], [11], [12], [13], [14] have been proposed based on various criteria and goals, judging from the final allocation process they can be generally classified into two types: stacking and spreading. In other words, the VMs are either concentrat-
3
4
IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING
ed to a number of physical servers, in order to decrease the power consumption and maximise the utilisation rate, or distributed across the whole data centre, for the purpose of workload balance and higher reliability. Table 1 summarises some commonly used or widely cited allocation policies. In this paper, we only consider three fundamental yet representative VM allocation policies [11]: Least VM policy (Workload Balancing), Most VM policy (Workload Stacking), and Random policy (Random). TABLE 1 POPULAR VM ALLOCATION POLICIES Name
First Fit
Stacking
Workload Stacking
Energy/ Cost Aware
Random Next Fit Spread ing
Workload Balancing
servers on which malicious VMs are co-located with at least one of the T targets, divided by the total number of VMs launched by the attacker, i.e.,
Description All servers are ordered by their identifier, and a new VM is allocated to the legitimate server (i.e., the server with enough remaining resources and satisfying all other requirements if there are any) with the smallest identifier One example is to allocate a new VM to the legitimate server with the most number of VMs (started by any user) This is what we call “Most VM policy”. This is a more advanced type of policies than “Workload Stacking”, which also aims to minimise the (cost of) power consumption. Specifically, a new VM is allocated to the server that will cause the least additional power consumption/cost, the calculation of which is different from policy to policy The simplest policy that selects at random from those legitimate servers Similar to First Fit, except that the search begins from the server that was last selected A group of similar policies that spread the VM requests based on different criteria. For example, a new VM is allocated to the legitimate server (1) with the least number of VMs (started by any user). This is what we call “Least VM policy”; (2) with the most number of free CPU cores; or (3) with the largest ratio of free CPU cores to total CPU cores
2.
Coverage VM A, t 3.
SuccTarget A, t T
VMmin − This is defined as the minimum number of VMs that the attacker needs to start so that at least one of them co-locates with at least one target. It is an estimate of the minimum effort an attacker has to take in order to achieve co-residence. TABLE 2 DEFINITIONS REGARDING THE SECURITY METRICS Name
Definition
K
The total number of servers
A
The attacker
L
A legal user. The target of A is the set of VMs started by L
VM(L,t)
3 PROBLEM FORMULATION AND ANALYSIS
3.1 Research Aim The aim of this research is to design a secure VM allocation policy, in order to mitigate the threat of co-resident attacks. Specifically, we determine whether an allocation policy is secure based on the following three metrics (the definitions of all notation are given in Table 2): 1. Efficiency − For attackers, clearly it is desirable to co-locate with as many targets as possible by starting the minimum number of VMs. Hence, we define Efficiency as the gains divided by the costs. More precisely speaking, it equals the number of
VM A, t
The reason why we use |Servers(SuccVM(A,t))| instead of just SuccVM(A,t) is that when two malicious VMs co-locate with the same target, the second VM should not be counted. Note that we focus on preventing attackers from co-locating with their targets, and consider that once co-residence is achieved, attackers are able to construct side channels. Although a second co-resident VM can make it easier for attackers to extract sensitive information from the victim, in this paper we focus on preventing any co-residence. Coverage − Another criterion to measure the success of an attack is the percentage of the conquered targets, i.e., Coverage, which equals the number of target VMs co-located with malicious VMs started in the attack, divided by the number of targets T, i.e.,
VM(A,t)
In this section, we first summarise the aim of our research. Then we describe our problem definition, and briefly introduce our proposed solution.
Servers SuccVM A, t
Efficiency VM A, t
Target(A) SuccTarget(A,t) SuccVM(A,t) Servers({a set of VMs})
The set of VMs started by L at time t The set of VMs started by A during one attack at time t The target set of VMs that A intends to co-locate with, Target(A) = ∑tVM(L,t), |Target(A)| = T A subset of Target(A) that co-locates with at least one VM from VM(A,t) A subset of VM(A,t) that co-locates with at least one of the T targets Servers that host the set of VMs
The following example illustrates the definitions of attack efficiency and coverage. As shown in Fig. 1, a legal user L starts four VMs (VM_L1, VM_L2, VM_L3 and VM_L4), and they are running on four different servers (Server 1, Server 2, Server 3 and Server 4). Then attacker A starts eight VMs, (VM_A1, VM_A2, …, VM_L8), four of which co-locate with three VMs of L. In this case, the at-
YI HAN ET AL.: USING VIRTUAL MACHINE ALLOCATION POLICIES TO DEFEND AGAINST CO-RESIDENT ATTACKS IN CLOUD COMPUTING
tack efficiency is 3/8 (instead of 4/8, as VM_A2 and VM_A4 co-locate with the same target VM_L2), and the coverage is 3/4. In addition to security, we also take workload balance and power consumption into consideration, since they are another two important factors when cloud providers design their VM allocation policies. In other words, the new policy not only mitigates the threat of co-resident attacks, but also satisfies the constraints in workload balance and power consumption.
W : max 3.
1 M
s | s S , x
v , s ,u
uU
1, v V
Power consumption − How to effectively reduce the power consumption is a crucial issue for cloud providers [12]. Different techniques of energy aware VM placement has been widely discussed in previous papers [9], [10], [11], [12], [13], [14]. In order to simplify the problem, here we only consider the most straightforward approach − minimising the number of running servers, i.e.,
P : min s | s S , u U , v V , xv, s ,u 1
Fig. 1. An example to explain attack efficiency and coverage
3.2 Problem Definition Consider the following scenario: in a cloud computing system of K servers S = {s1, s2, …, sK}, M users U = {u1, u2, …, uM} start N virtual machines V = {v1, v2, …, vN}. A mapping X: U×V→S, allocates each VM from each user to a specific server, XV×S×U = {xv,s,u|xv,s,u=1 iff VM v of user u is allocated to server s}. An attacker A intends to co-locate their VMs with the VMs of legal user L, i.e., Target(A) = ∑tVM(L,t). During one attack started at time t, A launches |VM(A,t)| VMs, and the goal is to maximise the efficiency and/or coverage rate. Given this attack scenario, our new policy should satisfy the following objectives: 1. Security − Under the new policy, the attacker has to start a large number of VMs to achieve a nonzero efficiency or coverage rate, i.e., VMmin is high. In addition, the coverage rate does not increase or increases very slowly with |VM(A,t)|. In order to achieve these two points, one extreme case is that VMs of different users are never allocated to the same server. Based on such an idea, we minimise the average number of users per server, i.e., 1 S : min u | u U , xv , s ,u 1, v V K sS 2. Workload balance − As we mentioned in the introduction, the importance of workload balance is twofold. For cloud providers, evenly distributing VMs helps decrease the probability of servers being over-utilised. For simplicity, in our new policy we use the number of running VMs per server as the criterion to spread the workload (the same as the Least VM policy). In addition, customers would also prefer if their VMs are not all allocated together on the same server. In other words, on average the number of servers that host a user's VMs should be maximised, i.e.,
In addition, we make the following assumptions: 1. The capacity of a server is not explicitly included. However, when a new VM request is being processed, only the servers with sufficient resources left are considered − we refer to these as legitimate servers. In other words, we focus on designing an algorithm to sort these legitimate servers, and allocate the new VM to the top ranked server, so that the above three objectives can be satisfied; 2. The multi-objective optimisation is done for every incoming VM request when it arrives, such that only the current system state and the VM request are taken into consideration, i.e., no lookahead mechanism is used; 3. VM live migration is not taken into consideration, which means once a VM is allocated to a server, it will run on that server until it is stopped or terminated by the user; 4. Cloud providers do not have any prior knowledge of the attackers or victims, and all VM requests are treated equally. There are many different ways to solve this multiobjective optimisation problem, and since the security objective is the most important in our case, we choose the ε-constraints method [44]. Problem Statement Formally, the problem is as follows: how to design a rule for the mapping X, such that the number of users per server is minimised, under the constraints on workload balance and power consumption, i.e.,
S : min
1 K
u | u U , x sS
v , s ,u
1, v V
s.t.W : s S , u U , v | v V , xv , s ,u 1 N * P : s | s S , u U , v V , xv , s ,u 1 K * where N* and K* are predefined positive thresholds. Note that we re-write the condition W to “no more than N* VMs of one single user are hosted on the same server”, as this is more practical for implementation. Because this multi-objective optimisation problem is NP-hard − it contains both the knapsack problem (the security objective and the workload balance constraint) and the bin-packing problem (the power consumption constraint) − we aim to find a heuristic solution for it. Specifically, in order to minimise the average number of users per server, PSSF gives a higher priority to (1)
5
6
servers that currently host VMs from the same user − assigning a new VM to one of these servers will not increase the number of different users on that server; (2) servers that once hosted VMs from the same user – this is for the case where none of the user’s VMs is running, but the user has created VMs before. Selecting one of these servers prevents attackers from constantly switching on and off their VMs to circumvent the restriction. In addition, the workload balance constraint, i.e., no more than N* VMs of one single user are hosted on the same server, is directly implemented in PSSF – if a server already hosts N* VMs from a user, it will not be chosen again even if it has enough remaining resources, when that user creates more VMs. Finally, in order to decrease the power consumption, we combine the ideas of spreading and stacking workload. A more detailed description of our PSSF policy is given in Section 6.
IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING
4.2 Modelling VMmin: Minimum Number of VMs the Attacker Needs To Start In the following two subsections, we present our models for the three security metrics. Note that hereafter all discussions are based on the assumption that the attacker applies the best strategies described in the last subsection. We start with VMmin, i.e., the minimum effort the attacker has to take in order to co-locate with their targets. Let pi, 1 ≤ i ≤ VMmin, be the probability that the ith VM co-locates with at least one of the T targets. Then the probability that VMmin = n follows a binominal distribution, and equals to the probability that none of the first (n−1) VMs co-locates with any target, and the last VM does: n 1
P VM min n pn 1 pi
(1)
i 1
4 SECURITY ANALYSIS OF BASIC VM ALLOCATION POLICIES
Let Kiʹ be the number of servers to which the ith VM can be assigned, and Tʹ be the number of servers that host the target T VMs, then pi = Tʹ ∕ Kiʹ. Note that Tʹ ≤ T, as a subset of these VMs may co-locate with each other. An analysis of Tʹ is given in Section 5.2.
Before proposing our new VM allocation policy, we first carry out a comparison between three basic existing policies − Least VM policy, Most VM policy and Random policy − in terms of their abilities in defending against coresident attacks. We start by analysing the attacker’s strategies under different allocation policies to maximise their efficiency and coverage rates. Then we introduce our models of VMmin, Efficiency(|VM(A,t)|) and Coverage (|VM(A,t)|) for each policy.
Least VM allocation policy Under this policy, if all n VMs are started at the same time, previously selected hosts will not be chosen again. In other words, the ith VM (i=1, 2, …, n) can only be assigned to one of the remaining K − (i−1) servers, i.e., Kiʹ = K − (i−1). Hence: T (2) pi K i 1
4.1 Attackers’ Strategies under Different VM Allocation Policies We assume that the attacker can figure out which VM allocation policy (or at least the type) is being used in the system, and optimise their strategies accordingly. Therefore, before comparing the three policies, we first need to understand the worst case scenario, i.e., how the attacker maximises either the efficiency or the coverage rate under these policies. One straightforward approach is to spread the VMs, i.e., to occupy as many servers as possible with the minimum number of VMs. Under the Least VM policy, the attacker should start as many VMs as possible at once (the number is restricted by costs and other constraints), for the reason that a server will not be selected twice within a short period of time under this policy. Under the Most VM policy, instead of launching a large number of VMs at the same time, the attacker should start their VMs in batches, i.e., start B VMs at a time (B