On the Design and Implementation of an Integrated Security ...

2 downloads 111035 Views 714KB Size Report
user wishing to access a service S1 from a cloud provider in a virtualized environment. Assume an application A1 running in a virtual machine on a platform is ...
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCC.2016.2535320, IEEE Transactions on Cloud Computing

On the Design and Implementation of an Integrated Security Architecture for Cloud with Improved Resilience Vijay Varadharajan, Senior Member, IEEE, and Udaya Tupakula, Member, IEEE

Abstract— In this paper, we propose an integrated security architecture which combines policy based access control with intrusion detection techniques and trusted computing technologies for securing distributed applications running on virtualised systems. Our security architecture incorporates access control security policies for secure interactions between applications and virtual machines in different physical virtualized servers. It provides intrusion detection and trusted attestation techniques to detect and counteract dynamic attacks in an efficient manner. We demonstrate how this integrated security architecture is used to secure the life cycle of virtual machines including dynamic hosting and allocation of resources as well as migration of virtual machines across different physical servers. We discuss the implementation of the developed architecture and show how the architecture can counteract attack scenarios involving malicious users exploiting vulnerabilities to achieve privilege escalation and then using the compromised machines to generate further attacks. The feedback between the various security components of our security architecture plays a critical role in detecting sophisticated, dynamically changing attacks, thereby increasing the resilience of the overall secure system. Index Terms—Integrated Security Architecture, Access Control, Intrusion Detection, Trusted computing, Resilience

I. INTRODUCTION ECURITY issues play a vital role in every organisation, as greater availability and access to information in turn imply that there is a greater need to protect them and ensure that they are properly used in secure decision making. Many access control mechanisms, languages and systems [1] have been proposed over many years to address the issue of access to information within systems. Such access control systems make certain basic assumptions about the state of the platform that is hosting and running the applications and systems software. There is an inherent trust that is placed on the underlying platform when a user or an upper level application is authenticated or authorised to perform actions. In the current open networked world with heterogeneous platforms and numerous software applications and system software running on them, it is important that such underlying trust assumption about the system state be properly examined. There are several reasons for this. First, computing platforms have become very powerful and can run many applications simultaneously. In

S

The authors are with the Advanced Cyber Security Research Centre, Faculty of Science, Macquarie University, Sydney, Australia (e-mail: [email protected]; [email protected]).

particular, as the number of software applications increases, greater is the possibility for the occurrence of security vulnerabilities. These vulnerabilities in turn make the platform more vulnerable to attacks. Second, attacks themselves are becoming increasingly sophisticated. Furthermore, attackers also have easier access to ready-made tools that enable exploitation of platform vulnerabilities to be more effective. Third, platforms are being shared by multiple users and applications (belonging to different users) both simultaneously as well as at different times. Therefore there is a greater probability for the platform to be left in a vulnerable state by different users and applications as and when they run. Finally, with the increase in the complexity of today’s platforms, users themselves could be unaware of their platform vulnerabilities. Hence we believe there is a strong need for not only different security techniques such as access control, intrusion detection and trust management; but more importantly they must be integrated in such a way that they can interact with each other in a dynamic manner to continuously enhance the security of the system. Let us now consider the motivation behind the need for such integration using a simple scenario. Consider, for instance, a user wishing to access a service S1 from a cloud provider in a virtualized environment. Assume an application A1 running in a virtual machine on a platform is providing the service S1. The access control model evaluates the request to access A1 in the virtual machine. The intrusion detection model determines for any instance if there is any malware in virtual machine. The trust management model determines whether the virtual machine and the platform have the attested properties to be trusted. The novel aspect of the integrated architecture arises due to the feedback between the different security components, which cater for the dynamic changes. Now assume a new malware or attack is detected in the virtual machine the intrusion detection model. This dynamic change in the context can be addressed by updating the access control policies, which in turn ensures that this new malware infected virtual machine is not used in the provision of service. The unique feature of the proposed security architecture is its ability to respond dynamically to security attacks and changes in trust, and modify and refine the access control policies to take into account these changes. In this paper, we propose an integrated security architecture which combines policy based access control with intrusion detection techniques and trusted computing technologies for securing the lifecycle of distributed applications running on virtual machine (VM) based systems. Our architecture

2168-7161 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCC.2016.2535320, IEEE Transactions on Cloud Computing

considers properties of virtual machines based on their specific characteristics, such as the applications running in the VMs, the resources allocated to the VMs, and policies associated with groups of VMs corresponding to a distributed application. Increasingly security attacks in an open networked environment exhibit multifaceted malicious behaviour. In fact, many attacks tend to exploit first weaknesses in the access control, which is then followed by the spreading of the malicious activity. However as access control and intrusion detection techniques are often implemented as separate tools, this may not only lead to inefficient detection but also poor prediction of potential consequences of the attacks. For example, Stuxnet spread via malicious insiders obtaining unauthorized access to critical components (using default passwords via USB), and then exploiting zero day vulnerabilities in Windows operating system. Hence the need to integrate multiple security technologies such as access control, intrusion detection and trusted computing to detect and counteract attacks in a dynamic and efficient manner has become increasingly a necessity. In fact it is the feedback between the various security components which plays a critical role in the ability of the secure system to detect sophisticated, dynamically changing attacks. This enhances the resilience of systems, which we believe could provide the foundation of future adaptive secure systems. This has been the motivation behind our work and the design of the integrated security architecture with improved resilience. The paper is organized as follows. Section II discusses the threat model and considers the different security requirements from the different parties involved in a cloud infrastructure. Section III describes a common cloud scenario which highlights the need for an integrated security approach to deal with dynamic situations in the current cloud environment. Section IV proposes comprehensive integrated security architecture and describes in detail the various security components and their functionalities. Section V discusses how the proposed security architecture is used for securing the life cycle of virtual machines. Section VI describes the implementation of the developed architecture and shows how our architecture can be used to counteract a range of attack scenarios. There is much related work for a paper like this one and we have referred to several related works throughout the paper. We have also included a separate related works Section VII which summarises some relevant works on intrusion detection techniques for virtualisation and cloud environments. Finally, Section VIII concludes the paper. II. THREAT MODEL Consider a typical scenario where we have a distributed system with applications running on virtual machines (VMs) on top of hypervisors/virtual machine monitors (VMMs). As shown in Figure 1, there can be several hundreds of servers (grouped into clusters and managed by Cluster Controllers) in a Infrastructure as a Service (IaaS) cloud provider’s environment hosting many tenants’ virtual machines. The tenants can dynamically request a cloud provider to host one or more virtual machines. When the cloud provider receives a request, the decision as to whether to host a tenant’s virtual machines is based on a range of different parameters such as number of virtual machines required by the tenant, resources

Platform I1 Platform I2

Cluster Controller I

Platform In

Platform J1 Platform J2

VM Hosting Requests from Tenants Cloud Controller

VM1 VM 2

Cluster Controller J

Platform Jn

VM x

Figure 1: Cloud Scenario

available on a hypervisor/VMM, service requirements for the tenant and revenue derived from hosting virtual machines. Currently several businesses such as healthcare [19], banking [20], government [21] and utility [22] companies are migrating their data and services to the cloud. The document [21] describes the policy of the Australian government to make use of public cloud to minimise the cost of its IT spending as a measure of responsible spending of the tax payer’s money. For example, one healthcare company claims [19] claims that it has saved $1.2 million by migrating its services to cloud environment. These can include sensitive data as well as critical services which make it vital that the cloud providers take into consideration the security requirements involved in hosting the tenant virtual machines. Hence in addition to performance and resource management requirements, it is critical that security requirements of tenants be taken into account by the cloud provider. From the point of view of tenants, the security requirements can vary; some tenants may require more security services than others. For example, a tenant who is running financial services on its virtual machines is likely to need more security measures compared to a tenant who is providing basic web hosting. At the same time, the cloud providers need to deploy security services and mechanisms to protect their own infrastructures. The security attacks can vary with different cloud deployment models IaaS or PaaS or SaaS. However there are more challenges in the case of IaaS public clouds, as in IaaS the tenants can be running their own operating systems and applications in their virtual machines and the cloud service provider may not have any knowledge about these virtual machines. Hence, unless specifically stated, the discussion in this paper is more related to public IaaS cloud infrastructures. Our cloud system architecture includes cloud system administrators, tenant administrators (or operators) who manage tenant virtual machines, and tenant users (or tenant’s customers) who use the applications and services running in the tenant virtual machines. Cloud providers are entities such as Amazon EC2 and Microsoft Azure VM who have a vested interest in protecting their reputations. The cloud system administrators are individuals from these corporations entrusted with system tasks and maintaining cloud infrastructures, who will have access to privileged domains. We assume that as cloud providers have a vested interest in protecting their reputations and hence take them to be trusted. The tenants can host one or more virtual machines in the cloud. Consider Figure 2 where the tenant virtual machines are

2168-7161 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCC.2016.2535320, IEEE Transactions on Cloud Computing

Tenant 1

Tenant 2

TVM11 TVM12 TST

TST

VMM1 Hardware

TVM13 TST

TVM21 TST

VMM2 Hardware

TVM22 TVM23 TST

TST

VMM3 Hardware

Figure 2: Tenant Virtual machines hosted on different servers

hosted on multiple physical servers. Virtual machines belonging to different tenants can be hosted on same physical server. In Figure 2, VMM1 and VMM3 are hosting virtual machines that belong to a single tenant, tenant 1 and tenant 2 respectively, whereas VMM2 is hosting virtual machines that belong to two different tenants. This is common and happens for efficiency reasons as part of resource allocation, where the cloud provider hosts different customer virtual machines on a single physical server. Successive virtual machine hosting requests say from competing organizations could naturally result in the virtual machines from these organizations being co-located [23] on the same physical server. In such multi-tenant scenarios, unless specific security policies are enforced to achieve effective isolation, a malicious tenant can attack the co-located machines and services. Tenants themselves may wish to deploy security services to protect their virtual machines from other co-located tenants. For example, in [24] Amazon states that security of tenant virtual machines is the responsibility of the tenants since they are free to run any of the operating systems or applications (though it claims to secure the underlying infrastructure). For instance, tenants can use host based security tools for securing their virtual machines. However such host based security tools are themselves vulnerable to the attacks. In some cases, tenants may even explicitly request the cloud provider to have their virtual machines implemented in a separate cluster and not to be colocated with their competitors. Such multi-tenant scenarios are possible even in the case of private cloud deployments. For example, virtual machines belonging to different departments within the same organization can be co-hosted on the same physical server. Hence the private cloud administrators may want to ensure that the virtual machines that belong to specific departments (e.g. payroll and sales) are not co-hosted on the same physical server. An approach to simplifying the management of virtual machines hosted on different physical servers involves the use of virtual domains. A Tenant Virtual Domain (TVD) allows security policy based grouping of related virtual machines running on separate physical machines. The domain security policy determines what type of communications is allowed and what is not. For instance, a policy could allow free communication between the virtual machines within the TVD, whereas any communication external to TVD can be restricted according to the security and usage policies defined by the TVD administrator. A virtual machine can be a member of the multiple virtual domains. For example, auditing virtual machines (belonging to auditors) can be part of all customers’ virtual domains.

There can be different clients (such as users, developers, sales and HR) and servers (such as web server, SQL server and mail server) virtual machines within a TVD which can be used in the provision of services, both in-house as well as external online services. Let us now consider the various threats that can arise in such an environment. One can identify three domains in this type of architecture that are relevant to the threat model. There is the tenant domain comprising tenant administrators and tenant users. Each tenant has its own tenant domain. There is the cloud system domain which consists of cloud system administrators and the hypervisor/VMM platform (with its privileged domain and hardware). Then there is the cloud cluster domain comprising cloud system domains that constitute the cloud infrastructure. Let us now consider the threats in the tenant domain. For example, there can be attacks from tenant users (customers). The tenant organization can be using different access control techniques such as Discretionary Access Control (DAC), Mandatory Access Control (MAC), and Role based Access Control (RBAC) to ensure that only authorized users have access to resources. Even with the access models, vulnerabilities can arise due to downloading of software or propagation of malware or during the implementation. This can lead to users misusing their privileges or exploiting the weakness in the operating system and applications to escalate their privileges and generating attacks. E.g. the pass-the-hash1 attack allowed the user credentials to be changed at runtime thereby gaining additional privileges for Windows systems [2, 25]. Also the Cloud Security Alliance report [26] indicates that “76% of respondents believe that the likelihood of malicious insiders in the cloud is possible, likely, or frequent.” Threats can also arise by exploiting the vulnerabilities in the services running in the tenant domain. For example, the servers within a tenant virtual domain could be vulnerable to attacks such as buffer overflow, cross-site scripting and SQL injection if they are not secured properly. The TVD administrators can use security tools such as host based intrusion detection and anti-virus for protecting their virtual machines. However, since these tools themselves are implemented within the tenant virtual machines, the malware (or malicious users) can exploit these tools and generate attacks as well as alter the logs for the attacks to go unnoticed. Furthermore, if the security policy of tenants such as the tenant 1 in Figure 2 was to allow free communications between its virtual machines, then a malicious tenant user will be able to exploit vulnerabilities in TVM13 to generate attacks on the other virtual machines (TVM11, TVM12). Then there are the attacks from the Internet targeting the virtual machines in the TVDs and the virtual machines in the TVDs targeting other TVD’s and hosts in the Internet. For instance, a tenant’s virtual machines can be the victim of a denial of service attack (or an attacker can compromise the 1 The Microsoft system which was designed to counteract the Pass the Hash attack was named Windows Blue. The aim was to prevent leakage of information due to escalation of privilege and lateral traversal of administratively privileged accounts. What the Windows Blue system did was to first harden the system against compromise of credentials and then limit the usability of these compromised credentials. It also enhanced he ability to detect and remediate compromise.

2168-7161 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCC.2016.2535320, IEEE Transactions on Cloud Computing

virtual machines in the TVD) and use them for the generation of the attack traffic. As the tenant is often charged based on the usage of resources, these attacks can led to tenant victims incurring significant financial losses. The elastic nature of the cloud aggravates this situation as it enables the attacker to dynamically increase the resources to generate distributed denial of service attacks. Such attacks can also lead to disputes between the cloud provider and its tenants. For example, if the tenant virtual machine is compromised and is used in the generation of attack traffic with spoofed addresses, then this could result in disputes between the cloud provider and its tenants. The cloud provider would want to charge the tenants for the generated attack traffic whereas the tenant will deny these charges since the attack traffic has spoofed identity. The Spoofer project study [3] confirms that such attacks are easily possible2. Furthermore, the attacker can use the compromised tenant virtual machines to generate a range of attack traffic such as ICMP flood, UDP flood, TCP SYN flood and Smurf attacks. Furthermore, in the distributed environment, there may be a need for migration of virtual machines to different physical machines. That is, a TVM-11 running on top of VMM1 may migrate to a VMM2. Hence the security architecture needs to ensure that a virtual machine is migrated in a secure manner and not conflicting with security policies. III. NEED FOR INTEGRATED SECURITY Currently there are different types of security techniques and models for securing virtual machines. These include secure virtualization techniques, access control techniques, intrusion detection techniques and trusted computing technologies. There are various forms of access control such as discretionary access control, mandatory access control, role based access control and Chinese wall as well as type enforcement and information flow based model. For example, sHype [4] uses labelling based access control mechanisms to enforce mandatory access control as well as access mechanisms for enforcing Chinese wall policy to prevent conflicting virtual machines to be hosted on the same VMM. In Figure 2, if the Chinese wall policy being enforced is such that Tenant 1 and Tenant 2 are in the conflict sets, then their virtual machines cannot be hosted on the same VMM, as there can be information leakage. Hence the security architecture should not allow Tenant 2’s VM to be deployed on VMM2 assuming Tenant 1’s VM was residing there previously. The system administrator has to deploy a new physical server to host the Tenant 2’s VM TVM_21 or host that VM only when the Tenant 1’s VM TVM_13 terminates or shuts down. However, the security architecture will allow virtual machines belonging to another tenant say Tenant 3 who is not in the conflict set with Tenant 1 as per the Chinese wall access policy. Intrusion detection techniques have been proposed that make use of the virtualisation technology for efficient detection of attacks. For example, techniques such as Livewire [5] have been proposed for intrusion detection on virtual machines. Virtual Machine Introspection (VMI) [5] techniques have been developed to inspect virtual machines 2 Something of the order of 25% of the autonomous systems permits such spoofing attacks.

from the hypervisor. Note that when the intrusion detection tools are deployed at the VMM instead of the VMs, then this helps to ensure that they themselves do not succumb to attacks at the VMs (which are the ones that are being monitored). Hence the intrusion detection security tool at the VMM will be able to detect better whether an attacker exploiting some vulnerabilities in the VM to generate attacks. See Section VII on Related Work for a discussion of relevant works on intrusion detection. Our contention in this paper is that there is a need for all these technologies to be integrated in a manner that they are able to deal with dynamic and emerging attacks and achieve a more resilient and secure system. Access control models enforcing security policies achieving strong isolation need to be integrated with techniques for detecting attacks. Trusted computing technologies need to be combined with access control to achieve dynamic management of access policies. It is the feedback between the detection of attacks and access control models that is key to update and refinement of policies to deal with dynamic nature of attacks. The access control systems are successful in preventing access to resources by unauthorised users. However if the attacker is successful in obtaining high level privileges through some means (e.g. by generating buffer overflow attacks or using stolen credentials), then most access control systems do not enforce any restrictions on the actions performed by the malicious users. Consider, for instance, Bell-LaPadula model enforcing read and write permissions between different security levels; it does not provide mechanisms to detect if an entity is maliciously writing useless information to consume CPU/memory/disk resources, or causing denial of service attacks. That is, the attackers are successful in performing malicious activates such as installing malicious software or altering the legitimate applications and using these compromised systems to generate attacks. For example, Trinity and TFN DDoS tools exploit buffer overflow vulnerability in RPC services (such as “statd”, “cmsd” and “ttdbserverd”) in Solaris systems and installs several backdoors to retain control by the attacker. The compromised machines are also used for generating DDoS flood and cause ICMP flood, SYN flood, UDP flood, and Smurf style attacks. Similarly worms such as Code Red exploit buffer overflow vulnerability in the Internet Information Server (IIS) to obtain higher privileges and then install malicious code for spreading the worm and conducting denial of service attacks. Such attacks are often detected by the intrusion detection systems in the traditional environment using signatures or detection of anomalous behaviour. Although the attacker has obtained higher privileges, the actions performed by the malicious user (such as installing root kits and altering code to hide the malicious process) either match with the signatures stored in the attack signature database or deviate from normal system behaviour causing their detection. Hence the intrusion detection system is able to raise alarms detecting such suspicious activities. Hence the need to integrate access control and intrusion detection security mechanisms leading to more efficient detection of emerging attacks. However there are also some additional challenges with the intrusion detection systems. For example, signature based systems cannot deal with the zero day attacks and anomaly based tools

2168-7161 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCC.2016.2535320, IEEE Transactions on Cloud Computing

can have high false alarms. Furthermore it could be difficult to relate the malicious functionality identified by the intrusion detection tools with the vulnerabilities exploited in the access control modules. The next logical functionality to consider in the integrated security approach is that of trust management, which is related to the integrity and trustworthiness of the platform itself. Currently most desktops, servers, laptops and even mobile devices have a trusted platform module (TPM) [6] installed in them at the time of shipment. The TPM technology provides mechanisms to measure and validate the system state which is then used in the establishment of trust between systems. The notion of trust is the expectation that an entity will behave in a particular manner for a specific purpose. The basic idea is if the physical machine has the TPM, then using its mechanisms, one can measure the state of the VMM at boot time and confirm that the VMM is brought into a trustworthy state, if it matches with some reference state. Once the VMM with its access control and intrusion detection functionalities are checked to be in a trustworthy state, then the guest virtual machines can be loaded onto the secure VMM. However the use of TPM only at the boot time alone is not sufficient as it can lead to limitations such as Time of Check-Time of Use (TOCTOU) [7]. The attackers can exploit such weaknesses during runtime to generate attacks. For example, Slammer attack exploits the weakness in the SQL server and spreads the worm using UDP packets. However, it only resides in the memory of the vulnerable machine and does not perform any read or write operations to the disk. The TPM is not able to detect these attacks at runtime. Once again integrated security mechanisms involving access control, intrusion detection and trusted computing could provide a better means to deal with dynamic attacks.



IV. INTEGRATED SECURITY ARCHITECTURE A. Requirements In this section, first we summarize the requirements and design issues for the development of the integrated security architecture. • The architecture needs to support virtual machines belonging to different tenants on the same physical server. Also as each tenant is able to host multiple virtual machines, the cloud provider should provide facilities for the tenants to simplify the management of their virtual machines and enforce domain wide security policies on their virtual machines. • The TVD administrator determines the domain wide security policies that need to be enforced on each virtual machine. The TVD administrator needs to validate whether the physical platform has the capabilities to enforce domain wide security policies and specific security policies related to its virtual machines. Before a virtual machine is included within a domain, the security policy of the domain needs to be satisfied. • Even if multiple virtual machines running on the VMM have same operating system, they can have different configuration, different applications running on them and different amount of resources could have been allocated to these applications. Hence the attack characteristics (and





the attack surface) can considerably vary for each virtual machine. For example, if very few resources are allocated to a particular VM, the attack traffic threshold for that particular VM can be low. Hence there is a need to determine and define attack signatures specific to each virtual machine. The architecture should be able to adopt both preventive and proactive approach to deal with various types of attacks such as polymorphic and metamorphic and zeroday attacks in an efficient manner. Let us consider how the term efficiency relates to the tenants and the cloud service provider. From a tenant point of view, one efficiency measure is that it should enable the tenant to make use of and pay only for the security functionality that it requires and uses. Hence the modular security component based architecture enables different security functionality (such as access control, intrusion detection and trust management) can be selected on the basis of tenants’ requirements and need, contributing to their efficiency. From the cloud provider point of view, efficiency arises by accommodating a new VM hosting request onto physical servers that are already running virtual machines instead of hosting it on a new physical server. With our security architecture, such hosting is performed without violating the security requirement of the VMs that are already running on a server as well as meeting the security requirements of the incoming VM. Having said this, the main focus of this paper is on the design and implementation of the integrated security architecture, and not on the efficiency. A unique feature of our architecture is its ability to adapt to dynamic changes thereby achieving increased resiliency. For instance, if attacks are detected by the intrusion detection components then the access control policies are dynamically changed taking into account of these attacks. It is the feedback between these various components that enables our security architecture to be adaptive and hence making it more resilient. This feature we believe is critically important for the design of future secure systems. The ability of the VMM based security architecture to detect and prevent attacks is dependent on the knowledge of the operating system semantics and the applications running in each virtual machine. Due to continuous updates to the operating systems and applications, there is a need to keep track of these changes and dynamically update the knowledge and the behavior of applications in virtual machines. Although the VMM has control on the virtual resources and has access to the contents of different registers, there is a semantic gap between the knowledge of the VMM and virtual machines. The architecture should have mechanisms to gather information from multiple locations (such as source and destination), and combine them to detect potential attacks. For example, in the case of distributed denial of service (DDoS) attacks, each compromised machine may only contribute to small amount of attack traffic and it may not be possible to detect the attack at the source. In such cases, the attacks can be detected at the destination. In case of worms, the attacks can be detected by combining the information captured from multiple machines or

2168-7161 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCC.2016.2535320, IEEE Transactions on Cloud Computing





infected hosts. Attacks should be identified at a fine granular level so that only these attacks are prevented while all the other traffic is not affected. Hence it is necessary to identify the malicious application/process dynamically and isolate only the malicious entities. The architecture should have mechanisms that allow feedback between the various security components, thereby enabling the dynamic changes in context to refine/modify the access control policies.

B. Security Architecture Overview Recall that the tenants can host one or more virtual machines in the cloud and the TVD administrator determines the security policies that are specific to each virtual machine and domain specific policies. VM specific policies are developed by considering the operating system, applications running in the virtual machine and the resources allocated to the virtual machine. The domain specific policy refers to the domains in which the VM is a member. Since each virtual machine can belong to several trusted virtual domains, the VM interactions (both physical and virtual) have to be monitored against the security policies of the TVDs in which the VM is a member of. TVM VMM / Dom 0

Security Architecture DEE (EV, ICL, INSP, TA, IS, SM)

ACM (DAC, MAC, RBAC,..) Physical Devices

IDE (Signature, Anomaly, TVD,..) TPM

Figure 3. Security Architecture Components

Figure 3 shows the components in the security architecture, where each physical server is equipped with a hardware trusted platform module (TPM). Within the VMM, security functionalities of access control, intrusion detection and security decision evaluation have been implemented in the form of Access Control Module (ACM), Intrusion Detection Engine (IDE) and Decision Evaluation Engine (DEE) respectively. In this section, first we will describe these architecture components and then in the next section we will discuss how they are used to secure the life cycle of virtual machines such as secure hosting of a virtual machine, its secure operation, secure transaction as well as secure migration of virtual machines. In the case of private cloud, the administrators are aware about the security requirements of their virtual machines. For example, multilevel security (e.g Bell-LaPadula model) may be typically used in a defence environment. In this case, subjects (including VM users, VM applications and cloud administrators) and objects (such as files in VMs) are assigned security labels, and the rules associated with the multilevel security model are applied. However in the case of public cloud, the cloud provider may not be usually aware of the

security mechanisms of the tenants, which can vary from tenant to tenant. Hence we assume the tenant updates the cloud provider with its security requirements for the VMs as part of the Service Level Agreement (SLA). The security related information of the tenants includes the security labels, details of subjects (such as users, their roles their privileges) and objects (files, registers and data), critical applications that are running in the virtual machine and critical functions that may be accepting inputs from untrusted sources that are related to the security of the tenant virtual machines. Note the multilevel integrity security model (such as the Biba model) is also applicable in a commercial environment. In this case the subjects and objects are associated with integrity levels. Access to an object by an application (subject) is now determined by their respective integrity levels. For instance, if Internet Explorer (IE) has been assigned a lower integrity level compared to system objects (e.g. registry keys) in the Windows system, then IE will not be able to access these high level sensitive objects. Such a scheme has been used in the Windows system, since Windows7. We also assume that the cloud service provider makes the security options (such as ACM, IDS and TPM) available and related pricing available to the tenants. It is important to note that all the security components incur additional overhead for the cloud provider. Hence the cloud provider provides pricing for different security options, from which the tenants can select. The aim is to provide best possible security for the tenants that require them. Note that Security as a Service Model described in [28] is of particular interest to cloud service providers, which enable them to charge a premium for higher level security while guaranteeing a default security to all tenants (thereby helping to improve the security of the ecosystem). C. Decision Evaluation Engine We first consider the Decision Evaluation Engine (DEE) as it is the critical component of the architecture which makes the security decision related to the tenant virtual machines hosted on each physical server. The DEE captures the events that are of interest to the tenant as specified in the SLA. For example, let us consider how DEE captures the events related to the runtime state monitoring of virtual machines whenever there is a variation in the processes running in the virtual machine. The variation trigger for processes is obtained from the CR3 register value which can be used to identify process initiation and termination events. In terms of acquiring the CR3 value, when a process is executed, it first calls the execve syscall. By looking at the value ebx when the syscall is executed, we determine the process name. However the value of CR3 when executing this syscall belongs to the parent process. During the execution of the syscall, all resources belonging to the parent process are released and the CR3 value is updated. Therefore by monitoring the update of CR3 (a mov instruction) in execve syscall context, in a Linux system, we obtain the CR3 as there is no other CR3 update since the context switch has been disabled. With Windows, one needs to monitor all CR3 values from the boot time, and detect the newly used CR3 which belongs to a new process. This enables us to maintain a mapping between the CR3 and process. Whenever a process

2168-7161 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCC.2016.2535320, IEEE Transactions on Cloud Computing

dies, detected via exit_group syscall, the CR3 is removed from the mapping. Such a procedure was used to identify the execution context of system calls and get the state of the virtual machines. The DEE has subcomponents and interacts with the ACM and IDE components to make security related decisions. The sub-components of the DEE include Entity Validation (EV), Information Capture and Logging (ICL), Introspection (INSP), Taint Analysis (TA), Information Sharing (IS), and Secure Migration (SM). EV and ICL are active for all events on the virtual machines whereas TA is invoked for specific actions on virtual machines. Taint analysis [8] is invoked only when the INSP, and/or ACM and/or IDE detects some suspicious activity in the virtual machine. Information sharing component (IS) is only used for sharing attack information between different secure VMM based physical servers in the distributed environment. Let us now consider the sub components in detail. Entity Validation (EV): It is responsible for identifying the entity at a fine granular level and its validation as per the specified policies. Note that the entity can vary depending on the action associated with the virtual machine. For example, before hosting a VM on the VMM, the VM is considered as an entity. After the VM is hosted, the processes running in the VM can be entities. The entity is decided based on the events. For example, in the case of validating online transactions, the entities are the processes that are creating and receiving the traffic as well as the virtual machines in which these processes reside. We have identified different events such as virtual machine (start, suspend, resume, restart, stop), validation events such as resources usage (CPU, memory, and network) and transaction events (start, ongoing, complete) for enforcing security policies. After the entity is determined by the entity validation component, the DEE makes a decision by considering the security policies in the ACM and the IDE components. Information Capture and Logging (ICL): This component is used for capturing the information related to the virtual machine and its domain and also for logging the specific events related to the virtual machine. We capture information such as what applications are running in each VM, processes related to specific applications, and behaviour information such as which process within the application is used for checking the updates and intervals between the updates. For example, with Windows XP machines, we observed typical default behaviour is to check for security updates at 3:00 AM every day. If Sophos anti-virus is installed in the VM, then processes such as swi_service.exe, swc_ service.exe, SAVadminservice.exe and savservice.exe are related to Sophos and the process alupdate.exe is dynamically invoked (e.g. say every 10 minutes) to check for updates of attack signatures from the remote Sophos server. This information is used in the specification of security policies for that VM. Processes failing to check for updates at expected intervals are considered as suspicious behaviour. Hence if a VM with Sophos anti-virus software is infected with Conficker malware, it will fail to check for updates since Conficker disables auto updates and any security tools in the infected system. In terms of logging, we have a default mode of logging that is used for the tenant virtual machine traffic specified by the

cloud provider for its own purposes. There could be additional specification of event logging defined by the tenants according to their requirements. For example, a tenant can request to log accesses to what they consider as important files by the users as well as the type of access. Even if an attacker is successful in compromising a tenant virtual machine, the attacker will not be able to access the logs in the ICL. Hence the TVD administrator is able to use these logs for analyzing the attacks on the tenant virtual machines. Furthermore, the tenant can also use these logs for monitoring the behavior of its users. Information Sharing (IS): This component is used in sharing the information that is needed for making decisions on events related to security. For example, if the IDE blocks a user due to detection of privilege escalation from any of its tenant users, then the details of the blocked user are shared with all the servers that are hosting the tenant virtual machines. If a new malicious process is detected on a specific OS or application of a tenant virtual machine, then the details have to be shared with all the virtual machines that are running vulnerable applications and operating systems. This component is useful for preventing the tenant attacks at the boundary device of cloud service provider. For instance, if a tenant virtual machine is experiencing DDoS flood attacks with ICMP traffic, then the component can alert the boundary device to drop the malicious traffic. Secure Migration (SM): This component is used to validate the capability of the remote physical server before the migration of a virtual machine to that server. SM determines the resource and security requirements for the virtual machine to be migrated. The requirements are determined either from the tenant’s SLA (service level agreement) or from the ICL. The requirements are then validated with the remote server. Taint Analysis (TA) component is invoked if the DEE chooses to perform taint analysis on the tenant virtual machine. In our architecture, we have deployed the Pin tool developed by Luk et. al [8] for process level taint analysis. However, note in our current architecture, taint analysis is only used in an offline manner. D. Access Control Module The access control module (ACM) is used to evaluate different security policies. In our architecture, we have made use of the sHype implementation to evaluate type enforcement and information flow security policies. We have extended the implementation to specify role based and trust level based security policies. The DEE captures the events that are of interest to the tenant and redirects them through the ACM component for the enforcement of the policies. For example, when an application running on a VM wants to access a resource, it uses the system call such as read and write to raise a request to the client OS. Hence these system calls are redirected through the ACM component to determine whether the application is allowed to access the resource to perform the requested operation. As mentioned above, our architecture supports several access control security policies. Type enforcement policy is used to control sharing between running VMs. This is achieved with the use of security labels. Virtual machines and applications running in the virtual machines are assigned security labels. For initial labeling of virtual resources, the ACM exports an acm_init call that determines the security label of a virtual

2168-7161 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCC.2016.2535320, IEEE Transactions on Cloud Computing

resource based on the resource type and identity. The ACM also exports an acm-authorize function, which takes the requesting subject (e.g. virtual machine and application), the virtual resource label and the operation type as parameters, and decides whether access is permitted or denied according to the security policy. The ACM stores security policy information locally in the hypervisor and provides security policy management through a privileged H_security hypervisor call. With the type enforcement policy, two virtual machines (or domains) can communicate with each other if and only if they have at least one security labels in common. The normal policy rule with respect to a requesting subject (virtual machine or application) accessing a virtual resource (such as vLAN or virtual memory) is that access is permitted if and only if the label of the requesting subject dominates the label of the resource. The Chinese wall policy determines which VMs can run simultaneously on the same hardware platform supported by a VMM. This is achieved by specifying conflict sets which define the set of types that are not permitted to run simultaneously on the same VMM. Within the types specified in a conflict set, at most one type can run on the same VMM at a time. We define the types again in terms of VMs and applications. The architecture is also able to support information flow type policies. In our current architecture, each VM can have a single or multiple labels. Information flow policies govern flows between labeled VMs and VMM resources. In the case of VMs with multiple labels, the VMM needs to trust the VM to correctly associate the label with the associated information flow out of the VM and to prevent leaking of information between labeled entities within the VM. We have extended the sHype architecture to support role based access control policies. In the design of a role based access control model, we needed to specify several design aspects such as what are the different roles and the privileges associated with them, who specifies the role based access control policies on what objects and how are the role based policies evaluated. In the case of public clouds, there can be role based access policies specified by the tenants as well as role based access policies defined by the public cloud provider. Our architecture provides an instance of each of these cases. For the case of tenant specified role based access control policies, we have developed a tenant hosting healthcare services. In this scenario, there are different roles within the tenant such as nurses, doctors, admin-staff where each role has different level of information access to the patient’s data stored in the cloud. For example, nurses only have read level access to patient’s clinical data records whereas doctors have both read and write access to patient clinical data but do not have access to create new patient records. The practice-admin role is able to create new patient records and write into personal data part of the patient records but do not have access to update the clinical data part of the patient records. The tenant then shares its role based access control policies and the role based access evaluation engine with the cloud provider. The cloud provider integrates this access evaluation with its own role based access evaluation to obtain the combined access decision. Additionally the tenant can specify as part of the service level agreement the provider needs to monitor and ensure that the

tenant’s role based access policies are satisfied. For the case of cloud provider defined role based access policies, we have defined different roles for the users (in the cloud provider) who may need to access and manage the tenant virtual machines and resources of the cloud infrastructure. Our architecture provides a simple role based access control model for users in the cloud provider domain. We have identified four types of users in the role based system in the cloud provider domain. From a tenant perspective, we have the two traditional roles, cloud_tenant_user (ctu) and cloud_tenant_administrator (cta). From the cloud provider perspective, we have cloud_provider_administrator (cpa) and also a cloud_system_administrator (csa). For instance, cloud_tenant_administrator has privileges to read alerts from the cloud infrastructure such as role based policy violation or intrusion detection policy violation; this can in turn be used by the administrator to stop the application or service that is causing the attack on either the cloud infrastructure or other tenant virtual machines. The cloud_provider_administrator is involved with the tenants (e.g. cloud_tenant_administrator) at the time of registration. The cloud provider administrator will have privileges to specify and manage the security policies of the cloud provider. In our architecture, these include the specification of security labels and label based type enforcement, Chinese wall and information flow type security policies as well as intrusion detection policies. For instance, the cloud provider administrator role enables to respond to attacks such as rate limiting or dropping the traffic if an attack is detected. The cloud provider administrator is also responsible for ensuring the implementation of tenant’s role based policies. The cloud provider administrator role also has privileges to generate alerts and access the logs of the cloud_tenant_administrator. The cloud_system_adminstrator role is concerned with the management of multiple platforms forming a cloud cluster. In this paper, we will not consider this role any further. Even such a simple role based model allows us to partition the privileges into various categories thereby ensuring that only users in appropriate roles can perform administrative operations and access sensitive information and services related to tenants and in the cloud infrastructure. This not only gives the benefits of a role based system such as flexible security management and the ability to handle better the dynamic changes in user population, but also helps to make the system more accountable. In terms of access evaluation, we have combined cloud provider security policies with that of tenant based access policies to derive an overall decision for a request. The request from a virtual machine will have associated with it a security label and a domain identifier. These will be used in the evaluation of the cloud provider type enforcement policies, Chinese wall, information flow and domain security policies. The request will also identify whether it is a tenant user role request or a tenant administrator role request. This attribute will be used in the cloud provider’s role based access policy. Then the result will then be combined with the tenant’s role based access policies. The request from the virtual machine will also have user and role based attributes associated with it. These attributes will be used in the evaluation of tenant’s role based access policies, which has been shared by the tenant

2168-7161 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCC.2016.2535320, IEEE Transactions on Cloud Computing

with the cloud provider. In our current implementation of the security architecture, first the cloud provider security policies are evaluated for a request coming from a tenant virtual machine. If this succeeds then the tenant’s role based access policies are evaluated for the request under consideration to derive the final result. E. Intrusion Detection Engine The IDE monitors the resources used by each virtual machine. These include memory, CPU, network, disk space and interactions between the virtual machines. IDE carries out signature based, anomaly based and virtual machine introspection based analysis to detect various types of attacks. If any of the packets from the virtual machines are matching with the known attack signature then the packets are dropped. In addition the entity (e.g. application such as web browser) that generated the attack packet is identified by querying the knowledge base repository and the entity is isolated and subjected to further analysis. If the packet does not match with any of the attacks signatures but found to be suspicious by the anomaly engine then the details of the virtual machine or entity that generated the packet are stored in the shared packet buffer and a copy of the packet is further analyzed. It is important to note that if a packet is found to be suspicious then the details of the virtual machine are stored in the shared buffer. This is to deal with the metamorphism and polymorphism characteristics of the attacks. Then the detection engine carries out further analysis on all future packets from that particular virtual machine. In our architecture, we perform the taint analysis on the packets that are considered as suspicious by the detection engine. The suspicious packets are tagged and passed to the destination. The destination VMM and the virtual machine determine the malicious activity being performed by the incoming suspicious packets. Taint analysis helps to identify different types of attacks such as format string attacks, SQL injection, command injection, cross-site scripting and privilege escalation. While the decision is being made on the suspicious packets, all traffic from that suspicious virtual machine is either dropped or rate limited based on the properties exhibited by the suspicious packet. If the suspicious packet exhibits serious properties such as spoofed source address then the packets are dropped. On the other hand if there is sudden burst of traffic, the traffic is rate limited. The analysis can also be used to develop a new attack signature based on the properties exhibited by the malicious packet using techniques such as those described in [9] and the attack signature database in the detection engine can be updated. After the analysis, if the packet is found to be benign, the details of the suspicious virtual machine are removed from the shared packet buffer. Hence any future packets from that particular virtual machine will be monitored by the detection engine only in the usual manner. In our architecture, an important application of IDE is to detect attacks by monitoring any violation of access control policies enforced by the ACM. Recall that with role based access policies, a tenant updates the ACM and the IDE. The IDE analyses the runtime state of the tenant virtual machine and compares them with the access control policies stored in the ACM. If there is any variation in change of privileges of

the users and/or unauthorized access to files by the users compared to the policies stored in the ACM then the IDE terminates the malicious session and raises an alert to the tenant administrator. Also the IDE detects attacks by monitoring the processes for different anomalous behavior such as missing path information for a running process, parent process that invoked the process, change of path for important processes, privileges of the running processes, high privilege process invoked by a low privileged process. For example, for a high privilege process invoked by a low privilege process indicates privilege escalation attack and a process with no path information indicates code injection attacks. For example, let us consider how the IDE module detects the common privilege escalation attacks by the users. One of the well-known privileges escalation attacks is the exploitation of features such as the setuid in Unix platforms. Users are given higher privileges temporarily for specific actions such as change of password. Often attackers exploit the vulnerabilities in such programs to escalate their privileges and perform malicious actions. Hence when the IDE analyses the tenant virtual machine, it can detect such users with high level privileges trying to exploit the system by comparing policies in the ACM. Now the IDE can terminate the session of this user. It can also request to dynamically alter the ACM to block the actions of such malicious user. This update blocking the malicious user can be sent to other servers (VMMs) that are hosting the tenant virtual machines. V. SECURING THE LIFE CYCLE OF VIRTUAL MACHINES In this section we describe how our architecture is used for securing the life cycle of tenant virtual machines. A. Secure Hosting Before hosting a virtual machine, the EV component in the DEE determines whether the virtual machine conflicts with any of the access control policies enforced in the ACM module. If any conflicting virtual machine is already running on the VMM, then the DEE prevents hosting that virtual machine. If hosting the virtual machine does not conflict with any of the running virtual machines, then the DEE evaluates other factors such as the available resources. If the DEE then decides to host the virtual machine, then the TPM is used to measure the state of the virtual machine and ensures it is trustworthy at boot time. A Trusted Platform includes a Trusted Platform Module (TPM) chip, a Core Root of Trust for Measurement (CRTM), TCG Software Stack (TSS) and the related certification. When a platform is booted, the Core Root of Trust for Measurement (CRTM) measures itself to ensure that it has not been compromised and stores the measured value in the Platform Configuration Register (PCR) of the Trusted Platform Module. Then the CRTM passes control to the first measurement agent. A bootstrapping process follows and whenever a software module is loaded, it is measured and the measured value is stored inside the PCRs. Hence at every boot, the TPM stores the measurement values of all the software components of the Trusted Platform. This ensures that the VMM and the VMs are in secure state at boot time. B. Secure Operation Now let us consider how the DEE ensures secure operation of

2168-7161 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCC.2016.2535320, IEEE Transactions on Cloud Computing

the virtual machines. The EV and the ICL in the DEE are invoked for all the actions on the virtual machines. The entity validation component identifies the entities at a fine granular level and determines which security policies in the ACM or the IDE need to be enforced on the interactions. ICL is used for capturing the specific features of the virtual machine and its interactions. In addition to the security policies in the ACM and the IDE, the DEE determines whether additional policies need to be enforced on the VM entities. We saw earlier whenever suspicious behaviour is identified by the ACM or the IDE, the DEE can decide to perform taint analysis to determine if the suspicious behaviour is actually malicious. The DEE can also be used for sharing of information between the different secure VMM physical servers. For instance, this can happen when new attacks are discovered by one physical server and shared with others. Now let us return to the privilege escalation attack during system operation and consider how our architecture is able to respond to such an attack. Consider a user who has logged on with limited privileges exploits some vulnerability in a virtual machine to obtain higher privileges and performs some malicious activity such as disabling a security tool that has been installed in the virtual machine or installs a malicious program in the virtual machine. First let us assume that the privilege escalation occurs by exploiting some vulnerability in a SQL server. This could be due to an attack such as the one mentioned in [10], where a user logged in with limited privileges obtains administrative privileges by changing three bytes in the memory by exploiting buffer overflow vulnerability. SQL server validates the user id before giving access to any of the objects. If the user id is set to 1, then the user is considered to have the administrative privileges. The user can alter the id in the memory in the vulnerable server after calling VirtualProtect(). The administrative privileges of such malicious users will be valid until the SQL server is restarted. Hence the malicious user is able to use such temporary higher privileges and perform malicious activities. So there is a need to detect such attacks during runtime by detecting the user privilege escalation. Note initially when the user first logged in as a normal user and carries out his/her actions, we will assume that it is allowed by the ACM as it satisfies the security policies. Let us now consider how the runtime privilege escalation by the user. This action is detected by the by the introspection module (INSP). Recall the VMM is used to access and monitor the runtime state information of the virtual machine such as registers, processes and applications running in the virtual machine. There are three different types of memory in the VMM which are known as machine, physical and virtual memory. Machine memory is the real memory which is controlled by the VMM. Physical memory is the memory assigned to the virtual machine and the virtual machine is under the illusion that the physical memory is the actual memory. The virtual memory is similar to the usage in traditional operating systems. The conversion between machine address to physical addresses is performed using a lookup table in the VMM. The Introspection component makes use of the xc_map_foreign_range function in the VMM to

access the memory contents of the virtual machine. Now the runtime privileges of the logged users are determined by analysing the memory allocated to the virtual machine and the actual privileges of the users are available in user_store in the ACM. Hence the introspection module is able to detect the privilege escalation of the logged user. Consider another attack scenario which involves a virtual machine being infected with malware (such as conficker [11, torpig or LOIC [12]). In this case, the IDE module comes into play in the detection of such attacks. This happens when the interactions by the virtual machines are found to be suspicious by the IDE. For example, the LOIC attack floods the victim machines with TCP, UDP and HTTP messages. Such attacks are detected by the signature or anomaly detection component in the IDE. The infected virtual machine is then suspended and an alert sent to the tenant administrator. When it comes to secure update of the virtual machines, one approach is to apply the updates to the snapshot image and then validate the image in an isolated environment before applying the updates to the virtual machine in the production environment. C. Secure Transaction Let us now consider the transactions from the tenant virtual machines and see how the architecture can help to secure them. First recall the tenant customers are able to use TPM based attestation to validate the state of tenant virtual machine before performing any transaction. However, the tenant virtual machine can be compromised after the initial TPM attestation3. Let us consider how our architecture can secure the transactions between the tenant virtual machine and its customers. Whenever a new flow is initiated by a virtual machine, the received traffic is validated by the DEE. The DEE invokes INSP to validate the runtime state of virtual machine when it receives the first packet for each transaction. If the state of the virtual machine is found to be suspicious (e.g. processes/users with excess privileges, hidden processes), then the DEE terminates the transaction and generates a termination report to the TVD administrator. Hence at this stage the attacks are not possible if any of the users/processes have escalated their privileges or if there are any hidden processes. However attacks are still possible with the users misusing their privileges to generate attacks and with visible processes/applications. For example, the attacker could be a malicious employee or an attacker who could have obtained unauthorized access by exploiting weak passwords or heartbleed attack or using some other social engineering techniques. The entity validation determines the process or application that has initiated the traffic flow and updates the ICL with the details of the entity that generated the packet. The ICL updates its information on the applications running in each virtual machine as the applications start interacting with other hosts or virtual machines. If the process or application reported by the EV is found in ICL database, it updates these details. If the reported process or application is not listed, then it creates a 3 Note in this paper we have not considered DRTM, dynamic root of trust measurement, as our architecture did not have virtualised TPM required to achieve this.

2168-7161 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCC.2016.2535320, IEEE Transactions on Cloud Computing

Figure 4: Protocols for integrated model

traffic with correct source address. Then the traffic is validated against known attack signatures, anomaly based detection policies and domain wide TVD policies. If the VM traffic is found to be legitimate by the signature and anomaly based detection components, then the traffic is validated against the TVD security policies and passed to the destination. For example, consider communications from the OS, Application and security tool vendors concerned with distributing security updates. Such communications can be push based, pull based or push and pull based. Hence the TVD administrators can specify and enforce policies as per these requirements. The respective TVD policies to be enforced on the VM traffic are determined from the source and destination addresses of the VM traffic. If any of the packet(s) from the virtual machines are matching with a known attack signature or determined to be malicious by the anomaly detection algorithm, or not permitted according to TVD policies then the packet(s) are dropped. The entity that generated the attack packet(s) is identified by querying the ICL and the entity is isolated or subjected to further analysis by the TVD administrator. Hence now attacks that are possible are those with correct source address, that is not matching with the attacks signature, and which cannot be detected by the anomaly detection engine and which are permitted by the TVD policies. Let us see how our architecture deals with such attacks. If such traffic is found to be malicious by the destination (or customer), it reports the malicious behaviour to the source DEE. The DEE first checks if the reported traffic has actually originated from one of its tenant virtual machines. If the reported traffic is found in the ICL, as a default option, the DEE reports to the TVD administrator regarding the attack. The DEE can also invoke taint analysis [8] and check the malicious behaviour reported by the victim. Taint analysis enables automatic analysis of attack packets by monitoring how each byte of the packet payload is used by the vulnerable program at the processor instruction level. If taint analysis confirms malicious behaviour reported by the victim, then the DEE can alert the TVD and/or terminate the virtual machine. Hence one can see that our architecture is able to address secure transactions from the tenant virtual machines in a comprehensive manner.

new entry for this process/application and stores the corresponding traffic. The new processes and applications are flagged for further validation. Now the IDE validates the traffic against known attack signatures, anomaly based detection policies and domain wide TVD policies. Note that the ICL has the specific details (such as resources allocated to VM, OS and applications running in VM, VM traffic logs, and TVD membership information) for each virtual machine. The IDE uses this information to specify the security policies for each virtual machine. The evaluation process of the IDE works as follows: If the source IP address of the packet is spoofed, then the packets are dropped and the virtual machine is isolated from interacting with other hosts. In this case, the virtual machine can be rejoined after further analysis by the cloud and/or TVD administrator. Hence it is not possible for the virtual machines to generate attack traffic with spoofed source address. However at this stage, it is still possible to generate attack

D. Secure Migration Let us now consider the situation where a virtual machine running on a one (VMM based) physical server is migrated to another server. Secure migration needs to ensure that the virtual machine is migrated to a secure platform. The capabilities of remote server are first checked to ensure that the current level of VM security policies can be enforced at the remote server. Our architecture makes use of TPM based validation to ensure that the remote server is capable of achieving a similar level of security for the virtual machine. We have assumed that the specification of security policies have been done using a standard specification language. In our case, we use XML for the specification of security policies. Our policies check if the remote server has the required resources (CPU, memory, network) for hosting the virtual machines as well as the capabilities to enforce the required ACM and IDE policies. The migration is permitted only if remote server is capable of enforcing the specific security

VM hosting request from tenant • Check if tenant requires security as a service • If true, capture security requirements (SLA) for the VM Use DEE to validate events and maintain logs as agreed in SLA Setting up a virtual machine • check if ACM security option enabled • if true, check if VM conflicts with other running VM’s • if true, abort VM setup on server • if false setup VM on the server Loading applications • Check if TPM security option enabled • If true perform CRTM using TPM/vTPM • Calculate hash of the applications before loading • If hash valid, load application • If hash is invalid abort loading application VM Operation • monitor VM events (as per SLA) • eg: obtain VM state using introspection • validating VM state according to ACM and/or IDS policies • check for privilege escalation, unauthorised access to resources, abnormal consumption of resources, suspicious behaviour of logged user, ……, hidden processes in VM • if true terminate user account and/or malicious session and alarm to admin • IC used for sharing details of malicious event with other VMM. Network based interaction with other virtual machine • Default deny, permit authorised communication • Eg: intra domain, inter domain (bilateral policies between domains) • TPM Attestation of the remote host (optional) • Check if files transferred are violating ACM policies and/or traffic matching with signatures or anomalies • If true terminate user account and/or malicious session and alarm to admin VM migration • Check if remote server has capabilities to enforce VM security requirements • If true, migrate VM to remote server • If false abort migration to remote server Etc.

2168-7161 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCC.2016.2535320, IEEE Transactions on Cloud Computing

policies related to the virtual machine under consideration. E. Secure Store Secure Store is used for storing the images of the virtual machine. The tenant can store the default trusted image of its virtual machines. Also the tenant can use this component to store the snapshot images of its virtual machines at regular intervals. In case of attacks on a tenant virtual machine, the default images or snapshot images of the virtual machine are used for fast restoration of the tenant virtual machine services. Furthermore, the images are stored in encrypted form and the respective keys for decrypting the images are stored in the TPM. F. Concrete Step by Step Protocol Figure 4 presents a concrete step by step protocol that describes the operation of our integrated security architecture. Note that the steps are somewhat simplified to focus on the overall flow of the protocol steps. For example, even if a server satisfies the security policies for hosting a tenant VM, there is a need to check for the availability of resources for hosting that VM on the server. We have modelled this protocol using predicate-transition Petri nets and then carried out validation analysis to detect security state based anomalies such as deadlocks and liveness and to achieve fail-safe mode of operation. We plan to describe this modelling and analysis in a separate paper due to space requirements. VI. IMPLEMENTATION We have implemented our architecture using Xen based VMM [13]. Xen is a native (or hypervisor-based) VMM where it runs directly on the hardware as the lowest and most privileged layer. In Xen terminology “domain" refers to a running virtual machine within which a guest OS executes. “domain 0" (Dom0) boots with the hypervisor and works as the control interface with special management privileges which has direct access to underlying physical hardware. All other virtual machines are called “domain U" (domU) in Xen terminology.

belongs to PepsiCo is running on one of the servers. While PepsiCo is running, the run-time exclusion set of our policy implies that CocaCola VM cannot start because the label of PepsiCo includes the CHWALL type PepsiCo and the label of CocaCola includes the CHWALL type CocaCola. The runtime exclusion rule of our policy enforces that PepsiCo and CocaCola VMs cannot run at the same time on the same hypervisor platform. Once PepsiCo is stopped or migrated to another platform, CocaCola can start. Once CocaCola is started, however, PepsiCo can no longer start on this physical platform. Similarly, our model can also enforce Chinese wall policy within different departments of the same organisation. For example, this prevents virtual machines that belong to marketing department to be co-hosted with virtual machines from payroll department. B. Privilege Escalation Attacks This section considers privilege escalation attack scenario. A virtual machine with Windows XP was used for this scenario where a test-user escalates his privileges to the SYSTEM level. Now consider how the users can escalate their privileges to system level and how our architecture is able to detect such attacks. In a legitimate scenario, the executable file, explorer.exe, runs with the privileges of the logged user since it provides access to his files. For example, a guest account will have this process running under the privilege of the guest and administrator account will have this process running under the privileges of the administrator. Now let us consider how the attacker can run this process under the SYSTEM privileges in a vulnerable machine.

Figure 6: Privilege Escalation Attack

Figure 5: Chinese Wall policy for different tenants

Our current implementation makes of the sHYPE for enforcing MAC policies. However the sHype does not implement the role based access control and techniques for enforcing intrusion detection policies. Hence we have extended it with the role based access control and for enforcing intrusion detection policies. In the following sections, we will describe practical results showing how our implemented architecture detects attacks in specific scenarios. A. ACM Enforcement for Hosting Virtual Machines Figure 5 shows the enforcement of Chinese wall policy for hosting virtual machines. In this case, the virtual machine that

As shown in Figure 6, using test user login we have scheduled a new task to be initiated at 18:23 from the command prompt. At that time, a new command prompt is initiated. However since the task was initiated by the system, the new command prompt runs with the SYSTEM privileges. Then as shown in Figure 6, we have used the command prompt with SYSTEM privileges to invoke the explorer.exe. Now the explorer process is initiated with the SYSTEM privileges. In Figure 6, we can also see that the taskmgr.exe is still running with the privileges of the test-user. However the explorer.exe is now running with SYSTEM privileges. Hence the test-user has complete access to the virtual machine. Our architecture detects such attacks during the run time state validation of the virtual machine. The explorer.exe privileges identified during the state validation did not match with the explorer.exe privileges for test-user stored in the ACM. Hence the DEE terminates the test-user session, disabling the test-user account to ensure that the user does not have login access to this machine. If the details of the test-user are valid on multiple tenant virtual machines within the TVD,

2168-7161 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCC.2016.2535320, IEEE Transactions on Cloud Computing

then the information sharing component in the DEE informs all other physical servers that are hosting the TVD virtual machines to disable the test-user account. C. Exploiting Security tool in VM In this section we consider a malicious user with limited privileges performing privilege escalation and then compromising the anti-virus software running in the virtual machine. Then the attacker uses the compromised system to generate further attacks. Let us see how our security architecture is able to deal with this attack scenario. In our attack scenario, we have used the anti-virus software Avira running on Windows 7. We have just used this as an example to illustrate how a malicious user can exploit current security measures to conduct attacks. Our research confirms similar attacks are also possible with other anti-virus software. The Avira Antivirus consists of several Windows services, user-level processes, and kernel-mode drivers. Among these modules, Realtime Protection service is crucial for Avira's protection mechanism, as it provides real time protection not only to the system (e.g. on-access detection of malware), but also to itself (self-protection such as prevention of unauthorised alteration of Avira-related files). In particular, unloading the kernel-mode drivers and the filter driver is blocked by Realtime Protection service. Avira's program folder is protected by the filter driver so that even the user with administrator privilege cannot add any file and delete or modify its files.

Figure 7: Successful Attack

Figure 8: Flooding with Malicious Traffic

We have used a staged malware in this scenario to compromise Avira during updates. In this example, the installer's ultimate goal is to replace Avira's sqlite3.dll with a malicious one (second stage) so as to subvert both Avira and the system. When an update begins, the installer monitors the status of Avira’s Realtime Protection service. Once the service is deactivated during the update, the staged malware performed the following tasks. i) For privilege escalation, it dropped and executed any known or zero-day exploit that is normally detected by Avira. Notice that this local privilege escalation (e.g. from admin to SYSTEM) is required only once. After this file replacement process, the malware obtains

SYSTEM privilege on the target machine. ii) Unloaded Avira's filter driver that is normally protected by the service. iii) Dropped the real payload (fabricated sqlite3.dll) and replaced the original file in Avira's installation folder with the malicious one. iv) The installer then deleted itself as a clean-up process to erase its existence; alternatively, the payload may delete the installer. The result of this file replacement and loading operations is shown in Figure 7. The original sqlite3.dll (sqlite3_ori.dll, 389KB) has been replaced with the malicious version (sqlite3.dll, 612KB), and the fabricated DLL has been loaded by the service. Here, the original DLL was not removed to show its replacement. Once malicious sqlite3.dll is loaded by the service, it obtains SYSTEM privilege on the target machine. Now any malicious activity becomes possible since it is loaded and executed in the context of Avira's Realtime Protection service. Now we have use the compromised virtual machine to generate malicious traffic as shown in Figure 8. Without our model, such an attack will be successful and the attacking source can remain anonymous in a traditional datacenter. Since the attacker has obtained complete control of the host based security tool and OS in the virtual machine, the attacker can alter the logs in the compromised system. Hence it is extremely difficult for the datacenter administrator to determine the attacking source for such attacks since the attack traffic does not have any valid MAC or IP address. The Trust Measurement mechanism in our model detects the changes to the sqlite3.dll and prevents it to be loaded into the system in the case of complete restart of the system. Furthermore, the information component in the DEE informs all other physical servers about the attack on the AVIRA. Such attacks can also be detected during runtime in our architecture. In our architecture, such runtime attacks are prevented by the secure VMM when the ACM module detects the privilege escalation of the logged user to SYSTEM level or when the IDE module detects malicious traffic originating from the compromised virtual machine. Although the attacker is successful in compromising the virtual machine, s/he does not have access to the security components in VMM. Hence such attacks will not be successful with our integrated security architecture. With our architecture, the attacks shown in Figure 8 are not possible in the first place. Since the traffic does not have valid MAC or IP address it will be blocked by the IDE module and an alert will be raised to the administrator. Hence our architecture can detect and prevent such an attack even before the attack traffic is placed on the network. For example, if the ACM detects privilege escalation or IDE detects an attack when it receives malicious traffic, then the logs are verified to extract the commands typed by the user (analysing the execve calls) and identify the malicious user who escalated their privileges and/or who installed the malicious application. The access policy is then changed to terminate the user account. VII. RELATED WORK In this section we briefly describe related works on intrusion detection that have been proposed for virtualisation and cloud environment and which are relevant to our architecture. Garfinkel et al. [5] proposed Virtual Machine Introspection based (VMI) based IDS architecture. The method of inspecting

2168-7161 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCC.2016.2535320, IEEE Transactions on Cloud Computing

a virtual machine from the hypervisor level for the purpose of analysing the software running inside it is called virtual machine introspection (VMI). VMM enables a virtual machine to be segregated in a way that it cannot access or modify the software running in a separate VM or in the VMM itself. This ensures that the IDS cannot be tampered even if the monitored host is completely subverted. Another VMI technique has been described by Payne et al. in [14]. They have implemented XenAccess, a monitoring library for Xen, which provides virtual memory introspection and virtual disk monitoring capabilities without any modifications to the VMM or to VM being monitored. XenAccess was further developed as LibVMI, which is also compatible with Kernel-based Virtual Machine (KVM). Our architecture makes use of the LibVMI for detecting attacks on the virtual machines. Fu and Lin [15] described a technique that can automatically generate the VMI tools by monitoring system wide instructions monitoring, which identify the introspection related data and redirect these data accesses to the in-guest kernel memory. Now let us consider some relevant work describing intrusion detection techniques for cloud environments. Kholidy et al. [16] proposed hierarchical and autonomous cloud based intrusion detection system. IDS is deployed in the VM management server and it sends the alarm with associated risk impact factor to the controller. The autonomous controller provides the most appropriate response to protect cloud based on certain security parameters. Mohamed et al. [17] proposed a distributed approach to deploy IDS at various locations in the cloud. IDS is deployed at the network switches, internal servers such front end servers as well as back end servers. These IDS sensors collaborate with each other to correlate information from different sources. Signature based and anomaly based approaches are used in this approach. Gupta and Kumar [18] proposed an Immediate System Call Sequence (ISCS) based anomaly algorithm for detecting attacks in the cloud. This technique does not perform any complex learning for creating a baseline of normal behaviour instead it uses an algorithm that structures the system calls generated during program execution in a specific way for anomaly detection. The techniques proposed in [16-18] mainly focus on the signature or anomaly based intrusion detection techniques for detecting the attacks. However our integrated security architecture incorporates not only these intrusion detection techniques but also access control and hardware based trusted computing using TPM, which enables the architecture to deal with dynamic attacks in a more efficient manner. VIII. CONCLUSIONS We have developed an integrated security architecture which combines access control security, intrusion detection techniques and trusted computing technologies for securing distributed virtual machine based systems. We have described how our integrated model is able to detect a range of attacks. We have described the implementation of the proposed architecture and have shown in detail how our architecture can be used to counteract a range of attack scenarios such as privilege escalation attacks and using the compromised machines to generate further attacks, exploiting vulnerabilities in security tools, and attacks on tenant online services.

REFERENCES [1] [2] [3]

[4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22] [23] [24] [25] [26] [27]

C.H. Vincent, D. Ferraiolo, and D. R. Kuhn. "Assessment of access control systems", US Department of Commerce, NIST, 2006. B. Ewaida, "Pass-the-hash attacks: Tools and Mitigation", January 2010, Available: http://www.sans.org/reading-room/whitepapers/testing/passthe-hash-attacks-tools-mitigation-33283, last viewed: 31/03/2015 R. Beverly, R. Koga, kc claffy, "Initial Longitudinal Analysis of IP Source Spoofing Capability on the Internet", 25 July 2013, Available at: http://www.internetsociety.org/doc/initial-longitudinal-analysis-ipsource-spoofing-capability-internet, last viewed: 31/03/2015 R.Sailer et al., “sHype: ypervisor Approach toTrusted Virtualised Systems”, IBM Research Report RC 23511, 2005 T. Garfinkel, and M. Rosenblum, "A virtual machine introspection based architecture for intrusion detection" Proc 10th Annual Network and Distributed System Security Symposium, California, February 2003. Trusted Computing Group, "TCG Specification, Architecture Overview, Specification Revision 1.2," [Online] http://www.trustedcomputinggroup.org April 2004. A. Nagarajan eta al., "Property based Attestation and Trusted Computing: Analysis and Challenges, Proc. 3rd Int Conf on NSS 2009. C. Luk et al., "Pin: Building Customized Program Analysis Tools with Dynamic Instrumentation," Proc ACM SIGPLAN on Programming Language Design Implementation, USA, June 2005. S. Kaur, and M. Singh, "Automatic attack signature generation systems: A review", IEEE Security & Privacy, 11(6), pp:54-61, Dec 2013. D. Litchfield, "Threat Profiling Microsoft SQL Server", [online] Available: http://www.cgisecurity.com/lib/tp-SQL2000.pdf. S. Seungwon and G. Guofei, “Conficker and Beyond: A Large-Scale Empirical Study”, Proc 26th ACSAC, Texas, USA, Dec 2010. LOIC, http://sourceforge.net/projects/loic/ last viewed: 25/08/2014 P. Barham, et al., “Xen and the art of virtualization,” Proc ACM Symp. Operating Syst. Principles, 2003. B. Payne, M. Carbone, and W. Lee, "Secure and Flexible Monitoring of Virtual Machines", Proc 23rd ACSAC, Florida, USA, Dec. 2007. Y. Fu, and Z. Lin, "Space Traveling across VM: Automatically Bridging the Semantic Gap in Virtual Machine Introspection via Online Kernel Data Redirection", IEEE Symposium on Security and Privacy, 2012. H. Kholidy et al., "Ha-cids: A hierarchical and autonomous ids for cloud IEEE International Conf. on Computational systems", Proc 5th Intelligence, Communication Systems and Networks, pp. 179–184, 2013. H. Mohamed et al., "A collaborative intrusion detection and prevention system in cloud computing", Proc IEEE AFRICON, pp. 1–5, 2013. S. Gupta, and P. Kumar, "An immediate system call sequence based approach for detecting malicious program executions in cloud environment", Wireless Personal Communications, 1–21, 2014. [Online]http://www.cio.com.au/article/557722/mercy-health-saves-1-2million-by-moving-cloud/ [Online]http://www2.alcatel-lucent.com/landing/financial-services/ [Online]http://www.finance.gov.au/sites/default/files/australiangovernment-cloud-computing-policy-3.pdf [Online]http://www.cio.com.au/article/559228/queensland-urbanutilities-moves-cloud-implements-website-self-service T. Ristenpart et al., "Hey, You, Get Off of My Cloud: Exploring Information Leakage in Third-Party Compute Clouds", Proc. 16th ACM Conf. on Computer and Communications Security, USA, Nov. 2009. “AWS security center.” [Online] http://aws.amazon.com/security/ https://technet.microsoft.com/en-us/dn785092.aspx Cloud Security Alliance (CSA), “Star-205 CSA - top threats to cloud computingv2.0.pdf”, http://365.rsaconference.com/docs/DOC-281 Security as a Service Working Group, Cloud Security Alliance, [Online]. Available: https://cloudsecurityalliance.org/group/security-as-a-service/ Vijay Varadharajan is Microsoft Chair Professor at Macquarie University. He is also the Director of Advanced Cyber Security Research Centre. Vijay has published more than 360 papers in International Journals and Conferences. Vijay has been/is on the Editorial Board of several journals including ACM TISSEC, IEEE TDSC, TIFS and TCC. Udaya Tupakula is Research Fellow at Advanced Cyber Security Research Centre at Macquarie University. He has completed his PhD from Macquarie University in 2006. Uday has 50 publications in International Journals and Conferences.

2168-7161 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

Suggest Documents