Thus, our proposal paves the way for large scale hosting of Grid or web services in commercial scenarios. ... environment over the Internet for a fee. Customers will tap ..... encrypted content and anonymous communications channels.
1
Security Issues in On-Demand Grid and Cluster Computing Matthew Smith, Michael Engel, Thomas Friese, Bernd Freisleben Department of Mathematics and Computer Science, University of Marburg, Germany Email: {matthew, engel, friese, freisleb}@informatik.uni-marburg.de Gregory A. Koenig, William Yurcik National Center for Supercomputing Applications (NCSA), University of Illinois, Urbana-Champaign, USA Email: {koenig, byurcik}@ncsa.uiuc.edu
Abstract— In this paper, security issues in on-demand Grid and cluster computing are analyzed, a corresponding threat model is presented and the challenges with respect to authentication, authorization, delegation and single sign-on, secure communication, auditing, safety, and confidentiality are discussed. Three different levels of on-demand computing are identified, based on the number of resource providers, solution producers and users, and the trust relationships between them. It is argued that the threats associated with the first two levels can be handled by employing operating system virtualization technologies based on Xen, whereas the threats of the third level require the use of hardware security modules proposed in the context of the Trusted Computing Platform Alliance (TCPA). The presented security mechanisms increase the resilience of the service hosting environment against both malicious attacks and erroneous code. Thus, our proposal paves the way for large scale hosting of Grid or web services in commercial scenarios.
I. I NTRODUCTION The Grid computing paradigm [1], [2] is formed around the general goal of providing resources, including processor cycles, data sources, special equipment, and even people, as easily as electricity is provided through the electrical power grid. Grid computing, coupled with cost-effective and readilyavailable computing power in the form of commodity clusters, has driven the development of closely-related technologies such as utility computing and on-demand business activities. Utility computing is defined to be the on demand delivery of infrastructure, applications, and business processes in a security-rich, shared, scalable, and standards-based computer environment over the Internet for a fee. Customers will tap into IT resources – and pay for them – as easily as they now get their electricity or water [3]. The overall goal is to provide services to a customer who can utilize those services on demand and be charged automatically for the usage, thus allowing the customer to provide more complex business services to end users. In its current state, however, Grid and utility computing focus on the unification of resources by a middleware layer to enable distributed computing within a fixed and preconfigured environment. Organizations or inter-organizational communities willing to share their computational resources typically
create a centrally planned Grid, where dedicated administrators manage computational platforms and the services offered. To gain access to Grid systems, users are typically required to obtain a certificate by personally going to a Certificate Authority (CA) and filling out a request form. During this process, the identity of the applicant is verified using some form of approved identification, such as a passport or driver’s license. The application is then processed and a digitallysigned certificate is issued to the user. This process, while not guaranteeing security, allows tight control on who can access shared Grid resources and provides administrators the ability to track malicious behaviour back to the person responsible. Standard security practices, such as firewalls and secure communication, coupled with the ability to link access to resources with responsible parties, form the basis for the trust necessary for organizations to release their software and data onto the Grid. In the on-demand paradigm, computational peak loads or free resources are outsourced to organizations offering computational power or the required resources. For this it is desirable to be able to dynamically rent resources with only a minimal administrative overhead. Thus, it must be possible to acquire and configure the needed resources without requiring the administrator to manually facilitate each and every transaction [4]. This creates a number of emergent security threats. For instance, in cluster computing if a company wishes to dynamically rent computational resources, it is typical to rent exclusive access to a number of nodes or even the entire cluster for the time needed. Secure mechanisms are needed to share clusters more efficiently for on-demand computing. In contrast, in the Grid environment exclusive access further complicates the process of dynamically renting resources. In many business environments (such as automotive or pharmaceutical) the risk of losing software or data is such that it has hindered on-demand computing. There is a need to reduce tension between on-demand functionality and requisite security measures. For more details about the concept of ondemand computing including definitions, research projects, and commercial services see [5]. In this paper, we present an analysis of threats within on-
2
demand utility computing environments. The threats presented here focus solely on new threats arrising from the on-demand usage of Grid and Cluster resources and should be seen as a complementary body of work in addition to the standard threats in Grid and Cluster Computing. Our analysis is based on a three-level hierarchy of the trust relationships in ondemand computing; these trust relationships involve interactions among resource providers, solution producers, and users. We also develop solutions for addressing the threats inherent to these three increasingly demanding levels by enabling trust management. Our solution involves applying a sandbox-based approach using virtual machine technology to ensure trust at the first two levels as well as a solution based on Trusted Computing Platform Alliance (TCPA) technology at the third level of the hierarchy. The remainder of this paper is organized as follows. In Section II, we define a three-level hierarchy of trust in ondemand computing environments. Next, Section III presents a threat model for on-demand computing. Based on this threat model, Section IV identifies several challenges involved with securing on-demand computing environments. We present a virtualization-based solution for providing security at the first two levels of the on-demand trust hierarchy in Section V and augment this solution for the remaining trust level by incorporating Trusted Computing hardware in Section VI. We conclude the paper by surveying related work in Section VII and presenting final comments and an outline for future research in Section VIII. II. O N - DEMAND G RID AND C LUSTER C OMPUTING In order to create a threat model for on-demand computing we must first identify the actors involved and the nature of the trust relationships among them. Before we describe on-demand scenarios, we discuss the trust relationships in traditional Grid or cluster usage scenarios. Trust relationships exist among three distinct actors: 1) Resource providers – owners of computational nodes or other physical resources 2) Solution producers – owners of software solutions and/or information databases which are deployed on a resource provider’s assets 3) Users – owners of input data to a solution producer’s product; users of a solution producer’s product, hosted by a resource provider In some cases, a solution producer and a resource provider, or a user and a solution producer, may be the same actor. In such cases, the trust requirements are somewhat reduced. Solution producers host software and data on resource providers’ nodes where users consume them. In order for this to happen, a given resource provider must grant access to both the solution producer and the user involved. This access is usually granted in the form of a user account through which the provider’s nodes are accessed. While it is possible to install most custom software in the solution producer’s home directory, sometimes it is necessary to have root privileges to install required software. In such cases, the resource provider must either grant the solution producer temporary root access
Fig. 1.
Trust relationships in current Grid and cluster computing systems
or perform these privileged operations on behalf of the solution producer. Figure 1 shows the trust relationships among the actors in current Grid and cluster computing systems. Resource providers must trust solution producers not to misuse the resources offered by the resource providers.1 Solution producers must trust resource providers not to misuse the software or database hosted on the resource provider’s assets.2 Users who consume the service offered by solution producers through resource providers must trust both of these parties not to misuse the data entered into the solution producer’s product.3 It is not required that solution producers or the resource providers trust the users, since their access to the system is restricted by the capabilities of the producers software and the access rights granted by the provider. Standard administrative security mechanism can be used to protect both producers and providers from malicious users. However, the solution producer and the resource provider must cooperate to create this protection, further underpinning the trust requirement between these two parties. In nearly all cases, Grid and cluster computing systems involve multiple solution producers and multiple users. Such cases present challenges for preventing exposure of confidential information when a single resource provider is involved. For example, if User A and User B both use compute resources provided by a common resource provider, the possibility exists of either user’s data being exposed to the other. Two standard ways of addressing this challenge are typically employed. First, the data used by the various users or solution producers may simply be of a non-confidential nature, and thus exposure of this data is irrelevant. This situation is quite common in academic environments in which research data or software is commonly available to any party who cares to download 1 Possible forms of misuse include using the rented nodes to send junk email, to launch a denial of service attack, to host illegal content, to download or steal account information from the resource provider, or to hijack other nodes in the system. 2 Possible forms of misuse include stealing the software or the information contained in the database, altering the software or the information in the database, or allowing access to unauthorized persons. 3 Possible forms of misuse by solution producers include stealing the input and output data, or modifying the results of a computation. Possible forms of misuse by resource providers are the same as for solution producers, but also include hijacking the user’s account to use the solution producer’s product at the user’s expense.
3
it. Second, the resource provider may be willing to grant exclusive access to its entire set of resources or to some subset of its resources to a specific solution producer or user. In this case, exposure of confidential information is not possible since no overlap of the resources used is possible. This solution is very unattractive to resource providers, however, since dedicating resources in this way greatly limits their use. On-demand computing environments pose additional challenges due to their dynamic nature and the sometimescompeting objectives of the actors involved. For example, in on-demand computing, users must be able to dynamically acquire resources based on some criteria such as priority or job deadline. Likewise, solution producers must be able to dynamically acquire resources from resource providers and autonomously deploy their solution there in order to address user requests. Finally, resource providers typically attempt to schedule the resources under their control to maximize some goal, such as utilization of the resource or profit generated from resource usage. We now define a three-level hierarchy of increasing amount of security requirements to offset the trust requirements among actors in an on-demand Grid and cluster computing environment: Level 1: The first level of the hierarchy encompasses the scenario described above in its most basic form. Multiple users and solution producers operate on the resources of a single resource provider. No trust relationship exists between users or between a user and the solution producers used by other users. Also, no trust relationship exists between the solution producers or a solution producer and users of other solution producers. The trust relationship between the user, solution producer and resource provider cooperating with each other is the same as in the traditional usage scenario. Figure 2 shows the “no trust” relationship of this level. The solution producer and resource providers do not need to trust the users of the their resources as they can use standard security mechanisms to protect themselves from their users in the same way they do in standard Grid and cluster systems. Thus, these “no trust” arrows are grey and will not be dealt with in this paper. For the sake of readability, the “trust” arrows are not depicted. Trust can be assumed between all parties not connected with a “no trust” arrow. Level 2: In the previous level, on-demand computing still requires the resource provider to trust the solution producer. It is desirable to eliminate the trust requirement from the resource provider to the solution producer, to facilitate a more flexible and cost effective business model. To enable level 2, on-demand computing requires security mechanisms to protect the resource provider’s assets from the solution producer’s while at the same time granting the required access rights to the resources the solution producer and the users need. Figure 3 shows the “no trust” relationship of level 2. For the sake of readability, the “trust” arrows are not depicted. Trust can be assumed between all parties not connected with a “no trust” arrow. Level 3: In the previous level, on-demand computing diminishes the need for the resource provider to trust the solution producer. The solution producer still must trust the resource
Fig. 2.
On-Demand Trust Relationships: Level 1
Fig. 3.
On-Demand Trust Relationships: Level 2
Fig. 4.
On-Demand Trust Relationships: Level 3
4
provider. This hinders easy and cost effective acquisition of resources from new resource providers since a trust relationship must first be acquired. Level 3 on-demand computing removes the trust requirement between solution producers and resource providers completely by offering security measures protecting the solution producers’ assets not only from other solution producers and users but also from the resource provider. This added security also removes the need for the users to trust the resource provider. The only trust requirement left in level 3 is that the users must trust their solution producer. This final trust requirement can never be fully removed because the solution producer’s software must be able to read the users’ data to process it and thus it will also be able to make illicit copies. Figure 4 shows all trust relationships in this final level of ondemand computing. III. T OWARD A T HREAT M ODEL FOR O N -D EMAND C OMPUTING In order to better understand the security threats to ondemand computing, we seek to develop a model that can be used to describe and comprehensively categorize the range of different attacks. The key components of threat modeling are identifying potential attackers with their corresponding goals and capabilities. Attacker goals target assets of on-demand computing; tangible assets such as storage space or intangible assets such as service reliability. We assume that attackers are similar in capability to on-demand users with access to supercomputing cycles, storage, and network bandwidth. The result of our threat modeling is a systematic way of organizing threats which we will call “The Threat Tree for On-Demand Computing”. For this threat analysis we consider a shared on-demand service hosting environment consistent with previous discussion in Section II. Since there are likely to be multiple users and solution producers participating in the same shared resource environment active at the same time, it must be possible for those entities to access resources in real-time without having the resource provider personally monitoring transactions. This creates a number of new security threats beyond the security threats of standard Grid and cluster systems. These new security threats arise from: (1) the greater number of participants, (2) the different usage model [4], [6], and (3) emergent properties from the combined interactions of many transactions resulting in complex, unintended behaviors not found in individual transactions [7]. The first categorization of threats to on-demand computing is internal versus external attacks. Internal attacks are committed by entities with legitimate access to a system while external attacks are committed by entities which do not posses access rights and must therefore break into the system. According to [8], internal attacks are the most common form of attack on enterprise networks. Since the number of legitimate solution producers and users is typically larger and more dynamic in on-demand computing, it is to be expected that internal attacks will be an even larger threat for on-demand computing. External attacks are made possible by the changing nature of cluster computing and the emerging Grid computing paradigm.
Clusters have moved from closed/proprietary environments in a closed network (particularly in commercial settings) to open/standard systems that are often exposed to public networks. External attackers can probe the publicly available resources for vulnerabilities which can then be exploited [9], [10]. This change is exacerbated by Grid computing which typically connects multiple open clusters together with Internet-accessible nodes. This change has resulted in exposing clusters to a variety of point-and-click attack tools that are easily available on the Internet. Just as the greater number of legitimate solution producers increases the potential for internal threats, it also increases the potential for external threats for two reasons. First, as the amount of legitimate activity on resources increases, it becomes more difficult to detect illegitimate activity. This is especially true for a special type of external attack where the external entity steals a legitimate entity’s identity and masquerades as that user, or where a legitimate user’s session is hijacked. Second, the amount of third party code is increased with each additional solution producer, along with the potential attack vectors associated with this code. The threat from collusion between inside attackers with privileged access and external attackers is both hard to detect and hard to defeat. Insiders may both enable access for external attacks and then cover these external attack traces. In ondemand computing, an insider in one computing relationship may be external in another computing relationship, so the concepts of internal and external are relative to both the asset at risk being considered and relationships at any instant in time. A second categorization of threats to on-demand computing is software attacks against controlled code versus third party code. Most Grids and clusters run code from third party partners or software providers and, for practical reasons, it is infeasible to audit all of this code. This is a major change compared to traditional clusters running a controlled base of known source code [11]. Thus, even a Grid or cluster with assured hardware and software components must co-exist with non-assured third party components. Recently, this has led to attacks on third party components of Grids and clusters as documented in the 2005 SANS Top-20 list which added a new cross-platform threat category [12]. This change in the general security landscape away from operating systemspecific attacks against controlled code toward applicationspecific attacks against third party code is an issue of particular relevance for on-demand computing since a multitude of third party applications will be running in this environment opening a multitude of new attack vectors. An IBM security paper [13] examines the trends of attack from the 80’s up to today: One trend in the hacking threats of particular interest from the client perspective, is the trend of hackers to focus on attacking the client. In the 80’s, hackers largely attacked the network, passively sniffing passwords, and actively hijacking network sessions. As applications increasingly encrypted data going across the network, hackers then turned vulnerabilities their attention largely to attacking servers directly, mainly through misconfigured or buggy services, like web servers. As companies have responded
5
with firewalls, intrusion detection, and security auditing tools to protect their servers, hackers have increasingly turned to hacking clients. A third categorization of threats to on-demand computing is created by two new usage patterns in on-demand environments. While the first two categorizations of threats can be handled with existing security mechanisms, this third category is emergent from the unique relationships for on-demand computing. First, solution producers need greater privileges to efficiently install their applications and there must be mechanisms in place to make sure these privileges are not misused. We label these threats “privilege threats”. Second, the traditional closed/proprietary Grid and cluster environment is no longer true for on-demand computing. Users from different organizations may rent resources from the same on-demand resource provider, making it critical that the resource provider enforces a strict separation between all participants. We label these threats “shared-use threats”. Privilege threats arise from the fact that solution producers must be able to administer their system without a central administrator who is trusted by all participants and can perform a security audit on all code submitted into the system. If solution producers are allowed to install and configure custom software as needed in an on-demand fashion, it is also possible that malware such as spam-bots, root-kits, spyware, etc. may be installed either intentionally or unintentionally. There is also the possibility of collusion between solution providers and attackers. For shared-use threats we distinguish three types of attacks: (A) resource attacks, (B) data attacks, and (C) meta-data attacks. These attacks can be further subdivided into three types of attacks against (1) the user, (2) the solution producer, and (3) the resource provider. For illustration purposes, we describe these shared-use threats as follows: • A.1 Shared-Use Resource Attacks against Users: Illegitimate use of resources owned by users. For example, an attacker may use software licenses for third party software that belongs to other users or masquerade as a certain user to consume their allotted CPU cycles. • A.2 Shared-Use Resource Attacks against Solution Producers: Illegitimate use of software or physical resources owned by solution producers. For example, an attacker may invoke or modify programs from other solution producers without authorization. • A.3 Shared-Use Resource Attacks against Resource Providers: Illegitimate use of CPU cycles, network bandwidth, storage, or other physical resources owned by resource providers. For instance, an attacker may send unsolicited bulk emails from the hosting network node or use multiple clusters in a brute-force attempt to decrypt credentials. In the extreme, illegitimate resource consumption attacks against a resource providers become denial-of-service attacks. • B.1 Shared-Use Data Attacks against Users: Illegitimate access to or modification of data owned by users. For example, an attacker may read or modify input datasets or output data results owned by users. Identity stealing attacks also fall into this category.
B.2 Shared-Use Data Attacks against Solution Producers: Illegitimate access to or modification of data owned by other resource providers. For example, an attacker may modify a solution producer data to corrupt the services offered by that solution producer. • B.3 Shared-Use Data Attacks against Resource Providers: Illegitimate access to or modification of data owned by the resource provider. For example, an attacker may alter system software, such as the system password files or certificate files, to subvert operations. Faking log entries also falls into this category. • C.1 Shared-Use Meta-Data Attacks against Users: Illegitimate access to or modification of meta-data owned by users. For example, an attacker may monitor meta-data describing user activity on a system in order to infer what the user is doing. • C.2 Shared-Use Meta-Data Attacks against Solution Producers: Illegitimate access to or modification of metadata owned by solution producers. For example, an attacker may monitor meta-data describing solution producer activity in order to characterize workloads in terms of who and how many users utilize which software packages over specific periods of time. • C.3 Shared-Use Meta-Data Attacks against Resource Providers: Illegitimate access to or modification of metadata owned by resource providers. For example, an attacker may monitor meta-data describing use of resource provider activity in order to characterize workloads in terms of who and how many users utilize which resources over specific periods of time. Consistent with previous graphical depictions of on-demand computing trust relationships in Figures 1-4, Figure 5 shows the threat tree we have described in this section with threat types in grey boxes and the actors executing them and the actors endangered by them indicated by arrowed lines. •
IV. C HALLENGES FOR O N - DEMAND S ECURITY There are significant challenges for security in an ondemand Grid or cluster computing environment. Table 1 overviews the primary challenges followed by an individualized discussion of each. The purpose of this discussion is to organize on-demand security challenges within an overall framework. Our discussion of each individual security challenge is brief – in-depth treatment of each challenge can be found in the literature referenced in each section. Authentication ensures that an entity is a valid user before access to resources is granted. In cluster computing, authentication is often done with passwords a user enters on request to a login or head node. In Grid computing, certificates (e.g. X.509) are increasingly used. It must be ensured that both passwords and certificates can be tied to a real person in order to take action and assign responsibility in the case of account misuse. To obtain a user certificate, a Registration Authority (RA) must confirm the identity of the requesting entity and then facilitate granting of a certificate signed by a Certificate Authority (CA). The entity can then use the certificate to
6
TABLE I O N -D EMAND S ECURITY C HALLENGES Challenge Authentication Authorization Delegation Confidentiality Secure Communications
Description matches user to valid identity and corresponding passwords or X.509 certificates grants access to resources only to entities that have the authority to use them allows a system to carry out functions on behalf of a user ensures information is not read without proper authorization protects communications content and metadata privacy
Data Availability
backs up data for short-term and long-term failure events
Auditing
tracks system state for forensic analysis
Fig. 5.
Current Best Practices user behavior monitoring to indicate behavior consistent with attackers secure file systems, with audit trails using secure reliable logging audit trails with secure reliable logging combination of strong authentication, encrypted storage, and secure communications encrypted content and anonymous communications channels operating system versioning and journaling, hardware storage solutions such as RAID and immutable WORM storage media, temporary scratch space for executing jobs and archival storage for saving user input and output data adding reliability and confidentiality to audit logs with windowing connection-oriented protocols and cryptography (respectively)
The Threat Tree for On-Demand Computing
identify itself to the Grid. The EUGridPMA [14] establishes requirements and best practices for Grid identity providers to enable a common trust domain applicable to authentication of end-entities in inter-organizational environements. The following is taken from the EUGridPMA guidelines concerning the issuing of certificates for users: ”In order for an RA to validate the identity of a person, the subject should contact the RA face-to-face and present photoid and/or valid official documents showing that the subject is an acceptable end entity as defined in the CP/CPS document of the CA. In case of host or service certificate requests, the RA should validate the identity of the person in charge of the specific
entities using a secure method. The RA should validate the association of the certificate signing request.” Certificates are often managed by users themselves. While passwords can be memorized, certificates must be stored in digital form which creates a security risk since there is no enforcement of key hygiene by users. For this reason, a CA cannot guarantee that a particular certificate was not stolen after signing. A recent IBM security study found that 80% of Windows clients have spyware infestations and 30% already have back doors [15], thus opening the door for identity theft. Certificate theft usually goes undetected for a long time since there are no external indications (no loss of functionality or alerts) to a user that something bad has happened. The problem of storing certificates on unsecured end-user machines can be partially addressed by the introduction of credential repositories which can store certificates on behalf of the users with a user accessing his/her certificates via a password. This system combines the benefits of passwords with the certificate trust framework. Of course, this only makes sense if the passwords to the credential repository are not stored on the end-user machines. Credential repositories also make tempting targets for attackers since they carry credentials which may be broken offline if compromised. However, while poorly managed credentials have caused unnumberable security incidents, to the best of our knowledge, not a single centrally-maintained authentication server has been compromised in the past decade [16]. In on-demand computing, the problem of password and credential hygiene is amplified by the number of participants in the system. For a single cluster, a system administrator may recognize consistent user behavior and thus be able to identify any unusual behavior. With on-demand computing this is no longer possible. In [17], a system is proposed that automatically identifies types of users by analyzing their command patterns using process accounting. Such an automated system could be used to warn local administrators if command behaviors consistent with attackers are detected so action can be taken to investigate. Such a system could prove invaluable to the on-demand Grid and cluster computing community
7
since it is capable of assisting in the detection of certificate theft across organizational boundaries. Cooperating resource providers could combine their knowledge of attack patterns and thus gain a broader basis for detecting malicious behaviour in unknown users. Users would also have a vested interest to share their command pattern information with resource providers to protect themselves against identity theft. Authorization ensures that authenticated users can only access the resources that he or she is allowed to access. Most ”shared-use resource attacks” described in Section III (A.1 - B.3) can be stopped by sufficiently tight authorization enforcement. In cluster computing, authorization is accomplished via user rights of the local operating system typically manually configured by the local system administrator. In the Grid environment, there has been a rapid evolution of different authorization technologies (grid-map files used by Globus [18], VOMS [19], CAS [20], PERMIS [21]) that unfortunately do not interoperate. For on-demand computing it is important that a standardized approach is adopted by resource providers so that solution producers can create requirement requests which fully capture the access rights for the necessary resources. A standardized approach would also allow solution producers to find resource providers whose security policies match the required access rights of their application. Another issue is the revocation of access rights. In current Grid systems, the target revocation delay is 10 to 60 minutes [16]. In an on-demand environment, it is desirable to have even lower delays so resource providers can quickly react to resource abuse and users can can quickly react to account abuse. Currently, Certificate Revocation Lists (CRLs) are used to revoke access rights throughout a Grid environment, however, practical implementations of this system do not meet delay targets. In [22], Online Certificate Status Protocols (OCSPs) are proposed to allow the timely revocation of user rights. Just as with authentication, the resource providers would do well to combine their authorization framework to cooperatively be able to respond quicker to newly identified attacks and attackers. All of the requirements discussed to this point can be handled using standard cluster and Grid authorization frameworks. It is up to the resource provider to ensure that users do not get the chance to execute programs for which they do not have sufficient access rights. In many cases, solution producers and their software legitimately require root access to systems to execute on-demand jobs. It is not feasible to perform an audit of code in an on-demand environment in order to distinguish legitimate root access from privilege escalation. It is therefore very difficult to ensure that solution producers do not illegally access other solution producers’ or users’ assets which are hosted within the same operating system. For a single small cluster, privilege escalation is detectable given familarity with typical user behavior. However, for on-demand computing privilege escalation becomes more complex and unscalable to detect without automation. In [23], a system is proposed that automatically identifies privilege escalation attacks on large clusters. Such an automated system may prove invaluable to the on-demand Grid and cluster computing community since
many serious attacks escalate from local accounts to root using exploits. Delegation allows a system to carry out a range of functions on behalf of the user who only has to log on to the system once. This is one of the major requirements of Grid computing and is an extension of the authentication and authorization process which is dealt with by systems like MyProxy [24]. It can be distilled down to the problem of online key management and bootstrapping in a distributed environment. There are new security risks introduced by the proxy delegation system used to enable the single sign-on. The more credentials stored online to facilitate single sign-on, the easier it becomes to steal a user’s identity. Once an attacker gains access to a user’s bootstrapping mechanism, it becomes very easy to move through an on-demand system, as there are less personal checks in place. The bootstrapping mechanism must therefore be highly protected. Possibilities are one-time passwords based on a shared secret passed in an out-of-band fashion or the use of biometric data. CryptoCard [25] or SecureID [26] based systems can also be utilized. Confidentiality ensures that private data (including metadata about traffic and computation activity) of all participants is protected from all unauthorized entities. This is the area where on-demand computing is most challenged. The large number of users and solution producers create the need for secure operations, otherwise commercial use will be very restricted. Standard data access controls offered by the operating system can be configured to protect users’ files from other users, but if an attacker gains root access to the system or is given root access legitimately (in the case of solution producers), secure storage systems can be circumvented. Furthermore, meta-information about traffic and computational workloads can be gathered via commands like ps, legitimate monitoring tools, or illegitimate traffic sniffing. The ability of solution producers to install custom software on systems makes it difficult to protect operational meta-data within a standard operating system. To ensure that the confidentiality requirements of on-demand customers are met, sandboxing is required to separate the different solution producers from each other and the different users from each other. Secure Communications guarantees the integrity, confidentiality and non-repudiation of packet communications between entities such as HPC clusters over a Grid. This is an area where on-demand computing can leverage existing solutions, however, some of the existing solutions do not scale to large Grid and cluster computing scenarios. Integrity may be provided by a combination of error-detecting techniques and ARQ windowed connection-oriented protocols. Confidentiality and non-repudiation can be provided with crytographic solutions such as symmetric/assymetric encryption and digital signatures, respectively. The need for both connection-oriented protocols and cryptography can be provided with VPNs and particularly IPSEC, but there is a challenge for on-demand computing since VPNs are relatively long-lived and manually configured. Future work is needed for middleware to enable the setup of dynamic VPN tunnels to handle the dynamics of on-demand computing. Performance tradeoffs are a major issue, with tuning necessary for PKI and TCP/IP solutions to
8
deliver the required high-speed bandwidth. Data Availability is the ability to preserve user data in the presence of short-term and long term failure events. Shortterm failure events include operating system crashes, storage device failures, and system compromises with attacks deleting files. These failure events require backup over relatively short time periods and are typically handled with file system versioning and journaling or hardware solutions such as RAID or immutable WORM storage media. Long-term events include fires, floods, electrical outages, and insider attackers which may take place over an extended period of time. These failure events require backup over longer time periods and are typically handled by archiving off-site, often using a service provider specializing in disaster recovery. For a survey of techniques and tradeoffs for providing data availability see [27]. In large Grid and cluster systems, the current best practice is to provide external storage separated into two categories: (1) temporary (scratch) work space that provides relatively fast access to moderate data volumes for use by current tasks, and (2) long-term storage systems such as tape libraries for long-term backup data retention. For on-demand computing, providing storage solutions for data availability is a challenge since different applications have different requirements while storage solutions are not on-demand – storage solutions require intensive human configuration and management and are relatively static in comparison with the dynamics of on-demand computing. Auditing allows resource providers and solution producers to see what actions were executed, when, and by whom. While auditing does not directly secure a system to prevent attacks it does serve a vital role in the overall security architecture by providing a history for analysis, data to determine cause-andeffect relationships, and a baseline of normal and abnormal behaviors. It is also a high priority target for attackers who want to delete attack traces or cause obfuscation/disruption by modifying audit logs. Deleting or modifying audit logs is particularly relevant for on-demand computing since audit data is the basis for billing (process accounting records). Most of the requirements for auditing on-demand computing are identical to traditional auditing requirements. However, one significant difference is that resource providers have typically been the only entity authorized to access audit data in single cluster environments, while for on-demand computing multiple solution producers are likely to require access to audit data for billing their services. Automated mechanisms are needed to allow access to audit data by one solution provider while at the same time protecting audit data of other solution producers. V. L EVELS 1 AND 2 O N -D EMAND S ECURITY Enforcing security policies in shared cluster and Grid environments is a complex task for system administrators. In an on-demand Grid environment, these tasks grow even more complex since the lifetime of deployed compute jobs (and, potentially, related user IDs) may be restricted to the jobs’ run time. Thus, administrators neither gain sufficient experience with the expected behaviour of tasks nor are they able to rely
on a fixed software configuration of the system, since newly deployed jobs may bring new dependencies for shared libraries and third-party software along with them. One of the basic ideas of the Grid, namely sharing computing power over the Internet – an approach that makes “raw” CPU cycles available in a securely shared, but flexible way – sounds attractive. An efficient solution for sharing Grid compute resources on a single physical machine is to use a virtualization architecture [6], [28], [29], [30]. Next to commercial solutions like VMware GSX Server and Microsoft Virtual PC, the Xen hypervisor [31] – a free and open source virtualization system developed at the University of Cambridge, UK – has gained significant interest in the open source operating system community. There are several Grid groups currently working on utilizing Xen to provide secure sandboxes for their Grid environments [32], [33], [29]. Xen provides independent, secure virtual machines in which a modified kernel of an open source OS like Linux, NetBSD or Plan 9 forms the basis for an essentially unmodified system and application installation on top of it4 . Several of these so-called XenU instances usually run in parallel on a single physical machine, protected from each other, under the control of a Xen0 master operating system instance that can create, suspend and terminate XenU instances on demand. The only instance gaining access to the providing system’s hardware like peripheral devices or physical disks is Xen0 (to which only the administrator of the system has access); CPUs, network and disk devices are virtualized for XenU domains and thus controllable by Xen0. Since each XenU instance basically runs its own operating system instance, solution producers now can, if desirable, not only deploy an application onto a resource providers node along with the required input data; rather, they are able to deploy a complete operating system installation along with all required shared libraries, third-party-software, etc. This greatly reduces compatibility and dependency problems when running Grid applications and provides the solution producer with full root access for configuration and control of the instance, albeit restricted to their virtual environment. Thus, using a virtualization solution implies that many configuration and administration tasks are shifted from the resource provider to the solution producer. This, in turn, can in many cases prove to be a win-win-situation: on the one hand, administrators now only have to provide systems running Xen along with a Xen0 controlling OS instance and network connectivity for the XenU nodes, usually tightly controlled by a firewalling or packet filtering solution. The Xen0 monitoring system can be used by the resource provider to bill solution producers and users for resources used. Solution producers, on the other hand, can now provide a complete system environment in which they have administrative privileges without having to find destination systems in the Grid that provide exactly the required software infrastructure. Solution producers now only have to care about finding a Xen-enabled 4 With the upcoming release of Xen 3, new in-CPU hardware virtualization technologies from Intel (Vanderpool) and AMD (Pacifica), which will be available in early 2006, allow the use of unmodified OS kernels and thus also permit running e.g. Windows on top of Xen.
9
Grid user task description
➊
Task-specific filesystem image
Grid node
➋ ➍
+ input data & config
➌
Xen Job-specific filesystem image
➎ XenU
➏
XenU instance running on user's filesystem image
Fig. 6.
Xen- and Linux-Based Grid Deployment Process
system with a specific type of CPU and sufficient amounts of main memory and disk space in order to run their jobs. When we consider Grid users who do not intend to develop solutions on their own, providing complete OS instances also allows solution producers to deliver a completely autonomous system installation along with their specific application without any additional incurred cost for operating system licenses. In this case, the on-demand Grid user simply has to supply the required input data to be processed. The basic deployment process of a Xen-enabled on-demand Grid job is depicted in Figure 6. Creating and running a job in a Xen-based Grid environment requires the execution of six steps, which taken together enable Grid users to deploy OS instances on demand. The first step a solution producer has to perform is to analyze the requirements of his or her Grid application, which may range from third-party libraries (e.g., for high precision math) over runtime environments (e.g., a Java virtual machine in a specific version) up to complex third-party applications (e.g., MatLab or Mathematica). A standard solution for this complex task is readily available in most Linux distributions that support packet management, like distributions using Debian’s deb-packets or RedHat’s RPM. In these packets – that usually provide a specific component of an OS installation like a certain shared library along with required header files or a single application – dependency information is encoded that tells the distribution’s packet management system which other packets have to be installed in advance to enable the installation of a certain packet. When all dependencies are resolved, the second step is to accumulate all required packets in order to create a customized, minimal Linux installation that is the basis for deploying the system on a Xen-based virtual machine. This system should be as small as possible, since it will be transferred over the Internet; thus, only installing software required by the specified dependencies guarantees a minimal system. One important point when creating the filesystem is that it contains no OS kernel – the kernel is provided by the Xen system for compatibility and security reasons. Compatibility problems could occur due to different versions of Xen installed on different Grid nodes which each require a specifically adapted version of the Linux kernel running in a XenU domain. Security problems could not directly result from deploying
a user-supplied kernel (since Xen takes care of sandboxing XenU kernels), but requiring a predefined kernel provided by the Xen0 administrator adds an additional level of security in case the Xen hypervisor itself suffers from a security problem. In the third step, the configured Linux filesystem images are customized further for the specific deployment. This step consists mostly of amending the task-specific filesystem image created in the second step with job-specific files, usually a specification of the input data to operate upon, license files for installed third-party software and potentially configuration files that specify network connections. Along with this, a configuration file is generated specifying the CPU, main memory and disk space requirements of the job and required network connections for automatic configuaration of the firewall that is provided by the Xen0 administrator. In the fourth step, the generated job-specific filesystem images can now be deployed on the target systems. A system on the Grid offering the required combination of CPU type, main memory and disk space is selected and the complete file system image is transferred to the destination system for execution. The fifth step, then, constitutes the initialization of the Grid job on the destination system. This involves the creation of a configuration file for the XenU instance, the creation of the XenU instance, and, finally, the booting of the instance. In the sixth step, after the system is booted, the Grid job itself is executed and returns the calculated results via the Internet. After running the job, it can be destroyed altogether or kept in storage on the Grid node for later reuse. A. Trust Relationships Using Xen In the Xen-based on-demand Grid scenario described above, the trust relationships between the involved parties must hold for level 1 and 2. The tasks of the parties in an on-demand environment using Xen are defined as follows. The resource provider takes care of setting up the physical hardware running the jobs. On these systems, the provider installs the Xen hypervisor software along with a Xen0 Linux instance that is responsible for creating, controlling and destroying the XenU instances that are to be deployed on demand. The Xen0 instance is under the sole control of the resource provider, thus he or she is able to ultimately control resource usage and network connectivity of all XenU instances. Solution producers create an application running on a Linux system together with a machine-readable description file indicating which other packets5 the application depends on. Depending on the mode of operation, the application is solely distributed as-is to users or, in addition, provided as part of a self-contained Linux image. The user, finally, is responsible for testing a solution locally and for creating a custom-tailored Linux image with all the dependencies required by the solution which he or she then combines into job-specific images by adding job-specific input data and configuration files to the generic image for a solution 5 usually
on
provided by the Linux distribution the Grid environment is based
10
or task. In addition, the user has to find suitable Xen-enabled compute nodes, apply for service and finally transfer the generated job-specific Linux image to the node determined suitable for the specific jobs’ execution. Xen-based systems address the challenges for on-demand security as follows: Authorization is basically similar to traditional cluster and Grid systems. Since the resources that are accessed are very generic (CPU time, main memory and disk space), deployed jobs have the illusion of running on a system of their own. The interaction patterns with the security-relevant part of the complete system – the Xen hypervisor itself and the controlling Xen0 domain – are well-defined; Xen’s sandboxing takes care of restricting consumed CPU time and disk space of a XenU instance, a firewall or packet filter installed in the Xen0 domain can, in addition, constrain network connectivity. Thus, a compromised password or certificate used to gain access to a providers’ system will in the worst case result in wasted CPU cycles. Data and application from other users are protected since they live in completely separate XenU domains; the provider is safe from attacks coming from the compromised domain since he or she can always restrict or destroy the offending instance while keeping full control over the system’s hardware and resource allocation. Auditing in a Xen-based system can be encompassing, since every bit of information going into and leaving a XenU domain (be it network or disk transfers) has to pass through the Xen hypervisor, where it can be recorded or, if required, intercepted. However, interpreting the audit trails generated by Xen can be more difficult compared to standard Unix auditing methods, since the information available to Xen is on a lower abstraction level: Xen can only record network packet data received and sent and disk device read/write operations, whereas auditing on kernel level automatically can extract corresponding file names or network protocol information related to a process. Confidentiality is improved in Xen-based systems compared to traditional Unix systems on which usually several users can have simultaneous access to the system. The sandboxing method used ensures that (unless the user decides otherwise) the OS installed in each XenU instance is exclusively used by one user, which in turn has exclusive access to all files in the corresponding virtual disk image. This principle also ensures that different solution providers are protected from each other and also that a user is protected from solution providers that are used by different users. Since the XenU sandboxes use an administrator-provided Linux kernel image, users effectively have no access to the controlling Xen0 instance or other users’ XenU instances, thus protecting the provider from malicious users. This constellation, however, does not protect the user from the provider, since the user no longer has control over the data processing that takes place in the kernel. Since all information has to pass through the kernel, even the use of encrypted filesystems does not suffice, since keys can always be read out from the XenU address space by the provider. However, finding keys in the memory space used by applications to encrypt files takes significant effort, so a file-level encryption used by
applications could increase the obstacles the provider has to overcome to access users’ application data.
B. Feasibility Analysis In [28], [31], [34], [29] performance measurements can be found for different virtualization solutions including Xen. Apart from runtime performance drawbacks which are also an issue in static Grid and cluster systems in a on-demand scenario the deployment cost of the virtual sandbox image is relevant. One drawback of sending complete file system images via the network is the induced transfer overhead for the image compared to sending a single application along with its input data. While a typical full-scale Linux today requires several gigabytes of disk space, these installations usually contain much software that may never be used - this is even more likely in a Grid environment where features like web browsers, a graphical user interface and possibly also the software development environment inherent in many standard installations are unnecessary for the execution of compute- or data-intensive jobs. In fact, a minimal special-purpose Linux system like Tinylinux or Tomsrtbt can be made to fit on a (compressed) floppy disk image, requiring less than two megabytes of transfer capacity. These systems, however, lack quite some functionality a Grid user may expect – e.g., a shell with extended scripting capabilities or standard Unix utilities like sed or awk. Thus, a Linux distribution usable for a typical C++ or Fortran-based application tends to be between 40 and 70 MB. This may be too much data to transfer for short-running Grid jobs with lifetimes of a few minutes; for more complex jobs, however, the benefits of a complete, self-contained runtime environment outweigh the induced data overhead by far. Additionally, in this area many possibilities for optimization exist. For example, certain classes of applications may require very similar configurations of the Linux system they are based upon. As a consequence, Grid system administrators can provide a set of preconfigured OS images, so that deployment of an application from this class does not require the transfer of the base Linux system, thus shifting the task of combining the base image and the applications’ code and data from the user to the provider. In this environment, the user must be able to verify the authenticity of an OS image, e.g. by generating a secure hash value of the image and comparing it to a set of hashes for trusted images. A further optimization could be achieved by providing generic operating system images in a peer-to-peer (P2P) system like BitTorrent [35], however, the new security threats introduced by such a system must then be taken into account. This relieves the user from increased network transfer due to repeatedly sending out a multi-megabyte image for each job deployed. Using P2P technology, images can be stored in a distributed and replicated way among all Grid nodes, resulting in a levelling of the network traffic between the various nodes as far as image transfer is concerned.
11
VI. L EVEL 3 O N -D EMAND S ECURITY The step from level 2 to level 3 on-demand Grid and cluster computing sees the removal of the trust relationship from the user and solution producer towards the resource provider (see Section II). This entails that the solution producer and user must be able to remotely check the integrity of the remote system to ensure that the solution producer’s software has not been modified and the data is stored securely. It should be noted here that it is impossible to fully secure a remote system under foreign administrative control, since as the last instance the remote administrator can unhook the system during operation and then perform any number of attacks against both soft and hardware security protocols off-line. Given enough time and resources, it is possible to break any encryption which protects the data. Therefore, the goal of level 3 on-demand security is to make the time and effort so prohibitive that it is no longer a relevant threat to the solution producers and users. One way to make hacking a system more complex is to implement the security measures in hardware, since it is much harder for most attackers to create workarounds in hardware than it is for them to work around software security systems. This introduces a new entity to the trust model: the Trusted Platform Module (TPM) Producer. Since the security hardware is used to offer essential security features for level 3 on-demand computing, it is necessary that users, producers and providers trust the hardware provider that the security hardware does what it is supposed to. However, since there are only very few hardware producers compared to the number of other actors in the on-demand world, it is possible to build a trust relationship to the hardware providers, either by reputation6 or hardware audit. The Trusted Computing Platform Alliance (TCPA)7 has produced open specifications for a security chip (TPM) and related software interfaces which is such a hardware based security system. In [37], the functionality of the TPM chip is introduced as follows: The TPM chip provides several essential, hardware based security services. First, it provides hardware based public key management and authentication services whose private keys cannot be stolen by software based attacks of any kind, (including malware and phishing attacks), as the sensitive private keys are generated on the chip, and are never visible in plain text outside of the chip. Second, it provides the essential boot-time hardware to measure a system’s software integrity, and report any sign of software tampering. Third, it provides for secure storage of data, including keys, protecting their secrecy across reboots and protecting them against theft. These integrity measurement and secure storage functions enable significant performance improvement in the verification of file authenticity, integrity, currency, and safety, so that these functions can be performed efficiently. 6 Security hardware producers are usually large well known firms such as Intel, AMD, IBM, etc. 7 It should be noted that there has been a great deal of negative hype concerning TCPA predicting the end of free speech and total control of the IT landscape by companies like Microsoft. This is in part due to the misconception that TCPA is the same thing as Paladium/NGSCB, the NextGeneration Secure Computing Base proposed by Microsoft. A detailed rebuttal to the accusations can be found in [36].
Currently, IBM offers a TPM enabled Linux [38] as a work-in-progress release and not all features of the TPM 1.2 specification are implemented and stable. Nevertheless, we believe that such technology is an interesting way of enabling level 3 on-demand security. Apart from the secure boot process, the IBM TPM Linux offers two security modules (EVM, SLIM and IMA) which are of interest for level 3 security. After the secure boot operation, the unmodified TPM secured operating system is up and running. Next the Extended Verification Module (EVM) which is part of the operating system can be used to check if an application which is to be loaded has been modified in a similar fashion as the boot modification checks. The EVM module verifies that all files are authentic, unmodified, current, and not known to be malicious. EVM does not (and cannot) determine if files are correct that is, that given any (possibly malicious) input data, they will operate properly. The Simple Linux Integrity Module (SLIM) deals with this issue, however, it is not a problem here since each solution producer operates within his or her own XenU, any weaknesses within the unmodified application are his or her concern alone and need not be checked by the resource provider. However, it does help the solution producer to ensure that their software conforms to the requirements of the resource provider. The Trusted Computing Architecture (TCA) facilitates remote attestation with which the status of the remote operating system and application hosted within can be checked. The Integrity Measurement Architecture (IMA) extends the TPMbased attestation into the system runtime, the EVM provides the measurement and the SLIM identifies all executables (programs, scripts) and all system level integrity objects (config files). It attests the software stack and adjusts protected hardware storage slots to maintain measurement list integrity values. Using this remote attestation facility, it is possible for solution producers and user to remotely check if a resource providers system meets their security requirements. Possible requirements are that software is stored only in encrypted form and is only decrypted into memory when it is being executed to reduce the risk of the resource provider stealing the software. Using TCP enabled network devices, it will be possible to securely transfer the keys needed to run an encrypted application directly from one TPM to another. This gives the solution producer added security since the TPM chip would have to be hacked by the resource provider to illegally gain access to their software or data. The user gains additional security through the remote attestation framework since it is possible to check whether the application from the trusted solution producer is authentic and unmodified. Figure 7 shows the steps involved in remote attestation as described in [15]. While the remote attestation feature of TPM enabled operating systems does not create absolute security for the solution producers and users, it makes it extremely difficult for the resource provider to tamper with their software and data, thus decreasing the need for the solution producers and users to trust the resource providers.
12
Fig. 7.
Remote Attestation
VII. R ELATED W ORK Various approaches for the protection of UNIX-like operating systems such as Linux, FreeBSD, NetBSD, and OpenBSD with respect to untrusted applications have been proposed. One very popular method is the virtualization of the entire system hardware, allowing a guest operating system to run in a virtual machine environment created by the host operating system. Such virtual machine systems include User-Mode Linux [39], OpenVZ [40] and Xen [31]. Xen, in particular, has seen a great increase in popularity due to the small performance overhead caused by its virtualization technology. Efficient performance combined with the high level of security offered by the solution led us to adopt it as the basis for our security framework. In [28] the integration of virtualization techniques into the Globus Grid framework is presented. UNIX accounts, VServer and VMware are presented as virtual workspace solutions and advantages and disadvantages are discussed. Generality, noninvasiveness, protection, enforcement, and state are the five features on which the advantages of the different solutions are judged. The focus of the paper is on the integration of virtual workspaces into the service-oriented framework of Globus and performance issues of the different workspace implementations. In [29] a similar system is introduced for the Condor Grid framework based on Xen. The focus of the paper is on different strategies which can be used to deploy virtual machines onto the Grid nodes. Virtual-grid-nodesandboxes, eager-prefetching-whole-file-caching-sandboxes, lazy-block-caching-sandboxes, lazy-prefetching-whole-filecaching-sandboxes are introduced as possible deployment strategies and their performance characteristics are discussed. Security issues are not discussed in-depth. Another sandboxing approach is the Chroot mechanism found in many UNIX-like operating systems. Chroot confines the file system access of a process to a different base in the file system; what appears to be the root directory of the file system to a chroot’d process is actually some lower-level directory within the real file system. Thus, access to files is limited to those within the chroot portion of the real file system. While chroot is a step in the right direction of strengthening system security, several well-known methods exist for processes to access files outside the chroot environment. These vulnerabilities have been addressed by the FreeBSD implementation of
the jail system call. Jails partition a FreeBSD environment into isolated containers in terms of both file system access and system calls. That is, a process can have root (uid 0) access to a jail environment, but this access is constrained to the jail to which the process belongs. A jail guarantees that every process, along with descendant processes, placed in it will remain in the jail, thus severely restricting the access it has to the real system environment. The accessible file name space is confined in the style of chroot (i.e. access is restricted to a configurable new root for the file system in the jail). Each jail is bound to use a single IP address for outgoing and incoming connections, and it is also possible to control what network services a process within a jail may offer. Certain network operations associated with privileged calls are disabled to circumvent IP spoofing or the generation of disruptive traffic. Finally, the ability to interact with other processes is limited to those processes in the same jail. Systrace [41] has become a popular mechanism for system call restriction as well as controlling privilege elevation on a fine-grained scale without the need for running entire processes in a privileged context. Systrace functionality has been implemented in the base NetBSD and OpenBSD operating systems with ports being available for Linux and FreeBSD as well. System call interposition is used to enforce security policies for processes running under the control of systrace. Systrace is implemented in two parts. First, a kernel extension intercepts system calls, compares them to a kernel-level policy map, and disables calls that map to a restricted entry (or for which no entry is present in the map). The kernel-level code is assisted by a user-level portion that reads and interprets policy specifications before entering them into the kernel-level policy map, reports policy enforcement decisions to user applications, and provides the capability of calling GUI applications for interactive generation of policies. UNIX-like operating systems support the notion of an “effective User ID” (EUID) for a process, used instead of the process’s real User ID (UID) for the purposes of determining access to system resources. Using this mechanism, a privileged process (UID 0) can perform whatever initialization functions requiring privileges are needed and then lower its privileges by setting its EUID to a non-privileged (non-zero) value. In the event that a security vulnerability in the running program is discovered, an attacker’s access to the system is constrained to the access available under the EUID, not the UID 0 privileged access. The suEXEC capability of the Apache 1.2 Web Server [42] to set the EUID of CGI programs to a User ID other than that of the Web server is an example of this functionality. The Progressive Deployment System (PDS) project [43] provides a virtual execution environment for software deployment. In contrast to a Xen-based virtualization, PDS only provides partial virtualization by intercepting a set of system calls to enable software deployment on networked machines while enabling management from a central location. Additional components required for the system can be loaded on demand at runtime. While the PDS system provides a system for software deployment from a central location, the system does not provide a mechanism for short-lived deployment of
13
software. In addition, the virtualization features in PDS are not adequate for the requirements of on-demand deployment since the virtualization methods only affect a small subset of system calls. VMPlants [44] is a tool for providing and managing virtual machine execution environments for Grid computing. It supports automated configuration and creation of virtual machines that are configured on a single system and subsequently cloned and instantiated over the Grid. VMPlants is a middleware service that supports various virtualization solutions; its main focus, however, is not on security, but rather on planning of production processes in Grids. The Entropia Virtual Machine [45] provides a system for creating desktop Grids, focusing on using the idle cycles on desktop systems in a home or office environment. Because potentially unknown and untrusted applications may run on a user’s desktop, Entropia provides some limited support for preventing applications from accessing and modifying data outside of the Entropia Virtual Machine. Further, Entropia provides some protection for the distributed application’s program and data to prevent malicious desktop owners from snooping on a cycle consumer’s information. VIII. C ONCLUSIONS In this paper, we have presented an analysis of threats existing within a shared environment for on-demand cluster and Grid computing where users may submit jobs that are dynamically scheduled and executed without human intervention. Our analysis addressed three different levels of on-demand computing, based on the number of resource providers, solution producers and users, as well as the trust relationships between them. We showed in which ways the security threats in these three different levels can be handled by constructing a set of security solutions that enable trust management in increasingly demanding levels. While the first two levels can be handled by a sandboxing approach based on the Xen hypervisor system, handling the third level requires hardware security mechanisms based on Trusted Computing Platform Alliance (TCPA) technology. There are several issues for future work, such as (a) the implementation of the presented architecture into a serviceoriented Grid platform, (b) an extensive analysis of the performance characteristics of the presented solutions for the three levels in the context of real-world applications, and (c) an evaluation of the interplay of the security architecture with other Grid software development issues, such as the the build and deployment process. IX. ACKNOWLEDGEMENTS This work is partially supported by the German Ministry of Education and Research (BMBF) (D-Grid Initiative, In-Grid Project), Siemens AG (Corporate Technology, M¨unchen) and IBM (Eclipse Innovation Grant). The authors would like to thank the anonymous reviewers for their valuable comments. R EFERENCES [1] I. Foster, C. Kesselman, J. Nick, and S. Tuecke, “The Physiology of the Grid: An Open Grid Services Architecture for Distributed Systems Integration,” in Open Grid Service Infrastructure WG, Global Grid Forum, 2002, pp. 1–31.
[2] I. Foster and C. Kesselman, The Grid 2: Blueprint for a New Computing Infrastructure. Morgan Kaufmann, 2003. [3] IBM Global Services, “FAQs: IBM e-Business On Demand,” 2002, http: //www.ibm.com/services/ondemand/files/Q&A.pdf. [4] M. Smith, T. Friese, and B. Freisleben, “Towards a Service-Oriented Ad Hoc Grid,” in Proc. of the 3rd International Symposium on Parallel and Distributed Computing, Cork, Ireland, 2004, pp. 201–208. [5] G. A. Koenig and W. Yurcik, “Design of an Economics-Based Software Infrastructure for Secure Utility Computing on Supercomputing Clusters,” in Proceedings of 12th International Conference on Telecommunication Systems Modeling and Analysis (ICTSM), Monterey, CA USA, July 2004. [6] M. Smith, T. Friese, and B. Freisleben, “Intra-Engine Service Security for Grids Based on WSRF,” in Proc. of Cluster Computing and Grid, Cardiff, UK, 2005. [7] W. Yurcik, G. Koenig, X. Meng, and J. Greenseid, “Cluster Security as a Unique Problem with Emergent Properties: Issues and Techniques,” in Proceedings of the 5th LCI Intl. Conference on Linux Clusters, 2004. [8] Computer Security Institute, “CSI/FBI Computer Crime and Security Survey,” 2004. [9] ITSS, “Multiple Unix Compromises on Campus,” Stanford ITSS Security Alert, April 10, 2004. [10] B. Krebs, “Hackers Strike Advanced Computing Networks,” Washington Post, April 13, 2004. [11] M. Pourzandi, D. Gordon, W. Yurcik, and G. A. Koenig, “Clusters and Security: Distributed Security for Distributed Systems,” in Cluster Security (Cluster-Sec) Workshop at the 5th IEEE International Symposium on Cluster Computing and the Grid (CCGrid), 2005. [12] SANS Institute, “The Twenty Most Critical Internet Security Vulnerabilities,” November 2005, http://www.sans.org/top20/. [13] D. Safford, “The Need for TCPA,” White Paper, IBM Research, 2002. [14] EUGridPMA, “The European Policy Management Authority for Grid Authentication in e-Science,” 2006, http:// eugridpma.org/ . [15] D. Safford, M. Zohar, and A. Boulanger, “Presentation: Trusted Computing for Linux,” IBM Research, August 2005. [16] The Enabling Grids for E-Science (EGEE) Project, “Global Security Architecture,” 2004, EU Deliverable DJRA3.1. [17] W. Yurcik and C. Liu, “A First Step Toward Detecting SSH Identity Theft in HPC Cluster Environments: Discriminating Masqueraders Based on Command Behavior,” in Cluster Security (Cluster-Sec) Workshop at the 5th IEEE/ACM International Symposium on Cluster Computing and the Grid (CCGrid), 2005. [18] The Globus Project, “The Globus Toolkit 4,” 2006, http:// www.globus. org/ toolkit/ . ´ Frohner, [19] R. Alfieri, R. Cecchini, V. Ciaschini, L. dell’Agnello, A. A. Gianoli, K. L¨orentey, and F. Spataro, “VOMS, an Authorization System for Virtual Organizations,” in Proc. of the European Across Grids Conference, ser. Lecture Notes in Computer Science, vol. 2970. Santiago de Compostela, Spain: Springer, 2003, pp. 33–40. [20] L. Pearlman, V. Welch, I. Foster, C. Kesselman, and S. Tuecke, “A Community Authorization Service for Group Collaboration,” in Proceedings of the IEEE 3rd International Workshop on Policies for Distributed Systems and Networks, Monterey, CA USA, 2003, pp. 50–59. [21] D. W. Chadwick, A. Otenko, and E. Ball, “Role-Based Access Control With X.509 Attribute Certificates,” IEEE Internet Computing, vol. 7, no. 2, pp. 62–69, 2003. [22] M. Myers et al., “RFC2560: X.509 Internet Public Key Infrastructure Online Certificate Status Protocol (OCSP),” http:// www.ietf.org/ rfc/ rfc2560.txt. [23] M. Treaster, G. A. Koenig, X. Meng, and W. Yurcik, “Detection of Privilege Escalation for Linux Cluster Security,” in 6th LCI Intl. Conference on Linux Clusters, 2005. [24] J. Novotny, S. Tuecke, and V. Welch, “An Online Credential Repository for the Grid: MyProxy,” in Proceedings of the 10th IEEE International Symposium on High Performance Distributed Computing, 2001, pp. 104–111. [25] “Crypto card corp.” http:// www.cryptocard.com. [26] “Secureid,” http:// www.ctrl-key.co.uk/ index.htm. [27] J. Tucek, P. Stanton, E. Haubert, R. Hasan, L. Brumbaugh, and W. Yurcik, “Trade-offs in Protecting Storage: A Meta-Data Comparison of Cryptographic, Backup/Versioning, Immutable/Tamper-Proof, and Redundant Storage Solutions,” in 22nd IEEE - 13th NASA Goddard Conference on Mass Storage Systems and Technologies (MSST), 2005.
14
[28] K. Keahey, K. Doering, and I. Foster, “From Sandbox to Playground: Dynamic Virtual Environments in the Grid,” in 5th International Workshop on Grid Computing, 2004. [29] S. Santhanam, P. Elango, A. Arpaci-Dusseau, and M. Livny, “Deploying Virtual Machines as Sandboxes for the Grid,” in Second Workshop on Real, Large Distributed Systems, 2005, p. 712. [30] M. Clement, Q. Snell, , and G. Judd, “High-Performance Computing for the Masses,” in Workshop on Java for Parallel and Distributed Computing, IPDPS, 1999, p. 712. [31] P. Barham, B. Dragovic, K. Fraser, S. Hand, T. Harris, A. Ho, R. Neugebauer, I. Pratt, and A. Warfield, “Xen and the Art of Virtualization,” in Proc. of the ACM Symposium on Operating Systems Principles (SOSP), Bolton Landing, NY USA, 2003, pp. 164–177. [32] Distributed Systems University of Marburg, “MAGE,” 2006, http:// ds. informatik.uni-marburg.de/ MAGE/ . [33] The Globus Project, “Virtual Workspaces,” 2006, http:// workspace. globus.org/ . [34] B. Clark, T. Deshane, E. Dow, S. Evanchik, M. Finlayson, J. Herne, and J. N. Matthews, “Xen and the Art of Repeated Research,” in Proceedings of the USENIX 2004 Annual Technical Conference, 2004, pp. 135–144. [35] B. Cohen, “Incentives Build Robustness in Bittorrent,” in Proc. of the 1st Workshop on Economics of Peer-to-Peer Systems, Berkeley, CA USA, 2003. [36] D. Safford, “Clarifying Misinformation on TCPA,” White Paper, IBM Research, 2002.
[37] D. Safford and M. Zohar, “A Trusted Linux Client (TLC),” Technical Paper, IBM Research, 2005. [38] IBM Watson Research - Global Security Analysis Lab, “TCPA Resources,” http:// www.research.ibm.com/ gsal/ tcpa/ . [39] J. Dike, “User-Mode Linux,” in Proc. of the 5th Annual Linux Showcase and Conference, Oakland, USA, 2001. [40] SWsoft, “OpenVZ,” 2006, http:// openvz.org/ . [41] N. Provos, “Improving Host Security with System Call Policies,” in Proc. of the 12th USENIX Security Symposium, Washington, USA, 2003, pp. 257–272. [42] The Apache Software Foundation, “Apache HTTP Server Documentation: suEXEC,” http:// httpd.apache.org/ docs/ suexec.html. [43] B. Alpern, J. Auerbach, V. Bala, T. Frauenhofer, T. Mummert, and M. Pigott, “PDS: A Virtual Execution Environment for Software Deployment,” in Proceedings of the 1st ACM/USENIX International Conference on Virtual Execution Environments (VEE), Chicago, IL, USA, 2005, pp. 175–185. [44] I. Krsul, A. Ganguly, J. Zhang, J. A. B. Fortes, and R. J. Figueiredo, “VMPlants: Providing and Managing Virtual Machine Execution Environments for Grid Computing,” in Proceedings of the 2004 ACM/IEEE Conference on Supercomputing (SC), Pittsburgh, PA USA, 2004, p. 7. [45] B. Calder, A. A. Chien, J. Wang, and D. Yang, “The Entropia Virtual Machine for Desktop Grids,” in Proceedings of the 1st ACM/USENIX International Conference on Virtual Execution Environments (VEE), Chicago, IL, USA, 2005, pp. 186–196.