Clusters and Security: Distributed Security for Distributed Systems

18 downloads 0 Views 55KB Size Report
National Center for Supercomputing Applications (NCSA). University of Illinois ... clusters varies from carrier-class applications with tight re- quirements on ...
Clusters and Security: Distributed Security for Distributed Systems Makan Pourzandi, David Gordon, Open Systems Laboratory, Ericsson Research 8400 Decarie Blvd, Town of Mont-Royal, QC, Canada {makan.pourzandi, david.gordon}@ericsson.com William Yurcik, Gregory A. Koenig National Center for Supercomputing Applications (NCSA) University of Illinois, Urbana-Champaign, USA {byurcik, koenig}@ncsa.uiuc.edu

Abstract Large-scale commodity clusters are used in an increasing number of domains: academic, research, and industrial environments. At the same time, these clusters are exposed to an increasing number of attacks coming from public networks. Therefore, mechanisms for efficiently and flexibly managing security have now become an essential requirement for clusters. However, despite the growing importance of cluster security, this field has been only minimally addressed by contemporary cluster administration techniques. This paper presents a high-level view of existing security challenges related to clusters and proposes a structured approach for handling security in clustered servers. The goal of this paper is to identify various necessarilydistributed security services and their related characteristics as a means of enhancing cluster security.

1. Introduction Large-scale commodity clusters are used in an increasing number of domains: academic, research, and industrial environments. These clusters share and coordinate the use of resources (CPU, storage, . . . ) for a wide range of users. Furthermore, the functionality provided by these clusters varies from carrier-class applications with tight requirements on availability and real-time response time to High-Performance Computing clusters where availability and real-time response time are secondary issues. At the same time, these clusters are exposed to an increasing number of attacks coming from public networks. Therefore, mechanisms for efficiently and flexibly managing security have now become an essential, but at the same time challenging, requirement for clusters.

The main difficulty in cluster security results from the fact that even though many security mechanisms exist for single nodes in a cluster, the issues related to securing a cluster as a whole are not the same as those related to securing the independent nodes that make up the cluster. Even though the behavior of individual nodes may be simple and could be approached with traditional security techniques, we believe that effective security management in the context of cluster systems requires tools that evaluate the state of the cluster as an indivisible entity. Simply put, securing a 100-node cluster is different from securing 100 standalone nodes. To illustrate the above, consider the example of a traditional security monitoring tool that examines the flow of communication into and out of individual cluster nodes. This tool is limited to evaluating security based only on streams of data that it considers independently of any cluster-specific context. On the other hand, a clusteraware security monitoring tool could evaluate whether a given node should even be communicating at all, based on information from sources such as the cluster’s job management system. That is, if no job is currently scheduled for execution on a given node, that node should most likely not be sending or receiving data on the network. The idea that cluster security must be considered as a whole is further underscored by realizing that while the behavior of individual cluster components may be simple, the combined interactions of multiple components may result in complex, unintended, and non-intuitive behaviors that are difficult or impossible to predict. That is, even if certain hardware or software components that make up a cluster are certified as assured, these components must co-exist in a cluster environment that most likely consists of nonassured components. Furthermore, even if a cluster were built entirely from certified components, it is unlikely that the entire cluster, considered as a single entity, would have

been evaluated in any kind of certification process. Simple combinatorics make it infeasible to use formal methods to identify and protect against all known vulnerabilities from component interactions. Cluster security is an emergent property because it arises from the independent security aspects of the individual cluster nodes and is at the same time irreducible with regard to the overall cluster system [25]. Within this paper, we leverage this perspective on cluster security in order to identify various necessarily-distributed security services and their related characteristics. Our goal is to develop techniques that can be used to enhance the security of clusters that exist in domains ranging from a carrier-class telecommunications environment to a High-Performance Computing (HPC) environment. The remainder of this paper is organized as follows. We first define a threat model specific to the clusters environment in Section 2. In Section 3 we explain why cluster security is a challenging task. In Section 4 we state the requirements for different cluster security services for including authentication, access control, and monitoring. Sections 5 and 6 introduce the authors’ research projects, DSI and NVisionCC, respectively. Section 7 addresses deployment issues with the practical use of these cluster security solutions. We end with a summary and conclusions in Section 8.

2. Threat model In order to prioritize our efforts in protecting clusters we present a threat model that guides our efforts. The goal of a threat model is not to cover every possible attack scenario, an exercise that is impossible given new threats and new vulnerabilities that continuously appear. Rather, the goal of a threat model is to understand a security posture given that attacks are numerous, that no protection system is perfectly secure, and that protection resources are finite. A threat model can be reduced to risk management. What are the likely attacks and, knowing these likely attacks, what tradeoffs are you willing to make to protect against these likely attacks? Attackers have a given set of capabilities; cluster vulnerabilities exist. When the capabilities generally available to attackers match the exploitable cluster vulnerabilities, there is a higher probability of security breaches in the cluster. The first threat to consider is from insider attack which most empirical studies report as being the most likely mode of attack [3]. Authorized users with privileged access may attempt to access unauthorized resources, perform denialof-service attacks on shared resources, or delete or modify shared data sets. Another type of insider threat is when a legitimate user’s authentication credentials (password or keys) are stolen, allowing an attacker to masquerade as a

legitimate user. Masquerade attacks are particularly dangerous since they can lead to further damage beyond the initial compromised account and there is little indication of a problem to cluster security system administrators. External attacks that probe and then exploit cluster vulnerabilities are a new reality after the Spring 2004 attacks on HPC infrastructures worldwide [8, 12]. Attackers seek to steal cluster services, eavesdrop on cluster messages, and disrupt cluster operations. There has been at least one reported case where a cluster’s computational power was used to stage a brute-force effort to decrypt stolen password files [15]. Cluster operations can be disrupted using external denial-of-service attacks on cluster nodes which are Internet accessible or by attacks against communications between remote users and clusters. The largest cluster security threat is actually the combination of multiple threats in what has been referred to as cascading threats or dependent risk [25]. The security of resources in a cluster environment are dependent on the integrity of all nodes. If one node is compromised, either by internal or external means, there is a dramatically increased risk to the rest of the cluster nodes since they often share identical configurations and common protection mechanisms. There is also a risk to peer resources in other security domains (often other clusters) since cluster users tend to coordinate access across different resources. We highlight three special points about the cluster security threat model: • Changing Nature of Clusters – Clusters have moved from closed/proprietary environments (particularly in commercial settings) to open/standard systems that are often exposed to public networks. This change has resulted in exposing clusters to a variety of point-andclick attack tools that are easily available on the Internet. Furthermore, many clusters run code from thirdparty partners or software providers. It is almost impossible for time and money issues to perform a security audit on all of this code. Thus, many clusters run untrusted third-party software. This is a major change from traditional clusters running a controlled base of known source code. • Shift from Random Reliability Failures to Intentional Attacks – Traditional high-availability clusters relied on redundancy to address issues related to random failures of hardware and software. However, this approach is not applicable to intentional attacks where faults are targeted and dependent (as opposed to random and interdependent). • Security as a Service versus Security as an Obligation – With the increasing use of clusters in several fields, there has been a gradual change in the way security is

handled. In many traditional, security-sensitive fields (e.g. banking, government) the client is bound to the offering1 . In many new fields, security is a service or an add-on to enhance other service offerings. This means that the security should be provided not to invalidate or conflict with other requirements. For example, in the case of handling on-line transactions, security should support real-time response time requirements. If the transactions are too slow, the client can choose not to use the service and the supplier loses the business opportunities. This puts extra pressure on the security requirement and changes the way the security should be implemented.

3. Challenges for Cluster Security Given the threat model we have described, implementing security for a cluster is difficult in multiple dimensions [25]. A cluster encompasses a collection of distributed resources: multiple layers including applications, middleware, operating systems, and network interconnects must all be coherently protected. While locking down a cluster by disabling services is desirable from a security perspective, cluster resources are meant to be used, so there is the resource management challenge of allowing users to consume resources in an authorized way. Clusters represent a heterogeneous management environment composed of different hardware and software node configurations, presenting the challenge of integrating different security solutions (vendor or open source) with a goal of comprehensive security solutions across the entire cluster. Further, there are large scale management requirements. As the size of clusters continues to increase, installing, monitoring, and maintaining clusters becomes a challenge since any misconfiguration or inconsistency potentially becomes an exploitable vulnerability. We are beyond the point where typically-sized clusters can be managed manually without automation support. The current state-of-the-art has automated cluster tools available for performance management, the challenge is developing automated cluster tools for security management.

4. Distributed Security Services In the previous sections of this paper we present the logic behind the need for a common infrastructure to implement security in a coherent way throughout an entire cluster. This section concentrates on cluster platform-level security. By platform security we mean the security mechanisms that are 1 For example, when was the last time you threatened your tax organization to provide you with a decent, fast, and secure interface to submit your tax forms or you would change your supplier?

deployed at a user level across the entire cluster (with possible support at the operating system level). To simplify, the distributed security functionality needed for a secure service invocation/communication between two objects in different nodes in the cluster can be summarized as follows: 1. Authenticating the source and target objects. This is fundamental in order to be able to securely define the credentials/privileges for each object in the system. 2. Deciding whether the source object can perform this action on the target object. This should be done according to the security policy already defined. 3. Auditing the action. For many clusters, this is optional based on the system functionality needed. Even though auditing is an often-needed function, its use is based on a trade-off between performance and security needs. 4. Protecting the data flow (requests and responses, the data exchanged, . . . ) from being modified or eavesdropped during the transit between nodes. The above functionality is often implemented in many clusters even though it is not clearly defined. Our approach is to qualify each functionality through a service and to provide the needed functionality by that service. Therefore, we define respectively the following services: distributed authentication, distributed access control, distributed monitoring/auditing, and secure communications between different nodes. This service-oriented approach provides more flexibility (as services evolve, a service can be replaced or enhanced with new capabilities), scalability (since several instances of the same service could run on the same node), and fault tolerance (as high availability techniques can be used to provide service availability [22]). This approach also allows us to take a systematic approach to deploying security functions. Services are inherently distributed across a cluster, interacting to maintain the distributed secure state of a cluster (Figure 1). In turn, each local service instance on different nodes should be built on top of existing security mechanisms at the operating system level or at the level of other node security mechanisms (Figure 2). Microsoft recently initiated an effort to provide distributed security services for Windows 2000 [19]. Microsoft’s approach is heavily based on PKI, Kerberos, and Active Directory. Active Directory is a repository for account information that enhances and scales the use of account information in different domains and stores the security policy for different domains. Microsoft added the support for some of these services at the operating system

4.1 Distributed Security Services through cluster

MiddlewareSecurity

DistributedSecurityServices

ApplicationSecurity

MiddlewareSecurity

DistributedSecurityServices

SerialSecurityServices

SerialSecurityServices

SerialSecurityServices

OperatingSystemSecurity

OperatingSystemSecurity

OperatingSystemSecurity

Secure Management

DistributedSecurityServices

ApplicationSecurity

Secure Management

MiddlewareSecurity

Secure Management

ApplicationSecurity

Figure 1. Distributed Security Services at cluster level.

level. Unfortunately, their approach is heavily tied to specific technologies and lacks the flexibility necessary to be adapted to clusters with an operating system different from Windows.

Middleware Security

Distributed Security Services

Secure Management

Application Security

Serial Security Services Operating System Security

Figure 2. Distributed Security Services at node level.

There are also some efforts towards leveraging distributed security services from Grid computing [20]. These services cope with nodes being dynamically added to the Grid and support a wide scope of interoperability, since a Grid generally runs across heterogeneous environments. However, clusters often depend on unique administrative ownership and nodes are not scattered as dynamically as they are in Grids. This fundamentally changes the scope of these services. In the case of clusters, a narrower scope enhances the possibility of implementing specific mechanisms.

Distributed Authentication

The distributed authentication service has the goal of providing a homogeneous framework for the entire cluster in order to supply authentication information for objects in the cluster. The local authentication of objects at the node level is well known; the greater challenge is to propagate the authentication information in a transparent way across the cluster. Kerberos mechanisms may implement such an infrastructure [7]. Several efforts toward its use have been deployed [19] resulting in many improvements leading to a reliable and proved protocol. However, deployments of Kerberos in real world environments show that it does not scale well when applied to large clusters. Furthermore, the protocol presents some single points of failure, which is unacceptable for many high-availability applications. The use of digital certificates issued and verified by a Certificate Authority as part of a Public-Key Infrastructure is considered likely to become the standard way to perform authentication on the Internet. The widespread acceptance of Kerberos addresses practical issues such as deployment and trust management. The deployment of Kerberos inside a cluster is much easier since it avoids the major problem with traditional PKI deployment: scalability and trust management issues. There are a variety of methods for using PKI in this case. Mainly, certificate servers can be used to create and securely propagate certificates while maintaining the certificate revocation lists. The target unit for authentication can be Unix users, nodes, or even applications/processes inside the cluster. Most of this choice depends on the type of cluster environment. The more straightforward approach is to use Unix users as the basic granularity. In this case, every user in the system holds a key pair which is typically distributed with certificates issued by a trusted CA. In many applications, possibly only a few users exist on dedicated clusters for running a pre-defined set of software. However, the user-based security system does not support authentication and authorization checks for interactions between two processes belonging to the same user. This situation leads to an all-or-nothing approach, as all users within a group or all processes owned by the same user have the same rights. This is quite inconvenient when one wishes to compartmentalize these rather large distributed applications by restricting the access to some resources or some processes or users within the same group. Therefore, userlevel granularity used as the basic entity for access control in these distributed applications may not be sufficient. In this scenario there is a need for a security mechanism with finer granularity which uses the individual processes as the basic entity being secured. Although user authorization remains a fundamental pre-

occupation, applying security to applications has always depended directly on the user’s permissions. For a given application, distributed authorization should determine its scope of execution throughout the cluster. Therefore, we work toward implementing better authentication programs (i.e. binaries) instead of concentrating our authentication efforts around Unix users. As for Unix users, the Unix userid should be managed in a consistent way throughout an entire cluster. In the end, a combination of process- and user-level security is the first step in implementing better security for clusters. Current systems offer all the tools and infrastructure for user authentication. As for authentication with process granularity, the DSI project addresses this issue by offering distributed security policies based on process classifications.

4.2

Distributed Access Control

Distributed access control presents a unified and consistent implementation of access control in the cluster, providing the platform with mechanisms to control access to the system resources uniformly throughout the cluster. The standard Unix access control is based on Discretionary Access Control (DAC) for users. DAC means that users are in charge of defining permissions for different objects which belong to them. This approach necessitates correctly setting permissions for different services for each user throughout the entire cluster: file permissions, network service configuration, etc. should be set correctly for each user. More importantly, DAC provides no protection against flaws in the system software or malicious software installed on the system. There is also the legacy problem of services running as privileged users with coarse-grain privilege control. Recently there have been many alternative developed: capabilities, access control lists (ACL), sand-boxing (BSD jails, chroot) and so on. In opposition to DAC, Mandatory Access Control (MAC) is based on security policies and attributes defined by an administrator for different objects in the system. MAC may alleviate the risks related to DAC by providing for the confinement of the programs defined by the administrator in a security policy. As further detailed in [17], a process-level granularity approach based on MAC can dramatically improve the overall security of a cluster. The mechanisms used to enforce process-level access control in a cluster need improvement. To this end, SELinux introduced the concept of network security identification tags (NSID) [13]. The NSID is inserted in network communications between applications. This allows for a mechanism to validate the permissions of the given application within its current context of execution. This again begs the question of how such measures should be implemented.

The DSI project showed the feasibility of this approach by implementing a Linux operating system kernel-level module that performs real-time security verification based on the LSM hooks [5]. In terms of usability, our experience with SELinux and DSI has shown that process-level security involves an errorprone configuration task. Such a complicated task will eventually lead to administrative mistakes or simple misconfigurations. Therefore, we believe that with process granularity, a simplified scheme based on a higher level of abstraction is key to general acceptance. In the DSI project, we developed a high level of abstraction regarding access control separating network, administrative, and computational processes into different security zones. Though, this approach should be further extended to all aspects of distributed security. There are few projects going on providing tools to simplify SELinux policy administration [21]. In summary, distributed access control should still be based on user-level access control if only for legacy issues. However, when possible or when the cluster handles sensitive data, a process-level granularity based on MAC in addition to DAC is a better solution. To use MAC over a cluster efficiently, we should then extend MAC to the entire cluster. Some research projects show the feasibility of this approach without major performance impact [5].

4.3

Distributed Monitoring

One of the major challenges in protecting a cluster is in monitoring a set of distributed resources. Monitoring an entire cluster environment involves examining the state of several cluster resources including authentication mechanisms (Section 4.1), access control mechanisms (Section 4.2), activity on individual cluster nodes, software configuration across the cluster, network traffic (both internal and external to the cluster), and user behavior. Due to the inherently decentralized architecture of clusters, monitoring these various entities typically involves creating some type of unified view of the status of all cluster resources. To create a unified view of the status of cluster resources, messages must be sent between nodes. Often, these messages are sent to a designated management node that is heavily protected and accessed only by the system administrator. This management node collates independent status reports from cluster resources and synthesizes a cohesive view of the entire cluster system. Two cluster-specific monitoring projects that employ such an architecture include Clumon [1] and Ganglia [11]. At a more basic level, however, each cluster node may simply be configured to send all Unix Syslog messages to the management node which collates messages from all cluster resources into a single log file. From another perspective, a cluster may be treated as a

black box by observing only the network traffic entering and leaving the cluster, but not the actual activity taking place within the cluster itself. For example, one may examine the status of smart hubs, routers, or network Intrusion Detection System sensors to analyze activity of the cluster. The idea here is that most malicious activity involving a cluster must at some point pass between the (external) attacker and the cluster, and this malicious activity can be observed at this point. Finally, clusters typically contain a number of resources that can be leveraged to determine an overall view of cluster activity. One of the most obvious places to obtain information about the state of a cluster is from the cluster’s batch job scheduler. This technique is particularly powerful when coupled with other monitoring techniques because information about cluster batch jobs can be correlated with other information to obtain a richer view of activity. Each of these monitoring approaches presents information from across the cluster. While these monitoring approaches provide integrity checks and enhanced operation, their distributed nature makes them vulnerable to attacks since they are involve some type of message passing. Encryption, properly implemented, can solve many of the problems but not all. Monitoring protocols should be formally verified using standard implementations. Monitors must be careful not to do more harm than good by being vulnerable to attack and subversion.

4.4

Distributed Secure Communications

Protecting the integrity of cluster communications is important. In this context, we include both communications related to the computational goals of the cluster (e.g., the communications related to parallel and distributed applications running on a cluster) as well as communications related to managing and monitoring a cluster. Further, because protecting communications often includes a high performance cost (e.g., the cost of encryption), it is sometimes useful to differentiate between intra-cluster communication and inter-cluster communication. In the case of intra-cluster communication where messages stay completely within the boundary of the cluster’s System Area Network, an optimization may be to assume that this network is less prone to attack and to simply allow these communication operations to remain unsecured. This is in contrast to inter-cluster communication where messages likely travel over a public network that is well outside the control of cluster administrators, suggesting that securing such communication is important. In either case, having the granularity to determine which communication operations are secured and unsecured is useful. Various well-known and usable solutions including IPSec and SSL/TLS exist and can address these issues effectively.

There are several research projects showing the feasibility of implementing some of the above services in a cluster. In the following sections, we detail two existing research projects: DSI in Section 5 and NVisionCC in Section 6.

5. Distributed Security Infrastructure (DSI) The DSI project [5] targets the distributed access control service. DSI began as a research project to support different security mechanisms to address the needs of telecommunications applications running on carrier-class Linux clusters. For the time being, DSI provides distributed mechanisms for access control, security management, and authentication. The Distributed Security Infrastructure contains one security server (SS) and a security manager (SM) on the remaining cluster nodes. The SS is responsible for distributed security management of the cluster. It will propagate the security policy and communicate via alarms and messages with the SMs on the nodes. Communication is done over the Secure Communication Channel (SCC). The SCC communications are encrypted using SSL/TLS over CORBA. The versatility of DSI is in fine-grained control that can be enforced on the node by the SMs. Various structures in the kernel such as sockets and processes can be assigned a security context identifier (ScID). ScIDs are global over the cluster and persistent. ScIDs are meant to group together processes that have the same security context. So, contrary to PIDs, SsIDs do not uniquely identify processes but security contexts. Similarly, each node is assigned a security node identifier SnID. Hence, the distributed security policy (DSP) consists of a list of rules to be applied to (SnID, ScID) pairs. For security mechanisms to be effective, users should not be able to bypass them. Hence, the best place to enforce security is at kernel level. Therefore, when necessary, all security decisions are implemented at kernel level, in the DSI Security Module (DSM). DSM is a set of kernel functions enforcing distributed security policy, and is implemented using LSM [23] as a Linux kernel module. As future work, in order to use the mainstream Linux tools, we consider using SELinux instead of our internally developed DSM Linux kernel module. As presented in Section 3, there is need for compartmentalization in large distributed applications. In order to compartmentalize large applications, DSI uses ScIDs to implement different virtual security zones. These security zones are defined with a process level granularity across the entire cluster. They are based on the process type and the node on which they are executing. A process instance can belong to different security zones. For example, the instances of the same process type can be defined in different security zones depending on which cluster node they are run-

ning. ScIDs do not identify different instances of a process type, but rather define the security zone they belong to. The security rules are defined in a central security policy file: Distributed Security Policy (DSP). They define the possible interactions between different security zones in the entire cluster. The DSP file can be used by the administrator to define a homogeneous view of the cluster. This is particularly convenient for the carrier-class clusters which are not running a wide range of applications – this makes it possible to predefine interactions between different zones. This flexible mechanism can be used to confine untrusted software or in an extreme case run them inside a sandbox. DSP changes are automatically propagated to all nodes of the cluster. The security managers are in charge of communicating this new rules to the local DSM providing a dynamic evolution of security behavior of the cluster. A more detailed presentation of DSI can be found in [16, 5].

6. Scalable Cluster Security Monitoring The trend in cluster computing, particularly in HighPerformance Computing settings where large computational problems require huge amounts of processing power, is for larger clusters with increasing node counts. Further, decreasing per-node costs have accelerated this trend for larger clusters. As the average size of clusters grows, however, security monitoring techniques that may have worked well for smaller clusters are often no longer effective. There are three primary reasons why traditional techniques do not scale: 1. Security management tools are predominantly command-line interfaces designed for monitoring a small number of entities. 2. Human cognitive abilities to perceive, understand, decide, and react do not scale in the same way as cluster size, speed, and functionality. 3. Existing security management tools are designed for enterprise environments. Currently only one security management tool exists that has been specifically designed for the unique cluster environment (NVisionCC) [24]. As clusters grow in size, the combinatorics resulting from monitoring an increasing number of statistics about the cluster nodes, the aggregate processes running across all nodes, the installed software packages, file system and network I/O, etc., become intractable without a new security monitoring paradigm specifically designed for clusters. NVisionCC is a security monitoring tool specifically designed to address the challenges presented by large clus-

ters in High-Performance Computing environments. NVisionCC implements a new security monitoring paradigm based on the observation that although the average number of nodes in clusters is increasing, these nodes can be divided into distinct classes that exhibit relatively homogeneous behavior for each node in a given class. For example, login nodes, compute nodes, storage nodes, and management nodes are all common classes of nodes found in most clusters. All cluster nodes within each class typically have similar operational characteristics such as the list of expected processes, installed software, network traffic patterns and port activity, and user behavior [25]. This allows a profile of a given node’s steady-state behavior to be created, a feature that is feasible in the unique HPC cluster environment (as opposed to an enterprise network environment) due to the observation that a dedicated cluster environment represents a constrained set of circumscribed activities and states that can be enumerated. Thus, the problem of monitoring the security of a cluster of hundreds or thousands of nodes is reduced to the much more feasible problem of scanning each cluster node for deviations from its expected behavior profile. For example, instead of monitoring hundreds or thousands of compute nodes in a cluster independently, a single profile can configured for all compute nodes which defines the list of expected processes, installed software, network traffic patterns and port activity, and user behavior. In this scalable way, hundreds or thousands of compute nodes can be compared to this one profile and analyzed for unexpected activity. While the scalable processing of potential security events is a focus, ultimately it is the scalable communication of security information to the human operator that may represent the most difficult security challenge. Leveraging the successful Clumon cluster management GUI developed at the National Center for Supercomputing Applications, NVisionCC has designed a visualization framework that presents information to a human operator with the following characteristics: 1. All nodes within an entire cluster are shown on a single screen. Nodes are shown as being adjacent in space, not stacked in time nor scrolled at the bottom. 2. An overview of an entire cluster is given along with the ability to “drill down” to areas of interest, revealing raw data details at the individual node level. 3. Smallest effective differences in shape and color indicate information. 4. Different icons show different levels of process security status: critical, bad, suspicious, and normal[24]. While using the NVisionCC profiling approach provides dramatically increased security monitoring scalability, there

is a limit to this approach (security monitor processing power versus cluster node size). NVisionCC relies on polling an agent, Performance Co-Pilot (PCP) [14], running on each cluster node. As the number of cluster nodes increases beyond a certain point, or as the aggregate node activity increases beyond a certain level, this polling approach no longer scales. The exact scalability breaking point for the polling approach is an open question currently being investigated. An alternative to polling is an interrupt-based approach that sends only significant change events to the central monitoring process for analysis. The advantage of this model is that security events are analyzed in a more timely fashion as they occur rather than at discrete polling intervals. The significant drawback of this approach is that current operating systems are not directly instrumented for such monitoring and implementing it would entail the deployment of kernel loadable modules onto each cluster node. These modules must be extensively tested before being accepted into a production environment, and must also typically be upgraded with each kernel revision. Further, these modules may lead to unacceptable decreases in the processor cycles delivered to application software running on cluster nodes.

7. Deployment Issues Deploying clusters out of the HPC environment and into the mainstream has been slower than it should be due to installation and management challenges; clusters are not trivial to set up and manage. Cluster installation packages like OSCAR [4] and ROCKS [18] are making cluster installation easier. Of further help is the fact that a large percentage of system software is common within a particular cluster node class (login node, compute node, storage node, management node, etc.). However, while clusters may start out as homogeneous within a node class, this homogeneity can quickly diverge as hardware and software is added and replaced. After installation, cluster management consists of performance tuning by benchmarking and configuring CPU capabilities, memory subsystems, I/O subsystems, and compiler options. The time and learning investment in properly tuning a cluster can be substantial. Unified policies and centralized management can help by lowering barriers to cluster deployment – the so-called “rule of one”: one system administrator with one plan, one set of user policies, and one help desk for handling questions and problems especially related to security management. A cluster implementation is typically not performed by the cluster security software developers, especially in the carrier-class commercial case, but coordination and training between developers and implementers is important since even a carefully designed cluster security system is worthless if not properly deployed. Some issues that need to be

coordinated include the expected load range on the cluster in order to tune security monitor performance and minimize performance hits, the expected number of users for tuning the cluster authentication system, and the expected storage behavior (interactive or write-once) and storage size to tune the cluster security access policy. One way to handle this is to define attributes in a configuration file where values can be set. To contrast the approaches we present in this paper, these attribute values will be very different in carrier-class clusters versus general-purpose HPC clusters.

8

Summary

In this paper we have presented complementary security approaches for clusters across a range from carrier-class clusters to High-Performance Computing clusters. At the carrier-class end of the cluster environment spectrum, clusters must be locked-down to a maximum extent with an emphasis on production reliability. The corresponding security approach we present for this carrier-class cluster environment is a unified security model including distributed authentication and distributed access control. At the HPC cluster environment end of the cluster environment spectrum, clusters must be flexible to handle a dynamic user constituency executing a wide range of applications. The corresponding security approach we present for this HPC cluster environment focuses on multi-dimensional detection designed to enable real-time cluster security management that adapts quickly and automatically to changing situations. NVisionCC is the first security intrusion detection system specifically designed for the unique HPC cluster environment. Issues specific to cluster security have traditionally not been studied extensively, and it is our hope that this paper sparks discussion. There is much more work to be done in areas such as scalable cluster monitoring, intuitive human interfaces to security tools, interconnect security, masquerade detection, and flexible protection that evolves with incremental cluster growth.

References [1] Clumon - The Cluster Monitoring System. http:// clumon.ncsa.uiuc.edu/. [2] CORBA Security Service Specification, Object Management Group, version 1.8, March 2002. [3] CSI/FBI Computer Crime and Security Survey, Computer Security Institute, 2004. [4] B. des Ligneris, S. Scott, T. Naughton, and N. Gorsuch, Open Source Cluster Application Resources (OSCAR): Design, Implementation, and Interest for

the [Computer] Scientific Community, First OSCAR Symposium, 2003.

[19] Secure Networking Using Windows 2000: Distributed Security Services, Microsoft White Paper, 1999.

[5] Distributed Security Infrastructure Open Source Project, http://disec.sourceforge.net.

[20] T. Seki, OGSA Introductory Session by IBM, Framework for Commercial Grids, IBM Japan, 2002. www.gridforumkorea. org/workshop/2002/2002_winter/ 01-Tutorial1-TakanoriSeki(IBM).pdf,

[6] B. Hartman, D. Flinn, and K. Beznosov, Enterprise Security with EJB and CORBA, Wiley, 2001. [7] J. Kohl and C. Neuman, The Kerberos Network Authentication Service (V5), IETF RFC 1510, September 1993. http://cryptnet.net/mirrors/ rfcs/rfc1510.txt [8] B. Krebs, Hackers Strike Advanced Computing Networks, Washington Post, April 2004. [9] B. LaMacchia, S. Lange, M. Lyons, R. Martin, and K. Price, .NET Framework Security, Pearson Education, 2002. [10] U. Lang, Access Policies for Middleware, University of Cambridge Technical Report UCAM-CL-TR-564, May 2003. [11] M.L. Massie, B.N. Chun, and D.E. Culler, The Ganglia Distributed Monitoring System: Design, Implementation, and Experience, Parallel Computing, Vol 30 Issue 7, 2004. [12] Multiple Unix Compromises on Campus, Stanford ITSS Security Alert, April 2004. http: //securecomputing.stanford.edu/ alerts/multiple-unix-6apr2004.html, [13] Network Packet Labeling, http://www.nsa. gov/selinux/papers/module/x2794. html. [14] Performance Co-Pilot. http://oss.sgi.com/ projects/pcp/ [15] T. Perrine and D. Kowatch, Teracrack: Password Cracking Using TeraFLOP and Petabyte Resources, San Diego Supercomputer Center Security Group Technical Report, 2003. http://security. sdsc.edu/publications/teracrack.pdf. [16] M. Pourzandi, I. Haddad, C. Levert, M. Zakrzewski, and M. Dagenais, A Distributed Security Infrastructure for Carrier Class Linux, Fourth Annual Ottawa Linux Symposium, 2002. [17] M. Pourzandi, A new Distributed Security Model for Linux Clusters, Usenix, 2004. [18] ROCKS Cluster Distribution, http://www. rocksclusters.org/Rocks/

[21] SE Linux Policy Tools, http://www.tresys. com/selinux/selinux_policy_tools. html. [22] Service Availability Forum, saforum.org/home

http://www.

[23] C. Wright, C. Cowan, S. Smalley, J. Morris, and G. Kroah-Hartmann, Linux Security Modules: General Security Support for the Linux Kernel, Usenix Security Symposium, 2002. http://lsm.immunix. org. [24] W. Yurcik, X. Meng, and N. Kiyanclar, NVisionCC: A Visualization Framework for High Performance Cluster Security, ACM CCS Workshop on Visualization and Data Mining for Computer Security (VizSEC/DMSEC), 2004. [25] W. Yurcik, G. A. Koenig, X. Meng, and J. Greenseid. Cluster Security as a Unique Problem with Emergent Properties: Issues and Techniques, 5th LCI Intl. Conference on Linux Clusters, 2004.

Suggest Documents