Chapter 16
Security in Data Intensive Computing Systems Eduardo B. Fernandez
1 Introduction Many applications, e.g., scientific computing, weather prediction, medical image processing, require the manipulation of large amounts of data [6]. Analysis of web traffic, sales, travel, and all kinds of human activities can bring valuable insights for business and science [27]. This work has been done until now in large multiprocessors in the computer centers of large institutions, whose increasing power allows more and more aspects to be analyzed and with more detail [29]. Recently, the cloud has brought the possibility of processing and storing large amounts of data at a relatively low cost and from anywhere in the world. However, this wide accessibility increases the vulnerability of their systems and the emphasis on fast processing leads often to sacrificing security. We survey here the security implications of data intensive applications in the new environments. A more specific discussion, considering just clouds, is given in [35]. Data is a valuable resource and many attacks are being reported every day. We don’t want the data handled by institutions to be seen by those who could misuse it; enterprises want their information to be hidden from competitors, and patients don’t want their medical records to be seen by unauthorized people. We don’t want that information changed either. Data and other resources are considered assets; security is the protection of assets, including institution and individual information [1, 12, 17]. We have to accept the fact that there are people who intentionally try to misuse information either for their own gain, to make a point, or for the sake
E.B. Fernandez () Department of Comp. and Elect. Eng. and Comp. Science, Florida Atlantic University, Boca Raton, FL, USA e-mail:
[email protected] B. Furht and A. Escalante (eds.), Handbook of Data Intensive Computing, DOI 10.1007/978-1-4614-1415-5 16, © Springer Science+Business Media, LLC 2011
447
448
E.B. Fernandez
of doing damage. Some of their actions include viruses and similar attacks that corrupt information; some actions are to access or modify information. Attacks can be external or from employees looking for revenge or monetary gain. Clearly, all the usual aspects of security apply to data intensive systems, we concentrate here only on those aspects peculiar to them, i.e., due to the nature of these applications and corresponding platforms. In particular, we leave out the privacy aspects of data mining (see [45] in this book). Also, we concentrate on recent developments since much has been written already about these subjects. We provide first a short introduction to security for those readers who are not experts on this topic. This is followed by a brief discussion of privacy. The next section is a discussion of web architectures. We then look at a few data-intensive applications, including a discussion of their security requirements. This is followed by a look at the platform architectures used to execute these applications to see what security they provide for the applications. We end with some conclusions where we summarize the situation and propose some actions for the future.
2 Security Security is the protection against: • Unauthorized data disclosure (confidentiality or secrecy) – Loss of secrecy may bring economic losses, privacy violations, and other types of damage. • Unauthorized data modification (integrity) – Unauthorized modification of data may result in inconsistencies or erroneous data. Data destruction may bring all kinds of losses. • Denial of service – Users or other systems may prevent the legitimate users from using their system. Denial of service is an attack on the availability of the system. • Lack of accountability – Users should be responsible for their actions and should not be able to deny what they have done (non-repudiation). This list defines the objectives of a secure system which are how to avoid or mitigate the effects of these attacks or misuses of information. These aspects, sometimes called security goals or properties, try to anticipate the possible risks of the system, that is, the types of attacks against which we have to be prepared. Different applications may emphasize one or more of these aspects over others. For example, the military worry more about confidentiality, a financial institution may emphasize accountability and integrity, homeland security systems usually emphasize availability. The generic attacks mentioned above are realized in two basic ways: by direct attacks from a person trying to exploit a vulnerability or flaw in the system, or using malware, software that contains code that exploits one or more of these flaws. Sometimes, a combination of the two is used some are insider attacks. We need to find defenses or countermeasures against these attacks.
16 Security in Data Intensive Computing Systems
449
The definition of security above describes security as defense against some types of attacks. The generic types of defenses that we can use include: • Identification and Authentication (I& A) – Identification is a user or system action where the user provides an identity. Authentication implies some proof that a user or system is the one he/it claims to be. The result of authentication may be a set of credentials, which later can be used to prove identity and may describe some attributes of the authenticated entity. • Authorization and Access control (A & A) – Authorization defines permitted access to resources depending on the accessor (user, executing process), the resource being accessed, and the intended use of the resource. Access control implies some mechanism to enforce authorization. • Logging and Auditing – These functions imply keeping a record (log) of actions that may be relevant for security and analyzing it later. They can be used to collect evidence for prosecution (forensics) and to improve the system by analyzing why the attack succeeded. • Hiding of information – It is usually performed by the use of cryptography but steganography is another option. The idea is to hide the information in order to protect it. • Intrusion detection – Intrusion Detection Systems (IDS) alert the operators or users in real time when an intruder is trying to attack the system. Before studying possible defenses we need to know the types of attacks that we will receive and why they can succeed. A vulnerability is a state or condition that may be exploited by an attack. People study source code and find flaws that can lead to possible attacks (threats). An attack is an attempt to take advantage of a vulnerability and it may result in a misuse, a violation of security. An actual attack is an incident. The outcome of misuse can be loss of confidentiality or integrity, theft of services, denial of service, or repudiation of some action. An attack has a perpetrator, who has a motivation (goal) and possibly appropriate skill. This motivation may be monetary gain (lucre), need to show off his skills, or political/religious motivation. The attack has a method of operation (modus operandi) to accomplish a mission with respect to a target. The damage of a mission can be loss of assets, money, or even lives. Vulnerabilities can be the result of design or implementation errors, although not every program error is a vulnerability. Design vulnerabilities are very hard to correct (one needs to redesign and reimplement the system), while implementation vulnerabilities can be corrected through patches. Examples of wrong design include using lack of checks, lack of modularity, or not using appropriate defenses. Examples of bad implementation include not checking size bounds for data structures and giving too much privilege to a process. Deployment, configuration, administration, and user errors are also sources of vulnerabilities. In a direct attack, the attacker takes advantage of a vulnerability to gain access to information. A typical attack is SQL injection, where the attacker inserts a database query in an input form. Direct attacks often are prepared through malicious code.
450
E.B. Fernandez
The objective of an attack is a misuse of the information to reach some goal for the attacker. An external attacker may try to find a way to transfer the list of customers of an institution; an insider can take advantage of incomplete authorization to modify some patient’s prescription; an external attacker can infiltrate many computers and convert them into zombies that will be used later for a denial of service attack; a legitimate customer of a financial institution may deny having given an order to buy some type of stock. A variety of programs have been written with the purpose of preparing or performing a misuse. They are what is called malicious software or malware. These include worms, viruses, and Trojan Horses. Other varieties of malicious code include logic bombs, rootkits, etc. [12]. To secure a system, we need first to enumerate its possible threats. The misuse activities approach [5] consists of a systematic way to identify system threats, and determining policies to stop and/or mitigate their effects. To do that, two stages are applied. The first stage is an analysis of the flow of events in a use case or a group of use cases, in which each activity is analyzed to uncover related threats. This analysis should be performed for all the system uses cases. The second stage entails a selection of appropriate security policies which can stop and/or mitigate the identified threats. Realization of the policies leads to security mechanisms.
3 Privacy Privacy is the right of individuals to decide about the use of information about themselves. This right is recognized by all civilized societies and is considered a fundamental human right. While there are laws to protect privacy against government intrusion, in the US there is little to protect people against business collecting private information, except the voluntary decisions of these businesses. Obviously, privacy laws mean nothing without having ways to enforce their restrictions in the institutions that handle individuals’ data. Their approaches to data security are then fundamental to any protection of privacy. Without effective authentication and access control systems, privacy is left to the good will of employees and other individuals. It should be noted that attempts to make information more readily available for improved use as well as the large amounts of information now collected may have a negative effect on privacy. For example, in the US the Health Insurance Portability and Accountability Act (HIPAA) regulations require the computerization and exchange of medical records over the Internet to improve service to patients. However, this easier availability makes this data also easier to be accessed by unauthorized parties. Unless there is a correlated investment on security, privacy will be negatively affected. Social networks are another source of privacy violations.
16 Security in Data Intensive Computing Systems
451
4 Web Architectures There is a tradeoff here between speed and security. Two protocols commonly used are: SOAP is a protocol for exchanging structured information in the implementation of web services. It relies on Extensible Markup Language (XML) for its message format, and usually relies on other application protocols, most notably Remote Procedure call (RPC) and HTTP, for message negotiation and transmission. SOAP provides a basic messaging framework to allow web services to communicate. Representational State Transfer (REST) is a style for web systems. REST-style architectures consist of clients and servers. Clients initiate requests to servers; servers process requests and return appropriate responses. Requests and responses are built around the transfer of representations of resources. A resource can be essentially any meaningful entity that may be addressed. A representation of a resource is typically a document that captures the current or intended state of a resource. Most of the web functionality on the Internet now uses REST: Twitter, Yahoo’s web services use REST, others include Flickr, del.icio.us, pubsub, bloglines, technorati, and several others. eBay and Amazon have web services for both REST and SOAP. SOAP is mostly used for enterprise applications to integrate a wide variety of types and number of applications and another trend is to integrate with legacy systems, etc. The emphasis of REST is speed and it has almost no provisions for security. SOAP and web services provide a rich set of security standards at the expense of adding more complexity and losing speed.
4.1 Service-Oriented Architectures (SOA) SOA is an architectural style in which a system is composed from a set of loosely coupled services that interact with each other by sending messages. In order to interoperate, each service publishes its description, which defines its interface and expresses constraints and policies that must be respected in order to interact with it. A service (set of services) represents a business activity and thus it is a building block for service-oriented applications. Applications are built by coordinating and assembling services. A key principle about services is that they should be easily reusable and discoverable, even in an inter-organizational context. Furthermore, the channels of communication between the participating entities in a serviceoriented application are much more vulnerable than in operating systems or within the boundaries of an organization’s intranet, since they are established on public networks Web services (WS) are the most popular approach to implementing SOA; they define an abstract framework, called the Web services platform which is comprised
452
E.B. Fernandez
of several parts. Many of these parts address a particular aspect of the common SOA and are defined by standards organizations and implemented on proprietary technology platforms These organizations have defined a rich set of security standards for web services interaction. We have represented most of those standards as patterns [10, 13].
4.2 Grid Computing A computing grid is typically heterogeneous in nature (nodes can have different processor, memory, and disk resources), and consists of multiple disparate computers distributed across organizations and often geographically using widearea networking communications usually with relatively low-bandwidth. Grids are typically used to solve complex computational problems which are computeintensive requiring only small amounts of data for each processing node. A variation known as data grids allow shared repositories of data to be accessed by a grid and utilized in application processing, however the low-bandwidth of data grids limit their effectiveness for large-scale data-intensive applications. In contrast, dataintensive computing systems are typically homogeneous in nature (nodes in the computing cluster have identical processor, memory, and disk resources), use highbandwidth communications between nodes such as gigabit Ethernet switches, and are located in close proximity in a data center using high-density hardware. Grid computing is a distributed computing environment composed of resources from different administrative domains. Grids focus on integrating existing resources with their hardware, operating systems, local resources management, and security infrastructure [16]. Some grid applications for data intensive applications are shown in [44]. GDIA, built on top of a meta-scheduler and a data grid, is a scalable grid infrastructure for data intensive applications that applies security controls [25]. Hameurlain et al. [21] indicate that security is one of the challenges of grid computing for data intensive systems but no solutions are discussed.
4.3 Cloud Computing Cloud computing is the latest computational paradigm and one of its main uses is data intensive computing. Clouds consist of three layers: Software as a Service (SaaS), Platform as a Service (PaaS), and Infrastructure as a Service (IaaS). Figure 16.1 shows the layers within the cloud. In Software as a Service the user rents software from the cloud provider instead of buying it. The software is hosted in the data center managed by the provider, and it is available over the Internet. Typical examples of SaaS include word processors,
16 Security in Data Intensive Computing Systems
453
Fig. 16.1 Cloud computing layers
spreadsheets, database applications, etc. Salesforce Customer Relationship Management (CRM) system and Google Aps are two examples for SaaS. Platform as a Service offers a development environment for developers where they can run their applications. A programming platform is offered to the user (developer), which provides a complete software lifecycle management, from planning to design to building applications to testing to maintenance. Two good examples are Google App Engine and the Force.com. Google App Engine offers APIs for storage, platform, and database. It supports applications written in different programming languages such as Java and Python. The Force.com platform is similar, but it uses a custom programming language called Apex. It offers cloud platform where users can develop, package, and deploy applications without any infrastructure. Infrastructure as a Service is the foundation layer that provides computing infrastructure to customers rather than buying servers, software, or networking components. IaaS layer virtualizes computing power, storage and network connectivity of the data centers, and offers it as services to customers. Amazon, GoGrid, and Google are some of the companies that offer infrastructure services. Amazon’s EC2 provides resizable compute capacity in the cloud. GoGrid offers cloud computing services that enables automated provisioning of virtual and hardware infrastructure over the Internet. A discussion of cloud use for data intensive computing is in [20]. Service Level Contracts (SLAs) are fundamental to specify the duties of the provider and the customer. They define the benefits and responsibilities of each party. The only means the provider can gain trust of client is through the SLA. Due to the dynamic nature of the cloud, continuous monitoring on Quality of Service (QoS) attributes is necessary to enforce SLAs [36].
5 Data Intensive Applications Purely data-intensive applications process multiterabyte-to-petabyte-sized datasets. This data may come in several different formats and is often distributed across multiple locations. Processing these datasets typically takes place in multistep
454
E.B. Fernandez
analytical pipelines that include transformation and fusion stages. Research issues involve data management, filtering and fusion techniques, and efficient querying and distribution [18]. This, and most other studies don’t mention security as an important issue. We enumerate below a few important applications and their security requirements.
5.1 Scientific Collaboration An important use of large amounts of data is in the collaborative analysis and transformation by groups of scientists performing experiments or analyzing the results of experiments. A similar example is manipulation of climate model data. Researchers must assemble and analyze large datasets that are archived in different formats on disparate platforms and must extract portions of datasets to compute statistical or diagnostic metrics in place. This leads to a common virtual environment in which to access both climate model datasets and analysis tools is provided to support not only ready access to climate data but also facilitate the use of visualization software, diagnostic algorithms, and related resources. Another application of this type is the analysis of planetary data [30–32]. Given that this data is to be used in scientific research, its main security concern is that the data is used by legitimate registered scientists and their associates, which calls for strong authentication. Read-only enforcement is also important because alterations to the data could make it useless, which calls for a basic authorization system. Logging and non-repudiation are not important in this environment, availability is important but not critical.
5.2 Computerization and Exchange of Medical Information Medical information is one of the most sensitive types of information. Its misuse could have a very serious effect on an individual’s life; leakage of information about a psychiatric treatment could ruin a career, an incorrect change in a medical record may result in a wrong prescription with damage to the patient. In past times this information was collected and stored at physicians’ offices and hospitals and relatively few people even knew it existed. In most instances it was not computerized and was protected by its isolation and the ignorance of its existence. All this is fast changing, most or all of the doctor offices use computers, hospitals have large information systems, and a good part of this information is becoming accessible through distributed systems, including the Internet. Specialized databases containing information about prescriptions or treatments are being linked together through the Internet. Genomic databases are starting to appear. Recent plans from President Obama call for interconnecting all medical records. This means that
16 Security in Data Intensive Computing Systems
455
the number of people that can potentially access information about patients has increased by orders of magnitude. This increases dramatically the potential for misuse. There is a clear requirement here for privacy, which indicates a strict need for authentication, access control, and logging. A problem is the lack of common standards to represent medical records, which makes the application of security controls much more complicated than it should be. Logging and non-repudiation enforcement is also needed to make health care providers responsible for their decisions. Availability is also very important because of the potential need for a specific medical record at any time and in any place.
5.3 Social Networks Social networks provide facilities for collaboration between people who have some common interest, belong to a similar group, or have some social connection. A social network service includes representations of each user (a profile), his social links, and a variety of additional services. These networks are characterized by constant activity and their users put in them large amounts of personal information. The large amount of information about individuals held by social networks creates a significant privacy problem. There is also a variety of security threats to social networks and several serious incidents have already occurred. The state of security in social networks is rather primitive; although they contain mostly information about individuals there is little protection against privacy violations. Their platforms are also easy to penetrate by external attackers. We defined in [11] what security requirements should be imposed on these organizations to protect their users. In summary, there is a need for authentication and for some basic authorization controls. Availability is not an important aspect, although its lack may be annoying. There is hardly a need for logging and non-repudiation.
5.4 Data Warehousing A data warehouse by nature is an open, accessible system to support managerial decision-making [24]. However, because of its sensitive information there is a need for security. A UML profile is a UML package that can extend either a metamodel or another profile. Villarroel et al. [43] showed how to incorporate security aspects into conceptual modeling of data warehouses using UML profiles. Commercial data warehouses may not have strict security controls, however. The need here is for enforcing read-only access and also authorization controls for sensitive data. Logging and non-repudiation are normally not needed. Availability is important but not critical.
456
E.B. Fernandez
5.5 Financial Systems Financial systems include stock and commodities trade, tax records analysis, financial planning, and analysis of stock data. In addition to the large amount of data, this data may be highly distributed and needs to be collected in one place for its analysis. Financial information is very sensitive. Because it refers to individuals, there is a big privacy concern. But also these applications handle money which brings a big incentive for attackers. We need here authentication, authorization, availability, logging, and non-repudiation control. The last two are needed for legal reasons and because of regulations such as Sarbanes Oxley. Other applications of this type include oil exploration, analysis of sensor network information, and SPAM filters. As we can see, the security requirements vary widely depending on the type of data used in the application. This means that instead of a one-size-fits-all kind of security, we need to analyze their specific threats, using an approach such a [5] or similar. The next section discusses how the existing platforms satisfy these requirements.
6 Data Intensive Architectures We separate complete architectures from database systems. Database systems are available through the Internet and can be used directly. A complete architecture provides facilities for storage, data filtering, query, and development tools.
6.1 Database Systems Most databases used in institutions are until now relational databases, which store their data in the form of tables, where each entry (tuple) represents a record. The relational model uses several basic (relational algebra) operations to manipulate data: selection, projection, union, minus, Cartesian product, and join. There are also rules for transaction integrity (ACID rules) and the basic operations can be used to define views that present the users a subset of the database. These views are normally used to enforce access control [8]. Almost all relational databases use SQL as data manipulation language (SQL is an ANSI and ISO standard). NoSQL is a new category of data management technologies that uses nonrelational database architectures. The main reason for the emergence of NoSQL systems is that typical relational databases have shown poor performance on certain data-intensive applications, including indexing a large number of documents, retrieving pages from high-traffic web sites, and delivering streaming media. NoSQL databases are often better suited to handle the requirements of high-performance,
16 Security in Data Intensive Computing Systems
457
web-scalable systems and intensive data analysis. Organizations like Facebook, Twitter, Netflix, and Yahoo have used NoSQL solutions to gain greater scale and performance, and at lower cost than relational database systems [33]. Some of these systems may also have some SQL capabilities. In a broader sense this definition includes any system that does not exclusively rely on SQL for data selection. These systems include native XML databases, graph stores, column stores, object stores, in-memory caches and multi-dimensional OLAP cubes using MDX. Typically, NoSQL systems do not attempt to provide ACID guarantees. SQL databases have well-developed security features, including authentication, authorization, logging, and encryption [8, 12]. As we will see below, the NoSQL databases lack strong security features. A common programming model for the new databases is MapReduce, which allows programmers to use a functional programming style to create a map function that processes a key-value pair associated with the input data to generate a set of intermediate key-value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key [20]. MapReduce programs can be used to compute derived data from documents such as inverted indexes and the processing is automatically parallelized by the system which executes on large clusters of machines, usually scalable to thousands of machines. Since the system automatically takes care of details like partitioning the input data, scheduling and executing tasks across a processing cluster, and managing the communications between nodes, nonspecialized programmers can use conveniently a large distributed processing environment [38]. Used together with the MapReduce architecture is the Google File System (GFS). GFS was designed to be a high-performance, scalable distributed file system for very large data files and data-intensive applications providing fault tolerance and running on clusters of commodity hardware. Apache has produced an open source Map/Reduce platform called Hadoop. Hadoop includes two software components (each administered by the open source Apache Software Foundation) – a distributed file manager known as HDFS (Hadoop Distributed File System), and a parallel programming framework derived from MapReduce. HDFS runs on a large cluster of commodity nodes and uses Java as programming language. Hadoop is being built and used by a global community of contributors, using Java. Yahoo has been the largest contributor to the project, and uses Hadoop extensively. Other users include IBM, Facebook, The New York Times, and e-Harmony. MapReduce doesn’t seem to have any predefined security provisions [37]. It is up to the implementers to add security features; for example, Amazon Elastic MapReduce starts data instances in two Amazon EC2 security groups, one for the master and another for the slaves. The master security group has a port open for communication with the service. It also has the Unix Secure Shell (SSH) port open to allow to access the instances using the SSH, using the key specified at startup. The slaves start in a separate security group, which only allows interaction with the
458
E.B. Fernandez
master instance. By default both security groups are set up to not allow access from external sources including Amazon EC2 instances belonging to other customers. Amazon S3 also provides authentication mechanisms. Amazon Elastic MapReduce customers can also choose to send data to Amazon S3 using the HTTPS protocol for secure transmission. In addition, Amazon Elastic MapReduce always uses HTTPS to send data between Amazon S3 and Amazon EC2. For added security, customers may encrypt the input data before they upload it to Amazon S3. However, there is no full authorization system, no availability protection, no logging, and no control of repudiation. As of Version 0.19, Hadoop has security flaws that limit how data can be handled, and what kind of data can be handled. The HDFS file system of Hadoop, has no authorization system, any authenticated user can read or write anything. Hadoop authenticates a user for access control to the system by using the output of the “whoami” command, which is not secure. Third, HBase, which is the “database” that Hadoop uses, has no access control at all. Any job running on a Hadoop cluster can access any data on that cluster. Schlesinger [38] provides some advice, one of which is: “don’t store very sensitive data in Hadoop.” This is an advice not heeded by some companies who store medical records and other sensitive information. Hadoop 0.20.2 has Unix access control lists for MapReduce; these are implemented as Java libraries, which are well known to be deficient as ways to enforce authorization [12, 17], so there is no much improvement. In the Unix security model there is no way to define access controls for individual users; for example. A typical Java library for Hadoop security looks like: java.lang.Object org.apache.hadoop.security.UserGroupInformation org.apache.hadoop.security.UnixUserGroupInformation A specialized system for web applications in which the database component is the bottleneck, was presented in [28]. A Database Scalability Service Provider (DSSP) caches application data and supplies query answers on behalf of the application. Cost-effective DSSPs need to cache data from many applications, which brings security concerns. The chapter studies the security-scalability tradeoff, both formally and empirically, by providing a method for statically identifying segments of the database that can be encrypted without impacting scalability. Another specialized system is a metadata catalog for data intensive applications [40]. This catalog supports authentication and authorization functions. Subsystems like these are useful but they don’t solve the lack of security in the main databases. Secure search of large files is discussed in [41]. In summary, other than for specialized functions, the situation does not look good here. The security of MapReduce and Hadoop cannot satisfy the requirements of applications handling medical or financial records for example.
16 Security in Data Intensive Computing Systems
459
6.2 Complete Architectures A complete data handling architecture must have components to store and filter information, to query information, and tools for developers; a few of them exist or are appearing. Some were intended for grid systems but they could be adapted for use in the cloud. An architecture proposed for scientific work is the Virtual Data Grid [15]. In this environment, data, procedures, and computations are entities that can be published, discovered, and manipulated. Their security emphasis is on authentication mechanisms. A system for planetary information, also used for medical research, is OODT [30, 32]. OODT objectives include access to data, discover specific data, correlate data models, and integrate software interfaces. For these purposes it includes product, profile, and query servers but there is no mention of any security facilities. Siebenlist et al. [39] describe the Earth System Grid (ESG), which includes an advanced authentication system. Legrand et al. [26] describe MonALISA (Monitoring Agents using a Large Integrated Services Architecture), a framework intended to control and optimize data intensive applications. It uses Java and it is intended for work in High-Energy Physics (HEP). No security aspects are discussed.
6.2.1 Lexis Nexis HPCC The LexisNexis approach utilizes clusters of hardware and includes custom system software and middleware components developed and layered on a base Linux operating system to provide the execution environment and distributed file system support required for data-intensive computing [4, 23]. Its architecture uses the following components: • Thor, (the Data Refinery Cluster), is responsible for consuming vast amounts of data, transforming, linking, and indexing that data. It functions as a distributed file system with parallel processing power spread across the nodes. A cluster can scale from a single node to thousands of nodes. • Roxie (the Query Cluster), provides separate high-performance online query processing and data warehouse capabilities. • ECL (Enterprise Control Language) is a programming language that manipulates the data. It is a non-procedural and dataflow oriented language. • There are also appropriate tools to support programmers, to interface with web services, and to apply web services security standards. There is no explicit description of security in their web documents so it is impossible to judge their security. Of course, for web services applications, one can use standards such as SAML and XACML, but their support in the architecture is not clear. ECL has not been published in the research community so it is not clear
460
E.B. Fernandez
how convenient it is to enforce security. One of the nice aspects of SQL is the possibility of using queries to define views to enforce access control [8, 12]; it is not clear if something like this is possible in ECL.
7 How to Improve Security in Current Systems? What do we need? We need systematic ways to build secure data intensive systems. The works of Mattmann [31] and Gorton [19] were a big advance in that before them, all these applications were only concerned with performance and provided ad hoc solutions for security. Mattmann and Gorton emphasized the need for systematic architectures. Past work on metadata catalogs is also important in this context [40]. The next step is to add security to these architectures. It is possible to do the same with some commercial offering, e.g., HPCC. Patterns are a way to simplify software development which we have been long using in our work. A pattern is an encapsulated solution to a software problem in a given context and a good catalog of patterns can improve the quality of software. Patterns are fundamental when dealing with complex systems, security patterns can help apply security in complex systems [13]. Reference architectures are important to provide a context for the patterns. A reference architecture or domain-specific architecture is a standardized, generic architecture, valid for a particular domain. It is reusable, extendable, and configurable, that is, it is a kind of pattern for whole architectures and it can be instantiated into a specific software architecture by adding implementation-oriented aspects [3]. Architectural styles are larger patterns and can be used as building blocks; for example [30] uses: product, profile, and query services. There is also a need for access control models combining different expressions for rules, using different security models and languages. Most modern security systems use Role-Based Access Control, there is nothing like that in the surveyed systems. These models can be used as part of a systematic methodology. We are now building a reference architecture for cloud security; we started by defining patterns to describe attacks to clouds [22]. We have proposed a systematic methodology for building secure applications [9], which can be adapted to data intensive systems. This methodology is based on patterns. A main idea in the proposed methodology is that security principles should be applied at every stage of the software lifecycle and that each stage can be tested for compliance with those principles. We also consider all the architectural levels of the system. We have considered the use of databases in the lifecycle, we only need to take into consideration the special aspects of very large databases. Its stages are shown in Fig. 16.2 and include: Domain analysis stage: A business model is defined. Legacy systems are identified and their security implications analyzed. Domain and regulatory constraints are identified and use as global policies that should be enforced in all the applications derived from this model [14]. The suitability of the development team is assessed,
16 Security in Data Intensive Computing Systems
461
Fig. 16.2 Secure systems design methodology
possibly leading to added training. This phase may be performed only once for each new domain or team. The need for specialized database architectures should be determined at this point. The approach (general DBMS or application-oriented system) should also de defined at this stage. Requirements stage: Use cases define the required interactions with the system. Each activity within a use case is analyzed to see which threats are possible. Activity diagrams indicate created objects and are a good way to determine which data should be protected [5]. Since many possible threats may be identified, risk analysis helps to prune them according to their impact and probability of occurrence. Any requirements for degree of security should be expressed as part of the use cases. Analysis stage: Analysis patterns can be used to build the conceptual model in a more reliable and efficient way. The policies defined in the requirements can now be expressed as abstract security models, e.g., access matrix. The model selected must correspond to the type of application; for example, multilevel models have not been successful for medical applications. One can build a conceptual model where repeated applications of a security model pattern realize the rights determined from use cases. In fact, analysis patterns can be built with predefined authorizations according to the roles in their use cases. Patterns for authentication, logging, and secure channels are also specified at this level. Note that the model and the security patterns should define precisely the requirements of the problem, not its software solution. UML is a good semi-formal approach for defining policies, avoiding the need for ad-hoc policy languages. The addition of OCL (Object Constraint Language) can make the approach more formal. Design stage: When one has defined the policies needed, one can select mechanisms to stop attacks that would violate these policies. A specific security model, e.g., RBAC, is now implemented in terms of software units. User interfaces should
462
E.B. Fernandez
DBMS language
*
User Interface
shell commands
*
1
* DBMS
1 DBMS Security Services 1
*
1
OS calls
Operating System
1
1
OS Security Services 1
I/O Services I/O
1
1
I/O
Fig. 16.3 The operating system controls the storage of databases
correspond to use cases and may be used to enforce the authorizations defined in the analysis stage. Secure interfaces enforce authorizations when users interact with the system. Components can be secured by using authorization rules for Java or.NET components. Distribution provides another dimension where security restrictions can be applied. Deployment diagrams can define secure configurations to be used by security administrators. A multilayer architecture is needed to enforce the security constraints defined at the application level. In each level one can use patterns to represent appropriate security mechanisms. Security constraints must be mapped between levels. The persistent aspects of the conceptual model are typically mapped into relational databases [7]. The design of the database architecture is done according to the requirements from the uses cases for the level of security needed and the security model adopted in the analysis stage. Two basic choices for the enforcement mechanism include query modification as in INGRES and views as in System R. Data intensive systems do not generally follow this model so we need to modify this stage. A tradeoff is using an existing DBMS as a Commercial Off-the-Shelf (COTS) component, although in this case security will depend on the security of that component. Database systems architectures can be structured in different ways. Traditional systems had the operating system controlling the file system that stored the whole database as shown in Fig. 16.3. In the newer architectures there is a web application server (Fig. 16.4) that combines a web server and an application integration server
16 Security in Data Intensive Computing Systems
Web User Interface
Local User Interface *
* 1 Security Services
463
1
1 OS calls
Web Application 1 Server 1
1
* Security Services
1
1
*
DBMS
OS calls
Operating System
1 1 I/O 1
I/O Services
1 Security Services
Fig. 16.4 Architecture using a web application server
(WAS). The WAS contains a common model of the possibly many databases and applies security controls. The architectures surveyed earlier, e.g. [23, 30, 34], have ad-hoc architectures using some type of middleware resembling a WAS. Implementation stage: This stage requires reflecting in the code the security rules defined in the design stage. Because these rules are expressed as classes, associations, and constraints, they can be implemented as classes in object-oriented languages. In this stage one can also select specific security packages or COTS, e.g., a firewall product or a cryptographic package. Some of the patterns identified earlier in the cycle can be replaced by COTS (these can be tested to see if they include a similar pattern). Performance aspects become now important and may require iterations. As indicated, a whole DBMS could be such component.
8 Conclusions The same security constraints and principles that apply to more general environments also apply to data intensive systems. Their objectives with respect to security are also similar. What is different is the fact that large amounts of information may exist scattered in diverse places and that many people may need legitimate access to them; in other words, there is a large data exposure. New architectures, such as the cloud, bring new types of attacks [22]. In addition, convenient use requires access through the Internet, which is a rather insecure environment.
464
E.B. Fernandez
The needs of the applications we considered are quite different, social networks are (or should be) concerned with user privacy, some scientific computing has few security needs, financial and medical applications are highly sensitive. In other words they span all degrees of security. This indicates that one should study the specific threats of each environment and then apply some security approach. If we look at the current ways to manipulate large amounts of data we come to the conclusion that the situation with respect to security is not good and that much needs to be done before we can trust in these systems. There is almost no concern for security and privacy in the recent commercial offerings. For example, the security of a very popular system like Hadoop is now rather poor. Also, many changes are happening, SciDb is toted as an improvement for scientific databases [42] but again there is no mention of security. In traditional systems, security has been applied in a reasonable way but in the new systems the only emphasis until now is performance. When many serious breaches happen, we will start worrying about security. At that moment, complete solutions will become important because they are the only ones which can provide a high degree of security. We made some recommendations to secure the new systems. Our approach to security is based on application semantics. We believe that there is no security system that fits all applications, one should analyze the specific threats of applications and then build the application and its platform together. Emphasis on protecting the lower levels of the architecture have not been effective in the past and if we don’t understand the applications there is no much hope we will be able to build secure systems. The use of Service Level Agreements (SLAs) and other legal restrictions are necessary to protect customers and individuals whose information is kept in these systems. More work is needed to define precise SLAs together with ways to monitor that the prescriptions of the SLAs are followed. Service certification may be needed in some cases [2]; not much has been done in this direction and it should become important for critical services. In fact, SLAs can be a byproduct of secure application development and they can be integrated in the system life cycle.
References 1. R. Anderson, Security Engineering (2nd . Ed.), Wiley, 2008. 2. M. Anisetti, C.A. Ardagna, and E. Damiani, “Container-level security certification of services”, International Workshop on Business System Management and Engineering (BSME 2010). 3. P. Avgeriou, “Describing, instantiating and evaluating a reference architecture: A case study”, Enterprise Architecture Journal, June 2003. 4. D. Bayliss, HPCC systems: Aggregated data analysis: The paradigm shift”, Lexis Nexis white paper, May 2011, http://hpccsystems.com 5. F. Braz, E.B. Fernandez, and M. VanHilst, “Eliciting security requirements through misuse activities” Procs. of the 2nd Int. Workshop on Secure Systems Methodologies using Patterns (SPattern’08). Turin, Italy, September 1–5, 2008. 328–333. 6. R.E. Bryant, “Data intensive supercomputing”, Slide presentation, http://www.cs.cmu.edu/ bryant
16 Security in Data Intensive Computing Systems
465
7. S. Ceri, P. Fraternali, and M. Matera, “Conceptual modeling of data-intensive web applications”, IEEE Internet Computing, July-August 2002, 20–30. 8. R. Elmasri, and S. Navathe, Fundamentals of Database Systems, Sixth Edition. Pearson. 2010. 9. E.B. Fernandez, M.M. Larrondo-Petrie, T. Sorgente, and M. VanHilst, “A methodology to develop secure systems using patterns”, Chapter 5 in “Integrating security and software engineering: Advances and future vision”, H. Mouratidis and P. Giorgini (Eds.), IDEA Press, 2006, 107–126. 10. E.B. Fernandez, “Security patterns and a methodology to apply them”, in Security and Dependability for Ambient Intelligence, G. Spanoudakis and A. Ma˜na (Eds.), Springer Verlag, 2009. 11. E.B. Fernandez, C. Marin, and M.M. Larrondo Petrie, “Security requirements for social networks in Web 2.0”, in the Handbook of Social Networks: Technologies and Applications, B. Furht (Editor), Springer 2010. 12. E.B. Fernandez, E. Gudes, and M. Olivier, The design of secure systems, Addison-Wesley, to appear. 13. E.B. Fernandez, Designing secure architectures using security patterns, under contract with J. Wiley. To appear in the Wiley Series on Software Design Patterns. 14. E.B. Fernandez and S. Mujica, “Model-based development of security requirements”, accepted for the CLEI (Latin-American Center for Informatics Studies) Journal. 15. I. Foster, J. Voeckler, M. Wilde, and Y. Zhao, “The Virtual Data Grid: A new model and architecture for data-intensive collaboration”, Proceedings of the 15th International Conference on Scientific and Statistical Database Management (SSDBM ’03), IEEE Computer Society, Washington, DC, USA, 2003. 16. I. Foster, Y. Zhao, I. Raicu, and S. Lu, “Cloud Computing and Grid Computing 360-Degree Compared”, CoRR, Vol. 0901, 2009. 17. D. Gollmann, Computer security (2nd Ed.), Wiley, 2006. 18. Gorton, I., Greenfield, P., Szalay, A., & Williams, R. (2008). “Data-intensive computing in the 21st century”. IEEE Computer, 41(4), 30–32. 19. I. Gorton, “Software Architecture Challenges for Data Intensive Computing”, Procs. Seventh Working IEEE/IFIP Conference on Software Architecture, WICSA 2008, 4–6. 20. R.L. Grossman and Y. Gu, “On the varieties of clouds for data intensive computing”, Bull. of the IEEE Comp. Soc. Tech. Comm. on Data Eng., 209, 1–7. http://sites.computer.org/debull/ A09mar/issue1.htm 21. A. Hameurlain, F. Morvan, and M. El Samad, “Large scale data management in grid systems: a survey”, Information and Communication Technologies: From Theory to Applications, ICTTA 2008. 1–6. 22. K. Hashizume, E.B. Fernandez, and N. Yoshioka, “Misuse patterns for cloud computing”, accepted for the Twenty-Third International Conference on Software Engineering and Knowledge Engineering (SEKE 2011), Miami Beach, USA, July 7–9, 2011. 23. Lexis Nexis HPCC, Data-Intensive Computing Solutions, http://wpc.423a.edgecastcdn. net/00423A/whitepapers/wp data intensive computing solutions.pdf (last retrieved June 30, 2011). 24. N. Katic, G. Quirchmayr, J. Schiefer, M. Stolba and A. M. Tjoa. “A Prototype Model for Data Warehouse Security Based on Metadata”, Int. Workshop on Security and Integrity of Data Intensive Applications in conjunction with the 9th Int. Conf. on Database and Expert Systems Applications (DEXA’98), University of Vienna, Austria, 24–28 August, 1998. 25. B. Lang, I. Foster, F. Siebenlist, R. Ananthakrishnan, T. Freeman A Flexible Attribute Based Access Control Method for Grid Computing 2009 GSC. 26. J. Legrand et al., “Monitoring and control of large systems with MonALISA”, Comm. of the ACM, vol. 52, No 9, Sept. 2009, 49–55. 27. S. Lohr, “New ways to exploit raw data may bring surge of innovation, a study says”, The New York Times, Friday, May 13, 2011, B3. 28. A. Manjhi, A. Ailamaki, B.M. Maggs, T.C. Mowry, C. Olston, and A. Tomasic “Simultaneous scalability and security for data intensive web applications”, Procs. of SIGMOD 2006, June 27–29, 2006, Chicago, Illinois, USA.
466
E.B. Fernandez
29. J. Markoff, “Digging deeper, seeing farther: Supercomputers alter science”, The New York Times, Tuesday, April 26, 2011, D1and D3. 30. C. Mattmann, D. Crichton, J. S. Hughes, S.C. Kelly, and P. M. Ramirez, “Software architecture for large-scale, distributed, data-intensive systems”, Procs. of the 4th Working IEEE/IFIP Conf. on Software Architecture (WICSA’4). Osto, Norway, June 2004. 31. C. Mattmann, D. Crichton, N. Medvidovic and S. Hughes, “A Software Architecture-Based Framework for Highly Distributed and Data Intensive Scientific Applications”. In Proceedings of the 28th International Conference on Software Engineering (ICSE06), pp. 721–730, Shanghai, China, May 20th-28th, 2006. 32. C. Mattmann, D. Crichton, A. Hart, S. Kelly, and J.S. Hughes, “Experiments with Storage and Preservation of NASA’s Planetary Data via the Cloud”, IEEE IT Professional – Special Theme on Cloud Computing, Vol. 12, No. 5, September/October, 2010, 28–35. 33. D. McCreary and D. McKnight, The CIO’s guide to NoSQL, http://www.Dataversity.net 34. C. Miceli et al., “Programming abstractions for data intensive computing on clouds and grids”, Procs. 9th IEEE/ACM Int. Symp. on Cluster Computing and the Grid, 2009, 478–483. 35. A. Nourian, M. Maheswaran, and M. Pourzandi, “Privacy and Security Requirements of Data Intensive Applications in Clouds”, Chapter 20, this book. 36. P. Patel, A. Ranabahu, and A. Sheth, “Service Level Agreement in Cloud Computing”, Cloud Workshops at OOPSLA, 2009. 37. I. Roy, S.T.V. Setty, A. Kilzer, V. Shmatikov, and E. Witchel, Airavat: Security and Privacy for MapReduce http://www.cs.utexas.edu/shmat/shmat nsdi10.pdf 38. J. Schlesinger, “Cloud security in MapReduce: An analysis”, http://www.defcon.org/images/ defcon-17/dc-17-presentations/defcon-17-jason schlesinger-cloud security.pdf 39. F. Siebenlist, R. Ananthakrishnan, D.E. Bernholdt, L. Cinquini, I.T. Foster, DE Middleton, and N. Miller, DN Williams Enhancing the earth system grid security infrastructure through single sign-on and autoprovisioning Proceeding GCE ’09 Proceedings of the 5th Grid Computing c 2009. Environments Workshop ACM New York, NY, USA 40. S. Singh et al., “A metadata catalog service for data intensive applications”, Procs. ACM/IEEE Sc 2003 Conference, ACM 2003. 41. A. Singh, M. Srivatsa, and L. Liu, “Efficient and secure search of enterprise file systems”, Procs. of WWW 2007, May 2007, Banff, CA. 42. M. Stonebraker, “SciDB: An Open Source Data Base Project “, presentation 2008. 43. R. Villarroel, E. Fernandez-Medina, M. Piattini, and J. Trujillo, “A UML 2.0/OCL Extension for Designing Secure Data Warehouses”, Journal of Research and Practice in Information Technology, Vol. 38, No. 1, February 2006, 31–43. 44. X. Wei et al., GDIA: A Scalable Grid Infrastructure for Data Intensive Applications, in Int Conf. on Hybrid Information Technology, 2006. ICHIT ’06. Nov. 2006. 45. B. Zhou and J. Pei, Privacy Preserving Data Mining and Social Computing in Large-Scale Social Networks, chapter 13 of this book.