Chapter 19
Privacy and Security Requirements of Data Intensive Computing in Clouds Arash Nourian and Muthucumaru Maheswaran
1 Introduction Cloud computing is evolving into a popular computing paradigm for deploying variety of data-intensive applications. Although there are many definitions of cloud computing [1], it is largely an evolution of computing paradigms in use for many years. Cloud computing offers the opportunity to commoditize several information technology (IT) services. As more services are ported to the clouds, the amount of data held by the clouds increase. When migrating to the cloud, customers need to understand how it differs from their existing environments. Clouds are shared and largely virtual environments. Data owners should understand the implications of their data residing in the cloud service provider’s data center under the protection of its security policies [2]. Further, the data security policies adopted by a cloud provider change over time. With changing data security policies, customers need to comprehend policy evolution to understand potential breaches in the security of their data. Cloud computing systems emerged from the need for creating a platform that can provide on-demand computing resources at low management cost. Therefore, service level agreements (SLAs) that define the resource characteristics are central to cloud computing. The SLAs are used to measure the compliance of the cloud resource management activities and providers or consumers violating them are
A. Nourian School of Computer Science, McGill University, McConnell Engineering Building, Room 318, 3480 University Street, Montreal, QC, H3A 2A7 Canada e-mail:
[email protected] M. Maheswaran () School of Computer Science, Department of Electrical and Computer Engineering, McGill University, McConnell Engineering Building, Room 754, 3480 University Street, Montreal, QC, H3A 2A7 Canada e-mail:
[email protected] B. Furht and A. Escalante (eds.), Handbook of Data Intensive Computing, DOI 10.1007/978-1-4614-1415-5 19, © Springer Science+Business Media, LLC 2011
501
502
A. Nourian and M. Maheswaran
penalized accordingly. Because SLAs are crafted by mainly considering resource management concerns, they are not suitable for dealing with the safety of data stored and processed by cloud computing over long time scales. A cloud ecosystem consists of many different participants. Currently, most important participants are cloud providers (e.g., Amazon, Rackspace) and cloud consumers (e.g., banks, e-commerce sites). Currently, the cloud model does not include an independent regulator to enforce best practices or arbitrate conflicts among the participants. Unlike computations, data processing can have security concerns that span across long time spans. Such concerns are impacted by variations of the policies adopted by the cloud providers and the trust among the participants. Because cloud computing is still taking shape, regulatory aspects are still recommendatory in nature (e.g., FedRAMP [3]). Trust in clouds is still a fledgling concept that is yet to be defined in a widely accepted manner. Consequently, customers do not have standard ways of evaluating and comparing the trust and reputation scores of different cloud operators. This means a cloud provider might not be equivalent and replaceable with another cloud provider in terms of the trust that can placed on the resources.
2 Data Cloud Computing With data-intensive applications, cloud computing becomes an even compelling proposition. Most data-intensive applications gather data from sources such as sensor devices (e.g., cameras), diagnostic services (e.g., medical imaging systems), transaction management systems (e.g., e-commerce archives), and social network systems. In these applications, rate at which data is generated increases with time. Also, many data intensive computing applications have archival requirements at least for a given duration. Therefore, the data load presented by these applications increase with time. The dynamic resource allocation model where the consumer pays based on the amount of resource usage is ideal for data-intensive applications because it allows the consumer to keep the resource utilization very high. This helps small companies with limited resources to launch data-intensive services at low cost with minimum launch time [4]. They can successfully compete with high investment companies because the underlying infrastructure needs are taken care by the clouds [5].
2.1 Model for Data Clouds Cloud computing can be categorized into four different types [6] known as public cloud, private cloud, community cloud, and hybrid cloud. In a public cloud, resource allocations are dynamically provisioned over the Internet. In these clouds different customers’ data sit on cloud machines which are shared between customers. Private cloud is built using private computing resources and networks. These type of
19 Privacy and Security Requirements of Data Intensive Computing in Clouds
503
clouds are built by organizations that need full control over their data and services. A community cloud is a cloud which is shared by group of organizations. A hybrid cloud environment combines multiple public and private clouds. Although cloud computing brings scalability, high quality of service, user friendly interface, and low cost, [7–9], it is still suffering from several fundamental security drawbacks. The security requirements for cloud computing providers typically include firewall isolation, network segmentation, intrusion detection and prevention systems, and security monitoring tools. Realizations of these requirements in clouds are based on virtual machines and are vulnerable for attacks targeting virtual machine operating systems. Attackers can remotely exploit the vulnerabilities in OSes to intrude into the cloud infrastructure and siphon off valuable data from clouds. Often, virtual machines are co-located in physical machines on which they are instantiated, which increase the effectiveness of the intrusion attacks [10]. To address this type of attacks, either traditional security techniques should be promoted to address security breaches in the virtual machine level or new paradigms should be introduced to tackle the cloud security issues.
2.2 Example Data Clouds Cloud Computing has various flavours of implementation that are relevant to data clouds. Basically, it is composed of three service models such as Software as a Service which are (SaaS), Platform as a Service (PaaS) and Infrastructure as a Service (IaaS). 1. Software as a Service (SaaS) Software as a Service refers to a software that is deployed over the Internet by a provider which customers are able to subscribe on demand. Customers are charged based on the number of services they use along with the duration of usage. The most successful examples of such services are customer relationship management (CRMs) (e.g., Saleforce.com) which can be used by anyone regardless of the number of customers they might have without spending lot of money for deploying such a system. Basically, in SaaS, software developers use the clouds to rent their softwares instead of selling it and customers are capable of renting a software until they need it by paying a fixed subscription fees. This model is highly appropriate for generic applications and is still not suitable for customized applications. 2. Platform as a Service (PaaS) Platform as a Service refers to platforms that enable users to create their own cloud applications by providing the business logic. PaaS provides high level access to system functions such as database management. Users are equipped with tools for creating new applications rapidly with low cost by PaaS. The major concern here is to avoid application lock-in as migrating between platforms is not an easy task. Although there are many popular PaaS stacks such as Amazon EC2,
504
A. Nourian and M. Maheswaran
Eucalyptus, Microsoft Azure, and Google App Engine, an application developed for one platform may require significant effort to migrate to another platform. 3. Infrastructure as a Service (IaaS) Infrastructure as a Service refers to computing resources as a service. An IaaS cloud provides virtual processing, network, and data storage elements on top of which customers can deploy custom system images to implement their applications. Companies need not create their own data centers for their services. They can deploy their services on top of an IaaS provider and get billed for the resources according to their usage. The IaaS does not have the problem of lockin as providers normally export compatible virtual machine interfaces. Amazon Web Services and Rackspace are two key players providing IaaS services. Data-intensive computing applications can be implemented on clouds falling into the three models given above. IaaS can be preferred approach for deploying data intensive applications when precise control is needed on managing the computing facilities provided by the clouds. For instance, parallel processing requirements of data intensive applications can favour IaaS clouds. In contrast, PaaS clouds provide complete development environments for cloud applications. Various services essential for web-oriented applications such as database service and, web service are typically made available by PaaS clouds. Besides the above cloud computing model that is generally adopted, other cloud computing models have been proposed such as the Linthicum [11] and the Jericho Cloud Cube [12] models. The Linthicum model consists of ten categories as follows: • • • • • • • • • •
Information as a Service: using any type of information hosted in the cloud. Database as a Service: using and sharing hosted databases in the cloud. Storage as a Service: using and sharing physical storage provided by the cloud. Application as a Service: using and sharing any hosted application inside the cloud. Platform as a Service: using and sharing the platform provided by the cloud. Process as a Service: using hosted resources to create required resource for business processes. Security as a Service: using and sharing security services provided by the cloud. Testing as a Service: using testing softwares and services hosted in cloud to test inside and outside cloud provided services. Management as a Service: using hosted controllers to manage cloud resource and services. Integration as a Service: using the stack provided by the cloud to control integration processes.
While the Linthicum model refines the simple cloud model discussed previously, the Jericho Cloud Cube model incorporates idea of dimensions to the traditional model as follows. • Insourced Versus Outsourced: services which are under consumers control called insourced whereas services controlled by third party are called outsourced.
19 Privacy and Security Requirements of Data Intensive Computing in Clouds
505
• Internal Versus External: describes the physical location of data. • Perimeterized Versus De-perimeterized: inside or outside of traditional IT domain. • Proprietary Versus Open: using open technology or black box one.
3 Security Concerns in Data Clouds This section describes some of the key security concerns in data clouds. These concerns arise in the context of the overall safety of the data stored and processed by cloud systems.
3.1 Data Confidentiality Confidentiality refers to the prevention of unauthorized disclosure of users’ information. It requires cloud providers to ensure that unauthorized disclosure of the information is prohibited. Confidentiality concerns include two major issues: authentication and access control. Most cloud installations use weak forms of authentication like username and password to anchor the credentials that are used for controlling access to data and resources. Further, the access control mechanisms in most clouds do not support fine grained control. Authentication and access control work only in un-compromised systems. In a compromised cloud computing system, preventing unauthorized information disclosure is much harder. A data cloud that deals with highly sensitive data use encryption and/or data segmentation to limit the confidentiality breach when the cloud computing system is compromised. Data segmentation helps minimize the amount of data disclosure in case any of the servers that stores the data gets compromised. Data segmentation brings the following advantages to the cloud’s security capabilities: • Consumers’ sensitive data leaks only when an entire system of cloud servers are compromised. • Downtime associated with the compromise of an individual node is negligible. Covert channel is another issue in the confidentiality of data inside a cloud. A covert channel is an unauthorized communication path that enables information leakage. Covert channels can be used through timing of messages or inappropriate use of storage mechanisms.
3.2 Data Integrity Information held by data clouds is often used by many parties. In addition, many parties often have the capability to update the information held in data clouds.
506
A. Nourian and M. Maheswaran
To maintain the integrity of the data, the capability to update the information should be carefully controlled. Like data confidentiality, data integrity depends on the authentication and access control. If the authentication is based on credentials that are anchored on weak usernames and passwords, the authentication process can be easily compromised. This can lead to unauthorized updates to the data. Similarly, if the cloud is compromised, data becomes unprotected from unauthorized updates.
3.3 Data Provenance Data has integrity if it is not changed in an unauthorized manner or by an unauthorized person. Provenance means not only that the data has integrity, but also that it was computed in the correct manner. Data clouds are used to hold data that undergo series of computational transformations to gather useful results. The computations are often applied in a data driven manner where data sets are piped through computational elements. In this process, the meta data associated with the data sets is used in applying the appropriate transformations. The meta data indicates the “computational history” of a data set.
3.4 Data Availability Data availability is one of the major benefits provided by the cloud computing systems. Using massive scale and sufficient safeguards against various forms of denial of service attacks, clouds are able to provide very high availabilities that rival the best provided by dedicated systems. However, several highly publicized data loses have happened in data clouds. Therefore, data availability is an issue that still needs careful consideration when data cloud deployments are rolled out.
3.5 Data Accountability Data accountability is the ability to verify the agents behind the different actions in data clouds. To implement accountability, identifying information of the agents need to be logged at different locations of the data clouds. Simply associating all the actions with a customer account may not be always sufficient. For instance, if an account is compromised, then the customer may not be accountable for the actions performed after the compromise. Therefore, sufficient information should be logged to detect the anomalous behaviour after the account compromise.
19 Privacy and Security Requirements of Data Intensive Computing in Clouds
507
3.6 Data Placement Large cloud operators such as Amazon have resource installations in different continents. Some countries for certain sensitive data (e.g., healthcare data [13]) stipulate that the data is actually stored within certain geographical locations. Such stipulations should be considered by the data clouds while replicas of the stored data are created. The data placement concerns arise because cloud operators can be forced by court orders to disclose data stored in their servers.
3.7 Data Remnance Data remnance is the residual of data left behind after a delete operation. It is created because the delete operation just unlinked an allocated block and included it in the free list. Although data remnance is an important issue, data cloud operators are yet to explicitly state in their policies how their system is handling this concern. In an IaaS cloud, the customer can stuff zeros in the storage volume before releasing it to reduce data remnance. However, the customer has less control over data remnance in SaaS and PaaS clouds. Data encryption or segmentation can help with remnance.
4 Threat Modelling in Data Clouds Threats in data clouds can be categorized as external or internal. External threats emerge from parties outside the cloud or customers’ applications running on the cloud. Internal threats emerge from the cloud itself or third party system or management services that are integrated into the cloud. When external threats emerge from outside parties, countermeasures developed for protecting network assets on the Internet (e.g., intrusion detection and prevention systems) can be useful in handling the threats. Threats emerging from application programs are little harder to handle. In IaaS clouds, applications run inside virtual machines, which can be used as capability constrained sandboxes to limit the data that can accessed and manipulated by the application. In PaaS clouds, the application programs use the APIs provided by the cloud provider to access the data and services managed by the cloud. To safeguard sensitive information, APIs often support API keys that restrict access to data to authorized programs. Cloud computing and in particular the PaaS and SaaS variants assume that the customer trusts the cloud service provider to keep the data secure. In many situations, the cloud service providers have peering arrangements with other service providers to share information regarding customers. This means a customer’s
508
A. Nourian and M. Maheswaran
activity information could be shared with others without customer’s consent. If a customer prefers high level of privacy, such information sharing among the providers can cause security and privacy concerns. Internal threats such as those mentioned above are difficult to using existing security measures. Two main approaches can be used to handle internal threats: trust modelling on clouds and distributing data across independent clouds. Trust modelling is an idea that is yet to be explored fully on clouds. Although trust modelling is well studied in online transaction management systems, cloud computing poses new challenges. The primary challenge is the lack of observability in the clouds. Customers can pay careful attention to cloud providers’ data confidentiality and availability records while selecting a cloud provider for their needs. While data availabilities can be measured by outside parties, quantifying data confidentiality is a hard problem. One approach is to examine the data confidentiality policies offered by a provider, evolution of such policies, historical data on data confidentiality violations of the provider. Another approach is to make cloud computing auditable. Many cloud computing providers are large companies with multi-country operations. These cloud providers are not likely to open their installations for a customer’s examination. However, a regulatory organization with sufficient mandate can act as an audit and examine cloud provider’s activities on behalf of the customers. Cloud computing systems have threat scenarios that do not exist in traditional computing systems. In clouds, multiple users can share a physical device (computing element) because their virtual machines are mapped onto the machine. So attackers can by pass the intricate access control mechanisms they can have co-tenancy as the victims. Once the attackers gain co-tenancy they could monitor various activities of the victims.
5 Data Intensive Applications on Clouds 5.1 E-Banking E-banking is a broad sector with many different institutions and users managing money in the clouds. For example, banks, credit card companies, and personal banking institutions create web-based portals so that their users can manage transactions through them. Personal banking is becoming a popular sector where people use applications running on desktops or mobile devices to manage their financial portfolio. Unlike traditional applications that existed for money management for a long time, cloud based money management applications are active and available all the time. This means the user is able to access the application from any device all the time. Further, the application is able to gather information from sources such as banks and credit card agencies regarding the client’s transactions.
19 Privacy and Security Requirements of Data Intensive Computing in Clouds
509
Most e-banking services fall into the SaaS cloud model. Cloud computing provides many benefits to e-banking such as always on access, professionally managed secure computing environments, and seamless connectivity with other services. Instead of running the money management software in the client’s desktop, there are significant advantages by running it on the clouds. A cloud resident application can receive notifications from banks and credit card agencies without any connectivity issues. In addition to providing a facility to manage money for its users, several money management applications have started sharing analytics among its users. These analytics are measures computed over anonymized transactions its users perform with their banks, credit card agencies, and other business partners. These analytics measures can be useful for users as well as businesses. For e-banking, data confidentiality is a key concern. User authentication remains one of the heavily exploited vulnerabilities to break data confidentiality in e-banking. Data integrity is not much of a concern here because users do not generate content in e-banking and the opportunities to update existing information is very limited for the users. Data accountability and provenance are important concerns as well. Information access should be carefully logged to detect fraud and address any legal conflicts that may arise. Similarly, data provenance is important because different services obtain data from other services. For instance, personal banking applications can obtain credit card transactions and bank accounts on behalf of the user from respective institutions. Data provenance is important to ensure that the data is appropriately used as it flows from one service to another one.
5.2 E-Health E-health clouds are emerging as many countries are streamlining their healthcare platforms by removing unnecessary information exchange blockages. In the simplest scenario, an e-health cloud will have patients, healthcare providers, and patient record hosting services. The simple scenario is sufficient only if the patient is trusted to maintain the health record. Such trust is not feasible in many situations such as when insurance claims are made for healthcare services. Therefore, a practical e-health cloud will be more complex. It will have health care providers responsible for maintaining the health records. There will be billing services and insurance services in addition to previous three entities. The key challenge is to maintain patient privacy while allowing efficient information interchange for the purposes of providing the necessary healthcare to the patient. In most cases, the e-health cloud will be a SaaS type cloud. With multiple parties in the e-health cloud, information access needs to be restricted according to the need-to-know principle. For instance, an insurance service provider need not know the particular type of ailment the patient is suffering from. Instead, the insurance service provider needs to know the particulars for the types of healthcare services rendered to the patient and the cost associated with those services.
510
A. Nourian and M. Maheswaran
E-health clouds are subjected to various governmental regulations on the way information is stored, transferred, archived, and released to different parties. If the cloud service provider fails to meet the regulations, there might be penalties imposed against the operator.
5.3 Video Surveillance (VSaaS) Video surveillance services can benefit from the processing and storage capabilities of cloud computing. In traditional implementations, videos obtained from surveillance cameras are fed to a local storage server. The capacity of the storage server can limit the number of cameras and consequently the surveillance area. Further, with limited storage, old video feeds are overwritten by new video feeds. This means the duration of surveillance available is often limited by the amount of storage available at the servers. With an elastic storage facility, clouds provide cheap but high volumes of storage that is suitable for the video surveillance application. Also, clouds provide a remote and secure facility for archiving the surveillance videos that is not affected by the insecurities of the site that is being monitored by the surveillance system. The processing capabilities of the clouds can be handy for performing computations on the surveillance videos. In the simplest form, video surveillance needs a video storage and retrieval facility. With the advancements in video analysis software, it will be possible search through videos for particular scenes. A cloudbased video surveillance system can facilitate such search functions by providing the necessary computational capability to run the video processing algorithms included in the search functions. Videos taken for surveillance purposes should be accessible only to authorized users. Because videos are large files, traditional encryption techniques cannot be applied to implement confidentiality. New techniques for efficiently handling large datasets created by the video surveillance application is needed to implement data confidentiality. Data integrity is not a major concern here. The integrity aspect is covered by the data provenance concerns that deals with the injection of false feeds or loopbacks. The video surveillance too has the availability concern as the other applications featured in this section.
5.4 E-Government Most federal agencies having aging computing systems based on traditional inhouse approach. To cut cost, quickly revamp the infrastructure, and remain unlocked to a particular vendor, they are beginning to consider cloud computing as a feasible approach [3]. If cloud computing is adopted by federal agencies, several missioncritical applications will be loaded into remote cloud servers that are managed by third party cloud providers. Along with the application code, the cloud servers will
19 Privacy and Security Requirements of Data Intensive Computing in Clouds
511
Fig. 19.1 Comparison of the security and privacy requirements of the different applications
be holding valuable data. Therefore, security and privacy of the data is a top priority. By adapting the cloud platform, the responsibility for implementing data protection is passed to the vendor [14], but enforcing accountability is retained with the federal agencies that own the data. With Internet access becoming ubiquitous, most international corporations have consolidated their IT services and are using private or public clouds for their purposes. Many governments are challenged to use the lessons learned by the corporations to roll out their own systems in an cost efficient and secure way. To address the security and privacy concerns of data in the cloud which is turning out to be the most important barrier for the government agencies to adopt the cloud services, a project called FeDRAMP [3] is initiated by the US government. The Federal Risk and Authorization Management Program or FedRAMP has been established to provide a standard approach to assessing and authorizing cloud computing services and products. FedRAMP allows joint authorizations and continuous security monitoring services for government and commercial cloud computing systems intended for multi-agency use. Joint authorization of cloud providers results in a common security risk model that can be leveraged across the federal government. The use of this common security risk model provides a consistent baseline for cloud-based technologies.
5.5 Comparison of the Different Application Requirements Figure 19.1 summarizes the data security and privacy requirements of the applications discussed in this section. From the figure it can be observed that confidentiality,
512
A. Nourian and M. Maheswaran
availability, and remnance are universally important for all applications. Concerns such as integrity, accountability, and placement are important for certain applications and not so much for others.
6 Example Cloud Systems and their Security Policies The lack of a standardized approach for establishing security and privacy parameters and enforcing them on the clouds is one of the fundamental concerns. Currently, security and privacy parameters are evaluated in an ad hoc basis according to the requirements of the customers. For example, a customer might ask the following questions when choosing a cloud service provider: • • • • •
How customers’ data are protected from an unauthorized access? What forms of authentication schemes are supported to authorize access? How is data privacy ensured? What tools are offered to notify customers about access to their data? If customers decide to change clouds, what happens to their data and is there a secure way of exporting the data? • Is their any data segmentation process to limit the amount of unintended data exposure if one of the servers gets compromised? • What are the security policies and how often have they changed in the past? This section analyzes various cloud offerings on the Internet and examines the mechanisms provided by them to address customers’ security and privacy concerns.
6.1 Amazon Web Services (IaaS) Amazon [15] is in the forefront of cloud computing offering variety of services in this domain based on the IaaS cloud service model [16]. 1. Amazon EC2 (Amazon Elastic Compute Cloud) which provides virtual environments with dynamic computing capacities. 2. Amazon S3 (Amazon Simple Storage Service) which provides a scalable data storage as a web service. 3. Amazon SimpleDB which provides core database functions as a web service. 4. Amazon CloudFront which provides a content delivery as a web service for content providers. 5. Amazon Elastic MapReduce [17] which provides a hosted Hadoop [18] framework for processing large amounts of data. Although Amazon’s cloud offerings are well established in the industry, they do not have a dedicated privacy policy. They use the same privacy policy used by their e-commerce site in Amazon.com. The AWS S3 does not encrypt a customer’s
19 Privacy and Security Requirements of Data Intensive Computing in Clouds
513
data. Customers are able to encrypt their data by themselves prior to uploading, but S3 does not provide encryption. Stored information is not guaranteed to be kept confidential and can be released in the following cases: • • • • •
Affiliated businesses not controlled by the cloud provider Third-party service providers Promotional offers Business transfers With customer’s consent
Some of the data sharing scenarios mentioned above apply more to e-commerce than clouds. Because Amazon does not have a different set of policies for their cloud offering, same policies apply to the clouds as well. It is interesting to note that the policy does not cover data placement and purging. As cloud computing continues to gain popularity, they may need cloud specific data sharing policies instead of using policies from B2C or B2B services.
6.2 Google AppEngine (PaaS) Google AppEngine [19] privacy policy does not address all the security and privacy concerns delineated above. Google is one of the few cloud operators to provide an archive of its privacy policies so that customers can investigate the evolution of Google’s privacy policies. According to the recent version of the privacy policy, Google may share customer’s information with other companies outside Google if they have contractual agreements or collaborations with them. Any party receiving the information is required to abide by Google’s terms of service that requires them to comply with Google’s privacy policy and any other appropriate confidentiality and security measures. But there is no information on how they could verify the compliance of the collaborators. In addition, they also disclose the customer information under the following circumstances: • Satisfy any applicable law, regulation, legal process or enforceable governmental request. • Enforce applicable terms of service, including investigation of potential violations. • Detect, prevent, or otherwise address fraud, security or technical issues. • Protect against harm to the rights, property or safety of Google, its users or the public as required or permitted by law. Also if Google’s cloud operations becomes involved in any part of sale, they ensure the confidentiality of the transferred data and provide a notice before doing such a transfer. Google employees, contractors and agents are subjected to discipline, including termination and criminal prosecution, if they fail to meet information disclosure obligations. However, there is no way for customers to verify such breaches took place or not. In most cases, customers need to trust the procedures Google has put in place to implement their policies.
514
A. Nourian and M. Maheswaran
6.3 Microsoft Azure (PaaS) Microsoft Azure [20] is a member of the TRUSTe privacy seal program. TRUSTe is an independent organization whose mission is to build trust and confidence in the Internet by promoting the use of fair information practices. There is no indication of cloud based privacy policy for Azure in both small version and complete version of the privacy policy. It uses the Microsoft general privacy policy for Azure as it is explicitly mentioned in the Azure website. Customers’ information may be revealed to third parties who are working with Microsoft. In addition to the information provided by users, Microsoft webpages contain web beacons which help them to gather information without customer’s awareness. Such information may be shared with third parties who are collaborating with Microsoft. Microsoft informs customers if any changes occur in its privacy policy by updating the corresponding pages. However, there is no archive of privacy policy for customers to review.
6.4 Proofpoint (SaaS, IaaS) Proofpoint [21] provides cloud based solutions for enterprise email services such as email storage management, archival service, protection against spam and phishing using its cloud infrastructure as well as other services. While the common use case is outbound email filtering to restrict information leakage, it can also be used to detect sensitive information in inbound traffic from business partners as well. Proofpoint provides a solution called “smart identifier” that automatically scans for sensitive information in the outbound messages based on customizable rules and block or encrypt the messages as appropriate. The problem with this approach is that Proofpoint needs to analyze confidential corporate data to apply the rules. Therefore, for this approach to work, customers (enterprises hosting their email services in Proofpoint clouds) need to trust Proofpoint. Proofpoint has a privacy policy in effect since September 2007. This policy has not changed substantially since it was put in effect. According to the policy, Proofpoint reserves the right to use and process customer’s information without any limitation and disclose and transfer information within their organization or among their affiliates. Also, in the privacy policy it is mentioned that contact information of the customers may be shared with their partners who can provide the customers with goods or services that may be of interest to them. According to their data placement policy, customer’s information may be transferred or maintained outside the customer’s state, province, or country, where the privacy laws may not be as protective as those in the customer’s jurisdiction. Also, for customers residing outside the US, their information can be transferred to US and processed there.
19 Privacy and Security Requirements of Data Intensive Computing in Clouds
515
Although the privacy policy mentions that customers can delete their contact information by contacting Proofpoint, there is no information regarding data purging and data remnance.
6.5 Salesforce (SaaS, PaaS) Salesforce [22] provides customer relationship management (CRM) services based on cloud solutions. With more than two million satisfied customers, it has service and sales clouds as its core services offered to businesses including small businesses, enterprises, and healthcare providers. They are a member of TRUSTe as indicated in their privacy policy. Salesforce may share customer’s information with other companies that work on Salesforce’s behalf. Data confidentiality is guaranteed by Salesforce and they maintain appropriate administrative, physical, and technical safeguards for enforcing the confidentiality and integrity of their customers’ data. Salesforce may use customer’s data for marketing purposes. The privacy policy also mentions that the security measures can be customized by customers to protect their data from unauthorized access, maintain data accuracy, and help ensure its appropriate use. There is no information regarding data purging or data remnance. However, it is mentioned that Salesforce will return customer’s data upon request with in 30 days after the effective date of termination.
6.6 Sun Open Clouds Sun provides an open cloud platform to minimize vendor lock-in (data lock-in) as long as they use the open standards APIs [23] provided as part of the cloud. They provide storage and compute clouds. Their storage cloud is set of web services that provide programmatic access to their storage infrastructure. Developers can build and run their own cloud based data center from pre-built components by dragging and dropping servers, switches, and firewalls. They can define server’s processor features such as core memory and clock speed by a pay-as-you-go model. The cloud created using this platform is compatible with Amazon S3 and EC2 platforms at the API level, which makes migration easy between these two cloud providers. The Sun cloud platform does not have a dedicated privacy policy. It uses the privacy policy used by other Oracle services. Following are some of the salient features of the Oracle privacy policy with regard to cloud computing. Oracle is under a TRUSTe privacy program. The policy regarding collection, usage, and sharing of customer information is very similar to other cloud providers. The Sun cloud uses various security measures to protect customer data from unauthorized access. All employees are required to sign an agreement to hold customers’ data confidential. However, the privacy policy does not cover data purging, remnance, and placement.
516
A. Nourian and M. Maheswaran
Fig. 19.2 Comparison of the supported features by different cloud providers
6.7 Rackspace Rackspace [24] is a storage cloud provider that hosts online content. With more than 40,000 cloud customers, they offer unlimited online storage for files and media objects and by using Limelight Network’s content delivery network (CDN), they deliver the content at high speeds to the clients. Signing up with Rackspace implies that the customers are in agreement with their privacy policies. If customers refuse to provide certain profile information while signing up, some features are disabled based on the missing information. All employees of Rackspace as well as their agents have access to customers’ data. They use best practice approaches to prevent the loss, misuse and alteration of the information in their possession. All of their employees as well as their agents have access to the customers data. Data integrity is provided but no information regarding data purging, data remnance, and data exporting.
6.8 Comparison of Operating Policies of Different Clouds One interesting observation that can be made from Fig. 19.2 is that almost all clouds support data confidentiality while no clouds support data purging. If a customer deems data purging important, the customer needs to encrypt the data before storing it in the clouds. Data placement is another important parameter that is not supported by most of the cloud providers. This can bring both legal issues as well
19 Privacy and Security Requirements of Data Intensive Computing in Clouds
517
as security risks to consumers’ data. Data placement can also create other burdens to consumers’ businesses such as different taxes. Data exportability is also among the missing features in the studied clouds. This feature is related to the issue of data lock-in which is not favourable to most consumers who want to retain the freedom of changing the cloud services. Consumers certainly prefer to choose a cloud which is able to give them more freedom over their data rather joining to a cloud which locks them up.
7 Trust Management in Clouds Trust certification is becoming an important issue for cloud computing. To facilitate the trust certification process, the Federal Trade Commission has formulated the Fair Information Principles (FIC). The FIC sets out the following requirements: • Need to notify the consumer when personal data is collected and used. • Need to obtain the consent from consumer before data collection. • Need to provide a facility for the consumer to examine the collected data for accuracy and misrepresentation. • Need to protect the consumers’ data against unauthorized access, destruction, and disclosure. • Need to provide mechanisms to enforce compliance. TRUSTe is a private certification company that certifies online businesses for their best practices with regard to customer privacy. Recently, TRUSTe has started a notification service that allows alerts the customers on events that can impact their privacy. TRUSTe also allows the customers to set limits to the personal information distribution. Any disputes with regard to privacy could be addressed directly with the service provider or with TRUSTe. TRUSTe services started with websites that collect user profile information. With clouds coming in different types (IaaS, PaaS, and SaaS), it is not clear how TRUSTe would be able to detect privacy violations. In particular, in IaaS clouds, different entities might be involved in providing a service. While in SaaS, a single cloud provider can be held responsible for the service even though multiple parties could have contributed to the deployment of the service.
8 Conclusions Cloud computing is becoming an increasingly attractive alternative to large inhouse data centers. For cloud computing to maintain its relevance for data intensive applications, it is important that related standards are developed and adopted especially in the security domain. This chapter highlighted various data privacy concerns that emerge in the context of representative data intensive applications.
518
A. Nourian and M. Maheswaran
Further, data privacy measures supported by existing cloud computing systems were examined. Although significant work has been done in securing cloud computing systems, lot more work is needed to ensure true data privacy. In particular, long-term security of the massive amounts of data accumulated by data intensive applications is still weak.
References 1. “National Institute of Standards and Technology,” http://www.nist.gov/. 2. M. Armbrust, A. Fox, R. Griffith, A. D. Joseph, R. Katz, A. Konwinski, G. Lee, D. Patterson, A. Rabkin, I. Stoica, and M. Zaharia, “Above the Clouds: A Berkeley View of Cloud Computing,” University of California at Berkeley, Tech. Rep., Feb. 2009. 3. “FedRAMP: Governmentwide Approach to Cloud Security,” http://www.cio.gov/pagesnonnews.cfm/page/Federal-Risk-and-Authorization-Management-Program-FedRAMP. 4. D. Abramson, R. Buyya, and J. Giddy, “A computational economy for grid computing and its implementation in the nimrod-g resource broker,” in Future Generation Computer Systems (FGCS) Journal, vol. 18, no. 8, 2002, pp. 1061–1074. 5. R. T. Kouzes, G. A. Anderson, S. T. Elbert, I. Gorton, and D. K. Gracio, “The changing paradigm of data-intensive computing,” Computer, vol. 42, pp. 26–34, 2009. 6. P. Mell and T. Grace, “The nist definition of cloud computing, national institute of standards and technology,” 2009. 7. C. Wang, Q. Wang, K. Ren, and W. Lou, “Ensuring data storage security in cloud computing,” Cryptology ePrint Archive, Report 2009/081, 2009. 8. R. L. Grossman, “The case for cloud computing,” It Professional, vol. 11, pp. 23–27, 2009. 9. R. L. Grossman and Y. Gu, “On the varieties of clouds for data intensive computing,” Mar. 2009. 10. T. Wood, G. Tarasuk-levin, P. Shenoy, P. Desnoyers, E. Cecchet, and M. D. Corner, “Memory buddies: Exploiting page sharing for smart colocation,” in 5th ACM International Conference on Virtual Execution Environments, 2009. 11. D. S. Linthicum, Cloud Computing and SOA Convergence in Your Enterprise: A Step-by-Step Guide, 1st ed. Addison-Wesley Professional, 2009. 12. “Jericho Cloud Cube Model,” http://www.opengroup.org/jericho/cloud cube model v1.0.pdf. 13. C. A. Yfoulis and A. Gounaris, “Tc3 health case study: Amazon web services,” 2009. 14. R. Gellman, “Privacy in the clouds: Risks to privacy and confidentiality from cloud computing,” Tech. Rep., Feb. 2009. 15. D. Nurmi, R. Wolski, C. Grzegorczyk, G. Obertelli, S. Soman, L. Youseff, and D. Zagorodnov, “Eucalyptus: A technical report on an elastic utility computing architecture linking your programs to useful systems,” 2008. 16. “Amazon Web Services.” http://aws.amazon.com. 17. J. Dean and S. Ghemawat, “Mapreduce: simplified data processing on large clusters,” Commununications of the ACM, vol. 51, January 2008. 18. D. Borthakur, The Hadoop Distributed File System: Architecture and Design, The Apache Software Foundation, 2007. 19. “Google AppEngine.” http://code.google.com/appengine. 20. “Microsoft Azure: Windows Azure.” http://www.microsoft.com/windowsazure/. 21. “Proofpoint.” http://www.proofpoint.com. 22. “Salesforce: CRM & Cloud Computing.” http://www.salesforce.com. 23. “Sun Open cloud Platform.” http://www.sun.com. 24. “Rackspace: Rackspace Managed Hosting,” http://www.rackspace.com/.