2010 IEEE 3rd International Conference on Cloud Computing
Preventing Information Leakage from Indexing in the Cloud Anna Squicciarini† †
Smitha Sundareswaran†
Dan Lin‡
College of Information Sciences and Technology The Pennsylvania State University, USA
[email protected],
[email protected] ‡ Department
of Computer Science Missouri University of Science and Technology, USA
[email protected]
To allay users’ privacy concerns, technological mechanisms which strongly enforce users’ privacy policies at any time when the users’ data being accessed by the service providers, are desired. Besides enforcing users’ privacy policies on their actual data file, we observe another interesting and very important privacy problem caused by data indexing. Indexes may contain a great amount of information concerning the data itself. Since indexes are usually constructed after the service provider receives the user’s data and decides to build indexes to improve search performance, users may not even be aware of such usage of their data which probably leaks much more information than that intended by the users. In this paper, we aim to address this new and critical privacy problem caused by data indexing. We propose a Three-Tier Data Protection Architecture to accommodate various needs from different users. For example, users will be able to specify different access control rights over sensitive and non-sensitive portions of their data files and request for different levels of privacy protection. To implement this architecture, we propose the portable data binding technique which leverages nested JAR (Java ARchives) files to achieve tight coupling of users’ data and indexing prevention policies as well as strong enforcement, while it relies on JAAS for authentication. Our approach has the following major advantages. First, it ensures full control from the userend and does not rely on trusted computing architectures. Second, servers’ authentication is achieved by leveraging the SAML infrastructure [11]. SAML is widely used to ensure users and servers authentication, hence, our approach does not introduce any further authentication mechanisms to guarantee proper enforcement. Third, it introduces little overhead since it does not require a specific service provider to apply any encryption-based approach. The rest of the paper is organized as follows. Section II introduces related works including a brief background about the cloud. Section III gives the problem statement. Section IV presents our proposed Three-Tier Data Protection Architecture, followed by detailed enforcement techniques elaborated in Section V. Section VI provides a case study
Abstract—Cloud computing enables highly scalable services to be easily consumed over the Internet on an as-needed basis. While cloud computing is expanding rapidly and used by many individuals and organizations internationally, data protection issues in the cloud have not been carefully addressed at current stage. Users’ fear of confidential data (particularly financial and health data) leakage and loss of privacy in the cloud may become a significant barrier to the wide adoption of cloud services. In this paper, we explore a newly emerging problem of information leakage caused by indexing in the cloud. We design a three-tier data protection architecture to accommodate various levels of privacy concerns by users. According to the architecture, we develop a novel portable data binding technique to ensure strong enforcement of users’ privacy requirements at server side.
I. I NTRODUCTION Cloud computing is a means by which highly scalable, technology-enabled services can be easily consumed over the Internet on an as-needed basis [19]. This new and exciting paradigm has generated significant interest in the marketplace and the academic world [22], [23], resulting in a number of notable commercial and individual cloud computing services, e.g., from Amazon, Google, Microsoft, Yahoo, and Salesforce [15]. Also, top database vendors, like Oracle, are adding cloud support to their databases. The concept of the cloud includes a number of implementations, based on the services they provide, from application service provisioning, grid and utility computing, to Software as a Service [19], [21], [29]. Regardless of the specific architecture, the overarching concept of this computing model is that customers’ data, which can be of individuals, organizations or enterprises, is processed remotely in unknown machines that users do not own or operate. The convenience and efficiency of this approach, however, comes with privacy and security risks [15], [16], [17]. A significant barrier to the adoption of cloud services is the users’ fear of confidential data (particularly financial and health data) leakage and loss of privacy in the cloud [14], [19], which may prove fatal to many different types of cloud services [14], [24]. 978-0-7695-4130-3/10 $26.00 © 2010 IEEE DOI 10.1109/CLOUD.2010.82
188
and Section VII shows initial experimental results. Finally, Section VIII concludes the paper.
[15], [16], [17], [19], [24]. Such issues are due to the fact that, in the cloud, users’ data and applications reside – at least for a certain amount of time – on the cloud cluster which is owned and maintained by a third party. Concerns arise since in the cloud it is not clear to individuals why their personal information is requested or how it will be used or passed on to other parties. Despite increased awareness of the privacy issues in the cloud, little work has been done in this space. Recently, Pearson et al. has proposed accountability mechanisms to address privacy concerns of end users [24] and then develop a simple solution, a privacy manager, relying on obfuscation techniques [25]. Their basic idea is that the user’s private data is sent to the cloud in an encrypted form, and the processing is done on the encrypted data. The output of the processing is de-obfuscated by the privacy manager to reveal the correct result. However, the privacy manager provides only limited features in that it does not guarantee protection once the data is being disclosed. In addition, general outsourcing techniques have been investigated over the past few years [18], [34], [2], and several cryptographic-based approaches for ensuring remote data integrity have been proposed. Although only [38] is specific for the cloud, some of the outsourcing protocols may also be applied in this realm. We would like to mention that, in this work, we do not cover issues of data storage security which are a complementary aspect of the privacy issues.
II. BACKGROUND A. The Cloud Amazon[36], Google[13], Microsoft[3], Saleforce.com [31] and Sun are considered among the key players in the cloud computing market, but they represent only a small portion of the providers in this space. Other emerging cloud providers are Proofpoint [28], RightScale [12], and Workday [39]. We now summarize the features of some of the most well known providers, highlighting the major differences among them. Amazon (AWS) offers a number of infrastructure-related web services, including the Elastic Computing Cloud (EC2), Simple Storage Service (S3), CloudFront, SimpleDB and Simple Query Service (SQS). EC2 provides resizable computing capacity in the cloud. It allows scalable deployment of applications by providing web services interfaces through which customers can create virtual machines. S3 is used to store and retrieve unlimited amounts of data at any time from the web. CloudFront is a content delivery network. SimpleDB provides core database functions of data indexing and querying. SQS is a distributed queue messaging service that supports the programmatic sending of messages via web services applications. AWS can be used for several purposes, in that it supports various operating systems and programming languages. For example, organizations can leverage the AWS worldwide network of edge servers to minimize degradation of delivery and services for content delivery by using cloudfront and S3. Organizations can also leverage AWS as an option for managing internal backup as an alternative to on-site storage infrastructure. While AWS offers an infrastructure, Google App Engine and Microsoft Azure Services offer platforms as a service for building and hosting web applications on the web infrastructure. They can be used for multiple purposes, such as messaging, securing email systems, collaboration and application development. For example, customers now can develop their applications by using the cloud services without having their own infrastructure installed locally. Recently, Sun is promoting an open cloud philosophy, targeting primarily the developer’s community. Their Open Cloud Platform plans to offer an infrastructure related to servers, storage and databases. Developers will access the Sun public cloud services from a web browser to provision resources on their platform of choice. With respect to competitors, it will stand out for supporting the various operating systems/programming languages and virtual data center capabilities.
C. Indexing The most common scheme for supporting efficient search over distributed content is to build a centralized inverted index. The index maps each term to a set of documents that contain the term, and is queried by the searcher to obtain a list of matching documents. This scheme is usually adopted by web search engines [7] and mediators. As suggested in [4], the scheme can be extended to support access-controlled search by propagating access policies along with content to the indexing host. The index host must apply these policies for each searcher to filter search results appropriately. Since only the indexing host needs to be contacted to completely execute a search, searches are highly efficient. A centralized index however exposes content providers to anyone who has access to the index structure. This violation of access control may not be tolerable in the cloud, where assumptions on the trust of indexing server no longer hold. Further, compromise of the index host by hackers could lead to a complete and devastating privacy loss. Decentralized indexing are an alternative architecture, used to identify a set of documents (or hosts of documents) matching the searcher’s query. These hosts are then contacted directly by the searcher to retrieve the matching documents. Access control can be supported by simply having the providers enforce their access policies before providing the documents. However, indexes are still
B. Privacy Issues in Cloud Computing Cloud computing raises a range of important privacy issues as acknowledged by a number of recent work [9],
189
hosted by untrusted machines over whom the providers themselves have no control. In order to overcome the aforementioned issues, recent works have explored the possibility of creating private indexes by relying on predicate-based cryptography [10], [40]. While notable, these works lack concrete applicability due to the key management requirements and the computational overhead. Our approach deals with the privacy problem from a different perspective, since it empowers the users to gain a better control over the indexes, and adapt the level of searchability of the protected data based on the unique privacy needs of the data itself and the trust posed on the remote servers. Finally, Agrawaal et al. [4] recently proposed an interesting approach to private indexing. The authors introduce a distributed access-control enforcing protocol, which differs from our approach in the assumptions and the type of architecture employed.
stringent architectural requirements mandated by monitoring techniques, and the lack of user control. III. P ROBLEM S TATEMENT A user who subscribed to a certain cloud service, usually needs to send his data as well as associated access control policies (if any) to the service provider. Once the data is received by the service provider, the service provider will have granted access rights on the data. As long as the access rights includes “read”, it is then possible for the service provider to construct indexes on the user’s data for the purpose of improving search performance. Such indexes contain possibly sensitive keywords, extracted from the user’s actual data, and stored in separate files. However, index files are usually not protected by means of access control policies, nor they are constructed with privacy in mind, often causing information leakage. In other words, users have no control on indexes generated by the service provider. Therefore, even if the user’s data file is well protected, sensitive information can still be disclosed through unprotected index files. To ensure users’ data privacy, we aim to develop techniques which satisfy the following requirements. 1) Users’ actual data file should be protected according to the specified access control policies. The proposed techniques should go beyond the traditional access control mechanism which simply verifies the service provider’s authentication and has no control over the file after the access rights are granted to the service provider. Therefore, strong policy enforcement is demanded to guarantee the protection stated by the policies. 2) Users should have certain degree of ability to control the server’s usage of indexing techniques over the data files. For example, if a user disallow any indexing on his data file, the proposed technique should not only protect the actual data file but also prevent the service provider from indexing the file. 3) A user may have different levels of privacy concerns over his data file. For example, only a portion of the user’s data file contains sensitive information. The user may just want to obtain the highest level of protection for that particular portion while allow indexing on other parts of the file. Therefore, the proposed techniques should be able to accommodate to a variety of data protection needs. 4) The proposed technique should not intrusively monitor the recipient’s system, nor it should introduce too much communication and computation overhead, which otherwise will hinder its feasibility and adoption in practice. Further, the proposed technique should also be flexible and adapt to possible modification of access control policies without requiring fully recomputation each time of policy update.
D. Policy Enforcement Our research on policy enforcement shares some similarity with approaches aiming at protecting objects using strong coupling among policies and data. One notable proposal is represented by sticky polices [20], [27], [33] which associate disclosure policies to personal data and increase the accountability of the involved parties [20], [27]. Sticky policies rely on cryptographic algorithms to guarantee content protection [37]. Unlike our enforcement mechanism, sticky policies do not go beyond providing access control, i.e., once the access rights are evaluated and access is granted, the users’ data is fully available at the authorized party, and there is no way to control its usage. A well-known approach to deploy sticky policies is based on identity-based encryption technology, and requires trusted computing platforms to ensure accountability [20]. However, these requirements are difficult to satisfy in the cloud. Further, since encryption is applied only to text files that adhere to a predefined structure, policies are relatively easy to corrupt and a skilled hacker may tamper the file and make the policy illegible. Other proposed approaches to deploy sticky policies [37], resulted in stronger protection at the cost of poor scalability, especially in dynamic environments. Our envisioned system will overcome these limitations by proposing a substantially different and more effective approach to back-end protection of users’ data. Though cryptography is required for guaranteeing the confidentiality of the data and the policy, our enforcement strategy is carried out by executable policies that are safely encoded and tamper-proof. Other related content protection schemes are watermarking schemes [26], anonymization/obfuscation techniques [5], or access control mechanisms [8]. However, none of these approaches provides any support for protecting the data after its access. Although this limitation may be overcome by adding auditing or monitoring techniques [24], the resulting solution is not suitable for the cloud, due to the
190
to be carried out at any point of time at any service provider. In doing so, the confidentiality of both data and policies are always guaranteed. Details of the portable data binding are presented in the following section.
IV. T HREE -T IER DATA P ROTECTION A RCHITECTURE FOR P REVENTING I NFORMATION L EAKAGE To prevent information leakage from indexing, we propose a three-tier data protection architecture. This architecture provides three degrees of privacy protection.
V. P ORTABLE DATA B INDING The portable data binding is an executable enforcement mechanism. Our basic idea is to leverage the programmable capability of JAR files to enclose users’ data and associated policies so that access rights can be converted into executable codes. Detailed techniques include authentication in indexing prevention policies, binding policies and data and policy enforcement.
Three−Tier Data Protection Architecture Strong Protection User Data Policies
Request Analyzer
Jar Service
Medium Protection
Output
Provider
Low Protection
A. Authentication in Indexing Prevention Policies Figure 1.
Overview of the Three-Tier Data Protection Architecture
The indexing prevention policy (IPP) contains a set of individual rules addressing how a service provider should be handling the user’s data. This policy is composed as a result of a service level agreement with the cloud customers who indicate their privacy concerns with respect to the data files they release to the servers. The IPP is actually written using a Java policy, which consists of access control rights defined to permit read, write or execute actions on some of the system resources. To compose an IPP, we employ the Role-based Access Control (RBAC) model [32]. Each service provider is treated as a unique entity who fulfills a particular role at any given point in time. The IPP specifies what role a particular service provider has to be assigned so as to obtain access to a particular resource, such as a file. An example of policy is reported in Section 7. As for authentication, we leverage the Java Authentication and Authorization Service (JAAS) infrastructure [1] and combine it with the RBAC model. In particular, the role of a service provider will be retrieved from an Identity Provider (IDP) in the form of a SAML assertion [11]. In our context, the IDP is used for server-based authentication and trusted to issue upon request a unique SAML assertion related to a given provider. Two attributes carried by the SAML assertion are relevant: (i) the domain that the service provider belongs to; (ii) whether a service provider performs a particular function. This information is more important than some hardware based “identity” of the service provider since our aim is to enforce IPPs based on whether or not a given service provider can carry out a said function or provide a particular service. In addition, the same SAML IDP can also be employed for end-user authentication. This does not require any extension or addition to the SAML infrastructure, which can be used as is, upon proper configuration [30]. The service provider’s unique identity (SAML assertion) is then mapped into a JAAS subject for ensuring enforcement. JAAS not only manages identities but also supports a lightweight policy specification. Specifically, JAAS allows the access right specification in a form which is handled by
An overview of the architecture is given by Figure 1. According to the user’s confidentiality requirements and trust assumptions on the server, the request analyzer will select one of the following three components: (i)strong protection; (ii)medium protection; (iii)low protection. The output to be sent to the service provider will be JAR files which enclose both data and policies. • Strong protection: The service provider is not allowed to read the sensitive portion of the user’s file, so as to negate the risk of indexing being conducted on sensitive portion of the document that could lead to privacy leaks. • Medium protection: The service provider is prevented from “effective” indexing. In general, the purpose of indexing is to speed up the search of desired data item through random access. Once random access is disallowed, indexes will become useless. Therefore, we propose an approach to disable random access to the data item in the user’s file. Our approach does not rely on access control policies. Instead, we prevent random access by enforcing the server to read data in a sequential order. Even if the index is constructed on data copies, its effectiveness will be compromised when data is periodically updated by the user. • Low protection: The user specifies clearly in the policy the usage of his data file and the usage of indexing. The service provider is assumed trusted and will inform and negotiate with the user the keywords to be used for indexing purposes. We propose a novel and generic technique, called portable data binding, to enforce the strategies adopted by the three components. In particular, we first define the indexing prevention policy to specify the access rights that a service provider will obtain to deal with the user’s data. The policies will be tightly coupled with the user’s data by physically attaching the two, so that the data will not be left unprotected at any time. Then policies and data will be transported together. Our technique will ensure the policy enforcement
191
the Java Virtual Machine without any need for serialization. As a result, IPPs specified using JAAS are much lighter than equivalent XML files that are traditionally used for policy specification. For example, a JAAS policy can be 10 times lighter at just 2KB as compared to a similar XML policy file which is about 20KB. Besides these attractive properties of JAAS, we employ JAAS also due to its superiority to traditional Java authentication. JAAS augments the traditional authentication to allow for authentication in a pluggable fashion. That is, JAAS permits applications to remain independent from the underlying authentication mechanisms so that even newer authentication mechanisms can be integrated into an application without changing its underlying structure or other major components. Also, JAAS allows us to compose more expressive and user-centric policies. Traditional Java authentication is based on merely code characteristics, while JAAS allows us to specify policies even based on attributes belonging to users or servers, such as users’ names and service provider’s functions. B. Binding Policies and Data In order to closely bind the user’s data with the corresponding IPP for the purpose of strong enforcement, we employ nested JARs. Each protected data file is encapsulated into two types of JAR files: sub-JAR (or inner JAR), and super-JAR (or outer JAR). A super-JAR can contain multiple sub-JARs. In what follows, we present the details of JAR construction. A sub-JAR encloses the actual data file and programmable constraints converted from the access control rights specified in the IPP. The data file is in an encrypted form along with the class files required to decrypt it. The key used for encryption relies on a random secret (S) entered at the time of creation by the owner of the data. The password is generated by repeatedly hashing using a pseudo random cipher as per the specifications of Password-Based Key Derivation Functions [6], thus ensuring a strong cryptographic key. As for the associated access control rights, the construction is slightly different with respect to the three levels of data protection. • Strong protection: Users need to provide sensitive fields regarding their data files, which can be specified in IPP and will serve as input to the sub-JAR. The sub-JAR performs the function of selecting which fields are to be read. It also carries out the functions of running the applet, displaying and copying the file excluding the protected fields to a temporary file. The protected fields are simply skipped during the sequential reading of the file by identifying the position where the protected field starts. To identify the position, each field name is compared with the user input and the corresponding field will be skipped if there is a match. • Medium protection: This option disables random access to the data file in order to prevent effective indexing
•
over the file. The service provider will be enforced to read the file in a sequential order before it can locate the content that it needs. To achieve this, we ensure that the data stored in the sub-JAR is always only accessible sequentially. While this does not tamper with the process of forming an index itself, it ensures that the index is rendered ineffectively. In particular, the sub-JAR partitions the encrypted file into multiple parts, while retaining another master copy of the file which is not partitioned and encrypted using a different key. The sub-JAR will decrypt the file on the fly and create a decrypted temporary file which contains the content in only one of the partitions. The temporary file will be displayed to the service provider using the Java application viewer. Since the partitions are always displayed in a sequential order, the service provider will not be able to randomly access them. Further, the subJAR performs a temporal update on the file’s partitions in which it scrambles the order of the file partitions, thus rendering any indexes made on copies of the data useless. The sub-JAR can retrieve the original order of the data by referring to the master copy of the file. To further decrease risks of data being copied so as to eventually be able to perform indexing on it, the original file is periodically updated -to minimize the usage of copied file versions. Low protection: The sub-JAR simply decrypts and presents the entire file unedited to the service provider in the form of a temporary file. In this case, the sub-JAR provides basic data protection functions like disallowing unauthorized access or copying operations.
The super-JAR is responsible for managing the authentication of the data recipient (i.e., the service provider). The authorization is carried out by one of the enclosed sub-JARs, according to the constraints specified in the associated IPP. As discussed in the next subsection, the super-JAR obtains the SAML assertion from the SAML IDP, and checks the server provider’s attributes in order to verify whether it is authorized. Finally, all the sub-JARs and super-JAR are signed and sealed in order to prevent the recipients from modifying the code or trying to access the class files from outside the package [35]. This additional layer of protection helps us prevent malicious attacks against confidentiality. C. Enforcement Process We now proceed to describe the process of the enforcement mechanism. We assume a pull mode which means a service provider is provided with the JAR files or has the ability to retrieve the JAR files independently. Each service provider has a JAR-handler script (a simple PHPbased script) installed. When a service provider receives or attempts to access a JAR, the JAR-handler will execute the JAR. The service provider is allowed to manage the user’s
192
data in the JAR only according to the IPPs embedded in it. The detailed steps are the following. First, the enforcement engine in the JAR requires a proof of authentication from the service provider. The service provider needs to query the SAML IDP in order to obtain the SAML assertion which verifies the service provider based on not only its unique identifier, but also its domain of belonging and functionality. Once the SAML assertion is received (i.e., authentication succeeds), the service provider can execute the corresponding sub-JAR and the data in the sub-JAR is open according to the specific IPP that applies. Notice that the same data can be protected by different IPPs, while being stored in the same JAR. For example, if some data is available to multiple service providers belonging to different domains, the IPPs may differ according to the service provider’s domain. In such a case, the super-JAR will contain multiple sub-JARs each of which stores IPPs for a possible service provider.
Figure 2.
Overview of the Email Integrated Cloud Indexing Service
and/or indexing portions of the sensitive data, while still guaranteeing the data storage service. More specifically, JARs are flexible and able to handle files encrypted selectively, using any standard algorithm negotiated between the email server and the enterprise. By using the selective encryption, indexes are allowed to be constructed over only the data that is actually rendered at the service provider. The authentication protocol embedded in the JARs will enable only the authorized server to actually access the content and extract it in its encrypted form. The data owners can retain the cryptographic key used for encryption and be free to access such sensitive portions of the data upon request. As for e-mails with lower confidentiality requirements, such as emails related to business relationships with customers or suppliers but not containing any transaction information, the request analyzer will select the medium protection option. This option will allow the service provider to obtain full access to the data but prevent it from creating effective indexes over the sensitive portion of the data. Moreover, access to the data by third parties are also disabled. There are two key advantages of using the medium protection. First, there is no overhead caused by additional encryption. Second, the user (the enterprise) does not need to run any audit or keep monitoring the data while being assured that indexing of sensitive portion of the data cannot be physically generated. Lastly, for general emails exchanged within the organization that carry only insensitive information, such as meeting reminders, the organization may not have any stringent requirements on data privacy. Hence, any service level agreement negotiated with the cloud service provider can be applied. The JARs will still provide unique protection features, by guaranteeing that the data is only indexed by the authorized service providers. From an architectural standpoint, tight integration with the cloud archiving infrastructure can be achieved by attaching
VI. C ASE S TUDY Cloud technology is a promising approach to deploying an Internet email service, such as archiving email business data to ease enterprises’ servers’ burden. An example of such type of email related cloud services is given by ProofPoint [28]. ProofPoint provides a number of services including legal compliance, standard encryption, and data management (in terms of data retention), which are well integrated with Microsoft Exchange. ProofPoint stores email archives in distributed fashion across the grid infrastructure and provide email search using parallel search technology. Enterprises using ProofPoint’s service are given only general information concerning how email data is indexed, and whether the indexes are stored anywhere and/or made available to third parties (for performance issues). That means any user relying on the cloud for similar services, has to trust the cloud service provider that the service provider will not generate privacy leaking indexes for the data search. Compared to the aforementioned situation, our proposed data protection approach will provide users much better control over the usage of their data by other parties in the cloud. In what follows, we use the same example of email service to illustrate how our approach achieves better privacy protection while minimizing the trust assumptions on the remote servers. Within the same enterprise, the stored email archives may have different degrees of privacy, according to the email content, involved parties (i.e., senders and recipients), the time the emails originated, and whether or not they convey communication messages internally to the organization. For highly sensitive emails, e.g., data that requires protection because of the parties involved or the sensitive content, full protection can be guaranteed using the strong protection option in the Three-Tier Data Protection Architecture. Our approach prevents the cloud service provider from reading
193
the request analyzer component to the email appliance requested by the cloud service provider for providing the archiving service. The request analyzer will act on top of functionality offered by the appliance, ensure the security of archived emails by assessing the type of indexing protection to apply, and render an integrated service with other archiving products offered by the cloud. For example, after checking for email compliance, it will apply the cryptographic protocols supported by the provider using the JARs for stronger assurance. A sketch of the integrated architecture is shown in Figure 2. At the end, we give example IPPs associated with emails. The names of the principal and the sub-JARs in the policies are presented in clear text for readability, while in the actual implementation they are randomized. According to the policy shown below, if the service provider is authenticated as an indexing server (SP1) and it has complete access to the email when the email is considered insensitive, the superJAR is granted the “read” and “execute” permissions for JARauxiliary which is the sub-JAR used for granting Low Protection.
Figure 3.
Overhead of IPPs
GHz, 2.00 GB RAM, 250 GB Hard Drive and a Windows XP Professional OS. We performed two sets of experiments to evaluate our initial prototype. In the first round of experiments, we evaluate the overhead of using Indexing Prevention Policies (IPPs). According to our approach, data files will be encrypted and decrypted based on access control rights specified in IPPs. Figure 3 shows the total time taken to encrypt and decrypt images when IPPs are used along with different encryption and decryption algorithms (i.e., AES, 3DES, PBECipher, RSA, RC2). As for content encryption, we used the RSAbased implementation of Password-Based Key derivation [6]. Observe that the 3DES algorithm (DESede) added the least amount of overhead (up to 890ms) with RC2, whereas the PBES cypher incurred the maximum overhead (up to 0.5s), with the increase of the size of JAR files. The result proves that the addition of IPPs does not introduce much time overhead (less than 1s). The second set of experiments evaluated the processing time of the strong-protection strategy which is the most demanding and time consuming among the three protection strategies. Recall that the strong-protection strategy removes protected fields in the user’s data file and creates a new file for indexing. In the experiments, we varied the file size and the number of words to be protected (i.e., the product of the number of input words and the number of occurrences of the words), and tested different file types including emails, chat archives, text files and PDF files. Table I shows the input parameters (the first four columns) and the results (the last two columns). The size of the file for indexing is determined by the length of the word/field selected as input multiplied by the number of times it occurs in the document. We can observe that the time increases with the original file size and the amount of file reduction (i.e., difference between the indexing file and the original file). This is mainly due to the increased amount of data that needs to be removed for privacy protection. However, we can also see that the time taken for processing a 51MB file is only 62ms, which indicates that even using the strong protection, the overhead is negligible.
grant codebase "file:./SampleAction.JAR", Principal sample.principal.SamplePrincipal "Indexing:SP1" { permission java.io.FilePermission "JARauxiliary.JAR", "read"; permission java.io.FilePermission "JARauxiliary.JAR", "execute";};
According to the policy below, if the service provider is authenticated as an indexing service provider (SP2) and it has very limited access to the email when the email is considered sensitive and should never be disclosed, the super-JAR is granted the “read” and “execute” permissions for JAR-Core which is the sub-JAR used for granting Strong Protection. grant codebase "file:./SampleAction.JAR", Principal sample.principal.SamplePrincipal "Indexing:SP2" { permission java.io.FilePermission "JARcore.JAR", "read"; permission java.io.FilePermission "JARcore.JAR", "execute";};
VII. I NITIAL E VALUATION We constructed a simple testbed by creating a small cloud, composed of three SAML-enabled servers: two service providers and one Identity Provider (IDP). Our tests were conducted using one Dell Laptop and two HP desktops. The HP desktops are HP Pavilion A6700F Desktop PC with 1.8 GHz AMD Phenom X4 9150e Quad-Core Processor, 4 GB RAM, 500 GB Hard Drive and a Vista Premium OS. The Dell Laptop is Intel(R) Core(TM)2 Duo CPU, T7250 @ 2.00
194
[12] RightScale Cloud Computing. Delivered. http://www.rightscale.com/. [13] Google Application Engine. http://code.google.com/appengine/. [14] R. Gellman. Privacy in the clouds: Risks to privacy and confidentiality from cloud computing. World Privacy Forum, 2009. [15] P. T. Jaeger, J. Lin, and J. M. Grimes. Cloud computing and information policy: Computing in a policy cloud? Journal of Information Technology and politics, 5(3), 2009. [16] B. R. Kandukuri, R. P. V., and A. Rakshit. Cloud security issues. In IEEE International Conference on Services Computing (SCC), pages 517–520, 2009. [17] L. M. Kaufman. Data security in the World of Cloud Computing. IEEE Security and Privacy, 7(4):61–64, 2009. [18] M. Lillibridge, S. Elnikety, A. Birrell, M. Burrows, and M. Isard. A cooperative internet backup scheme. In USENIX Annual Technical Conference, pages 29–41, 2003. [19] T. Mather, S. Kumaraswamy, and S. Latif. Cloud Security and Privacy: An Enterprise Perspective on Risks and Compliance (Theory in Practice). O’ Reilly, first edition, 2009. [20] M. C. Mont, S. Pearson, and P. Bramhall. Towards accountable management of privacy and identity information. In Proc. of the European Symposium on Research in Computer Security (ESORICS), pages 146–161, 2003. [21] Cloud Computing: Clash of the clouds. The economist. 2009. [22] IEEE International Conference on Cloud Computing. http://thecloudcomputing.org/2009/2/. [23] First Workshop on Scientific Cloud Computing. http://dsl.cs.uchicago.edu/ScienceCloud2010/, 2010. [24] S. Pearson and A. Charlesworth. Accountability as a way forward for privacy protection in the cloud. Hewlett-Packard Development Company (HPL-2009-178), 2009. [25] Siani Pearson, Yun Shen, and Miranda Mowbray. A privacy manager for cloud computing. In CloudCom, pages 90–106, 2009. [26] L. Perez-Freire, P. Comesana, J. Troncoso-Pastoriza, and Fernando Perez-Gonzalez. Watermarking security: a survey. In LNCS Transactions on Data Hiding and Multimedia Security, 2006. [27] H. C. P¨ohls. Verifiable and revocable expression of consent to processing of aggregated personal data. In Proc. of International Conference on Information and Communications Security (ICICS), pages 279–293, 2008. [28] Proofpoint. http://www.proofpoint.com/. [29] B. P. Rimal, E. Choi, and I. Lumb. A taxonomy and survey of Cloud Computing systems. Networked Computing and Advanced Information Management, International Conference on, 0:44–51, 2009. [30] A. Rybicki. Delegated saml authentication library for portlet developers. http://www.ja-sig.org/wiki/display/UPM31/02+-
Table I P ROCESSING TIME WHEN USING STRONG PROTECTION File Type
Size of original file (KB)
No. of input words (or field names)
Total no. of occurrences of word/field
File size for indexing (KB)
Processing Time (ms)
email email email email email chat chat chat chat text text text text pdf pdf pdf
345 3290 4551 1474 478 5664 440 5664 397 45876 48978 37752 50886 47654 37361 51879
5 5 5 6 6 8 7 9 6 7 3 4 5 8 10 5
5 16 5 7 7 145 7 112 6 105 73 65 85 100 63 112
137 1570 4375 1300 192 5099 276 5224 232 45238 48520 37316 50362 47231 37301 51362
0.0001 16 32 16 0.0001 62 0.0001 62 0.0001 62 62 62 62 62 62 62
VIII. C ONCLUSION
AND
F UTURE R ESEARCH
In this paper, we addressed the newly emerging data privacy problems in the cloud caused by indexing. We proposed a three-tier data protection framework consisting of three protection strategies which differ according to the level of privacy required by the end-users. We also developed strong enforcement techniques to guarantee that users’ privacy requirements are actually fulfilled by the service provider. This work constitutes the first step towards a comprehensive approach for data protection in the cloud. Therefore, our approach can be extended along several dimensions. First, we plan to extend our testbed and evaluate, in a larger cloud, all the strategies discussed in the paper. Second, we will conduct an extensive security analysis, in order to evaluate the ability of our approach to withstand from possible malicious attempts of overcoming our indexing protection. Finally, temporal based data binding approaches will be explored. R EFERENCES
+Delegated+SAML+Authentication+Library+for+Portlet+Developers.
[1] http://java.sun.com/products/archive/jaas/. [2] Giuseppe Ateniese, Randal Burns, Reza Curtmola, Joseph Herring, Lea Kissner, Zachary Peterson, and Dawn Song. Provable data possession at untrusted stores. In Proc, of ACM conference on Computer and communications security, pages 598–609, 2007. [3] Microsoft Azure Platform. http://www.microsoft.com/windowsazure/?wt.srch=1. [4] Mayank Bawa, Roberto J. Bayardo, Rakesh Agrawal, and Jaideep Vaidya. Privacy-preserving indexing of documents on the network. VLDB J., 18(4):837–856, 2009. [5] R. Bayardo and R. Agrawal. Data privacy through optimal kanonymization. In Proc. of International Conference on Data Engineering, pages 217–228, 2005. [6] Black Duck Koders.com. Pbkdf2 java. http://forums.sun.com/thread.jspa?threadID=5306039. [7] S. Brin and L. Page. The anatomy of a large-scale hypertextual web search engine. In Seventh International World-Wide Web Conference (WWW 1998), 1998. [8] G. Bruns, D. S. Dantas, and M. Huth. A simple and expressive semantic framework for policy composition in access control. In Proceedings of the 5th ACM Workshop on Formal Methods in Security Engineering (FMSE), 2007. [9] A. Cavoukian. Privacy in the clouds. Identity in the Information Society, 1, 2008. [10] Y.-C. Chang and M. Mitzenmacher. Privacy preserving keyword searches on remote encrypted data. February 2004. [11] OASIS Security Services Technical Committee. Security assertion markup language (saml) 2.0. http://www.oasisopen.org/committees/tc home.php?wg abbrev=security.
[31] Salesforce. http://www.salesforce.com/. [32] R. S. Sandhu, E. J. Coyne, H. L. Feinstein, and C. E. Youman. Role-based access control models. Computer, 29(2):38–47, 1996. [33] M. Schunter and M. Waidner. Simplified privacy controls for aggregated services - suspend and resume of personal data. In Proc. of Privacy Enhancing Technologies, pages 218–232, 2007. [34] T. J. E. Schwarz and E. L. Miller. Store, forget, and check: Using algebraic signatures to check remotely administered storage. In IEEE International Conference on Distributed Systems, page 12, 2006. [35] Sealing Packages within a JAR File. http://java.sun.com/docs/books/tutorial/deployment/ jar/sealman.html. [36] Amazon Web Services. http://aws.amazon.com/. [37] W. Tang. On using encryption techniques to enhance sticky policies enforcement. Technical Report (TR-CTIT-08-64), Centre for Telematics and Information Technology, 2008. [38] Q. Wang, C. Wang, J. Li, K. Ren, and W. Lou. Enabling public verifiability and data dynamics for storage security in cloud computing. In ESORICS, pages 355–370, 2009. [39] Workday. http://www.workday.com/. [40] Zhiqiang Yang, Sheng Zhong, and Rebecca N. Wright. Towards privacy-preserving model selection. In PinKDD, pages 138–152, 2007.
195