A Scheme to Manage Encrypted Data Storage with Deduplication in Cloud Zheng Yan1,2(), Wenxiu Ding1, Haiqi Zhu1 1
State Key Lab of Integrated Services Networks Xidian University, Xi’an 710071, China 2 Dept. of Comnet, Aalto University, Espoo 02150, Finland
[email protected],
[email protected],
[email protected]
Abstract. Cloud computing offers a new way of service provision by rearranging various resources and IT structures over the Internet. Private user data are often stored in cloud in an encrypted form in order to preserve the privacy of data owners. Encrypted data sharing introduces new challenges for cloud data deduplication. We found that existing solutions of deduplication suffer from high computation complexity and cost and therefore few of them can be really deployed in practice. In this paper, we propose a scheme to deduplicate encrypted data stored in cloud based on proxy re-encryption. We evaluate its performance and advantages based on extensive analysis and implementation. The results show the efficiency and effectiveness of the scheme for potential practical deployment. Keywords: Cloud Computing; Data Deduplication; Proxy Re-encryption
1
Introduction
Cloud computing offers a new way of Information Technology services by rearranging various resources (e.g., storage, computing) and providing them to users based on their demands. Cloud computing provides a big resource pool by linking network resources together. It has desirable properties, such as scalability, elasticity, faulttolerance, and pay-per-use. Thus, it becomes a promising service platform. The most important and typical cloud service is data storage service. The cloud users upload personal data to a Cloud Service Provider (CSP) and allow it to maintain these data. Rather than fully trusting the CSP, existing research proposed to only outsource encrypted data to the cloud in order to ensure data privacy. But duplicated data in an encrypted form could be stored by the same or different users, especially for shared data. Although cloud storage space is huge, this kind of duplication greatly wastes networking resources, consumes a lot of power energy, and makes data management complicated. Deduplication has proved to achieve high space and cost savings, which can reduce up to 90-95% storage needs for backup applications [9] and up to 68% in standard file systems [10]. Saving storage is becoming a crucial task of
CSP. Obviously, the savings, which can be passed back directly or indirectly to cloud users, are significant to the economics of cloud business. How to manage encrypted data storage with deduplication in an efficient way is a practical issue. However, current industrial deduplication solutions cannot handle encrypted data. Existing solutions for deduplication suffers from brute-force attacks [7, 11-14]. They cannot flexibly support data access control and revocation [16, 18-20]. Most existing solutions cannot ensure reliability, security and privacy with sound performance [23-25]. In real practice, it is hard to allow data holders to manage deduplication due to a number of reasons. First, the data holder may not be always online or available for such a management, which could cause a big storage delay. Second, the designed system for deduplication could be very complicated in terms of communications and computations. Third, it may intrude the privacy of data holder in the process of discovering the duplicated data by the CSP. Forth, the data holder has no idea how to issue access rights or deduplication keys to a user in some situations when he/she does not know other data holders due to data super-distribution. Therefore, CSP cannot cooperate with the data holder on data storage deduplication in many situations. In this paper, we propose a scheme based on Proxy Re-Encryption (PRE) to manage encrypted data storage with deduplication. We aim to solve the issue of deduplication in the situation where the data holder is not available or hard to be involved. Specifically, the contributions of this paper can be summarized as below: We motivate to save the storage of cloud and preserve the privacy of data owners by proposing a scheme to manage encrypted data storage with deduplication. Our scheme can flexibly support data update and sharing with deduplication even when the data holder is offline. And it does not intrude the privacy of data owners. We prove the security and justify the performance of our scheme through analysis and implementation. The rest of the paper is organized as follows. Section II gives a brief overview of related work. Section III introduces a system model and preliminaries. Section IV gives the detailed description of our schemes, followed by security analysis and performance evaluation in Section V. Finally, a conclusion is presented in the last section.
2
Related Work
Reconciling deduplication and client-side encryption is an active research topic [1]. Cloud storage service providers such as Dropbox [2], Google Drive [3], Mozy [4], and others perform deduplication to save space by only storing one copy of each file uploaded. However, if clients conventionally encrypt their data, storage savings by deduplication are totally lost. This is because the encrypted data are saved as different contents by applying different encryption keys. Existing industrial solutions fail in encrypted data deduplication. DeDu [17] is an efficient deduplication system but it cannot handle encrypted data.
Message-Locked Encryption (MLE) intends to solve this problem [5]. The most prominent manifestation of MLE is Convergent Encryption (CE), introduced by Douceur et al. [6] and others [7, 11, 12]. CE was used within a wide variety of commercial and research storage service systems. Letting 𝑀 be a file’s contents, hereafter called the message, the client first computes a key 𝐾 ← 𝐻(𝑀) by applying a cryptographic hash function 𝐻 to the message, and then computes the ciphertext 𝐶 ← 𝐸(𝐾, 𝑀) via a deterministic symmetric encryption scheme. A second client 𝐵 encrypting the same file 𝑀 will produce the same 𝐶, enabling deduplication. However, CE is subject to an inherent security limitation, namely, susceptibility to offline brute-force dictionary attacks [13, 14]. Knowing that the target message 𝑀 underlying a target ciphertext 𝐶 is drawn from a dictionary 𝑆 = {𝑀1 , … , 𝑀𝑛 } of size 𝑛, the attacker can recover 𝑀 in the time for 𝑛 = |𝑆| off-line encryptions: for each 𝑖 = 1, … , 𝑛, it simply CE-encrypts 𝑀𝑖 to get a ciphertext denoted as 𝐶𝑖 and returns the 𝑀𝑖 such that 𝐶 = 𝐶𝑖 . (This works because CE is deterministic and keyless.) The security of CE is only possible when the target message is drawn from a space too large to exhaust. Another problem of CE is that it is not flexible to support data access control by data holders, especially for data revocation process, since it is impossible for data holders to generate the same new key for data re-encryption [18, 19]. An image deduplication scheme adopts two servers to achieve verifiability of deduplication [18]. The CE-based scheme described in [19] combines file content and user privilege to obtain a file token, which achieves token unforgeability. However, both schemes directly encrypt data with a CE key, thus hold the problem as described above. To resist the attack of manipulation of data identifier, Meye et al. proposed to adopt two servers for intra-user deduplication and inter-deduplication [20]. The ciphertext 𝐶 of CE is further encrypted with a user key and transferred to the servers. However, it does not deal with data sharing after deduplication among different users. ClouDedup [16] also aims to cope with the inherent security exposures of CE, but it cannot solve the issue caused by data deletion. A data holder that removes the data from the cloud can still access the same data since it still knows the data encryption key if the data is not completely removed from the cloud. Bellare et al. [1] proposed DupLESS that provides secure deduplicated storage to resist brute-force attacks. In DupLESS, a group of affiliated clients (e.g., company employees) encrypt their data with the aid of a Key Server (KS) that is separate from a Storage Service (SS). Clients authenticate themselves to the KS, but do not leak any information about their data to it. As long as the KS remains inaccessible to attackers, high security can be ensured. Obviously, DupLESS cannot control data access of other data users in a flexible way. Alternatively, a policy-based deduplication proxy scheme [15] was proposed but it did not consider duplicated data management (e.g., deletion, owner change, etc.) and evaluated scheme performance. As stated in [21], reliability, security and privacy should be taken into consideration when designing a deduplication system. The strict latency requirements of primary storage lead to the focus on offline deduplication systems [22]. Recent studies propose techniques to improve their performances, especially restore performance [2325]. Fu et al. [23] proposed History-Aware Rewriting (HAR) algorithm to accurately identify and rewrite fragmented chunks, which improved the restore performance.
Kaczmarczyk et al. [24] focused on inter-version duplication and proposed ContextBased Rewriting (CBR) to improve the restore performance for latest backups by shifting fragmentation to older backups. Some other work [25] even proposed to forfeit deduplication to reduce the chunk fragmentation by container capping. In this paper, we apply PRE and symmetric encryption to deduplicate encrypted data. Our scheme can resist the attacks mentioned above and achieve good performance with no need to keep data owners online all the time. Meanwhile, it also ensures the confidentiality of stored data and supports digital rights management. We aim to achieve the mandatory requirements of encrypted cloud data deduplication.
3
Problem Statements
3.1
System Model and Security Model
We propose a scheme to deduplicate encrypted data at CSP by applying PRE to issue the keys to different authorized data holders. It is applied into the scenario that the data holder is not available to be involved into deduplication control. AP AP
u1
CSP
u2
u3
……… un
Fig. 1. System model
As shown in Fig. 1 the system contains three types of entities: 1) CPS that offers storage services and cannot be fully trusted since it is curious about the contents of stored data, but should perform honestly on data storage in order to gain commercial profits; 2) data holder (𝑢𝑖 ) that uploads and saves its data at CSP. In the system, it is possible to have a number of eligible data holders (𝑢𝑖 , 𝑖 = 1, … , 𝑛) that could save the same encrypted raw data; 3) an authorized party (AP) that does not collude with CSP and is fully trusted by the data holders to handle data deduplication. AP cannot know the raw data stored in CSP and CSP should not know the plain user data in its storage. Additional assumptions include: the hash code of data 𝑀 is applied as its indicator, which is used to check the duplication of data during data uploading and storage. We assume that the data holder signs the right hash code honestly for the origin verification at CSP. This hash code is protected and cannot be obtained by system attackers. In another line of our work, we improve the security of data origin verification based on hash chain challenge. We further assume that the first data uploader (holder) has the highest priority. A valid proof should be provided by a data holder in order to request a special treatment. Users, CSP and AP communicate through a secure channel with each other.
3.2
Preliminary and Notations
Preliminary A PRE scheme is represented as a tuple of (possibly probabilistic) polynomial time algorithms(𝐾𝐺; 𝑅𝐺; 𝐸; 𝑅; 𝐷): (𝐾𝐺; 𝐸; 𝐷) are the standard key generation, encryption, and decryption algorithms. On input the security parameter 1𝑘 , 𝐾𝐺 outputs a public and private key pair (𝑝𝑘𝐴 ; 𝑠𝑘𝐴 ) for entity A. On input 𝑝𝑘𝐴 and data 𝑚, 𝐸 outputs a ciphertext 𝐶𝐴 = 𝐸(𝑝𝑘𝐴 ; 𝑚) . On input 𝑠𝑘𝐴 and ciphertext 𝐶𝐴 , 𝐷 outputs the plain data 𝑚 = 𝐷(𝑠𝑘𝐴 ; 𝐶𝐴 ). On input(𝑝𝑘𝐴 ; 𝑠𝑘𝐴 ; 𝑝𝑘𝐵 ), the re-encryption key generation algorithm 𝑅𝐺, outputs a re-encryption key 𝑟𝑘𝐴→𝐵 for a proxy. On input 𝑟𝑘𝐴→𝐵 and ciphertext 𝐶𝐴 , the re-encryption function 𝑅 , outputs 𝑅( 𝑟𝑘𝐴→𝐵 ; 𝐶𝐴 ) = 𝐸(𝑝𝑘𝐵 ; 𝑚) = 𝐶𝐵 which can be decrypted with private key 𝑠𝑘𝐵 . Notations Table 1 summarizes the notations used in the proposed scheme. Table 1. System Notations Key
Description
(𝑝𝑘𝑢 , 𝑠𝑘𝑢 )
The public key and secret key of user 𝑢 for PRE
(𝑃𝐾𝑢 , 𝑆𝐾𝑢 )
The public key and secret key of user 𝑢 for public-key cryptographic operations
𝐷𝐸𝐾𝑢
The symmetric key of 𝑢
𝐻()
The hash function
𝐶𝑇
The ciphertext
𝐶𝐾
The cipherkey
𝑀
The sser data
Encrypt(𝐷𝐸𝐾, 𝑀)
The symmetric encryption function on 𝑀 with 𝐷𝐸𝐾
Decrypt(𝐷𝐸𝐾, 𝐶𝑇)
The symmetric decryption function on 𝐶𝑇 with 𝐷𝐸𝐾
During system setup, every data holder 𝑢 generates 𝑝𝑘𝑢 and 𝑠𝑘𝑢 for PRE. Meanwhile, it also generates a key pair 𝑃𝐾𝑢 and 𝑆𝐾𝑢 for Public-Key Cryptosystem (PKC), e.g., signature generation and verification. Generation of the above keys is the task of 𝑢. The keys (𝑝𝑘𝑢 ; 𝑠𝑘𝑢 ) and (𝑃𝐾𝑢 ; 𝑆𝐾𝑢 ) of 𝑢 are bound to the unique identity of the user (It can be a pseudonym of user 𝑢). This binding is crucial for the verification of the user identity. At system setup, 𝑝𝑘𝑢 and 𝑃𝐾𝑢 are certified by an authorized third party as 𝐶𝑒𝑟𝑡(𝑝𝑘𝑢 ) and 𝐶𝑒𝑟𝑡(𝑃𝐾𝑢 ), which can be verified by CSP, AP, and any CSP users. AP independently generates 𝑝𝑘𝐴𝑃 and 𝑠𝑘𝐴𝑃 for PRE and broadcast 𝑝𝑘𝐴𝑃 to CSP users.
4
Scheme Design and Algorithms
Our scheme contains the following main aspects: 1) Data holder encrypts its data using a randomly selected symmetric key 𝐷𝐸𝐾 and stores the encrypted data at CSP together with the signed hash code of the data;
2) Data duplication occurs at the time when another data holder tries to store the same raw data at CSP. The CSP cooperates with AP to perform deduplication; 3) When a data holder deletes data from CSP, CSP updates the records of duplicated data holders to forbid this holder to access the data later on; 4) In case that a real data owner uploads the data later than the data holder, the CSP can update the records and keys to transfer the ownership to the real owner; 5) In case that 𝐷𝐸𝐾 is updated by a data owner and the new encrypted raw data is provided to CSP for the reason of achieving better security, CSP can decide to issue the new re-encrypted 𝐷𝐸𝐾′ to all data holders with the support of AP. 4.1
Schemes
Each entity in the system has a public and private key pair under the PRE scheme. Let (𝑝𝑘𝐴𝑃 ; 𝑠𝑘𝐴𝑃 ) denote the key pair of the AP and (𝑝𝑘𝑢 ; 𝑠𝑘𝑢 ) the key pair of data holder 𝑢. A data holder 𝑢 encrypts its secret key 𝐷𝐸𝐾 with 𝐸(𝑝𝑘𝐴𝑃 , 𝐷𝐸𝐾) and publishes it along with its encrypted data 𝐸𝑛𝑐𝑟𝑦𝑝𝑡(𝐷𝐸𝐾, 𝑀) to the CSP. If another data holder 𝑢′ uploads the same raw data encrypted using a different key, then CSP contacts AP to grant 𝐷𝐸𝐾 to the eligible duplicated data holder 𝑢′ as follows. Algorithm: Grant access to duplicated data - grant (𝒖′ ) Input: 𝑝𝑘𝑢′ , 𝑃𝑜𝑙𝑖𝑐𝑦(𝑢), 𝑃𝑜𝑙𝑖𝑐𝑦(𝐴𝑃) - CSP requests AP to grant access to duplicated data for 𝑢′ by providing 𝑝𝑘𝑢′ . - AP checks 𝑃𝑜𝑙𝑖𝑐𝑦(𝐴𝑃) and issues CSP 𝑟𝑘𝐴𝑃→𝑢′ = 𝑅𝐺(𝑝𝑘𝐴𝑃 ; 𝑠𝑘𝐴𝑃 ; 𝑝𝑘𝑢′ ) if the check is positive. - CSP transforms 𝐸(𝑝𝑘𝐴𝑃 ; 𝐷𝐸𝐾) into 𝐸(𝑝𝑘𝑢′ ; 𝐷𝐸𝐾) if 𝑃𝑜𝑙𝑖𝑐𝑦(𝑢) authorizes 𝑢′ to share the same data storage of 𝑀 encrypted by 𝐷𝐸𝐾: 𝑅(𝑟𝑘𝐴𝑃→𝑢′ ; 𝐸(𝑝𝑘𝐴𝑃 ; 𝐷𝐸𝐾)) = 𝐸(𝑝𝑘𝑢′ ; 𝐷𝐸𝐾). Note: 𝑟𝑘𝑅𝐶→𝑢′ calculation can be skipped if it has been executed already and the value of (𝑝𝑘𝑢′ ; 𝑠𝑘𝑢′ ) and (𝑝𝑘𝐴𝑃 ; 𝑠𝑘𝐴𝑃 ) remain unchanged.
- Data holder 𝑢′ obtains 𝐷𝐸𝐾 by decrypting 𝐸(𝑝𝑘𝑢′ ; 𝐷𝐸𝐾) with 𝑠𝑘𝑢′ : 𝐷𝐸𝐾 = 𝐷(𝑠𝑘𝑢′ ; 𝐸(𝑝𝑘𝑢′ ; 𝐷𝐸𝐾)), and then it can access data 𝑀 at CSP. In this scheme, CSP operates as the proxy in PRE scheme. It indirectly distributes 𝐷𝐸𝐾 for data decryption to duplicated data holders without learning anything about these secrets (e.g., 𝐷𝐸𝐾 and raw data stored at CSP). Note that AP itself is not allowed by the CSP to access the user’s personal data due to business reasons and incentives. 4.2
Procedures
Data Deduplication Fig. 2 illustrates the procedure of data deduplication at CSP with the support of AP based on the proposed scheme. We suppose that user 𝑢1 saves its sensitive personal data 𝑀 at CSP with protection using 𝐷𝐸𝐾1 , while user 𝑢2 is a data holder who tries to save the same data at CSP. Step 1 - System setup: each user of CSP generates personal credentials and two key pairs (𝑝𝑘𝑖 ; 𝑠𝑘𝑖 ) for PRE and (𝑃𝐾𝑖 ; 𝑆𝐾𝑖 ) for PKC, (𝑖 = 1, 2, … 𝑛). Meanwhile,
each user gets the certificate of its public keys 𝐶𝑒𝑟𝑡(𝑝𝑘𝑖 ) and 𝐶𝑒𝑟𝑡(𝑃𝐾𝑖 ). AP independently generates its public key pair (𝑝𝑘𝐴𝑃 , 𝑠𝑘𝐴𝑃 ) and publishes 𝑝𝑘𝐴𝑃 publically. Step 2 - Data storage: user 𝑢1 saves data 𝑀 at CSP. In this step, 𝑢1 encrypts data 𝑀 for privacy and security protection with a randomly selected symmetric key 𝐷𝐸𝐾1 to get 𝐶𝑇1 = 𝐸𝑛𝑐𝑟𝑦𝑝𝑡(𝐷𝐸𝐾1 , 𝑀). It then encrypts 𝐷𝐸𝐾1 with 𝑝𝑘𝐴𝑃 to get 𝐶𝐾1 by calling 𝐸(𝑝𝑘𝐴𝑃 , 𝐷𝐸𝐾1 ) . User 𝑢1 calculates 𝐻(𝑀) and signs it with 𝑆𝐾1 as 𝑆𝑖𝑔𝑛(𝐻(𝑀), 𝑆𝐾1 ) . Then 𝑢1 sends the data package 𝐷𝑃1 to CSP where 𝐷𝑃1 = {𝐶𝑇1 , 𝐶𝐾1 , 𝐻(𝑀), 𝑆𝑖𝑔𝑛(𝐻(𝑀), 𝑆𝐾1 ), 𝐶𝑒𝑟𝑡(𝑝𝑘1 ), 𝐶𝑒𝑟𝑡(𝑃𝐾1 )}. u1 System Setup: Generate pk1 , sk1 ; PK1 , SK1 ; get Cert ( pk1 ), Cert ( PK1 )
Data Storage:Encrypt M with DEK1 ; Encrypt DEK1with pk AP ;Calculate H (M) and sign it with SK1
u2
AP
CSP
System Setup: Generate pk2 , sk2 ; PK 2 , SK 2 ; get Cert ( pk2 ), Cert ( PK 2 )
System Setup: Generate pk AP , sk AP ; get Cert ( pk AP )
{CT1 , CK1 , H ( M ), Sign( H ( M ), SK1 ), Cert ( pk1 ), Cert ( PK1 )} Duplicated Data Upload: {CT2 , CK 2 , H ( M ), Encrypt M with DEK 2 ; Sign( H ( M ), SK 2 ), Encrypt DEK 2 with pk AP ; Cert ( pk2 ), Cert ( PK 2 )} Calculate H (M) and sign it with SK 2
Duplication Check: Verify Sign( H ( M ), SK1 ), check if the same data is saved; if not, save CT1 , CK1 ,etc.
Duplication Check: Verify Sign( H ( M ), SK 2 ), check if the same data is saved; if yes, contact AP for deduplication Cert ( pk2 )
rk AP -u 2 Get DEK1 with sk2 to access CT1
E ( pk2 ; DEK1 )
Check policy, generate rk AP -u 2
Duplication: CSP performs re-encryption on CK1 , get E ( pk2 ; DEK1 )
Fig. 2. A procedure of data deduplication
Step 3 - Duplication check: After receiving the data package, CSP verifies 𝐶𝑒𝑟𝑡(𝑝𝑘1 ) and 𝐶𝑒𝑟𝑡(𝑃𝐾1 ) , checks if the duplicated data is stored as below: it fies 𝑆𝑖𝑔𝑛(𝐻(𝑀), 𝑆𝐾1 ), checks if the same data is saved by finding if there is the same 𝐻(𝑀) in stored records. If the check is negative, it saves 𝐶𝑇1 , 𝐶𝐾1 and 𝐻(𝑀), as well as other related information such as 𝐶𝑒𝑟𝑡(𝑝𝑘1 ), 𝐶𝑒𝑟𝑡(𝑃𝐾1 ). If the check is positive and the pre-stored data is from the same user, it informs the user about this situation. If the same data is from a different user, refer to Step 5 for deduplication. Step 4 - Duplicated data upload: user 𝑢2 later on saves the same data 𝑀 at CSP following the same procedure of Step 2. That is, 𝑢2 sends the data package 𝐷𝑃2 to CSP where 𝐷𝑃2 = {𝐶𝑇2 , 𝐶𝐾2 , 𝐻(𝑀), 𝑆𝑖𝑔𝑛(𝐻(𝑀), 𝑆𝐾2 ), 𝐶𝑒𝑟𝑡(𝑝𝑘2 ), 𝐶𝑒𝑟𝑡(𝑃𝐾2 )}. The scheme can support a large number of data storage at CSP. In this case, the data holder sends 𝐷𝑃2 = {𝐻(𝑀), 𝑆𝑖𝑔𝑛(𝐻(𝑀), 𝑆𝐾2 ), 𝐶𝑒𝑟𝑡(𝑝𝑘2 ), 𝐶𝑒𝑟𝑡(𝑃𝐾2 )} for duplication check before real encrypted data uploading. If duplication happens, 𝑢2 only needs to get 𝐸(𝑝𝑘2 , 𝐷𝐸𝐾1 ) from CSP through re-encryption with the support of AP. Step 5 - Deduplication: After receiving the data package, CSP performs duplication check as in Step 3. In the case that the check is positive, CSP contacts AP by sending 𝐶𝑒𝑟𝑡(𝑝𝑘2 ) (that contains 𝑝𝑘2 ) for deduplication. AP verifies the policy for data holding and storage regarding 𝑢2. If verification is positive, it generates 𝑟𝑘𝐴𝑃→𝑢2 by calling 𝑅𝐺(𝑝𝑘𝐴𝑃 ; 𝑠𝑘𝐴𝑃 ; 𝑝𝑘2 ) and issues it to CSP (using a secure channel) for reencrypting 𝐶𝐾1 by calling 𝑅(𝑟𝑘𝐴𝑃→𝑢2 ; 𝐸(𝑝𝑘𝐴𝑃 ; 𝐷𝐸𝐾1 )) = 𝐸(𝑝𝑘2 ; 𝐷𝐸𝐾1 ). Then CSP sends 𝐸(𝑝𝑘2 ; 𝐷𝐸𝐾1 ) to 𝑢2 that can get 𝐷𝐸𝐾1 with its secret key 𝑠𝑘2 . 𝑢2 confirms the
success of data deduplication to CSP. After getting this notification, CSP records corresponding deduplication information in the system and discards 𝐶𝑇2 and 𝐶𝐾2 . At this moment, both 𝑢1 and 𝑢2 can access the same data 𝑀 saved at CSP. User 𝑢1 uses 𝐷𝐸𝐾1 directly, while 𝑢2 gets to know 𝐷𝐸𝐾1 by calling 𝐷(𝑠𝑘2 ; 𝐸(𝑝𝑘2 ; 𝐷𝐸𝐾1 )). Data Deletion at CSP When data holder 𝑢2 wants to delete the data from CSP, it sends deletion request to CSP: 𝐶𝑒𝑟𝑡(𝑃𝐾2 ), 𝐻(𝑀). CSP checks the validity of the request, then removes deduplication record of 𝑢2, block 𝑢2’s later access to 𝑀. CSP further checks if the deduplication record is empty. If yes, it deletes encrypted data 𝐶𝑇 and related records. Data Owner Management In case that the real data owner 𝑢1 uploads the data later than the data holder 𝑢2, the CSP can manage to save the real data owner encrypted data at the cloud and allow it to share the storage. In this case, the CSP contacts AP and provides all data holders’ 𝑝𝑘𝑢 (e.g., 𝑝𝑘2 ) if CSP does not know its corresponding re-encryption key 𝑟𝑘𝐴𝑃→𝑢 (e.g., 𝑟𝑘𝐴𝑃→𝑢2 ). AP issues 𝑟𝑘𝐴𝑃→𝑢 (e.g., 𝑟𝑘𝐴𝑃→𝑢2 ) to CSP if validity check is positive. CSP performs re-encryption on 𝐶𝐾1 , gets 𝐸(𝑝𝑘2 , 𝐷𝐸𝐾1 ) and sends re-encrypted 𝐷𝐸𝐾1 to all related data holders (e.g., 𝑢2), deletes 𝐶𝑇2 and 𝐶𝐾2 by replacing it with 𝑢1’s encrypted copy 𝐶𝑇1 and 𝐶𝐾1 , and updates corresponding deduplication records. Encrypted Data Update In some cases, a data holder could update encrypted data stored at CSP by generating a new 𝐷𝐸𝐾 ′ and upload the new encrypted data with 𝐷𝐸𝐾 ′ to CSP. u2
u1
AP
CSP
Procedure as Presented in Figure 2 Update Data: generate new DEK1 ;Encrypt Mwith DEK1 ; Encrypt DEK1 with pk AP ;
{CT1 , CK1 , H ( M ), Sign( H ( M ), SK1 ), Cert ( pk1 ), Cert ( PK1 )}, update CT1
Verify Sign( H ( M ), SK1 ), request AP re-encryption keys if not available for all holders Cert ( pk2 ) if rk APu 2 is not known
Get DEK1 with sk2 to access CT1
CSP performs re-encryption on CK1 , get E ( pk2 ; DEK1 ), remove CT1 , CK1 E ( pk2 ; DEK1 )
rk APu 2
Check policy, generate rk APu 2
Fig. 3. A procedure of encrypted data update
As illustrated in Fig. 2, 𝑢1 uploads its data and the CSP manages the encrypted data with deduplication. User 𝑢2 shares the data with 𝑢1. Suppose 𝑢1 wants to update encrypted data stored at CSP with new symmetric key 𝐷𝐸𝐾1′ . Fig. 3 shows the procedure of encrypted data update. User 𝑢1 sends an update request as below: 𝐷𝑃1′ = {𝐶𝑇1′ , 𝐶𝐾1′ , 𝐻(𝑀), 𝑆𝑖𝑔𝑛(𝐻(𝑀), 𝑆𝐾1 ), 𝐶𝑒𝑟𝑡(𝑝𝑘1 ), 𝐶𝑒𝑟𝑡(𝑃𝐾1 ), 𝑢𝑝𝑑𝑎𝑡𝑒 𝐶𝑇1 }. The CSP verifies 𝑆𝑖𝑔𝑛(𝐻(𝑀), 𝑆𝐾1 ) and contacts AP for updating re-encryption keys. AP checks its policy for generating and sending corresponding re-encryption keys (e.g., 𝑟𝑘𝐴𝑃→𝑢2 ), which are used by the CSP to perform re-encryption on 𝐶𝐾1′ for generating encrypted keys that can be decrypted by all eligible data holders (e.g.,
𝐸(𝑝𝑘2 , 𝐷𝐸𝐾1′ )). The re-encrypted keys (e.g., 𝐸(𝑝𝑘2 , 𝐷𝐸𝐾1′ )) are sent back to the eligible data holders for future access on data 𝑀. Any data holder can perform the encrypted data update. Based on storage policy and service agreement between the data holder and CSP, CSP decides if such an update can be performed.
5
Security Analysis & Performance Evaluation
5.1
Security Analysis
Our scheme realizes a secure way to protect and deduplicate the data stored in cloud by concealing plaintext from both CSP and AP. The security of the proposed scheme is ensured by the PRE theory, symmetric key encryption and PKC theory. CSP has no way to know the raw data since it is always in an encryption status. CSP knows the encrypted 𝐷𝐸𝐾 with 𝑝𝑘𝐴𝑃 , but AP won’t share its own secret key 𝑠𝑘𝐴𝑃 with CSP. Thus CSP cannot know 𝐷𝐸𝐾 and then the raw or plain data. AP has no way to access the data since this is blocked by CSP although AP can obtain 𝐷𝐸𝐾. In addition, we apply proper management protocols to support data storage management and data owner management at the same time when deduplication is achieved. Data confidentiality is achieved by symmetric key encryption (e.g., AES) and PRE. The user data is presented in two different ways. One is 𝐻(𝑀). The hash function is assumed to be secure, i.e., it is difficult to find collision attack. The raw data is impossible to be obtained through the hash code. Another is the ciphertext of symmetric key encryption 𝐶𝑇 = 𝐸𝑛𝑐𝑟𝑦𝑝𝑡(𝐷𝐸𝐾, 𝑀). We assume that the symmetric key encryption is secure. Therefore, 𝐷𝐸𝐾 becomes the key point. In the proposed scheme, we adopt PRE to protect 𝐷𝐸𝐾. First, 𝐷𝐸𝐾 is encrypted with 𝑝𝑘𝐴𝑃 through PRE to obtain the cipherkey 𝐸(𝑝𝑘𝐴𝑃 , 𝐷𝐸𝐾). We assume that CSP and AP would not collude with each other. Although CSP can obtain the ciphertext, it cannot gain 𝐷𝐸𝐾 because it knows nothing about the secret key of AP. The re-encryption with key 𝑝𝑘𝐴𝑃→𝑢 transforms the cipherkey under 𝑝𝑘𝐴𝑃 into the cipherkey under 𝑠𝑘𝑢 . Similarly, the CSP can get nothing from the cipherkey for knowing nothing about 𝑠𝑘𝑢 . The authorized user u can obtain the data 𝑀 by decrypting 𝐸(𝑝𝑘𝑢 , 𝐷𝐸𝐾) with its own secret key 𝑠𝑘𝑢 , which can guarantee that the raw data is secure and can be shared by eligible users. The cooperation of CSP and AP without collusion guarantees that only authorized users can access the plain data and the data can be deduplicated in a secure way. 5.2
Computation Complexity
The proposed scheme involves four kinds of system roles: data owner, data holder, CSP and AP. To present the computation complexity in details, we adopt AES for symmetric encryption, RSA for signature and PRE proposed in [8]. We analyze the complexity of uploading one data file as below. Data Owner: It is in charge of five operations: system setup, data encryption, symmetric key encryption, hash computation and signature generation. The key generation of PRE includes 1 exponentiation. The RSA key generation needs 1 modular
inversion. The computation complexity of encrypting data using 𝐷𝐸𝐾 depends on the size of data, which is inevitable in any cryptographic methods for protecting the data. Likewise, the computation complexity of hash depends on the size of data, but it is very fast. The sign requires 1 exponentiation and the encryption of 𝐷𝐸𝐾 using PRE needs 2 exponentiations. Thus, the computation complexity of data owner is 𝒪(1). In addition, the burden of system setup can be amortized over all data storage operations. CSP: Each user uploads its data to CSP. CSP should first check the signature and choose to save the appropriate copy of the data. This needs 1 exponentiation. If the data holder uploads the same data, the CSP contacts the AP for gaining a reencryption key. In this case, CSP has to finish the re-encryption operation of PRE, which requires 1 pairing. If the same data is uploaded by 𝑛 data holders, the computational complexity results to be 𝒪(𝑛). CSP is responsible for allowing the access to the same data for all data holders by avoiding storing the same data in the cloud. Data Holder: When it uploads the same data that has been stored in the CSP, data holder does the same as the data owner has done. In addition, the data holder has to do one more decryption for accessing the data, which involves 1 exponentiation. The computational complexity of PRE decryption for data holder is 𝒪(1). AP: It is responsible for the re-encryption key management. It checks the policy and issues key for authorized user. The re-encryption key generation needs 1 exponentiation. It needs to issue keys for all authorized data holders who upload the same data. The computational complexity of this is 𝒪(𝑛). It is noted that the computational burden of system setup, especially the generation of key pairs, can be amortized over the lifetime of entities. Thus, our scheme is very efficient. Table 2 lists the computation operations and complexity of different roles. Table 2. Computation Complexity of System Entities
Data owner CSP Data Holder AP
Algorithm
Computations
Complexity
Setup Data upload Re-encryption System Setup Data upload Decryption for access System Setup Re-encryption key generation
1*ModInv+1*Exp 3*Exp 1*Pair 1*ModInv+1*Exp 3*Exp 1*Exp 1*Exp 1*Exp
𝓞(𝟏) 𝓞(𝒏) 𝓞(𝟏)
𝓞(𝒏)
Notes: Pair: Bilinear Pairing; Exp: Exponentiation; n: Number of data holders who share the same data; ModInv: Modular Inversion.
5.3
Performance Analysis
Implementation and Testing Environment We implemented the proposed scheme and tested its performance. Table 3 describes our implementation and testing environment. We cannot test our scheme directly in the cloud. Instead, we applied a MySQL database to store data files and re-
lated information. In our test, we did not take into account the time of data uploading and downloading. We focused on testing the performance of the deduplication procedure and algorithms designed in our scheme. Table 3. Test Environment Hardware Environment Software Environment
CPU: Intel Core 2 Quad CPU Q9400 2.66GHZ Memory: 4GB SDRAM Operating System: Ubuntu v14.04, Windows Home Ultimate editions 64bit Programming Environment: Eclipse Luna CDT, Java Secure Protocol: OpenSSL Library: Pairing Based Cryptography (http://crypto.stanford.edu/pbc/) Database: MySQL v5.5.41
Experiments We set up two users (User 1: zhq; User 2: cyl) in the system for testing purpose in order to illustrate that the designed scheme works. Correctness of Scheme Implementation Case 1: User zhq uploads a new file named chandelier.txt.
Fig. 4. New file upload process
Fig. 5. Reject duplicated saving
User zhq encrypts the encryption key 𝐷𝐸𝐾 with 𝑝𝑘𝐴𝑃 and generates key file named as key_zhq_chandelier.txt, As the cloud has not yet stored this file, the file is encrypted and successfully uploaded to the cloud as an encrypted file C_chandelier.txt, shown in Fig. 4. When user zhq tried to access the data, it obtains the plaintext file M_chandelier.txt. Case 2: User cyl uploads the same file chandelier.txt. As shown in Fig. 5, the CSP found that the file has been uploaded by user zhq through duplication check. Notice that no matter what the file name is, the CSP can detect duplication if the file content is the same as a file saved already. The CSP would not save the file but ask AP to issue a re-encryption key to user cyl for accessing the same file. Finally, the AP issues a re-encryption key 𝑟𝑘1→2 to allow user cyl to share the file chandelier.txt that has been stored already. Case 3: User cyl accesses the shared file chandelier.txt
Fig. 7. Data owner management Fig. 6. Access shared duplicated data
User zhq uploaded chandelier.txt and denoted as data owner. User cyl uploaded the same file. Now it tried to access this shared file. As shown in Fig. 6, the CSP and AP allow cyl to access the pre-uploaded file. User cyl can download the file and obtain the data successfully. Case 4: Data owner management User cyl has uploaded file summertime.txt as a data holder. When the data owner zhq uploads this file, CSP removes the data record of cyl and replaces it with the corresponding record of zhq, as shown in Fig. 7. Case 5: Data deletion User zhq is the data owner of the file chandelier.txt. If user cyl as a data holder wants to delete the file, CSP just blocks the access of cyl, as shown in Fig. 8 But if the data owner zhq wants to delete the file, CSP needs to delete the record of zhq but keep the record of cyl. CSP blocks the access of zhq to the file, as shown in Fig. 9.
Fig. 9. Data deletion by a data owner
Fig. 8. Data deletion by a data holder
Efficiency Test Test 1: Efficiency of PRE We test the efficiency of each operation of PRE with different sizes of AES symmetric keys (128 bits, 196 bits and 256 bits). Fig. 10 shows the operation time. We observe that our scheme is very efficient. The time spent for PRE key pair generation (KeyGen), re-encryption key generation (ReKeyGen), encryption (Enc), re-encryption (ReEnc) and decryption (Dec) is not related to the length of input. For the tested three AES key sizes, the encryption time is less than 6 milliseconds. The decryption time is trivial, which implies that our scheme does not introduce heavy processing load to data holders. We also observe that the computation time of each operation does not vary too much with the different length of AES key size. Therefore, our scheme can be efficiently adapted to various security requirements in various scenarios. Test 2: Efficiency of file encryption and decryption In this experiment, we tested the operation time of file encryption and decryption with AES by applying different AES key sizes (128 bits, 196 bits and 256 bits) and
different file size (from 10 megabytes to 600 megabytes). We observe from Fig. 11 that even when the file is as big as 600MB, the encryption/decryption time is less than 30 seconds if applying 256-bit AES key. Applying symmetric encryption for data protection is a reasonable and practical choice.
Fig. 10. The execution time of PRE operations
6
Fig. 11. Operation time of file encryption and decryption with AES
Conclusion
Managing encrypted data storage with deduplication is important and significant in practice for achieving a successful cloud storage service. In this paper, we proposed a practical scheme to manage the encrypted data in cloud with deduplication based on PRE. Our proposed scheme can flexibly support data update and sharing with deduplication even when the data holder is offline. The encrypted data can be securely accessed because only authorized data holders can obtain the symmetric key used for data decryption. Extensive performance analysis and test show that our scheme is secure and efficient under the described security model. The implementation further shows the practicability of our scheme. Future work includes further optimizing the current implementation for practical deployment and usage.
References 1. Bellare, M., Keelveedhi, S., Ristenpart, T.: Dupless: Server-aided Encryption for Deduplicated storage. In: 22nd USENIX Conference on Security, pp. 179-194. USENIX (2013) 2. A File-storage and Sharing Service, http://www.dropbox.com/ 3. Google Drive, http://drive.google.com 4. Mozy: A File-storage and Sharing Service, http://mozy.com/ 5. Douceur, J.R., Adya, A., Bolosky, W.J., Simon, D., Theimer, M.: Reclaiming Space from Duplicate Files in a Serverless Distributed File System. In: 22nd International Conference on Distributed Computing Systems, pp. 617-624. IEEE (2002) 6. Wallace, G., Douglis, F., Qian, H., Shilane, P., Smaldone, S., Chamness, M., Hsu, W.: Characteristics of Backup Workloads in Production Systems. In: FAST, pp. 4. USENIX (2012)
7. Wilcox Z.O.: Convergent Encryption Reconsidered, 2011. http://www.mailarchive.com/
[email protected]/msg08949.html 8. Ateniese, G., Fu, K., Green, M., Hohenberger, S.: Improved Proxy Re-encryption Schemes with Applications to Secure Distributed Storage. ACM Transactions on Information and System Security, vol.9, pp.1-30. ACM (2006) 9. Opendedup, http://opendedup.org/ 10. Meyer, D.T., Bolosky, W.J.: A Study of Practical Deduplication. ACM Transactions on Storage, vol.7, pp.1-20. ACM (2012) 11. Pettitt J.: Hash of Plaintext as Key? http://cypherpunks.venona.com/date/1996/02/msg02013.html 12. The Freenet Project. Freenet. https://freenetproject.org/ 13. Bellare, M., Keelveedhi, S., Ristenpart, T.: Message-locked Encryption and Secure Deduplication. In: Johansson, T., Nguyen, P.Q.(eds.) Cryptology–EUROCRYPT 2013. LNCS, vol. 7881, pp. 296-312. Springer, Heidelberg (2013) 14. Perttula: Attacks on Convergent Encryption, http://bit.ly/yQxyvl. 15. Liu, C.Y, Liu, X.J, Wan, L.: Policy-based De-duplication in Secure Cloud Storage. In: Yuan, Y.Y., Wu, X., Lu, Y.M.(eds.) Trustworthy Computing and Services, pp. 250-262. Springer, Heidelberg (2013) 16. Puzio, P., Molva, R., Onen, M., Loureiro, S.: ClouDedup: Secure Deduplication with Encrypted Data for Cloud Storage. In: 5th International Conference on Cloud Computing Technology and Science, pp. 363-370. IEEE (2013) 17. Sun, Z., Shen, J., Yong, J.M.: DeDu: Building a Deduplication Storage System over Cloud Computing. In: 15th International Conference on Computer Supported Cooperative Work in Design, pp. 348-355. IEEE (2011) 18. Wen, Z.C., Luo J.M., Chen H.J., Meng J.X., Li X., Li J.: A Verifiable Data Deduplication Scheme in Cloud Computing. In: 2014 International Conference on Intelligent Networking and Collaborative Systems, pp. 85-90. IEEE (2014) 19. Li J., Li Y.K., Chen X.F., Lee P.P.C., Lou W.J.: A Hybrid Cloud Approach for Secure Authorized Deduplication. IEEE Transaction on Parallel Distributed Systems, vol.26, pp. 1206-1216. IEEE (2015) 20. Meye P., Raipin P., Tronel F., Anceaume E.: A Secure Two-Phase Data Deduplication Scheme. In: HPCC/CSS/ICESS 2014, pp. 802-809. IEEE (2014) 21. Paulo J., Pereira J.: A Survey and Classification of Storage Deduplication Systems. ACM Computing Surveys, vol.47, pp.1-30. ACM New York (2014) 22. Li Y.-K., Xu M., Ng C.-H., Lee P.P.C.: Efficient Hybrid Inline and Out-of-Line Deduplication for Backup Storage. ACM Transaction on Storage, vol.11, pp.2:1-2:21. ACM (2014) 23. Fu, M., Feng, D., Hua, Y., He, X., Chen, Z.N., Xia, W., Huang, F., Liu, Q.: Accelerating Restore and Garbage Collection in Deduplication-based Backup Systems via Exploiting Historical Information. In: 2014 USENIX Annual Technical Conference. pp. 181-192. USENIX Association (2014) 24. Kaczmarczyk, M., Barczynski, M., Kilian, W., Dubnicki, C.: Reducing Impact of Data Fragmentation Caused by In-line Deduplication. In: 5th Annual International Systems and Storage Conference. pp. 15:1-15:12. ACM (2012) 25. Lillibridge, M., Eshghi, K., Bhagwat, D.: Improving Restore Speed for Backup Systems that Use Inline Chunk-based Deduplication. In: FAST, pp. 183-198. USENIX (2013)