A Novel Encryption Technique for Data DeDuplication 1
Ninni Singh
Department of Computer Science and Engineering Jaypee University of Information Technology, Wahnaghat, Solan 173234
[email protected]
Abstract— Data de-duplication is an emerging technology, which incurs a benefit of storing a single instance or a single copy of duplicated data in a storage disk. As the number of user increase in the network there will be an explosive growth in data, thus there is a need of some mechanism which efficiently deal with such situation. To protect the confidentiality of the message in the network while sustenance of data de-duplication, the convergent encryption algorithm is utilized before uploading data send into the network. However, there are several issues that have to be addressed some of the issue is addressed in our proposal. In this paper, to enhance more security, we derived private key or secret key from user password which fulfill the one issue i.e. Data origin authentication. In these creations, the integrity of the information is appreciated by just deriving the convergent key (Secret key + symmetric key (key derived from chunks)). However, we also addressed the issue related to keyword search on encrypted data. Our analysis reveals that our proposal is secure in terms of definition quantified in the proposed secure data de-duplication model. Keywords— Data De-Duplication, Private Key, Symmetric Key, Keywords, and Encryption.
I.
INTRODUCTION
Now a day’s storage device plays a vital role in every field. As technology is developed, demand for storage device is also increasing. Today we are dealing with digital content, which increases the need of storage device which provide a data security in a cost effective manner. Although, the hard disk is Inexpensive, which is employed for keeping historical data. It is also hard to manage disk in a distributed environment by upholding its integrity. As many users or organization used storage devices to store the data, but sometimes what passes, different users store the same information in the computer memory device, which is just a waste of computer storage. Today this data de-duplication technique is used. Data deduplication is also identified as single instance storage, which is applied to increase the utility of storage devices by getting rid of the duplicate data before storing it into the memory device. De-duplication find out the single instance within the file or between the file (by using file level de-duplication or block level de-duplication technique) and stores the single copy of the data irrespective of the number of times it happens. By
978-1-4799-8433-6/15/$/31.00©2015 IEEE
2
Anum Javeed Zargar,3Geetanjali Rathee, 4Satya Prakash Ghrera.
2, 3, 4
Department of Computer Science and Engineering Jaypee University of Information Technology, Wahnaghat, Solan 173234
[email protected],
[email protected] ,
[email protected] 2, 3, 4
causing this, data de-duplication reduces the disk space demanded by the large size file and scales down the bandwidth requirement of sending the duplicate chunks of data again and again in the network [1] [2] as shown in figure 1. A A V A
B C B D
V C D
Data Store
B V B D
V C D C
A
B D
C V
A
Deduplication process
Data Store
Fig.1. Data De-Duplication
Information protection is another field which gets the eyes of authors in the area of memory organization. Generally users' stores data in a storage device for future determination and this storage device is accessed by many users. Therefore, the number of attacks is possible inside the storage devices. In parliamentary law to secure data from several types of attack, we adopt cryptographic techniques with data de-duplication. Data de-duplication takes the advantage of not to store the duplicate copy, which shortens the space size and encryption takes the advantage of converting plain text to unreadable format, which protect them from inside and remote attacks. In this report, we introduce an approach to ensure data duplication. Our proposed security model comprises of four strata. First layer is the authentication layer, which provide two features, i.e. Data origin authentication and information integrity. Second layer is Encryption and confidentiality layer, equally its name indicates in this layer we perform encryption by using fast (convergence algorithm) [3]. Third layer is a data de-duplication layer, in which we identify the similar copy of the data and store the individual representative of the data and the fourth layer is data store, which is utilized to store the information in the data center or any computer memory device, as shown in figure 2. Proposed model works in both single file server as well as distributed server. In single file server, we store information along with metadata and in distributed server, metadata is stored in independent server, and the file is stashed away in a memory device. In the proposal, we first derived the secret key from the user password by using HMAC-SHA96 and then authenticate
Authentication Encryption and Confidentiality Data Deduplication Data Store Fig.2. Four Tier Architecture.
Each other by this way we achieve one feature: first is data origin authentication, and second same cipher text for the same block of data (if the user wishes to store its file into the storage device, then we first find out the symmetric key from each chunk and then utilize the convergent encryption scheme, which is used to perform encryption along with data deduplication). As we derive symmetric key from the chunk or block of information. So, if we apply the same key for encryption use, then it will return same cipher text for the same stoppage of data. Chunking and encryption of data, all these operations are executed on client side, so we protect our information from inner and remote attacks. Lastly trapdoors that are used to link the chunks to a file is encrypted using a secret symmetric key, this key is only known by users, i.e. User only needs to know its password, client system automatically derived a secret key from them, thus our system is protected from key compromise. Remaining section is classified into five sections. In the second section, we elaborate our system in the field of related work, in the third section we briefly describe our proposal and its execution. In the fourth section we describe how our system is space efficient (by de-duplication) and provide data security and in the fifth section we conclude our paper. II.
RELATED WORK
Currently, we are more focused toward the use of data deduplication technique. There are essentially three strategies employed in data de-duplication: fixed size chunks, variable size chunks and whole file. In whole file de-duplication technique, whole file considers as a one chunk. In his we use the hash technique, for identifying chunk identifier. But the disadvantage with this approach is that, there is the possibility that two different files may produce same hash function and as a result single instance of that file are stored and other file may reject. This form of strategy is used in fast [3], EMC Centera system [5] and window single instance store [4]. In second strategy is fixed size chunking, file are partitioned into fixed size blocks and after that we apply the de-duplication technique on that partitioned chunk. In this same store chunk of file yield same cipher text [6] and hence saves some storage space. And in third strategy is variable size chunks, this strategy is more flexible and. Rabin fingerprinting is more efficient [7]. This is primarily used in shark [1], LBFS [8] and Deep Store [9].
The events referred to data de-duplication in multiple environment is identified by Douceur [3]. Then he proposed a security model. The author proposed model derived key from plain text (by performing a hash function on plain text), named as convergent encryption. Then (Storer el.at) [10] identify some security issues in Doucecur. Then he gives a new security model for data de-duplication. But the problem with this mannequin is that, it focused only on server side deduplication and not consider the client side de-duplication. His proposed model is susceptible to malicious attack like eave dropping and data modification etc. Now to protect the information leakage at the client side Halevi [11] proposed a model named as power of ownership. It is also known as a challenge response protocol, used to identify the owner of particular information. It is basically used by the storage server to identify that whether the requesting user is the owner of that information or not. Here server challenging to the users is to present a valid path for a subset of Merkle tree leaves [1]. Both the approach utilizes the construction of Merkle tree, only the difference between both is that in former first we apply erasure coding of the data of the file and then this encoded version of the coding act as an input for the construction of Merkle tree. In later scheme, instead of performing erasure coding, we first pre-process the data using some hash function (universal hash function). Tons of study have been done with proof of ownership (PoW). Ng.et. al [13] proposed a novel scheme, in which file is divided into fixed size chunks. Each chunk of data has its own commitments, on the bases of commitments we build a hash based tree. This scheme is more secure because it does not reveal any secret information. It ensures POW, because in this user has to prove its possession of data block without disclosing any secret information nevertheless, this strategy runs efficiently, but the problem with this scheme is that it incurs high computation cost, by generating commitments for each proof of request. (Pietro. et. al) [14] propose a scheme in which, the whole file is projected onto a small dimension or you can say, for POW their project files onto a small number of bits. But it also incurs a disadvantage that is here privacy is violated. (Jia.et.al) [15] proposed a scheme in which they provide authentication along with preserving confidentiality in cross user side deduplication. To provide de-duplication, they utilize a convergent encryption algorithm under the weak leakage model. But unfortunately there model is susceptible to malicious storage server attacks. Equally, if we spill the beans about any enterprise environment, we have to share storage device. With enterprise data from the different arms of same organization is stored in common place we called it as a data center. Some times what happen employee from different branch may store same data; this is inefficient in storage space point of view. Thus, to overcome this we proposed a novel scheme, which is based four level architecture.
Assumption Our proposed scheme contains following assumption, these are as follows: First: We assume that there will be a secure channel between the users and the data center or stores. This secure channel support integrity, confidentiality and mutual authentication features. Second: as we used hash function in our proposition, hence we consider a hash function which is strongly collision resistant. Third: here user performs a search belonging to himself.
file. After encryption we identify the chuck id. To avoid same name, file conflicts, first we have to identify the keywords in the files and then encrypt this keywords using a Secret key. For unique identification of files, we encrypt the file content by user private key or a secret key. Start Password not Match
Login
Not an Authentic User
Password Match
Architecture
If User has File To Upload
Divide the file into chunks Message Formats Storage Group Person
Person is Login Into The system
File is Partitioned Into Chunks
Deduplication process
Fig .3. Working Architecture of Our Proposed Approach.
Beginning level is authentication level, which basically provides two features: data origin authentication along with upholding its unity. Second layer is Encryption and confidentiality layer, equally its name indicates in this layer we perform encryption by using fast (convergence algorithm) [3]. Third layer is a data de-duplication layer, in which we identify the similar copy of the data and store the individual representative of the data and the fourth layer is data store, which is utilized to store the information in the data center or any computer memory device. Our proposed approach is quite similar to email portal like Gmail. In this we provide a portal from user experiences to insert its own user Id and Password before performing any operation (storing and retrieving information from the data center). Once user logging into the system, we derive the private key or secret key from them by using and then authenticate each other by this way we achieve one feature: first is data origin authentication. After that if the user wishes to store the file into the data store, then we have to be partitioned the data into fixed size chunk. From each chunk we identify the symmetric key. After this we integrate the both the keys (symmetric and secret key) Secret key remains same for all the files but the symmetric key change for each chunk of data. Then this key used for encryption of files using AES symmetric encryption. The symmetric key is the hash of the chunk, thus for same data chunk they produce the same hash this, encrypted data with same key yield same cipher text and then upload the Metadata. Metadata contains two primitives: first keyed hash of all the encrypted file and file ID with encrypted keywords. Now for addressing each chunk of data there is a need of the identifier, which uniquely identifies the particular chunk of a
Compare chunks
If there is match
No
Store it to the disk
Yes Store Single Instance of chunk
Fig .4. Flow Diagram of Security in De-Duplication (Our Proposal).
Suppose the user desire to access some content of the file, after submitting it request. Foremost, we encrypt the keyword using the private key or secret key. At present in the system Inputs: = User enters its own Password. Terminologies = Symmetric key (key derived from each chunks of files) = Secret key derived from the password. = key after integrating ( + ). Step: 1 firstly user has to enter its own User Id and password. from password Step: 2 after that we derive a Secret Key using . Step: 3 if user wants to upload a file. Then, first we scan the whole system on the basis of keyword, if there is no such keyword previously uploaded then perform following operation. Step: 4 partitioned the whole file into fixed size chunks i.e. = Step: 5 Apply hashing algorithm on each chunks of data for identifying key Step: 6 we use AES symmetric encryption technique for encrypting each chunk of data by utilizing Step: 7 after encryption we find out the chunk identifier. First we have to find out the keyword in the files and then encrypt this keywords using . Step: 8 else match found, then system return the File ID which matches the keyword. After that he again scan the system on the basis of chunks of that File Id that system Return.
scans the whole system, if there is a match, then it returns the File Id which contains that keywords, after this we again scan the file Id on the basis of chunks that contain that keyword in
the file, we are doing this operation to obtain out the relevant file among several files.
Input: = files that user previously uploaded. = chunk size.
Proposed Algorithm: Step1: In our proposed approach we provide a GUI, where a user has to submit its own User ID and password before entering into the system (as shown in figure 5).
Fig 5. Login Page
File_upload () { // user upload a file when the last time it was login into the system. File_imp [] file Å file_from_user; For each (files in File_imp) { Partition of file into chunks // same as that of step 2. We have to calculate hash for each chunk. For each chunk Xor the calculated hash and the secret key. Encrypt the key using Xored key. } }
Step 5: In this step we compare the file chunks within the file itself, so that we can remove the duplicates within chunks and store unique instance of chunks of files.
Step 2: From user entered password, we generate a secret key or private key using .
Compare_different_file (file A, file B, file C) // file b and c are already store {
{ ) // { ; } }
}
IV RESULTS So far we have deliberated our proposed algorithm with both previous approaches and shown how the proposed algorithm Works.
Fig 6. Secret key retrieval algorithm
Step 3: Suppose now user wishes to upload some files, thus to accomplish this he perform following operation.
. Fig 6. Existing Files in the System
Step 4: Suppose user already uploaded some file, when the last time he login into the system.
In this section we have simulated our work in MATLAB, on different file sizes and shown the corresponding bar charts in
terms of file size and file type. In figure 6 we have shown the existing files in the systems. Suppose user login into the system and he wants to upload a particular file. The size of the file is shown in figure 7. After uploading we chunks the whole file and perform operations that is elaborates above. After chunking we compare the chunked file with it. This reduces the size of the file as shown in figure 8. After above computation, we consider New_File’ and again perform the comparison operation, but this the comparison is performed among existing files. This operation is shown in figure 8. In the form of bar charts.
Fig 7. New file that user wish to upload
Fig 8. Data De-duplication Operation
V CONCLUSION In this paper, we have shown the several possible approaches for security in data de-duplication. We have suggested a vigorous mechanism which addresses some of the issue intended for secure de-duplication. Further, we have revealed the algorithm, performance and analysis of our proposal that addressed some of the issues. To offer data origin authentication, we derived secret key from user password. The integrity of the information is appreciated by just deriving the convergent key (Secret key + symmetric key (key derived from chunks)). However, we also addressed the issue related
to keyword search on encrypted data. Our analysis reveals that our proposal is secure in terms of definition quantified in the proposed secure data de-duplication model. References [1.] S. Annapureddy, M.J freedman and D. Maazieres, “Shark: Scaling file servers via cooperative caching”, In proceedings of the 2nd Symposium on Networked Systems Design and Implementation (NSDI), 2005. [2.] D. Bhagat, K Pollack, D. D. E. Long, E.L. Miller, J.F. paris, and T. Schwarz, “S. J Providing high reliability in a minimum redundancy archival storage system” , In Proceedings of the 14th International Symposium on Modeling Analysis ,and Simulation of computer and Telecommunication Systems(MASCOTS), Sept 2006. [3.] J.R Douceur , A. Adya, W.J.Bolosky , D.Simon and M. Theimer, “Reclaiming space from duplicate files in a serverless distributed file system” , . In Proceedings of the 22nd International Conference on Distributed Computing Systems [4.] (ICDCS ’02), pages 617–624, Vienna, Austria, July 2002. [5.] W. J. Bolosky, S. Corbin, D. Goebel, and J. R. Douceur .”Single instance storage inWindows 2000”. In Proceedings of [6.] the 4th USENIX Windows Systems Symposium, pp 13– 24.USENIX, Aug. 2000. [7.] H. S. Gunawi, N. Agrawal, A. C. Arpaci-Dusseau, R. H. ArpaciDusseau, and J. Schindler. Deconstructing commodity storage clusters. In Proceedings of the 32nd Int’l Symposium on Computer Architecture, pp 60–71, June 2005. [8.] S. Quinlan and S. Dorward. Venti: “A new approach to archival storage”. In Proceedings of the 2002 Conference on File and Storage Technologies (FAST), pp 89–101, Monterey, California, USA, 2002. USENIX. [9.] M. O. Rabin.“Fingerprinting by random polynomials”. Technical Report TR-15-81, Center for Research in Computing Technology, Harvard University, 1981. [10.] Muthitacharoen, B. Chen, and D. Mazières. “A low-bandwidth network file system”. In Proceedings of the 18th ACM Symposium on Operating Systems Principles (SOSP ’01), pp 174–187, Oct. 2001. [11.] L. L. You, K. T. Pollack, and D. D. E. Long. Deep Store: An archival storage system architecture. In Proceedings of the 21st International Conference on Data Engineering (ICDE ’05), Tokyo, Japan, Apr. 2005. IEEE. [12.] M.W. Storer, K. Greenan, D. D. Long, and E. L. Miller. “Secure data deduplication”. In Proceedings of the 4th ACM International Workshop on Storage Security and Survivability, Storage SS ’08, pp 1–10, New York, NY, USA, 2008. ACM. [13.] S. Halevi, D. Harnik, B. Pinkas, and A. Shulman-Peleg.” Proofs of ownership in remote storage systems”. In Proceedings of the 18th ACM conference on Computer and communications security, CCS ’11, pages 491–500, New York, NY, USA, 2011. ACM. [14.] R. C. Merkle. “A digital signature based on a conventional encryption function”. In A Conference on the Theory and Applications of Cryptographic Techniques on Advances in Cryptology, CRYPTO ’87, pp 369–378, London, UK, UK, 1988. Springer-Verlag. [15.] W. K. Ng, Y. Wen, and H. Zhu. “Private data deduplication protocols in cloud storage” , In Proceedings of the 27th Annual ACM Symposium on Applied Computing, SAC ’12, pages 441– 446, New York, NY, USA, 2012, ACM. [16.] R. Di Pietro and A. Sorniotti , “Boosting efficiency and security in proof of ownership for deduplication”, In Proceedings of the 7th ACM Symposium on Information, Computer and Communications Security, ASIACCS ’12, pp 81–82, New York, NY, USA, 2012. ACM. [17.] J. Xu, E.-C. Chang, and J. Zhou “Weak leakage-resilient client-side deduplication of encrypted data in cloud storage” , In Proceedings of the 8th ACM SIGSAC symposium on Information, computer and communications security, ASIA CCS ’13, pp 195–206, New York,NY, USA, 2013. ACM.
[18.] Ninni Singh and Nitin Kumar Singh “Information Security in Cloud Computing using Encryption Techniques” In Proceeding of International Journal of Scientific & Engineering Research, pp 11111113 Volume 5, Issue 4, April-2014.
Ninni Singh is a Teaching Assistant (TA) in the Department of Computer Science Engineering (CSE) & Information Communication Technology (ICT) with Jaypee University of Information Technology (JUIT), Waknaghat, Solan173234,Himachal Pradesh,India. She received her B.E. Degree in Computer Science and Engineering (CSE) from hitkarini college of engineering and technology, Jabalpur, Madhya Pradesh in 2009. Now she is undertaking the Master Degree course under supervision of Dr. Hemraj Saini in Jaypee University of Information and Technology (JUIT), Waknaghat, Solan-173234. Her research interests include cryptography and network security, distributed system and wireless sensor and mesh network. Geetanjali Rathee is a Teaching Assistant (TA) in the Department of Computer Science Engineering (CSE) & Information Communication Technology (ICT) with Jaypee University of Information Technology (JUIT), Waknaghat, Solan173234, Himachal Pradesh, India. She received her B.Tech Degree in Computer Science and Engineering (CSE) from Bhagwan Mahavir Institute of Engineering and Technology (BMIET), Haryana in the year 2011. She has completed her Master Degree program in June 2014 under supervision of Dr. Nitin Rakesh at Jaypee University of Information and Technology (JUIT), Waknaghat. She is currently a PhD scholor at Jaypee University of Information Technology, Waknaghat. Her research interests include Resiliency in Wireless Mesh Networking, Routing Protocols, and Networking, Security in Wireless Mesh Networks. Anum Javeed Zargar is a Teaching Assistant (TA) in the Department of Computer Science Engineering (CSE) & Information Communication Technology (ICT) with Jaypee University of Information Technology (JUIT), Waknaghat, Solan173234,Himachal Pradesh,India. She received her B.tech. Degree in Computer Science and Engineering (CSE) from Islamic University of Science and Technology, Awantipor, Jammu Kashmir 192122 in 2012. Now she is undertaking the Master Degree course under supervision of Mr.Amit Kumar Singh in Jaypee University of Information and Technology (JUIT), Waknaghat, Solan-173234. Her research interests include Digital Watermarking, Digital halftoning and network security.