Cross-user Source-based Data Deduplication in ...

33 downloads 32930 Views 421KB Size Report
Index Terms— Cloud Computing, Data deduplication, Distributed System, Data Redundancy, Web based system, ... In bit-level method, the best way to compare two chunks of ... after being encrypted to the server so that the storage provider.
International Journal of Research in Computer Engineering and Electronics.

1

ISSN 2319-376X

VOL : 3 ISSUE : 6

Cross-user Source-based Data Deduplication in Distributed System without Violating Data Privacy of Individual User Md. Nadim[1], S.M Motakabber Billah[2], Md. Shahnoor Alam[3], Ashis Kumar Mandal[4], Md. Palash Uddin[5], Md. Rashedul Islam[6]

Abstract— Data deduplication for web storage providers becomes one of the most challenging issues because of the continuous and exponential growth of the number of users and the memory requirement for the users’ data. In this paper, we proposed a cross user source-based deduplication solution for any distributed system to increase the storage capacity in the web servers, which may store a huge amount of redundant data uploaded by different users. The proposed solution also preserves clients’ confidential data applying a private-key encryption technology. A hashed algorithm, SHA1 is used to find out the duplicate files that are uploaded by the clients. Only one of the duplicated copies will be stored and this copy will be shared to all the clients by a pointer to the single file who had uploaded those. The experimental result of the proposed solution is also traced that shows a bulky space is saved if the amount of the redundant data uploaded by various users increases. Index Terms— Cloud Computing, Data deduplication, Distributed System, Data Redundancy, Web based system, Hash Function, Redundant Data, SHA1

——————————  ——————————

1 INTRODUCTION

C

LOUD computing is becoming much popular as it can afford low-cost and on-demand use of bulky storage and processing resources [3]. In coming years, as the data stored in the cloud storage to be managed at any digitized organization is growing rapidly, the chance of redundant data is also increasing. Hence, one of the major challenges is to manage deduplication of the data. Data deduplication is a procedure that reduces the amount of data that are needed to be stored physically by removing redundant information and replacing subsequent iterations of it with a pointer to the original [1]. Additionally, data deduplication can be treated as a data compression method for eliminating redundant data improving storage and bandwidth efficiency. In recent years, data deduplication has emerged as one of the fastest growing data center technologies. Cross-user deduplication is used in practice to maximize the benefits of deduplication that detects redundant data among various users and eliminates the redundancy and just keeps a single unit of the duplicate data. ————————————————

[1] Md. Nadim is a Lecturer in Department of Computer Science and Information Technology, HSTU*, PH: +8801723918502, E-mail: [email protected] [2] S.M Motakabber Billah completed his B.Sc. in Computer Science and Engineering in 2012 from HSTU, E-mail: [email protected] [3] Md. Shahnoor Alam is a Jr. Programmer at Live Technologies, Dhaka, Bangladesh. [4] Ashis Kumar Mandal is an Assistant Professor in Department of Computer Engineering, HSTU, E-mail: [email protected] [5] Md. Palash Uddin is a Lecturer in Department of Computer Science and Information Technology, HSTU, PH: +8801722841941. E-mail: [email protected] [6] Md. Rashedul Islam is a Lecturer in Department of Computer Science and Information Technology, HSTU, PH: +8801737334004. E-mail: [email protected] HSTU*= Hajee Mohammad Danesh Science and Technology University, Dinajpur-5200, Bangladesh.

1.1 Deduplication Approaches Block-level and file-level are the commonly used deduplication methods. In file-level deduplication, only a single copy of duplicated files is stored in the storage. Two or more files are recognized as duplicate if they have the same hash value. It is a very popular type of service offered in multiple products [15], [16]. Block-level deduplication segments the files into blocks and stores only a single copy of each duplicate block. This technique could either use fixed-sized blocks [17] or variable-sized chunks [18], [19]. In terms of the architecture of the deduplication solution, there are two basic approaches: source-based (inline) and target-based approach (post-processing) deduplication. If the data is deduped at the source, redundancies are removed before transmission to the backup target. Source-based deduplication uses less bandwidth for data transmission, but it increases server workload and could increase the amount of time it takes to complete backups. Post-processing is the analysis and removal of redundant data after a backup is complete and data has been written to storage. This technology improves storage utilization, but does not save bandwidth. 1.2 Deduplication Algorithms In deduplication solutions, there are a variety of methods to eliminate the redundant data by inspecting data down to bit level and determining whether they have been stored before or not. The common methods are bit-level comparison between two blocks, hash-based algorithms or a customization. 1.2.1 Bit-level Comparison In bit-level method, the best way to compare two chunks of data is to perform a bit-level comparison on the two blocks [1].

IJRCEE@2014 http://www.ijrcee.org

International Journal of Research in Computer Engineering and Electronics.

2

ISSN 2319-376X

VOL: 3 ISSUE : 6

The cost associated in doing this is the input/output required to read and compare them. 1.2.2 Hash-based Algorithms Hash based deduplication breaks data into chunks, either fixed or variable length, for block-level deduplication. It then processes the chunks or the files with the hashing algorithm to create a hash. The common used algorithms to create a hash against the chunks or files are Secure Hash Algorithm 1 (SHA1) and Message Digest Algorithm 5 (MD5). A hash is a bit string typically 128 bits for MD5 and 160 bits for SHA-1 that represents the data processed. If the same data is processed through the hashing algorithm multiple times, the same hash is created each time. If the hash already exists, the data is deemed to be a duplicate and is not stored. If the hash does not exist, then the data is stored and the hash index is updated with the new hash. The working principle of the hash based deduplication is shown below:

files of other users. In the developed distributed system, the same data of different clients will appear in a list. The storage provider will keep one copy of those duplicated file. He then deletes the duplicated list and will give the remaining file to those clients who had the same file in the storage indicating the post-process duplication. Thus a lot of space will be saved and the clients’ data will keep confidential in the storage. We developed this solution to ensure more security of the client’s data and backup capacity of their data. Clients’ data are stored after being encrypted to the server so that the storage provider will not be able to see the content of the clients’ file.

2 RELATED WORKS There are some works done on a data deduplication. The authors of the paper described in [2] proposed a data deduplication system framework for cloud environments and named it as DDSF. A proxy-based and policy-driven deduplication mechanism to enable different trust relations among cloud storage providers, deduplication modules and different security requirements is proposed by the authors of the paper described in [3]. Some researchers tried to enhance cloud storage performance by real-time feedback control and to make deduplication of data [4]. In [5] the authors described how deduplication can be used as a side channel which reveals information about the contents of files of other users. The work presented in [6] focused on techniques to speed up the deduplication process. Some researchers have also proposed different chunking algorithms to improve the accuracy of detecting duplicates [7]–[11]. Other research considers the problem of deduplication of multiple data types [12], [13].

Fig. 1. Hashed based data deduplication

In Figure 1, the first arrived data chunks namely A, B, C, D, and E are processed by the hash algorithm which creates the hashes HA, HB, HC, HD, and HE respectively. Later, the subsequent data chunks A, B, C, D, and F are processed. A new hash HF is generated only for F. Since the hashes for A, B, C, and D has already been generated, the data are supposed to be the same data and are not stored again. Since F generates a new hash, the new hash and new data are stored. Figure 2 shows the deduplication process.

Fig. 2. Data deduplication process

We studied on the privacy implications of cross-user deduplication. We demonstrated how deduplication can be used as a side channel which reveals information about the contents of

3 PROPOSED TECHNIQUE In the proposed technique normal users or admin needs to be login in the system through a login panel. Once they complete successful login, basic functionalities (view, delete, download, upload) are available to the normal users. The admin gets the privilege to analyze the server status with our deduplication tool. If duplicate files exist on the server, he/she will get a message and be suggested to take a action to dedupe the files. When he/she confirms the dedupe process, only one copy of the duplicated files be stored in the server, and a pointer of that file will be shared with all the users who have uploaded the file. Figure 3 shows the basic mechanism of the proposed deduplication procedure, and Figure 4 shows the activity flow of a normal user and admin where admin can perform dedup functionality under his/her maintenance activities. In the working principle of the flowchart for detecting duplicate file as shown in Figure 3, the first arrived data files for example F1, F2, F3, F4, and F5 uploaded by the users are processed by the hash algorithm, SHA1 that creates the hashes HF1, HF2, HF3, HF4, and HF5 for the files respectively. Later, the subsequent data files F1, F2, F3, F4, and F6 uploaded by the other users are also processed. A new hash HF6 is generated

IJRCEE@2014 http://www.ijrcee.org

International Journal of Research in Computer Engineering and Electronics.

3

ISSN 2319-376X

VOL: 3 ISSUE : 6

only for the file F6. Since the hashes for F1, F2, F3, and F4 has already been generated, the data are supposed to be the same data and are not stored again. Since F6 generates a new hash, the new hash and new data are stored. Parallely, the proposed technique marks the duplicate files. When the list of all newly uploaded files are processed in the same manner, the final list of the duplicate files is generated to dedupe by the admin.

Fig. 4. Database design

4 EXPERIMENTAL RESULT AND DISCUSSION

Fig. 3. Data deduplication process of proposed technique

4.1 Sample Data and Calculations The overall proposed technique was simulated in Apache HTTP Server using PHP 5.3.8 and MySQL Server 3.1.3. For testing our deduplication technique, we have taken ten random users for uploading their data in the server. Then we calculated the amount of memory they are using for storing their uploaded data. As the users stored their data on the remote storage from multiple sources and no user is aware of the stored data of any other user, we found several duplicate files when we implemented the deduplication tool. Applying the deduplication process, we find remarkable decrease in the amount of memory occupied by the users. For better explanation of the mentioned procedure, the sample processed data of ten users is given in Table 1 before deduplication, and in Table 2 after deduplication. Thus, the amount of saved storage is: Total space reduced= (((22873-18174)*100))/22873 =20.54 % From the above experimental data set and result we came into decision that the proposed technique saved 20.54% storage capacity. So, we can say that the proposed technique is efficient for reducing storage capacity and thus saves disk space for storing more files. Moreover, the proposed technique keeps privacy of the data uploaded by each user because it applies an encryption technology, base_64 to the data files before deduplication. Thus, the proposed technique saves a huge space for the cloud storage keeping privacy of each user.

Fig. 4. User activity diagram

3.1 Database Design The Entity-Relationship diagram for the developed distributed system supporting deduplication is shown in Figure 5.

IJRCEE@2014 http://www.ijrcee.org

International Journal of Research in Computer Engineering and Electronics.

4

ISSN 2319-376X

VOL: 3 ISSUE : 6

TABLE 1 DATA BEFORE DEDUPLICATION TOTAL SIZE=22873 KB user 1 2 3 4 5

6 7 8 9 10

File Name

File Size(KB)

1.Batikbabu.pdf 2. 5(mp3pk.com).mp3 1. Desert.jpg 2. Chrysanthemum.jpg 1. Tulips.jpg 2. 03.Banno Ki.MP3 1. FhireDekha 71.pdf 2. Ekattorer Dinguli.pdf 1. Ekattorer_sei_ Dinguli.pdf 2.Earth.jpg 1. Fhire _Dekha .pdf 2.man_in_blue.gif

500 3499 825 900 500 4500 230 300 300 210

1. songs_pk.mp3 2. Jellyfish.jpg 1. Lighthouse.jpg 2. Ekattorer Dinguli.pdf 1. sara jahan.MP3 2.janeses.jpg 1. sonkat.pdf 2. Hydrangeas.jpg

3499 240 400 300 5000 400 350 400

1 2 3 4 5 6

7 8 9 10

1. songs_pk.mp3 2. Jellyfish.jpg 1. Lighthouse.jpg 2. Ekattorer Dinguli.pdf 1. sara jahan.MP3 2.janeses.jpg 1. sonkat.pdf 2. Hydrangeas.jpg

Type of the file: pdf Number of duplicate files: 3 Size of all files: 900 KB

1725

Space saved: 600 KB

5000 530

Reading 2:

Type of the file: mp3 Number of duplicate files: 2 Size of all files: 6998 KB

510

Space saved: 3998 KB 500 20

File Name 1.Batikbabu.pdf 2. 5(mp3pk.com).mp3 1. Desert.jpg 2. Chrysanthemum.jpg 1. Tulips.jpg 2. 03.Banno Ki.MP3 1. FhireDekha 71.pdf 2. Ekattorer Dinguli.pdf 1. Ekattorer_sei_ Dinguli.pdf 2.Earth.jpg 1. Fhire _Dekha .pdf 2.man_in_blue.gif

Reading 1:

3999

520 Reading 3: 3739

5400 750

File Size(KB) 500 3499 825 900 500 4500 230 300 300 210 _

3999

Type of the file: image Number of duplicate files: 2 Size of all files: 800 KB

700

TABLE 2 DATA AFTER DEDUPLICATION FILE SIZE REDUCED=18174 KB user

4.2 Identification of Duplicate Files

Space saved: 400 KB

4.3 Comparative Result and Discussion The uploaded data by the ten users is represented in the graphs of Figures 6 and 7 before and after deduplication by the proposed technique. In the graph of Figure 8, the comparative result is illustrated where we can see the remarkable reduction in memory requirement resulted by the implementation of the proposed deduplication technique. The storage capacity will be saved more if the duplicate files from the different users increase.

1725 5000 530 510 20

20 _ 240 240 400 _ 5000 400 350 _

400 5400 Fig. 6: Before deduplication

350

IJRCEE@2014 http://www.ijrcee.org

International Journal of Research in Computer Engineering and Electronics.

5

ISSN 2319-376X

VOL: 3 ISSUE : 6

Fig. 7: After deduplication

Fig. 8: Comparative result before and after deduplication

Although the proposed technique reduces the use of vast storage, it is not free from drawbacks. Since the proposed technique dedupes data at file-level, its performance is lower than that of at block-level.

5

CONCLUSION AND FUTURE WORK

As the data is growing rapidly on the cloud storage, there is a big chance of data duplication. We proposed this solution to remove data deduplication in cloud where the data will also be secured because of using base_64 encryption technique. The solution finds the duplicate data accurately as we used Secure Hash Algorithm for this purpose. As a result, this solution is very effective at optimizing the use of storage space. The experimental result shows that the proposed technique might help in the future to reduce the enormous requirements of storage. The proposed technique may motivate storage researchers and storage engineers to investigate if and how data reduction techniques as data deduplication can be integrated into the storage systems. Future work includes developing a bit-level comparison deduplication method for the developed distributed system rather than file-level deduplication. Impact of the implementation of this technique on very large live cloud storage like [14] may also be investigated in future.

REFERENCES [1] “Data deduplication technology review”, http://www.computerweekly.com/report/Data-deduplicationtechnology-review [2] Jianhua Gu, Chuang Zhang and Wenwei Zhang, “DDSF: A Data Deduplication System Framework for Cloud Environments”, CLOSER 4th

International Conference on Cloud Computing and Services Science, April 3-5, 2014. Barcelona, Spain, pp. 403-410. [3] Chuanyi LIU, Yancheng WANG, Jie LIN, “A Policy-based Deduplication Mechanism for Encrypted Cloud Storage”, Journal of Computational Information Systems Vol-10:6, 2014, pp. 2297–2304. [4] Tin-Yu Wu, Wei-Tsong Lee, Chia Fan Lin, “Cloud Storage Performance Enhancement by Real-time Feedback Control and De-duplication”, IEEE Wireless Telecommunications Symposium (WTS), 18-20 April 2012, London. [5] Danny Harnik, Benny Pinkas, Alexandra Shulman-Peleg, "Side Channels in Cloud Services: Deduplication in Cloud Storage", IEEE Security & Privacy, vol.8, no. 6, pp. 40-47, November/December 2010, doi:10.1109/MSP.2010.187. [6] A. Sabaa, P. d. Kumar et al., “Inline Wire Speed Deduplication System,” 2010, US Patent App. 12/797,032. [7] L. L. You and C. Karamanolis, “Evaluation of efficient archival storage techniques,” in Proceedings of the 21st IEEE/12th NASA Goddard Conference on Mass Storage Systems and Technologies, 2004. [8] E. Kruus, C. Ungureanu, and C. Dubnicki, “Bimodal content defined chunking for backup streams,” in Proceedings of the 8th USENIX conference on File and storage technologies, 2010. [9] J. Min, D. Yoon, and Y. Won, “Efficient deduplication techniques for modern backup operation,” IEEE Transactions on Computers, 2011. [10] A. Muthitacharoen, B. Chen, and D. Mazieres, “A low-bandwidth network file system,” in ACM SIGOPS Operating Systems Review, 2001. [11] K. Eshghi and H. K. Tang, “A framework for analyzing and improving content-based chunking algorithms,” Hewlett-Packard Labs Technical Report TR, 2005. [12] W. Xia, H. d. Jiang et al., “Silo: a similarity-locality based near-exact deduplication scheme with low ram overhead and high throughput,” in Proceedings of USENIX annual technical conference, 2011. [13] D. Bhagwat, K. Eshghi, D. D. Long, and M. Lillibridge, “Extreme binning: Scalable, parallel deduplication for chunk-based file backup,” in Modeling, Analysis & Simulation of Computer and Telecommunication Systems, 2009. [14] “Best Cloud Storage Providers and Reviews Online | Top 10 Cloud Storage”, http://www.top10cloudstorage.com/ [15] J. R. Douceur, A. Adya, W. J. Bolosky, D. Simon, and M. Theimer, “Reclaiming space from duplicate files in a serverless distributed file system,” International Conference on Distributed Computing Systems, 2002. [16] H. S. Gunawi, N. Agrawal, A. C. Arpaci-Dusseau, R. H. ArpaciDusseau, and J. Schindler, “Deconstructing commodity storage clusters,” Proceedings of the 32nd annual international symposium on Computer Architecture. Washington, DC, USA, pp. 60–71, 2005. [17] A. Muthitacharoen, B. Chen, and D. Mazieres, “A low-bandwidth network file system,” in Symposium on Operating Systems Principles, 2001, pp. 174–187. [Online]. Available: http://citeseer.ist.psu.edu/muthitacharoen01lowbandwidth.html. [18] S. Quinlan and S. Dorward, “Venti: a new approach to archival storage,” in First USENIX conference on File and Storage Technologies, Monterey,CA, 2002. [Online]. Available: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.18.8085. [19] L. L. You, K. T. Pollack, and D. D. E. Long, “Deep store: An archival storage system architecture,” in In Proceedings of the 21st International Conference on Data Engineering (ICDE) 05. IEEE, 2005, pp. 804–815.

IJRCEE@2014 http://www.ijrcee.org

Suggest Documents