2013 Fourth International Conference on Emerging Intelligent Data and Web Technologies
A Novel Triple Encryption Scheme for Hadoop-based Cloud Data Security
Chao YANG, Weiwei LIN*, Mingqi LIU School of Computer Engineering and Science South China University of Technology Guangzhou, China
[email protected],
[email protected],
[email protected]
rs and guarantee the data dependability. Siani Pearson, et al. [7], described the overview of privacy issues within cloud computing and a detailed analysis on privacy threat based on different type of cloud scenario was explained, the level of threat seems to vary according to the application area. A security enhancement for Hadoop [8], which provides strong mutual authentication by using Kerberos is presented. The central server performs access control for stored files on storage servers. Since files are stored in cleartext, when storage servers are compromised by an attacker, data confidentiality is broken. Various public key encryption schemes with additional functionalities are proposed based on pairing computation. For example, Giuseppe Ateniese et al. [9] proposed proxy re-encryption schemes that support a secure data forwarding functionality. Hou Qinghua et al. [10] proposed a method of secure virtual machine monitor to protect user data privacy in cloud storage. YU Shu-cheng et al. [11] presented techniques of attribute-based encryption (ABE), proxy re-encryption, and lazy re-encryption for data security and access control in cloud computing. Tahoe[12] is a secure distributed storage system, which has a master-slave structure. Tahoe is a secure distributed file system supporting data confidentiality, where stored data are encrypted by Advanced Encryption Standard (AES) and the decryption keys are managed by the data owner. Recently, Tahoe is adapted to support Hadoop [13]. The integration of Tahoe and Hadoop serves as a good candidate for a secure storage system in Hadoop. However, Tahoe is not the default file system of Hadoop. Moreover, the data owner has to manage decryption keys for files, since each file has a unique decryption key. When the user has many files, the task of key management becomes heavy. In HDFS [14], all files are stored in cleartext and controlled by a central server. Thus, HDFS is not secure against storage servers that may peep at data content. Additionally, Hadoop and HDFS have a weak security model, in particular the communication between Datanodes and between clients and datanodes is not encrypted. We have conducted some studies of Hadoop [15, 16] and implemented Hadoop-based efficient and economical cloud storage system [4], and we have found that the data storage security issues. The main purpose of this study is to address the data security issue in Hadoop. In
Abstract—Cloud computing has been flourishing in past years because of its ability to provide users with on-demand, flexible, reliable, and low-cost services. With more and more cloud applications being available, data security protection becomes an important issue to the cloud. In order to ensure data security in cloud data storage, a novel triple encryption scheme is proposed in this paper, which combines HDFS files encryption using DEA and the data key encryption with RSA, and then encrypts the user's RSA private key using IDEA. We implement the triple encryption scheme in Hadoop-based cloud data storage and experiment studies were conducted to verify its effectiveness. Keywords-Cloud storage; data encryption; data security; Hadoop
I.
INTRODUCTION
Cloud computing is an emerging and increasingly popular computing paradigm, which provides the users massive computing, storage, and software resources on demand [1]. Because system resources are essentially shared by many users and applications, an excellent task scheduling scheme is critical to resource utilization and system performance [2]. Cloud computing is currently getting considerable attention in both academic and industrial areas. With more and more cloud applications being available, data security becomes an important issue in cloud computing. Data protection is a critical issue in cloud computing environments. Some studies on distributed storage systems and architectures used in Clouds can be found in [3, 4]. A benchmarking approach to identify strengths and weaknesses in different cloud-based data management implementations appears in [5]. Cong Wang et al. [6] stated that data security is a problem in cloud data storage, which is essentially a distributed storage system. An effective and flexible distribution scheme is proposed to ensure the correctness of user’s data in cloud data storage, explicit dynamic data support, including block update, delete, and append relying on erasure-correcting code in the file distribution preparation to provide redundancy parity vecto-
* Corresponding author 978-0-7695-5044-2/13 $26.00 © 2013 IEEE DOI 10.1109/EIDWT.2013.80
437
order to ensure data security in Hadoop-based cloud data storage, a novel triple encryption scheme is proposed and implemented, which combines HDFS files encryption using DEA (Data Encryption Algorithm) and the data key encryption with RSA, and then encrypts the user's RSA private key using IDEA (International Data Encryption Algorithm). The remainder of the paper is structured as follows. The HDFS files encryption using DEA and the data key encryption with RSA, and hadoop implementations are presented in Section 2. Section 3 discusses the design and implementation of RSA private key Encryption using IDEA algorithm. Then, in Section 4, we conduct a performance evaluation. Finally Section 5 concludes this paper. II.
(2) Call API to get the file from HDFS to the application server; (3) Ask the Data key management module for getting the Data key; (4) Use Data key to decrypt the file; (5) and (6) return decrypted files to the user.
DATA HYBRID ENCRYPTION BASED ON DES AND RSA
A. Principle of Data Hybrid Encryption HDFS files are encrypted using a hybrid encryption method, a HDFS file is symmetrically encrypted by a unique key k and the key k is then asymmetrically encrypted by owner's public key. Symmetrical encryption is safer and more expensive than asymmetrical encryption. Hybrid encryption is a compromising choice against the two forms of encryption above. Hybrid encryption uses DES algorithm to encrypt files and get the Data key, and then uses RSA algorithm to encrypt the Data key. User keeps the private key in order to decrypt the Data key. The hybrid encryption principle is shown in Figure 1.
Figure 2. File encryption and decryption process
Figure 1. Hybrid encryption principle.
B. Maintaining the Integrity of the Specifications When user chooses to upload the file to HDFS with file encryption at the first time, the application server will generate the RSA public/private key pair and then send the private key back to the client itself to keep. Figure 2 shows the file’s encryption and decryption process. Encrypted file’s upload process: (1) User uploads files; (2) Data key management module (see Figure 3) generates DES key; (3) Use DES key to encrypt file; (4) Call API to upload the encrypted file to HDFS.
Figure 3. Data key management module
The data key management module maintains the Data key for each encrypted file. 1. Data key’s generation (1) When user needs to encrypt a file, the user asks the data key module system for the Data key; (2) Data key management module generates the data key, and returns to the file encryption / decryption module; (3) According to user’s ID, the file encryption/decryption module finds the user's public key and the use the public-key to encrypt Data key; (4) Store the encrypted Data key to the database.
Encrypted file’s download process: (1) User requests to download file;
2. Data key’s acquisition
438
(1) In order to decrypt a file, the user provides (user's private key, the file ID) record to find the file’s Data key; (2) According to the file ID, the server queries the database and finds the encrypted Data key; (3) Use the user’s private key to restore Data key; (4) Return Data key to the file encryption / decryption module in order to decrypt the file. III.
B. Basic method and implementation of the private key file’s encrypted storage management 1. Basic method Because of the importance of the private key, it would be an improvement if the system can provide the private key for encryption and decryption functions to ensure the security of the private key. The private key’s encryption process is as follows: (1)Use the password provided by the user to encrypt the private key; (2)Encrypted key transmitted to the server; (3)Server receives the message and stores in the database.
RSA PRIVATE KEY ENCRYPTION
Although the hybrid encryption method above can obviously improve the security level of the cloud data’s storage, the private key which is kept by the user is still vulnerable in that it may be easily stolen by attackers. Therefore, the key security management module using IDEA (International Data Encryption Algorithm) to encrypt the user’s private key is proposed and implemented in this study.
The process of the private key’s decryption is as follows: (1)User sends a request to the server and asks the server for downloading the private key; (2)Server authenticates the user information; (3)If the user exists, the server takes out the encryption private key from the database, and then sent to the user. If it is not correct, a password error message is sent to the client; (4)User receives the message and then uses the password to decrypt the message to get the private key. The process is shown in Figure 4. The solid line represents the procedures that encrypt and store the private key, and the dotted line represents the process that decrypts private key.
A. Status quo of the private key file’s management and security risks If the private key is stolen, the user data will be cracked. Public/Private key encryption system mainly focused on the management of the public key file. The private key file's encryption storage management is hardly studied, mainly because the private key files’ security situation usually tends to be seen as the companies’ issue and should be addressed by themselves. If the private key is lost and can’t get back, it will be unable to decrypt the encrypted data. If the private key is stolen, the encrypted data will be leaked. Therefore the private key must be encrypted and backup, in order to protect the security of the private key files. The management of the private key files mainly has three solutions: the private key local storage solution, smart card storage solution and web hosting box solution. But not all solutions are safe and each has its own security risks. (1)Storing Private key using local storage Private key is stored locally and managed by the administrator. Its security risks that once the administrator has betrayed the company and announced the private key out, the company will be at the risk of huge losses. (2) Smart card-based storage Private key is stored in the smart card and kept by the users themselves. Its security risks that once the smart card lost, the user's data will be easily stolen. (3)Web hosting box storage Private key host in the web hosting box. Its security risks that the private key can’t be fully controlled by its owner. The security situation of the private key is controlled by the web hosting box’s controller. Because of the security risks of the private key files’ management, we put forward a private key encryption method of storage management, in order to generate, backup, safely restore and effectively manage the user's private key. The private key’s encryption management enables us to achieve a higher security level of the private key, but the client needs to install a key file security management system dedicatedly.
Figure 4. Process of the private key’s encryption and decryption
2. Implementation of the private key file encryption storage management The implementation of the private key file encryption storage management includes two processes: the process of private key's encryption and storage, and the process of private key’s decryption. The process of private key encryption and storage: (1) The user enters a password, and then the IDEA encryption / decryption module encrypts the private key based on the user's password (2)The encrypted private key is uploaded to the server; (3)The server receives the encrypted private key, and then temporarily buffers at the server. After that, the encrypted private key is send to HDFS by the server. At the
439
same time, the server saves the (username, encrypted private key file ID) record in server’s database. The process of private key decryption: (1)The user sends the server a request about retrieving the private key; (2)The server search for the record in the database; (3)If the record exists, the server takes out the user's encrypted private key from HDFS (4)The server sends the encrypted private key to the user; (5)The IDEA encryption/decryption module uses user’s password to decrypt the private key; (6) After using, user deletes the private key. The process is shown in Figure 5, where the solid line represents the private key’s encrypted storage process, the dashed line represents the private key’s encryption process:
experimental results in Scenario 2 are summarized in Figure 7. With different replica, the performance overhead of the triple encryption scheme is shown by Figure 8. B. Analysis of Experimental Results (1)Analysis of file writing The writing of HDFS is uploading files from client to HDFS. As can be seen from the results of the three scenarios, the file writing rate of the triple encryption scheme is relatively low, and maintained at 5MB/s. It is caused by the additional operation of the triple encryption scheme: a) generate a DES key for every upload file; b)symmetric encryption for file with the DES key; c) the asymmetric encryption for DES key with RSA public key; d) store encrypted DES key. While the performance overhead of the triple encryption scheme existence, but also in line with the HDFS file system characteristics that frequent writes to the file is rare. (2)Analysis of file reading The reading of HDFS is downloading files from HDFS to client. As can be seen from the results of the three scenarios, the reading rate of triple encryption scheme has slower than the default HDFS, but with the cluster distribution trends, the performance overhead is relatively reduced. It is caused by the additional operations of the triple encryption scheme˖a)the reading of encrypted data key; b)the asymmetric decryption of encrypted data key; c)the symmetric decryption of encrypted file with DES key. Although performance overhead of Reading file from the triple encryption scheme existence, but the basic line with the characteristics of the HDFS file system to meet the write once read many. (3)Replica factor As can be seen from Figure 8, multiple replica can improve the flexibility of the selection of data block, may slightly increase the performance of the reading rate. It hardly affects the writing rate of files.
Figure 5. Implementation of private key’s encryption and decryption
IV.
EXPERIMENTAL RESULTS AND ANALYSIS
A. Experimental Results To evaluate the performance overhead caused by introducing the triple encryption scheme into HDFS, we have conducted some experiments. Following three HDFS cluster scenarios to test the performance overhead of the triple encryption scheme. 1. Peseudo-distributed model, replica=1. The cluster only has one machine that acts as namenode and datanode at the same time. 2. Fully-distributed modelˈreplica=1. The cluster has three machines. One acts as the namenode and datanode at the same time. The others act as datanodes. 3. Fully-distributed model. The cluster is the same as scenario 2 except the replica=3. For each scenario, we conduct experiment for 12 different files and each one take 10 times. Then we calculate the I/O average rate under the default HDFS and the triple encryption scheme. The results of the experiments are shown by Figure 6, 7 and 8. The experimental results of the reading and writing experiments in Scenario 1 are summarized in Figure 6. There are two curves in each figure and each curve represents results of a file system. Similarly, the
V.
CONCLUSIONS
This paper focuses on the data security protection in the cloud. To ensure data security in cloud data storage, a novel triple encryption scheme is proposed. In the triple encryption scheme, HDFS files are encrypted by using the hybrid encryption based on DES and RSA, and the user's RSA private key is encrypted using IDEA. The triple encryption scheme is implemented and integrated in Hadoop-based cloud data storage. Results of experiment show that the triple encryption scheme we proposed is feasible, it meets the reading and writing characteristics of HDFS and can enhance the confidentiality of default HDFS. As a future work, we plan to achieve the parallel processing of the encryption and decryption using MapReduce, in order to improve the performance of data encryption and decryption.
440
[6]
ACKNOWLEDGMENT The authors would like to thank the anonymous reviewers for their valuable comments and suggestions that have improved the presentation, quality and correctness of this paper. This work is nancially supported by the National Natural Science Foundation of China (grant No. 61202466), Guangdong Natural Science Foundation grant (No. S2012030006242 and S2011010001754), Guangdong Provincial Science and technology projects (grant no. 2012B010100030) ˈ Student Research Projects of South China University of Technology (grant No. 1481 and 2979˅ and the Fundamental Research Funds for the Central Universities, SCUT (grant No. 201131).
[7]
[8]
[9]
[10]
REFERENCES [1]
[2]
[3]
[4]
[5]
[11]
Rajkumar Buyya, Chee Shin Yeo, and Srikumar Venugopal, MarketOriented Cloud Computing: Vision, Hype, and Reality for Delivering IT Services as Computing Utilities, Keynote Paper, Proceedings of the 10th IEEE International Conference on High Performance Computing and Communications (HPCC 2008, IEEE CS Press, Los Alamitos, CA, USA), Sept. 25-27, 2008, Dalian, China. Weiwei Lin, Chen Liang, James Z. Wang, and Rajkumar Buyya. Bandwidth-aware divisible task scheduling for cloud computing, Software: Practice and Experience[J], ISSN: 0038-0644, Wiley Press, New York, USA, 2013 (in press, Article first published online: 23 NOV 2012). Qinlu He, Zhanhuai Li, Xiao Zhang. Study on Cloud Storage System Based on Distributed Storage Systems. Computational and Information Sciences (ICCIS), 2010 International Conference on , vol., no., pp.1332-1335, 17-19 Dec. 2010. Weiwei Lin, Chen Liang, Liu Bo. A Hadoop-based Efficient Economic Cloud Storage System, Third Pacific-Asia Conference on Circuits, Communications and System,1-4,Wuhan, July 2011. Yingjie Shi, Xiaofeng Meng, Jing Zhao, Xiangmei Hu, Bingbing Liu, and Haiping Wang. 2010. Benchmarking cloud-based data management systems. In Proceedings of the second international workshop on Cloud data management (CloudDB '10). ACM, New York, NY, USA, 47-54.
[12]
[13]
[14] [15]
[16]
Rampal Singh, Sawan Kumar, Shani Kumar Agrahari. Ensuring Data Storage Security in Cloud Computing, IOSR Journal of Engineering,2(12),2012:17-21. Siani Pearson. Taking account of privacy when designing cloud computing services, Proceedings of the 2009 ICSE Workshop on Software Engineering Challenges of Cloud Computing,2009,44-52. Owen OMalley, Kan Zhang, Sanjay Radia, Ram Marti,and Christopher Harrell. Hadoop security design. https://issues.apache.org/jira/secure/attachment/12428537/securitydesign.pdf, October 2009. Giuseppe Ateniese, Kevin Fu, Matthew Green, and Susan Hohenberger. Improved proxy re-encryption schemes with applications to secure distributed storage, Journal of ACM Transactions on Information and System Security, 2006, 9(1):1-30. Hou Qinghua, Wu Yongwei, Zheng Weimin, Yang Guangwen. A Method on Protection of User Data Privacy in Cloud Storage Platform, Journal of computer research and development,2011, 48(7):1146-1154. YU Shu-cheng, Cong Wang,Kui Ren,Wenjing Lou. Achieving Secure, Scalable, and Fine-grained Data Access Control in Cloud Computing, in Proceedings of the 29th conference on Information communications,IEEE Press Piscataway, NJ, USA,2010, Pages 534542. Zooko Wilcox-O’Hearn and Brian W arner. Tahoe: the least-authority files ystem. In Proceedings of the 4th ACM international workshop on Storage security and survivability - StorageSS ’0 8, pages 21-26. ACM,2008. Hsiao-Ying Lin, Shiuan-Tzuo Shen, Wen-Guey Tzeng, Bao-Shuh P. Lin. Toward Data Confidentiality via Integrating Hybrid Encryption Schemes and Hadoop Distributed File System,Proceedings of the 2012 IEEE 26th International Conference on Advanced Information Networking and Applications,IEEE Computer Society Washington, DC, USA,2012,pages 740-747. The apache hadoop. http://hadoop.apache.org/,2013. Lin Wei-wei, Liu Bo. Hadoop Data Load Balancing Method Based on Dynamic Bandwidth Allocation, Journal of South China University of Technology(Natural Science Edition),2012, 40(9):42-47. Lin Wei-wei. An Improved Data Placement Strategy for Hadoop, Journal of South China University of Technology(Natural Science Edition),2012, 40(1),152-158..
Figure 6. Performance comparison in Scenario 1.
441
Figure 7. Performance comparison in Scenario 2.
Figure 8. Performance comparison in Scenario 2 and Scenario 3.
442