Improving Encryption Performance using MapReduce Sanket Desai*, Younghee Park*1, Jerry Gao*, Sang-Yoon Chang2, Chungsik Song*
*Computer Engineering Department San Jose State University, San Jose, USA {sanket.desai, younghee.park, jerry.gao, chungsik.song}@sjsu.edu Abstract. The advanced and readily available cloud infrastructure has resulted in significantly increased offloading of data to the cloud. In fact, many users have become completely reliant on cloud service providers without regard to the safety of their data. Encryption, the foundation of data protection for reliable and secure cloud environments comes at a high cost as data size increases, presenting an obstacle to provision of big data security. This paper proposes a framework to reduce encryption costs through MapReduce, which can boost parallel processing and parameter tuning. By using MapReduce, encryption performance is enhanced in terms of execution time with minimal usage of system resources. Our experiments demonstrate the performance benefits realized through MapReduce-based parallel encryption computation. Keywords—Cryptography, Cloud, MapReduce, Big data security I.
INTRODUCTION
There are many compelling reasons to migrate applications and data to private or public clouds: among these are scalability, rapid elasticity, agility, cost savings. Companies are increasingly moving their business-related applications and data to the public cloud and exploiting the benefits. As the cloud has evolved, massive amounts of digital information are now being generated and stored on the cloud, originating from various sources including online transactions, emails, posts to social media sites, sensors, and mobile devices. Enterprise and cloud data centers are under pressure to develop technologies for fast and effective solutions to communicate and utilize big data in cloud storage, since being able to capture and analyze this data can greatly benefit such a wide range of business purposes. However, along with the benefits, the risk to the safety and privacy of personal and business-critical data has increased due to the lack of security techniques to protect data on the cloud. Encryption technology limits application functionality in using data and is computational-resource intensive. Because of the volume, velocity and variety of big data, it has become more challenging for the encryption __________________________________________________________ 1
This author is a corresponding author.
2
Advanced Digital Science Center Singapore, Singapore
[email protected]
process of to be scalable and efficient in the cloud environment. The increased demand for enhanced security as well as for scalable and efficient encryption schemes for big data on the cloud has driven the adoption and implementation of several encryption algorithms [5], [10]. Encryption schemes involving big data suffer from huge CPU resource consumption and low throughput. The resulting critically important performance issues have led to many efforts to improve encryption schemes, including efforts to apply technologies of parallel computing. Parallel computation is a method in which several computations can be carried out simultaneously on multiple microprocessors. Multicore and multiprocessor computers having multiple processing elements within a single machine have been used for such parallel processing. In this paper, we propose a framework using MapReduce as a programming model for encrypting large amounts of data in a parallel and distributed fashion. This aims to improve the performance of encryption by using various conditions. To achieve good performance, we carefully choose a set of configuration parameters that are used to set up the framework. These parameters will affect the encryption performance in the MapReduce framework. We evaluate AES encryption using the framework and compare the results to those from standard sequential AES implementation without MapReduce. The results demonstrate that our approach can achieve significant performance gains by using MapReduce along with the selected configuration parameters. In section II, we summarize the status of implementation of AES encryption in the cloud and recent developments in parallel computation of AES encryption. Section III summarizes the framework proposed to reduce encryption costs through MapReduce, which can boost parallel processing. We review configuration parameters used in MapReduce and optimize parameters to improve performance of the calculation in section IV. We evaluate AES encryption by using MapReduce in section V. We state our conclusions in section VI.
II. RELATED WORK AND MOTIVATION A. Advanced Encryption Standard The Advanced Encryption Standard (AES) [3] was designed by NIST in 2001 and has been adopted as the approved standard for a wide range of applications. The algorithm that is described in AES is a symmetric key algorithm in which the same key is used in both encryption and decryption of data. The key size in AES can be 128, 192 or 256 bits. The algorithm is a block cipher that works on one block of data at a time. Various sizes of blocks have been used. AES is extensively used in practical secure applications for data in the cloud as shown in table 1. The popularity of the scheme is due to its efficiency and proven security. But AES still has many performance limitations in memory requirements and execution time, especially when applied to big data in the cloud. It also limits many application functionalities, such as the search function, logic operations, and mathematical calculation. A secure and scalable key management system of a symmetric AES encryption scheme in the cloud environment is another big challenge to be addressed. TABLE I.
POPULAR CRYPTOSYSTEMS IN INDUSTRY
Industry Cryptosystems Industry Product Encryption Cloudera, AES-256 Navigator Encrypt SafeNet, ProtectDB Thales, Hardware Security Modules (HSM) CloudLink RSA Data Protect Manager HP, Altlla
AES, 3DES, DES, RSA, RC4, SHA-1, ACSHA-1 AES(128, 192,256) 3DES, RSA ECC AES - 256 AES
B. AES in Parallel Computation Parallel implementation of an Advanced Encryption Standard (AES) cryptography algorithm had been proposed in much research aimed at improving performance. Parallel computation can be performed by using multiprocessor computers, that is, single machines with multiple processing elements. Graphics Processing Units (GPUs) offer large potential performance gains within stream processing applications over a standard CPU. This improvement in performance is due to the distributed architecture within the GPU in the form of large numbers of simple processing units. The implementation of the AES block cipher encryption algorithm on GPUs has been investigated by Harrison and Waldron [11]. They showed that the GPU performs best using large packet sizes and thus it
suits applications that require bulk data encryption/decryption. This paper also demonstrates that the GPU can be used effectively as a co-processor contrary to the operating system reports of 100% CPU load during GPU task execution. Nagendra and Sekhar [6] explore the implementation of an AES cryptography algorithm on a dual core processor by using OpenMP API to reduce execution time. OpenMP (Open multiprocessing) is an application programming interface that is supported by multicore architecture to provide multithreaded shared memory parallelism. They show that parallel implementation of an AES block cipher using a dualcore (Intel Core 2 Duo) processor takes 40% ~ 45% less time to perform the encryption and decryption than does a sequential implementation. III. OUR MAPREDUCE FOR PARALLELIZING ENCRYPTION Parallel computations using multicore and multiprocessor computers have limitations in implementation for big data in the cloud environment. First, dividing the work in equal-size pieces isn’t always easy or obvious in the cloud environment. Second, combining the results from independent processes may require further processing. Third, we are still limited by the processing capacity of a single machine. When we start using multiple machines, a whole host of other factors come into play such as coordination and reliability. MapReduce [4] is a programming model to efficiently process and generate large data sets on a cluster with a parallel and distributed algorithm. Because of its parallel programming model, MapReduce expedites processing of a large amount of data. Hadoop [12] is a popular open source implementation of the MapReduce framework. Hadoop’s MapReduce framework is designed to write any applications that can process a vast amount of data distributed in the cloud and stored in the Hadoop distributed File System (HDFS). This framework automatically partitions input data and handles all problems related to consistency and fault tolerance in a large cluster environment. It is useful for large, long-running jobs that cannot be handled within the scope of a single request. It has been used for tasks such as analyzing application logs, aggregating related data from external sources, transforming data from one format to another, and exporting data for external analysis. Definition of MapReduce Programming: The computation in MapReduce has two functions, map and reduce, which take a set of input key/value pairs, and produce a set of output key/value pairs. Map takes an input pair and produces a set of intermediate key/value pairs. Keys and values will always be in the form of binary strings. The MapReduce library groups together all intermediate values associated with the same intermediate key and passes them to the reduce function. The reduce function accepts an intermediate key and a set of values for that key. It merges together these values to form a possibly smaller set of values. Typically either an output value of zero or one is produced for each invocation nof reduce. An iterator supplies intermediate values to the user's reduce function.
Fig. 1 illustrates a MapReduce framework: map, shuffle, and reduce. The map function, Map(k1, v1) ⇒ List(k2, v2) for each key k and value v, works in a parallel manner on every pair in the input dataset. In shuffle, data is redistributed based on the output keys, such that all data belonging to one key is passed to the reducer. This can be done by any key distribution algorithm like Hashing. The reduce function, Reduce (k2, list(v2) ) ⇒ list(v3), is then applied to each value for a particular key and produces a collection of values for each key.
Figure 2: AES encryption using Hadoop MapReduce IV. TUNNING CONFIGRATION PARAMETERS IN MAPREDUCE Tuning of Hadoop configuration parameters directly affects the performance of MapReduce under various conditions. Table 1 shows a list of configuration parameters that we will investigate. We select eight configuration parameters that will impact the performance of AES in MapReduce.
Figure 1: MapReduce workflow Using MapReduce for parallelizing encryption: Using MapReduce for parallelizing encryption processes can certainly improve performance, because instead of encrypting blocks one by one, multiple mappers can work on encrypting different blocks [9], After which the reducer will combine all blocks and store them back in HDFS. There are a number of variations possible for this process. For instance, different keys can be used to encrypt individual blocks or the same keys can be used to encrypt all blocks. A large number of keys can certainly slow the decryption process. In the MapReduce process, the mapper will take the pair in the form of , where the block is a part of the data stored in HDFS and block_id uniquely identifies that block. The mapper will produce the output in the form of , where c_data represents the encrypted content of the block. The reducer will then take the input in the form of pairs and store them as contiguous blocks. If different keys are used for encryption, each reducer will process the blocks encrypted with the same key in HDFS files in sequential order. The number of reducers depends on the total number of keys being used in encryption as well as on size of data.
Table 1. Configuration Parameters in MapReduce Name Parameter Threshold mapred.inmem.merge.threshold Merge io.sort.factor Memory mapred.job.shuffle.input.buffer.percent io.sort.mb mapred.job.shuffle.merge.percent mapred.job.reduce.input.buffer.percent Reducer mapred.reduce.tasks
Compression
mapred.compress.map.output
Here is a detailed description for the configuration parameters with respect to performance tuning: a. Threshold: After the map task is completed, the map output will be copied into the tracker’s memory buffer from the reducer’s task. When the buffer reaches a threshold number of map outputs, it is merged and written to the disk. This threshold can be specified by a mapred.inmem.merge.threshold parameter. The appropriate setting of this parameter can save some time in merging, which in turn reduces the overall time consumed by the reducer. b. Merge: After the mapping phase is completed, the map output will be merged and the sorting order will be maintained. The files will be selected for merging and sorting based on the sorting factor (io.sort.factor). The default value for this parameter is 10. For jobs in which the mapper output is very large and which have a huge number of writes to disk, this factor should be increased. c. mapred.job.shuffle.input.buffer.percent: The mapper output is stored into the memory buffer of reducer tasktrackers. The memory buffer size is controlled by this parameter. The configuration parameter returns the
d.
e.
f.
g.
h.
percent of the heap portion to be used for storing mapper output for merging and sorting. The value of the parameter depends upon the size of the mapper output. For larger mapper output, setting the configuration parameters will decrease the number of disk spills. io.sort.mb: This parameter indicates the size of the buffer required for sorting. The default size of the buffer is 100 MB. However, for encryption in larger clusters, the size should be increased since a large buffer size will decrease writes to disk and hence decrease overall time. mapred.job.shuffle.merge.percent: This parameter specifies the threshold for the size of the task tracker’s buffer. After the buffer is full, the buffer output will finally be written to disk. mapred.job.reduce.input.buffer.percent: During the reduce phase, the map output needs to be retained. This parameter specifies the percentage of heap memory portion to be used for this purpose. The greater the percentage of this parameter , the less merge will result on the disk and the less reduction in I/O time on the local disk during the reduce phase. For frequent I/O operations during the reduce phase, this parameter value should be increased. Reducer: This parameter (i.e. mapred.reduce.tasks) indicates the number of reducers to be run for a particular map-reduce job. This parameter mainly depends on hardware configuration and the length of the resource to process. For large clusters, setting the maximum number of reduce tasks on a task-tracker can improveperformance. Compression: This parameter (i.e. mapred.compress.map.output) indicates whether the mapper output should be compressed or not. The default value of this parameter is false. By setting this parameter to true, disk space can be saved and data transfer time can also be reduced. However, this adds an extra layer for compression on map and reduce processes. For performing encryption in larger cluster, this parameter should be set to true.
V. IMPLEMENATIONS AND EVALUATION We evaluate the AES algorithm that is implemented with the MapReduce and compare it to the conventional AES implementation. All the tests were carried out in Starfish [14] in Hadoop for comparing the performance of AES with that of MapReduce. Starfish is a self-tuning system using a profiler and optimizer. We used the Starfish profiler in order to collect statistical information about MapReduce programs and manually set the values of each configuration parameter for a MapReduce job. The parameters we set and measured were total execution time to encrypt a file and partial execution times for each job. We used Hadoop 1.2.2 with Hadoop CryptoCodec Compressor 0.0.6. The testing machine was Intel 8 Cores with
CPU E5606 2.13GHz and 12M RAM. In the experiment we used a block size of 256 bits for AES. AES Counter Mode (CTR) was used. CTR is simple and creates a pseudo random stream that is independent of the plaintext. To avoid duplication, different pseudo random streams are obtained by counting up from the different nonces or initialization vectors (IV) that are multiplied by a maximum message length. By using the different nonces, encryption is possible without per message randomness. Decryption and encryption are completely parallelizable, and transmission errors only affect the wrong bits and nothing more. In the MapReduce setup, the mapper first breaks a file in chunks in order to be encrypted, and then the reducer takes inputs with the chunked encrypted file. The reducers take care of the encryption as well as the decryption by writing the output file to the HDFS. A conventional AES algorithm is then been put to use to encrypt the file. Hadoop commonly provides a codec framework for compression algorithms known as Crypto Codec. However, because encryption algorithms require some additional configuration and methods for key management, we introduced a crypto codec framework that builds on the compression codec framework and distinguishes crypto algorithms from compression algorithms [2]. Table 2. Used values for each configuration parameter. Parameter value mapred.inmem.merge.threshold 84 io.sort.factor 49 mapred.job.shuffle.input.buffer.percent 0.734 io.sort.mb 573 mapred.job.shuffle.merge.percent 0.242 mapred.job.reduce.input.buffer.percent 0.3820 mapred.reduce.tasks 2 mapred.compress.map.output False (for smaller file) We first evaluate the performance of AES encryption with MapReduce or without MapReduce. After that, we also investigate the impact of the selected configuration parameters on encryption performance. Table 2 shows the specific values for each configuration parameter. The values can be derived from Starfish. The detailed investigations of the parameters will be our future work. Except for Fig. 3, we used 10MB file size for our test. We measure two major things: the average output written to disk by Reducer, execution time for the Map and Reduce job. Finally, we show the total encryption time with the selected configuration parameters. 1) Encryption Performance in MapReduce: Fig. 3 shows encryption performance in the MapReduce framework. Compared to conventional encryption without MapReduce, encryption in MapReduce achieved a 20% ~ 30% improvement in performance. We also ound that the execution time was much greater for the AES algorithm to encrypt a 500 MB based on conventional AES than the time taken by the AES setup with MapReduce. Compared to encryption without MapReduce, we consistently obtained better performance
using MapReduce. It is evident that MapReduce allows encryption to take advantage of parallel processing. Parallelization is the key factor in MapReduce’s ability to improve performance. Figure 3 demonstrates that MapReduce significantly reduces total encryption execution time.The parallel process of splitting the file for encryption at the mapper contributed to reduction of execution time. Even though we used Crypto codec, the significant performance improvement is due to the way mapper and reducer assign jobs. Both master and slave process in parallel to make the difference in encryption performance.
improve the execution time for the whole job, the reduce job duration should be taken into consideration. Compared to the map job, the performance of the reduce job may differ from configuration parameters. Fig. 6 represents the comparison in the total time for the entire encryption job before and after tuning the configuration parameters. This clearly demonstrates that we can obtain performance improvement by configuring the encryption job according to the eight configuration parameters. 12 10
Before tuning
500MB 50 MB 5 MB 1 MB
80 70
Time (sec)
60
Time (ms)
8
50
After tuning
6 4 2
40
0
30
Map
20
Reduce
Figure 5: Execution time for the map and the reduce job
10
19.5
0
AES 256
19
HADOOP AES
Figure 3. Total execution time for AES encryption without or with MapReduce 2) Execution Time at Reducer: We first evaluate the average output written to disk by Reducer. Fig. 4 represents the difference in the average outputs written to disk by reducer in term of execution time, before fine-tuning and after finetuning the configuration parameters. After configuring parameters for the map-reduce job, the average output written to disk by the reducer decreased by 40%, resulting in overall improvement of performance. In other words, as the number of writes to disk decreases, execution time and CPU cycles are correspondingly reduced. 35 30
Time (ms)
25
Time (ms)
18.5 18 17.5 17 16.5 16
Before tuning
After tuning
Figure 6: Total execution time with MapReduce 4) Execution Time at MapReduce: We evaluate the total execution time for the map and reduce jobs. Fig. 7 shows a comparison of the time of the map and reduce jobs to perform AES encryption in MapReduce. This shows that after tuning the configuration parameters, the execution time of the mapper increased, but with significant improvement in the execution time for the reducer job, the total time for the whole job was reduced.
20
6000
15
5000
10
Map & Before tuning
Map & After tuning
Reduce & Before tuning
Reduce & After tuning
4000
0
Before tuning
After tuning
Figure 4: Execution time of the reduce job 3) Total Execution Time at MapReduce: We compare the total execution time for the map and reduce job in performing encryption in HDFS. Fig. 5 represents the overall execution time for the map-reduce job before and after tuning the configuration parameters. It demonstrates that, in order to
Time (ms)
5
3000 2000 1000 0
Map
Redcue
Figure 7. Execution time of the map and the reduce Job
From our experiments, we drew two conclusions. First, we demonstrated that encryption performance can be significantly improved by tuning configuration parameters. Second, by using MapReduce, parallel processing improves encryption performance in terms of total execution time. As a benchmark, we evaluated the execution time of AES encryption for different sizes of files using MapReduce with tuning configuration parameters. Execution time is compared to those obtained using CTR AES-256 in the conventional way. Our results show that there is a 20% ~ 30% performance benefit with the use of the MapReduce framework, with increasing performance enhancement as file size increases. We used 2 reducers for our test. Because of the multiple reducers, the mapper needs to spend more time to partitioning jobs for these multiple reducers. However, each of the reducers have only some part of the encrypted file and the time of the reducers can decrease due to the parameter settings.
[1] [2]
[3] [4] [5]
[6]
VI. CONCLUSION AND FUTURE WORK Maintaining confidentiality is of the utmost importance in the age of Big Data. It is necessary to protect it from data leakage as a greater variety of data is moving around the Internet. Many organizations are overwhelmed by the need to process Big Data to provide a high quality of data services. They suffer from shortcomings in designing scalable and efficient techniques both to operate and to protect Big Data. Data encryption can provide us a basic solution to protect all types of data, yet insufficient attention has been given to evaluating different encryption techniques. Similarly, little attention has been paid to thorough investigation and evaluation of various encryption algorithms. Through this project, we have been able to engage in this much-needed scrutiny and analysis of available encryption algorithms through an intensive literature review and market research. This paper has tested encryption performance of the popular parallel processing platform, MapReduce. We encrypted large amounts of data in a parallel and distributed fashion. Our selected configuration parameters of MapReduce directly and positively affected Map-Reduce job performance under various conditions. These parameters must be carefully considered to achieve maximum encryption performance. A fully homomorphic encryption scheme is promising for the cloud environment but is not yet practicable. Homomorphic evaluation of AES has interesting potential as a practical encryption scheme for data in cloud storage [1]. Future work will be needed to investigate and implement such a proposed framework using MapReduce for parallel processing in a fully homomorphic encryption scheme. REFERENCES
[7]
[8]
[9]
[10]
[11]
[12] [13]
[14]
Craig Gentry, Shai Halevi, and Nigel P. Smart. Homomorphic evaluation of the AES circuit. In Reihaneh Issues.apache.org, '[HADOOP-9331] Hadoop crypto codec framework and crypto codec implementations ASF JIRA', 2013. [Online]. Available: https://issues.apache.org/jira/browse/HADOOP-9331. [Accessed: 15- Mar- 2015]. J. Daemen, V. Rijmen. Rijndael: The Advanced Encryption Standard. Dr. Dobb’s Journal, March 2001. Jeffrey Dean and Sanjay Ghemawat, MapReduce: Simplied Data Processing on Large Clusters, OSDI 2004 Kristin Lauter, Michael Naehrig, and Vinod Vaikuntanathan. Can homomorphic encryption be practical? In CCSW, pages 113–124. ACM, 2011. M. Nagendra and M. Chandra Sekhar, Performance Improvement of Advanced Encryption Algorithm using Parallel Computation. International Journal of Software Engineering and Technology, vol 8, issue 2, p287-296, 2014 N. Coffey, 'Password-based encryption', Javamex.com, 2015. [Online]. Available: http://www.javamex.com/tutorials/cryptography/pbe_key _derivation.shtml. [Accessed: 15- Mar- 2015]. P. Rogaway, Evaluation of Some Block Cipher Modes of Operation. Technical Report, Cryptography Research and Evaluation Committees (CRYPTREC), 2009. Sujitha, G., Varadharajan, M., Raj Kumar, B., & Merey Shalinie, S. (2013). Provisioning mapreduce for improving security of cloud data. 220-228(Journal of Atificial Intelligence 6 (3))) C. Gentry, Fully homomorphic encryption using ideal lattices, Symposium on the Theory of Computing (STOC), 2009, pp. 169-178. Owen Harrison, John Waldron, Practical Symmetric Key Cryptography on Modern Graphics Hardware, Proc. of the 17th conference on Security symposium. San Jose, CA, 2008, pp. 195-209. https://hadoop.apache.org/ Impetus, Hadoop Perfromance Tuning, White Paper, Impetus Technologies Inc., Oct. 2009, Partners in Software R&D and Engineering. [Online]. Available: www.impetus.com. Herodotos Herodotou, Harold Lim, et. al., Starfish: A Self-tuning System for Big Data Analytics, in the Fifth Biennial Conference on Innovative Data Systems Research(CIDR), page 261-272, 2011.