Proceedings of 2009 Student Conference on Research and Development (SCOReD 2009), 16-18 Nov. 2009, UPM Serdang, Malaysia
Serpent Encryption Algorithm Implementation on Compute Unified Device Architecture (CUDA) Anas Mohd Nazlee, Fawnizu Azmadi Hussin and Noohul Basheer Zain Ali Electrical and Electronics Engineering Department, Universiti Teknologi PETRONAS, 31750, Perak, Malaysia
[email protected],
[email protected] Abstract—CUDA is a platform developed by Nvidia for general purpose computing on Graphic Processing Unit to utilize the parallelism capabilities. Serpent encryption is considered to have high security margin as its advantage; however it lacks in speed as its disadvantage. We present a methodology for the transformation of CPU-based implementation of Serpent encryption algorithm (in C language) on CUDA to take advantage of CUDA’s parallel processing capability. The proposed methodology could be used to quickly port a CPUbased algorithm for a quick gain in performance. Further tweaking, as described in this paper through the use of a profiler, would further increase the performance gain. Result based on the integration of multiple block encryption in parallel shows throughput performance of up to 100MB/s or more than 7X performance gain.
In C programming language, CUDA function is defined as a kernel that is called by specifying the number of thread blocks, the number of threads per block and the parameters needed for the function. Each thread executes the function in parallel and can communicate with other thread in the same thread block. As the threads are organized into thread blocks, the thread blocks are controlled by the Streaming Multiprocessors. It is then possible for simultaneous block encryption as each thread encrypts the data in parallel. All of this is summarized in Fig. 1.
Keywords-parallel computing; GPU computing
I.
INTRODUCTION
Compute Unified Device Architecture (CUDA) was introduced by Nvidia alongside with its supported graphic processor architecture for mass consumer graphic hardware. The programming language is an extended C language thus providing lower learning curve for developers compared to other GPGPU programming language. Compiler and other toolkits from Nvidia also proved sufficient in order to implement and optimize a program on CUDA. Serpent encryption algorithm was one of the finalists as Advanced Encryption Standard (AES) candidate. Although Serpent was not chosen as the AES, its authors provided a strong case of arguments over the selection decision [1]. The algorithm is then released under General Public License. Serpent encryption possesses high security margin to make up for its high number of clock cycles and number of operations needed for the encryption to be performed. II.
LITERATURE REVIEW
Initially, GPU computing is more focused on floatingpoint computation and application of parallelism for scientific calculation. Integer operation is done using the mantissa of the floating-point thus resulting in inferior performance compared to CPU [2]. CUDA has been able to overcome the performance degradation and successful attempt has been made to efficiently implement AES cryptography using CUDA resulting in peak throughput performance of up to 8.28Gbit/s [3] (equivalent to 1.035 GB/s). Current CUDA performance gives out throughput for bitwise operation of eight operations per clock cycle [4].
Figure 1. Processing flow in CUDA
Serpent encryption operates based on 32-round SPnetwork with four 32-bit words as an input and up to 256-bit key. The design of Serpent algorithm is presented with parallelism by bit-slicing [5]. The 4x4 S-boxes introduced within the SP-network has become the focus of previous works [6, 7] to improve Serpent’s number of clock cycles. Different approaches were taken with Gladman’s Serpent Sboxes [7] optimized for Intel Pentium 4 MMX while Osvik’s Serpent [6] reduced the registers used by eliminating temporary variables thus fit in the number of registers inside x86 architecture processors. Both previous works provide good example of reducing the number of operations and managing memory for CUDA implementation. III.
METHODOLOGY
Our approach started off with the naïve implementation of the Serpent encryption that is by using direct integration for the encryption to be performed using CUDA. The
1569244087-1
reference source code used is from the optimized Serpent meant for the NIST submission. The whole process is summarized into a flow chart in Fig. 2.
structures are used with pointers as dynamic data structures to take up variable number of blocks for encryption.
Start
B. Parallel Encryption Modification was done to the encryption function by adding the thread number identification to the variable. This method could only be useful when using Electronic Codebook [8] as the Block Cipher Mode since no intermediate value is taken from previous or concurrent encryption. Partial source code of the parallel encryption is shown in Fig. 4. The variable “idx” in line 4 holds the thread number that is determined by calculation in which the data is encrypted. Each round of the encryption involves keying, passing through the S-Box and transformation as in lines 6, 7 and 8 respectively.
Source code acquisition Redefine Data Structure Identify compute-intensive function Port compute-intensive function to CUDA Execute application Benchmark and Profile application End Figure 2. Flow chart for methodology
We described the processes of redefining the data structures, execution of the parallel encryption and also profiling of the application in the following sub-sections. A. Data Structure Data structures in the reference source code have been changed to allow parallel encryption to take place in each thread. Original data structures consist of one-dimensional arrays of four elements for the plain-text and cipher-text blocks (e.g. text[4]) while the key materials are stored in two-dimensional arrays of 33 x 4 elements (e.g. keys[33][4]). Another dimension would need to be added to the arrays to assign the thread number to the plain-text and key materials for each thread encryption. Instead of adding another dimension, we defined data structures that consist of four variables to accommodate the four 32-bit words plaintext and aligned them to 16-byte boundaries as shown in Fig. 3 for CUDA to read the data in a single instruction [4]. Similar structure is used for the keys, except the structure includes an array of 33 elements for the four variables. The defined data typedef struct __align__(16) { unsigned long x0, x1, x2, x3; } SER_BLOCK; typedef struct __align__(16) { unsigned long k0, k1, k2, k3; } SUBKEY; typedef struct { subkey k[33]; } SER_KEY; typedef struct { uint32_t rkey[8]; } RAW_KEYS;
1 __global__ void cuda_encrypt(SER_BLOCK *enc_block, 2 SER_KEY *keys) { 3 4 int idx = (blockIdx.x * blockDim.x + threadIdx.x); 5 6 enc_block[idx] = keying(enc_block[idx],keys[idx].k[1]); 7 enc_block[idx] = SBOX00 (enc_block[idx]); 8 enc_block[idx] = transform(enc_block[idx]); 9 … 10 enc_block[idx] = SBOX31(enc_block[idx]); 11 enc_block[idx] = keying(enc_block[idx],keys[idx].k[32]); 11 } Figure 4. Fraction of source code for encryption.
The threads are managed through blocks that are identified by the “blockIdx.x” and “blockDim.x” in Fig. 4. CUDA limits the maximum number of threads to 512 for each block of threads. The resources used for the encryption is considered large, therefore the compiler has limited the number of threads to 256 threads per block. We have done performance study for the number of threads per block to study the effect of having various numbers of threads per block and discussed in Section IV. The data size relates to the total number of threads used for the encryption. An example of 16KB of data size is shown in Fig. 5. The number of threads used is the division of data size into 16-byte blocks as each encryption takes 16 bytes of input. The encryption also takes 32KB of user keys for the Key Scheduler to convert into key materials that are used for the keying processes in each round. In this implementation, the user keys are unique to each thread and do not share the same key. Our initial attempt was made by only considering the encryption function to be executed on the GPU. The Key Scheduler function was executed on the CPU and the key materials from the Key Scheduler is then transferred to the GPU. The size of data transfer of the key materials amounted up to 528MB for 16MB of plain text. This became the bottleneck that hampered the throughput
Figure 3. Source code fragment for defined data structures.
1569244087-2
performance of our initial attempt which only resulted in 70% throughput performance gain.
GPU Time in (%) 100% 90%
16KB of Plain text + 32KB of keys
80% 70%
User key
(32B)
User key
Key Scheduler
… Round 32 Keying
Round 1
…
… Key 32
… Round 32
Key 33
Keying
Cipher text
Encryption
30%
Memory Transfer Plain Text
20%
Memory Transfer Keys
0% Initial Attempt
Key 1 … Key 32
Complete Application
Figure 6. Percentage of time used by GPU.
overall time spent thus it is very inefficient. The time taken for the memory transfer is then minimized by moving the Key Scheduler function into the GPU. The complete application (i.e. with Key Scheduler and encryption executed on GPU) has bigger computation to memory transfer ratio that is around 80% time spent on computation and 20% time spent on total memory transfer.
Key 33
Cipher text Thread idx= 1
50%
10%
Plain text (16B)
Key 1
Memory Transfer Cipher Text
40%
Key Scheduler
Plain text (16B) Round 1
(32B)
60%
Thread idx= 1024
16KB of Cipher text
IV.
The Key Scheduler is executed in each thread before the encryption. Our initial attempt suffered performance degradation caused by the large data transfer for the key materials from the host to the graphic card as the Key Scheduler is executed on the CPU before the key materials are transferred to the graphics card. The Key Scheduler will receive the user key which is smaller in size compared to key materials and this reduces the amount of transfer time significantly. C. Application Profile CUDA Visual Profiler provides programmers with data of the application execution that includes the execution time and resources used by the application. The performance of CUDA application depends on the code efficiency. The time spent on computing must be bigger compared to memory transfer to maximize the efficiency. We focus on the percentage of time usage by the GPU to perform certain functions. In this case, we compare the application with and without the Key Scheduler executed on the graphics card to examine the percentage of time used for the memory transfer.
Based on the methodology presented, we have collected the performance results for our research to study the throughput performance by varying other factors such as data size for the input and the number of threads used. The test platform hardware is Intel Core 2 Duo Processor 2.2GHz and Nvidia Geforce GTX 260+ 896MB. GNU C Compiler 4.4 and Nvidia CUDA Compiler 2.3 were used. The optimized reference source code was used for CPU comparison. It was compiled with “-O3 –march=core2” compiler flag. CUDA source code was compiled without any optimization flag. We have set a range of 5% accepted variation as the benchmark timing mostly in hundreds of millisecond and taken the average of three benchmark runs.
Throughput vs Data Size 120
100.000
Throughput (MB/s)
Figure 5. Block diagram for 16KB of data size.
RESULTS AND DISCUSSIONS
The bar chart in Fig. 6 shows the percentage of time spent on the GPU in performing the tasks. The majority time spent for the application without Key Scheduler executed on the GPU is on the memory transfer for the key materials. Memory transfer occupies more than 50% of the
1569244087-3
100.000
98.765
100.629
100 80 60
CPU 40 20
CUDA 14.085
14.060
14.085
14.072
0 2
4
8
16
Data Size (MB)
Figure 7. Graph for throughput performance versus data size
The throughput performance result is shown in Fig. 7. The data size is varied to study the scalability of the CUDA encryption in handling different block sizes. The maximum data size achieved is 16MB, while at the same time still maintaining the throughput performance. As stated in the Section III, the number of threads used is scaled according to the data size; thus there is no trend visible in the throughput performance as seen in Fig. 7. The amount of resources used to store the data is large and hardly fits into other types of memory, limiting the optimization that can be done through memory management. The improvement achieved by minimizing the memory transfer is significant over previous implementation. The comparisons between complete application and initial attempt for the data size of 16MB are shown in Fig. 8. The throughput for CPU degraded in second attempt comparison as the benchmark also includes key scheduling process for CPU. Throughput Performance for 16MB Data Size 120.000
The graph in Fig. 9 shows slight variation from 256 threads per block to 64 threads per block in the throughput performance. Performance degradation occurred as we further decrease the number of threads per block to 32 threads. The number of threads per block recommended by [4] is at least 16 threads and decreasing the number of threads any further will only result in performance degradation. V.
CONCLUSION
Direct integration of source code provides the fastest way of implementing an algorithm to another architecture that uses the same programming language. The complete application achieved up to 100MB/s throughput, which is more than 7X speedup over the original CPU-based implementation. This is achieved by using the profiler to identify resource hogging tasks that limits the performance of our initial attempt. The experimental results presented in this paper, supported by another related work [3], suggest that CUDA is capable of handling encryption tasks efficiently.
100.629
We have outlined the necessary steps taken for the implementation of Serpent encryption on CUDA. Currently, we have only implemented the algorithm based on bitwise operations. It is interesting to see whether other algorithms that are based on lookup table for the Permutation and SBoxes could utilize the constant memory cache for performance gain.
Throughput (MB/s)
100.000 80.000 60.000
Complete Application (Key Scheduling and Encryption on GPU)
50.79
Initial Attempt (Encryption on GPU)
32.52
40.000 14.072
20.000 0.000 CUDA
CPU
Figure 8. Throughput performance comparison for minimizing memory transfer.
Although the total number of threads is determined by the data size, we must define the number of threads per block to have the application allocate the right number of blocks for the given number of threads per block. We varied the number of threads per block to study the effect on the performance throughput for further optimization.
Throughputs (MB/s)
100
101.911
98.160
[2]
[3]
[4]
Throughput vs Thread PerBlock 120
REFERENCES [1]
100.000
83.333
[5]
80 60 40 20
[6]
0
32
64
128
[7]
256
Number of Threads Per Block
[8]
Figure 9. Graph of throughput performance for various number of threads per block.
1569244087-4
R. Anderson, E. Biham, and L. Knudsen, “The case for serpent,” March 24, 2000. [Online]. Available: http:// www.cl.cam.ac.uk/~rja14/Papers /serpentcase.pdf, Accessed: July 2, 2009. N. K. Govindaraju, B. Lloyd, W. Wang, M. Lin, and Dinesh Manocha,”Fast computation of database operations using graphics processors,” in Proc. of SIGMOD, 2004. Manavski, S.A., "CUDA compatible GPU as an efficient hardware accelerator for AES cryptography," Signal Processing and Communications, 2007. ICSPC 2007. IEEE International Conference on , vol., no., pp.65-68, 24-27 Nov. 2007. Nvidia CUDA programming guide 2.2. [Online]. Available: http://developer.download.nvidia.com/compute/cuda/2_21/toolkit/doc s/NVIDIA_CUDA_Programming_Guide_2.2.1.pdf, Accessed: July 7, 2009. R. Anderson, E. Biham, and L. Knudsen, “Serpent: a proposal for the advance encryption standard.” [Online]. Available: http://www.cl.cam.ac.uk/~rja14/Papers/serpent.pdf, Accessed: July 1, 2009. D. A Osvik, “Speeding up serpent,” in AES Candidate Conference, New York, NY, USA, April 2000. B. Gladman, http://gladman.plushost.co.uk/oldsite/cryptography_ technology/serpent/index.php, Accessed: July 26, 2009. Dworkin, M. 2001. "Recommendation for block cipher modes of operation: methods and techniques." NIST Special Publication 80038A.