Real Time Communication between Multiple FPGA

0 downloads 0 Views 463KB Size Report
Dept. of Electronic Science1, Dept. of Computer Science and Engineering2,A. K. Choudhury School of Information Technology3,4 .... provides a consistent foundation for developing application .... [6] Cryptography & Network Security By Behrouz A.Forouzan. ... [12] ] http://csrc.nist.gov/publications/fips/fips197/fips-197.pdf.
Real Time Communication between Multiple FPGA Systems in Multitasking Environment Using RTOS Rourab Paul 1, Sangeet Saha2; Suman Sau3 ; Amlan Chakrabarti4 , Student Member, IEEE3; Member IEEE4 Dept. of Electronic Science1, Dept. of Computer Science and Engineering2, A. K. Choudhury School of Information Technology3,4 University of Calcutta, India. Email: {rourabpaul1,sangeet.saha872,sumansau3}@gmail.com,[email protected] Abstract - The recent development of Field-Programmable Gate Array (FPGA) architectures, with soft core (MicroBlaze) and hard core (PowerPC) processors, embedded memories and IP cores, offers the potential for high computing power. Presently FPGAs are considered as a major platform for high performance embedded applications as it provides the opportunity for reconfiguration as well as good clock speed and design resources. As the complexities in the embedded applications increase, use of an operating system brings in a lot of advantages. In present day application scenarios most embedded systems have real-time requirements that demand the use of Real-time operating systems (RTOS), which creates a suitable environment for real time applications to be designed and expanded easily. In an RTOS the design process is simplified by splitting the application code into separate tasks and then the scheduler executes them according to a specific schedule, meeting the real-time deadline. In this research work, we propose the design and implementation of a real-time FPGA based application, which demonstrates the creation of real-time process tasks in FPGA systems for successful real-time communication between multiple FPGA systems. We have chosen the RSA based encryption and decryption algorithm for this implementation, as security is one of the most important need for data communication. At first we demonstrate the realtime execution of multiple process tasks in a single FPGA system for the encryption and decryption of data. Next we describe the most challenging part of our work, where we establish the realtime communication between two FPGA systems, each running the encryption engine and decryption engine respectively and communicating with one another via an RS232 communication link. The results show that our design is better in terms of execution speed in comparison with the existing research works.

I.INTRODUCTION Reconfigurable System like FPGA platform has the potential to provide the performance benefits of ASICs and the flexibility of processors. An FPGA is a collection of programmable gates embedded in a flexible interconnect network that can contain hard or soft microprocessors. FPGAs combine the programmability of processors with the performance of custom hardware. As FPGAs can provide a useful balance between performance and flexibility, they becomes the primary source of computation in many critical embedded systems. A few research works [1, 2, 3], shows FPGA based embedded system have become a platform for the implementation of cryptographic algorithms, which

needs large number of bit-level operations, and that can be done efficiently on FPGA. The merit of our paper lies in the implementation of cryptographic algorithm utilizing threads, run by an RTOS on FPGA systems, which is quite challenging in this reconfigurable architecture domain. The usage of RTOS is advantageous in many respects, as RTOS is an efficient tool to optimize the software runtime as the code complexity grows, by distributing the tasks into multiple threads. Better and safer synchronization and resource management are also major advantages of an RTOS. As an RTOS, we have chosen Xilkernel [4] and RSA [5] as our cryptographic algorithm to be implemented, which is quite popular in the public key cryptography domain. Now in comparison to symmetric key cryptography [6] there are few advantages in the public key cryptography regarding scalable communication, trusted identification, and simpler key management [6]. Choosing an algorithm to get implemented in FPGA hardware, the concerned facts are processing power, execution time and resource utilization. Among the existing implementations of RSA algorithm [5], here comes few solutions like Montgomery Algorithm for Modular Multiplication [7,8]. Several Montgomery designs have been proposed for ASIC and FPGA platform on limited resource availability to satisfy the execution time burden [2, 9, 10]. In work [1] the execution time depends on the value of exponent term (public key, and private key) where as the encryption and decryption time of our chosen algorithm [11] (Binary method) does not depend upon the key values but does depend on the number of modular multiplications. Very of late AES [12] is one of the famous crypto algorithms available over the internet traffic due to its high security and parallel nature of processing [13]. Parallel computing gives the highest flexibility in this domain, but in case of parallel computing though there is an increase of processing speed, the heat dissipation and area is increased which makes it difficult to implement it in a constrained system with smaller resources, such as real time embedded systems. This is one of the reasons to adopt RSA in our proposal. Our contribution in this paper can be briefed as follows • Establishing reliable communication between multiple FPGA devices is an essential component for developing complex real time systems used for applications like real time data acquisition and processing [1].

• Our paper proposes the major issue that two threads running separately on each board can communicate with each other via RS232 communication link. • The algorithm is implemented in hardware by introducing the concept of thread execution by an RTOS and its hardware utilization proves that our implementation is better with respect to the existing work [1,2,3]. • Large value of secret key (exponent term) increases security of the encryption process, where as the execution time of the binary method [11] that we used for RSA process is invariant to the value of exponent term. The organization of the paper is as follows, Section II describes the design methodology of our experiments where We portray about the necessity our proposed algorithm and Real Time Operating System, section III details out the real time implementation of our IV experiments. Section IV compares our proposal with other existing works and the concluding remarks are presented in Section V. II. DESIGN METHODOLOGY Modular exponentiation is a type of exponentiation performed over a modulus. It is particularly useful in computer science, especially in the field of cryptography. In this paper the cryptography algorithm that we used is RSA. The exponential heuristics developed for computing Me are applicable for computing Me (mod n) [14]. C= Me mod n A. Hardware based modular exponential operation Doing a "modular exponentiation" means calculating the remainder when dividing by a positive integer n (called the modulus) a positive integer M (called the base) raised to the e-th power (e is called the exponent). The first rule of modular exponential is that we do not compute Me[11] because both M and e may be very large in case RSA algorithm. If we going to store Mea large no of memory space needed. The temporary result must be reduced modulo n at each step of exponentiation this is because of space requirement of Me is enormous . If M and e have 256 bit each we need 1080 bits to store Me .This number is approximately equal to the number of particle of universe [15]. In order to compute this total bit capacity all computers in the world we can make an assumption there are 512 million computers, each of which has 512 Mbytes of memory. Then total number of bits available on all computers will be 1018 . So we have no way to compute Me. We have many hardware compatible algorithms to implement RSA; i.e. Binary Method, M array Method etc [11].For our implementation the main concern was about the resources usage and execution time. Due to these two merits processor should have minimum number of modular multiplication to compute Me mod n. According to the reference [11] binary method is one of the fastest methods to compute modulus exponent without computing Me. In reference [1] execution time is depending on the value of

exponent term (‘e’). In case of high secure data communication where exponent term (public and private keys) is large enough the encryption and decryption algorithm [1] would take large processing time which would overwhelmed the processor for high speed communication like [16]. B. Algorithm We present the RSA algorithm in Fig. 1, where both the encryption and decryption algorithms have the same steps, but only differ in the exponent term. This proposed algorithm is a modified version of that in [1], the modification is done to handle the exponential operation in a better way. The real time scenario of the system is analyzed and its hardware utilization proves that it’s better compared to the related works [1]. Proposed RSA Algorithm for encryption Input: M,e,n Output C= Me mod n 1. ek-1 =1 then C:=M else C:=1 2. For i=k-2 down to 0 2a. C=C*C (mod n) 2b. if ei=1 then C=C*M(mod n) 3. return C

Proposed RSA Algorithm for decryption Input: M,e,n Output M= Ce mod n 1. ek-1 =1 then M:=C else M:=1 2. For i=k-2 down to 0 2a. M=M*M (mod n) 2b.if ei=1 then M=M*C(mod n) 3. return M

Fig. 1: Proposed Algorithm

In our algorithm we have 3 inputs M, n and e where all the inputs have k number of bits. This binary method checks each of the bits of the exponent term from left to right [17]. Depending on the scan bit value a squaring and a subsequent multiplication operation is performed for each step (2a and 2b). As an example; let e=01111011. Which implies k=8. Since ek-1=0, we take C=1. The binary method proceeds as 123 shown in Table II, which shows M is 7+5=12. It is very obvious the number of clocks to execute modulus exponent only depends on the number of modulus multiplications rather than the value of exponent term. TABLEI STEPWISE RESULT OF MOD CALCULATION ALGORITHM

i 6 5 4 3 2 1 0

ei 1 1 1 1 0 1 1

Step 2a 1 M2 M6 M14 M30 M60 M122

Step 2b M M3 M7 M15 M30 M61 M123

C. Real Time Operating System (Xilkernel) Real time embedded systems are typically designed for various purposes such as to control or to process data, meeting certain deadlines at the right time. To achieve this purpose, real-time operating systems (RTOS) are often used. RTOS can be defined as: "a program that schedules execution in a timely manner, manages system resources, and provides a consistent foundation for developing application code."[18], more specifically An RTOS is a piece of software with a set of APIs for users to develop applications.

RTOSes are typically differentiated from generic OSes regarding the following criteria i.e. Preemptive or prioritybased scheduling, Predictability in task synchronization, deterministic behaviors [19]. Multitasking, other key features of RTOS which usually means that the software is divided into tasks, or smaller subsets of the total problem and at run-time, creating an environment that provides each task with its own processor [20]. There are various RTOSes are available for microcontroller as well as for FPGA based design ie VxWorks, QNX, eCos, LynxOS, and RTLinux[21] Here we have chosen Xilkernel as our RTOS which is provided by Xilinx. Xilkernel is a small, robust, and modular kernel. It is highly integrated with the Platform Studio Frame .It allows a very high degree of customization [22]. It supports the core features required in a lightweight embedded kernel, with a POSIX API. Xilkernel works on both the MicroBlaze and PowerPC 405 processors. Xilkernel has very low memory footprint, it uses 7-16 kb of BRAM in a multi threaded program [4], which is much smaller than the RTOS used in microcontroller [19]. Now scheduling is also a major issue in the environment of multitasking, Xilkernel supports priority driven, preemptive scheduling with time slicing (SCHED_PRIO) or simple round-robin scheduling (SCHED_RR)[4]. Xilkernel is structured as a library. The user application source files must link with Xilkernel to access Xilkernel functionality. In Xilkernel a thread is the unit of execution and is analogous to a process. Threads are coded like functions.At least one thread is required to spawn from the system at kernel start [4].

connecting and accessing UART module to the Processor as shown in the Fig. 2.

Fig 2: Block diagram architecture of each FPGA System

III. IMPLEMENTATION The proposed architecture was synthesized using Xilinx ISE 11.1[23] and was implemented on XC3S500e Spartan 3E FPGA Board [23]. The necessary software for this design is written using the feature-rich C/C++ code editor and compilation environment provided within the SDK (Xilinx Software Development Kit). The SDK provides an environment for creating software platforms and applications targeted for Xilinx embedded processor (Microblaze)[23].We have tested the real time execution of the program in the hardware in 4 ways, which are described in fig 4.

D. Hardware Architectural Design This work is implemented using the Xilinx EDK 11.1 (version) and Xilinx Spartan 3E FPGA prototyping board has been used for the hardware implementation and testing. A soft core 32-bit RISC processor Micro Blaze has been used as a CPU for this embedded computing unit and all the required soft core peripherals are UART 1(used for RS232 DCE (Data Circuit-Terminal Equipment) port), UART 2(used for RS232 DTE (Data Terminal Equipment) port). The blocks used to build up the FPGA based embedded computing unit is shown in Fig. 2. E Serial communication between FPGAs The system that is used for establishing the serial communication between the multiple FPGA systems is UART (Universal Asynchronous Receiver Transmitter). The Block diagram of UART is shown in Fig 3. Here “BRG” stands for “Baud Rate Generator” which controls the speed of the data communication in RS232 channel. Both receiver and sender side must work in the same band ratio otherwise data will be lost. BRG control the received data store initially at received FIFO and the transmit Data FIFO transfer the data through the transmitter Module (TX Module). Microblaze Processor Local Bus Interface module is used here for

Figure 3: Universal Asynchronous Receiver Transmitter (UART) system module attached with the Processor Local Bus. EXPERIMENTS

SINGLE BOARD (16 BIT, 32 BIT)

1. WITHOUT RTOS

2. WITH RTOS

BOARD TO BOARD (8 BIT, 16 BIT)

3. WITHOUT RTOS

4. WITH RTOS

Fig. 4: Experiments performed in our implementation

A Experiment 1 and 2 (Single Board)

TABLE II COMPARISON OF RESOURCES USED OF FPGA FOR EXPERIMENT 1 &2 Used Percentage (%) Resource Available Witho With Without With Type ut RTOS RTOS RTOS RTOS

A real time data is received from the key board using the hyper-terminal application of a host computer which is then being sent to the board using RS232 (9600 bps, no parity bit) serial cable in the DCE port on the board ,receiving the data the encryption engine execute the encryption algorithm and creates the cipher text. Again this cipher text converts to the plain text by decryption program by decryption engine. The plain text is transferred to the PC hyper terminal via the RS232 DCE port (9600 bps, no parity bit) . Fig. 5 shows the Hyper-terminal output of experiment. The same experiment has been done in experiment 2 using a RTOS where the encryption and decryption algorithm is processed in two different threads run by the RTOS. The kernel starts with the executing of a main thread which is executing the encryption algorithm now we can refer this thread as parent thread because later this thread is creating a child thread and which is executing the decryption algorithm The Fig. 6 gives the block diagram of the whole process. Table III gives a comparison of the resource utilization for with RTOS and without RTOS implementations.

plain text called cipher text is being send to the board 2 using RS232 (DTE to DTE cable) port. After receiving the cipher text, board 2 decrypts the encrypted text and converts to the plaintext again. This plain text is being send to the hyper terminal of next PC through RS232 (DCE to DTE cable). Fig. 7 shows the Hyper-terminal output of experiments. This proposed architecture is tested for both cases, without RTOS and with RTOS. Fig 8 shows the architectural picture of the experiment 3 and 4. Fig 9 shows the encryption result of sender board of experiment 3 and 4.

Fig. 5: Hyper terminal of encryption and decryption process without RTOS [ experiment 1 ]

Fig. 7: Hyper terminal of RTOS encryption and decryption process [ Exp 2]

Xilkernel (RTOS) Library call

PC Via RS232 link Cipher Text Creation

Creation of Child Thread by RTOS

Execution of Encryption Algorithm By Parent Thread Execution of Decryption Algorithm by Child Thread

Slices Slice Flip Flops 4 input LUT

712 904

757 923

4656 9312

15 9

16 9

1378

1440

9312

14

15

Implementatio n of RTOS Parent Thread Creation by RTOS

Via RS232 link

Plain Text Creation Fig. 6: Work flow using RTOS (experiment 2)

Host PC

B Experiment 3 and 4 (Multiple Board) This is the most challenging and interesting part of our paper, where two FPGA board can communicate successfully. The real time data are taken from Key Board into the board 1 through RS232 (DTE to DCE cable) port where the encryption process is going on. The encrypted

Figure 8: Architectural Picture of the Experiment 4

IV. RESULTS In our test case we have taken public key (n = 3233, e = 17), and private key (n=3233, d=2753), and have sent the data 123

to the encryption engine. The encrypted result will be C= 12317 mod 3233 = 855. Here C is the cipher text or encrypted data. To decrypt C=855, we used decryption algorithm and we get M=8552753 mod 3233=123.

of software and hardware co-design methodology. In future we wish to develop similar approaches for the implementation of AES, DES, RC4 etc. over an Ethernet communication channel, which we believe, will play a key role in the hardware implementation of cryptosystems. REFERENCES

Fig. 9: Output of encryption engine [experiment 3 and 4]

A. Comparison with Existing Works The performance measure of our implementation is compared with existing methods are shown in Table III and Table IV. As the proposed algorithm does not depend upon the value of exponential term, the execution speed is far better than the existing works for large value keys. From Table III we could see for the same private and public keys the decryption process is 156 time faster than existing [1]. Our design utilizes 63% and 46% less number of slices compare to the existing[24] and [1] respectively. TABLE III : COMPARISON OF EXECUTING SPEED BTWEEN PROPOSED AND [1]

Clock frequency 50 MHz Existing[1] Proposed

Encryption Process # Clock Time cycles (ms) 5176 0.103 4414

0.088

Decryption Process # Clock Time cycles (ms) 1116878 22.34 7146

0.143

TABLE IV : COMPARISON OF RESOUCE USAGE WITH EXISTING WORKS

Resource Type 4 input LUT

Proposed

Existing[3]

Existing[1]

Available

1440

3818

2667

9312

V. CONCLUSION A highly secure and efficient cryptosystem is highly needed, but it is difficult to integrate the flexibility of software and the performance of hardware. In this paper, a new architecture is proposed for the implementation of cryptography algorithms in real-time scenario. The design splits a cryptosystem into cryptography architecture (embedded processor cores and communication between multiple FPGAs) and task based execution of algorithms (RSA) in an RTOS environment, integrating the advantages

[1] S. Sau , C. Pal and A. Chakrabarti “Design and Implementation of Real Time Secured RS232 Link for Multiple FPGA Communication, Proc. Of International Conference on Communication, Computing & Security,2011, ISBN - 978-1-4503-0464-1. [2] C. D. Walter. August 1999. Montgomery's Multiplication Technique: How to Make It Smaller and Faster. Cryptographic Hardware and Embedded Systems, Lecture Notes in Computer Science, Springer. No. 1717. pp. 80-93. [3] A. Mazzeo, L. Romano, G. P. Saggese and N. Mazzocca. 2003. FPGABased Implementation of a Serial RSA Processor. Design. Proceedings of the conference on Design, Automation and Test in Europe - Volume 1. ISBN:0- 7695-1870-2 . [4] xilkernel_v3.00.pdf on www.xilinx .com. [5] R. L. Rivest et al. 1978. A Method for Obtaining Digital Signatures and Public-Key Cryptosystems. Communications of the ACM. Vol. 21. pp. 120126. [6] Cryptography & Network Security By Behrouz A.Forouzan. [7] Montgomery Algorithm for Modular Multiplication Professor Dr. D. J. Guan ,August 25, 2003. [8]RSA & Public Key Cryptography in FPGAs, John Fry, Martin Langhammer Altera Corporation – Europe [9] A. Tenca, C. Koc. 1999. A Scalable Architecture for Montgomery Multiplication. Cryptographic Hardware and Embedded Systems, Lecture Notes in Computer Science, No. 1717, pp. 94-108. [10]. A. Tenca, G. Todorov, C. Koc. May 2001. High-radix design of a scalable modular multiplier. Cryptographic Hardware and Embedded Systems, Lecture Notes in Computer Science, Springer. No. 2162. pp. 185201. [11] High-Speed RSA Implementation, Cetin Kaya Koc, November 1994, Version 2.0, ftp://ftp.rsa.com/pub/pdfs/tr201.pdf. [12] ] http://csrc.nist.gov/publications/fips/fips197/fips-197.pdf. [13] http://www.design-reuse.com/articles/13981/fpga-implementation-ofaes-encryption-and-decryption.html. [14] B. Schneier. 1996. Applied Cryptography, Protocols, Algorithms, and Source Code in C, John Wiley and Sons Inc. 2nd Edition. New York, U.S.A. [15] G.B. Arfken, D.F. Griffing , D.C. Kelly and J priest. University Physics San Diego, CA Harcourt Brace, Jovanovich Publishers , 1989. [16] http://www.techmaish.com/maximum-internet-speed-available-in-theworld/. [17] 4.6.3 of D. E. knuth , The Art of Computer Programming : Seminumeritical Algorithm, Volume 2, Reading M.A. : Addison Wasley, Second Edition, 1981. [18] Qing Li , Caroline Yao “Real-Time Concepts for Embedded Systems”. [19] Tran Nguyen Bao Anh*†, Su-Lim Tan†Survey and performance evaluation of real-time operating systems (RTOS) for small microcontrollers”, *Renesas Technology Singapore, Singapore Engineering Centre, Singapore 098632,†School of Computer Engineering, Nanyang Technological University, Singapore 639708. [20] Awais M. Kamboh, Adithya H. Krishnamurthy and Jaya Krishna K. Vallabhaneni “Demonstration of Multitasking using ThreadX RTOS on Microblaze and PowerPC” [21] Operating system for Xilinx embedded processor” at http://www.em.avnet.com. [22] Sarat Yoowattana, Chinnapat Nantajiwakornchai, Manas Sangworasil “A Design of Embedded DMX512 Controller using FPGA and XILKernel” ,2009 IEEE Symposium on Industrial Electronics and Applications (ISIEA 2009), October 4-6, 2009, Kuala Lumpur, Malaysia. [23] http://www.xilinx.com [24] ] M. Ibrahimy, M.B.Reaz, K.Asaduzzaman and S.Hussain. 2007. FPGA Implementation of RSA Encryption Engine with Flexible Key Size. International Journal of Communications. Issue 3. Volume 1.

Suggest Documents