High-Radix Systolic Modular Multiplication on Reconfigurable Hardware
Ciaran McIvor, Máire McLoone, John V McCanny The Institute of Electronics, Communications and Information Technology (ECIT) Queen's University Belfast, Northern Ireland Science Park, Queen's Road, Queen's Island, Belfast, BT3 9DT E-mail:
[email protected],
[email protected],
[email protected] Modern cryptographic algorithms can be implemented in hardware, software, or firmware. Each solution has its advantages and disadvantages but over the past five years some manufacturers of cryptographic solutions have increasingly been integrating hardware cryptographic accelerator chips into their products. This is due to a number of reasons. The first of these is that hardware accelerators, as the name suggests, speed up the throughput rate of cryptographic algorithms when compared with software-only implementations – this is especially true for public-key computations, which tend to bog down on software [1]. This is an especially important characteristic for Internet applications, which employ Virtual Private Networks (VPN) or the Secure Socket Layer (SSL) protocol, for example. More and more companies, such as Cisco Systems Inc. and Nortel Networks Ltd. are using VPNs to communicate with their employees in the field [2]. Likewise, the SSL protocol is vital for providing security for Internet browser-based transactions. Therefore, as bandwidth requirements increase so too does the speed with which encryption must be performed. Companies are realising that the only option is to perform encryption in dedicated hardware, as it is a far cheaper option to implement a separate encryption system in this way than to simply install more servers [2]. The second reason is that hardware-based cryptographic solutions can provide significant security improvements over software solutions. For example, hardware can protect secret keys and other parameters, as these need only leave the tamperproof microchip having already been encrypted – it is very difficult to protect such information in a software environment [3].
Abstract The overall aim of the work presented in this paper has been to develop Montgomery modular multiplication architectures suitable for implementation on modern reconfigurable hardware. Accordingly, novel high-radix systolic array Montgomery multiplier designs are presented, as we believe that the inherent regular structure and absence of global interconnect associated with these, make them well-suited for implementation on modern FPGAs. Unlike previous approaches, each processing element (PE) comprises both an adder and a multiplier. The inclusion of a multiplier in the PE means that the need to pre-compute or store any multiples of the operands is avoided. This also allows very high-radix implementations to be realised, further reducing the amount of clock cycles per modular multiplication, while still maintaining a competitive critical delay. For demonstrative purposes, 512-bit and 1024-bit FPGA implementations using radices of 28 and 216 are presented. The subsequent throughput rates are the fastest reported to date.
1. Introduction In recent years the use of cryptography has become more widespread due to the advent and expansion of electronic communication and storage systems. This has been fuelled by the growing popularity of services such as e-mail and online shopping and banking. Therefore, the computer security community has a requirement to produce high-speed, low cost, and high-strength cryptographic products in order to satisfy demands for real-time encryption.
0-7803-9407-0/05/$20.00 2005 IEEE
13
ICFPT 2005
interconnect associated with these, make them wellsuited for implementation on modern FPGAs. There have been several general-purpose bit-level systolic array Montgomery architectures proposed in the technical literature to date [12–15]. These require an n×n matrix of processing elements (PEs), where n is the modulus bit length and is typically very large (n ≥ 1024-bits for RSA). However, no actual implementations of these arrays have been reported due to the unrealistically large resource requirements of the designs. Nevertheless, some important implementations of FPGA based systolic Montgomery architectures have been reported. Blum and Paar [16] proposed a radix-2 design, which uses varying PE sizes of 4-bits, 8-bits and 16-bits. The use of larger PE sizes means that the silicon area required in an implementation is reduced significantly. Furthermore, they only use one row of n PEs rather than an n×n matrix. Thus, they require n/u PEs in their implementations, where u is the PE size. Each PE requires a 4-bit, 8-bit or 16-bit adder, respectively. They use the fast carry logic chains located on their target FPGA (Xilinx XC4000 series) to perform this. Blum and Paar [17] later presented a high-radix systolic array based on their original system. This reduces the amount of clock cycles per modular multiplication, albeit at the expense of an increase in the silicon area required. Their radix-24 design is based on pre-computing and storing (in registers or RAM) the 15 multiples of the operands B and M (see Algorithm 2.1) and then calling upon these when required. However, there are obvious limitations to this approach. These include having to pre-compute either 15 or 30 ordinary multiplications at the beginning of each new modular multiplication as well as having to store the 30 multiples (≈ n-bits in length). Also, the design is not very suitable for radix-2k designs, with k > 4, as this would result in unrealistic precomputation and storage requirements for the 2k–1 multiples of B and M . The new systolic array architectures and implementations presented in this paper avoid these problems. Unlike previous approaches, each PE comprises both an adder and a multiplier. The inclusion of a multiplier in the PE means that the need to pre-compute or store any multiples of the operands is avoided. This also allows very highradix implementations to be realised (k > 4), further reducing the amount of clock cycles per modular multiplication, while still maintaining a competitive critical delay. For demonstrative purposes, 512-bit and 1024-bit implementations using radices of 28 and 216 are presented. The architectures require only one row of m+2 PEs, where the
The extremely fast evolution of Field Programmable Gate Arrays (FPGAs) over the last few years now means that hardware-based cryptography can also offer a great deal of flexibility, a characteristic previously monopolised by software solutions. Flexibility is an important design aspect of a good cryptographic implementation, as this ensures that different key sizes and protocols can be accommodated with relative ease. Adding this to the reasons already discussed, makes hardware-based cryptographic solutions a very appealing alternative to software. Within the hardware environment, FPGA technology has some important advantages over custom processors or Application Specific Integrated Circuits (ASICs). Whilst ASICs are normally optimised to perform a specific function very efficiently, FPGAs provide simpler design flows and faster design cycle times, as well as greater flexibility [4], albeit at the expense of efficiency. However, within the last few years, FPGA performance and density have improved significantly meaning that they are a viable and costeffective alternative to ASICs [5]. Certain FPGA features, such as embedded arithmetic functions, memory and general-purpose microprocessor cores, make them a very attractive platform for implementing cryptographic applications. The overall hardware view is more complex than this, however, and there are many existing and newly emerging architectures, which provide a combination between efficiency and programmability. These include Application Specific Instruction-set Processors (ASIPs) of which the Xtensa processor [6] is a good example, programmable Digital Signal Processors (DSPs), microcontrollers and application specific microprocessors or streaming processors such as the Imagine processor [7], for example. These processors must also be considered as potential platforms for the implementation of cryptographic algorithms but this is beyond the scope of this paper and the focus here is on FPGA implementation. Public-key cryptosystems, such as RSA [8] and elliptic curve cryptographic (ECC) schemes [9–10] play a crucial role in modern security protocols. They are based on complex finite field mathematics and, in particular, modular multiplication. The overall aim of the work presented in this paper has been to develop Montgomery modular multiplication [11] architectures suitable for implementation on modern reconfigurable hardware. Accordingly, novel high-radix systolic array Montgomery multiplier designs are presented, as we believe that the inherent regular structure and absence of global
14
{
}
modulus M = ∑ (2 k ) i mi , mi ∈ 0,1...2 k − 1 and i =0 m −1
perform high-radix Montgomery multiplication. Unlike previous approaches, we propose systolic array architectures wherein each PE calculates the kbit (depending on the choice of radix) multiplications qi M and AiB and also performs the necessary additions. As mentioned, this avoids the need to pre-compute or store any multiples and also allows a large choice of radix to be used, reducing the amount of clock cycles required to compute a Montgomery multiplication. These architectures are described in Section 3.
m is the base-2k logarithm of M. This means that implementation is easily attainable on current modern FPGAs. Two modular multiplications can be performed in parallel meaning that 100% utilisation of the PEs is achieved. Moreover, the throughput rate of the array is two full Montgomery multiplications every 3m+7 clock cycles. The contribution is structured as follows. In Section 2 the Montgomery multiplication algorithm is described. Section 3 details the high-radix systolic modular multiplication architectures. In Section 4 implementation performance results of these architectures are discussed. A concluding summary is given in Section 5.
3. High-radix Systolic Array Architectures This section details the novel systolic array architectures for performing Montgomery multiplication. We firstly provide a description of the PEs before explaining how these are connected to form the systolic array.
2. High-radix Montgomery Modular Multiplication Montgomery multiplication [11] is a method for performing modular multiplication without the need to perform division by the modulus M. A high-radix version of Montgomery’s algorithm [17] is given as Algorithm 2.1 below. Again, m −1 k i k and M = ∑ (2 ) mi , mi ∈ 0,1...2 − 1
{
i =0
R=2
k(m+2)
3.1. First Processing Element The systolic array comprises m+1 identical PEs as well as an initial PE, which computes the calculation (Si + qi M )/2k using the k least significant bits of the operands Si and M . Thus, the division by 2k is performed in this first cell only, leaving the other PEs in the array to perform the k-bit calculations (Si + qi M ) + AiB + CARRY, as described in Section 3.2. Figure 3.1 provides a right-to-left diagram of the first PE. The k least significant bits of Si and M are firstly multiplied together to form the product qi M (remember qi = Si mod 2k). This product is then added to Si and a trivial division of the sum by 2k is then performed. The k CARRY bits are then propagated to the PE to the left, as described in Sections 3.2 and 3.3.
}
(mod M).
Algorithm 2.1.
High-radix Montgomery multiplication
Input:
{
}
A = ∑i =0 (2 k ) i ai , ai ∈ 0,1...2 k − 1 , a m + 2 = 0 ; m+ 2
{
}
B = ∑i =0 (2 k ) i bi , bi ∈ 0,1...2 k − 1 ; m +1
M = ( M ′ mod 2 k ) M ,
{
}
M = ∑i =0 (2 k ) i m i , m i ∈ 0,1...2 k − 1 ; m
km A, B < 2 M ; 4 M < 2 ; M ′ = − M −1 ;
3.2. General Processing Element As discussed, the systolic array comprises m+1 identical general PEs, which perform the k-bit calculations (Si + qi M ) + aiB + CARRY. Figure 3.2 provides a right-to-left diagram of one of these PEs. Each one processes a particular k-bits of the operands Si, M and B, depending on the position of the PE in the array (see Section 3.3). Each PE computes two k-bit multiplications and one 2k-bit 4input addition. The multiplications and additions are computed in systematic fashion, as shown in Figure 3.2. Once the arithmetic computations have been completed, the result is split into a CARRY and sum, S. The CARRY bits are propagated from right-to-left
Output: Sm+3=ABR–1(mod M); 1. S0 = 0 2. for i in 0 to m+2 loop qi = Si mod 2k Si+1 = (Si + qi M )/2k + AiB end loop 3. return Sm+3=ABR–1(mod M) As can be seen from step 2 of Algorithm 2.1, both ordinary addition and multiplication are required to
15
through each PE of the array from the first (Figure 3.1) to the last. The S bits are pumped to the PE immediately to the right (see Section 3.3) to be reused in the next iteration of the for loop in step 2 of Algorithm 2.1. Also, the qi, Ai, and RST bits, as well as being used in each PE, are processed from right-to-left through each PE in the array. The distribution of the CARRY, S, qi, Ai, and RST bits are explained in more detail in Section 3.3. The RST signal sets the CARRY and S bits to zero, according to step 1 of Algorithm 2.1.
3.3. For demonstrative purposes, only three general PEs are shown in the diagram (in reality there are m+1 general PEs). At the beginning of each new modular multiplication, the input RST signal is set to high for one clock cycle to initialise S[0] to zero. The high value of the RST signal is then pumped from right to left through the array so that after one clock cycle S[1] is set to zero and after two cycles S[2] is initialised to zero, and so forth. This ensures that the precondition at step 1 of Algorithm 2.1 is met. The qi and Ai signals are pumped through the array in a similar manner so that the correct values are available at the corresponding PE when needed. The inputs M [0] and S[0] to the First PE correspond to the least significant k-bits of the M and S signals, respectively. Thus, M [1] and S[1] correspond to the next k-bits of M and S, and so on.
3.3. Systolic Montgomery Multiplier The first and general PEs described in Sections 3.1 and 3.2, respectively, have been connected to form a systolic Montgomery multiplier, as shown in Figure
CLK
REG
CARRY
k
2k
k k
+
* k
2k Figure 3.1. Systolic array first PE
REG
qi k
k
2k CLK
k
CARRYOUT
REG SOUT
2k+1
+
CARRYIN
RSTIN
k
2k
CLK Ai
B
*
REG
RSTOUT
M
SIN
k k
k
qi
* k
k+1
REG
k
k
Figure 3.2. Systolic array general PE
16
Si
k
Ai
M
M [3] B[2]
qi
Even Position
Ai RST
M [2] B[1]
M [1] B[0]
Odd Position
qi
PE
PE
RST
PE
RST
CARRY
S[3] S[2]
Odd Position
Ai
Ai
CARRY
S[3]
qi
Even Position
qi
Ai
M [0]
RST
CARRY
S[0]
S[1]
S[2] S[1]
First PE
S[0]
Figure 3.3. High-radix systolic Montgomery multiplier
architecture is general and is suitable for implementation on alternative FPGA families such as those produced by Altera, Actel, or Lattice and indeed more recently developed Xilinx families, such as the Virtex-4. Operand lengths of 512-bits and 1024-bits and radices of 28 and 216 are used. The XC2VP40-7-1148 and XC2VP70-7-1517 devices were used for the 512-bit and 1024-bit implementations, respectively. Tables 4.1 and 4.2 provide performance results obtained using Xilinx Foundation software v6.1.03i. The ordinary multiplications and additions have been implemented using the embedded 18×18-bit multipliers and fast carry logic chains (located on the Xilinx Virtex2 Pro devices), respectively. The registers shown in Figures 3.1 and 3.2 are implemented using the flip-flops available in each CLB slice.
Once a new modular multiplication has started, the CARRY from the First PE is ready after one clock cycle and is subsequently pumped to the PE to the left. Therefore, after two cycles S[0] and the other outputs from the first general PE are ready, and after three cycles the outputs from the second general PE are ready, and so on. It is noted that during odd numbered clock cycles, the even positioned PEs are idle, as these are awaiting the updated S signals from the PE to the left. Likewise, during even numbered cycles the odd positioned PEs are idle. However, 100% utilisation can be obtained by performing two modular multiplications in parallel, so that both the odd and even numbered PEs are used on each cycle. This is achieved by beginning a second multiplication exactly one clock cycle after the first. After the m+3 iterations of the for loop of Algorithm 2.1 have executed, the k least significant bits S[0] of the result Sm+3=ABR–1(mod M) of the first modular multiplication, are ready. This takes 2(m+3) clock cycles. After a further m cycles the full result is ready. Moreover, after a further one clock cycle the full result of the second modular multiplication is also complete. Thus, two full Montgomery multiplications can be computed in 3m+7 clock cycles, using m+2 PEs.
Table 4.1. Performance results for 512-bit Montgomery implementations
Radix
Clock (MHz)
Area (Slices)
Mult 18×18
Clock Cycles
Data Rate (Mb/s)
28
119.86
2,952
131
199
616.77
16
108.70
2,923
67
103
1080.68
2
4. FPGA Implementation and Performance Results
An intrinsic advantage of the designs being systolic is that there is only nearest neighbour interconnect (see Figure 3.3) meaning that high clock speeds have been achieved, as shown in Tables 4.1 and 4.2. Also, because high radices have been used, a relatively low amount of clock cycles is required to complete two full modular multiplications. As a result, the throughput rates obtained are the fastest reported in
The systolic modular multiplier architecture described in Section 3 has been captured generically in VHDL. For demonstration purposes, these have been implemented using the Xilinx Virtex2 Pro family of FPGAs [18], as we believe this represents a typical modern FPGA architecture. However, our
17
[5]
the literature to date. As discussed in Section 1, modular multiplication is very important for publickey cryptographic algorithms, such as RSA which requires modular exponentiation.
[6]
Table 4.2. Performance results for 1024-bit Montgomery implementations
[7]
Data Rate (Mb/s)
Radix
Clock (MHz)
Area (Slices)
Mult 18×18
Clock Cycles
28
104.70
5,797
259
391
548.40
216
101.86
5,709
131
199
1048.29
[8]
It is estimated that a 1024-bit RSA decryption throughput rate of 2.1 Mb/s can be achieved using two 512-bit radix-216 multipliers in parallel with the well-known Chinese Remainder Theorem technique and the right-to-left binary exponentiation algorithm. The architectures can also be used in other publickey schemes such as ECCs.
[9]
[10]
[11]
5. Conclusions
[12]
In this paper Montgomery modular multiplication architectures suitable for implementation on modern reconfigurable hardware have been presented. These have been constructed using a novel high-radix systolic array Montgomery multiplier design. Unlike previous approaches, each PE comprises both an adder and a multiplier. The inclusion of a multiplier in the PE means that the need to pre-compute or store any multiples of the operands is avoided. This also allows very high-radix implementations to be realised, further reducing the amount of clock cycles per modular multiplication, while still maintaining a competitive critical delay. 512-bit and 1024-bit FPGA implementations using radices of 28 and 216 have produced the fastest Montgomery multiplication data throughput rates reported in the literature to date.
[13]
[14]
[15]
[16]
References [1] [2] [3]
[4]
Doud, R., “Hardware Crypto Solutions Boost VPN”, EE Times, April 1999. Wade, W., “Encryption Migrates to Silicon as Net Traffic Swells”, EE Times, May 2001. Shamir, A., Van Someren, N., “Playing Hide and Seek with Stored Keys”, In Proceedings of the Third International Conference on Financial Cryptography, pp. 118–124, 1999. Vahid, F., “The Softening of Hardware”, IEEE Computer, pp. 27–34, April 2003.
[17]
[18]
18
Bohm, M., “FPGA Evolution: New Design Methods on the Horizon”, EE Times, January 2002. Gonzalez, R., “Xtensa: A Configurable and Extensible Processor”, IEEE Micro 20(2), pp. 60–70, 2000. Kapasi, U.J., Dally, W.J., Rixner, S., Owens, J.D., Khailany, B., “The Imagine Stream Processor”, In Proceedings of IEEE International Conference on Computer Design: VLSI in Computers and Processors, pp. 282–288, 2002. Rivest, R.L., Shamir, A., Adleman, L., “A Method for Obtaining Digital Signatures and Public-key Cryptosystems”, Communications of the ACM, 21(2): 120–126, February 1978. Miller, V.S., “Use of Elliptic Curves in Cryptography”, Proceedings of Advances in Cryptology (Crypto ’85), pp. 417–426, 1986. Koblitz, N., “Elliptic Curve Cryptosystems”, Math. Computing, Vol. 48, pp. 203–209, 1987. Montgomery, P.L., “Modular Multiplication without Trial Division”, Math. Computation, Vol. 44, pp. 519–521, 1985. Walter, C.D., “Systolic Modular Multiplication”, IEEE Transactions on Computers, (42), pp. 376–378, March 1993. Iwamura, K., Matsumoto, T., Imai, H., “Montgomery Modular Multiplication Method and Systolic Arrays Suitable for Modular Exponentiation”, Electronics and Communications in Japan, Part 3, Vol. 77, pp. 40–51, March 1994. Wang, P., “New VLSI Architectures of RSA Public-key Cryptosystems”, In Proceedings IEEE International Symposium on Circuits and Systems, Vol. 3, pp. 2040–2043, 1997. Tiountchik, A., “Systolic Modular Exponentiation via Montgomery Algorithm”, Electronics Letters, Vol. 34, pp. 874–875, April 1998. Blum, T., Paar, C., “Montgomery Modular Exponentiation on Reconfigurable Hardware”, In Proceedings 14th IEEE Symposium on Computer Arithmetic, pp. 70– 77, 1999. Blum, T., Paar, C., “High-radix Montgomery Modular Exponentiation on Reconfigurable Hardware”, IEEE Transactions on Computers (50), pp. 759–764, July 2001. Xilinx, Inc., “Xilinx Data Sheets”, http://www.xilinx.com/xlnx/xweb/xil_publica tions_index.jsp, March 2005.