Scalable and systolic Montgomery multiplier over GF(2 ) generated by

1 downloads 0 Views 372KB Size Report
doi:10.1049/iet-cds:20060314. Paper first received 17th October 2006 and in final revised form 10th August. 2007 ... multiplication over GF(p) was presented for application ... bits to separate m/k sub-word data, where the selected digital size k ...
Scalable and systolic Montgomery multiplier over GF(2m) generated by trinomials C.-Y. Lee, C.W. Chiou, J.-M. Lin and C.-C. Chang Abstract: A Montgomery’s algorithm in GF(2m) based on the Hankel matrix – vector representation is proposed. The hardware architecture obtained from this algorithm indicates low-complexity bit-parallel systolic multipliers with irreducible trinomials. The results reveal that the proposed multiplier saves approximately 36% of space complexity as compared to an existing systolic Montgomery multiplier for trinomials. A scalable and systolic Montgomery multiplier is also developed by applying the block-Hankel matrix – vector representation. The proposed scalable systolic architecture is demonstrated to have significantly less time – area product complexity than existing digit-serial systolic architectures. Furthermore, the proposed architectures have regularity, modularity and local interconnectability, making them highly appropriate for VLSI implementation.

1

Introduction

Finite field arithmetic operations, particularly for the binary field GF(2m), have been widely applied in cryptography and error-correcting codes. In particular, public-key cryptography systems are based on finite field arithmetic, for example, RSA, XTR, NTRU, ElGamal and Torus-based cryptography. Both software implementations and hardware architectures for the finite field GF(2m) have been explored extensively. In the finite field [1, 2], the performance of a cryptosystem is primarily determined by an efficient implementation of the arithmetic operations, for example, addition, multiplication and inversion. Inversion can be conducted simply by repeating the multiplication-squaring algorithm. Therefore efficient architectures for multiplication over GF(2m) are desirable to reduce the complexity of elliptic curve cryptosystems. In the VLSI design, systolic arrays are architectures of one or two-dimensional arrays of simple processing elements performing a specific task, including matrix – vector multiplication. Almost all previous GF(2m) multiplication architectures are either bit-parallel [3 – 5] or bitserial [6, 7]. Various systolic arrays have been studied based on different field representations, that is, polynomial (standard), dual and normal bases. In general, every distinct representation basis has its own associated different hardware architecture. For instance, a bit-parallel systolic multiplier for the standard basis in GF(2m) is especially appropriate for realising regular circuits and is not # The Institution of Engineering and Technology 2007 doi:10.1049/iet-cds:20060314 Paper first received 17th October 2006 and in final revised form 10th August 2007 C.-Y. Lee is with the Department of Computer Information and Network Engineering, Lunghwa University of Science and Technology, 2F 105, Lane 437, Chung-Pei Road Sec. 2, Taoyuan County 333, Taiwan, Republic of China C.W. Chiou is with the Department of Computer Science and Information Engineering, Ching Yun University, Chung-Li 320, Taiwan, Republic of China J.-M. Lin and C.-C. Chang are with the Department of Information Engineering and Computer Science, Feng Chia University, Taichung City 407, Taiwan, Republic of China E-mail: [email protected] IET Circuits Devices Syst., 2007, 1, (6), pp. 477 – 484

dependent on a specific value of m for GF(2m). The benefit of the dual basis representation is that the field multiplication can be indicated by a Hankel matrix – vector computation [4]. In the normal basis, field multiplication can typically be conducted with a multiplication matrix, which is an m  m matrix M with every entry Mij in GF(2). Each type of finite field operation has distinct features and is thus appropriate for specific applications. A multiplicative inverter can adopt any basis, but the normal basis is valuable for fast squaring. A systolic multiplier can utilise polynomial and dual bases. Exponentiation and inversion can be derived by both multiplication and squaring operations. The polynomial basis representation is sufficient for the multiplication operation and the normal basis representation is suited for the squaring operation. Single-basis representation is thus not very efficient for both exponentiation and inversion. Bit-parallel systolic frameworks with bit-level pipelining can offer very high throughput. Several hardware-efficient bit-parallel systolic multiplier designs have been designed [8 – 10]. Lee et al. [8, 9] first recommended a circulant matrix – vector to explore lowcomplexity bit-parallel systolic multipliers over GF(2m), where the field is constructed from all-one polynomials and equally spaced polynomials. A trinomial-based bitparallel systolic multiplier [11] was also derived by adopting an LSB-first multiplication algorithm. Lee et al. [10] developed a transformation method that allows the circuit to be transformed from an AOP-basis systolic multiplier [8] into a bit-parallel systolic Montgomery multiplier for trinomials. Interestingly, such multipliers are good candidates for low-complexity systolic architectures owing to the simple multiplier structures. Previous related investigations implemented these architectures with lowcomplexity bit-parallel systolic multipliers whereas in the case of traditional bit-parallel systolic multipliers [12, 13] with LSB-first and MSB-first schemes. The finite field GF(2m) is especially valuable for the cryptographic application when m is large. For example, the National Institute of Standards and Technology (NIST) [14] recommended five binary finite fields for the ECDSA (elliptic curve digital signature algorithm) 477

applications, in which two binary fields GF(233) and GF(2409) are generated from x 233 þ x 73 þ 1 and x 409 þ x 87 þ 1, respectively. Multipliers for large m values require O(m2) space complexity and O(m) latency complexity, making them inappropriate for implementing constrained hardware environments, including smart cards and portable devices. The Montgomery multiplication algorithm without a division operation was originally developed by Montgomery [15] to elevate the performance of modular integer multiplications. A Montgomery technique operating in Montgomery multiplication over GF( p) was presented for application in RSA cryptosystems 20 years ago. The merit of the Montgomery multiplication algorithm is that it restructures the multiplication operations such that the modular adjustment depends on the least significant digits, instead of the most significant digits as in traditional modular integer multiplications. The algorithm replaces division operations with straightforward addition and shifting operations. Moreover, the Montgomery multiplication algorithm [16] uses modular reduction according to the least significant digit, rather than the most significant digit as in traditional modular multiplication algorithms. Nibouche et al. [17] proposed a Montgomery multiplier systolic architecture over GF( p). Several modular multiplication algorithms and architectures for the field GF(2m) have been presented based on the Montgomery multiplication concept [18 – 20]. Mentens et al. [21] developed an elliptic curve processor using a Montgomery multiplier over GF(2m). The scalable architecture is combined with both serial and parallel algorithms. It consists of original data with m bits to separate m/k sub-word data, where the selected digital size k indicates the scalable factor. Since the computation of both sub-word data requires one clock cycle, dm/ke clock cycles occupy the entire original data computation. Hence, the scalable architecture can generate an optimal realisation in hardware implementations. Hybrid multipliers for the composite fields GF((2m)k) have been proposed to improve the trade-off between throughput performance and hardware complexity [22] and digit-serial systolic architectures have also been presented [23 – 25]. For cryptography applications, ECDSA stipulates GF(2m) and GF( p) arithmetic operations. The Montgomery multiplier and inverter operating in both types of finite fields GF( p) and GF(2m) based on scalable and unified architectures have also been presented [26, 27]. For the large-word lengths frequently used in cryptography, the bit-serial approach is rather slow, while bit-parallel realisation needs large circuit area and power consumption. Elliptic curve cryptosystems strongly depend on the implementation of finite field arithmetic. This work proposes a novel Montgomery multiplication algorithm over GF(2m) based on the Hankel matrix – vector representation. Because the field is generated using irreducible trinomials, the Montgomery multiplication can be decomposed into two Hankel matrix – vector multiplications. A lowcomplexity bit-parallel systolic multiplier is derived from this algorithm. Analytical results demonstrate that the proposed multiplier saves approximately 36% of space complexity as compared to existing multipliers for trinomials [10, 11]. To further save both time and space complexities, the Montgomery multiplication can also be represented by the block-Hankel matrix – vector representation. The proposed Montgomery multiplication can be realised by a scalable and systolic architecture. The results also indicate that the proposed scalable multiplier has significantly less 478

time – area complexity than existing digit-serial systolic architectures. The rest of this investigation is structured as follows. Section 2 briefly reviews Montgomery multiplication over GF(2m). Section 3 introduces the proposed bit-parallel systolic Montgomery multiplier based on the Hankel matrix – vector representation. Section 4 discusses a novel scalable and systolic architecture, which is based on the Hankel multiplier in Section 3. Section 5 analyses the time – area complexity. Conclusions are finally drawn in Section 6. 2 Conventional Montgomery multiplication over GF(2m) Let GF(2m) indicate a finite field of 2m elements. GF(2m) represents a vector space over GF(2) of dimension m. A set of m linearly independent vectors is selected to act as the basis of representation. Let P(x) ¼ p0 þ p1x þ p2x 2 þ . . . þ pmx m of degree m over GF(2) be an irreducible primitive polynomial, where p0 ¼ pm ¼ 1. Any element A(x) [ GF(2m) can be denoted by the following polynomial basis representations A(x) ¼ a0 þ a1 x þ a2 x2 þ    þ am1 xm1 Let A(x), B(x), C(x) denote three elements in GF(2m). The Montgomery multiplication efficiently determines C(x) ¼ A(x) . B(x) . R 21(x) mod P(x), where R(x) satisfies gcd(R(x), P(x)) ¼ 1. Generally, R(x) ¼ x k is commonly selected as the Montgomery factor, since the reduction modulo x k means that the terms of order larger than k can be negligible for the remainder operation and the division by x k shifts the polynomial to the right by k places for the division. Because P(x) and R(x) are prime relative to each other, two polynomials R 21(x) and P 0 (x) exist with the characteristic that R(x) . R 21(x) þ P(x) . P 0 (x) ¼ 1. Thus, the computation algorithm of the Montgomery multiplication is constructed as follows: Step 1. H(x) ¼ A(x)B(x). Step 2. U(x) ¼ H(x) . P 0 (x) mod R(x). Step 3. C(x) ¼ (H(x) þ U(x) . P(x))/R(x) mod P(x). As stated above, an efficient multiplier architecture can be derived if R(x) is properly selected based on the irreducible polynomial P(x). For instance, if the field is generated with a trinomial P(x) ¼ x m þ x k þ 1, then the selection of R(x) ¼ x k enables the implementation of a bit-parallel systolic multiplier, as seen in [10]. Example 1: Assume that the finite field GF(25) is built from P(x) ¼ x 5 þ x 2 þ 1. Let A(x) ¼ a4x 4 þ a3x 3 þ a2x 2 þ a1x þ a0 and B(x) ¼ b4x 4 þ b3x 3 þ b2x 2 þ b1x þ b0 [ GF(25). Define the Montgomery factor R(x) by R(x) ¼ x 2. For the Montgomery multiplication C(x) ¼ A(x)B(x)R 21(x) mod P(x), the coefficients of C(x) can be determined with c0 ¼ a0 b0 þ a4 b3 þ a3 b4 þ a2 b0 þ a1 b1 þ a0 b2 c1 ¼ a0 b1 þ a1 b0 þ a4 b4 þ a3 b0 þ a2 b1 þ a1 b2 þ a0 b3 c2 ¼ a4 b3 þ a4 b0 þ a3 b1 þ a2 b2 þ a1 b3 þ a0 b4 c3 ¼ a4 b4 þ a4 b1 þ a3 b2 þ a2 b3 þ a1 b4 þ a0 b0 and c4 ¼ a4 b2 þ a3 b3 þ a2 b4 þ a1 b0 þ a0 b1 IET Circuits Devices Syst., Vol. 1, No. 6, December 2007

3 Proposed bit-parallel systolic Montgomery multiplier over GF(2m) for all trinomials This section first describes the proposed Montgomery multiplication algorithm, followed by the proposed architecture based on this algorithm. The time – area complexity analysis of the architecture is described in Section 5. 3.1

Algorithm

Let A(x) ¼ am21x m21 þ . . . þ a1x þ a0 and B(x) ¼ bm21 x m21 þ . . . þ b1x þ b0 represent two elements in GF(2m), where the field is built from an irreducible polynomial P(x) ¼ x m þ x n þ 1 over GF(2). Suppose that the intermediate product T(x) ¼ t2m22x 2m22 þ . . . þ t1x þ t0 indicates the multiplication of A(x) and B(x), where t0 ¼ a0 b0 t1 ¼ a1 b0 þ a0 b1  tm1 ¼ am1 b0 þ am2 b1 þ    þ a0 bm1

in matrix H. Such a matrix is determined by the 2m 2 1 entries appearing in the first row and the last column. With the property of Definition 1, an m  m Hankel matrix  ¼ [h0 , h1 , . . . , h2 m2 ] H can be defined using the vector H over GF(2). With the Hankel matrix–vector representation, the product C(x) in (1) can be translated into the following equation 2 3 cn 6 . 7 6 .. 7 6 7 6 7 32 3 6 cm1 7 2 b b1    bm1 am1 0 6 7 6 7 6 6 7 6 c0 7 6 b1 b2 b0 7 76 am2 7 6 7¼6 7 6 7 6 c1 7 4 54 5 6 7 6 . 7 6 .. 7 bm1 b0 bm2 a0 6 7 6 . 7 6 . 7 4 . 5 cn1 2b  b b 0 0   0 3 nþ1

tm ¼ am1 b1 þ am2 b2 þ    þ a1 bm1  t2m3 ¼ am1 bm2 þ am2 bm1 and t2m2 ¼ am1 bm1 Assume that the intermediate product T(x) is represented by T (x) ¼ T1 þ T2 xn þ T3 xmþn where

6 6 bnþ2 6 6 . 6 . 6 . 6 6 6 bm1 6 6 þ6 0 6 6 .. 6 . 6 6 . 6 . 6 . 6 6 .. 4 . 0 2

T1 ¼ t0 þ t1 x þ    þ tn1 xn1 T2 ¼ tn þ tnþ1 x þ    þ tmþn1 x

m1

T3 ¼ tmþn þ tmþnþ1 x þ    þ t2m2 xmn2 Let the Montgomery parameter R(x) be chosen by using R(x) ¼ xn. The Montgomery multiplication of A(x) and B(x) can thus be rewritten as C(x) ¼ A(x)B(x)xn mod(xm þ xn þ 1) ¼

T1 þ T2 xn þ T3 xmþn þ T1 (xm þ xn þ 1) xn

¼ T2 þ T3 xm þ T1 (xmn þ 1) ¼ T2 þ T3 þ T1 xmn þ (T1 þ T3 xn ) ¼ C0 þ C1

.. .. .. .. .. .. ..

m2

m1

.

..

.

..

.

..

.

..

.

..

.

..

.

.

..

.

..

.

..

.

..

.

..

.

..

.

.

..

.

..

.

..

.

..

.

..

.

..

.

.. . .. . .. .

.

..

.

..

.

..

.

..

.

..

.

..

.

0

.

..

.

..

.

..

.

..

.

..

.

..

.

b0

.

..

.

..

.

..

.

..

.

..

.

.

..

.

..

.

..

.

..

.

..

.

b0 .. .

b1 .. .



0

 3



7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 5

b0    bn2 bn1

am1 6a 7 6 m2 7 6 7 6 .. 7 6 . 7 6 7 6 .. 7 6 . 7 6 7 6 . 7  6 .. 7 ¼ K 0 A þ K 1 A 6 7 6 .. 7 6 . 7 6 7 6 .. 7 6 . 7 6 7 6 7 4 a1 5 a0 (2)

(1)

where C0 ¼ T2 þ T3 þ T1 xmn ¼ c0,0 þ c0,1 x þ    þ c0,m1 xm1

Both matrices K0 and K1 in (2) can be represented with the Hankel matrix–vector. Hence, both matrices K0 and K1 can be represented by the two vectors K 0 ¼ [b0 , b1 , . . . , bm1 , b0 , b1 , . . . , bm2 ] and K 1 ¼ [bnþ1 , bnþ2 , . . . , bm1 , 0, 0, . . . , 0, b0 , . . . , bn1 ] respectively. Interestingly, the coefficients of the product C(x) are permuted from c0 , c1 , . . . , cm21 into cn , cnþ 1 , . . . , cm21 , c0 , . . . , cn21 .

and C1 ¼ T1 þ T3 xn ¼ c1,0 þ c1,1 x þ    þ c1,m2 xm2 Definition 1: An m  m matrix H is called a Hankel matrix if it satisfies the relation H( p, q) ¼ H( p 2 1, q þ 1), for 1  p, q , m 2 1, where H( p, q) represents the ( p,q)-entry IET Circuits Devices Syst., Vol. 1, No. 6, December 2007

Example 2: Considering the Montgomery multiplier for irreducible trinomials with the form x m þ x n þ 1, for easy description, the finite field GF(25) generated by P(x) ¼ x 5 þ x 2 þ 1 is employed as an example to illustrate the Montgomery multiplier. Let two field A(x),B(x) GF(25) be P4 elements P[ 4 i denoted by A(x) ¼ i¼0 ai x and B(x) ¼ i¼0 bi xi . Their Montgomery multiplication can be indicated by P C(x) ¼ 4i¼0 ci xi ¼ A(x)B(x)x2 modx5 þx2 þ 1. From (2), 479

the product C(x) can be written as 2 3 2 b0 b1 b2 c2 6 c3 7 6 b1 b2 b3 6 7 6 6 c4 7 ¼ 6 b2 b3 b4 6 7 6 4 c0 5 4 b3 b4 b0 b4 b0 b1 c1 2 b3 b4 0 6 b4 0 0 6 þ6 60 0 0 40 0 0 0 0 0

b3 b4 b0 b1 b2 0 0 0 0 b0

32

3

b4 a4 6 a3 7 b0 7 76 7 6 7 b1 7 76 a 2 7 b2 54 a1 5 b3 a0 32 3 0 a4 6 a3 7 07 76 7 6 7 07 76 a2 7 b0 54 a1 5 b1 a0

Remark 1: If the field GF(2m) is built from an irreducible trinomial x m þ x n þ 1, then a straightforward matrix addition to obtaining K0 þ K1 stipulates a total of m 2 1 XOR gates. Next, by applying Remark 1, the Montgomery multiplication in (2) can be represented by C ¼ K 0 A þ K 1 A ¼ H A

Algorithm 1. Both vectors are permuted only by the element B(x). Step 2 performs a straightforward matrix addition to obtain H ¼ K0 þ K1 Constructing this matrix stipulates a total of m 2 1 XOR gates. As in Example 2, consider two Hankel vectors, K 0 ¼ [b0 , b1 , b2 , b3 , b4 , b0 , b1 , b2 , b3] K 1 ¼ [b3 , b4 , 0, 0, 0, 0, 0, b0 , b1]. The matrix addition H ¼ K0 þ K1 can be represented with the vector  ¼ [h0 , h1 , h2 , h3 , h4 , h5 , h6 , h7 , h8] ¼ [b0 þ b3 , H b1 þ b4 , b2 , b3 , b4 , b0 , b1 , b2 þ b0 , b3 þ b1], as depicted in Fig. 2. Step 3 is the final step to achieve the Montgomery multiplication for the field GF(2m) generated by an irreducible trinomial with Hankel matrix – vector multiplication.  ¼ [h0 , h1 , For clarity, A ¼ [a0 , a1 , a2 , a3 , a4] and H h2 , h3 , h4 , h5 , h6 , h7 , h8] are applied as an example to show the bit-parallel systolic multiplier. Applying the proposed algorithm, Fig. 3 depicts the five steps to handle Hankel multiplication and the time assignment for such a bit-level Hankel multiplication. The arithmetic operation in the dotted line box is assumed to be the identical of a U-cell, according to the structure in Fig. 3. Fig. 4 shows the resulting bit-parallel systolic

(3)

The matrix H in (3) is also a Hankel matrix. Assume that  ¼ [h0 , h1 , . . . , h2m2 ] matrix H is defined by the vector H  the following symbol Considering (i)the vector H  ¼ [hi , hiþ1 , . . . , hm1þi ] Hence, from (3), is defined Q the coefficient c,n þ i . can be rewritten as   A c,nþi. ¼ Q

(4)

where the operation ‘  ’ indicates the inner product of two vectors and (0)kxl denotes the operation x mod m.  ¼ [h0 , h1 , . . . , hm1 ] Consequently, a Notably, Q Montgomery multiplication can be executed as shown in Fig. 1. 3.2

Fig. 2 Circuit for performing a Hankel matrix addition

Architecture

Considering irreducible trinomials, the Montgomery multiplication can be decomposed into two Hankel matrix – vector multiplications, as illustrated in (3). Step 1 identically initialises two Hankel vectors using Fig. 3 Hankel multiplication with the time assignment

Fig. 1 Montgomery multiplication for the field GF(2 m) generated by an irreducible trinomial with Hankel matrix –vector multiplication 480

Fig. 4 Bit-parallel systolic Hankel multiplier IET Circuits Devices Syst., Vol. 1, No. 6, December 2007

4 Proposed scalable systolic Montgomery multiplier over GF(2m) based on block-Hankel matrix – vector representation

Fig. 5 Detailed circuit of a U-cell

Hankel multiplier. This circuit is composed of 5  5 U-cells, as presented in Fig. 5. In the initial step, ci ¼ 0 is assumed for 0  i  4. A Ui,j-cell performs the following operations c,nþi. ¼ c,nþi. þ aj hiþj

Every U-cell comprises one AND gate, one XOR gate and two 1-bit latches, as illustrated in Fig. 5. The Montgomery multiplication requires a total of 2m 2 1 clock cycles when all hi and ai signals go from top into the proposed array Hankel multiplier. As mentioned above, the proposed bit-parallel systolic Montgomery multiplier for x m þ x n þ 1 includes two modules, a matrix addition circuit and a Hankel multiplier, as depicted in Fig. 6. Let A(x) and B(x) represent two elements in GF(2m). In summary, the proposed multiplier based on Fig. 1 is addressed as follows: 1. In the initial step, the element B(x) is transformed into both forms of Hankel vectors K 0 ¼ [b0 , b1 , . . . , bm21 , b0 , . . . , bm22] and K 1 ¼ [bnþ1,bn þ 2 , . . . , bm21 , 0, . . . , 0, b0 , b1 , . . . , bn21]. By using two Hankel vectors, matrix addition circuit in Fig. 6 is performed by H ¼ K0 þ K1 and comprises m 2 1 XOR gates. 2. As Hankel vector H is computed, the Hankel multiplier in Fig. 6 is based on Step 3 of Fig. 1 to compute C ¼ HA.

As revealed in the previous section, the computation of Montgomery multiplication for all trinomials x m þ x n þ 1 can be decomposed into two Hankel matrix – vector computations. This Montgomery multiplier can be obtained via a low-complexity systolic architecture, in contrast to the multiplier presented by Lee et al. [10]. The circuit complexity of a bit-parallel systolic multiplier is thus proportional to O(m 2) for large values of m. This section introduces a scalable and systolic multiplier to reduce the circuit complexity. The proposed scalable and systolic Montgomery multiplication for all trinomials x m þ x n þ 1 first constructs a smaller scale Montgomery multiplier of n data bits (i.e. an n  n Hankel multiplier) using the proposed architecture in the previous section. A scalable architecture is then derived to obtain the complete m-bit Montgomery multiplication by iteratively applying this n  n Hankel Multiplier k 2 times, where k ¼ dm/ne. The derivation is given in detail as below.

Definition 2 (block-Hankel matrix – vector): Assume that H is an m  m Hankel matrix and V is an m  1 column vector. If m ¼ nk, then matrix H and vector V can be split as follows 2

H0

H1 H2

6 H 6 1 H¼6 .. 6 . 4 .. . H k H k1 2 3 V0 6 V1 7 6 7 V ¼6 . 7 4 .. 5 V k1

3 H k1 Hk 7 7 .. 7 .. 7 and . 5 .    H 2k2  

(5)

where each Hi(for 0  i  2k 2 2) is an n  n matrix in Hankel form and each Vj (for 0  j  k 2 1) is an n  1 column vector. Considering a block-Hankel matrix – vector H, assume that the vector C ¼ [C0 , C1 , . . . , Ck21]T is the result of matrix – vector computations HV, where Ci , 0  i  k 2 1, denote n  1 column vectors. The vector Ci can be calculated as follows

C 0 ¼ H0 V 0 þ H 1 V 1 þ    þ H k1 V k1 C 1 ¼ H 1 V 0 þ H 2 V 1 þ    þ H k V k1 ::: C k1 ¼ H k1 V 0 þ H k V 1 þ    þ H 2k2 V k1

Fig. 6 Bit-parallel systolic Montgomery multiplier architecture for all trinomials IET Circuits Devices Syst., Vol. 1, No. 6, December 2007

Remark 2: Assume that Hi and Hj are Hankel sub-matrices in (5). Matrix additions (Hi þ Hj) can be performed by a total of 2n 2 1 XOR gates. 481

Therefore the Montgomery multiplication in (2) can also be expressed as 2 3 2 32 3 K 0,0 K 0,1    K 0,k1 A0 C0 6 C 7 6 K 6 7 K 0,k 7 6 1 7 6 0,1 K 0,2    76 A1 7 6 . 7¼6 . 7 6 .. .. .. 76 .. 7 6 . 7 6 . 7 4 . 5 4 . . . . 54 . 5 C k1

K 0,k1 K 0,k    K 0,2k2 2 K 1,0 K 1,1    K 1,k1 6 K K 1,k 6 1,1 K 1,2    þ6 .. .. .. 6 .. 4 . . . . K 1,k1

K 1,k

   K 1,2k2

Ak1 32 A0 76 A 76 1 76 . 76 . 54 .

3 7 7 7 7 5

Ak1 (6)

Applying (6), the proposed scalable Montgomery multiplication can be executed as shown in Fig. 7. Based on (6), Fig. 8 illustrates a scalable and systolic Montgomery architecture of size k  k for Fig. 7. The circuit includes only three registers, one matrix addition circuit, one n  n Hankel multiplier and one summation circuit. The Hankel multiplier can be employed to implement an n  n systolic array, as shown in Fig. 4. As for the three registers, each of the registers K0,i and K1,i represent (2n 2 1)-bit latches and the register Vi is an n-bit latch. In Fig. 8, the MUX is responsible for shifting registers K0 and K1 . The SW is applied to shift the outcome of the Montgomery multiplication. In the initial step, three registers K0 , K1 and V are transformed using Steps 1.1– 1.3, respectively. In the first round, the computation C0 ¼ (K0,0 þ K1,0) V0 þ (K0,1 þ K1,1) V1 þ . . . þ (K0,k21 þ K1,k21)Vk21 is split into k Hankel multiplications. The Hankel multiplier in Fig. 8 is performed with the computation (K0,i þ K1,i)V0 and the result is stored in the register C. In this round, the control signals ctr1 and ctr2 in MUX are assigned with the value 0. The signal ctr1 in MUX controls both registers K0 and K1 with cyclic shifting operations. The signal ctr2 in SW controls the calculation of Ci ¼ Ci þ HVj . The value of the signal ctr1 in MUX is changed to 1 when the three input data K0,k21 ,K1,k21 and Vk21 are about to enter into

Fig. 8 Proposed scalable and systolic Montgomery multiplier over GF(2 m)

the Hankel multiplier. The signal ctr2 in SW is changed to the value 1 to control the output of C0 in the register C when the output of the Hankel multiplier is performed by (K0,k21 þ K1,k21)Vk21 . Similarly, the second round is performed by the computation C1 ¼ (K0,1 þ K1,1)V0 þ (K0,2 þ K1,2)V1 þ . . . þ (K0,k þ K1,k)Vk21 . Register C outputs the result of the computation Ck21 after k rounds. 5

Time and space complexity

The proposed Montgomery multiplication algorithm can derive two Hankel matrix – vector multiplications for all trinomials, as demonstrated in previous sections. Additionally, every Hankel multiplication can also be represented with a block Hankel matrix – vector multiplication. This section presents the estimated area – time complexity of the proposed multiplier architectures and compares them with those of the corresponding existing structures. 5.1 Complexity of bit-parallel systolic multiplier architecture

Fig. 7 Proposed scalable Montgomery multiplier over GF(2 m) based on block-Hankel matrix– vector representation 482

The proposed bit-parallel systolic Montgomery multiplier (see Fig. 6) comprises two modules, a matrix addition circuit and a Hankel multiplier. The matrix addition circuit consists of m 2 1 XOR gates; the Hankel multiplier comprises m 2 U-cells and each U-cell includes one AND gate, one XOR gate and two 1-bit latches. Every cycle of duration TA þ TX yields a desired output word after the latency of 2m 2 1 cycles. Table 1 compares various bitparallel systolic multipliers over GF(2m). To compare the area complexity, the transistor count based on the standard CMOS, VLSI realisation [28] is employed for comparison. Therefore some basic logic gates, 2-input XOR, 2-input AND and 1-bit latch, are assumed to be composed of 6, 6 and 8 transistors, respectively [29]. The proposed multiplier in Fig. 6 has twice the time complexity and saves about 36% space complexity as compared to the structure of [10]. The structure of [11] has the same time complexity and (11/7) times the space complexity of Fig. 6. IET Circuits Devices Syst., Vol. 1, No. 6, December 2007

Table 1:

Comparisons of various bit-parallel systolic multipliers over GF(2m)

Multipliers Fig. 8 for all trinomials m

Lee et al. [10] for x þ x

m21

þ1

#XOR

#Latch

Delay time per cell

Latency

Transistors

m2

m2 þ m 2 1

2m 2

TA þ TX þ TL

2m 2 1

28m 2 þ 6m 2 6

1.5m þ m

4m

2

TA þ TX þ TL

mþ1

47m 2 þ 8m

1.5m 2 þ 0.5m

4m 2

TA þ TX þ TL

mþ1

47m 2 þ 5m

2

TA þ TX þ TL

2m 2 1

44m 2 þ 22m 2 22

2

TA þ TX þ TL

3m

80m 2

m

Lee et al. [10] for x m þ x þ 1

2

2

m2

Lee [11]

m

Wang-Lin [12]

5.2

#AND

2

2m

2

m þm21 2

2m

2

4m þ 2m 2 2 7m

Complexity of scalable architecture

The complexity of the proposed scalable architecture in Fig. 8 depends on the selected digit size n. The architecture involves one matrix addition circuit, one Hankel multiplier, three common data registers and one summation module. The proposed scalable multiplier requires k 2 Hankel matrix – vector computations when k ¼ [m/n]. The summation module in every cycle performs the addition of the desired output sub-word following a latency of 2n 2 1 cycles. Hence, the latency of the proposed scalable multiplier requires k 2 þ 2n 2 2 cycles. The Hankel multiplier in Fig. 8 requires (1/k 2) times the space complexity of the Hankel multiplier in Fig. 6. Various digit-serial systolic multipliers based on cut-set systolisation techniques [30] have recently been presented [23 – 25] to enhance the trade-off between throughput performance and hardware complexity. For the digit-serial system, the data words are first partitioned into digits of some bits and then processed and transmitted on a digit-by-digit basis. An appropriate digit size is appropriately selected benefits the digit-serial architecture by enhancing the trade-off between throughput performance and hardware complexity. Table 2 compares the performance of various digit-serial systolic multipliers over GF(2m). Some real circuits, including M74HC86 (STMicroelectronics, XOR gate, tPD ¼ 12 ns (TYP.)) for the 2-input XOR gate [31], M74HC08 (STMicroelectronics AND gate, tPD ¼ 7 ns (TYP.)) for the 2-input AND gate [32], M74HC279 (STMicroelectronics, SR Latch, tPD ¼ 13 ns (TYP.)) for the 1-bit latch [33], M74H257 (STMicroelectronics, Mux, tPD ¼ 11 ns (TYP.)) for the 2-to-1 multiplexer [34] are used to compare time complexity in this work. These circuits are high-speed CMOS gates fabricated by silicon gate C2MOS technology. They have balanced propagation

delays (i.e. tPLH ¼ tPHL), low power dissipation and high speed. The type propagation-delay-time (tPD) is used to ensure a fair comparison. They are the same company’s products. Figs. 9 and 10 depict the results of the comparisons of the proposed scalable systolic multiplier and two digit-serial systolic multipliers [23, 25] in the finite field GF(2233). The proposed scalable multiplier (as shown in Fig. 10) has lower time – area complexity than the two reported digit-serial systolic multipliers for digit size k . 4. Fig. 10 also reveals that the time – area complexity of the proposed scalable multiplier can be gained by an optimum VLSI architecture as compared to traditional

Fig. 9 Comparisons of transistor count for various digit-serial systolic multipliers over GF(2 233)

Table 2: Comparisons of various digit-serial systolic multipliers over GF(2 m) Multipliers Guo– Wang [25] #AND

k(2n 2 þ n) 2

Kim et al. [23]

Fig. 8

k(2n 2 þ n)

n2

2kn

2

n 2 þ 3n 2 1

#XOR

2kn

#MUX

2kn

2kn

#SW

0

0

n

#latch

10kn

10kn þ k

2n 2 þ 5kn 2 2k

Latency

3k

3k

k 2 þ 2n 2 2

Critical

TAND þ 3TXOR

2

þn

path

TAND þ TXOR

þ (k 2 1)(TAND

þ (k 2 1)(TAND

þ 2TXOR

þ TXOR

þ 2TMUX) þ TL

þ 2TMUX) þ TL

k ¼ dm/ne n: selected digit size IET Circuits Devices Syst., Vol. 1, No. 6, December 2007

TAND þ TXOR þ TL

Fig. 10 Comparisons of the time– area product for various digitserial systolic multipliers over GF(2 233) 483

bit-parallel systolic multipliers [10, 12]. Furthermore, our scalable multiplier (as shown in Fig. 9) also has lower space complexity than the two reported digit-serial systolic multipliers [23, 25] and two bit-parallel multipliers [20, 35]. 6

Conclusions

This investigation presents a novel means of realising bitparallel systolic Montgomery multipliers over GF(2m) under Hankel matrix–vector multiplication. Because the field is built from irreducible trinomials, a Montgomery multiplication can be decomposed into two Hankel matrix– vector multiplications. The proposed architecture can reduce the space complexity by up to 36% as compared with two existing multipliers [10, 11]. Moreover, the scalable and systolic multiplier architecture produces the Montgomery multiplier using a block Hankel matrix–vector. As compared with previous digit-serial systolic multipliers, the proposed scalable multiplier shows that it has significantly less time–area product complexity than the previous digit-serial systolic architectures. If the proposed scalable and systolic architecture is applied to ECC, which require large field size, then the trade-off between throughput performance and hardware complexity can be optimised by selecting the appropriate digit size. Moreover, because the multiplier has the features of regularity and modularity, it is well suited to VLSI implementations. 7

Acknowledgments

The authors would like to thank the anonymous reviewers for their valuable comments, which significantly improved the revised version of this paper. 8

References

1 Lidl, R., and Niederreiter, H.: ‘Introduction to finite fields and their applications’ (Cambridge University Press, New York, 1994) 2 Lidl, R., Niederreiter, H., and Cohn, P.M.: ‘Finite fields’ (Cambridge University Press, New York, 1997) 3 Lee, C.Y., and Chiou, C.W.: ‘Design of low-complexity bit-parallel systolic Hankel multipliers to implement multiplication in normal and dual bases of GF(2m)’, IEICE Trans. Fundam., 2005, E88-A, (11), pp. 3169–3179 4 Lee, C.Y., Chiou, C.W., and Lin, J.M.: ‘Concurrent error detection in a bit-parallel systolic multiplier for dual basis of GF(2m)’, J. Electron. Test., Theory Appl., 2005, 21, pp. 539– 549 5 Kim, N.Y., Kim, H.S., and Yoo, K.Y.: ‘Computation of AB2 multiplication in GF(2m) using low-complexity systolic architecture’, IEE Proc., Circuits Devices Syst., 2003, 150, (2), pp. 119–123 6 Zhou, B.B.: ‘A new bit-serial systolic multiplier over GF(2m)’, IEEE Trans. Comput., 1988, 37, (6), pp. 749–751 7 Fenn, S.T.J., Taylor, D., and Benaissa, M.: ‘A dual basis bit-serial systolic multiplier for GF(2m)’, Integr. VLSI J, 1995, 18, pp. 139–149 8 Lee, C.Y., Lu, E.H., and Lee, J.Y.: ‘Bit-parallel systolic multipliers for GF(2m) fields defined by all-one and equally-spaced polynomials’, IEEE Trans. Comput., 2001, 50, (5), pp. 385–393 9 Lee, C.Y., Lu, E.H., and Sun, L.F.: ‘Low-complexity bit-parallel systolic architecture for computing AB2 þ C in a class of finite field GF(2m)’, IEEE Trans. Circuits Syst. II, 2001, 50, (5), pp. 519–523

484

10 Lee, C.Y., Horng, J.S., Jou, I.C., and Lu, E.H.: ‘Low-complexity bit-parallel systolic Montgomery multipliers for special classes of GF(2m)’, IEEE Trans. Comput., 2005, 54, (9), pp. 1061–1070 11 Lee, C.Y.: ‘Low-complexity bit-parallel systolic multiplier over GF(2m) using irreducible trinomials’, IEE Proc., Comput. Digit. Tech., 2003, 150, (1), pp. 39–42 12 Wang, C.-L., and Lin, J.-L.: ‘Systolic array implementation of multipliers for finite fields GF(2m)’, IEEE Trans. Circuits Syst., 1991, 38, (7), pp. 796 –800 13 Kwon, S., Kim, C.H., and Hong, C.P.: ‘Unidirectional two dimensional systolic array for multiplication in GF(2m) using LSB first algorithm’. 6th Int. Workshop on Fuzzy Logic and Applications, WILF 2005, (LNCS, 2006, 3849), pp. 420 –426 14 ‘Digital signature standard’, National Institute for Standards and Technology Standard FIPS Publication 186-2, January 2000 15 Montgomery, P.L.: ‘Modular multiplication without trial division’, Math. Comput., 1985, 44, pp. 519– 521 16 Koc¸, C ¸ . K. and Acar, T.: ‘Montgomery multiplication in GF(2k)’, Des. Codes Cryptogr., 1998, 14, pp. 57– 69 17 Nibouche, O., Bouridane, A., and Nibouche, M.: ‘Architectures for Montgomery’s multiplication’, IEE Proc., Comput. Digit. Tech., 2003, 150, (6), pp. 361–368 18 Chiou, C.W., Lee, C.Y., Deng, A.W., and Lin, J.M.: ‘Concurrent error detection in Montgomery multiplication over GF(2m)’, IEICE Trans. Fundam., 2006, E89-A, (2), pp. 566– 574 19 Chiou, C.W., Lee, C.Y., Deng, A.W., and Lin, J.M.: ‘Efficient VLSI implementation for Montgomery multiplication in GF(2m)’, Tamkang J. Sci. Eng., 2006, 9, (4), pp. 365–372 20 Wu, H.: ‘Montgomery multiplier and squarer for a class of finite fields’, IEEE Trans. Comput., 2002, 51, (5), pp. 521– 529 ´´ rs, S.B., and Preneel, B.: ‘An FPGA implementation of 21 Mentens, N., O an elliptic curve processor over GF(2m)’. Proc. 2004 Great Lakes Symp. on VLSI (GLSVLSI 2004), 2004, pp. 454– 457 22 Paar, C., Fleischmann, P., and Soria-Rodriguez, P.: ‘Fast arithmetic for public-key algorithms in Galois fields with composite exponents’, IEEE Trans. Comput., 1999, 48, (10), pp. 1025–1034 23 Kim, C.H., Hong, C.P., and Kwon, S.: ‘A digit-serial multiplier for finite field GF(2m)’, IEEE Trans. VLSI, 2005, 13, (4), pp. 476– 483 24 Kim, N.-Y., and Yoo, K.-Y.: ‘Digit-serial AB2 systolic architecture in GF(2m)’, IEE Proc., Circuits Devices Syst., 2005, 152, (6), pp. 608–614 25 Guo, J.H., and Wang, C.L.: ‘Digit-serial systolic multiplier for finite fields GF(2m)’, IEE Proc., Comput. Digit. Tech., 1998, 145, (2), pp. 143–148 26 Gutub, A.A.-A., Tenca, A.F., Savas, E., and Koc, C.K.: ‘Scalable and unified hardware to compute Montgomery inverse in GF(p) and GF(2n)’, Cryptographic hardware and embedded systems – CHES 2002, (LNCS, 2002, 2523), pp. 484–499 27 Savas, E., Tenca, A.F., and Koc¸, C ¸ .K.: ‘A scalable and unified multiplier architecture for finite fields GF(p) and GF(2m)’, Cryptographic hardware and embedded systems – CHES 2000, (LNCS, 2000, 1965), pp. 281– 296 28 Weste, N., and Eshraghian, K.: ‘Principles of CMOS VLSI design, a system perspective’ (Addison-Wesley, Reading, MA, 1985) 29 Kang, S.M., and Leblebici, Y.: ‘CMOS digital integrated circuits-analysis and design’ (McGraw-Hill, 1999) 30 Kung, S.Y.: ‘VLSI array processors’ (Prentice-Hall, Englewood Cliffs, NJ, 1988) 31 ‘M74HC86, Quad exclusive OR gate’, STMicroelectronics 2001, http://www.st.com/stonline/books/pdf/docs/2006.pdf 32 ‘M74HC08, Quad 2-Input AND Gate’, STMicroelectronics 2001, http://www.st.com/stonline/books/pdf/docs/1885.pdf ¯ Latch’, STMicroelectronics 2001, http:// 33 M74HC279, Quad S¯-R www.st.com/stonline/books/pdf/docs/1937.pdf 34 M74HC257, Quad 2 Channel Multiplexer (3-State), STMicroelectronics 2001, http://www.st.com/stonline/books/pdf/docs/1932.pdf 35 Sunar, B., and Koc, C.K.: ‘Mastrovito multiplier for all trinomials’, IEEE Trans. Comput., 1999, 48, (5), pp. 522–527

IET Circuits Devices Syst., Vol. 1, No. 6, December 2007

Suggest Documents