IEEE TRANSACTIONS ON COMPUTERS,
VOL. 54, NO. 9,
SEPTEMBER 2005
1061
Low-Complexity Bit-Parallel Systolic Montgomery Multipliers for Special Classes of GFð2mÞ Chiou-Yng Lee, Member, IEEE, Jenn-Shyong Horng, I-Chang Jou, Senior Member, IEEE, and Erl-Huei Lu Abstract—Recently, cryptographic applications based on finite fields have attracted much interest. This paper presents a transformation method to implement low-complexity Montgomery multipliers for all-one polynomials and trinomials. Using this method, we proposed a new bit-parallel systolic architecture for computing multiplications over GFð2m Þ. These new multipliers have a latency m þ 1 clock cycles and each cell incorporates at most one 2-input AND gate, two 2-input XOR gates, and four 1-bit latches. Moreover, these new multipliers are shown to exhibit significantly lower latency and circuit complexity than the related systolic multipliers and are highly appropriate for VLSI systems because of their regular interconnection pattern, modular structure, and fully inherent parallelism. Index Terms—Bit-parallel systolic multiplier, finite field, irreducible trinomial, montgomery multiplication, irreducible AOP.
æ 1
INTRODUCTION
G
ALOIS field arithmetic is important in error correcting codes and public-key cryptography schemes [1], [2]. In particular, two public-key cryptography schemes, elliptic and hyperelliptic curve cryptostsyems [3], require arithmetic operations to be performed in finite field. Since the elliptic scalar multiplication is based on the multiplication over GFð2m Þ, we therefore developed and implemented an efficient hardware architecture for bit-parallel multiplication over GFð2m Þ. For computing multiplications over GFð2m Þ, both software implementations and hardware architectures have been studied extensively. A good multiplication algorithm is usually needed to a suitable element presentation of a polynomial basis, that is, to choose a special irreducible polynomial for the construction of the field. For example, GFð2m Þ multipliers using some popular polynomials, such as all-one polynomials (AOPs) and trinomials [4], [5], [6], [7], [8], have a low circuit complexity. In VLSI designs, systolic architectures are fundamentally suited to rapid computation and depend on regular circuitry to perform arithmetic operations over finite fields GFð2m Þ. Their common nature supports architectural characteristics such as concurrence, I/O-balance, and simple and regular design. Most systolic multipliers
. C.Y. Lee is with the Program Coordination Department, Chunghwa Telecommunications Laboratories, 2F, 105, Lane 437, Chung-Pei Road Sec. 2, Chung-Li, Tao-Yuan, Taiwan 320, ROC. E-mail:
[email protected]. . J.-S. Horng and I.-C. Jou are with the Department of Computer and Communication Engineering, National Kaohsiung First University, 2, Juoyue Road, Nantz District, Kaohsiung 811, Taiwan, ROC. E-mail:
[email protected]. . E.-H. Lu is with the Department of Electrical Engineering, Chang Gung University, 259 Wen-Hwa 1st Rd., Kwei-San, Tao-Yuan, Taiwan 333, ROC. E-mail:
[email protected]. Manuscript received 13 Aug. 2003; revised 31 Dec. 2004; accepted 11 May 2005; published online 15 July 2005. For information on obtaining reprints of this article, please send e-mail to:
[email protected], and reference IEEECS Log Number TC-0131-0803. 0018-9340/05/$20.00 ß 2005 IEEE
perform array-type multiplication in which one operand is processed slowly. Generally, the array algorithms are classified as least-significant-bit first (LSB-first) and mostsignificant-bit first (MSB-first) schemes, such as those of Wang and Lin [9] and Yeh et al. [10], which are both regularly connected to identical cells and require a latency of 3m clock cycles. However, these multipliers result from directly unrolling iterative algorithms and do not fully exploit inherent parallelism. Consequently, they require a large area and a large latency overhead to be fully pipelined. Recently, Lee et al. [11], [12] used the inner product operation to implement efficient systolic multipliers with low-latency and low-complexity architectures, defined by all-one and equally spaced polynomials. Unfortunately, irreducible all-one polynomials and irreducible equally spaced polynomials are very rare. For example, as m 100, the values of m for which an all-one polynomial of degree m is irreducible are only 2, 4, 10, 12, 18, 28, 36, 52, 58, 60, 66, 82, and 100. Recently, Koc and Acar [13] demonstrated a modular multiplication using the Montgomery technique for the finite field multiplication over GFð2m Þ. When RðxÞ ¼ xm , the Montgomery multiplication for computing AðxÞBðxÞR1 ðxÞ mod P ðxÞ, where AðxÞ; BðxÞ 2 GFð2m Þ and P ðxÞ generates the field, is suitable for both hardware and software implementations. Wu [14] recently proposed a trinomial-based Montgomery multiplier. Bajard et al. [15] suggested a Montgomery multiplier over GFðpm Þ. However, those multipliers were irregularly designed and are unsuitable for implementing systolic architectures. In this paper, a transformation method to implement bit-parallel systolic Montgomery multipliers over GFð2m Þ, which is defined by irreducible AOPs and irreducible trinomials, is proposed. Both the multipliers exhibit a lower hardware complexity and lower latency than those with other systolic multipliers in [16], [17]. Finally, they are very appropriate Published by the IEEE Computer Society
1062
IEEE TRANSACTIONS ON COMPUTERS,
for VLSI systems because of their regular interconnection and modular structure.
2
PRELIMINARIES
Finite fields have received a lot of attention due to their important and practical applications in high bit-rate digital communications. A finite field GFð2m Þ includes 2m elements, where m is a positve integer. These 2m elements in GFð2m Þ can also be represented by three major types of bases, normal basis, dual basis, and polynomial basis. Using the polynomial basis representation, the generic field element AðxÞ 2 GFð2m Þ is represented through the m-vector (am1 ; am2 ; ; a0 ) with respect to the set fxm1 ; ; x2 ; x; 1g: AðxÞ ¼ am1 xm1 þ þ a2 x2 þ a1 x þ a0 ; where x is a generator for GFð2m Þ over GF(2) and the coefficients ai are in the ground field GF(2). Therefore, the field element of GFð2m Þ is a unique linear combination of polynomials. Given this element representation, addition and multiplication in GFð2m Þ can be performed by polynomial addition and polynomial multiplication modulo P ðxÞ on the field elements represented as degree m 1 or less. GFð2m Þ bit-parallel multipliers have been suggested in [5], [6], [8] to reduce the complexity of the field multiplication. Among these, finite fields generated from the AOP and the trinomial have been shown to be implemented by low-complexity multipliers. A polynomial of the form P ðxÞ ¼ xm þ þ x þ 1 is called the AOP, which is irreducible if and only if m þ 1 is a prime and 2 is a primitive modulo m þ 1. Moreover, an optimal normal basis of type I is generated from the AOP [18]. The existence, distribution, and other characteristics of trinomials have been comprehensively studied. For example, Blake et al. [19] studied some constructive problems associated with irreducible trinomials over GF(2). Stahnke [20] revealed that, for m 200, almost primitive trinomials exist for slightly over one half of the values of m. Based on trinomials to construct the finite field GFð2m Þ, various multipliers have been developed in [6], [7], [8], [14]. Among these, nonsystolic bit-parallel multipliers with 1 n m=2 have low-complexity architectures. Using the Horner’s rule, Lee, in [17], proposed a low-complexity bit-parallel systolic multiplier for all trinomials. Wu, in [14], used the Montgomery technique to a low-complexity Montgomery multiplier. A polynomial P ðxÞ of degree m is almost primitive (almost irreducible) if P ð0Þ 6¼ 0 and P ðxÞ has a primitive (irreducible) factor of degree r, where 0 m r r. We say that P ðxÞ has exponent r and increase m r. For example, the trinomial x16 þ x3 þ 1 is almost primitive with exponent 13 and increase 3 because x16 þ x3 þ 1 ¼ ðx3 þ x2 þ 1ÞDðxÞ; where DðxÞ ¼ x13 þ x12 þ x11 þ x9 þ x6 þ x5 þ x4 þ x2 þ 1 is primitive. From a computation viewpoint, it is more efficient to work in the ring GFð2Þ½x=ðx16 þ x3 þ 1Þ than in the field GFð2Þ½x=DðxÞ. Brent and Zimmermann
VOL. 54,
NO. 9,
SEPTEMBER 2005
demonstrated in [21] that trinomials have the following characteristics: Let trinomials of the form xm þ xn þ 1 be the almost primitive polynomials, therefore one has GCDðm; nÞ ¼ 1. 2. Let trinomials of the form xm þ xn þ 1 be the almost irreducible polynomials. Then, we have GCDðm; nÞ is odd. Furthermore, Seroussi [22] showed that irreducible trinomials with GCDðm; nÞ > 1 are rare. Accordingly, this paper will focus on AOPs and trinomials with GCDðm; nÞ ¼ 1 and discuss the advantages of polynomials to investigate lowcomplexity bit-parallel systolic Montgomery multipliers. 1.
3
CONVENTIONAL OVER GFð2m Þ
MONTGOMERY MULTIPLICATION
The Montgomery multiplication was presented in 1985 for efficient integer modular multiplications [23]. In 1998, the multiplication was extended by Koc and Acar [13] to multiplications in the finite field GFð2m Þ. The Montgomery multiplication is defined by AðxÞBðxÞR1 ðxÞ mod P ðxÞ, where P ðxÞ generates the field of GFð2m Þ and AðxÞ, BðxÞ, and R1 ðxÞ 2 GFð2m Þ. Notably, R1 ðxÞ denotes the multiplicative inverse of the element RðxÞ. For bit-parallel multiplication in GFð2m Þ, Wu [14] showed that lowcomplexity Montgomery multipliers over GFð2m Þ can be obtained if the field is generated by P ðxÞ ¼ xm þ xk þ 1 and RðxÞ is selected as RðxÞ ¼ xk . Since P ðxÞ and RðxÞ are relatively prime, two polynomials R1 ðxÞ and PeðxÞ exist with the characteristic that RðxÞR1 ðxÞ þ PeðxÞP ðxÞ ¼ 1. Thus, the Montgomery multiplication of AðxÞ and BðxÞ is defined as follows: Step 1. T ðxÞ ¼ AðxÞBðxÞ. Step 2. UðxÞ ¼ T ðxÞPeðxÞ mod RðxÞ. Step 3. CðxÞ ¼ ðT ðxÞ þ UðxÞP ðxÞÞ=RðxÞ mod P ðxÞ. As mentioned before, the Montgomery multiplication is a complicated arithmetic operation which incorporates three steps: conventional multiplication, modulo multiplication, and division. It is a method that avoids the division that is usually necessary for finding the remainder, at the cost of an additional multiplication. Since divisions are much more costly than multiplications, this method represents a significant improvement over regular modular multiplications. The following section addresses the implementation of a bit-parallel systolic architecture.
4
BIT-PARALLEL SYSTOLIC MULTIPLIER BINOMIALS
FOR
A special polynomial of the form xm þ 1 over GF(2), called the binomial, produces simpler multipliers. Although the finite field GFð2m Þ cannot be constructed from this polynomial, an AOP-based multiplier over GFð2m1 ) is frequently applied using the reduction polynomial of the binomial xm þ 1 to implement low-complexity multipliers. For example, the reduction process using the AOP P ðxÞ ¼ xm1 þ xm2 þ þ x þ 1 over GF(2) is usually carried out
LEE ET AL.: LOW-COMPLEXITY BIT-PARALLEL SYSTOLIC MONTGOMERY MULTIPLIERS FOR SPECIAL CLASSES OF GFð2m Þ i1
using the binomial xm þ 1 since ðx þ 1ÞP ðxÞ ¼ xm þ 1. Then,
2 X
any element AðxÞ in the Galois field GFð2m1 ) can also be
j¼0
represented AðxÞ ¼ am1 xm1 þ þ a2 x2 þ a1 x þ a0 over
¼
GF(2). Applying this element representation, the polynomial multiplication modulo xm þ 1 is shown in the following theorem. Theorem 1. Assume that AðxÞ ¼ am1 xm1 þ þ a2 x2 þ
þ
2
m 1 X j¼0 iþj¼odd
X
j¼iþ1 2
a bj xi ;
AðxÞBðxÞ ¼
2
m 1 X
a : 2 > < 2
k¼even k¼0
Next, let k be an odd number, j is chosen so that > satisfies 2i þ bm1 j ¼< ik1 2 2 c þ 1 j m 1 and i 0 j 2 1. Thus, substituting j < ik1 > into 2 Pm1 P2i 1 a b þ a b gives i m1 j¼0 j cþ1 j j¼ þb 2
i
21 X
a bj þ
j¼0
¼
m 1 X k¼1 k¼odd
iþj¼even m 1 X
m1 X
a bj
j¼2i þbm1 2 cþ1
a b 2
Case 2: odd i. Assume that k is an even value, where 0 k m 1. m1 j is chosen so that j ¼< ik1 > satisfies iþ1 2 2 þb 2 c i1 j m 1 and 0 j 2 . Thus, substituting j ¼< Pi1 P ik1 2 > into j¼0 a bj þ m1 j¼iþ1þbm1c a bj , one has 2 2
2
j ¼ 0a b 2
2
1 C a b Axi : 2
2
ð3Þ
where wj;i ¼
produces
2
i¼0
Wi ¼ ½w0;i ; w1;i ; ; wðm1Þ;i T ;
a b ; 2 2 a b 2
Assume that k is an even value, where 0 k m 1.
j¼2i
m1 X
P Pm1 Now, for each m1 iþj¼even a ij b iþj iþj¼odd a iþjþ1 b ij1 , þ < 2 > < 2 > j¼0 j¼0 a column vector is defined as
(
i i m1 j is chosen so that j ¼< iþk 2 > satisfies 2 j 2 þ b 2 c. i m1 P þb c 2 2 a bj Thus, substituting j ¼< iþk 2 > into j¼ i
2
t u
ð2Þ
cases, even i and odd i are discussed. Case 1: even i.
a bj ¼
m1 X
þ
where < p > denotes p modulo m. In the following two
a : 2 >
into j¼ a bj produces iþ1
iþk 2
tuting j ¼
satisfies
the irreducile AOP of degree m 1. Since xm ¼ 1, the
j¼0 iþj¼even
m 1 X
Next, let k be an odd number, j is chosen so that
two elements in GFð2m1 Þ, where the field is constructed from
i¼0
a bj
m1 j¼iþ1 2 þb 2 c
k¼0 k¼even
a1 x þ a0 and BðxÞ ¼ bm1 xm1 þ þ b2 x2 þ b1 x þ b0 are
computation of the product of both AðxÞ and BðxÞ involves 0 m 1 m1 X BX AðxÞBðxÞ ¼ a b @
m1 X
a bj þ
1063
2
for i þ j ¼ even for i þ j ¼ odd:
The sum of all entries in the column vector Wi is exactly Pm1 Pm1 iþj¼even a ij b iþj iþj¼odd a iþjþ1 b ij1 þ < 2 > < 2 > and Wi appears j¼0 j¼0 as the ith column vector in the m by m matrix W ¼ ½wj;i , where 2 6 6 6 W ¼6 6 4
1
x
xm1
w0;0
w0;1
...
w0;ðm1Þ
w1;0 .. .
w1;1 .. .
.. .
w1;ðm1Þ .. .
wðm1Þ;0
wðm1Þ;1
3 7 7 7 7: 7 5
ð4Þ
. . . wðm1Þ;ðm1Þ
The structure of the matrix W indicates that, if i þ j ¼ even, then the coefficients a and b in the expression 2 2 of wj;i are determined by the coefficients in the expression of wðj1Þ;ði1Þ and wðj1Þ;ðiþ1Þ , respectively; if i þ j ¼ odd, the coefficients a and b in the expression of wj;i are 2 2 determined by the coefficients in the expression of wðj1Þ;ðiþ1Þ and wðj1Þ;ði1Þ , respectively. From this observation, the proposed binomial-based multiplication is regular and simple and is well-suited to implementing systolic architecture by fully exploiting inherent parallelism of the input datum. In the following example, Theorem 1 is
1064
IEEE TRANSACTIONS ON COMPUTERS,
VOL. 54,
NO. 9,
SEPTEMBER 2005
Fig. 2. The detailed circuit of the U-cell. Fig. 1. The bit-parallel systolic multiplier over GFð24 Þ based on AOP.
applied to verify the correctness of the configuration of the binomial-based multiplication in Example 1. 4 ) be given by AðxÞ ¼ Example 1. Let two elements P4 P4 in GFð2 i i a x and BðxÞ ¼ b x . Assume that CðxÞ ¼ i i¼0 i Pi¼0 4 i c x is denoted by the product of AðxÞ and BðxÞ i¼0 i over GFð24 ) such that the product CðxÞ using the structure of matrix W in (4) can be obtained by
required by the maximum computation delay of one 2-input AND and one 2-input XOR gate. Applying the properties of an AOP, an AOP-based multiplier in [11] is the use of an inner-product multiplication to perform the multiplication over GFð2m1 Þ and thereby construct a fully bit-parallel systolic multiplier. The required latency is only m clock cycles. We now compare the proposed AOP-based multiplier with Lee et al.’s multiplier in [11]; both multipliers have the same hardware complexity and latency. Notably, Lee et al.’s multiplier in [11] differs from that in the proposed method.
5
PROPOSED TRANSFORMATION METHOD
The preceding section discusses a new bit-parallel systolic multiplier for binomials. This section will introduce shuffling the coefficients of two elements to perform a AðxÞBðxÞxn mod ðxm þ 1Þ computation. Accordingly, the next section will present the Montgomery systolic multiplier for trinomials. Let m be a positive integer and let ðiÞ ¼ q þ iðm nÞ mod m;
ð5Þ mn1 2
For clarity, AðxÞ ¼ a4 x4 þ a3 x3 þ a2 x2 þ a1 x þ a0 and BðxÞ ¼ b4 x4 þ b3 x3 þ b2 x2 þ b1 x þ b0 are used as an example to illustrate the systolic multiplier. Let Dj ðxÞ ¼ dj;4 x4 þ dj;3 x3 þ dj;2 x2 þ dj;1 x þ dj;0 represent the jth intermediate product of AðxÞ and BðxÞ. Assuming the initial D0 ðxÞ ¼ 0 is established, d0;i ¼ 0 for 0 i 4. Fig. 1 shows a parallel-in parallel-out systolic multiplier. The U-cell presented in Fig. 2 is made up of one 2-input AND gate, one 2-input XOR gate, and three 1-bit latches. Assume that the U-cell is located in the ith column and jth row of the proposed multiplier, which is performed by djþ1;i ¼ dj;i þ a b if i þ j ¼ even or djþ1;i ¼ dj;i þ a b if i þ j ¼ odd. Fig. 1 2 2 shows that the latency has m clock cycles and each cell is
where 1 n m 1, 0 i m 1, and q ¼< >. Assume that if the value of n is fixed on 1 n m 1 and GCDðn; mÞ ¼ 1, then ðiÞ, for 0 i m 1, is permuted on the complete residue set f0; 1; 2; ; m 1g. Observing the ith column vector Wi in the matrix W in Example 1, assume that i is fixed on 0 i m 1, then ij 1. þ < 2 >>¼ i, for i þ j ¼ even. iþjþ1 2. þ < ij1 >>¼ i, for i þ j ¼ odd. 2 Using (5), the above properties can be obtained by
1. 2.
ij < ð< iþj 2 >Þ þ ð< 2 >Þ >¼< q þ ðiÞ > , for i þ j ¼ even. > þð< ij1 >Þ >¼< q þ ðiÞ > , for i þ < ð< iþjþ1 2 2 j ¼ odd.
LEE ET AL.: LOW-COMPLEXITY BIT-PARALLEL SYSTOLIC MONTGOMERY MULTIPLIERS FOR SPECIAL CLASSES OF GFð2m Þ
1065
3. If i ¼ m 1, then < q þ ðiÞ >¼ m 1. Thus, applying the above properties, (1) can be rewritten as follows: m 1 m 1 X X
AðxÞBðxÞ ¼
i¼0
aðÞ bðÞ 2
iþj¼even j¼0 iþj
2
ij
xÞþð< 2 >Þ> m1 X
þ
aðÞ bðÞ 2
iþj¼odd j¼0
xÞþð< 2 >Þ>
W x ;
i¼0
where m 1 X
W ¼
aðÞ bðÞ
iþj¼even j¼0
þ
m1 X
2
2
aðÞ bðÞ : 2
iþj¼odd j¼0
2
Now, consider the Montgomery multiplication for computing AðxÞBðxÞxn mod ðxm þ 1Þ. From [11], BðxÞðnÞ ¼ BðxÞxn mod ðxm þ 1Þ ¼
m 1 X
The above results show that the sequence c2 ; c0 ; c3 ; c1 ; c4 is a permutation of the sequence c0 ; c1 ; c2 ; c3 ; c4 , as compared with Example 1. Moreover, if i þ j ¼ even, the coefficients aðÞ and b in the expression for wj;i are 2 2 determined by the coefficients in the expression for wðj1Þ;ði1Þ and wðj1Þ;ðiþ1Þ , respectively; if i þ j ¼ odd, the coefficients aðÞ and b in the expression of 2 2 wj;i are determined by the coefficients in the expression for wðj1Þ;ðiþ1Þ and wðj1Þ;ði1Þ , respectively. Now, consider the Montgomery multiplier for calculating DðxÞ ¼ AðxÞBðxÞxn mod ðxm þ 1Þ. Since BðxÞðnÞ can be determined by shifting BðxÞ cyclically n positions to the left, the product DðxÞ can be obtained as follows:
ð7Þ
b xj :
j¼0 ðnÞ
Clearly, BðxÞ
is easily determined by shifting BðxÞ
cyclically n positions to the left, that is, by substituting < j þ n > into the subscript of bj on the element BðxÞ. Accordingly, AðxÞBðxÞxn mod ðxm þ 1Þ based on (6) can be given by AðxÞBðxÞxn mod ðxm þ 1Þ ¼ AðxÞBðxÞðnÞ 0 m1 X X B m1 ¼ aðÞ b @ 2 2 i¼0
þ
iþj¼even j¼0
m1 X iþj¼odd j¼0
ð8Þ
1
1.
C aðÞ b A 2
As indicated above, the Montgomery multiplication for binomials can be summarized as follows:
2
x : Example ðiÞ ¼
2.
mn1 2
Given
n¼2
þ iðm nÞ mod m,
and
m ¼ 5,
0 i 4,
thus
ð0Þ ¼ 1;
ð1Þ ¼ 4, ð2Þ ¼ 2, ð3Þ ¼ 0, ð4Þ ¼ 3, and q ¼ 1 can be readily determined. Substituting ðiÞ into the matrix W P in Example 1, let CðxÞ ¼ 4i¼0 c x , where c is defined as the sum of all entries of the column vector Wi , then
2.
The term is located in position (j; i) and is thus denoted as term (j; i). All coefficients in term (j; i) are coefficients of the neighboring term. For example, the term (2, 2) is a1 b4 , and the terms (1, 1) and (3, 3) include the coefficient a1 . Similarly, terms (3, 1) and (1, 3) include the coefficient b4 . The Montgomery multiplication for computing AðxÞBðxÞxn mod ðxm þ 1Þ can be obtained from binomial-based multiplication. That is, Fig. 1 can also offer the multiplication of AðxÞBðxÞxn mod ðxm þ 1Þ when the two vectors (að0Þ ; að1Þ ; ; aðm1Þ ) are and ( b ; b ; ; b ) translated from two vectors (a0 ; a1 ; ; am1 ) and (b0 ; b1 ; ; bm1 ).
1066
IEEE TRANSACTIONS ON COMPUTERS,
From the above summaries, the circuit in Fig. 1 can also be shown to compute AðxÞBðxÞxn mod ðxm þ 1Þ. Thus, Fig. 1 is also called the AOP-based Montgomery multiplier. The following section will use the proposed multiplier to realize the trinomial-based Montgomery multiplication.
6
BIT-PARALLEL SYSTOLIC MONTGOMERY MULTIPLIER FOR IRREDUCIBLE TRINOMIALS OF THE FORM xm þ xn þ 1 WITH GCDðm; nÞ ¼ 1
6.1 Algorithm In [21], Brent and Zimmermann stated that almost primitive trinomials of the form xm þ xn þ 1 satisfy the condition of GCDðm; nÞ ¼ 1 and most irreducible trinomials also satisfy the condition of GCDðm; nÞ ¼ 1. Hence, this section addresses the reduction process using the trinomials of the form P ðxÞ ¼ xm þ xn þ 1 with GCDðm; nÞ ¼ 1 to derive the lowcomplexity bit-parallel systolic Montgomery multiplier. L e t AðxÞ ¼ am1 xm1 þ þ a1 x þ a0 a n d BðxÞ ¼ bm1 xm1 þ þ b1 x þ b0 be two elements in GFð2m Þ, where the field is constructed from irreducible trinomial P ðxÞ ¼ xm þ xn þ 1 with GCDðm; nÞ ¼ 1. Assume that the product T ðxÞ ¼ t2m2 x2m2 þ þ t1 x þ t0 is the general multiplication of AðxÞ and BðxÞ, where
VOL. 54,
NO. 9,
SEPTEMBER 2005
Next, the Montgomery multiplication for trinomials with m
x þ xn þ 1 can be represented as AðxÞBðxÞxn mod ðxm þ xn þ 1Þ T1 ðxÞ þ T2 ðxÞxn þ T3 ðxÞxmþn þ T1 ðxÞðxm þ xn þ 1Þ xn ð11Þ m ¼ T2 ðxÞ þ T3 ðxÞx þ T1 ðxÞxmn þ T1 ðxÞ ¼
¼ ðT2 ðxÞ þ T3 ðxÞ þ T1 ðxÞxmn Þ þ ðT1 ðxÞ þ T3 ðxÞxn Þ ¼ KðxÞ þ GðxÞ; where KðxÞ ¼ T2 ðxÞ þ T3 ðxÞ þ T1 ðxÞxmn ¼ AðxÞBðxÞxn mod ðxm þ 1Þ GðxÞ ¼ T1 ðxÞ þ T3 ðxÞxn : From the relation of KðxÞ and GðxÞ, each term in GðxÞ is included in the polynomial KðxÞ, that is, the polynomial GðxÞ can be extracted from KðxÞ ¼ AðxÞBðxÞxn mod ðxm þ 1Þ computations. Thus, two polynomials, KðxÞ and GðxÞ, are given by KðxÞ ¼ k x þ k x þ þ k x ;
ð12Þ
t0 ¼ a0 b0 GðxÞ ¼ g x þ g x
t1 ¼ a0 b1 þ a1 b0 .. . tm1 ¼ a0 bm1 þ a1 bm2 þ þ am1 b0
þ þ g x :
Moreover, each term g , for 0 i m 2, in GðxÞ
tm ¼ a1 bm1 þ a2 bm2 þ þ am1 b1 .. . t2m2 ¼ am1 bm1 :
can also be extracted from k computations. In summary, the implementation of the bit-parallel systolic Montgomery multiplier for trinomials is established by the
Consider the intermediate multiplication T ðxÞ decomposed as follows: T ðxÞ ¼ T1 ðxÞ þ T2 ðxÞxn þ T3 ðxÞxmþn ;
ð9Þ
where n1
T1 ðxÞ ¼ tn1 x
T2 ðxÞ ¼ tmþn1 x T3 ðxÞ ¼ t2m2 x
KðxÞ ¼ T2 ðxÞ þ T3 ðxÞ þ T1 ðxÞxmn ¼ AðxÞBðxÞxn modðxm þ 1Þ;
and (13), respectively, Fig. 1 can be reconfigured to produce
þ þ tnþ1 x þ tn
mn2
following steps: Step 1. Since the function KðxÞ in (11) is defined by
thus Fig. 1 can be used to realize KðxÞ computations. Step 2. From the relations of both KðxÞ and GðxÞ in (12)
þ þ t1 x þ t0
m1
two polynomials KðxÞ and GðxÞ, as shown in the multi-
þ þ tmþnþ1 x þ tmþn :
plication unit of Fig. 3. Step 3. The final sum step is performed by summing
Given the intermediate product AðxÞBðxÞ ¼ T1 ðxÞ þ T2 ðxÞxn þ T3 ðxÞxmþn ;
KðxÞ and GðxÞ, i.e., the product
using the Montgomery multiplication in [13], the Montgomery multiplication for computing AðxÞBðxÞxn mod ðxm þ 1Þ can be given by
DðxÞ ¼ AðxÞBðxÞxn mod ðxm þ xn þ 1Þ ¼ d x þ d x þ þ d x ;
AðxÞBðxÞxn mod ðxm þ 1Þ T1 ðxÞ þ T2 ðxÞxn þ T3 ðxÞxmþn þ T1 ðxÞðxm þ 1Þ xn m ¼ T2 ðxÞ þ T3 ðxÞx þ T1 ðxÞxmn ¼ T2 ðxÞ þ T3 ðxÞ þ T1 ðxÞxmn : ¼
ð13Þ
then one has ð10Þ
d ¼ k þ g mod 2; for 0 i m 2 d ¼ k :
LEE ET AL.: LOW-COMPLEXITY BIT-PARALLEL SYSTOLIC MONTGOMERY MULTIPLIERS FOR SPECIAL CLASSES OF GFð2m Þ
1067
6.2 Hardware Implementation In this section, we will use the primitive trinomial x5 þ x2 þ 1 over GF(2) to illustrate the novel bit-parallel systolic Montgomery multiplier over GFð25 Þ. Let AðxÞ ¼ a4 x4 þ a3 x3 þ a2 x2 þ a1 x þ a0 and BðxÞ ¼ b4 x4 þ b3 x3 þ b2 x2 þ b1 x þ b0 be two elements of the field GFð25 Þ generated by the primitive trinomial x5 þ x2 þ 1 over GF(2). The Montgomery multiplication for computing KðxÞ ¼ AðxÞBðxÞx2 mod ðx5 þ 1Þ is expressed as
Fig. 3. The bit-parallel systolic Montgomery multiplier for the field generated by x5 þ x2 þ 1.
where kei ¼ k . From the above multiplication, based on (13), the polynomial GðxÞ can be represented as GðxÞ ¼ a0 b0 þ ða1 b0 þ a0 b1 Þx þ ða3 b4 þ a4 b3 Þx2 þ a4 b4 x3 ¼ ða3 b4 þ a4 b3 Þx þ a0 b0 x þ a4 b4 x þ ða1 b0 þ a0 b1 Þx :
Fig. 4. The detailed circuits for W, Q, and V cells.
1068
IEEE TRANSACTIONS ON COMPUTERS,
VOL. 54,
NO. 9,
SEPTEMBER 2005
TABLE 1 Comparison of the Related Systolic Multipliers over GFð2m Þ
Obviously, each term g in GðxÞ can be found and exists in each term k in KðxÞ. As mentioned before, Fig. 3 shows the proposed trinomial-based Montgomery multiplier, which includes the multiplication unit and the sum unit. For constructing the multiplication unit, each masked entry is defined as the V-cell which is identical to one 2-input AND gate, two 2-input XOR gates, and four 1-bit latches, as shown in Fig. 4a. And, each unmasked entry is defined as a Q-cell that is composed of one 2-input AND gate, one 2-input XOR gate, and four 1-bit latches, as shown in Fig. 4b. Based on the two basic cells, the result of the multiplication unit in Fig. 4 produces two polynomials, KðxÞ and GðxÞ. Besides, the coefficient g for 0 i 3 is easily shown to be produced in the ði þ 1Þth column. Finally, the sum unit in Fig. 3 is composed of m W-cells to perform the sum of KðxÞ and GðxÞ. Each W-cell consists of one 2-input XOR gate and one 1-bit latch to perform d ¼ g þ k computations, as shown in Fig. 4c. Therefore, Fig. 3 uses three types of cells to carry out the Montgomery multiplication for the trinomials. As stated above, our proposed bit-parallel systolic Montgomery multiplier only requires a latency m þ 1. The maximum computation delay in each
cell requires one 2-input AND gate and one 2-input XOR gate delays.
7
CONCLUSIONS
This paper first addresses the use of the Montgomery technique to realize bit-parallel systolic multipliers for AOPs and trinomials. The proposed trinomial-based and AOP-based Montgomery multipliers are shown to be translatable from the proposed binomial-based multiplier into simple systolic multipliers. Both kinds of multipliers have a high throughput due to the low propagation delay in each cell. Moreover, the latency of both multipliers is only m þ 1 clock cycles to compute the Montgomery multiplication in GFð2m Þ. The literature describes various bit-parallel systolic multipliers using the polynomial basis, including those of Wang-Lin [9], Yeh et al. [10], and Lee [17]. Their multipliers are based on Horner’s rule. If the field of GFð2m Þ is constructed from a primitive polynomial with a general form, then the latency of the multiplier must still be 3m clock cycles. In particular, in some proposed applications, such as Lee’s multiplier [17], the latency can be
LEE ET AL.: LOW-COMPLEXITY BIT-PARALLEL SYSTOLIC MONTGOMERY MULTIPLIERS FOR SPECIAL CLASSES OF GFð2m Þ
reduced from 3m to 2m 1. However, such multipliers cannot easily perform the Montgomery multiplication. The use of the binomial-based multiplication is considered first to realize the AOP-based and the trinomial-based Montgomery multipliers for implementing bit-parallel systolic architectures. Table 1 indicates that the proposed multipliers require lower logic gates and have a lower latency than conventional systolic multipliers in [9], [10]. Moreover, Lee et al.’s multiplier in [11] used the inner-product method to develop an AOP-based systolic multiplier over GFð2m Þ. This circuit typically performs a binomial-based multiplication. The major problem with Lee et al.’s multiplier is its lack of suitability for binomial multiplication of an even degree because of inner-product multiplication. More importantly, the latency of the proposed multipliers requires only m þ 1 clock cycles. That is, the proposed multipliers were designed to have very low latency and very high throughput and to significantly reduce the time and space complexity. Moreover, our architectures are wellsuited to VLSI systems because of their regular interconnection patterns, modular structures, and fully inherent parallelism and are suitable for applications, such as smart cards, mobile phone, or other portable devices with limited specific space constraints.
ACKNOWLEDGMENTS The authors would like to thank the anonymous reviewers for their valuable comments, which improved the revised version of this paper.
REFERENCES [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11]
[12]
[13]
E.R. Berlekamp, Algebraic Coding Theory. New York: McGraw-Hill, 1968. M.Y. Rhee, Cryptography and Secure Communications. Singapore: McGraw-Hill, 1994. N. Kobliz, “Elliptic Curve Cryptography,” Math. Computation, vol. 48, no. 177, pp. 203-209, Jan. 1987. C. Paar, “A New Architecture for a Parallel Finite Field Multiplier with Low Complexity Based on Composite Fields,” IEEE Trans. Computers, vol. 45, no. 7, pp. 856-861, July 1996. C.K. Koc and B. Sunar, “Low-Complexity Bit-Parallel Canonical and Normal Basis Multipliers for a Class of Finite Fields,” IEEE Trans. Computers, vol. 47, no. 3, pp. 353-356, Mar. 1998. B. Sunar and C.K. Koc, “Mastrovito Multiplier for All Trinomials,” IEEE Trans. Computers, vol. 48, no. 5, pp. 522-527, May 1999. M. Diab and A. Poli, “New Bit-Serial Systolic Multiplier for GFð2m Þ Using Irreducible Trinomials,” Electronics Letters, vol. 27, no. 20, pp. 1183-1184, June 1991. J.H. Guo and C.L. Wang, “A Low-Complexity Power-Sum Circuit for GFð2m Þ and Its Applications,” IEEE Trans. Circuits and Systems II, vol. 47, no. 10, pp. 1091-1097, Oct. 2000. C.L. Wang and J.L. Lin, “Systolic Array Implementation of Multipliers for GFð2m Þ,” IEEE Trans. Circuits and Systems II, vol. 38, pp. 796-800, July 1991. C.S. Yeh, S. Reed, and T.K. Truong, “Systolic Multipliers for Finite Fields GFð2m Þ,” IEEE Trans. Computers, vol. 33, no. 4, pp. 357-360, Apr. 1984. C.Y. Lee, E.H. Lu, and J.Y. Lee, “Bit-Parallel Systolic Multipliers for GFð2m Þ Fields Defined by All-One and Equally-Spaced Polynomials,” IEEE Trans. Computers, vol. 50, no. 5, pp. 385-393, May 2001. C.Y. Lee, E.H. Lu, and L.F. Sun, “Low-Complexity Bit-Parallel Systolic Architecture for Computing AB2 þ C in a Class of Finite Field GFð2m Þ,” IEEE Trans. Circuits and Systems II, vol. 48, no. 5, pp. 519-523, May 2001. C.K. Koc and T. Acar, “Montgomery Multiplication in GFð2k ),” Designs, Codes and Cryptography, vol. 14, no. 1, pp. 57-69, Apr. 1998.
1069
[14] H. Wu, “Montgomery Multiplier and Squarer for a Class of Finite Fields,” IEEE Trans. Computers, vol. 51, no. 5, pp. 521-529, May 2002. [15] J.C. Bajard, L. Imbert, C. Negre, and T. Plantard, “Efficient Multiplication in GF(pk ) for Elliptic Curve Cryptography,” Proc. IEEE Symp. Computer Arithmetic, pp. 181-187, June 2003. [16] C.L. Wang, “Bit-Level Systolic Array for Fast Exponentiation in GFð2m Þ,” IEEE Trans. Computers, vol. 43, no. 7, pp. 838-841, July 1994. [17] C.Y. Lee, “Low Complexity Bit-Parallel Systolic Multiplier over GFð2m Þ Using Irreducible Trinomials,” IEE Proc. Computers and Digital Technology, vol. 150, pp. 39-42, Jan. 2003. [18] A.J. Menezes, Applications of Finite Fields. Kluwer Academic, 1993. [19] I.F. Blake, S. Gao, and R.L. Lambert, “Construction and Distribution Problems for Irreducible Trinomials over Finite Fields,” Applications of Finite Fields, pp. 19-32, Oxford: Clarendon Press, 1996. [20] W. Stahnke, “Primitive Binary Polynomials,” Math. Computation, vol. 27, pp. 977-980, 1973. [21] R.P. Brent and P. Zimmermann, “Algorithms for Finding Almost Irreducible and Almost Primitive Trinomials,” Proc. Conf. in Honor of Professor H.C. Williams, May 2003. [22] G. Seroussi, “Table of Low-Weight Binary Irreducible Polynomials,” Visual Computing Dept., Hewlett Packard Laboratories, Aug. 1998, http://www.hpl.hp.com/techreports/98/HPL-98135.html. [23] P.L. Montgomery, “Modular Multiplication without Trial Division,” Math. Computation, no. 44, pp. 519-521, 1985. Chiou-Yng Lee received the bachelor’s degree (1986) in medical engineering and the MS degree in electronic engineering (1992), both from the Chung Yuan University, Taiwan, and the PhD degree in electrical engineering from Chang Gung University, Taiwan, in 2001. Since 1988, he has been a research associate with Chunghwa Telecommunication Laboratory in Taiwan, where he joined the Department of Project Planning. He taught related field courses at Ching-Yun Technology University. His research interests include computations in finite fields, error-control coding, signal processing, and digital transmission system. He is a member of the IEEE and the IEEE Computer society. He also became an honor member of Phi Tao Phi in 2001. Jenn-Shyong Horng received the BS degree in electrical engineering from National Cheng Kung University, Taiwan, in 1991 and the MS degree in electrical engineering from Kaohsiung Polytechnic Institute, Taiwan, in 1995. Since 1996, he has been employed by Southern Region Branch, Chunghwa Telecommunication Corporation, Taiwan, Republic of China, where he works in the Department of Corporate Services. He is currently a PhD candidate in the Department of Computer and Communication Engineering, National Kaohsiung First University of Science and Technology, Taiwan. His research interests include algorithms and architectures for computations in finite fields, data security and error-control coding, and digital communication networks. He has been awarded several Taiwan patents.
1070
IEEE TRANSACTIONS ON COMPUTERS,
I-Chang Jou received the BS degree in electrical engineering from National Taiwan University, Taiwan, in 1969, the MS degree in geophysics and computer science from National Central University, Taiwan, in 1972 and 1983, respectively, and the PhD degree in electrical engineering from National Taiwan University, Taiwan, in 1986. He was with the Telecommunication Laboratories of teh Ministry of Communications, Taiwan, from 1972 to 1997. Currently, he is the president of National Kaohsiung First University of Science and Technology. His major research fields are VLSI for DSP, digital signal processing, image processing, speech processing, and neural networks. He has published more than 125 papers in the areas of parallel computing, image processing, speech processing, and neural networks. He is a senior member of IEEE.
VOL. 54,
NO. 9,
SEPTEMBER 2005
Erl-Huei Lu received the BS and MS degrees in electrical engineering from Chung Cheng Institute of Technology, Taiwan, in 1974 and 1980, respectively, and the PhD degree in electrical engineering from National Cheng Kung University, Taiwan, in 1988. He is a professor in the Department of Electrical Engineering, Chang Gung University, Taiwan. His research interests include error-control coding, network security, and systolic architectures.
. For more information on this or any other computing topic, please visit our Digital Library at www.computer.org/publications/dlib.