Scalable and Systolic Multipliers for Gaussian Normal Basis of GF(2m) Chiou-Yng Lee1 , Chin-Chin Chen2 , Wen-Yo Lee1 and Erl-Huei Lu2 Department of Computer Information and Network Engineering, LungHwa University, Taiwan, R.O.C.,
[email protected] 2 The Department of Electrical Engineering, Chang Gung University, Taiwan, R.O.C.,
[email protected] 1
Abstract. Two novel algorithms are presented for a type-t Gaussian normal basis (GNB) of GF(2m ), which are appropriate for VLSI implementation. The proposed algorithms are based on the Hankel matrixvector representation to obtain the scalable and systolic multipliers in a flexible manner that can adapt to the required precision. This investigation depicts that our multipliers have a latency of d + n2 + n, where n= ⌈(mt + 1)/d⌉ , and d denotes the selected digital size. The proposed architectures are suitable for implementing all type-t GNB multiplications. A comparison of the latency indicates that the proposed multipliers for type-1 and type-2 GNBs of GF(2m ) save around 60% and 40% latency, respectively over related systolic multipliers with unscalable architecture. Moreover, because the new architectures are regularity and modularity, they are well suited to VLSI implementations. Keywords: Bit-Parallel Systolic Multiplier, Hankel Matrix-Vector, Optimal Normal Basis, Gaussian Normal Basis
1
Introduction
Galois Field arithmetic operations, i.e., addition, multiplication and inversion, have several applications in cryptography, including decipherment operation of RSA algorithm [1], Diffe-Hellman key exchange algorithm [2] and elliptic curve cryptography [3]. Multiplication is the most important and time-consuming computation for finite field GF(2m ) arithmetic operations. Other complex finite field arithmetic operations such as exponentiation, division and multiplicative inversion could be performed by repeating multiplications. The efficiency of finite field multiplications depends greatly on the selected basis representations for elements in GF(2m ). There are three popular basis representations, termed polynomial basis (PB), normal basis (NB), and dual basis (DB). Each basis representation has its own distinct advantages. The normal basis multiplication is generally selected for cryptography applications, because the squaring of the element in GF(2m ) is simply the right cyclic shift of its coordinates. The normal basis multiplier
2
over GF(2m ), as discovered by Massey and Omura [4], depends on the selection of key function for multiplication. Among various finite field multipliers, a hardware implementation of normal basis multiplication is classified either as a parallel or serial type. The bit-serial multipliers [5,6] require less area, but are slow that is taken by m clock cycles to carry out the multiplication of two elements. Conversely, bit-parallel multipliers [7,8] are typically fast, but have larger architectures. In particular, bit-parallel multipliers are the most efficient normal basis multipliers over GF(2m ) in which the field is built from an irreducible All-One Polynomial, called the type-I optimal normal basis. An optimal normal basis of type II is based on the palindromic representation of polynomials of length 2m . Kwon [9] and Lee-Chiou [10] recently developed an efficient design of a bit-parallel systolic multiplier for an optimum normal basis of type-II. Both hardware implementations have a trade-off between throughput performance and hardware complexity. To implement a cryptosystem in a constrained environment such as smart cards, a good multiplication algorithm is necessary to realize the VLSI chip. For the elliptic curve digital signature algorithm (ECDSA) in IEEE [11] and National Institute of Standards and Technology (NIST) [12], Gaussian normal basis (GNB) are defined to implement the field arithmetic operation. The GNB is a special class of normal basis, which exists for every positive integer m not divisible by eight. The GNB for GF(2m ) is determined by an integer t, and is called the type-t Gaussian normal basis. Small values of t are generally chosen to ensure that the field multiplication is implemented efficiently when more than one GNB exists for a given m, since the complexity of a type-t GNB multiplier is proportional to t [13]. For cryptography applications, a major design concern for multiplication units is the large number of operand bits, producing large fan-out of signals, large wire delays and complex routing. These problems are lowered in systolic architectures [14-17]. However, these architectures are normally tailored for fixedprecision computation. Hybrid multipliers for composite fields GF((2m )k ) have been presented to enhance the trade-off between throughput performance and hardware complexity [18], and digit-serial architectures have also been developed [19]. These architectures are based on a cut-set systolization technique to speedup computation time, however, such multipliers have a similar the space complexity as compared to original bit-level multiplier designs, . The scalable architecture is combined with both serial and parallel algorithms. It comprises original data with m bits to separate m/d subword data, where the selected digital size d represents the scalable factor. As the computation of both subword data stipulates one clock cycle, ⌈m/d⌉ clock cycles occupy the complete original data computation. Therefore, considering the trade-off between throughput performance and hardware complexity, the scalable architecture can produce an optimal realization in hardware implementations. Novel scalable multiplication algorithms for Gaussian normal basis are proposed herein. The proposed architectures are based on the core of the Hankel matrix-vector multiplication, and achieve efficient scalable and systolic multipliers.
3
2 2.1
Preliminaries Gaussian Normal Basis
The finite field GF(2m ) is well known to be viewable as a vector space of dim−1 mension m over GF(2). A set N ={α, α2 , · · · , α2 } is called the normal basis of GF(2m ), and is called the normal element of GF(2m ). Let any element A GF(2m ) can be represented as m−1
A=
i
ai α2 = (a0 , a1 , · · · , am−1 )
(1)
i=0
where ai ∈GF(2), 0 ≤ i ≤ m − 1 denotes the ith coordinate of A. In hardware implementation, the squaring element is performed by a cyclic shift of the binary representation. The multiplication of elements in GF(2m ) is uniquely determined m−1 i j uij α2 , uij ∈ GF(2). M={uij } is called a by the m cross products αα2 = j=0
multiplication table. Let A = (a0 , a1 , · · · , am−1 ) and B = (b0 , b1 , · · · , bm−1 ) indicate two normal basis elements in GF(2m ), and C = (c0 , c1 , · · · , cm−1 ) ∈GF(2m ) represent their product, i.e., C = AB. Coordinate i of C can then be represented by (2) ci = A(i) · M · (B (i) )T where A(i) denotes circular shifts of the element ”A” by i positions. The complexity of the normal basis N, represented by CN , is the number of nonzero uij values in M, and determines the gate count and time delay of the NB multiplier. Mullin et al.[24] demonstrated that two types of an optimal normal basis (ONB), type-I and type-II, exist in GF(2m ) if CN = 2m − 1. Definition 1. Let p = mt + 1 represent a prime number and gcd(mt/k, m) = 1, where k denotes the multiplicative order of 2 module p. Let γ be a primitive p m (t−1)m root of unity, and employ α = γ + γ 2 + · · · + γ 2 to generate a normal basis N for GF(2m ) over GF(2), called the Gaussian normal basis (GNB). Significantly, GNBs exist for GF(2m ) whenever m is not divisible by 8. By adopting Definition 1, each element A of GF(2m ) can also be given as m−1
A = a0 α + a1 α2 + · · · + am−1 α2
m−1
= a0 γ + a1 γ 2 + · · · + am−1 γα2 m
m+1
2m−1
+a0 γ 2 + a1 γ 2 + · · · + am−1 γα2 +··· (t−1)m (t−1)m+1 tm−1 +a0 γ 2 + a1 γ 2 + · · · + am−1 γα2
(3)
Given Eq. (3), the element A for the type-t GNB of GF(2m ) from γ mt+1 = 1 can be alternated to define the following formula: A = aF (1) γ + aF (2) γ 2 + · · · + aF (p−1) γ p−1
(4)
4
The sequence F (1), F (2), · · · , F (p − 1) needs to be pre-computed with F (2i uj mod p) = i, 0 ≤ i ≤ m − 1, 0 ≤ j ≤ t − 1, where u denotes an integer of order t mode p. Hence, the GNB can also be represented with the set {γ, γ 2 , · · · , γ p−1 }. As t = 1, it is well-known that type-1 GNB {γ, γ 2 , · · · , γ m } equals the field generated from an irreducible all-one polynomial (AOP) of degree m. For the type-t GNB with an even t, a GNB element A = a1 γ + a2 γ 2 + · · · + atm γ tm can be discovered, where ai = ap−i if γ tm+1 = 1. Since t = 2, type-2 GNB can be transformed into the normal basis {γ +γ −1 , γ 2 +γ −2 , · · · , γ m +γ −m }. Therefore, type-1 and type-2 GNBs are the same as type-I and type-II OBNs, respectively. Let A = (a0 , a1 , · · · , am−1 ) and B = (b0 , b1 , · · · , bm−1 ) denote two normalbasis elements in GF(2m ), and let C = (c0 , c1 , · · · , cm−1 ) ∈ GF(2m ) represent their product, i.e., C = AB. A type-t GNB exists for GF (2m ). Rather than adopting M in Eq. (2), the first coordinate of the product C can be calculated as in the following formula: p−2 c0 = F (A, B) = aF (j+1) bF (p−j)
(5)
j=1
2.2
Basis conversion
The GNB multiplication using Algorithm 1 can generally obtain an efficient hardware implementation for every GNB element presented by Eq. (4). Definition 1, reveals that a GNB element has two type representations, given by Eqs.(3) and (4). Because γ p = 1 and F (2i uj mod p) = i, the GNB element representation in Eq. (3) can easily be translated into the element representation of Eq.(4). Hence, the basis conversion from the GNB to the NB is described in the following two steps: Step 1.(a0 , · · · , a0 , · · · , am−1 , · · · , am−1 ) (aF (1) , aF (2) , · · · , aF (p−1) ) Step 2.(a0 , a1 , · · · , am−1 ) (a0 , · · · , a0 , · · · , am−1 , · · · , am−1 ) 2.3
Bit-parallel systolic Hankel multiplier
This section briefly introduces the bit-parallel systolic Hankel multiplier in [10]. Definition 2. An m × m matrix H is known as the Hankel matrix if it satisfies the relation H(k, p) = H(k + 1, p − 1), for 0 ≤ k ≤ m − 2, 1 ≤ p ≤ m − 1, where H(i, j) represents the element in the intersection of row i and column j. A Hankel matrix is entirely determined by its last column and first row, and thus depends on having 2m − 1 parameters. The entries of H are constant down the diagonals parallel to the main anti-diagonal. A Hankel matrix H may be expressed with its last column vector and first row vector H = (h0 , h1 , · · · , h2m−2 ). Let B = (b0 , b1 , · · · , bm−1 ) denote a vector, and C = H ⊗ B by the product of a Hankel vector and any other vector. The coordinate of the product
5 h0 h5 a0
a4
a3
h1 h6
0
0
0
0
U00
U01
V02
U03
0
h2 h7
V04 a2
h4
U10
V11
U12
V13
U14 h3 h8
a1
V20
U21
V22
U23
V24 a1
h3
U30
V31
U32
V33
U34
a2
h4
V40
U41
c0
c1
V42
U43
U44
c2
c3
c4
Fig. 1. The bit-parallel systolic Hankel multiplier
C = (c0 , c1 , · · · , cm−1 ) is given by m−1
ci =
hi+j bj
(6)
j=0
Definition 3. Let i, j, m be positive integers with 0 ≤ i, j ≤ m − 1, one can define the following function σ(i, j): j−i < 2 > for i + j=even σ(i, j) = < − j+i+1 > for i + j=odd 2 where
denotes q mod m. Let i denote a fixed integer in the complete set {1, 2, · · · , m − 1}, one verifies that the map k = σ(i, j). Therefore, by substituting j = σ(i, j) into Eq.(6), the product C can be denoted as m−1 m−1 C= hi+σ(i,j) bσ(i,j) γ i (7) i=0
j=0
Figure 1 depicts the bit-parallel systolic Hankel multiplier given by Example 1. The multiplier comprises 25 cells, including 14 U-cells and 11 V-cells. Every Ui,j cell (Fig.2(a)) noctains one 2-input AND gate, one 2-input XOR gate and four 1-bit latches to realize ci = ci + aσ(i,j) hσ(i,j)+i for σ(i, j) =< (j − i)/2 >, or ci = ci + aσ(i,j) h(i,j)+i for σ(i, j) =< −(j + i + 1)/2 >. Each Vi,j cell (Fig.2(b)) is formed by one 2-input AND gate, one 2-input XOR gate and four 1-bit latches to realize the operation of ci = ci + aσ(i,j) hσ(i,j)+i for σ(i, j) =< (j − i)/2 >, or the operation of ci = ci + aσ(i,j) h(i,j)+i for σ(i, j) =< −(j + i + 1)/2 >. As
6 a/h+i
ci
a
h+i/a
ci
h+i h+i+m
(a)
(b)
Fig. 2. (a) The detailed circuit of U-cell; (b) The detailed circuit of V-cell
stated in the above cell operations, the latency requires m clock cycles, and the computation time per cell is needed by one 2-input AND gate and one 2-input XOR gate.
3
Proposed scalable and systolic coprocessor p−1
Let A =
p−1
bl γ l with a0 = b0 = 0 and al , bl ∈ GF(2) for
A=
p−1
B=
al γ l and B =
l=0
l=0
1 ≤ l ≤ p − 1 denote two type-t GNB elements in GF(2m ), where γ represents the root of xp + 1. Assume that the chosen digital size is d-bits, and n = ⌈p/d⌉ , both elements A and B can also be expressed as follows. Ai γ id
l=0
n−1
p−1
n−1
Bi γ id
l=0
d−1
where Ai =
j=0
d−1
aid+j γ j , Bi =
al γ l =
i=0
bl γ l =
i=0
bid+j γ j and al = bl = 0 for p ≤ l ≤ dn −
j=0
1. Based on the partial multiplication for determining C = AB0 , the partial product C can be denoted by AB0 = A0 B0 + A1 B0 γ d + · · · + An−1 B0 γ d(n−1)
(8)
Each term Ai B0 in Eq. (8) represents the core computation and, denotes the degree of 2d − 2. In a general multiplication, the term Ai B0 can be defined by Ai B0 = Si + Di γ d
(9)
7
where the degree of both Si and Di is less than d. Therefore, Eq.(8) can be re-expressed as C = AB0 = (S0 + D0 γ d ) + (S1 + D1 γ d )γ d + · · · + (Sn−1 + Dn−1 γ d )γ d(n−1) = S0 + (D0 + S1 )γ d + · · · + (Dn−2 + Sn−1 )γ d(n−1) + Dn−1 γ dn = C0 + C1 γ d + · · · + Cn−1 γ d(n−1) + Cn γ dn (10) where C0 = S0 Ci = Di−1 + Si , for 1 ≤ i ≤ n − 1 Cn = Dn−1 Considering Eq.(10), Ci = (cdi , cdi+1 , · · · , cd(i+1)−1 ) can be translated with the following matrix-vector cdi b0 adi + b1 adi+1 + · · · + bd−1 ad(i−1)+1 cdi+1 b0 adi+1 + b1 adi + · · · + bd−1 ad(i−1)+2 .. = .. . . cd(i+1)−1 b0 adi+d−1 + b1 adi+d−2 + · · · + bd−1 adi ad(i−1)+1 ad(i−1)+2 · · · adi bd−1 ad(i−1)+2 ad(i−1)+3 · · · adi+1 bd−2 = . .. .. . . .. .. . . . . adi = Hi B0
adi+1
· · · adi+d−1
b0
(11)
From the above matrix-vector, the Hankel vector Hi =(a(i−1)d+1 , a(i−1)d+2 , · · ·, aid+d−1 ) is defined by the d × d Hankel matrix Hi . Hence, the computation AB0 with the Hankel vector representation can be computed as follows AB0 = H0 ⊗ B0 + H1 B0 ⊗ γ d + · · · + Hn ⊗ B0 γ dn
(12)
The above equation reveals that the partial multiplication of AB0 computation can be dismembered into (n + 1) Hankel multiplications. Because A = a0 + a1 γ + · · · + ap−1 γ p−1 + ap γ p + · · · + and−1 γ nd−1 with a0 = 0 and ap = ap+1 = ... = and−1 = 0, the result of C = AB0 can be denoted with C = c0 + c1 γ + · · · + cp+d−1 γ p+d−1 . Thus, C = C mod(γ p + 1) obtains p−1 p−1 d−1 C= (cj + cp+j )γ j + cj γ j = cj γ j j=0
j=d
(13)
j=0
As stated above, the proposed partial multiplication for computing AB0 can be calculated using the following steps.
8
H0
B0
Bit-parallel systolic Hankel multiplier
H1 Hi Hn
X register C=Xmod(γp+1)
C
Fig. 3. The proposed scalable and systolic architecture for computing AB0
Step 1: The element A = (a0 , a1 , · · · , ap−1 ) is firstly converted into the Hankel vector Hi = (a(i−1)d+1 , a(i−1)d+2 , · · · , aid+d−1 ), for 0 ≤ i ≤ n. Step 2: The AB0 computation is based on Eq.(12) to perform (n+1)-time Hankel multiplications, and its result is stored to the register X. Step 3: The degree of the register X is p+d. So, the final step is performed by C = X mod(γ p + 1). As stated above, the computing AB0 includes two core operations, namely the Hankel multiplication and the reduction polynomial γ p + 1, as illustrated in Fig. 3. The AB0 computation requires (n + 1)-time Hankel multiplications when applying the bit-parallel systolic Hankel multiplier as shown in Fig.1. The architecture in Fig.3 for deriving AB0 needs the latency of d+n clock cycles, and the results store the X register. Following n + 1 iterative Hankel multiplications, the result needs to perform the reduction polynomial γ p + 1. Significantly, the general partial multiplication for ABi , 0 ≤ i ≤ n − 1, can also be represented as ABi = H0 ⊗ Bi + H1 Bi ⊗ γ d + · · · + Hn ⊗ Bi γ dn
(14)
Consequently, Fig. 3 can realize every ABi computation. The following section describes the implementation of two scalable and systolic GNB multipliers using the proposed scalable and systolic coprocessor in Fig. 3.
4
Scalable and systolic GNB multipliers p−1
Let A =
l=0
p−1
al γ l and B =
l=0
bl γ l with a0 = b0 = 0 and al , bl ∈ GF(2) for
1 ≤ l ≤ p − 1 represent two type-t GNB elements in GF(2m ). If the element C is the product of both elements A and B, C = AB mod(xp + 1), then the product C can take the following form:
9
p−1 C = AB mod(xp + 1) = ci γ i
(15)
i=0
If the selected digital size is d digits, then element B can be represented as n−1 d−1 B= Bi γ id , where Bi = bid+j γ j . The product C in Eq.(15) using LSD-first i=0
j=0
multiplication algorithm can therefore be represented as
C = AB0 mod(γ p + 1) + AB1 γd mod(γ p + 1) + · · · +ABn−1 γ d(n−1) mod(γ p + 1) = A(0) B0 + A(1) B1 + · · · + A(n−1) Bn−1
(16)
where p−1 (i−1) A(i) = Aγ id mod(γ p + 1) = A(i−1) γ d mod(γ p + 1) = a γ j j=0
Applying Eq.(16), the proposed LSD-first digital-serial multiplication is addressed as follows: Algorithm 1. Input: A = (a0 , a1 , · · · , am−1 ) and B = (b0 , b1 , · · · , bm−1 ) are two normal basis elements in GF(2m ) Output: C = (c0 , c1 , · · · , cm−1 ) = AB 1. initial step. 1.1 A = (a0 , a1 , · · · , ap−1 ) ←− (a0 , a1 , · · · , am−1 ) 1.2 B = (b0 , b1 , · · · , bp−1 ) ←− (b0 , b1 , · · · , bm−1 ) n−1 d−1 1.3 B = Bi γ id , where n = ⌈p/d⌉ and Bi = bid+j γ j i=0
j=0
1.4 C (0) = 0 2. multiplication step: 2.1 for i = 0 to n − 1 do 2.2 C (i) = C (i−1) + ABi 2.3 A = Aγ d mod(γ p + 1) 2.4 endfor 3. basis conversion step 3.1 (c0 , · · · , c0 , · · · , cm−1 , · · · , cm−1 ) ←− (c0 , c1 , · · · , cp−1 ) 4. return (c0 , c1 , · · · , cm−1 ) The proposed GNB multiplication algorithm is split into n-loop partial multiplications. Figure 4 depicts the GNB multiplier based on the proposed partial multiplier in Fig. 3. Both NB elements A and B are initially transformed into the GNB given by Eq.(5), and are stored into both registers A and B, respectively. In round 0 (see Fig. 4), the systolic array in Fig. 3 is adopted to compute C = A(0) B0 , and the result is stored in register C. In round 1, the element A
10
A(i)=A(i-1)γd mod(γp+1)
B0
p
d
B1 Bi
ABi computation as seen in Fig.3
Bn-1
p
C
Basis conversion m
C
Fig. 4. The proposed LSD-first digital-serial GNB multiplier over GF(2m )
must be cyclically shifted to the right by d digits. The result produced in the systolic array is added to register C in the round 0. The final register C is translated from the GNB to the NB following n iterations. The first round, which estimates the latency, requires d + n clock cycles. Each subsequent round computation requires a latency of n+1 clock cycles. Finally, the entire multiplication requires a latency of d + n(n + 1) clock cycles. The critical propagation delay of every cell is the total delay of one 2-input AND gate, one 2-input XOR gate and one 1-bit latch. 4.1
Complexity
NB multipliers for various bit-parallel systolic multipliers are only discussed on type-I and II OBNs of GF(2m ), as seen in Lee-Chiou [10], Known [9] and Lee et al.[7]. As is well known, a type-I ONB is built from an irreducible AOP, while a type-II ONB can be constructed from a palindromic representation of polynomials of length 2m. However, both ONB types exist about 24.5% for m < 1000, as depicted in IEEE P1363 Standard [11]. For the ECDSA (Elliptic Curve Digital Signature Algorithm) applications, NIST [12] has recommended five binary fields, GF(2163 ), GF(2233 ), GF(2283 ), GF(2409 ) and GF(2571 ), and the field GF(2233 ) only exists a type-2 normal basis. Their ONB multipliers are implemented with unscalable architectures, and the latency needs m+1 clock cycles. The space complexity of the previous architectures becomes proportional to m2 as m becomes large. Hence, the NIST architectures have limited application in cryptography. The proposed GNB multiplier do not have this problem. Table 1 compares the circuits of the proposed multipliers with those of the other available multipliers. The table reveals that the proposed GNB multipliers can realize the hardware implementation in scalable and systolic architectures. Therefore, given a suitable digital size d, the proposed multipliers minimize of
11 Table 1. Comparison of the related bit-parallel systolic multipliers for normal basis of GF(2m ) multipliers Kwon [9] Lee et al. [7] Lee-chiou [10] Fig.4 Basis Type-II ONB AOP Type-II ONB Gaussian NB Total Complexity 2-input XOR 2m2 +m m2 +2m+1 2m2 +m d2 +d+p 2 2 2 2m +m m +2m+1 2m d2 2-input AND 2 2 2 5m 3m +6m+1 7m 3.5d2 +5p+3nd 1-bit latch 0 p 1x2 SW 0 0 Computation time per cell TA +2TX TA +TX TA +TX TA +TX Latency m+1 m+1 m+1 d+n(n+1) Note: the value d is the selected digital size; p = mt + 1 is a prime number; n = ⌈p/d⌉; TX denotes 2-input XOR gate delay;TA denotes 2-input AND gate delay
the total execution time. The proposed multiplier for a type-2 GNB save about 40% of latency as compared to Kwon’s [9] and Lee-Chiou’s [10] multipliers, and those for a type-1 GNB save about 60% latency as compared to Lee’s multipliers [7]. Since the selected digital size d must minimize the total execution time, the proposed multipliers then have low hardware complexity and low latency. Furthermore, the proposed architectures are suitable for all type-t GNB multiplications.
5
Discussions
This work presents a new way to realize scalable and systolic GNB multipliers over GF(2m ) under an Hankel matrix-vector multiplication. The gate count and time delay of the proposed multipliers are compared with those of three similar other multipliers. For the time and space complexity, this investigation shows that the two architectures for all type-t GNBs depend on the selected digital size d. For type-1 GNBs, the proposed multipliers save about 60% latency as compared with existing multipliers. For type-2 GNBs, the proposed multipliers save about 40% latency as compared with existing multipliers. Finally, the architectures proposed here which are combined with digital-serial and digital-parallel finite field multipliers. Because the chosen digital size d achieves a minimum of the total execution time, the proposed multipliers has a low hardware complexity and low latency. This feature is has good trade-offs between area and speed for implementing cryptographic schemes in embedded systems.
References 1. D.E.R. Denning, Cryptography and Data Security, Reading, MA: Addison-Wesley, 1983. 2. M.Y. Rhee, Cryptography and Secure Communications, McGraw-Hill, Singapore, 1994.
12 3. A. Menezes, P. V. Oorschot, S. Vanstone, Handbook of Applied Cryptography, CRC Press, Boca Raton, FL, 1997. 4. J.L. Massey and J.K. Omura, ”Computational method and apparatus for finite field arithmetic,” U.S. Patent Number 4,587,627, May 1986. 5. A. Reyhani-Masoleh, and M. Anwar Hasan, ”Low complexity word-level sequential normal basis multipliers,” IEEE Trans. On computers, Vol. 54, No.2, Feb. 2005. 6. C.Y. Lee and C.J. Chang, ”Low-complexity linear array multiplier for normal basis of type-II,” IEEE Intern. Conf. Multimedia and Expo, Vol. 3, pp.1515 - 1518, June 2004. 7. C.Y. Lee, E.H. Lu, and J.Y. Lee, ”Bit-Parallel Systolic Multipliers for GF(2m ) Fields Defined by All-One and Equally-Spaced Polynomials,” IEEE Trans. Computers, Vol. 50, No. 5, pp. 385-393, May 2001. 8. M.A. Hasan, M.Z. Wang, and V.K. Bhargava, ”A modified Massey-Omura parallel multiplier for a class of finite fields,” IEEE Trans. Comput., Vol. 42, No. 10, pp. 1278-1280, Nov. 1993. 9. S. Kwon, ”A low complexity and a low latency bit parallel systolic multiplier over GF(2m ) using an optimal normal basis of type II,” Proc. of 16th IEEE Symp. Computer Arithmetic, pp. 196-202, June 2003. 10. C.Y. Lee and C.W. Chiou, ”Design of low-complexity bit-parallel systolic Hankel multipliers to implement multiplication in normal and dual bases of GF(2m ),” IEICE Trans. Fund., vol. E88-A, no.11, pp. 3169-3179, Nov. 2005. 11. IEEE Standard 1363-2000, ”IEEE Standard Specifications for Public-Key Cryptography,” Jan. 2000. 12. Nat’l Inst. of Standards and Technology, Digital Signature Standard, FIPS Publication 186-2, Jan. 2000. 13. A. Reyhani-Masoleh, ”Efficient algorithms and architectures for field multiplication using Gaussian normal bases,” IEEE Trans. Computers, Vol. 55, No. 1, pp.34 - 47, Jan. 2006. 14. C.Y. Lee, ”Low-Latency Bit-Parallel Systolic Multiplier for Irreducible xm +xn + 1 with gcd(m, n) = 1,” IEICE Trans. Fundamentals, Vol.E86-A, No.11, pp. 28442852, Nov. 2003. 15. C.Y. Lee, J.S. Horng and I.C. Jou, ”Low-complexity bit-parallel systolic Montgomery multipliers for special classes of GF(2m ),” IEEE Trans. Computers, vol. 54, no. 9, pp. 1061-1070, Sep. 2005. 16. C.Y. Lee, ”Systolic architectures for computing exponentiation and multiplication over GF(2m ) using polynomial ring basis,” Journal of LungHwa University, vol. 19, pp.87-98, September 2005. 17. C.Y. Lee, ”Low complexity bit-parallel systolic multiplier over GF(2m ) using irreducible trinomials,” IEE Proc.-Comput. and Digit. Tech.,Vol.150 , pp. 39-42, Jan. 2003. 18. C. Paar, P. Fleischmann, and P. Soria-Rodriguez, ”Fast arithmetic for public-key algorithms in Galois fields with composite exponents” IEEE Trans. Comput., vol. 48, no. 10, pp. 1025-1034, Oct.1999. 19. C.H. Kim, C.P. Hong and S. Kwon, ”A digit-serial multiplier for finite field GF(2m ),” IEEE Trans. VLSI, Vol.13, No. 4, pp.476 - 483, April 2005.