Efficient Subquadratic Space Complexity

1

Efficient Subquadratic Space Complexity Architectures for Parallel MPB Single- and Double-Multiplications for All Trinomials Using Toeplitz Matrix-Vector Product Decomposition Chiou-Yng Lee, Senior Member, IEEE, and Pramod Kumar Meher, Senior Member, IEEE

Abstract—Subquadratic multiplication algorithm has received significant attention of cryptographic hardware researchers for efficient implementation public-key cryptosystems. In this paper, we derive a new shifted MPB (SMPB) representation based on modified polynomial basis (MPB). We have shown that by using MPB and SMPB, the proposed double basis multiplication can be transformed into Toeplitz matrix-vector product (TMVP) structure. Furthermore, by employing this formulation of double basis multiplication, we show that three-operand multiplication over GF (2m ) for all trinomials can be realized efficiently by the recursive TMVP (RTMVP) formulation. To perform the threeoperand multiplication with the RTMVP formulation, we have derived a new RTMVP decomposition scheme. The proposed single- and double-multiplications can, respectively, use TMVP and RTMVP decompositions to achieve subquadratic space complexity architectures. By theoretical analysis, it is shown that the proposed subquadratic multipliers involve significantly less space complexity and less computation time compared to the existing subquadratic multipliers using TMVP and Karatsuba algorithms. Moreover, our proposed double-multiplication design can be used in several applications involving successive multiplications, such as exponentiation, inversion, and elliptic curve point multiplication. Index Terms—Galois field, finite field, binary extension field, modified polynomial basis multiplication, elliptic curve cryptography, Toeplitz matrix-vector product

I. I NTRODUCTION Finite field multiplication over GF (2m ) is a basic field operation, which is frequently used in elliptic curve cryptography (ECC) to perform point-additions and point-doubling operations on an elliptic curve. The multiplication over GF (2m ) can be used further to perform division, exponentiation, and inversion operations. In finite field arithmetic, addition is the simplest operation because addition of any two bits can be performed by logical XOR operation; and there is no carry propagation. Division operations on the other hand can be implemented by a series of multiplications. The area and time complexity involved in performing the multiplications, consequently, contribute most of the area and the time required for the implementation of ECC. It is, therefore, required to design finite field multipliers with greater efficiency in terms C.-Y. Lee is with the Department of Computer Information and Network Engineering, Lunghwa University of Science and Technology, Taoyuan 33306, Taiwan (e-mail: [email protected]). P. K. Meher is with the School of Computer Engineering, Nanyang Technological University, Singapore (e-mail: [email protected]).

of area consumption and speed performance for ECC. The Weil and Tate pairings [1] based on elliptic curve arithmetic involves extensive computations of multiplication involving operands in large finite fields. This generates further interests to explore hardware-efficient designs for high-performance multiplication over large finite fields. A Toeplitz matrix is a matrix in which elements of each descending diagonal from left to right are identical. It is encountered in many signal processing and image processing applications. It is shown that Toeplitz matrix-vector product (TMVP) approach can lead to efficient hardware architecture for multiplication in finite fields based on normal basis (NB) [2], [3], [4], shifted polynomial basis (SPB)[5], [6], and dual basis (DB) (or modified polynomial basis, MPB) [7], [8], [9]. In binary extension fields GF (2m ), multiplication is a twostep operation: naive polynomial multiplication followed by reduction using the irreducible polynomial F (x), which generates the field. The naive polynomial multiplication involves O(m2 ) space complexity and O(m) delay. When the chosen F (x) is a low-weight irreducible polynomial, such trinomials and pentanomials, the reduction is a simple operation. Therefore, naive polynomial multiplication is generally considered as the major contributor to the hardware implementation of GF (2m ) multiplication. The complexity of naive polynomial multiplication is reduced by employing the divide-and-conquer techniques, such as Karatsuba-Ofman algorithm (KA) [10], [11], Toom-Cook algorithm [12], and TMVP decomposition [13], from O(m2 ) to O(mα ), where 1 ≤ α ≤ log2 3. Recently, Lee et al. [14] have proposed a generalized (a, 2)way KA decomposition with a > 2, which is suitable for implementing subquatratic digit-serial multiplication. Based on KA decomposition, multi-partite digit-serial multiplier is introduced in [15]. Recently, it is shown that three-operand multiplication can provide area-delay efficient architectures for applications involving successive multiplications, e.g., exponentiation, inversion, and elliptic curve point multiplication. Three-operand multiplication based on KA decomposition is suggested in [16], and is also shown that M-ary exponentiation using√threeoperand multiplication can be performed in nearly m/2 multiplication steps. Fast inversion based on Gaussian normal basis double-multiplication is suggested in [17]. Multi-operand multiplication has been found to be useful for hardware implementation of high-performance applications.

2

In this paper, we extend the TMVP decomposition of [13] to derive 2-way and 3-way recursive TMVP (RTMVP) decomposition schemes for the implementation of three-operand multiplication. Using the RTMVP decomposition approach, in the well-known MPB field representation, we explore a novel area-time efficient multiplication scheme in binary extension field. We show here that three-operand MPB multiplication can be efficiently realized by the proposed RTMVP formulation. The proposed RTMVP decomposition is found to be a suitable match for three-operand modified polynomial basis (MPB) multiplier with subquadratic space complexity. By theoretical analysis, we show that the proposed subquadratic singleand double-multiplications have less computation time and less space complexity compared to the existing subquadratic multipliers. The rest of this paper is organized as follows. Section II presents the preliminaries regarding 2-way and 3-way TMVP. In Section III, we present our proposed new RTMVP decompositions. In Section IV, we extend the well-known MPB representation to derive a new shifted MPB (SMPB) representation. Besides, in this section, we present the architecture of the proposed three-operand multiplication based on MPB and SMPB to be used in the RTMVP formulation. In Section V, time- and space-complexities are analyzed. Finally, we conclude the paper in Section VI.

The original TMVP involves four sub-TMVPs but the TMVP in (2) involves three sub-TMVPs. The TMVP in (2), therefore, provides better performance than the original TMVP in (1). Based on the decomposition scheme of (2), we can recursively generate four components (component matrix point (CMP), component vector point (CVP), point-wise multiply (PWM), and reconstruction (R)) of reduced size matrices as CM P (T ) = (T2 + T1 , T1 , T0 + T1 ) CV P (V ) = (V1 , V0 + V1 , V0 ) P = P W M (CM P (T ), CV P (V )) = (P0 , P1 , P2 ) C = R(P ) = (P0 + P1 , P1 + P2 ) where P0 = (T2 + T1 )V1 , P1 = T1 (V0 + V1 ), P2 = (T0 + T1 )V0 .

V0

T0

V1

T1

T2 EPG stage

PWM stage

P2

II. R EVIEW OF T OEPLITZ M ATRIX -V ECTOR P RODUCT D ECOMPOSITION In linear algebra, an n × n Teoplitz matrix is a matrix T = [ti,j ]0≤i,j≤N −1 with the property of ti,j = ti−1,j−1 for 1 ≤ i, j ≤ n − 1. The Toeplitz matrix-vector product is widely applied to compute multiplications in finite fields based on dual basis (DB), shifted polynomial basis, and normal basis. A Toeplitz matrix has the following property: Proposition 1. An n × n Toeplitz matrix T is determined by the 2n − 1 entries in the first row and the first column. We can use the vector (t0 , t1 , · · · , t2n−2 ) to define a Toeplitz matrix T . Using such vector representation, the addition T1 + T2 requires 2n − 1 XOR gates if T1 and T2 are two n × n Toeplitz matrices. In the following, we briefly review 2-way and 3-way TMVP decompositions in [13]. A. 2-Way TMVP Decomposition Let V = (V0 , V1 ) be a n × 1 column vector and the matrix vector (T0 , T1 , T2 ) be used to define a n × n Toeplitx matrix T , where V0 and V1 are two n2 ×1 column vectors, and T0 , T1 , and T2 are three n2 × n2 Toeplitz matrices. A Toeplitz matrixvector product C = T V in this case is given by C0 T1 T2 V0 C= = . (1) C1 T0 T1 V1 Using divide-and-conquer method, (1) can be expressed as C0 T1 (V0 + V1 ) + (T2 + T1 )V1 = . (2) C1 T1 (V0 + V1 ) + (T0 + T1 )V0

P1

P0 R stage

C1

C0

Figure 1. The subquadratic TMVP multiplier architecture [13].

As mentioned above, Fig. 1 shows subquadratic TMVP architecture, which involves three stages: the evaluation point generation (EPG) stage, the point-wise multiplication (PWM) stage, and the reconstruction (R) stage. The EPG stage is to perform EMP(T) and EVP(V), the PWM stage computes P = P W M (CM P (T ), CV P (V )) = (P0 , P1 , P2 ), and the R stage performs the operation C = R(P ) = (P0 +P1 , P1 +P2 ). Let symbols S and D denote “space” and “delay”, respectively. Let S⊗ (n) and S⊕ (n) in the case of n = 2i (i > 1) denote the number of bit-multiplications and the number of bitadditions required for n×n TMVP multiplication, and D⊗ (n) and D⊕ (n) denote the number of AND gate delay and the number of XOR gate delay required for TMVP multiplication. In [13], Fan and Hasan have shown that, for 2-way TMVP decomposition, CMP component involves ( 3n 2 −1) XOR gates and T⊕ delay, CVP component involves n2 XOR gates and T⊕ delay, PWM component involves 3S⊗ ( n2 )+3S⊕ ( n2 ) space complexity and D⊗ ( n2 )+D⊕ ( n2 ) delay, and R component involves n XOR gates and T⊕ delay. Accordingly, we have obtained the following recurrences on complexities:  S⊗ (n) = 3S⊗ ( n2 ), S⊗ (1) = 1    S⊕ (n) = 3S⊕ ( n2 ) + 3n − 1, S⊕ (1) = 0 (3) D⊗ (n) = D⊗ ( n2 ), D⊗ (1) = T⊗    n D⊕ (n) = D⊕ ( 2 ), D⊕ (1) = 0

3

Lemma 1. Assuming b and i to be two positive integers which satisfy n = bi , the solution of the recurrence relation R(n) = R( nb ) + d and R(1) = e is R(n) = d(logb n) + e, where d and e are integer constants.

Lemma 2. Let a, i, and b be three integers with a 6= b and n = bi , the solution of the recurrence relation R(n) = aR( nb ) + db h dn + h and R(1) = e is R(n) = (e + a−b + a−1 )nlogb a − db h a−b n − a−1 . We utilize Lemmas 1 and 2 to solve the recurrence equations in (3). The time and space complexities of 2-way TMVP decomposition can be expressed as follows: S⊗ (n) = n

log2 3

,

S⊕ (n) = 5.5nlog2 3 − 6n + 0.5, D(n) = T⊗ + 2(log2 n)T⊕ . B. 3-Way TMVP Decomposition Let V = (V0 , V1 , V2 ) be a n × 1 column vector and the matrix vector T = (T0 , T1 , T2 , T3 , T4 ) be a n × n Toeplitz matrix, where each of Vi s (for i = 0, 1, 2) is a n3 × 1 column vectors, and each of Ti s (for i = 0, 1, 2, 3, 4) is a n3 × n3 Toeplitz matrix. The product C = T V can be rewritten as        C0 T2 T3 T4 V0 P0 + P3 + P5  C1  =  T1 T2 T3   V1  =  P1 + P4 + P5  C2 T0 T1 T2 V2 P2 + P3 + P4 (4) where    P0 = T234 V2 ,  P3 = T2 V02 , P1 = T123 V1 , P4 = T1 V01 ,   P2 = T012 V0 , P5 = T3 V12 . Txyz in (4) denotes the sum of Tx +Ty +Tz . From (4), we can find that the 3-way TMVP decomposition for each recurrence involves 6 sub-TMVPs. Accordingly, we have obtained the complexities of 3-way TMVP decomposition as follows:

A. 2-Way RTMVP Decomposition Let A and B be two n×n Toeplitz matrices defined by A = [A0 , A1 , A2 ] and B = [B0 , B1 , B2 ], where Ai s and Bi s are n n 2 × 2 Toeplitz matrices. Let C = [C0 , C1 ] be a n × 1 column vector, where C0 and C1 are n2 × 1 column vectors. Here we consider the 3-mult of the form E = ABC as a recursive TMVP multiplications. To realize 3-mult computation, let us define a two-step operation: D = BC and E = AD. Based on 2-way TMVP decomposition, the intermediate product D = BC is obtained in the first step as D0 B1 C01 + B12 C1 = . (5) D1 B1 C01 + B01 C0 From TMVP computation in (5), we get the intermediate product D as a vector. Next, we again use the 2-way TMVP decomposition to compute E = AD, then the product E can be obtained as follows: E0 A1 D01 + A12 D1 = E1 A1 D01 + A01 D0 A1 (B12 C1 + B01 C0 ) + A12 (B1 C01 + B01 C0 ) = A1 (B12 C1 + B01 C0 ) + A01 (B1 C01 + B12 C1 ) A1 B12 C1 + A2 B01 C0 + A12 B1 C01 = . (6) A0 B12 C1 + A1 B01 C0 + A01 B1 C01

C0

B0

C1

B1

A0

B2

A1

A2 EPG stage

P0

D0

D2

P1

P2

D1 PWM stage

P3

P4

P5 R stage

: multiplier-1 E0

: multiplier-2

E1

S⊗ (n) = nlog3 6 , S⊕ (n) =

1 24 log3 6 n − 5n + , 5 5

D(n) = T⊗ + 3(log3 n)T⊕ . III. P ROPOSED R ECURSIVE T OEPLITZ M ATRIX -V ECTOR P RODUCT D ECOMPOSITION In this section, we extend the TMVP decomposition to derive two new recursive TMVP (RTMVP) decompositions for computing three-operand multiplication. For the convenience of presentation, we use the term ‘3-mult’ to refer the threeoperand multiplication in the rest of the paper. We have derived the proposed 2-way and 3-way RTMVP decompositions as follows.

Figure 2. The proposed 2-way RTMVP decomposition

The 3-mult in (6) involves 6 three-operand sub-RTMVPs (P0 = A1 B12 C1 , P1 = A2 B01 C0 , P2 = A12 B1 C01 , P3 = A0 B12 C1 , P4 = A1 B01 C0 , and P5 = A01 B1 C01 ). We can use this formula to iteratively decompose the 3-mult. Based on the proposed multiplication using three individual stages (the evaluation point generation (EPG) stage, the point-wise product (PWM) stage, and the reconstruction (R) stage), Fig. 2 shows the implementation of a 3-mult using 2-way RTMVP identity. Using a cascaded product approach, the six 3-mults in (6) can be rewritten as P0 = A1 D1 , P1 = A2 D2 , P2 = A12 D0 , P3 = A0 D1 , P4 = A1 D2 , and P5 = A01 D0 , where D0 = B1 C01 , D1 = B12 C1 and D2 = B10 C0 . Accordingly, the six 3-mults in the PWM stage in Fig. 2 involves three

4

’multiplier-1’ and six ’multiplier-2’. Multiplier-1 calculates Di for i=0,1, and 2. Multiplier-2 calculates Pi for 0 ≤ i ≤ 5. Now, we define the complexity parameters in Table I to calculate the complexity of the proposed algorithm. In the following we estimate the complexities of three individual stages in Fig. 2: Table I L ISTS THE SYMBOL PARAMETERS FOR ESTIMATING THE COMPLEXITY OF THE PROPOSED MULTIPLIER

parameters Descriptions Sa,⊕ (n) the number of XOR gates for multiplier-a, a=1 and 2 S3,⊕ (n) the number of XOR gates for 3-mult Sa,⊗ (n) the number of AND gates for multiplier-a, a=1 and 2 S3,⊗ (n) the number of AND gates for 3-mult Da (n) the delay complexity for multiplier-a, a=1 and 2 D3 (n) the delay complexity for 3-mult T⊕ XOR gate delay T⊗ AND gate delay Note: n denotes the number of bit-length polynomial. •

•

•

EPG Stage. Based on 6 three-operand sub-RTMVPs in this stage, we compute two evaluation matrix points (EMP) and one evaluation vector point (EVP). – EMP component: For any two matrices A and B, we can define EM P (A) = (A0 , A1 , A2 , A01 , A12 ) and EM P (B) = (B1 , B01 , B12 ), respectively, where each of Ai and Bj is a n2 × n2 Toeplitz matrix. In [13], it is shown that ( 3n 2 − 1) XOR gates are required to generate EM P (A), which involves two additions: A01 = A0 + A1 and A12 = A1 + A2 . Accordingly, the computation of EM P (A) involves the space complexity of ( 3n 2 − 1) XOR gates and the computation time of T⊕ delay. And the complexity of EM P (B) is the same of EM P (A). Therefore, EM P (A) and EM P (B) in total require (3n − 2) XOR gates, and involve T⊕ delay. – EVP component: The column vector C is split into two parts, C = [C0 , C1 ], where C0 and C1 are two n 2 ×1 column vectors. From (6), the EVP component is generated by EV P (C) = (C0 , C1 , C01 ). Therefore, the complexity of EV P (C) involves n2 XOR gates and T⊕ delay. PWM Stage. The proposed splitting method (as given in (6)) involves three multiplier-1 and six multiplier-2 in each step. Note that we use cascaded structure to implement the 3-mult, and each multiplier-1 is associated with two multiplier-2. For example, multiplier-1 computes D0 = B1 C01 , and the result is used as input operand of two multiplier-2 for computing P2 = A12 D0 and P5 = A01 D0 . Based on this, the complexity of computing a 3-mult can be expressed as S3,⊕ (n) = S1,⊕ (n) + 2S2,⊕ (n) XOR gates and S3,⊗ (n) = S1,⊗ (n) + 2S2,⊗ (n) AND gates. Therefore, the complexity of PWM unit involves 3S3,⊗ ( n2 ) AND gates and 3S3,⊕ ( n2 ) XOR gates. It involves D3 ( n2 ) = D1 ( n2 ) + D2 ( n2 ) delays. R Stage. Each subproduct Pi with 0 ≤ i ≤ 5 is a RTMVP. The width of each of these subproducts is therefore 0.5n bits. The R stage needs 2n XOR gates to

evaluate E = (E0 , E1 ) = (P0 + P1 + P2 , P3 + P4 + P5 ), and requires 2T⊕ delay. Based on the above analysis, we have estimated the XOR and the AND complexities involved in different steps as follows n S3,⊕ (n) = 3S3,⊕ ( ) + 5.5n − 2, S3,⊕ (1) = 0 (7) 2 n (8) S3,⊗ (n) = 3S3,⊗ ( ), S3,⊗ (1) = 2 2 For estimating time complexity, the evaluation step requires T⊕ delay, the PWM step requires D3 ( n2 ) delay, and the reconstruction step requires 2T⊕ delays. Consequently, we can obtain the following recursive relation of the time complexity: n D3 (n) = D3 ( ) + 3T⊕ , D3 (1) = 2T⊗ (9) 2 Using Lemmas 1 and 2 to solve the recursive equations (7), (8), and (9), we can find the following complexities for 3-mult using 2-way RTMVP decomposition. S3,⊕ (n) = 10nlog 2 3 − 11n + 1 S3,⊗ (n) = 2n

log 2 3

D3 (n) = 2T⊗ + (3 log 2 n)T⊕

(10) (11) (12)

B. 3-Way RTMVP Decomposition Let A and B be two n × n Toeplitz matrices defined by A = [A0 , A1 , A2 , A3 , A4 ] and B = [B0 , B1 , B2 , B3 , B4 ], where Ai s and Bi s are n3 × n3 Toeplitz matrices. Let C = [C0 , C1 , C2 ] be a n × 1 column vector, where C0 , C1 and C2 are n3 × 1 column vectors. The 3-mult of the form E = ABC uses a recursive TMVP decomposition, such as D = BC and E = AD. Based on 3-way TMVP decomposition, the intermediate product D = BC is obtained as     D0 B234 C2 + B2 C02 + B3 C12  D1  =  B123 C1 + B1 C01 + B3 C12  (13) D2 B012 C0 + B2 C02 + B1 C01 Again we use 3-way TMVP decomposition to compute E = AD as     E0 A234 D2 + A2 D02 + A3 D12 E =  E1  =  A123 D1 + A1 D01 + A3 D12  (14) E2 A012 D0 + A2 D02 + A1 D01 Substituting (13) into (14), the product E can be obtained according to the following formulation. 

 A4 K0 + A2 K1 + A3 K2 + A34 K3 + A24 K4 + A23 K5 E =  A3 K0 + A1 K1 + A2 K2 + A23 K3 + A13 K4 + A12 K5  A2 K0 + A0 K1 + A1 K2 + A12 K3 + A02 K4 + A01 K5 (15) where K0 = B012 C0 , K1 = B234 C2 , K2 = B123 C1 , K3 = B1 C01 , K4 = B2 C02 , K5 = B3 C12 . In the following, we analyze the complexity of three stages of 3-mult according to (15). In the EPG stage, we find three components as CM P (A) = (A0 , A1 , A2 , A3 , A4 , A01

5

, A02 , A12 , A13 , A23 , A24 , A34 ),

(16)

A. Formulation of New Shifted MPB Representation

CM P (B) = (B1 , B2 , B3 , B012 , B123 , B234 ),

(17)

Let the field be constructed from an irreducible trinomial F (x) of the form xm + xn + 1, the MPB in [18] is defined as follows.

CV P (C) = (C0 , C1 , C2 , C01 , C02 , C12 ).

(18)

Using Lemma 3 in Appendix, the complexity of CM P (A) involves (3n − 2) XOR gates. In [13], it is shown that the matrix additions (B012 , B123 and B234 ) in (17) involves (2n − 1) XOR gates and 2T⊕ delays; and the vector additions C01 , C02 , C12 in (18) involves n XOR gates and T⊕ delays. Thus, the EPG unit in total requires (6n − 3) XOR gates and 2T⊕ delay. The PWM stage needs 18 3-mults, and the R stage involves 15 vector additions. Out of 18 3-mults involved in (15), each multiplier-1 is associated with three multiplier-2. Therefore, 18 3-mults can be clustered into 6 groups in PWM stage. For example, A0 D1 , A1 D1 and A2 D1 , where D1 = B234 C2 , can be clustered to form a group of 3-mults. In this group, multiplier-1 is used to compute K1 , three multiplier-2 are used to compute A0 D1 , A1 D1 and A2 D1 . In this case, the complexity of 3-mult can be defined as S3,⊕ (n) = S1,⊕ (n) + 3S2,⊕ (n) XOR gates and S3,⊗ (n) = S1,⊗ (n) + 3S2,⊗ (n) AND gates. Therefore, the PWM unit in each step involves 6S3,⊕ ( n3 ) XOR gates and 6S3,⊕ ( n3 ) AND gates, and requires D3 ( n3 ) = D1 ( n3 ) + D2 ( n3 ) gate delays for the computation. Since each 3-mult produces a n3 -bit product word, we need 5n XOR gates for the R stage, which involves 3T⊕ delay. Therefore, we obtain the following expressions of the complexities.  S3,⊗ (n) = 6S3,⊗ ( n3 ), S3,⊗ (1) = 2  S3,⊕ (n) = 6S3,⊕ ( n3 ) + 11n − 3, S3,⊕ (1) = 0 (19)  D3 (n) = D3 ( n3 ) + 5TX , D3 (1) = 2T⊗ Using Lemmas 1 and 2 to solve the recursive equations (19), subquadratic 3-mult based on 3-way RTMVP decomposition can be found to have the following complexities.  S3,⊗ (n) = 2nlog3 6  S3,⊕ (n) = 10.4nlog3 6 − 11n + 0.6 (20)  D3 (n) = 2T⊗ + (5 log3 n)T⊕ IV. N EW S UBQUADRATIC MPB S INGLE - AND D OUBLE -M ULTIPLICATIONS BASED ON TMVP AND RTMVP D ECOMPOSITIONS Toeplitz matrix vector product approach can be used for efficient realization of multiplications in binary extension fields, for some special classes of basis representation, such as shifted polynomial basis, modified polynomial basis, and dual basis. In general, multiplication using double basis representation involves TMVP multiplier and basis conversion from/to the polynomials, and the cost of basis conversion, which depends on the chosen irreducible polynomial F (x). Among them, the MPB representation involves significantly less space complexity if F (x) is a trinomial. For the sake of simplicity, we use trinomials to derive a new double basis representation and make use of efficient implementation of three-operand multiplication.

Definition 1. Let N = {1, x, x2 , · · · , xm−1 } be the polynomial basis (PB) of GF (2m ), where the intermediate x is the root of irreducible trinomial xm + xn + 1. We can define that the ordered set N 0 = {αi |αi = xi for 0 ≤ i ≤ m − n − 1 and αi = xi + xi−m+n for m − n ≤ i ≤ m − 1} is called the modified polynomial basis with respect to the set N . In the context of this basis representation, the MPB is equivalent to the revised order sequence of the triangular basis in [19]. From the formulation of the MPB, we can find that αi = xi + xi−m+n for m − n ≤ i ≤ m − 1 can be represented by αi+1 = xαi mod F (x), and xαm−1 = 1 mod F (x). For this reason, we can define a new shifted MPB as follows. Definition 2. If a given set {α0 , α1 , · · · , αm−1 } is the MPB for trinomial xm + xn + 1, then we can define the ordered set N 00 ={βi = xi αm−n mod F (x)| 0 ≤ i ≤ m − 1 } to be the shifted MPB (SMPB). Example 1. Let the field GF (25 ) be constructed from the irreducible trinomial x5 + x2 + 1. We can find that the set {1, x, x2 , x3 + 1, x4 + x} is the MPB, and the set {x3 + 1, x4 + x, 1, x, x2 } is the SMPB. Here, we discuss the basis conversion between MPB and SMPB. Assume that a field GF (2m ) is constructed from F (x) = 1+xn +xm . Let A = a0 β0 +a1 β1 +· · ·+am−1 βm−1 and A = a0 α0 + a1 α1 + · · · + am−1 αm−1 be two elements in GF (2m ) represented by MPB and SMPB representations, respectively. From Definitions 1 and 2, we have obtained βi = xi αm−n = αm−n+i for 0 ≤ i ≤ n−1 and αm−n = xm−n +1. Moreover, from βn = xn αm−n = xm + xn = 1 = α0 , we have obtained βi = αi−m+n for n ≤ i ≤ m − 1. Thus, by basis conversion from SMPB to MPB, we can obtain A = a0 β0 + a1 β1 + · · · + am−1 βm−1 = an α0 + a1 α1 + · · · + am−1 αm−n−1 +a0 αm−n + · · · + an−1 αm−1 = a0 α0 + a1 α1 + · · · + am−1 αm−1

(21)

where ai =

an+i ai

for 0 ≤ i ≤ m − n − 1 . for m − n ≤ i ≤ m − 1

It can be noted that the basis conversion from SMPB to MPB is given by the permutation of the coordinate coefficients of the element in GF (2m ). Based on the basis conversion in (21), we can define the transformation matrix U as 0(m−n)×n I(m−n)×(m−n) U= (22) In×n 0n×(m−n) We use the matrix U to perform the basis conversion as A = UA (23) A = U −1 A

6

where U −1 is the invertible matrix U defined as 0n×(m−n) In×n −1 U = . I(m−n)×(m−n) 0(m−n)×n

(24)

B. MPB Multiplication In this subsection, we use two bases MPB and SMPB to derive a new double basis multiplication. We have shown that the basis conversion in (21) does not involve any cost for hardware implementation, and, therefore, the double basis multiplication is equivalent to MPB multiplication. Pm−1 Pm−1 Let A = i=0 bi βi be two i=0 ai αi and B = polynomials in GF (2m ) represented by MPB and SMPB, respectively. From Definition 2, we obtain βi = xi αm−n = xi (1 + xm−n ), and the polynomial B can be represented by B = (1 + xm−n )(b0 + b1 x + · · · + bm−1 xm−1 )

Using (29), the product C = AB mod F (x) can be computed as follows.      b0 a2 a3 a4 a03 a14 c0  c1   a1 a2 a3 a4 a03   b1        c2  =  a0 a1 a2 a3 a4   b2        c3   a24 a0 a1 a2 a3   b3  a13 a24 a0 a1 a2 c4 b4 Referring to the above, to obtain the matrix TA , it is required to compute the four terms a13 , a24 , a03 , a14 , which involves 4 XOR gates and T⊕ delay.

A B

MPB

MPB

(25)

Basis conversion Pre-computation circuit

Assuming that the polynomial C is represented by MPB, and is the product of A and B. It can be rewritten as A

Let us denote that A(i) = xA(i−1) mod F (x), where A(0) = Pm−1 (i) A. By using the algebra of (27), A(i) = j=0 aj αj can be computed as (i−1)

(i−1)

A(i) = (am−1 + am−n )α0 +

m−1 X

(i)

aj−1 αj

(28)

j=1

Thus, based on (26), the product C using matrix-vector representation can be obtained according to the following formulation. C = (1 + xm−n )(b0 A(0) + b1 A(1) + · · · + bm−1 A(m−1) ) = b0 S (0) + b1 S (1) + · · · + bm−1 S (m−1)

(29)

= [S (0) , S (1) , · · · , S (m−1) ] · B = TA · B where S = A + A(m−n) . Matrix TA in (29) can be transformed into a Toeplitz matrix. For clarity, we illustrate a double basis multiplication C = AB mod F (x) in Example 2. Example 2. Let a field GF (25 ) be generated by F (x) = 1 + x2 + x5 . Assume that A = a0 α0 + a1 α1 + a2 α2 + a3 α3 + a4 α4 and B = b0 β0 + b1 β1 + b2 β2 + b3 β3 + b4 β4 are two elements in GF (25 ) represented by MPB and SMPB, respectively. We can pre-compute S = A + A(3) = a2 α0 + a3 α1 + a4 α2 + a03 α3 + a14 α4 , where a03 = a0 + a3 and a14 = a1 + a4 .

Subquadratic TMVP multiplier

MPB

Subquadratic TMVP multiplier

MPB

C

(a)

C = (1 + xm−n )(b0 A + b1 Ax + · · · + bm−1 Axm−1 ) (26) Based on the definition of MPB (Definition 1), we have obtained the following algebraic relations:  αi x = αi+1 , for 0 ≤ i < m − n  αi x = αi+1 + αi=m+n , for i ≥ m − n (27)  αm−1 x = α0

PB

B

MPB

MPB

U

SMPB

Pre-computation circuit

C

(b) Figure 3. (a) Traditional MPB multiplier [7]; (b) The proposed MPB multiplier based on subquadratic TMVP multiplier architecture of Fig. 1.

Fig. 3b shows the proposed MPB multiplier, which involves a TMVP multiplier, a matrix transform unit U , and a precomputation circuit. Note that, in our proposed architecture, the matrix transformation unit U for the implementation of the basis conversion from MPB to SMPB does not involve any hardware cost, but traditional MPB multiplier [7] requires (m − n) XOR gates to perform the basis conversion from the MPB to the PB. C. Three-Operand MPB Multiplier using RTMVP Scheme In Section III, we have derived the proposed three-operand multiplication algorithm using RTMVP decomposition. Based on MPB multiplication, we derive here a new three-operand multiplication architecture using the proposed RTMVP decomposition. Let A,B,C and E be four elements in GF (2m ), where A,B, and E are represented by MPB, C is represented by SMPB, and E = ABC mod F (x). For three-operand multiplication, the product E requires two multipliers in cascade, where the first multiplier computes D = BC mod F (x), and the second multiplier computes an MPB multiplication E = AD mod F (x). Based on the structure of double basis multiplier, we can use Lemma 4 (in Appendix) to obtain the MPB multiplication with respect to Toeplitz matrix-vector product. Thus, the first multiplier directly uses the proposed double basis multiplication in (29) to produce the intermediate result D = TB C, which can be used for the computation of

7

E = MA U TB C mod F (x) to perform the three-operand multiplication, where U provides the first multiplication result to convert the basis transformation from the MPB to the SMPB. Therefore, based on Lemma 4, the three-operand multiplication can be expressed alternatively as E = TA TB C mod F (x)

(30)

As mentioned above, it is shown that three-operand multiplication using the proposed double basis multiplier can be realized by an architecture with RTMVP approach. Therefore, 3-mult using the proposed RTMVP decomposition can be used to derive the architecture of multiplier with subquadratic space complexity. Fig. 4a shows the proposed three-operand multiplier, which involves one RTMVP multiplier and two precomputation units. Each pre-computation unit involve (m − 1) XOR gates. The proposed architecture for multi-operand multiplication can be used for hardware and time efficient realization of applications involving successive multiplications, such as inversion, exponentiation, and pairing computation.

C B A

MPB

U

MPB


MPB


SMPB

B

MPB

MPB

Basis conversion

Subquadratic RTMVP multiplier in Fig. 2

PB


Table II C OMPLEXITIES OF BCC AND PCC UNITS FOR THE PROPOSED AND THE EXISTING MPB MULTIPLIERS polynomials

MPB

E

(a)

A

1) The proposed double basis multiplication is based on two bases MPB and SMPB, while traditional double basis multiplication is based on two bases MPB and PB. 2) The basis conversion circuit (BCC) for our proposed method is performed from MPB to SMPB, which does not involve any cost for hardware implementation. The BCC for [7] is performed from MPB to PB. In Table II, we have listed the complexity of the BCC for all trinomials. 3) The pre-computation circuit (PCC) is used to compute all entries of Toeplitz matrix. As shown in Example 2, our proposed PCC involves (m − 1) XOR gates and T⊕ delay, while the complexity of the PCC for [7] depends on the selected trinomial, as shown in Table II. 4) Proposed RTMVP approach can be used for multioperand multiplication scheme, while traditional MPB multiplication scheme (as shown in Fig. 4b) cannot be used for efficient multi-operand multiplication.

xm + x + 1 xm + xn + 1 1 < n < m/2 xm + xm/2 + 1

BCC (#⊕) -

proposed PCC Delays (#⊕) #T⊕ m−1 1 m−1 1

BCC (#⊕) m−1 m−n

[7] PCC (#⊕) m−1 m−1

Delays #T⊕ 1 2

m−1

0.5m

0.5m

1

1

V. C OMPLEXITY A NALYSIS In this Section, we analyze the complexity of MPB singleand double-multipliers based on TMVP and RTMVP decompositions, respectively, and compare the corresponding subquadratic multipliers.

Subquadratic TMVP multiplier in Fig. 1

A. Comparison of Subquadratic Single-multipliers MPB

C

MPB

Basis conversion

PB


Subquadratic TMVP multiplier in Fig. 1

MPB

E

(b) Figure 4. (a) The proposed three-operand MPB multiplier based on RTMVP decomposition of Fig. 2.; (b) The three-operand multiplier using traditional MPB multiplier approach using Subquadratic TMVP multiplier of Fig. 1.

Typically, dual basis multiplication is realized efficiently by Toeplitz matrix-vector product structure, as shown in Fig. 3a. The traditional MPB multiplier [7] involves a TMVP multiplier, a basis conversion circuit, and a pre-computation circuit. In [7] , it is shown that the MPB is a dual basis formulation respect to PB, and the implementation of double basis multiplication requires basis conversion from MPB to PB if input operands are in MPB representation. The proposed dual basis multiplication has the following advantages compared to traditional dual basis multiplier [7].

We use two different bases, MPB and SMPB, to build an efficient single-multiplier (Fig. 3b) for the field based on MPB, while traditional MPB multiplier [7] (Fig. 3a) is based on PB and MPB. Both multipliers have similar architectures, which involve three units such as TMVP multiplier, basis conversion circuit (BCC), and pre-computation circuit (PCC). Although traditional MPB multiplier is derived for quadrinomial basis, its architecture is suitable for trinomial basis. Here, we analyze the complexity of both multipliers for trinomial basis. We assume that the TMVP multiplier is realized by non-recursive TMVP decomposition approach to develop a subquadratic multiplier. Therefore, we compare BCC and PCC units only for both multipliers, as shown in Table II. The BCC unit of our proposed MPB multiplier does not involve any cost for hardware implementation, while traditional MPB multiplier involves additional hardware for basis conversion. The delay of our architecture for the case of xm + xn + 1 with 1 < n < m/2 is less by T⊕ than traditional MPB multiplier. Other subquadratic parallel multipliers are suggested in [20] for Winograd algorithm, and in [21] for

8

Table III C OMPARISON OF SELECTED SUBQUADRATIC PARALLEL MULTIPLIERS FOR TRINIMIALS xm + xn + 1 WITH m = bi AND 1 < n < m/2

design

b

#⊕

#⊗

gate delays

Karatsuba [21]

2 3 2 3 2 3 2 3

6mlog2 3 − 6m 16 log3 6 m − 16 m 3 3 6mlog2 3 − 6m 16 log3 6 m − 16 m 3 3 5.5mlog2 3 − 4m − n − 0.5 4.8mlog3 6 − 3m − n − 0.8 5.5mlog2 3 − 4m − 0.5 4.8mlog3 6 − 4m − 0.4

mlog2 3 mlog3 6 mlog2 3 mlog3 6 mlog2 3 mlog3 6 mlog2 3 mlog3 6

T⊗ + (2 + 3 log2 m)T⊕ T⊗ + (2 + 4 log3 m)T⊕ T⊗ + (2 + 3 log2 m)T⊕ T⊗ + (2 + 4 log3 m)T⊕ T⊗ + (2 + 2 log2 m)T⊕ T⊗ + (2 + 3 log3 m)T⊕ T⊗ + (1 + 2 log2 m)T⊕ T⊗ + (1 + 3 log3 m)T⊕

Winograd [20] TMVP [13] Fig. 3b

Karatsuba algorithm. Those algorithms are based on the naive polynomial multiplication to explore an efficient subquadratic space complexity architecture. In [22], it is shown that the reduction polynomial stage for trinomial xm + xn + 1 with 1 < n < m/2 involves (2m − 2) XOR gates and requires 2T⊕ gate delays. We use the reduction polynomial stage in [22] to add Winograd and Karatsuba algorithms for evaluating the complexity of GF (2m ) multiplication. Table III compares the complexities of the proposed and the existing subquadratic parallel multipliers. As shown in this table, our proposed MPB multiplier has significantly less computation time and less space complexity compared to the existing subquadratic multipliers. B. Comparison of Subquadratic Double-Multipliers It is well-known that the subquadratic algorithms are derived from two-operand multiplication. For fast three-operand multiplication, the subquadratic algorithms require two separate multipliers in cascade. Multiplier of Lee et al. [16] is based on recursive Karatsuba algorithm to explore subquadratic threeoperand multiplication. Our proposed double-multiplication scheme is based on RTMVP decomposition to derive subquadratic space complexity architecture. Table IV lists the complexity of the proposed and the exiting subquadratic multipliers [16], [20], [21], [13] for three-operand multiplication. As shown in this table, TMVP-based multiplier [13] is better than other existing subquadratic architectures. The proposed 2-way RTMVP-based multiplier has nearly 25% less delays and about 9% less space complexity compared to those of the existing 2-way subquadratic multipliers. The proposed 3way RTMVP-based multiplier has less delay and slightly more space complexity compared to those of existing 3-way TMVPbased multiplier [13]. We consider three fields based on the existing trinomials of degree m, such as x1223 + x255 + 1, x861 + x14 + 1 and x191 + x9 + 1, for the comparison of our proposed and the existing subquadratic multipliers for three-operand multiplication. In order to reduce the complexity, hybrid subquadratic multiplication approach is introduced in [10], which combines 2-way and 3-way decomposition schemes. By this approach, we use the hybrid of 2-way and 3-way decompositions to construct the proposed and the existing subquadratic multipliers based on the field order of the form m = 24 34 , m = 1223 ≈ 24 34 , m = 861 ≈ 25 33 , and m = 191 ≈ 26 3, respectively, for synthesis purpose.

We have used the NanGate’s Library Creator and the 45nm FreePDK Base Kit from North Carolina State University (NCSU) [23] to synthesize the proposed double-multiplier and the corresponding existing multipliers. From the synthesis results, we obtain the delay, the number of gates, and total GE (gate equivalent), as shown in Table V. In this table, the total GE is estimated based on the used cell area, i.e., a NAND gate (area=0.798 nm2 ), a AND gate (area=1.064 nm2 and delay=0.02 ns), and a XOR gate (area=1.596 nm2 and delay=0.04 ns). We find that our proposed double-multiplier has less space complexity compared to the best of the existing subquadratic multipliers for the selected field order of the form m = 2i 3j is i > j with significantly less time complexity. VI. C ONCLUSIONS In this paper, we have derived a new SMPB representation. The proposed MPB multiplication is constructed from MPB and SMPB, while traditional MPB multiplier is constructed from PB and MPB. We have shown that the basis conversion from MPB to SMPB does not involve any cost for hardware implementation. We have proposed three-operand MPB multiplication using a new formulation of the RTMVP decomposition. We have used the traditional TMVP decomposition to propose two new 2-way and 3-way RTMVP decompositions. The proposed MPB single- and double-multiplications use TMVP and RTMVP decompositions, respectively, to achieve subquadratic space complexity architectures. Note that, based on the proposed MPB multiplication, we can derive a subquadratic multiplier using RTMVP decomposition, while traditional MPB multiplication cannot utilize RTMVP decomposition for computing three-operand multiplication. From the theoretical analysis, it is shown that the proposed subquadratic multipliers have significantly less computation time and less space complexity compared to the existing subquadratic multipliers based on Karatsuba algorithm, Winograd algorithm, and splitting TMVP algorithm. Moreover, our proposed threeoperand multiplier can be used for several applications such as exponentiation, inversion, and elliptic curve point multiplication. R EFERENCES [1] U. Bose, A. K. Bhattacharya, and A. Das, “GPU-Based Implementation of 128-Bit Secure Eta Pairing over a Binary Field,” in AFRICACRYPT, 2013, pp. 26–42. [2] C.-Y. Lee and C. W. Chiou, “Scalable Gaussian Normal Basis Multipliers over GF (2m ) Using Hankel Matrix-Vector Representation,” Signal Processing Systems, vol. 69, no. 2, pp. 197–211, 2012.

9

Table IV C OMPARISON OF SUBQUADRATIC SELECTED PARALLEL MULTIPLIERS FOR COMPUTING THREE - OPERAND MULTIPLICATION FOR TRINIMIALS m n x + x + 1 WITH m = bi AND 1 < n < m/2

design

b

#⊕

#⊗

gate delays

Karatsuba [16]

2 3 2 3 2 3 2 3 2 3

13mlog 2 3 − 16m + n + 6 21mlog 3 6 − 31m + n + 13 12mlog2 3 − 12m 32 log3 6 m − 32 m 3 3 12mlog2 3 − 12m 32 log3 6 m − 32 m 3 3 11mlog2 3 − 8m − 2n − 1 9.6mlog3 6 − 6m − 2n − 1.6 10mlog2 3 − 9m − 1 10.4mlog3 6 − 9m − 1.4

2mlog2 3 2mlog3 6 2mlog2 3 2mlog3 6 2mlog2 3 2mlog3 6 2mlog2 3 2mlog3 6 2mlog2 3 2mlog3 6

2T⊗ + (3 + 5 log2 m)T⊕ 2T⊗ + (3 + 7 log3 m)T⊕ 2T⊗ + (4 + 6 log2 m)T⊕ 2T⊗ + (4 + 8 log3 m)T⊕ 2T⊗ + (4 + 6 log2 m)T⊕ 2T⊗ + (4 + 8 log3 m)T⊕ 2T⊗ + (4 + 4 log2 m)T⊕ 2T⊗ + (4 + 6 log3 m)T⊕ 2T⊗ + (1 + 3 log2 m)T⊕ 2T⊗ + (1 + 5 log3 m)T⊕

Karatsuba [21] Winograd [20] TMVP [13] Proposed RTMVP

Table V C OMPARISON OF VARIOUS SUBQUADRATIC DOUBLE - MULTIPLIERS OVER GF (2m ) IN THE TERMS OF DELAY (ns), NUMBER OF GATES , AND TOTAL GE. m = 1223 ≈ 24 34 #XOR #AND Total GE delay #XOR Fig.4a 2,076,934 279,936 2,356,870 1.36 1,034,366 [13] 2,019,344 279,936 2,299,280 1.8 1,037,204 [16] 3,252,398 279,936 3,532,334 2.08 1,515,454 [21] 2,384,850 279,936 2,664,786 2.44 1,312,388 GE denotes gate equivalent in terms of number of 2-input NAND gates. multipliers

[3] C.-Y. Lee, Y.-H. Chen, C. W. Chiou, and J.-M. Lin, “Unified Parallel Systolic Multiplier Over GF () ,” J. Comput. Sci. Technol., vol. 22, no. 1, pp. 28–38, 2007. [Online]. Available: http://dx.doi.org/10.1007/ s11390-007-9003-0 [4] C.-Y. Lee and C. W. Chiou, “Efficient Design of Low-Complexity Bit-Parallel Systolic Hankel Multipliers to Implement Multiplication in Normal and Dual Bases of GF (2m ),” IEICE Transactions, vol. 88-A, no. 11, pp. 3169–3179, 2005. [5] C.-Y. Lee, “Low-Complexity Parallel Systolic Montgomery Multipliers over GF (2m ) Using Toeplitz Matrix-Vector Representation,” IEICE Transactions, vol. 91-A, no. 6, pp. 1470–1477, 2008. [6] J. Han and H. Fan, “Toeplitz matrix-vector product based GF (2n ) shifted polynomial basis multipliers for all irreducible pentanomials,” IACR Cryptology ePrint Archive, vol. 2013, p. 427, 2013. [Online]. Available: http://eprint.iacr.org/2013/427 [7] M. A. Hasan, A. H. Namin, and C. Nègre, “Toeplitz matrix approach for binary field multiplication using quadrinomials,” IEEE Trans. VLSI Syst., vol. 20, no. 3, pp. 449–458, 2012. [8] J.-S. Pan, R. Azarderakhsh, M. M. Kermani, C.-Y. Lee, W.-Y. Lee, C. W. Chiou, and J.-M. Lin, “Low-Latency Digit-Serial Systolic Double Basis Multiplier over GF (2m ) Using Subquadratic Toeplitz Matrix-Vector Product Approach,” IEEE Trans. Computers, vol. 63, no. 5, pp. 1169– 1181, 2014. [9] S.-M. Park and K.-Y. Chang, “Fast Bit-Parallel Shifted Polynomial Basis Multiplier Using Weakly Dual Basis Over GF (2m ),” IEEE Trans. VLSI Syst., vol. 19, no. 12, pp. 2317–2321, 2011. [10] A. Weimerskirch and C. Paar, “Generalizations of the karatsuba algorithm for efficient implementations,” University of Ruhr, Bochum, Germany, Tech. Rep., 2003. [11] Y. Li, G. Chen, and J. Li, “Speedup of bit-parallel Karatsuba multiplier in GF (2m ) generated by trinomials,” Inf. Process. Lett., vol. 111, no. 8, pp. 390–394, 2011. [12] M. Bodrato, “Towards optimal toom-cook multiplication for univariate and multivariate polynomials in characteristic 2 and 0,” in WAIFI, 2007, pp. 116–133. [13] H. Fan and M. Hasan, “A new approach to subquadratic space complexity parallel multipliers for extended binary fields,” IEEE Trans. Computers, vol. 56, no. 2, pp. 224 – 233, 2007. [14] C.-Y. Lee, C.-S. Yang, B. K. Meher, P. K. Meher, and J.-S. Pan, “LowComplexity Digit-Serial and Scalable SPB/GPB Multipliers Over Large Binary Extension Fields Using (b, 2)-Way Karatsuba Decomposition,” IEEE Trans. on Circuits and Systems, vol. 61-I, no. 11, pp. 3115–3124, 2014. [Online]. Available: http://dx.doi.org/10.1109/TCSI.2014.2335031 [15] J.-S. Pan, C.-Y. Lee, and P. K. Meher, “Low-latency digit-serial and

m = 861 ≈ 25 33 #AND Total GE 139,968 1,174,334 139,968 1,177,172 139,968 1,655,422 139,968 1,452,356

[16]

[17]

[18] [19]

[20]

[21]

[22] [23]

delay 1.28 2.28 2 2.36

#XOR 84,022 88,076 113,270 102,684

m = 191 ≈ 26 3 #AND Total GE 11,664 95,686 11,664 99,740 11,664 124,934 11,664 114,348

delay 1 1.4 1.48 1.64

digit-parallel systolic multipliers for large binary extension fields,” IEEE Trans. on Circuits and Systems, vol. 60-I, no. 12, pp. 3195–3204, 2013. C.-Y. Lee, P. K. Meher, and C.-P. Chang, “Efficient M-ary Exponentiation over GF (2m ) Using Subquadratic KA-Based Three-Operand Montgomery Multiplier,” IEEE Trans. Circuits and Systems I: Regular Papers, vol. 61, no. 11, pp. 3125–3134, 2014. R. Azarderakhsh, K. Järvinen, and V. S. Dimitrov, “Fast Inversion in GF (2m ) with Normal Basis Using Hybrid-Double Multipliers,” IEEE Trans. Computers, vol. 63, no. 4, pp. 1041–1047, 2014. C. Nègre, “Quadrinomial modular arithmetic using modified polynomial basis,” in ITCC (1), 2005, pp. 550–555. R. Furness, S. Fenn, and M. Benaissa, “Multiplication using the triangular basis representation over GF (2m ),” in Global Telecommunications Conference, 1996 (GLOBECOM ’96), vol. 2, 1996, pp. 788 – 792. B. Sunar, “A generalized method for constructing subquadratic complexity GF (2k ) multipliers,” IEEE Trans. Computers, vol. 53, no. 9, pp. 1097 – 1105, 2004. C. Paar, “A new architecture for a parallel finite field multiplier with low complexity based on composite fields,” IEEE Trans. Computers, vol. 45, no. 7, pp. 856–861, 1996. B. Sunar and C ¸ etin Kaya Koç, “Mastrovito multiplier for all trinomials,” IEEE Trans. Computers, vol. 48, no. 5, pp. 522–527, 1999. “Nangate standard cell library,” http://www.si2.org/openeda.si2.org/projects /nangatelib/.

VII. A PPENDIX Lemma 3. The matrix additions (A01 (= A0 + A1 ), A02 (= A0 + A2 ), A12 (= A1 + A2 ), A13 (= A1 + A2 ), A23 (= A2 + A3 ), A24 (= A2 + A4 ), and A34 (= A3 + A4 )) in (16) can be performed using (3n − 2) XOR gates. Proof: Let nP = 3m. Based on Proposition 1, we can use 6m−2 polynomial A = i=0 ai xi to represent the corresponding n×n matrix A. The matrix A is split into m×m block Toeplitz matrix, such as A = [A0 , A1 , A2 , A3 , A4 ]. Using polynomial representation, the five block matrices Ai (for i=0,1,2,3,4) can P2m−2 be represented by Ai = j=0 aim+j xj . We can therefore have Aij = Ai + Aj

10

=

m−2 X

(aim+k + ajm+k )xk + (aim+m−1 + ajm+m−1 )xm−1

k=0

+

m−2 X

(aim+m+k + ajm+m+k )xm+k

(31)

k=0

Note that the matrix additions of this form are involved in computation of A01 , A02 , A12 , A13 , A23 , A24 , and A34 . Based on (31), we can P find the reused terms as follows: m−2 m+k in A01 also 1. The term k=0 (am+k + a2m+k )x appears in A12 . P m−2 m+k 2. The term in A02 also k=0 (am+k + a3m+k )x appears in A13 . P m−2 m+k 3. The term in A12 also k=0 (a2m+k + a2m+k )x appears in A23 . P m−2 m+k 4. The term in A13 also k=0 (a2m+k + a4m+k )x appears in A24 . P m−2 m+k 5. The term in A23 also k=0 (a3m+k + a4m+k )x appears in A34 . The seven matrices A01 , A02 , A12 , A13 , A23 , A24 , and A34 can be computed by (3n − 2) XOR gates. Lemma 4. Assume that the MPB multiplication is based on matrix-vector product approach and expressed in the form C = MB A, and we have C = TB A, where TB = MB U , MB is generated by B, and U is a matrix for transformation from the SMPB to the MPB. Proof: A, B, C in GF (2m ) are represented by the MPB polynomials, where C = MB A, and the matrix MB is generated by B. Using (23), we can have A = U A, where U performs the basis conversion from SMPB into MPB. Thus, the MPB multiplication can be written as C = MB U A. Since double basis multiplication is a Toeplitz matrix-vector product given by C = TB A, we can find TB = MB U .

Chiou-Yng Lee received the Bachelor’s degree (1986) in Medical Engineering and the M.S. degree in Electronic Engineering (1992), both from the Chung Yuan Christian University, Taiwan, and the Ph.D. degree in Electrical Engineering from Chang Gung University, Taiwan, in 2001. From 1988 to 2005, he was a research associate with Chunghwa Telecommunication Laboratory in Taiwan. From 2001 to 2005, he taught courses related finite fields at Ching Yun University. Currently, he is a Professor in the Department of Computer Information and Network Engineering at Lunghwa University of Science and Technology. His research interests include computations in finite fields, error-control coding, signal processing, and digital transmission system. Besides, he is a senior member of the IEEE and the IEEE Computer society.

Pramod Kumar Meher (SM03) received the B.Sc. (Honours) and M.Sc. degree in physics, and the Ph.D. degree in science from Sambalpur University, India, in 1976, 1978, and 1996, respectively. Currently, he is a Senior Research Scientist with Nanyang Technological University, Singapore. Previously, he was a Professor of Computer Applications with Utkal University, India, from 1997 to 2002, and a Reader in electronics with Berhampur University, India, from 1993 to 1997. His research interest includes design of dedicated and reconfigurable architectures for computation-intensive algorithms pertaining to signal, image and video processing, communication, bio-informatics and intelligent computing. He has contributed more than 200 technical papers to various reputed journals and conference proceedings. Dr. Meher has served as a speaker for the Distinguished Lecturer Program (DLP) of IEEE Circuits Systems Society during 2011 and 2012, and Associate Editor for the IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMSII: EXPRESS BRIEFS during 2008 to 2011, and Associate Editor for the IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS-I: REGULAR PAPERS during 2012-2013. Currently, he is serving as Associate Editor for the IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, Journal of Circuits, Systems, and Signal Processing, and Integration, the VLSI Journal. Dr. Meher is a Fellow of the Institution of Electronics and Telecommunication Engineers, India. He was the recipient of the Samanta Chandrasekhar Award for excellence in research in engineering and technology for 1999. He has received the 2013 Sydney R. Parker Best Paper Award in the area of Signal Processing, and the 2013 M.N.S. Swamy Award for being the best paper amongst all the papers published in the Journal of Circuits, Systems, and Signal Processing.

Efficient Subquadratic Space Complexity

Efficient Subquadratic Space Complexity

Suggest Documents

A SUBQUADRATIC SEQUENCE ALIGNMENT ... - CiteSeerX

Space Complexity vs. Query Complexityâ

A SUBQUADRATIC SEQUENCE ALIGNMENT ... - CiteSeerX

Subquadratic Algorithms for Algebraic Generalizations of 3SUM

Multidimensional quadratic and subquadratic BSDEs with special ...

Subquadratic Approximation Algorithms For Clustering ... - CiteSeerX

Space-Efficient Private Search - CiteSeerX

Space Complexity in Polynomial Calculus - KTH

Understanding Space in Proof Complexity - KTH

Reduced Complexity Space-Time Optimum Processing - CiteSeerX

Complexity of continuous space machine operations - CiteSeerX

Space Complexity of Algorithm for Modular

Nondeterministic circuits, space complexity and ... - Science Direct

SPACE COMPLEXITY IN PROPOSITIONAL CALCULUS ... - CiteSeerX

Low-Complexity Energy Efficient Base Station Cooperation

Subquadratic wavenumber dependence of the ... - Semantic Scholar

Subquadratic Approximation Algorithms for Clustering ... - Springer Link

Simple and Communication Complexity Efficient ... - Semantic Scholar

Efficient Computation of Stochastic Complexity - Semantic Scholar

A Methodology for Efficient Space-Time Adapter Design Space ... - arXiv

Space Efficient Secret Sharing: A Recursive Approach

Space-Efficient Simulation of Quantum Computers - arXiv

Space-efficient Verifiable Secret Sharing Using ...

Efficient Filtering in State-Space Representations

Efficient Subquadratic Space Complexity