Dual-Field Arithmetic Core for High-Performance

2 downloads 0 Views 425KB Size Report
The hardware scheme interleaves multiplication and modulo reduction. ... based on integer modular arithmetic, similar in its implementation to GF(N) operations.
Dual-Field Arithmetic Core for High-Performance Cryptographic Operations Roberta Piscitelli

Alessandro Cilardo, Nicola Mazzocca, Nicola Petra

European Space Agency, Noordwijk, The Netherlands

University Federico II, Naples, Italy

Email: [email protected]

Email:

{acilardo,mazzocca,nicpetra}@unina.it

Abstract—In this paper we propose a novel unied architec-

tiplication and modulo reduction in a parallel scheme. Fur-

ture for parallel Montgomery multiplication supporting both GF (N ) and GF (2k ) nite eld operations, which are critical

thermore, it relies on a modied Booth recoding technique

for several public key cryptographic algorithms, such as RivestShamir-Adleman (RSA) and Elliptic Curve (EC) cryptosystems.

for the multiplicand and a radix-4 scheme for the modulus, enabling reduced time delays for moderately large operand

The hardware scheme interleaves multiplication and modulo

widths. We also present a pipelined architecture based on the

reduction. Furthermore, it relies on a modied Booth recoding

parallel component previously introduced, enabling very low

scheme for the multiplicand and a radix-4 scheme for the

clock counts and high throughput levels for long operands

modulus, enabling reduced time delays even for moderately large operand widths. In addition, we present a pipelined architecture based on the parallel blocks previously introduced, enabling very

used in cryptographic applications. Experimental results, based on

0.18µm

CMOS technology, prove the effectiveness of

low clock counts and high throughput levels for long operands

the proposed techniques, and substantially outperfom the best

used in cryptographic applications. Experimental results, based

results previously presented in the technical literature.

0.18µm

on

CMOS technology, prove the effectiveness of the

proposed techniques, and substantially outperfom the best results previously presented in the technical literature.

The paper is structured as follows. Section II provides a brief description of the Montgomery multiplication algorithm and presents the state-of-the-art of unied eld arithmetic. Section III describes the proposed parallel arithmetic compo-

I. I NTRODUCTION

nent, and then a high-throughput pipelined unit based on the

The increasing centrality of networking and Internet ap-

previously introduced core. Section IV presents our results and

plications are stimulating an ever-growing demand for high-

compares them to the state-of-the-art. Section V concludes the

performance implementations of cryptographic algorithms and

paper with some nal remarks.

protocols. Two widely adopted public-key cryptosystems, in II. BACKGROUND

particular, are the Rivest-Shamir-Adleman (RSA) [10] and the Elliptic Curve (EC) [1] cryptosystems. While various standardization bodies recommend prime elds nary extension elds

GF (N ) or bi-

GF (2k ) for elliptic curve cryptosystems,

A slight variant of standard modular multiplication, the Montgomery multiplication performs the following operation:

A · B · R−1 mod N

RSA cryptography is essentially based on integer modular arithmetic, similar in its implementation to

GF (N ) operations.

Both types of nite elds have in common that the multiplication of elements implies a reduction operation, either modulo a prime degree

N or modulo an irreducible binary polynomial f (x) of m. The so-called Montgomery algorithm [8] has proved

to be the most effective implementation technique for modular multiplication [16], [2]. It is in fact based on a slightly different denition of the modular product, which enables particularly efcient implementations. Originally introduced for integer numbers (and thus for

GF (N )

arithmetic), Montgomery multiplication has been ef-

fectively extended to binary elds

k

GF (2 )

[7]. As a conse-

quence, during the last years several works have addressed the problem of implementing unied arithmetic blocks, suitable for computing operations in both elds using the same underlying hardware [13], [4], [17], [5], [12], [11]. In this paper, we propose a novel unied architecture for parallel Montgomery multiplication supporting both and

GF (2k )

GF (N )

operations. The hardware unit interleaves mul-

R = 2n

where

is a power of two and

n

is equal to, or slightly

N . The value R−1 is the inverse of R modulo N , i.e. a number such that R−1 R mod N = 1. In order for such a number to exist, it sufces that gcd(N, R) = 1. Since in both Elliptic Curve larger than the number of bits in the modulus

cryptography based on prime elds and in RSA cryptography

N

is always an odd number, this condition is always satised

when

R

is a power of two. Montgomery multiplication can be

performed with the following algorithm. Algorithm 1: Montgomery Modular Multiplication

˜ such that R·R−1 −N · N ˜ = 1, A, B < 2N N , R and N −1 Output: P ≡ A · B · R mod N , P < N ˜ mod R 1. Q = AB · N 2. P = AB+Q·N R 3. if P > N then P = P − N

Input:

Montgomery algorithm enables very efcient implementation, as

it

naturally

allows

deeply

implementations [16], [15], [2], [9].

pipelined

and

systolic

An interesting property enabled by Montgomery multiplication is the possibility to work on the numbers, dened as

A = A · R mod N .

N -residues

of

It can be easily

NN -residue form: ABR−1 mod N = AR · BR · R−1 mod N = (AB) · R mod N = AB . This also holds true for modular addition: (A+B) mod N = A + B . All

the Montgomery algorithm in

GF (2k )

similar to the integer/GF (N ) case. The

turns out to be very

GF (2k )

counterpart

of Algorithm 2 is presented, for example, in [12].

seen that the Montgomery product of two numbers in residue form is still in

A. Unied eld arithmetic: state-of-the-art

operations used in RSA and EC cryptography can be reduced

GF (2k )

to a composition of modular multiplications and additions, and

hardware solutions for computing both operations with the

can thus always handle operands in Montgomery form.

same processing unit. To enable this approach, Savas ¸ et al.

Notice that Algorithm 1 requires a magnitude comparison (Step 3) in order to ensure the result is actually less than the

Since the structure of Montgomery variants for

GF (N ) and

are similar, several authors have proposed unied

proposed in [13] a basic building block able to perform a one-digit addition in both

GF (N )

and

GF (2k )

elds. The

N . However, when many consecutive multiplications

basic component is the Dual Field Adder, i.e. an ordinary

are to be performed, we can allow intermediate results to be

full adder whose carry input can be disabled, so that the sum

[0, 2N [ with a proper choice for R. In fact, if we choose R > 4N , it can be easily seen that the reduction algorithm accepts multiplicands A, B < 2N (not necessarily AB+Q·N Q less than N ): < 2N ·2NR+Q·N < 4N R R + R · N < 2N , so the algorithm preserves the invariant that inputs and output are less than 2N . By avoiding magnitude comparison,

output is simply the XOR of the two input bits (i.e., their

the above version of Montgomery algorithm greatly improves

by Wolkerstorfer in [17]. The author introduces a low power

performance. We will thus refer to this version of the algorithm

design enabling short critical paths and high clock frequencies

in the following.

by using carry save adders. In [5], the authors present the

modulus

in the range

For scalable implementations, a natural choice is to partition operands into words, and process them separately. Precisely, we will refer in this paper to the so-called nely integrated operand scanning (FIOS) method [6], reported below.

w-bit words Pm−1 Pm−1 w i w i A (2 ) , B = A = i i=0 Bi (2 ) , N = i=0 P Pm−1 m−1 ˜ w i w w i ˜ ˜ i=0 Ni (2 ) , Ai , Bi , Ni , Ni < 2 , i=0 Ni (2 ) , N = mw with 0 ≤ A, B < 2N , mw ≥ 2 + dlog2 N e (i.e. 2 > 4N ) −n Output: P ≡ A · B · 2 mod N , with n = mw 1. P = 0 2. for j = 0 to m − 1 3. C=0 ˜0 mod 2w 4. Qj = (P0 + Bj A0 )N 5. for i = 0 to m − 1 6. S := Pi + Bj Ai + Qj Ni + C w 7. if (i 6= 0) then Pi−1 := S mod 2 8. C := S/2w 9. Pm−1 := C The w -bit words of operands A, B , and N are processed Algorithm 2: FIOS method for

Input:

in two nested loops. During the execution of the algorithm,

full precision. Montgomery modular reduction is computed by interleaving the addition of partial products and the modulus. A hardware solution for dual-eld arithmetic is also presented

design of a low-power multiply/accumulate (MAC) unit for efcient arithmetic in nite elds. The unit combines integer and polynomial arithmetic into a single functional unit supporting both

GF (N )

and

GF (2k )

elds. The emphasis is mostly put

on power consumption, as the authors show that a properly designed unied multiplier may consume signicantly less power if used in polynomial mode compared to integer mode. The fastest solution for unied eld multiplication was proposed by Satoh and Takano [12]. They present a scalable

GF (N ) GF (2k ) nite elds. The core of the processor is a parallel

elliptic curve cryptographic processor supporting both and

dual-eld multiplier, based on a Wallace tree scheme. The delay for a multiplication is logarithmic in the input-size, although it is different for the two types of elds. In fact, a sub-portion of the Wallace tree is used for obtaining a

GF (2k )

product, while the whole structure, including a fast carry propagation adder, is required for

GF (N )

operations. The

authors evaluate different parallelisms, developing the multiplier for word sizes of 8, 16, 32, or 64 bits, depending on the desired trade-off between area requirements and performance. One advantage of their approach is that it does not require any special full adder, such as the dual-eld adder, unlike works in [13], [5], [4] and others. This makes it possible to optimize the partial product addition network. Furthermore, at

a full precision register since it is shared among consecutive

elliptic curve is improved by converting on-the-y the integer

and

w+1

and

C

a bit-serial unied multiplier processing the multiplicand in

2w + 1 bit variable P needs

temporary variables

S

GF (2) sum). Based on a similar idea, Großsch¨adl [4] proposed

can be stored in a

bit register, respectively, while

“rows” (i.e.,

m

iterations of the inner loop with constant

j ).

Authors in [7] extended Montgomery multiplication to binary elds

GF (2k ),

a higher level, the performance of point multiplication over an multiplicand in a redundant form. Finally, a recent solution proposes a fast modular arithmetic-

by adopting polynomial representation

logic unit [11] that is scalable in the digit size and the eld

R−1 = 2−n with x−n . With polyGF (2k ) eld elements can be handled

size. The datapath is based on chains of carry save adders to

as binary polynomials and multiplication can be performed

This enables efcient execution of modular multiplication

and replacing the factor nomial representation,

modulo an irreducible polynomial

N (x).

So, the structure of

speed up arithmetic operations over large integers in

GF (N ).

and addition/subtraction. The unit is prototyped in FPGA

Multiplier bits

Integer mode (fsel =1)

Polynomial mode (fsel =0)

b2k+1

b2k

b2k−1

AAk

inv

trp

shl

AAk (x)

inv

trp

shl

0

0

0

0

0

0

0

0

0

0

0

0

0

1

0

1

0

0

0

0

0

0

1

0

0

1

0

1

0

1

1

0

0

1

0

1

0

1

0

0

1

0

1

0

0

1

1

0

1

1

1

0

0

0

1

1

1

0

1

1

0

0

1

1

1

1

1

0

0

0

0

A(x) A(x) t · A(x) t · A(x) t · A(x) ⊕ A(x) t · A(x) ⊕ A(x)

0

0

+A +A +2A −2A −A −A

0

1

1

TABLE I PARTIAL

PRODUCT GENERATION FOR INTEGERS AND BINARY POLYNOMIALS .

technology achieving interesting throughput levels, although

In order to avoid the expensive calculation of the multiple

3A

inferior to the ASIC-based work presented in [12].

in the

AA's

value set, the Booth recoding scheme is

normally used. It takes a bit stream III. A RCHITECTURE

is dened to be

A. Parallel Montgomery Multiplier In this section, we propose a novel unied architecture for parallel Montgomery multiplication supporting both and

GF (2k )

GF (N )

operations. Unlike previously proposed parallel

multipliers such as the solution in [12], the hardware unit merges multiplication and Montgomery reduction, allowing a word-level modular multiplication to be performed is a single

set of

for modulo reduction. The number of partial products to be added can be approximately halved by using Booth recodig and a radix-4 Montgomery scheme for

GF (2k )

multiplication, similar to

the work in [5]. Unlike [5], however, we extend the radix-4 approach to integer arithmetic, as well. The basic full-precision algorithm for a radix-4 digit-serial interleaved Montgomery multiplication is given below (see for example [14]): Algorithm 3: Radix-4 Montgomery Modular Multiplication

˜ such that 4k+1 · 4−(k+1) − N · N ˜ = 1, 2 < N < 4k , N A < 2N , B < 2N −(k+1) Output: P ≡ A · B · 4 mod N , P < 2N 1. P = 0 2. for i = 0 to k ˜0 mod 4 3. Qi = (P0 + Bi · A0 ) · N Input:

4.

P = (P + Bi · A + Qi · N )/4 It can be easily proved that, by using k + 1 iterations (i.e., −(k+1) by computing A · B · 4 mod N , A, B < 2N ) the nal value of P is still less than 2N . Notice that Qi only depends on the two least signicant bits of (P0 + Bi · A0 ) and N , so it can be computed by a simple circuit or a look-up table. Its value is dened in such a way as to make the least signicant digit of

(P + Bi A + Qi N )4

zero at each iteration.

In the following, we will use partial product

(Bi A)

AA

and

NN

to represent a

and a multiple of the modulus

respectively. In the case of radix-4, numbers. Thus, the value sets of

AA ∈ {0, A, 2A, 3A},

AA

(Qi N ),

Bi and Qi are 2-bit N N are as follows:

and

N N ∈ {0, N, 2N, 3N }

requiring two extra adders to compute

3A

and

3N

on the y.

into

as input

according to Table I, where

set are calculated with simple operations such as bit inversion and/or bit-shift. For

GF (2k )

operations, elements are handled

as binary polynomials. In this case, a pure radix-4 polynomial multiplication is adopted, and multiples

AAk (x) are calculated

as in Table I.

cycle. The proposed multiplier relies on a modied Booth recoding scheme for the multiplicand, and a radix-4 approach

AA

AA

(bi+1 , bi , bi−1 )

b−1 0. Booth recoding scheme transforms the value {−2A, −A, 0, +A, +2A}. All elements in the

and generates a recoded

In [14] authors adopt a method named Montgomery recoding scheme to change the possible values of

NNs

so that

they can all be obtained by simple shifts and inversions. Let

(sp1 , sp0 )2 be the 2 bits in the least signicant digit (LSD) of SP = P + AA and (n1 , n0 )2 be the 2 bits in the LSD of N . According to the input condition that N has to be odd, n0 is always 1. Then, Montgomery recoding scheme takes a bit stream (sp1 , sp0 , n1 )2 as input and generates a recoded N N value according to Table II, where Qi represents the recoded quotient digit for a N N at the i-th iteration. Montgomery recoding scheme transforms the value set of NN into

{−N, 0, +N, +2N }.

Due to the two recoding schemes,

it is easy to calculate all the elements in the value sets of and

NN.

AA

Notice that this technique changes the range of the

Montgomery algorithm output, which may now be negative. In polynomial mode the addition becomes a bitwise XOR;

NN (sp1 , sp0 )2 of

for this reason, we need to sum a different value of in order to reduce the least signicant digits

SP = P +AA. In order to perform the modular multiplication GF (2k ) with the same recoding scheme, we need an additional control signal, f sel (eld select) which allows us to switch between integer-mode (f sel = 1) and polynomial mode (f sel = 0). In Table II we show the unied Montgomery in

recoding scheme. The structure of the Montgomery Module Generator for the recoding scheme (MMG) is similar to the Partial Product Generator (PPG) described in [5]. The MMG is controlled by an Encoder circuit via the three signals (transport), and quotients

shl

inv

(invert),

trp

(shift left), which represent the recoded

Qi described above. These signals depend on the two (sp1 , sp0 )2 of SP and n1 , according to

least signicant bits

Field

Three

Recoded

Recoded

Select

input bits

quotient

NN

Qi

NN

f sel

sp1

sp0

n1

0

0

0

0

0

0

0

0

1

0

0

1

0

0

0

Control signals

inv

trp

shl

0

0

0

0

0

0

0

0

0

0

+1

+N

0

1

0

1

1

-1

-N

1

1

0

1

0

0

+2

+2N

0

0

1

0

1

0

1

+2

+2N

0

0

1

0

1

1

0

-1

-N

1

1

0

0

1

1

1

+1

+N

0

1

0

1

0

0

0

0

0

0

0

0

1

0

0

1

0

0

0

0

0

1

0

1

0

-1

-N

1

1

0

1

0

1

1

+1

+N

0

1

0

1

1

0

0

+2

+2N

0

1

1

1

1

0

1

+2

+2N

0

1

1

1

1

1

0

+1

+N

0

1

0

1

1

1

1

-1

-N

1

1

0

TABLE II U NIFIED M ONTGOMERY B OOTH R ECODING .

the following equations, as can be easily derived from Table II.

inv = sp0 · [f sel · (sp1 · n1 + sp1 · n1 )] + f sel · [sp1 ⊕ n1 ] shl = sp1 · sp0 trp = sp0 (1) In a PPG the Modied Booth recoding is performed in two steps: Encoding and Decoding/Selection. The encoding step can be seen as a digit-set conversion: the Encoder takes a radix-4 number

B

to a number

with digits in

˜ B

{0, 1, 2, 3} and converts it {−2, −1, 0, 1, 2}. Thus, if radix-

with a digit set of

4 multiplication is performed with the encoded multiplier

˜, B A

only the multiples

±2A

and

±A

Fig. 1.

use of two additional CSA stages and two additional vectors, namely

corresponding module is negative, i.e.

trp = 1

N N = ±N (no shl = 1, a 1-bit leftshift has to be performed, i.e. N N = 2N . Finally, N N = 0 is generated by trp = shl = 0. A parallel (w × w)-bit means

left-shift). On the other hand, when

multiplier for signed/unsigned modular multiplication contains

bw/2c + 1

PPGs and

bw/2c + 1

MMGs and the same number

of Encoder circuits for PPG and MMG, respectively. The partial products

AA

and the modules

NN

are

w+2

bits long

as they are represented in two's complement form. Negative multiples can thus be obtained by a bitwise complement of the corresponding positive multiple, with a

1

added at the least

signicant position of the partial product. B. Design Optimizations 1) Addition of Partial Products and Montgomery Modules: In a non-optimized Montgomery multiplier with Modied Booth recoding, the sum of the partial products and the Montgomery modules in the CSA stage is given by:

(SP (i) , CP (i) ) = AA(i) + (S (i) , C (i) ) (S (i+1) , C (i+1) ) = (SP (i) , CP (i) ) + N (i) (SP (i+1) , CP (i+1) ) = P (i+1) + (S (i+1) , C (i+1) ) ···

Stmp

and

Ctmp:

Sp(0) , Cp(0) = AA(0) + Cp(0) + N N (0) S (0) , C (0) = AA(1) + Sp(0) + Cp(0) Stmp(0) , Ctmp(0) = Sp(0) + Cp(0) + AA(1) + cn(0) S (0) , C (0) = Stmp(0) + Ctmp(0) + Cp(1) Sp(1) , Cp(1) = S (0) + C (0) + N N (1) ···

of the multiplicand

will be required, all of which are easily obtained by

signal

w = 6.

(SP (i) , CP (i) ) is given by the sum of the i-th recoded partial product AAi and the previous AAj , 0 ≤ j < i with the recoded modules N Nj , 0 ≤ j < i. We also need to consider the LSB ca and cn of the output carry vector: this involves the

shifting and/or complementation. The MMG works as follows: When inv = 1, the N N = −N . Control

Addition of Partial Products and Montgomery Modules with Booth

Recoding in an optimized scheme for

For this reason, we need :

• • •

Sp(i) , Cp(i) Stmp(i) , Ctmp(i) (i) (i) CSA stages to compute S , C optimizations adopted (see in Figure 1 for w = 6)

bw/2c + 1 bw/2c + 1 bw/2c + 1

The main

CSA stages to compute CSA stages to compute

consist in:

• delaying the sum of the LSB carry vector in order to avoid

ca and cn of the output bw/2c + 1 CSA stages;

• postponing the sum of the least signicant bits (i) (i) (i) {s1 , s0 , c0 } of S (i) and C (i) respectively, to save area and CSA stages; these operations imply a complication of the MMG unit, which needs more inputs

• reversing the order of the sum of

AA(i) , N N (i) , in order

to improve the critical path. This operation does not alter the computation of

N N i,

due to the encoding network

previously described, which tests the bits needed for the computation of the modulus before the addition of the

AAi+1 The constant

vector.

K

is used to handle the sign extension of partial

products. Instead of replicating the MSB of all partial products

and propagating it to the left margin of the multiplier, each

Figure 3 describes how the pipelined unit is used to pro-

MSB is inverted so as to have a positive weight and then

cess multi-word operands, showing how the portions of the

compensated by constant

K

once for all rows [3].

operands are scheduled in the pipeline. Numbers in parentheses indicate which of the eight blocks in the unit works on which portion of operands

C. Pipelined solution Previous works (e.g. Satoh and Takano's 64-bit multiplier [12]) suggest that it is normally convenient to adopt a large parallelism for achieving higher throughput levels. Our parallel architecture has a relatively complex selection network and a linear critical path, which results in large time delays as the word size increases. In order to achieve high throughput levels and propose a scalable scheme, we present in this section a pipelined architecture, using the parallel unit

and

words on the same row are to be processed consecutively. The throughput of the architecture is one multiplication word per clock cycle in this case.

(m+1)

Qi

(m+7)

for

the whole row, in addition to adding them. The four “larger”

N

and

A,

(7) (9)

(8) (m+2)

(m+4) (m+6)

(4) (6)

(6)

word

(2)

(2) (4)

(8)

(m+8)

multipliers on the left side of Figure 2, on the other hand, only need to sum the multiples of

(3) (5)

.... .... .... .... .... .... .... .... ....

recoded as the signals

A

(m+4)

(m+6)

.... .... ....

A0/N0

(1) (3)

word

(m+5)

presented in the previous section: in other words, they can

A1/N1

(m)

(m+2)

(m+3)

ers on the right are in fact four instances of the parallel unit

and the multiplicand

m is less than 8.

multiplication on large multi-word operands, when many

composed of eight pipelined modules. The “smaller” multipli-

N

at which clock cycle

This makes the unit particularly suitable for high-performance

Am-1/Nm-1

generate the recoded multiples (i.e.

B

end of each row only if the number of words

Figure 2 shows the internal structure of a 64x64-bit unit

of the modulus

A, N ,

for the top rightmost multiplier). The

unit has a latency of eight cycles, introducing a stall at the

as the basic building block.

shl, trp, inv )

1

(starting from cycle

(5) (7)

B0

(m+1)

(m+3)

(m+5)

B1

(m+7)

as determined

by right-multipliers. Since left-multipliers are much simpler in

Fig. 3.

their structure and have consequently a shorter delay, they are

and

designed so that they process longer data. As we use two's complement representation in the CS form, it is desirable to keep intermediate sums in CS form and convert the nal result back to binary form only at the end of the pipelined structure. We thus need to transfer CS numbers between subsequent multiplier modules having different output/input sizes. This required the use of a suitable technique [18] to sign-extend the CS pair and properly propagate sign information.

Bj

Ai , Ni , w-bit unit

Scheduling for a multi-word Montgomery multiplication. are

w-bit

words. Word sub-portions enter the pipelined

according to the schedule indicated in parentheses.

16×16 bit size, while left 48×16 bit size. The architecture is designed so

Right modules in Figure 2 have a modules have a

that the single blocks, especially the smaller right-multipliers, can be optimized to minimize the clock period. Notice that, with a slight modication to the scheme of Figure 2, the rst and the last row (possibly connected to an external bus) may be designed with a smaller height than the multipliers in the second and third row, so as to balance the delay of each stage in the pipeline. After the nal stage, a Dual-Field Carry-LookAhead adder (not shown in Figure 2 and Figure 1) converts the Carry/Sum pair back to non-redundant form.

Fig. 4.

Overall architecture of the dual-eld multiplication unit

The overall architecture of the dual-eld multiplication unit is shown in Figure 4. From the scheme in Figure 3 it is clear that at the beginning of each row we need to drive in the

A0 , N0 , and Bj , while the P are stored internally in a

unit three different words, namely Fig. 2.

Pipelined architecture of the arithmetic core. Apexes in parentheses

indicate the different portions into which a single inside the pipelined unit.

w-bit

word is partitioned

words of the intermediate result

dedicated memory. This is the only case when we need three concurrent accesses to the external memory. To mitigate this

problem and limit the number of external buses, we observe

choice, especially for the reduction in clock count. As a

N,

future work, we plan to study new techniques to further

w additional

reduce the delay of the parallel Montgomery unit, described

ip-ops and some selection logic, independent of the full

in Section III-A, thereby improving the clock period and the

that it is convenient to store the rst word of the modulus

N0

in an internal register. This trick only requires

N0

size of the operands and the modulus.

is stored before

throughput achievable by the pipelined unit.

starting a multiplication (or a sequence of multiplications

R EFERENCES

sharing the same modulus). As a consequence, at the beginning of each row in the multiplication pipeline we only need and

Bj ,

while for the subsequent words we need

Ai

and

A0 Ni

(Bj is constant through the row), which are driven into the

rithms Multiplier-Accumulators”, Electronics Letters, vol. 26 no. 17, pp. 1413-1415, Aug 1990.

The pipelined multiplier core of Figure 2 was described in

0.18µm standard cell

library technology by using Cadence Build Gates synthesis tool. Post-synthesis area requirements are estimated to be while the minimum clock period is

on Recongurable Hardware,” IEEE Transactions on Computers, vol. 50, [3] N. Burgess, “Removal Of Sign-Extension Circuitry From Booth's Algo-

IV. E XPERIMENTAL RESULTS AND C OMPARISONS

1316kµm2 ,

Cambridge University Press 1999. [2] T. Blum and C. Paar, “High-Radix Montgomery Modular Exponentiation no. 7, pp. 759-764, July 2001.

multiplication unit through the same pair of buses.

VHDL and then synthesized for a CM0S

[1] I.F. Blake, G. Seroussi, and N.P. Smart, Elliptic Curves in Cryptography,

12.2ns.

[4] J. Großsch¨ adl, “A bit-serial unied multiplier architecture for nite elds

GF (p)

and

GF (2n ),”

in Cryptographic Hardware and Embedded

Systems: Proceedings of CHES'01. Lecture Note in Computer Science, Springer-Verlag, vol. 2162, pp. 206-223, 2001. [5] J. Großsch¨ adl and G.-A. Kamendje, “Low Power Design of a Functional Unit for Arithmetic in Finite Fields

Although there are different related works presenting unied Montgomery multiplication (see Section II-A), we only compare our results with the multiplier introduced in [12], since it achieves the highest throughput among the various works available in the literature. Both their work and ours are synthesized as a CMOS ASIC, but the design in [12] relies

GF (p)

and

GF (2m )”,

in Informa-

tion Security Applications - WISA'03. Lecture Notes in Computer Science, Springer-Verlag, vol. 2908, pp. 227-243, 2003. [6] C ¸ .K. Koc ¸ , T. Acar, and B.S. Kaliski, Jr., “Analyzing and Comparing Montgomery Multiplication Algorithms,” IEEE Micro, vol. 16, no. 3, pp. 26-33, June 1996. [7] C ¸ .K. Koc ¸ , T. Acar, “Montgomery Multiplication

GF (2k )”,

Designs,

Codes and Cryptography, vol. 14, no. 1, pp. 57-69, Apr. 1998. [8] P.L. Montgomery, “Modular multiplication without trial division,” Math-

0.18µm

ematics of Computation, vol. 44, no. 170, pp. 519-521, Apr. 1985. ¨ [9] S. B. Ors, L. Batina, B. Preneel, and J. Vandewalle, “Hardware Imple-

target used in our design. When implemented in the same

mentation of a Montgomery Modular Multiplier in a Systolic Array,”

technology, our solution is thus likely to enable even better

in Proceedings of the International Parallel and Distributed Processing

on a

0.13µm

technology, more advanced than the

Symposium (IPDPS03), p. 184b, 2003.

improvements than emphasized in the following discussion.

[10] R. L. Rivest, A. Shamir, and L. Adleman, “A Method for obtaining

The table below reports some results referred to integer (i.e.

Digital Signatures and Public-Key Cryptosystems”, Communications of

GF (N )) modular multiplication for different operand lengths, choosing the eld sizes indicated by NIST standards for elliptic curve cryptography. Performance improvements are

ASIC

0.13µm

Proc. IEEE International Symposium on Circuits and Systems (ISCAS [12] A. Satoh and K. Takano, “A Scalable Dual-Field Elliptic Curve Crypto-

This work ASIC

Modular Arithmetic Logic Unit and its Hardware Implementation,” in 2006), pp. 787-790, 2006.

especially evident in terms of clock counts. Satoh and Takano [12]

the ACM, vol. 21, pp. 120-126, 1978. [11] K. Sakiyama, B. Preneel, and I. Verbauwhede, “A Fast Dual-Field

graphic Processor,” IEEE Transanctions on Computers, vol. 52, pp. 449-

0.18µm

460, 2003. [13] E. Savas ¸ , A. F. Tenca and C ¸ . K. Koc ¸ , “A Scalable and Unied Multiplier

clock period: 7.26 ns

clock period: 12.2 ns

GF (N )

clock

throughput

clock

throughput

Architecture for Finite Fields

eld size

count

[Mbit/s]

count

[Mbit/s]

Hardware and Embedded Systems: Proceedings of CHES'00. Lecture

192

45

587.7

27

582.9

Note in Computer Science, Springer-Verlag, vol. 1965, pp. 281-296, 2000.

224

66

467.5

36

510.0

[14] H.K. Son and S.G. Oh “Design and Implementation of Scalable Low-

256

66

534.3

36

582.9

Power Montgomery Multiplier”, in Proceedings of the IEEE International

284

91

429.9

45

517.3

521

231

310.7

90

474.5

GF (2k )

case, to a subportion of the Wallace tree in the multiplier.

GF (2k )

operations

would be worse in our case, while remaining superior for the more critical integer/GF (N ) arithmetic. In the case a dual frequency implementation is not possible, on the other hand, our multiplier has better performance also for

GF (2m ),”

Cryptographic

Conference on Computer Design (ICCD'04), pp. 524-531, Oct. 2004.

mode, since the output of their unit is connected, in this If a dual clock frequency were allowed,

and

[15] W.-C. Tsai, C.B. Shung, and S.-J.Wang, “Two systolic architectures

Authors in [12] emphasize that a higher frequency could be used if the unied multiplier were used only in

GF (p)

GF (2m ),

and

comparisons with the multiplier in [12] appear similar to those given in the above table for the integer/GF (N ) case. V. C ONCLUSION The approach presented in this paper, based on dual-eld parallel Montgomery multiplication, proves to be a promising

for modular multiplication,” IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 8, no. 1, pp. 103-107, Feb. 2000. [16] C. D. Walter, “Systolic Modular Multiplication,” IEEE Transactions on Computers, vol. 42, no.3, pp. 376-378, March 1993. [17] J. Wolkerstorfer, “Dual-eld arithmetic unit for Cryptographic

Hardware

and

Embedded

GF (p)

Systems:

and

GF (2m ),”

Proceedings

of

CHES'02. Lecture Note in Computer Science, Springer-Verlag, vol. 2523, pp. 500-514, 2002. [18] A. F. Tenca, and L. A. Tawalbeh, “Carry-Save Representation Is ShiftUnsafe: The Problem and Its Solution,” IEEE Transanctions on Computers,vol. 55, no.5, may 2006.