Dual-Field Arithmetic Core for High-Performance Cryptographic Operations Roberta Piscitelli
Alessandro Cilardo, Nicola Mazzocca, Nicola Petra
European Space Agency, Noordwijk, The Netherlands
University Federico II, Naples, Italy
Email:
[email protected]
Email:
{acilardo,mazzocca,nicpetra}@unina.it
AbstractIn this paper we propose a novel unied architec-
tiplication and modulo reduction in a parallel scheme. Fur-
ture for parallel Montgomery multiplication supporting both GF (N ) and GF (2k ) nite eld operations, which are critical
thermore, it relies on a modied Booth recoding technique
for several public key cryptographic algorithms, such as RivestShamir-Adleman (RSA) and Elliptic Curve (EC) cryptosystems.
for the multiplicand and a radix-4 scheme for the modulus, enabling reduced time delays for moderately large operand
The hardware scheme interleaves multiplication and modulo
widths. We also present a pipelined architecture based on the
reduction. Furthermore, it relies on a modied Booth recoding
parallel component previously introduced, enabling very low
scheme for the multiplicand and a radix-4 scheme for the
clock counts and high throughput levels for long operands
modulus, enabling reduced time delays even for moderately large operand widths. In addition, we present a pipelined architecture based on the parallel blocks previously introduced, enabling very
used in cryptographic applications. Experimental results, based on
0.18µm
CMOS technology, prove the effectiveness of
low clock counts and high throughput levels for long operands
the proposed techniques, and substantially outperfom the best
used in cryptographic applications. Experimental results, based
results previously presented in the technical literature.
0.18µm
on
CMOS technology, prove the effectiveness of the
proposed techniques, and substantially outperfom the best results previously presented in the technical literature.
The paper is structured as follows. Section II provides a brief description of the Montgomery multiplication algorithm and presents the state-of-the-art of unied eld arithmetic. Section III describes the proposed parallel arithmetic compo-
I. I NTRODUCTION
nent, and then a high-throughput pipelined unit based on the
The increasing centrality of networking and Internet ap-
previously introduced core. Section IV presents our results and
plications are stimulating an ever-growing demand for high-
compares them to the state-of-the-art. Section V concludes the
performance implementations of cryptographic algorithms and
paper with some nal remarks.
protocols. Two widely adopted public-key cryptosystems, in II. BACKGROUND
particular, are the Rivest-Shamir-Adleman (RSA) [10] and the Elliptic Curve (EC) [1] cryptosystems. While various standardization bodies recommend prime elds nary extension elds
GF (N ) or bi-
GF (2k ) for elliptic curve cryptosystems,
A slight variant of standard modular multiplication, the Montgomery multiplication performs the following operation:
A · B · R−1 mod N
RSA cryptography is essentially based on integer modular arithmetic, similar in its implementation to
GF (N ) operations.
Both types of nite elds have in common that the multiplication of elements implies a reduction operation, either modulo a prime degree
N or modulo an irreducible binary polynomial f (x) of m. The so-called Montgomery algorithm [8] has proved
to be the most effective implementation technique for modular multiplication [16], [2]. It is in fact based on a slightly different denition of the modular product, which enables particularly efcient implementations. Originally introduced for integer numbers (and thus for
GF (N )
arithmetic), Montgomery multiplication has been ef-
fectively extended to binary elds
k
GF (2 )
[7]. As a conse-
quence, during the last years several works have addressed the problem of implementing unied arithmetic blocks, suitable for computing operations in both elds using the same underlying hardware [13], [4], [17], [5], [12], [11]. In this paper, we propose a novel unied architecture for parallel Montgomery multiplication supporting both and
GF (2k )
GF (N )
operations. The hardware unit interleaves mul-
R = 2n
where
is a power of two and
n
is equal to, or slightly
N . The value R−1 is the inverse of R modulo N , i.e. a number such that R−1 R mod N = 1. In order for such a number to exist, it sufces that gcd(N, R) = 1. Since in both Elliptic Curve larger than the number of bits in the modulus
cryptography based on prime elds and in RSA cryptography
N
is always an odd number, this condition is always satised
when
R
is a power of two. Montgomery multiplication can be
performed with the following algorithm. Algorithm 1: Montgomery Modular Multiplication
˜ such that R·R−1 −N · N ˜ = 1, A, B < 2N N , R and N −1 Output: P ≡ A · B · R mod N , P < N ˜ mod R 1. Q = AB · N 2. P = AB+Q·N R 3. if P > N then P = P − N
Input:
Montgomery algorithm enables very efcient implementation, as
it
naturally
allows
deeply
implementations [16], [15], [2], [9].
pipelined
and
systolic
An interesting property enabled by Montgomery multiplication is the possibility to work on the numbers, dened as
A = A · R mod N .
N -residues
of
It can be easily
NN -residue form: ABR−1 mod N = AR · BR · R−1 mod N = (AB) · R mod N = AB . This also holds true for modular addition: (A+B) mod N = A + B . All
the Montgomery algorithm in
GF (2k )
similar to the integer/GF (N ) case. The
turns out to be very
GF (2k )
counterpart
of Algorithm 2 is presented, for example, in [12].
seen that the Montgomery product of two numbers in residue form is still in
A. Unied eld arithmetic: state-of-the-art
operations used in RSA and EC cryptography can be reduced
GF (2k )
to a composition of modular multiplications and additions, and
hardware solutions for computing both operations with the
can thus always handle operands in Montgomery form.
same processing unit. To enable this approach, Savas ¸ et al.
Notice that Algorithm 1 requires a magnitude comparison (Step 3) in order to ensure the result is actually less than the
Since the structure of Montgomery variants for
GF (N ) and
are similar, several authors have proposed unied
proposed in [13] a basic building block able to perform a one-digit addition in both
GF (N )
and
GF (2k )
elds. The
N . However, when many consecutive multiplications
basic component is the Dual Field Adder, i.e. an ordinary
are to be performed, we can allow intermediate results to be
full adder whose carry input can be disabled, so that the sum
[0, 2N [ with a proper choice for R. In fact, if we choose R > 4N , it can be easily seen that the reduction algorithm accepts multiplicands A, B < 2N (not necessarily AB+Q·N Q less than N ): < 2N ·2NR+Q·N < 4N R R + R · N < 2N , so the algorithm preserves the invariant that inputs and output are less than 2N . By avoiding magnitude comparison,
output is simply the XOR of the two input bits (i.e., their
the above version of Montgomery algorithm greatly improves
by Wolkerstorfer in [17]. The author introduces a low power
performance. We will thus refer to this version of the algorithm
design enabling short critical paths and high clock frequencies
in the following.
by using carry save adders. In [5], the authors present the
modulus
in the range
For scalable implementations, a natural choice is to partition operands into words, and process them separately. Precisely, we will refer in this paper to the so-called nely integrated operand scanning (FIOS) method [6], reported below.
w-bit words Pm−1 Pm−1 w i w i A (2 ) , B = A = i i=0 Bi (2 ) , N = i=0 P Pm−1 m−1 ˜ w i w w i ˜ ˜ i=0 Ni (2 ) , Ai , Bi , Ni , Ni < 2 , i=0 Ni (2 ) , N = mw with 0 ≤ A, B < 2N , mw ≥ 2 + dlog2 N e (i.e. 2 > 4N ) −n Output: P ≡ A · B · 2 mod N , with n = mw 1. P = 0 2. for j = 0 to m − 1 3. C=0 ˜0 mod 2w 4. Qj = (P0 + Bj A0 )N 5. for i = 0 to m − 1 6. S := Pi + Bj Ai + Qj Ni + C w 7. if (i 6= 0) then Pi−1 := S mod 2 8. C := S/2w 9. Pm−1 := C The w -bit words of operands A, B , and N are processed Algorithm 2: FIOS method for
Input:
in two nested loops. During the execution of the algorithm,
full precision. Montgomery modular reduction is computed by interleaving the addition of partial products and the modulus. A hardware solution for dual-eld arithmetic is also presented
design of a low-power multiply/accumulate (MAC) unit for efcient arithmetic in nite elds. The unit combines integer and polynomial arithmetic into a single functional unit supporting both
GF (N )
and
GF (2k )
elds. The emphasis is mostly put
on power consumption, as the authors show that a properly designed unied multiplier may consume signicantly less power if used in polynomial mode compared to integer mode. The fastest solution for unied eld multiplication was proposed by Satoh and Takano [12]. They present a scalable
GF (N ) GF (2k ) nite elds. The core of the processor is a parallel
elliptic curve cryptographic processor supporting both and
dual-eld multiplier, based on a Wallace tree scheme. The delay for a multiplication is logarithmic in the input-size, although it is different for the two types of elds. In fact, a sub-portion of the Wallace tree is used for obtaining a
GF (2k )
product, while the whole structure, including a fast carry propagation adder, is required for
GF (N )
operations. The
authors evaluate different parallelisms, developing the multiplier for word sizes of 8, 16, 32, or 64 bits, depending on the desired trade-off between area requirements and performance. One advantage of their approach is that it does not require any special full adder, such as the dual-eld adder, unlike works in [13], [5], [4] and others. This makes it possible to optimize the partial product addition network. Furthermore, at
a full precision register since it is shared among consecutive
elliptic curve is improved by converting on-the-y the integer
and
w+1
and
C
a bit-serial unied multiplier processing the multiplicand in
2w + 1 bit variable P needs
temporary variables
S
GF (2) sum). Based on a similar idea, Großsch¨adl [4] proposed
can be stored in a
bit register, respectively, while
rows (i.e.,
m
iterations of the inner loop with constant
j ).
Authors in [7] extended Montgomery multiplication to binary elds
GF (2k ),
a higher level, the performance of point multiplication over an multiplicand in a redundant form. Finally, a recent solution proposes a fast modular arithmetic-
by adopting polynomial representation
logic unit [11] that is scalable in the digit size and the eld
R−1 = 2−n with x−n . With polyGF (2k ) eld elements can be handled
size. The datapath is based on chains of carry save adders to
as binary polynomials and multiplication can be performed
This enables efcient execution of modular multiplication
and replacing the factor nomial representation,
modulo an irreducible polynomial
N (x).
So, the structure of
speed up arithmetic operations over large integers in
GF (N ).
and addition/subtraction. The unit is prototyped in FPGA
Multiplier bits
Integer mode (fsel =1)
Polynomial mode (fsel =0)
b2k+1
b2k
b2k−1
AAk
inv
trp
shl
AAk (x)
inv
trp
shl
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
1
0
0
0
0
0
0
1
0
0
1
0
1
0
1
1
0
0
1
0
1
0
1
0
0
1
0
1
0
0
1
1
0
1
1
1
0
0
0
1
1
1
0
1
1
0
0
1
1
1
1
1
0
0
0
0
A(x) A(x) t · A(x) t · A(x) t · A(x) ⊕ A(x) t · A(x) ⊕ A(x)
0
0
+A +A +2A −2A −A −A
0
1
1
TABLE I PARTIAL
PRODUCT GENERATION FOR INTEGERS AND BINARY POLYNOMIALS .
technology achieving interesting throughput levels, although
In order to avoid the expensive calculation of the multiple
3A
inferior to the ASIC-based work presented in [12].
in the
AA's
value set, the Booth recoding scheme is
normally used. It takes a bit stream III. A RCHITECTURE
is dened to be
A. Parallel Montgomery Multiplier In this section, we propose a novel unied architecture for parallel Montgomery multiplication supporting both and
GF (2k )
GF (N )
operations. Unlike previously proposed parallel
multipliers such as the solution in [12], the hardware unit merges multiplication and Montgomery reduction, allowing a word-level modular multiplication to be performed is a single
set of
for modulo reduction. The number of partial products to be added can be approximately halved by using Booth recodig and a radix-4 Montgomery scheme for
GF (2k )
multiplication, similar to
the work in [5]. Unlike [5], however, we extend the radix-4 approach to integer arithmetic, as well. The basic full-precision algorithm for a radix-4 digit-serial interleaved Montgomery multiplication is given below (see for example [14]): Algorithm 3: Radix-4 Montgomery Modular Multiplication
˜ such that 4k+1 · 4−(k+1) − N · N ˜ = 1, 2 < N < 4k , N A < 2N , B < 2N −(k+1) Output: P ≡ A · B · 4 mod N , P < 2N 1. P = 0 2. for i = 0 to k ˜0 mod 4 3. Qi = (P0 + Bi · A0 ) · N Input:
4.
P = (P + Bi · A + Qi · N )/4 It can be easily proved that, by using k + 1 iterations (i.e., −(k+1) by computing A · B · 4 mod N , A, B < 2N ) the nal value of P is still less than 2N . Notice that Qi only depends on the two least signicant bits of (P0 + Bi · A0 ) and N , so it can be computed by a simple circuit or a look-up table. Its value is dened in such a way as to make the least signicant digit of
(P + Bi A + Qi N )4
zero at each iteration.
In the following, we will use partial product
(Bi A)
AA
and
NN
to represent a
and a multiple of the modulus
respectively. In the case of radix-4, numbers. Thus, the value sets of
AA ∈ {0, A, 2A, 3A},
AA
(Qi N ),
Bi and Qi are 2-bit N N are as follows:
and
N N ∈ {0, N, 2N, 3N }
requiring two extra adders to compute
3A
and
3N
on the y.
into
as input
according to Table I, where
set are calculated with simple operations such as bit inversion and/or bit-shift. For
GF (2k )
operations, elements are handled
as binary polynomials. In this case, a pure radix-4 polynomial multiplication is adopted, and multiples
AAk (x) are calculated
as in Table I.
cycle. The proposed multiplier relies on a modied Booth recoding scheme for the multiplicand, and a radix-4 approach
AA
AA
(bi+1 , bi , bi−1 )
b−1 0. Booth recoding scheme transforms the value {−2A, −A, 0, +A, +2A}. All elements in the
and generates a recoded
In [14] authors adopt a method named Montgomery recoding scheme to change the possible values of
NNs
so that
they can all be obtained by simple shifts and inversions. Let
(sp1 , sp0 )2 be the 2 bits in the least signicant digit (LSD) of SP = P + AA and (n1 , n0 )2 be the 2 bits in the LSD of N . According to the input condition that N has to be odd, n0 is always 1. Then, Montgomery recoding scheme takes a bit stream (sp1 , sp0 , n1 )2 as input and generates a recoded N N value according to Table II, where Qi represents the recoded quotient digit for a N N at the i-th iteration. Montgomery recoding scheme transforms the value set of NN into
{−N, 0, +N, +2N }.
Due to the two recoding schemes,
it is easy to calculate all the elements in the value sets of and
NN.
AA
Notice that this technique changes the range of the
Montgomery algorithm output, which may now be negative. In polynomial mode the addition becomes a bitwise XOR;
NN (sp1 , sp0 )2 of
for this reason, we need to sum a different value of in order to reduce the least signicant digits
SP = P +AA. In order to perform the modular multiplication GF (2k ) with the same recoding scheme, we need an additional control signal, f sel (eld select) which allows us to switch between integer-mode (f sel = 1) and polynomial mode (f sel = 0). In Table II we show the unied Montgomery in
recoding scheme. The structure of the Montgomery Module Generator for the recoding scheme (MMG) is similar to the Partial Product Generator (PPG) described in [5]. The MMG is controlled by an Encoder circuit via the three signals (transport), and quotients
shl
inv
(invert),
trp
(shift left), which represent the recoded
Qi described above. These signals depend on the two (sp1 , sp0 )2 of SP and n1 , according to
least signicant bits
Field
Three
Recoded
Recoded
Select
input bits
quotient
NN
Qi
NN
f sel
sp1
sp0
n1
0
0
0
0
0
0
0
0
1
0
0
1
0
0
0
Control signals
inv
trp
shl
0
0
0
0
0
0
0
0
0
0
+1
+N
0
1
0
1
1
-1
-N
1
1
0
1
0
0
+2
+2N
0
0
1
0
1
0
1
+2
+2N
0
0
1
0
1
1
0
-1
-N
1
1
0
0
1
1
1
+1
+N
0
1
0
1
0
0
0
0
0
0
0
0
1
0
0
1
0
0
0
0
0
1
0
1
0
-1
-N
1
1
0
1
0
1
1
+1
+N
0
1
0
1
1
0
0
+2
+2N
0
1
1
1
1
0
1
+2
+2N
0
1
1
1
1
1
0
+1
+N
0
1
0
1
1
1
1
-1
-N
1
1
0
TABLE II U NIFIED M ONTGOMERY B OOTH R ECODING .
the following equations, as can be easily derived from Table II.
inv = sp0 · [f sel · (sp1 · n1 + sp1 · n1 )] + f sel · [sp1 ⊕ n1 ] shl = sp1 · sp0 trp = sp0 (1) In a PPG the Modied Booth recoding is performed in two steps: Encoding and Decoding/Selection. The encoding step can be seen as a digit-set conversion: the Encoder takes a radix-4 number
B
to a number
with digits in
˜ B
{0, 1, 2, 3} and converts it {−2, −1, 0, 1, 2}. Thus, if radix-
with a digit set of
4 multiplication is performed with the encoded multiplier
˜, B A
only the multiples
±2A
and
±A
Fig. 1.
use of two additional CSA stages and two additional vectors, namely
corresponding module is negative, i.e.
trp = 1
N N = ±N (no shl = 1, a 1-bit leftshift has to be performed, i.e. N N = 2N . Finally, N N = 0 is generated by trp = shl = 0. A parallel (w × w)-bit means
left-shift). On the other hand, when
multiplier for signed/unsigned modular multiplication contains
bw/2c + 1
PPGs and
bw/2c + 1
MMGs and the same number
of Encoder circuits for PPG and MMG, respectively. The partial products
AA
and the modules
NN
are
w+2
bits long
as they are represented in two's complement form. Negative multiples can thus be obtained by a bitwise complement of the corresponding positive multiple, with a
1
added at the least
signicant position of the partial product. B. Design Optimizations 1) Addition of Partial Products and Montgomery Modules: In a non-optimized Montgomery multiplier with Modied Booth recoding, the sum of the partial products and the Montgomery modules in the CSA stage is given by:
(SP (i) , CP (i) ) = AA(i) + (S (i) , C (i) ) (S (i+1) , C (i+1) ) = (SP (i) , CP (i) ) + N (i) (SP (i+1) , CP (i+1) ) = P (i+1) + (S (i+1) , C (i+1) ) ···
Stmp
and
Ctmp:
Sp(0) , Cp(0) = AA(0) + Cp(0) + N N (0) S (0) , C (0) = AA(1) + Sp(0) + Cp(0) Stmp(0) , Ctmp(0) = Sp(0) + Cp(0) + AA(1) + cn(0) S (0) , C (0) = Stmp(0) + Ctmp(0) + Cp(1) Sp(1) , Cp(1) = S (0) + C (0) + N N (1) ···
of the multiplicand
will be required, all of which are easily obtained by
signal
w = 6.
(SP (i) , CP (i) ) is given by the sum of the i-th recoded partial product AAi and the previous AAj , 0 ≤ j < i with the recoded modules N Nj , 0 ≤ j < i. We also need to consider the LSB ca and cn of the output carry vector: this involves the
shifting and/or complementation. The MMG works as follows: When inv = 1, the N N = −N . Control
Addition of Partial Products and Montgomery Modules with Booth
Recoding in an optimized scheme for
For this reason, we need :
• • •
Sp(i) , Cp(i) Stmp(i) , Ctmp(i) (i) (i) CSA stages to compute S , C optimizations adopted (see in Figure 1 for w = 6)
bw/2c + 1 bw/2c + 1 bw/2c + 1
The main
CSA stages to compute CSA stages to compute
consist in:
• delaying the sum of the LSB carry vector in order to avoid
ca and cn of the output bw/2c + 1 CSA stages;
• postponing the sum of the least signicant bits (i) (i) (i) {s1 , s0 , c0 } of S (i) and C (i) respectively, to save area and CSA stages; these operations imply a complication of the MMG unit, which needs more inputs
• reversing the order of the sum of
AA(i) , N N (i) , in order
to improve the critical path. This operation does not alter the computation of
N N i,
due to the encoding network
previously described, which tests the bits needed for the computation of the modulus before the addition of the
AAi+1 The constant
vector.
K
is used to handle the sign extension of partial
products. Instead of replicating the MSB of all partial products
and propagating it to the left margin of the multiplier, each
Figure 3 describes how the pipelined unit is used to pro-
MSB is inverted so as to have a positive weight and then
cess multi-word operands, showing how the portions of the
compensated by constant
K
once for all rows [3].
operands are scheduled in the pipeline. Numbers in parentheses indicate which of the eight blocks in the unit works on which portion of operands
C. Pipelined solution Previous works (e.g. Satoh and Takano's 64-bit multiplier [12]) suggest that it is normally convenient to adopt a large parallelism for achieving higher throughput levels. Our parallel architecture has a relatively complex selection network and a linear critical path, which results in large time delays as the word size increases. In order to achieve high throughput levels and propose a scalable scheme, we present in this section a pipelined architecture, using the parallel unit
and
words on the same row are to be processed consecutively. The throughput of the architecture is one multiplication word per clock cycle in this case.
(m+1)
Qi
(m+7)
for
the whole row, in addition to adding them. The four larger
N
and
A,
(7) (9)
(8) (m+2)
(m+4) (m+6)
(4) (6)
(6)
word
(2)
(2) (4)
(8)
(m+8)
multipliers on the left side of Figure 2, on the other hand, only need to sum the multiples of
(3) (5)
.... .... .... .... .... .... .... .... ....
recoded as the signals
A
(m+4)
(m+6)
.... .... ....
A0/N0
(1) (3)
word
(m+5)
presented in the previous section: in other words, they can
A1/N1
(m)
(m+2)
(m+3)
ers on the right are in fact four instances of the parallel unit
and the multiplicand
m is less than 8.
multiplication on large multi-word operands, when many
composed of eight pipelined modules. The smaller multipli-
N
at which clock cycle
This makes the unit particularly suitable for high-performance
Am-1/Nm-1
generate the recoded multiples (i.e.
B
end of each row only if the number of words
Figure 2 shows the internal structure of a 64x64-bit unit
of the modulus
A, N ,
for the top rightmost multiplier). The
unit has a latency of eight cycles, introducing a stall at the
as the basic building block.
shl, trp, inv )
1
(starting from cycle
(5) (7)
B0
(m+1)
(m+3)
(m+5)
B1
(m+7)
as determined
by right-multipliers. Since left-multipliers are much simpler in
Fig. 3.
their structure and have consequently a shorter delay, they are
and
designed so that they process longer data. As we use two's complement representation in the CS form, it is desirable to keep intermediate sums in CS form and convert the nal result back to binary form only at the end of the pipelined structure. We thus need to transfer CS numbers between subsequent multiplier modules having different output/input sizes. This required the use of a suitable technique [18] to sign-extend the CS pair and properly propagate sign information.
Bj
Ai , Ni , w-bit unit
Scheduling for a multi-word Montgomery multiplication. are
w-bit
words. Word sub-portions enter the pipelined
according to the schedule indicated in parentheses.
16×16 bit size, while left 48×16 bit size. The architecture is designed so
Right modules in Figure 2 have a modules have a
that the single blocks, especially the smaller right-multipliers, can be optimized to minimize the clock period. Notice that, with a slight modication to the scheme of Figure 2, the rst and the last row (possibly connected to an external bus) may be designed with a smaller height than the multipliers in the second and third row, so as to balance the delay of each stage in the pipeline. After the nal stage, a Dual-Field Carry-LookAhead adder (not shown in Figure 2 and Figure 1) converts the Carry/Sum pair back to non-redundant form.
Fig. 4.
Overall architecture of the dual-eld multiplication unit
The overall architecture of the dual-eld multiplication unit is shown in Figure 4. From the scheme in Figure 3 it is clear that at the beginning of each row we need to drive in the
A0 , N0 , and Bj , while the P are stored internally in a
unit three different words, namely Fig. 2.
Pipelined architecture of the arithmetic core. Apexes in parentheses
indicate the different portions into which a single inside the pipelined unit.
w-bit
word is partitioned
words of the intermediate result
dedicated memory. This is the only case when we need three concurrent accesses to the external memory. To mitigate this
problem and limit the number of external buses, we observe
choice, especially for the reduction in clock count. As a
N,
future work, we plan to study new techniques to further
w additional
reduce the delay of the parallel Montgomery unit, described
ip-ops and some selection logic, independent of the full
in Section III-A, thereby improving the clock period and the
that it is convenient to store the rst word of the modulus
N0
in an internal register. This trick only requires
N0
size of the operands and the modulus.
is stored before
throughput achievable by the pipelined unit.
starting a multiplication (or a sequence of multiplications
R EFERENCES
sharing the same modulus). As a consequence, at the beginning of each row in the multiplication pipeline we only need and
Bj ,
while for the subsequent words we need
Ai
and
A0 Ni
(Bj is constant through the row), which are driven into the
rithms Multiplier-Accumulators, Electronics Letters, vol. 26 no. 17, pp. 1413-1415, Aug 1990.
The pipelined multiplier core of Figure 2 was described in
0.18µm standard cell
library technology by using Cadence Build Gates synthesis tool. Post-synthesis area requirements are estimated to be while the minimum clock period is
on Recongurable Hardware, IEEE Transactions on Computers, vol. 50, [3] N. Burgess, Removal Of Sign-Extension Circuitry From Booth's Algo-
IV. E XPERIMENTAL RESULTS AND C OMPARISONS
1316kµm2 ,
Cambridge University Press 1999. [2] T. Blum and C. Paar, High-Radix Montgomery Modular Exponentiation no. 7, pp. 759-764, July 2001.
multiplication unit through the same pair of buses.
VHDL and then synthesized for a CM0S
[1] I.F. Blake, G. Seroussi, and N.P. Smart, Elliptic Curves in Cryptography,
12.2ns.
[4] J. Großsch¨ adl, A bit-serial unied multiplier architecture for nite elds
GF (p)
and
GF (2n ),
in Cryptographic Hardware and Embedded
Systems: Proceedings of CHES'01. Lecture Note in Computer Science, Springer-Verlag, vol. 2162, pp. 206-223, 2001. [5] J. Großsch¨ adl and G.-A. Kamendje, Low Power Design of a Functional Unit for Arithmetic in Finite Fields
Although there are different related works presenting unied Montgomery multiplication (see Section II-A), we only compare our results with the multiplier introduced in [12], since it achieves the highest throughput among the various works available in the literature. Both their work and ours are synthesized as a CMOS ASIC, but the design in [12] relies
GF (p)
and
GF (2m ),
in Informa-
tion Security Applications - WISA'03. Lecture Notes in Computer Science, Springer-Verlag, vol. 2908, pp. 227-243, 2003. [6] C ¸ .K. Koc ¸ , T. Acar, and B.S. Kaliski, Jr., Analyzing and Comparing Montgomery Multiplication Algorithms, IEEE Micro, vol. 16, no. 3, pp. 26-33, June 1996. [7] C ¸ .K. Koc ¸ , T. Acar, Montgomery Multiplication
GF (2k ),
Designs,
Codes and Cryptography, vol. 14, no. 1, pp. 57-69, Apr. 1998. [8] P.L. Montgomery, Modular multiplication without trial division, Math-
0.18µm
ematics of Computation, vol. 44, no. 170, pp. 519-521, Apr. 1985. ¨ [9] S. B. Ors, L. Batina, B. Preneel, and J. Vandewalle, Hardware Imple-
target used in our design. When implemented in the same
mentation of a Montgomery Modular Multiplier in a Systolic Array,
technology, our solution is thus likely to enable even better
in Proceedings of the International Parallel and Distributed Processing
on a
0.13µm
technology, more advanced than the
Symposium (IPDPS03), p. 184b, 2003.
improvements than emphasized in the following discussion.
[10] R. L. Rivest, A. Shamir, and L. Adleman, A Method for obtaining
The table below reports some results referred to integer (i.e.
Digital Signatures and Public-Key Cryptosystems, Communications of
GF (N )) modular multiplication for different operand lengths, choosing the eld sizes indicated by NIST standards for elliptic curve cryptography. Performance improvements are
ASIC
0.13µm
Proc. IEEE International Symposium on Circuits and Systems (ISCAS [12] A. Satoh and K. Takano, A Scalable Dual-Field Elliptic Curve Crypto-
This work ASIC
Modular Arithmetic Logic Unit and its Hardware Implementation, in 2006), pp. 787-790, 2006.
especially evident in terms of clock counts. Satoh and Takano [12]
the ACM, vol. 21, pp. 120-126, 1978. [11] K. Sakiyama, B. Preneel, and I. Verbauwhede, A Fast Dual-Field
graphic Processor, IEEE Transanctions on Computers, vol. 52, pp. 449-
0.18µm
460, 2003. [13] E. Savas ¸ , A. F. Tenca and C ¸ . K. Koc ¸ , A Scalable and Unied Multiplier
clock period: 7.26 ns
clock period: 12.2 ns
GF (N )
clock
throughput
clock
throughput
Architecture for Finite Fields
eld size
count
[Mbit/s]
count
[Mbit/s]
Hardware and Embedded Systems: Proceedings of CHES'00. Lecture
192
45
587.7
27
582.9
Note in Computer Science, Springer-Verlag, vol. 1965, pp. 281-296, 2000.
224
66
467.5
36
510.0
[14] H.K. Son and S.G. Oh Design and Implementation of Scalable Low-
256
66
534.3
36
582.9
Power Montgomery Multiplier, in Proceedings of the IEEE International
284
91
429.9
45
517.3
521
231
310.7
90
474.5
GF (2k )
case, to a subportion of the Wallace tree in the multiplier.
GF (2k )
operations
would be worse in our case, while remaining superior for the more critical integer/GF (N ) arithmetic. In the case a dual frequency implementation is not possible, on the other hand, our multiplier has better performance also for
GF (2m ),
Cryptographic
Conference on Computer Design (ICCD'04), pp. 524-531, Oct. 2004.
mode, since the output of their unit is connected, in this If a dual clock frequency were allowed,
and
[15] W.-C. Tsai, C.B. Shung, and S.-J.Wang, Two systolic architectures
Authors in [12] emphasize that a higher frequency could be used if the unied multiplier were used only in
GF (p)
GF (2m ),
and
comparisons with the multiplier in [12] appear similar to those given in the above table for the integer/GF (N ) case. V. C ONCLUSION The approach presented in this paper, based on dual-eld parallel Montgomery multiplication, proves to be a promising
for modular multiplication, IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 8, no. 1, pp. 103-107, Feb. 2000. [16] C. D. Walter, Systolic Modular Multiplication, IEEE Transactions on Computers, vol. 42, no.3, pp. 376-378, March 1993. [17] J. Wolkerstorfer, Dual-eld arithmetic unit for Cryptographic
Hardware
and
Embedded
GF (p)
Systems:
and
GF (2m ),
Proceedings
of
CHES'02. Lecture Note in Computer Science, Springer-Verlag, vol. 2523, pp. 500-514, 2002. [18] A. F. Tenca, and L. A. Tawalbeh, Carry-Save Representation Is ShiftUnsafe: The Problem and Its Solution, IEEE Transanctions on Computers,vol. 55, no.5, may 2006.