2330
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 21, NO. 12, DECEMBER 2013
TABLE V C OMPARISON OF MM-1024 W ITH O THER I MPLEMENTATIONS Ref.
Description
Platform
This brief (Platform I) This brief (Platform II) This brief (Platform III) Reference (Platform IV)
Mixed partitioning Mixed partitioning Mixed partitioning pSHS
16 ASIP 16 32-b MMU 8 ASIP 8 64-b MMU 4 ASIP 4 128-b MMU 4 ASIP 4 32 × 32 multipliers
Freq. (MHz) 250 250 250 250
[11] [5]
Single-core Single-core
MIPS32 20Kc ARM
[4]
pSHS
[3]
Multiplicand-based partitioning
4 MicroBlaze cores 8 MicroBlaze cores 4 custom processors 4 32 × 32 multipliers
V. C ONCLUSION By exploiting parallelism and lowering area cost, this brief presented a mixed partitioning for the Radix-2 MM. Experimental results showed that the proposed partitioning could achieve better performance than previous methods when it was implemented on the low-cost ASIP-based multicore system. Therefore, the proposed partitioning is suitable to speed up the public key ciphers like RSA, DSA, etc. R EFERENCES [1] S. Singh, “Challenges of programming multi-core microprocessors,” in Proc. IET Electron. Soc. Weekly Conf. Program. Hardw. Syst., Oct. 2008, pp. 1–29. [2] P. L. Montgomery, “Modular multiplication without trial division,” Math. Comput., vol. 44, pp. 519–521, Apr. 1985. [3] J. Fan, K. Sakiyama, and I. Verbauwhede, “Montgomery modular multiplication algorithm on multi-core systems,” in Proc. IEEE Workshop Signal Process. Syst., Shanghai, China, Oct. 2007, pp. 261–266. [4] Z. Chen and P. Schaumont, “A parallel implementation of Montgomery multiplication on multicore systems: Algorithm, analysis, and prototype,” IEEE Trans. Comput., vol. 60, no. 12, pp. 1692–1703, Dec. 2011. [5] A. F. Tenca and C. K. Koc, “A scalable architecture for modular multiplication based on Montgomery’s algorithm,” IEEE Trans. Comput., vol. 52, no. 9, pp. 1215–1221, Sep. 2003. [6] C. K. Koc, T. Acar, and B. S. Kaliski, “Analyzing and comparing Montgomery multiplication algorithms,” IEEE Micro, vol. 16, no. 3, pp. 26–33, Jun. 1996. [7] K. Yu and N. Audsley, “Combining behavioural real-time software modeling with the OSCI TLM-2.0 communication standard,” in Proc. IEEE 10th Int. Conf. Comput. Inf. Technol., Jun. 2010, pp. 1825–1832. [8] H. W. M. van Moll, H. Corporaal, V. Reyes, and M. Boonen, “Fast and accurate protocol specific bus modeling using TLM 2.0,” in Proc. Design, Autom. Test Eur. Conf. Exhibit., Apr. 2009, pp. 316–319. [9] OSCI TLM-2.0 Language Reference Manual, Open SystemC Initiative, New York, 2009. [10] W. Huang, J. Han, S. Wang, and X. Zeng, “A low-complexity heterogeneous multi-core platform for security SoC,” in Proc. IEEE Asian Solid-State Circuits Conf., Jul. 2010, pp. 1–4. [11] MIPS Technologies. (2002, Jun.). 64-Bit Architecture Speeds RSA by 4x, Sunnyvale, CA [Online]. Available: http://www.mips.com/media/files/ white-papers/64bitarchitecturespeedsRSA4x.pdf
3826 3554 3460 15 477
Execution Time (μs) 15.3 14.2 13.8 61.9
533 80
23 377 45 600
43.9 570
100
16 716 10 762
167.2 107.6
93
4092
44
Clock Cycles
Novel Architecture for Efficient FPGA Implementation of Elliptic Curve Cryptographic Processor Over GF(2163) Hossein Mahdizadeh and Massoud Masoumi Abstract— A new and highly efficient architecture for elliptic curve scalar point multiplication is presented. To achieve the maximum architectural and timing improvements, we reorganize and reorder the critical path of the Lopez–Dahab scalar point multiplication architecture such that logic structures are implemented in parallel and operations in the critical path are diverted to noncritical paths. The results we obtained show that with G = 55 our proposed design is able to compute scalar multiplication over GF(2163 ) in 9.6 µs with the maximum achievable frequency of 250 MHz on Xilinx Virtex-4 (XC4VLX200), where G is the digit size of the underlying digit-serial finite-field multiplier. Another implementation variant for less resource consumption is also proposed; with G = 33, the design performs the same operation in 11.6 µs at 263 MHz on the same platform. The results of synthesis show that, in the first implementation, 17 929 slices or 20% of the chip area is occupied, which makes it suitable for speed-critical cryptographic applications, while in the second implementation 14 203 slices or 16% of the chip area is utilized, which makes it suitable for applications that may require speed–area tradeoff. Index Terms— Elliptic curve cryptography, FPGA implementation, scalar point multiplication.
I. I NTRODUCTION Elliptic curve cryptography (ECC) is a public key cryptography system superior to the well-known RSA cryptography; for the same key size, it gives a higher security level than RSA [1]. Intuitively, there are numerous advantages of using field-programmable gate array (FPGA) technology to implement in hardware the computationally intensive operations needed for ECC. These advantages have been comprehensively studied and listed by Wollinger et al. [2]. Several recent FPGA-based hardware implementations of ECC have achieved high-performance throughput and efficiency. In this brief, we present a new architecture for efficient FPGA implementation of an ECC processor over GF(2163 ), which has considerable advantages compared to other implementations as it regards to speed and area. The proposed architecture is based on a modified Lopez–Dahab (LD) elliptic curve point multiplication algorithm [3] in which we have reorganized and reordered the data path carefully to achieve Manuscript received March 26, 2012; revised October 14, 2012; accepted November 21, 2012. Date of publication January 11, 2013; date of current version October 14, 2013. The authors are with Islamshahr Azad University, Tehran 16846-13114, Iran (e-mail:
[email protected];
[email protected]). Digital Object Identifier 10.1109/TVLSI.2012.2230410
1063-8210 © 2013 IEEE
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 21, NO. 12, DECEMBER 2013
Algorithm 1 LD Scalar Point Multiplication Over GF(2m ) [3] INPUT: k = (kt−1, . . ., k1, k0)2 with kt−1 = 1, P = (xP, yP) ∈ E(F2m ). OUTPUT: Q = kP =( 3 , 3 ). /*Affine to Projective */ 1. Z1←1, X2← 4 +b, 2 ← 2 . {Compute (P,2P)} 1← /* Projective Scalar Multiplication */ 2. For i from t −2 downto 0 do 2.1 If ki = 1 then T←Z1, Z1←(X1Z2 + X2Z1)2, X1← xP Z1 + X1X2T Z2. T←X2, X2←X24 +bZ24, Z2←T 2Z22. 2.2 Else T←Z2, Z2←(X1Z2 + X2Z1)2, X2← xP Z2+X1X2Z1T . T←X1, X1←X14 +bZ14, Z1←T 2Z12. /*Projective to Affine*/ 3. x3←X1/Z1. 4. y3←( xP +X1/Z1)[(X1+ xP Z1)(X2+ xP Z2)+ ( 2 +y)(Z1 Z2)]( xP Z1Z2)−1 + yP. 5. Return (x3, y3)
maximum performance and efficiency. In this brief, the execution delay of the LD algorithm has been reduced by parallelization of the multipliers in the implementation of the calculations of projective coordinates. In addition, separate architecture has not been designed for obtaining the initial values of the LD algorithm, and its related calculations are transferred to the main loop of the algorithm to reduce the complexity of the processor’s architecture. Point addition and doubling are performed separately, and consecutive point addition in the LD algorithm is carried out without the involvement of key bits, which in turn reduces the length of the critical path. The projectiveto-affine coordinate conversion module has been designed to be very compact, and a separate clock source has been dedicated to this module to prevent the impact of this section on the main calculations of the processor. The results obtained show that our proposed design is able to compute GF(2163 ) elliptic curve scalar point multiplication operations with G = 55 in 9.6 μs with the maximum achievable frequency of 250 MHz on Xilinx Virtex-4 (XC4VLX200) while occupying 17 929 slices or 20% of the chip area, which makes the design suitable for high-speed applications. With G = 33, the design performs the same operation in 11.6 μs at 263 MHz on the same platform while utilizing 14 203 slices or 16% of the chip area, which makes it suitable for applications that may require a speed–area tradeoff. The organization of the rest of the article is as follows. In Section II, the LD algorithm is presented briefly. In Section III, the proposed architecture for ECC processor is illustrated. In Section IV, the implementation results and performance obtained are compared with those of other published works. II. E LLIPTIC C URVE S CALAR P OINT M ULTIPLICATION (ECSM) Scalar multiplication is by far the most important operation in elliptic curve cryptosystems. ECSM is an operation which, on input an integer k and a point P on an elliptic curve C, computes another point Q such that Q = kP. In our ECSM architecture, we use a variant of the algorithm due to Lopez and Dahab, which is an improvement of the traditional Montgomery ECSM algorithm [3]. The algorithm consists of three stages: 1) conversion of P from affine coordinate to projective coordinate; 2) computation of Q = kP in projective coordinate; and 3) conversion of Q from projective coordinate back to affine coordinate. This is shown in Algorithm 1. III. P ROPOSED A RCHITECTURE FOR THE ECC P ROCESSOR The most important operations for designing an efficient ECC processor are finitefield multiplication, inversion, and squaring. Field addition and subtraction in GF(2m ) are not investigated since they are defined as polynomial addition and can be implemented simply as
2331
the XOR addition of the two m-bit operands [1]. Finite-field squarer over GF(2163 ) has been designed based on the proposal presented in [4]. For finite field multiplication, we have designed an efficient least significant digit finite-field multiplier as proposed in [5]. For inversion, an efficient inverter based on the Itoh–Tsujii multiplicative inverse algorithm [6] has been implemented. For the design of the architecture for ECSM, two different parts are considered: the first part involves calculations in the projective coordinate system, and the other part involves the calculations for converting projective coordinates to affine coordinates. For the projective calculations, parts 1 and 2 of the LD algorithm are considered. In the design of this part of the processor, as proposed in [7], the number of computational units is chosen in such a way that allows parallel computations to be performed. Hence, we use three field multipliers to implement the main loop of the algorithm in which point addition and doubling is carried out. So, according to Section 2.1 of the LD algorithm, in the the first stage, three multiplications X 1 Z 2 , X 2 Z 1 , and T Z 2 (T → X 2 ) are performed in parallel by using three multipliers, and then three other multiplications x p Z1, X 1 X 2 T Z 2 (T ← Z 1 ), and bZ 24 are accomplished in parallel in the second stage, as shown in Fig. 1. Hence, the delay of each iteration is reduced from six field multiplication delays to two field multiplications delays. As mentioned, the most important modules in the design of an ECSM are the field multiplier, the field inverter, and the field squarer. The key point here is that the critical path must be placed on the longest path among these modules. Since in our design the inverter module is designed in such a way that its critical path coincides with that of the multiplier’s and, since the multiplier’s path is longer than the squarer’s path, the critical path needs to be placed on the multiplier. Another important strategy in the design of the architecture for the projective calculations unit is that separate calculations are not performed for the use of the initial values of part 1 of the LD algorithm, because if further computational modules are designed for these calculations, the complexity of the critical path and the amount of required area will be increased. We have used the calculations of part 2 of the algorithm to avoid additional or unnecessary calculations and to obtain the results for part 1. In the proposed design, calculations of part 1 are performed whenever the most significant bit of key is 1. So, when ki = 1, if the values of (1) are used in the calculations of part 2.1, then the required initial values for the LD algorithm are obtained in accordance with part 1 of the LD algorithm X 1 ← 1, Z 1 ← 0, X 2 ← x P , Z 2 ← 1.
(1)
The results of the calculation in Section 2.1 of the LD algorithm using the above values are obtained as X 1 ← x P , Z 1 ← 1, X 2 ← x 4P + b, Z 2 ← x 2P .
(2)
As it is seen in Fig. 2, whenever the key bit is equal to 1, the values of 1, 0, and x p are entered into the multiplexers to be connected to the appropriate inputs in the next stage and to make the terms of (2). It should be mentioned that the input–output of the point addition, doubling operations and their corresponding paths are separately visible in Fig. 1. After design and implementation of the point addition and doubling in projective coordinates, the input and output ports of the architecture of Fig. 1 should be connected together in the form of a new architecture to complete the iteration in the LD algorithm based on the key bits. Therefore, as is seen in Fig. 2, outputs of point addition and doubling are swapped for obtaining correct outputs and also performing the iteration in part 2 of the LD algorithm. We can check point addition and doubling in part 2 of the algorithm in this new architecture based on 0 and 1 bits of key. For example, if we consider the following point addition
2332
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 21, NO. 12, DECEMBER 2013
Dob out-X1 Add in-X2 Add in-Z1 163 163
done
0
Sel start reset clk done
Multiplier1 163
1
0
Sel
1
163 0
1
1
0
Sel
Multiplier2
start reset done clk
0
Register1
Dob in-Z1
Sel
Sel
163
163
Add in-X1
Register3
in
in
in
Register4
Register5
Register6
Add in-Z2
Dob in-X1
Dob in-Z1 start reset clk
initial
Add out-X2
Dob out-X1
Add out-Z2
….
163
….
163
Add in-X2
Architecture of Fig. 2 ….
163
Add in-Z1
….
163
163
Register2
Z1
X1
Add out-Z2 in
Dob in-X1
163
start reset clk
Multiplier3
in
squarer4
Sel
1
squarer3
0
Add out-X2
Dob out-Z1 in
squarer2
1
b
Squarer1
Add in-X1 Add in-Z2 xP 163 163 163
Dob out-Z1
163
t2 register
Sel
Sel
163
1= (0,0...0,1) 163
163
0
1
initial
163
Add out-Z2
163
163 0
1
initial
0
1
MUX
k_m
MUX
k_m
MUX
k_m
1= (0,0...0,1)
0= (0,0...0)
xP
k_m
….
squarer5
MUX
163
….
t1 register
X1
163 0
initial
1
163
163
163
Add out-X2
Dob out-Z1
Dob out-X1
initial
Fig. 2. bits.
Fig. 1. Architecture designed for the computation of point addition and point doubling in projective coordinates of the LD algorithm.
operation for ki = 1, inputs to this operation are X 1 , X 2 , Z 1 , and Z 2 and outputs are saved in X 1 and Z 1 T ← Z 1 , Z 1 ← (X 1 Z 2 + X 2 Z 1 )2 , X 1 ← x P Z 1 + X 1 X 2 TZ2 .
(3)
When a key bit changes from 1 to 0, this change will lead to the change in terms X 1 Z 2 + X 2 Z 1 and X 1 X 2 TZ2 . However, since whenever the value of any key bit changes, only X 1 and Z 1 are swapped with X 2 and Z 2 , the terms X 1 Z 2 + X 2 Z 1 and X 1 X 2 T Z 2 will remain unchanged. So, the point addition operation can be repeated in the iterative part of the algorithm without the involvement of key bits. Only after the end of the iterations, the registers are swapped based on key bits. The point doubling operation for ki = 1 is performed by using (4) T ← X 2 , X 2 ← X 42 + bZ 24 , Z 2 ← T 2 Z 22 .
(4)
The operation for ki = 0 is carried out by swapping X 2 and Z 2 , with X 1 and Z 1 , respectively. Therefore, output registers are swapped in order to provide proper inputs for the point doubling operation based on key bits in the iterative part of the LD algorithm. In order to realize when the initial values are entered into the calculations and also to be aware of the iterations of the LD algorithm based on key bits, it is necessary to combine the module designed in Fig. 1 with a key shift register in a new structure. The aim of this brief is to demonstrate that when all values of key are scanned the inputs and outputs of the architecture of Fig. 1 will be properly connected to each other. The new structure is shown in Fig. 2. So, by designing the architecture of Fig. 2, in addition to the execution of the iterative part of the LD algorithm (part 2 of the algorithm), the point addition operation is carried out without the involvement of key bits which reduces the complexity of the processor. The second part of the processor involves in the calculations that convert projective coordinates to affine coordinates. It is obvious from the LD algorithm that many sequential calculations are required for the implementation of parts 3 and 4 of the algorithm. In this brief, for implementing the standard projective-to-affine coordinates conversion algorithm, which needs 10 multiplications [1], two field multipliers have been used so that all of the 10 necessary multiplications are accomplished with these two multipliers. It should be noted that using small word lengths in the implementation of the two field multipliers used in the conversion of coordinates will decrease the required implementation area. It should also be noted that one of the important steps that must be considered in the design
X2
Z1
Z2
Architecture of point addition and doubling iteration based on key
of the ECSM is selecting the multiplier’s word length (G). Choosing the right word length plays an important role in reducing the delay and the occupied area in the implementation of curve-scalar multiplication. So, when choosing the number of digits of the multiplier (G) from among the values of G for which the value of m/G is the same, the smallest number should be selected. Implementation of the multiplier with smaller G consumes less hardware resources while speed does not change. Each set of words with this specification consumes equal cycles, while whenever the word length increases, the length of the critical path increases. For example, with m = 163, among the values of G∈{41 − 54} we select G = 41. Due to the iterative calculations in the projective coordinate system (part 2 of the LD algorithm), fast execution of calculations is very important in the design of an efficient ECC processor. So, choosing large G values for the multipliers used in the design of the first part of the processor (i.e., the multipliers in Fig. 1 or projective calculations) will be more appropriate. The word lengths that are used in this part of the processor are G 1 = 55, 41, and 33. As was mentioned earlier, since calculations of the third and fourth part of the LD algorithm are used only once at the end of the algorithm and there is no iteration in these parts, it is not necessary to select large values for G. Instead, since there are relatively a large number of computational units in this part of the processor, a relatively small value for G should be chosen to reduce the required implementation area. The word’s length used in this part of the processor in both implementations is G 2 = 10. For words G 2 ≤ 10, the obtained working frequencies were very close together. The results we obtained showed that choosing G 2 = 10 will lead to an appropriate and compact implementation with the smallest delay. Since the calculations in this part of the processor are done sequentially, and also the implementation of this part is more compact than the other parts, the critical path will be increased. Therefore, another clock source is dedicated to this part to prevent the impact of this section on the main calculations of the processor. Since the delay of this part of the processor is much less than the delay of scalar-point multiplication, the working frequency of this part has no effect on the working frequency of the processor. IV. I MPLEMENTATION R ESULTS The required parameters including curve order [the number of points defined on an elliptic curve over GF(2m )], the coefficients, and the base point coordinates are selected based on the NIST proposal [8]. The ECC processor was implemented using synthesizable VHDL codes, and synthesized, placed, and routed using Xilinx ISE 12.1. Performance of the proposed scalar multiplication for two word lengths is shown in Table I. The proposed design completes the
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 21, NO. 12, DECEMBER 2013
TABLE I P ERFORMANCE OF THE P ROPOSED S CALAR M ULTIPLIER Time (μs)
No. of Cycles
Area Area (Slices) (LUT)
250
9.6
2751
17 929
33 414
508
256
10.6
3077
16 544
30 895
497
263
11.6
3404
14 203
26 557
529
G1
G2
Freq. (MHz)
55
10 10 10
41 33
Efficiency
TABLE II P ERFORMANCE OF THE S CALAR M ULTIPLIERS Ref.
FPGA
[4]
XC2V6000
[7]
XC4VLX80
[9]
XC2V2000
[12]
XCV400E
[13]
XCV2000E
[14] VinexII V8000
Freq. Time (MHz) (μs)
Area (Slices)
Area Efficiency (LUT)
34.11
13 376
2812
143
10
24 363
—
—
100
46.5
3416
7559
532
76.7
210
—
3002
265
66.4
144
—
20 068
56
90.2
106
18 079
—
44
—
—
93.3
340
[15]
XCV2600E
46.5
63
18 314 + 24 RAMs
[16]
XC2V600-4
54
60
—
—
—
[17]
Virtex II pro 30
100
280
8450
—
—
[18]
Virtex-4 VLX200
153.9
19.55
16 209
26 364
316
[19]
Virtex 2000E
66
75
—
10 017
70
[20]
Stratix II
—
49
—
—
—
computations in the projective coordinates in 326 ∗ (m/G 1 )+1304 cycles and coordinate conversion in 15 ∗ (m/G 2 )+214 cycles. The term “m/G 1 ” indicates the number of cycles required to perform finite field multiplication in part 2 of the LD algorithm or calculations in the projective coordinate system. The term “m/G 2 ” indicates the number of cycles required to perform finite field multiplication in parts 3–4 of the LD algorithm or calculations for converting projective coordinates to affine coordinates. In order to decide how efficient a design is, we utilize the efficiency defined as Througput Mbit/s Area slices a figure of merit, where throughput is defined as working frequency × Number of Bits Number of Cycles and hardware area can be defined as number of four input LUTs as well as CLB slices. The last column in the table shows the algorithmic efficiency defined as throughput divided by the area. It would be more accurate to use throughput divided by the number of slices, but slice counts were not reported by the authors of other designs. Therefore, we have used throughput divided by the number of LUTs. In Table II, a number of high-speed elliptic curve processors (ECPs) are compared with the proposed one. As it is seen from Table II, the proposed design is more efficient than the other designs reported in the open literature. Note that the work presented in [9] consumes almost half of resources compared with our implementation, but with G 1 = 33, the proposed design is four times faster than this implementation.
2333
Some other papers introduce FPGA-based processors for elliptic curve cryptography on Koblitz curves. For example, [10] presents a pipelined architecture that is able to compute a single scalar multiplication, on average, in 11.72 μs. However, such works are not compared with the proposed design since binary curves and Koblitz curves belong to two different classes. In addition, the proposed processor is more efficient than the latest implementations of Koblitz curves on FPGA [11]. R EFERENCES [1] F. Rodriguez-Henriquez, N. A. Saqib, A. D. Pérez, and C. K. Koc, Cryptographic Algorithms on Reconfigurable Hardware. New York: Springer-Verlag, 2006. [2] T. Wollinger, J. Guajardo, and C. Paar, “Security on FPGAs: State-ofthe-art and implementations attacks,” ACM Trans. Embedded Comput. Syst., vol. 3, no. 3, pp. 534–574, 2004. [3] J. Lopez and R. Dahab, “Fast multiplication on elliptic curves over GF(2m) without precomputation,” in Proc. 1st Int. Workshop Cryptograph. Hardw. Embedded Syst., 1999, pp. 316–327. [4] D. Yong-Ping, Z. Xue-Cheng, L. Zheng-Lin, H. Yu, and Y. Li-Hua, “High-performance hardware architecture of elliptic curve cryptography processor over GF(2163 ),” J. Zhejiang Univ. Sci. A, vol. 10, no. 2, pp. 301–310, 2009. [5] S. Kummar, T. Wollinger, and C. Paar, “Optimum digit serial GF(2m ) multipliers for curve based cryptography,” IEEE Trans. Comput., vol. 55, no. 10, pp. 1306–1311, Oct. 2006. [6] T. Itoh and S. Tsujii, “A fast algorithm for computing multiplicative inverses in GF(2m ) using normal basis,” J. Inf. Comput., vol. 78, no. 3, pp. 171–177, 1988. [7] C. H. Kim, S. Kwon, and C. P. Hong, “FPGA implementation of high performance elliptic curve cryptographic processor over GF(2163 ),” J. Syst. Archit., vol. 54, no. 10, pp. 893–900, 2008. [8] FIPS 186-2. (2009) [Online]. Available: http://csrc.nist.gov/publications/ fips/ [9] B. Ansari and A. Hasan, “High-performance architecture of elliptic curve scalar multiplication,” IEEE Trans. Comput., vol. 57, no. 11, pp. 1443–1453, Nov. 2008. [10] K. Jarvinen and J. Skytta, “On parallelization of high-speed processors for elliptic curve cryptography,” IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 16, no. 9, pp. 1162–1175, Sep. 2008. [11] K. Jarvinen, “Optimized FPGA-based elliptic curve cryptography processor for high-speed applications,” Integr., VLSI J., vol. 44, no. 4, pp. 270–279, Sep. 2011. [12] G. Orlando and C. Paar, “A high-performance reconfigurable elliptic curve processor for GF(2m ),” in Proc. Cryptograph. Hardw. Embedded Syst., 2000, pp. 41–56. [13] N. Gura, S. C. Shantz, H. Eberle, S. Gupta, V. Gupta, D. Finchelstein, E. Goupy, and D. Stebila, “An end-to-end systems approach to elliptic curve cryptography,” in Proc. Cryptograph. Hardw. Embedded Syst., Redwood Shores, CA, 2002, pp. 349–365. [14] K. Jarvinen, M. Tommiska, and J. Skytta, “A scalable architecture for elliptic curve point multiplication,” in Proc. IEEE Int. Conf. Field-Program. Technol., Brisbane, Australia, Dec. 2004, pp. 303–306. [15] F. Rodriguez-Henriquez, N. A. Saqib, and A. Diaz-Perez, “A fast parallel implementation of elliptic curve point multiplication over GF(2m ),” Microprocess. Microsyst., vol. 28, nos. 5–6, pp. 329–339, 2004. [16] R. C. C. Cheung, N. J. Telle, W. Luk, and P. Y. K. Cheung, “Customizable elliptic curve cryptosystems,” IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 13, no. 9, pp. 1048–1059, Sep. 2005. [17] K. Sakiyama, L. Batina, B. Preneel, and I. Verbauwhede, “Superscalar coprocessor for high-speed curve-based cryptography,” in CHES (Lecture Notes in Computer Science), vol. 4249. New York: Springer-Verlag, 2006, pp. 415–429. [18] W. N. Chelton and M. Benaissa, “Fast elliptic curve cryptography on FPGA,” IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 16, no. 2, pp. 198–205, Feb. 2008. [19] J. Lutz and A. Hasan, “High performance FPGA based elliptic curve cryptographic coprocessor,” in Proc. Int. Conf. Inf. Technol., Coding Comput., 2004, pp. 486–492. [20] K. Jarvinen and J. Skytta, “On parallelization of high-speed processors for elliptic curve cryptography,” IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 16, no. 9, pp. 1162–1175, Sep. 2008.