CUDA based implementation of parallelized Pollard's Rho algorithm ...

FINAL WORKSHOP OF GRID PROJECTS, ”PON RICERCA 2000-2006, AVVISO 1575”

1

CUDA based implementation of parallelized Pollard’s Rho algorithm for ECDLP M. Chinnici1 , S. Cuomo2 , M. Laporta2 , A. Pizzirani2 , S. Migliori3 ENEA-FIM-INFOPPQ, Casaccia Research Center, Via Anguillarese 301, 00123 S.Maria di Galeria, Italy, [email protected] 2 Universit`a degli studi di Napoli FEDERICO II, Dipartimento di Matematica e Applicazioni R.Caccioppoli Via Cinthia, 80136 Napoli, Italy, {salvatore.cuomo, mlaporta}@unina.it, [email protected] 3 ENEA-FIM, Enea-Sede, Lungotevere Thaon di Revel n. 76, 00196 Roma, Italy

1

Abstract—The recent introduction by NVidia of Compute Unified Device Architecture (CUDA) libraries for High Performance Computing on Graphic Processing Units has started the trend of video cards for resolution of many computationally hard problems in different areas like fluid dynamics, molecular dynamics, computer vision and astrophysics. In this paper we show how CUDA libraries and hardware can be introduced in cryptography as a cryptoanalytic tool. Increase of data communications made data cryptography a real necessity of the daily life. Sometimes, private key cryptosystems are enough, more often public key cryptosystems are needed for communication through insecure channels. Cryptosystems based on the theory of the elliptic curves offer schemas with a relatively low communication overhead. In general, security is strongly related to a presumed intractability of an arithmetic problem like Discrete Logarithm Problem (DLP). In particular, testing the security of the elliptic curve cryptography reduces to testing the solvability of the DLP in the group of points of an elliptic curve (ECDLP). Various methods are known to be more or less efficient in solving instances of the general DLP. Some of them have deterministic running time. Those methods with probabilistic running time like Pollard’s rho method have a better trade off between space and time. We describe an implementation of a parallelized Pollard’s rho attack on ECDLP, based upon recent results about the optimization of Pollard’s rho method and enhanced by some ”ad-hoc” choices for CUDA.

Index Terms—Cryptoanalysis, CUDA, Elliptic Curves, High Performance Computing.

I. I NTRODUCTION Cryptography is essentially aimed at protecting data from unauthorized access. This is particularly important when data involve sensible informations and are transmitted on insecure channels. Typical examples are business via internet and the payment with a credit card. The data involved in such transactions are usually encrypted to make it harder for an attacker to retrieve secret informations. In the past, the key for encryption was the same for decryption raising a serious problem regarding key distribution. In 1976 W.Diffie and M.Hellman[3] invented an agreement protocol that allows two users to exchange a secret key over an insecure channel without any prior contact. This event is commonly considered the birth of public-key cryptography. Then, relying on some hard mathematical problem, many cryptosystems have been proposed. However, since some attacks to such math problems succeded most of these cryptosystems become insecure or simply impratical. Actually, three mathematical problems are still considered to be hard: the integer factorization problem (IFP), the discrete

CHINNICI ET AL. et al.: CUDA BASED IMPLEMENTATION OF PARALLELIZED POLLARD’S RHO ALGORITHM FOR ECDLP

logarithm problem in the multiplicative group of a finite field (DLP) and in the group of points of an elliptic curve (ECDLP). There is no real proof that the aforementioned problems are intractable. However, a lot of work has been done to try to solve them efficiently. All these efforts amount to the development of subexponentialtime algorithms for IFP and DLP, but none for ECDLP. Elliptic curve cryptography (ECC) became more and more attractive essentially for such a reason. Moreover, parameters of the ECC are usually much smaller than parameters of cryptosystems based on IFP and DLP. Consequently ECC has lower communication overhead.

y3 = −y1 +

3x21 + a 2y1

(x1 − x3 )

.

(2)

More remarkable is the fact that these formulas are still valid in E(K) for a generic ground field K, where the so called elliptic curve discrete logarithm problem can be formulated as it follows: given P, Q ∈ E(K), determine the integer k (if there’s one) so that Q = kP = P + P + ... + P . This problem is {z } | k times

significantly hard if the ground field is finite. Indeed, V. Miller[7] and N. Koblitz[5] proposed (independently each other) to use the elliptic group E(Fp ), defined on a finite field Fp , as an arithmetic base of a cryptosystem.

II. E LLIPTIC CURVES Let K be a field of characteristic 6= 2, 3. The set E of points (x, y) ∈ K × K satisfying the equation y 2 = x3 + ax + b with a, b ∈ K, is called elliptic curve whenever x3 +ax+b has no multiple roots in K. The definition of an elliptic curve is slightly more complicated in when the charateristic of K is 2 or 3. The set E, enriched with a so-called ”point to infinity” O∞ and a well defined addition +, becomes an elliptic group, denoted by E(K) where the point O∞ acts as the group identity. If K = R, the real field, then the addition can be described geometrically through the method ”chord-tangent”(see [10] p.55).The inverse of a point P = (x, y) is −P = (x, −y) by definition.Moreover, one has the following explicit formulas for the sum and the doubling of points on E(R). If P = (x1 , y1 ), Q = (x2 , y2 ) and P + Q = (x3 , y3 ), then x1 6= x2 implies 2 y2 − y1 − x1 − x2 , x3 = x2 − x1 y2 − y1 y3 = −y1 + (x1 − x3 ) , x2 − x1

while P = Q implies 2 2 3x1 + a x3 = − 2x1 2y1 2

,

(1)

III. ECDLP Altough ECDLP is a particular case of DLP, there is no generic algorithm with subexponential running time that solves it. One reason is that there’s a primary difference between the underlying algebraic structures, i.e. the multiplicative group F∗p of a finite field Fp for DLP and the elliptic group E(Fp ). Mainly, while F∗p is completed in a structure with two operations, the elliptic group has only its own addition. For example, the indexcalculus methods, which solves DLP instances with subexponential running time, fail when it comes to elliptic groups (except in very special and well-understood cases). As usual, algorithms for ECDLP are classified as it follows: generic algorithms which are appliable to all instances of ECDLP, special algorithms which take advantage of the particular instance of the problem. Since all special attacks to the ECDLP can be easily avoided by means of a suitable choice of the parameters, it is more interesting to focus on generic algorithms. The most used generic methods are variations of the Pollard’s rho method. Indeed, in 1997 CERTICOM introduced a list of ECDLP challenging problems offering a money prize for each solution[1]. The solved problems got a solution through the use of a parallelized Pollard’s rho method.


IV. P OLLARD ’ S RHO ALGORITHM FOR ECDLP Let us consider P, Q ∈ E(Fp ) and assume that we want to compute k such that Q = kP . The main idea of Pollard’s rho algorithm is to determine distinct pairs (c0 , d0 ) and (c00 , d00 ) of integers modulo n such that c0 P + d0 Q = c00 P +d00 Q where n is the order of the subgroup hP i generated by P. Then, one can compute (c0 − c00 )P = (d00 − d0 )Q = (d00 − d0 )kP, which implies (c0 − c00 ) ≡ (d00 − d0 )k (mod n). Thus, k ≡ (c0 − c00 )(d00 − d0 )−1 (mod n). The naive method requires to generate at random c, d ∈ [0, n − 1] , compute cP + dQ and store each triple (c, d, cP + dQ) in a table sorted by the third element until a point cP + dQ is obtained twice (this occurrence is called ”collision”). The birthday paradox1 helps to estimate the expected number of iterations (or equivalently the complexity of the algorithm) before a collision is found. This√number is approximately p πn/2 ≈ 1.2533 n. Instead of randomly generated points Pollard[9] proposed an iteration function acting on hP i with a pseudorandom behaviour. If this function is ”random enough”, then, the algorithm has the same expected running time of the naive method. The original Pollard’s function partitions hP i into three subsets S1 , S2 , S3 of approximately equal size. Then, from the starting point P0 it iterates P1 = f (P0 ), P2 = f (P1 ), ..., Pi+1 = f (Pi ). More precisely, it is defined as:   Pi + a1 P + b1 Q if Pi ∈ S1 2Pi if Pi ∈ S2 f (Pi ) =  Pi + a2 P + b2 Q if Pi ∈ S3 Some experimental results by E.Teske[11][12] showed that the running time of the Pollard’s p algorithm tends to the expected value πn/2 as the number of subsets Si increases. Mainly storage can be efficiently reduced to a negligible amount at the cost of some extra computation. By using Floyd’s algorithm called ”tortoise and 1 The birthday paradox can be formulated as it follows: how large the number of people must be in a room in order to expect at least two of them have p the same birthday ? Surprisingly the number is small: π365/2 ≈ 24.

hare”([4] exercises 6 and 7, page 7) it can be reduced to a constant. Van Oorschot and Wiener[13] proposed to record only points satisfying a precise condition (for example, the last 30bits of the x coordinate have to be equal to zero) for a better trade off between space and performances. Moreover, they showed that the algorithm can be efficiently parallelized on an arbitrary number of processors. While each of them generates a random walk from a different starting point, collision detection is completed by another designated computer. V. CUDA CUDA is a general purpose parallel computing architecture developed by NVidia. Programming of CUDA devices is realized mainly through ”C for CUDA”, an extension of the C language that gives user access to CUDA capabilities. Even if C is the main language in CUDA hardware programming, third party wrappers are available for Python, Fortran, Java and MatLab. Actually, as it is reported by NVidia, there are millions of CUDA-capable gpus which are already installed with prices ranging from a few euros for hardware with limited computing capabilities (20e-30efor an 8400GS-256mb video card) to thousands of euros for high-end hardware with 4 teraflops (single precision) power (Tesla C1060 with 960 cores). Some advantages offered by CUDA architecture are: scattered reads(code has access to all memory addresses), a fast shared memory region (a region that grants really high performances and can be used by all threads together), full support of integer and bitwise operations and fast downloads and readbacks to and from gpu. CUDA has also some limitation that must be considered while developing software: no support for recursive functions on the device, division and inversion are computationally expensive and the device memory management is difficult(threads using device memory should access it avoiding multiple requests on the same bank). 3

CHINNICI ET AL. et al.: CUDA BASED IMPLEMENTATION OF PARALLELIZED POLLARD’S RHO ALGORITHM FOR ECDLP

VI. CUDA BASED IMPLEMENTATION OF P OLLARD ’ S RHO ALGORITHM Considering high processing power of actual gpus, it makes sense to take advantage from them for a parallelized Pollard’s rho algorithm. As we already said, that integer division and modulo operations are really expensive. Hence, one has to find suitable solutions for an efficient modular arithmetic, especially when we handle multiword integers and we deal with formulas 1 and 2 on an elliptic curve. Moreover, integer words should be accurately written in memory, even because simultaneous accesses of more than one thread to the same bank cause performance losses. The implementation of the program is described as it follows: 1) The host makes precomputations needed for the Pollard’s rho algorithm. 2) Precomputed data is sent to the device. 3) The host starts threads on gpu. Such threads generate pseudorandom points through the iteration function. Then they look for distinguished points (having last 30bit of x coordinate all equal to zero) and report them to the host. 4) The distinguished points are stored into a hash table, where the host looks for collisions. The algorithm stops if a collision is found. Modular arithmetic. Modular addition is operated naively adding single words from the least significative digits and recording a carry from each previous single word addition. Each addition is followed by a subtraction of one modulus if the sum is greater than it. The modular difference is operated in an analogous way. Here one has to record borrows of the subtraction of each single word. A negative result requires an addition of one modulus. In our implementation with CUDA, the most delicate part is the modular multiplication via the Montgomery product ([14] p.395-397) with the coarsely integrated operand scanning method (CIOS)[6]. For all the elaborate details we refer the reader to [8]. PARALLELIZED

The iteration function. The starting points 4

are linear combinations of P and Q. Each point is generated with a different multiple of P . If t threads are executed, each of them is associated to a starting point Al = al P + ql Q where 0 ≤ l < t and 0 ≤ al , ql < n. Our iteration function is a so-called ”add only” function which partitions hP i into r subsets. Let Al,i = (xl,i , yl,i ) be the point corresponding to the walk of the l-th thread and the i-th iteration. The iteration function is defined as: f (Al,i ) = Al,i+1 = Al,i +bj P if xl,i ≡ j(mod r), ∀j = 0, ..., r − 1, ∀l = 0, ..., t − 1 . The congruence xl,i ≡ j(mod r) can be easily checked through bitwise operators if r is a power of 2. In our application we choose r = 64. Points representation and storage. The array (b0 P, ..., bj P, ..., br−1 P ) is stored by using affine coordinates. More precisely, since the size of these coordinates turns out to be small and since the points do not vary within the program, the array can be fitted into the constant memory of the graphic card. That allows to avoid memory problems due to simultaneous accesses of more than one thread to the same point. Moreover note that all data of the curve are recorded in the constant memory. All the other points are represented with the so-called Jacobian coordinates([2], 3.2), which allow us to avoid division and inverse calculation when we add points on the elliptic curve. In order to get coalesced memory access, a single word of each Jacobian coordinate is memorized in a locations multiple of 16 (i.e. the size of the bank). In this way, each thread always accesses the same memory bank. By using just a midlevel gpu, an 8800GTS with g92 chip, with a preliminary version of our application we have generated about 320.000 points/seconds on the curve over F109 listed in CERTICOM site. VII. C ONCLUSIONS

CUDA-enabled gpus turn out to form a very interesting platform for high performance computing for many remarkable reasons: the wide diffusion, a rapidly and continuously increasing


power, a good performance/price ratio and the possibility of installing more than one gpu into a single workstation. In this paper we have shown that graphic cards can be useful to improve on performances for ECDLP resolution. We think that all we have done here is new because all CERTICOM problems were attacked only by using cpus (in distributed environment). Now a mixed approach cpu-gpu can also be considered (gpu being a co-processor). At least, computations for our problem can be performed only on gpu while cpu remains free and available for other jobs. R EFERENCES [1] CERTICOM, Certicom ECC challenge, http : //www.certicom.com/, (1997). [2] A. J. Menezes, D. Hankerson and S. Vanstone, Guide to Elliptic Curve Cryptography, Springer-Verlag New York, Inc., 2003. [3] W. Diffie and M. E. Hellman, New Directions in Cryptography, IEEE Transactions on Information Theory, IT-22 (1976), pp. 644–654. [4] D. E. Knuth, Art of Computer Programming, Volume 2: Seminumerical Algorithms (3rd Edition), AddisonWesley Professional, 1997. [5] N. Koblitz, Elliptic Curve Cryptosystems, Mathematics of Computation, 48 (1987), pp. 203–209. [6] C. K. Koc and T. Acar, Analyzing and Comparing Montgomery multiplication Algorithms, IEEE Micro, 16 (1996), pp. 26–33. [7] V. S. Miller, Use of Elliptic Curves in Cryptography, Advances in Cryptology − CRYPTO ’85: Proceedings, (1986), pp. 417+. [8] A. Pizzirani, Il problema del logaritmo discreto sulle curve ellittiche e relazioni con la crittografia, Tesi di Laurea presso Universit`a degli studi di Napoli ”Federico II”, 2009. [9] J. M. Pollard, Monte Carlo methods for index computation mod p, Mathematics of Computation, 32 (1978), pp. 918–924. [10] J. H. Silverman, The arithmetic of elliptic curves, vol. 106 of Graduate Texts in Mathematics, Springer, 1986. [11] E. Teske, Speeding up Pollard’s rho method for computing discrete logarithms, Lecture Notes in Computer Science, 1423 (1998), pp. 541–554. , On random walks for Pollard’s rho method, [12] Math. Comput., 70 (2001), pp. 809–825. [13] P. C. Van Oorschot and M. J. Wiener, Parallel collision search with cryptanalytic applications, Journal of Cryptology, 12 (1999), pp. 1–28. [14] H. C. A. van Tilborg, Encyclopedia of cryptography and security, Springer-Verlag, 2005.

5

CUDA based implementation of parallelized Pollard's Rho algorithm ...

CUDA based implementation of parallelized Pollard's Rho algorithm ...

Suggest Documents

Parallelized Seeded Region Growing Using CUDA

CUDA implementation of the algorithm for simulating the epidemic ...

A Parallelized Implementation of Turbo Decoding Based ... - CiteSeerX

A CUDA-based Implementation of Random Forests - HÃ¥kan Grahn

CUDA based implementation of DCT/IDCT on GPU

CUDA Based Speed Optimization of the PCA Algorithm - TEM JOURNAL

cuda based implementation of flame detection algorithms ... - CiteSeerX

Cuda Parallel Implementation of Image ... - Bentham Open

CuBA-a CUDA implementation of BAMPS

Cuda Parallel Implementation of Image ... - Bentham Open

Fast parallelized algorithm for ECG analysis - wseas

CUDA Implementation of Deformable Pattern Recognition ... - CiteSeerX

cuSVM: A CUDA Support Vector Machine Implementation

Performance Analysis of Parallel Pollard's Rho Algorithm

Parallel Genetic Algorithm on the CUDA Architecture

a fpga-based viterbi algorithm implementation for

Optimizing Raytracing Algorithm Using CUDA

Design and Implementation of Parallelized Linked List Class ...

An FPGA Implementation of a Parallelized MT19937 Uniform Random ...

Performance Analysis of a Hybrid MPI/CUDA Implementation of the ...

Implementation of Rule Based Algorithm for Sandhi-Vicheda Of ... - arXiv

C++/CUDA Implementation of the Weeks Method for Numerical ...

GPU Implementation of Belief Propagation Using CUDA for Cloud ...

CUDA implementation of TVDLF method for the two