Motivation. ⷠLattice based signature schemes are post-quantum secure. ⷠRejection .... Verify a signature s on document µ with respect to the public key h: 1.
NTRU-MLS CUDA Wei Dai Motivation
NTRU Modular Lattice Signature Scheme on CUDA GPUs
NTRU-MLS Why GPUs? Scheme Details Implementation
Wei Dai, John Schanck, Berk Sunar, William Whyte and Zhenfei Zhang
Results
Worcester Polytechnic Institute, Worcester, MA, USA {wdai, sunar}@wpi.edu Security Innovation, Wilmington, MA, USA {jschanck, wwhyte, zzhang}@securityinnovation.com
1 / 20
Motivation
NTRU-MLS CUDA Wei Dai Motivation
I
Lattice based signature schemes are post-quantum secure.
I
Rejection sampling slows down the signing procedure.
I
Previous NTRU-MLS parameters are less secure against new attacks.
I
Revised parameters offer higher security, smaller keys and signatures, but require more aggressive rejection sampling, i.e. slow.
I
This slowdown is mitigated by parallel computing on CUDA-enabled GPUs.
NTRU-MLS Why GPUs? Scheme Details Implementation Results
2 / 20
NTRU-MLS CUDA
Notations
Wei Dai Motivation
I I
N
Ring: R = Z [x] / x − 1 Polynomial/vector: f =
N −1 X
NTRU-MLS Why GPUs?
ai xi = ha0 , a1 , . . . , aN −1 i
i=0 I
Norm: kf k = max |ai |
Scheme Details Implementation Results
0≤i
Implementation
uniform over R(q/2)
3. s0 = sp + pr
7. If ksk >
Scheme Details
q 2
mult. R − Bt , goto Step 2.
8. Output s as a signature of µ.
8 / 20
NTRU-MLS CUDA
Verify
Wei Dai Motivation NTRU-MLS
Verify a signature s on document µ with respect to the public key h: 1. Compute t = s ∗ h (mod q)
Why GPUs? Scheme Details Implementation Results
2. (sp , tp ) = Hash (h, µ). 3. If (s, t) 6≡ (sp , tp ) (mod p), invalid. 4. If ksk >
q 2
− Bs or ktk >
q 2
− Bt , invalid.
5. Valid.
9 / 20
NTRU-MLS CUDA
Sign
Wei Dai
Assume h ∈ R(q/2) and g −1 ∈ R(p/2) are ready. Sign µ ∈ {0, 1}∗ with (f , g): 1. (sp , tp ) = Hash (h, µ) j k q 2. r ← R 2p + 12
Motivation NTRU-MLS
hash function uniform RNG uniform over R(q/2)
3. s0 = sp + pr 4. t0 = s0 ∗ h (mod q)
mult. R(q/2)
5. a = (tp − t0 ) ∗ g −1 (mod p)
mult. R(p/2)
6. (s, t) = (s0 , t0 ) + a ∗ (f , g)
mult. R
7. If ksk >
q 2
− Bs or ktk >
q 2
Why GPUs? Scheme Details Implementation Results
− Bt , goto Step 2.
8. Output s as a signature of µ.
10 / 20
Product-form Keys
NTRU-MLS CUDA Wei Dai Motivation
I
I
Introduced to NTRUEncrypt by Hoffstein and Silverman in 2003 Extra parameters and new keygen: I I
I
I
d1 , d2 , d3 : three small integers, e.g. 6 − 13 f = p(F 1 ∗ F 2 + F 3 + 1) g = G1 ∗ G2 + G3 + 1 F i and Gi have exactly di coefficients equal to +1 and di coefficients equal to −1.
NTRU-MLS Why GPUs? Scheme Details Implementation Results
Only store indices of non-zero coefficients: I I
f and g are stored as (F 1 , F 2 , F 3 ) and (G1 , G2 , G3 ) F i or Gi is stored as an array of 2di indices, the first di are indices of +1, the left are those of −1.
11 / 20
NTRU-MLS CUDA
CUDA-enabled GPUs
Wei Dai
Memory Register Constant Texture Shared Global
Cached N/A Yes Yes N/A No
Access R/W R R R/W R/W
Scope one thread threads + host threads + host threads in a block threads + host
Lifetime Thread Application Application Block Application
Motivation NTRU-MLS Why GPUs? Scheme Details Implementation Results
a a a a a a a a
12 / 20
CPU-GPU Workflow
NTRU-MLS CUDA Wei Dai
I
I
82-bit security: 1.11% accepted ≈ 90 attemps Host: I I I
I
NTRU-MLS Why GPUs? Scheme Details Implementation Results
Device (each block): I I I I
I
Hash Allocatation Data to device
Motivation
RNG, Salsa20 Polynomial mult. Check validity (Write back)
Host: I
Repeat, or retrive one signature. 13 / 20
NTRU-MLS CUDA
CPU-GPU Workflow
Wei Dai
I
Host → Device: sp , tp , h, g −1 , F 1 , F 2 , F 3 , G1 , G2 , G3 Another 48 bytes for Salsa20.
Motivation NTRU-MLS Why GPUs? Scheme Details
Trials per launch I
int Pos[]:
z
0
···
}| 0 1
0
1
0
{
Implementation Results
Block No.1 and No.3 has valid signatures. Retrieve only the No.1 signature in Sig. Input sp , tp , g −1 F i , Gi h Salsa20
Type int8 t uint16 t int32 t uint32 t
Bytes N 4di 4N 48
14 / 20
Polynomial Multiplication
NTRU-MLS CUDA Wei Dai Motivation
Convolution: N 2 integer multiplications with N threads. Compute: Input: Output:
C = A * B int t A[N], B[N] int t C[tid]
NTRU-MLS Why GPUs? Scheme Details Implementation Results
t = 0; for (i=0; i