Pipeline Oriented Implementation of NORX for ARM ...

1 downloads 0 Views 1MB Size Report
Nov 7, 2017 - In this work, we focused on the Cortex-A family: Cortex-A7,. Cortex-A15 and Cortex-A53. 1https://community.arm.com/processors/b/blog/posts/.
Pipeline Oriented Implementation of NORX for ARM Processors

Luan Cardoso dos Santos [email protected] Julio López [email protected] November 7, 2017 Institute of Computating - UNICAMP LASCA

Table of contents 1. Introduction 2. Target architecture 3. NORX family of AEAD algorithms 4. Pipeline optimization 5. Results 6. Future work 1/37

Introduction

Authenticated encryption (with additional data) • An AEAD scheme is an algorithm that uses a secret key and public nonce to process a plaintext and additional plain data to output ciphertext and authentication data [Rog02]. • Such a scheme is useful, for example, to encrypt the body of a message, keep a header in plaintext and authenticate the whole.

Figure 1: Basic block design of an AEAD. 2/37

Authenticated encryption (with additional data) Formally: • An AEAD scheme is defined by Π = (K, E, D) and the associated sets Nonce = {0, 1}n , Header ⊂ {0, 1}∗ and Message ⊆ {0, 1}∗ . • The keyspace K is a non-empty set of strings. • The message M ∈ Message; The Nonce N ∈ Nonce; The Header H ∈ Header. • The encryption algorithm EKN,H (M) → C. • The decryption algorithm DKN,H (C) → {M, ⊥}. • It is required that DKN,H (EKN,H (M)) = M for all K ∈ K, N, H and M. • And |EKN,H (M)| = ℓ(|M|) for some linear-time length function ℓ. 3/37

Cryptographic competitions: CAESAR

• CAESAR (2013, –) stands for ”Competition for Authenticated Encryption: Security, Applicability, and Robustness” [CAE13]. • CAESAR aims to select a portfolio of AEAD ciphers, suited for widespread adoption and that offer advantages over NIST’s AES-GCM. • Following the footsteps of other cryptographic competitions, such as SHA-3 (2007-2012), AES (1997-2000) and eSTREAM (2004-2008), CAESAR also aims to promote research on AEAD algorithms.

4/37

Cryptographic sponges • A cryptographic sponge function is an algorithm with a finite internal state, that receives as input strings of any length and produces an output of desired length [BDPA11]. • Sponges can be used to creat hash functions, MACs, stream ciphers, RNGs and AEAD.

Figure 2: The basic design of a sponge function [BDPA11].

5/37

Target architecture

ARM processors

• The Advanced RISC Machine is a mainly 32-bit architecture owned by the British company ARM Holdings. • With more than 100 billion chips deployes up to 2017, it is one of the most widespread architectures nowadays.1 • ARM follows a load/store architecture, and mostly a single clock cycle execution. • In this work, we focused on the Cortex-A family: Cortex-A7, Cortex-A15 and Cortex-A53. 1

https://community.arm.com/processors/b/blog/posts/ inside-the-numbers-100-billion-arm-based-chips-1345571105 6/37

ARM processors: Target cores i

• Cortex-A7: The most efficient ARMv7-A core, with over a billion shipped units. Capable of 40-bit physical adressing, and features an eight-stage in-order pipeline. It can be featured in big.LITTLE technology together with other high-performance cores. • Cortex-A15: A high-performance ARMv7-A core, well suited to consumer items such as smartphones and embedded applications. As with other processors of the same line, it is capable of 40-bit physical addressing. It also features a fifteen-stage out-of-order superscalar pipeline for integer calculations. 7/37

ARM processors: Target cores ii

• Cortex-A53: An ARMv8-A core capable of seamlessly running both 32-bit and 64-bit code, and is made as an efficient 64-bit core for a low area and power footprint. Like the Cortex-A7, it is capable of being deployed together with high-end CPUs for chips with heterogeneous cores. The Cortex-A53 uses an efficient eight-stage, 2-way superscalar, in-order pipeline. Our tests were also carried on Cortex-M4, M3 and M0, for completeness.

8/37

NORX family of AEAD algorithms

NORX AEAD

• NORX is a family of AEAD algorithms, currently in the third round of CAESAR. • Based on a sponge design, it is a simple yet fast algorithm, optimized for both 32-bit and 64-bit architectures. • The design of NORX also allows for arbitrary parallelism in the payload processing. • Based on ARX2 primitives, NORX is optimized for both software and hardware implementations, with a SIMD friendly core permutation and no secret-dependent memory access. 2

Addition-Rotation-Xor 9/37

NORX AEAD

• The naming convention for NORX is NORXwlpt, where: • • • •

w is the bit size of the words in the internal state. l is the number of rounds. p is the parallelism degree. t is the bitsize length of the authentication tag. When t = 4w, it is omitted.

• The key length of NORX is k = 4w, therefore, the 32-bit algorithm has a security level of 128 bits, while the 64-bit algorithm has a security level of 256 bits.

10/37

NORX’s mode of operation i

The state is transformed in each step of the cipher using a non linear permutation Fℓ .

Figure 3: The layout of NORX.[AJN14].

11/37

NORX’s mode of operation ii

Figure 4: The layout of NORX, with parallel payload processing.[AJN14].

12/37

NORX’s core permutation

• The core of NORX is a 16-word internal state S, that can be viewed as a 4 × 4 matrix: 

 s0 s1 s2 s3 s   4 s5 s6 s7  S=   s8 s9 s10 s11  s12 s13 s14 s15

13/37

Pipeline optimization

Original permutation The permutation can be visually represented as: G()

G()

G() s0

s1

s2

s3

s7

s4

s5

s6

s7

s10 s11

s8

s9

s0

s1

s2

s3

s4

s5

s6

s8

s9

s12 s13 s14 s15

G()

s10 s11

s12 s13 s14 s15 G()

G()

G()

G()

Figure 5: Column (left) and diagonal (right) steps. Source: Norx v3.0 specification [AJN14].

14/37

Original permutation The Norx permutation is subdivided into a function G(), applied to the lines and diagonals of S: Algorithm 1 NORX F round function 1: function F 2: input: S, G() ▷ Norx State s0 · · · s15 and G() function 3: s0 , s4 , s8 , s12 ← G(s0 , s4 , s8 , s12 ) ▷ Processing the columns 4: s1 , s5 , s9 , s13 ← G(s1 , s5 , s9 , s13 ) 5: s2 , s6 , s10 , s14 ← G(s2 , s6 , s10 , s14 ) 6: s3 , s7 , s11 , s15 ← G(s3 , s7 , s11 , s15 ) 7: s0 , s5 , s10 , s15 ← G(s0 , s5 , s10 , s15 ) ▷ Processing the diagonals 8: s1 , s6 , s11 , s12 ← G(s1 , s6 , s11 , s12 ) 9: s2 , s7 , s8 , s13 ← G(s2 , s7 , s8 , s13 ) 10: s3 , s4 , s9 , s14 ← G(s3 , s4 , s9 , s14 ) 11: output: S 12: end function 15/37

Original permutation With G(a, b, c, d) being defined as: Algorithm 2 NORX G permutation function 1: function G 2: input: a, b, c, d 3: a ← (a ⊕ b) ⊕ ((a ∧ b) ≪ 1) 4: d ← (a ⊕ d) ≫ r0 5: c ← (c ⊕ d) ⊕ ((c ∧ d) ≪ 1) 6: b ← (c ⊕ b) ≫ r1 7: a ← (a ⊕ b) ⊕ ((a ∧ b) ≪ 1) 8: d ← (a ⊕ d) ≫ r2 9: c ← (c ⊕ d) ⊕ ((c ∧ d) ≪ 1) 10: b ← (c ⊕ b) ≫ r3 11: output: a, b, c, d 12: end function

▷ Four words of the State

How can we improve the this permutation?

16/37

Code profiling A synthetic test, using encryptions of random data was profiled for identification of hotspots. roundF is the best target for optimization.

Figure 6: Profiling results 17/37

Optimizing the F() function The G() function can be split and reorganized in order to better use the processor’s pipeline: G2()

G2()

G2() s2

s3

s0

s1

s2

s3

s0

s1

s4

s5

s6

s7

s4

s5

s6

s8

s9

s10 s11

s8

s9

s10 s11

s12 s13

s14 s15

s12 s13 s14

s7

s15 G2()

Figure 7: Column and diagonal steps, with two way pipeline optimization. Notice that each call to function G2() operates over 8 words.

18/37

optimizing the F() function Or even further, with a single G() function operating in the whole state at once: G4()

G4()

s0

s1

s2

s3

s0

s1

s2

s3

s4

s5

s6

s7

s5

s6

s7

s4

s8

s9

s10 s11

s10 s11 s8

s9

s12 s13 s14 s15

s15 s12 s13 s14

Figure 8: Column and diagonal steps, with four way pipeline optimization, operating over the whole state at once.

19/37

Additional optimizations

A few extra steps were taken to improve code performance: • Extensive use of preprocessor macros and code inlining. • Avoiding use of extra or temporary variables, encrypting and decrypting in place. • Initialization of the sponge via loads of constant values instead of evaluating F2 (0 ∥ 1 ∥ 2 ∥ · · · ∥ 15). • Where possible, concatenate shift and rotate operations together with arithmetic ones, as to allow the use of ARM’s barrel shifter. For example a=a+b

Suggest Documents