Pipeline Oriented Implementation of NORX for ARM Processors
Luan Cardoso dos Santos
[email protected] Julio López
[email protected] November 7, 2017 Institute of Computating - UNICAMP LASCA
Table of contents 1. Introduction 2. Target architecture 3. NORX family of AEAD algorithms 4. Pipeline optimization 5. Results 6. Future work 1/37
Introduction
Authenticated encryption (with additional data) • An AEAD scheme is an algorithm that uses a secret key and public nonce to process a plaintext and additional plain data to output ciphertext and authentication data [Rog02]. • Such a scheme is useful, for example, to encrypt the body of a message, keep a header in plaintext and authenticate the whole.
Figure 1: Basic block design of an AEAD. 2/37
Authenticated encryption (with additional data) Formally: • An AEAD scheme is defined by Π = (K, E, D) and the associated sets Nonce = {0, 1}n , Header ⊂ {0, 1}∗ and Message ⊆ {0, 1}∗ . • The keyspace K is a non-empty set of strings. • The message M ∈ Message; The Nonce N ∈ Nonce; The Header H ∈ Header. • The encryption algorithm EKN,H (M) → C. • The decryption algorithm DKN,H (C) → {M, ⊥}. • It is required that DKN,H (EKN,H (M)) = M for all K ∈ K, N, H and M. • And |EKN,H (M)| = ℓ(|M|) for some linear-time length function ℓ. 3/37
Cryptographic competitions: CAESAR
• CAESAR (2013, –) stands for ”Competition for Authenticated Encryption: Security, Applicability, and Robustness” [CAE13]. • CAESAR aims to select a portfolio of AEAD ciphers, suited for widespread adoption and that offer advantages over NIST’s AES-GCM. • Following the footsteps of other cryptographic competitions, such as SHA-3 (2007-2012), AES (1997-2000) and eSTREAM (2004-2008), CAESAR also aims to promote research on AEAD algorithms.
4/37
Cryptographic sponges • A cryptographic sponge function is an algorithm with a finite internal state, that receives as input strings of any length and produces an output of desired length [BDPA11]. • Sponges can be used to creat hash functions, MACs, stream ciphers, RNGs and AEAD.
Figure 2: The basic design of a sponge function [BDPA11].
5/37
Target architecture
ARM processors
• The Advanced RISC Machine is a mainly 32-bit architecture owned by the British company ARM Holdings. • With more than 100 billion chips deployes up to 2017, it is one of the most widespread architectures nowadays.1 • ARM follows a load/store architecture, and mostly a single clock cycle execution. • In this work, we focused on the Cortex-A family: Cortex-A7, Cortex-A15 and Cortex-A53. 1
https://community.arm.com/processors/b/blog/posts/ inside-the-numbers-100-billion-arm-based-chips-1345571105 6/37
ARM processors: Target cores i
• Cortex-A7: The most efficient ARMv7-A core, with over a billion shipped units. Capable of 40-bit physical adressing, and features an eight-stage in-order pipeline. It can be featured in big.LITTLE technology together with other high-performance cores. • Cortex-A15: A high-performance ARMv7-A core, well suited to consumer items such as smartphones and embedded applications. As with other processors of the same line, it is capable of 40-bit physical addressing. It also features a fifteen-stage out-of-order superscalar pipeline for integer calculations. 7/37
ARM processors: Target cores ii
• Cortex-A53: An ARMv8-A core capable of seamlessly running both 32-bit and 64-bit code, and is made as an efficient 64-bit core for a low area and power footprint. Like the Cortex-A7, it is capable of being deployed together with high-end CPUs for chips with heterogeneous cores. The Cortex-A53 uses an efficient eight-stage, 2-way superscalar, in-order pipeline. Our tests were also carried on Cortex-M4, M3 and M0, for completeness.
8/37
NORX family of AEAD algorithms
NORX AEAD
• NORX is a family of AEAD algorithms, currently in the third round of CAESAR. • Based on a sponge design, it is a simple yet fast algorithm, optimized for both 32-bit and 64-bit architectures. • The design of NORX also allows for arbitrary parallelism in the payload processing. • Based on ARX2 primitives, NORX is optimized for both software and hardware implementations, with a SIMD friendly core permutation and no secret-dependent memory access. 2
Addition-Rotation-Xor 9/37
NORX AEAD
• The naming convention for NORX is NORXwlpt, where: • • • •
w is the bit size of the words in the internal state. l is the number of rounds. p is the parallelism degree. t is the bitsize length of the authentication tag. When t = 4w, it is omitted.
• The key length of NORX is k = 4w, therefore, the 32-bit algorithm has a security level of 128 bits, while the 64-bit algorithm has a security level of 256 bits.
10/37
NORX’s mode of operation i
The state is transformed in each step of the cipher using a non linear permutation Fℓ .
Figure 3: The layout of NORX.[AJN14].
11/37
NORX’s mode of operation ii
Figure 4: The layout of NORX, with parallel payload processing.[AJN14].
12/37
NORX’s core permutation
• The core of NORX is a 16-word internal state S, that can be viewed as a 4 × 4 matrix:
s0 s1 s2 s3 s 4 s5 s6 s7 S= s8 s9 s10 s11 s12 s13 s14 s15
13/37
Pipeline optimization
Original permutation The permutation can be visually represented as: G()
G()
G() s0
s1
s2
s3
s7
s4
s5
s6
s7
s10 s11
s8
s9
s0
s1
s2
s3
s4
s5
s6
s8
s9
s12 s13 s14 s15
G()
s10 s11
s12 s13 s14 s15 G()
G()
G()
G()
Figure 5: Column (left) and diagonal (right) steps. Source: Norx v3.0 specification [AJN14].
14/37
Original permutation The Norx permutation is subdivided into a function G(), applied to the lines and diagonals of S: Algorithm 1 NORX F round function 1: function F 2: input: S, G() ▷ Norx State s0 · · · s15 and G() function 3: s0 , s4 , s8 , s12 ← G(s0 , s4 , s8 , s12 ) ▷ Processing the columns 4: s1 , s5 , s9 , s13 ← G(s1 , s5 , s9 , s13 ) 5: s2 , s6 , s10 , s14 ← G(s2 , s6 , s10 , s14 ) 6: s3 , s7 , s11 , s15 ← G(s3 , s7 , s11 , s15 ) 7: s0 , s5 , s10 , s15 ← G(s0 , s5 , s10 , s15 ) ▷ Processing the diagonals 8: s1 , s6 , s11 , s12 ← G(s1 , s6 , s11 , s12 ) 9: s2 , s7 , s8 , s13 ← G(s2 , s7 , s8 , s13 ) 10: s3 , s4 , s9 , s14 ← G(s3 , s4 , s9 , s14 ) 11: output: S 12: end function 15/37
Original permutation With G(a, b, c, d) being defined as: Algorithm 2 NORX G permutation function 1: function G 2: input: a, b, c, d 3: a ← (a ⊕ b) ⊕ ((a ∧ b) ≪ 1) 4: d ← (a ⊕ d) ≫ r0 5: c ← (c ⊕ d) ⊕ ((c ∧ d) ≪ 1) 6: b ← (c ⊕ b) ≫ r1 7: a ← (a ⊕ b) ⊕ ((a ∧ b) ≪ 1) 8: d ← (a ⊕ d) ≫ r2 9: c ← (c ⊕ d) ⊕ ((c ∧ d) ≪ 1) 10: b ← (c ⊕ b) ≫ r3 11: output: a, b, c, d 12: end function
▷ Four words of the State
How can we improve the this permutation?
16/37
Code profiling A synthetic test, using encryptions of random data was profiled for identification of hotspots. roundF is the best target for optimization.
Figure 6: Profiling results 17/37
Optimizing the F() function The G() function can be split and reorganized in order to better use the processor’s pipeline: G2()
G2()
G2() s2
s3
s0
s1
s2
s3
s0
s1
s4
s5
s6
s7
s4
s5
s6
s8
s9
s10 s11
s8
s9
s10 s11
s12 s13
s14 s15
s12 s13 s14
s7
s15 G2()
Figure 7: Column and diagonal steps, with two way pipeline optimization. Notice that each call to function G2() operates over 8 words.
18/37
optimizing the F() function Or even further, with a single G() function operating in the whole state at once: G4()
G4()
s0
s1
s2
s3
s0
s1
s2
s3
s4
s5
s6
s7
s5
s6
s7
s4
s8
s9
s10 s11
s10 s11 s8
s9
s12 s13 s14 s15
s15 s12 s13 s14
Figure 8: Column and diagonal steps, with four way pipeline optimization, operating over the whole state at once.
19/37
Additional optimizations
A few extra steps were taken to improve code performance: • Extensive use of preprocessor macros and code inlining. • Avoiding use of extra or temporary variables, encrypting and decrypting in place. • Initialization of the sponge via loads of constant values instead of evaluating F2 (0 ∥ 1 ∥ 2 ∥ · · · ∥ 15). • Where possible, concatenate shift and rotate operations together with arithmetic ones, as to allow the use of ARM’s barrel shifter. For example a=a+b