High-throughput GCM VLSI architecture for IEEE 802.1 ... - Google Sites

High-Throughput GCM VLSI Architecture for IEEE 802.1ae Applications Chuan Zhang and Li Li

Zhongfeng Wang

Institute of VLSI Design, Nanjing University LAPEM, Nanjing University Nanjing, Jiangsu 210093, China Email: {chzhang, lili}@nju.edu.cn

Broadcom Corporation 5300 California Avenue Irvine, CA 92617, USA Email: [email protected]

Abstract—This paper presents a high-throughput GCM VLSI architecture fully compliant to IEEE 802.1ae applications, which can be operated in all modes specified in the standard. Unlike previous works, with the modified parallel GHASH module, the design implements encryption efficiently without knowing the total number of data blocks in advance. Furthermore, a fully subpipelined version of loop-free key expansion architecture is employed to support constant key changes in each clock cycle. An encryptor design example with 2-parallel modified GHASH module is implemented and fabricated in Fujitsu 0.13 μm 1.2V 1P8M CMOS technology. The ASIC implementation results demonstrate that the maximum operating frequency can reach 764.5 MHz and our design can obtain 97.9 Gb/s throughput with 547 k gates.

I.

INTRODUCTION

Galois/Counter Mode (GCM) [1], which is an authenticated encryption algorithm, was designed by Viega and McGrew as an improvement to Carter-Wegman Counter (CWC) mode. It is a block cipher mode of operation that uses universal hashing over GF(2128) to provide authenticated encryption. Combined with the well-known Counter (CTR) mode of encryption, GCM is well-suited for wireless, optical, and magnetic recording systems due to its multi-Gb/s authenticated encryption speed, outstanding performance, minimal computational latency as well as high intrinsic degree of pipelining and parallelism. New communication standards such as IEEE 802.1ae [2], ANSI FC-SP, IEEE P1619.1, IETF IPSec standards, and NIST 800-38D have considered employing GCM to enhance their performance. In the IEEE 802.1ae system, GCM is adopted as the Default Cipher Suite for Layer 2 transport security. Although GCM is suitable for parallelism, there is a substantial lack of high-speed implementations for the GCM compliant to IEEE 802.1ae. Han et al [3] developed a GCM encryptor with TSMC 0.18 μm process, which is a part of the MAC Security chip in Ethernet Passive Optical Network (EPON) without giving out the exact throughput and hardware consumption. Yang et al [4] proposed an implementation of high-throughput GCM using a 0.18 μm CMOS standard cell library. By balancing the critical delay of the AES engine and the modular multiplier, the design could achieve a throughout of 34 Gb/s. In order to achieve a throughput higher than 100 Gb/s, Satoh [5] proposed parallel GCM hardware architecture with parallel GHASH module. Although the speed was quite high, the

architecture is proved to be not suitable for IEEE 802.1ae applications. Because in order to execute encryption properly, the design should be aware of the number of input data blocks in advance, which is impossible in practical applications. This work presents a high-throughput GCM VLSI architecture for IEEE 802.1ae applications, which aims to obtain very high hardware efficiency. A novel variation of the parallel GHASH module that has been optimized for IEEE 802.1ae applications is employed. Moreover, a fully subpipelined version of loop-free key expansion architecture is introduced to support constant key changes on the fly. Based on the proposed design techniques, an encryptor example with 2-parallel modified GHASH module is implemented in Fujitsu 0.13 μm 1.2V 1P8M CMOS technology to demonstrate the merits of the architecture. It is shown that the proposed design achieves 97.9 Gb/s encryption throughput with the gate count of only 547 k, and consequently the same hardware efficiency as [5] while the latter does not support IEEE 802.1ae. II.

REVIEW OF GCM ALGORITHM

The GCM encryption has four input bit-strings: a secret key K, an initialization vector IV, a plaintext P, and an additional authenticated data A; and two output bit-strings: a ciphertext C and an authentication tag T. Here P consists of a sequence of n bit-strings denoted as ( P1 , P2 ,..., Pn −1 , Pn* ) , in which in the bit length of the last bit-string is u, and the bit length of the other bit-strings is 128. Similarly, C and A are denoted as (C1 , C2 ,..., Cn −1 , Cn* ) and ( A1 , A2 ,..., Am −1 , Am* ) , respectively. And the GCM encryption operation is defined as follows: ⎧ H = E ( K , 0128 ) ⎪ ⎧ IV || 0311 if len( IV ) = 96 ⎪ = Y ⎨ 0 ⎪ ⎩GHASH( H ,{}, IV ) otherwise ⎪⎪ ⎨Yi = incr(Yi −1 ) for i = 1," , n ⎪ ⎪Ci = Pi ⊕ E ( K , Yi ) for i = 1," , n -1 ⎪C * = P* ⊕ MSB ( E ( K , Y )) n u n ⎪ n ⎪⎩T = MSBt (GHASH( H , A, C ) ⊕ E ( K , Y0 )).

(1)

A4i+4

H4 H

register

H4 MUX

H

H4 MUX

for i = 0

A4i+3

H4

register

H

register

H

register

for i = 1,..., m −1 for i = m

register

(2)

for i = m +1,..., m + n −1

Figure 1. 4-parallel GHASH architecture.

for i = m + n for i = m + n +1.

IEEE 802.1ae is the IEEE MAC Security (MACsec) standard which defines connectionless data confidentiality and integrity for media access independent protocols. In its applications, K is the 128 bit Secure Association Key (SAK). The 64 most significant bits of the 96-bit IV are the octets of the Secure Channel Identifier (SCI). The 32 least significant bits of the 96-bit IV are the octets of the Packet Number (PN). T is the Integrity Check Value (ICV) and is 128 bits long. And the bit-strings A, P, and C vary with different modes of Default Cipher Suite, which are: 1) Integrity Protection, 2) Confidentiality Protection without a confidentiality offset, 3) Confidentiality Protection with a confidentiality offset. III.

A4i+2

MUX

⎧0 ⎪( X ⊕ A ) ⋅ H i ⎪ i−1 ⎪⎪( Xm−1 ⊕ ( Am* || 0128−v )) ⋅ H Xi = ⎨ ⎪( Xi−1 ⊕ Ci−m ) ⋅ H ⎪( X ⊕ (Cn* || 0128−u )) ⋅ H ⎪ m+n−1 ⎪⎩( Xm+n ⊕ (len( A) || len(C))) ⋅ H

A4i+1

MUX

With the input bit-strings A and C, the function GHASH is defined by GHASH( H , A, C ) = X m + n +1 . And the variables X i for i = 0,..., m + n + 1 are defined as follows:

However, if the number of data blocks cannot be exactly divided by q, the total number is required before encryption to determine the encryption process pattern in advance. Obviously, it is impractical in IEEE 802.1ae applications because all input data blocks are transmitted sequentially, and the number of data blocks is not available until the end of input sequence. For more details, please refer directly to [5]. Hence a modified parallel GHASH block that is suitable for sequential transmission is proposed. Without loss of generality, we assume the number of data blocks is (pq+n), where 1≤n≤q. And X pq + n can be calculated as follows: X pq + n = (" ((( A1 H q ⊕ A2 H q −1 ⊕ " ⊕ Aq H ) ⊕ Aq +1 ) H q ⊕ Aq + 2 H q −1 ⊕ " ⊕ A2 q H ) ⊕ "

MODIFIED PARALLEL GHASH BLOCK

Apq +1 ) H n ⊕ Apq + 2 H n −1 ⊕ " ⊕ Apq + n H

Based on (4), the modified parallel GHASH module architecture can be easily obtained. And a 4-parallel version is shown in Fig. 2. A4i+1

register register

register H3 H2 H

register H2 MUX

MUX

H4 H3 H2 H

A4i+4

A4i+3

A4i+2

MUX

In order to meet the significant development of Ethernet, there is a compelling need of high-throughput implementation for MACsec based on IEEE 802.1ae. Several approaches have been proposed to tackle the problem, among which the optimal candidate is employing the parallel GHASH architecture. Recently, Satoh has introduced a q-parallel architecture to calculate the GHASH function. And the processing equation can be rewritten as below, where the number of data blocks is pq:

(4)

H H

X pq = (" ( A1 H q ⊕ Aq +1 ) H q ⊕ " ⊕ A( p −1) q +1 ) H q ⊕ (" ( A2 H q ⊕ Aq + 2 ) H q ⊕ " ⊕ A( p −1) q + 2 ) H q −1 "

(3)

⊕ (" ( Aq H q ⊕ A2 q ) H q ⊕ " ⊕ Apq ) H .

register

Figure 2. Modified 4-parallel GHASH architecture.

A 4-parallel hardware architecture corresponding to (1) is illustrated in Fig. 1 as follows:

In order to provide a more clear explanation, an example is illustrated in Fig. 3 to show how to calculate X 6 with inputs of A1 ~ A6 using the above architecture.

Figure 3. Calculation scheduling of X6.

IV.

PROPOSED GCM ARCHITECTURE

The block diagram of the proposed high-throughput GCM architecture for IEEE 802.1ae applications is shown in Fig. 4. It mainly contains four modules. The information extraction (InfoExtn) module is used to extract SAK, SCI, PN, Destination Address, and Source Address from the input data and convert them to K, IV, A, and P according to different modes of Default Cipher Suite. The CIPH module executes the block cipher encryption and outputs the data. The key expansion (KeyExp) module is employed to generate roundkeys on the fly with subpipelined architecture. In order to satisfy the high-throughput requirement, a loop-free architecture is introduced. KeyExp consists of 11 submodules, which can execute with CIPH module at the same pace. And the GHASH module calculates T using the universal hash function GHASH.

B. Architecture for KeyExp Module Design A fully subpipelining loop-free KeyExp module is used to support key changes every cycle according to IEEE 802.1ae [6]. Totally 11 key expansion units are included in this module. And the same cutsets as CIPH module are added to KeyExp module to ensure that the two modules cooperate efficiently in parallel. Therefore KeyExpasionUnit{1, 2, …, 9} are of the same architectures, whereas those of KeyExpasionUnit0 and KeyExpasionUnit10 are different as shown in Fig. 6. Here the SubWord transformation applies the S-box value used in SubBytes to each of the four bytes in one word, and the RotWord transformation performs a one byte circular left shift on a word. 32

32

SubWord RotWord

SubWord RotWord

Rcon

Rcon

K input data

P IV

InfoExtn mode selection

KeyExp round key output data

CIPH C

A

128

128

T

32

…

32

32

…

32

32

…

32

32

…

32

GHASH

GCM block diagram

(a)

Figure 4. Block diagram for the proposed GCM architecture.

A. Architecture for CIPH Module Design As shown in Fig. 5, 11 round units are included in CIPH module to efficiently implement the cipher function recommended by IEEE 802.1ae. Among these round units, RoundUnit{1, 2, …, 9} are of the same architectures. But the architectures of RoundUnit0 and RoundUnit10 are unique with themselves. RoundUnit0 simply implements the exclusive OR operation. RoundUnit{1, 2, …, 9} each consists of four transformations: SubBytes, ShiftRows, MixColumns, and AddRoundKey. And RoundUnit10 has the same transformations except for MixColumns. Both pipelining and subpipelining strategies are employed to achieve the required high throughput. IV||0311 +1

MUX register RoundUnit0

register

register

Sub-Stage2

RoundUnit1

register

register

… … 128

(b)

Figure 6. Architecture for KeyExp module design. (a) KeyExpasionUnit{1, 2, …, 9}. (b) KeyExpasionUnit0. (c) KeyExpasionUnit10.

C. Architecture for GHASH Module Design Following the above design instructions, the GHASH module can be easily implemented. The maximum hardware efficiency of the whole GCM architecture can be achieved by optimizing the concurrent degree of GHASH module. Suppose the critical paths of CIPH module and GHASH module are tCIPH and tGHASH respectively, the concurrent degree N can be obtained by:

N = ⎢⎡tGHASH tCIPH ⎥⎤ V.

… Sub-Stage n

… …

…

Sub-Stage1

(c)

128

RoundUnit10 P T C MUX output data

Figure 5. Architecture for CIPH module design.

(5)

ASIC IMPLEMENTATION RESULTS AND COMPARISON

A fully subpipelined encryptor with 70 substages of the proposed GCM VLSI architecture is implemented in Fujitsu 0.13 μm 1.2V 1P8M CMOS technology. The hierarchical design flow is followed with standard EDA tools: Cadence Verilog XL is used for simulation and verification, Synopsys Design Compiler is used for synthesis, Synopsys Prime Time

is used for timing analysis, and Cadence SoC Encounter is used for floor planning, place, and route. The final layout plot of the design with 2-parallel modified GHASH block is shown in Fig. 7.

VI.

CONCLUSION

A high-throughput GCM VLSI architecture for IEEE 802.1ae applications has been proposed. The modified parallel GHASH architecture is exploited. The loop-free fully subpipelined KeyExp module is employed to support constant key changes. One 70-stage subpipelined encryptor design example with a-parallel modified GHASH module is implemented in Fujitsu 0.13 μm 1.2V 1P8M CMOS technology, which demonstrated a maximum throughput of 97.9 Gb/s at the cost of 547 k logic gates. Further speedup can be expected if multi-channel environment is introduced. ACKNOWLEDGMENT

Figure 7. Final layout of the proposed GCM design.

Table I lists the ASIC implementation results of the design and comparisons with other references. To make the comparisons simple, one parameter with the name of hardware efficiency is introduced, which can be calculated as follows: Hardware Efficiency =

Throughput Gate Count

[2]

ASIC IMPLEMENTATION RESULTS AND COMPARISONS

Reference

Proposed

[3]

[4]

[5]

Yes

Yes

No

No

128 bits 70-stage pipelined

128 bits



2-parallel

—

sequential

4-parallel

764.5 MHz 97.9 Gb/s 0.13 μm 547 k

— — 0.18 μm —

271.0 MHz 34.7 Gb/s 0.18 μm 499 k

317.5 MHz 162.6 Gb/s 0.13 μm 980 k

0.179

—

0.070

0.166

—

REFERENCES [1]

(6)

TABLE I.

Supporting IEEE 802.1ae Data Length AES Architecture GHASH Architecture Frequency Throughput Technology Gate Count Hardware Efficiency

The authors gratefully acknowledge support from the National 863 Program of China under Grant No. 2008AA01Z135, the National Nature Science Foundation of China under Grant No. 90307011, and the High-Tech Foundation of Guangdong Province of China under Grant No. 2006B50101003. Also the authors thank Dr. Jun Xu for the useful discussion.

It can be observed that the proposed design operating at frequency 764.5 MHz achieves the maximum throughput of 97.9 Gb/s. The design occupies 2.9 mm2 area and totally 547 k logic gates are integrated in this specific area. As summarized in Table I, our design obtains slightly higher hardware efficiency than the state-of-the-art ASIC implementation. Furthermore, if the compatibility with IEEE 802.1ae issue is taken into consideration, our architecture demonstrates advantages in both throughput and hardware consumption compared with prior literatures. Therefore we conclude that the proposed architecture are well suited for very high speed IEEE 802.1ae applications.

[3]

[4]

[5]

[6]

D. A. McGrew and J. Viega. (2005, May). The Galois/Counter mode of operation (GCM). Inst. Stand. Technol. [Online]. pp. 1–8. Available: http://www.csrc.nist.gov/groups/ST/toolkit/BCM/documents/proposed modes/gcm/gcm-revised-spec.pdf. IEEE Standard for Local and Metropolitan Area Networks-Media Access Control (MAC) Security, IEEE Standard 802.1ae, 2006. K.-S. Han, K.-O. Kim, T. W. Yoo, and Y. Kwon, “The design and implementation of MAC security in EPON,” in Proc. IEEE International Conference on Advanced Communication Technology (ICACT), Feb. 2006, pp. 20-22. B. Yang, S. Mishra, and R. Karri. (2005, June). High speed architecture for Galois/Counter mode of operation (GCM). Cryptology ePrint Archi ve. [Online]. pp. 1–15. Available: http://eprint.iacr.org/2005/146.pdf. A. Satoh, “High-speed parallel hardware architecture for Galois counter mode,” in Proc. IEEE Symp. Circuits and Systems (ISCAS), pp. 18631866, May 2007. X. Zhang and K. K.Parhi, “High-speed VLSI architectures for the AES algorithm” IEEE Trans.Very Large Scale Integr. (VLSI) Syst., vol. 12, no. 9, pp. 957-967, Sept. 2004.

High-throughput GCM VLSI architecture for IEEE 802.1 ... - Google Sites

High-throughput GCM VLSI architecture for IEEE 802.1 ... - Google Sites

Suggest Documents

Parallel-Processing VLSI Architecture for Mixed Integer ... - IEEE Xplore

Parallel-Processing VLSI Architecture for Mixed Integer ... - IEEE Xplore

An efficient reformulation based VLSI architecture for ... - IEEE Xplore

Scalable VLSI Architecture for Variable Block Size ... - Google Sites

Scalable VLSI Architecture for Variable Block Size ... - Google Sites

A VLSI Architecture for Visible Watermarking in a ... - Google Sites

Highthroughput DNA sequencing concepts and ... - Google Sites

Whirlpool hash function: architecture and VLSI ... - IEEE Xplore

New Architecture Paradigms for Analog VLSI ... - People.csail.mit.edu

AN EFFICIENT VLSI ARCHITECTURE FOR MC ... - CiteSeerX

Display Architecture for VLSI -based Graphics Workstations

An Efficient VLSI Architecture for CORDIC Algorithm

VLSI Architecture for Spatial Domain Spread ...

SYSTEM ANALYSIS OF VLSI ARCHITECTURE FOR MOTION ...

FPGA and SoC Based VLSI Architecture of Reversible ... - IEEE Xplore

Low Cost VLSI Architecture of Resisting Long Echo ... - IEEE Xplore

VLSI reed solomon decoder architecture for ...

An Efficient VLSI Architecture for Fingerprint

Microarchitecture for Billion-Transistor VLSI ... - Google Sites

Gcm Service.pdf - Google Drive

8021 Zagryadsky

architecture and VLSI implementation - Semantic Scholar

vlsi architecture of an area efficient image

Andes Instruction Set Architecture Specification - VLSI Signal ...