A VLSI Implementation of the Blowfish Encryption ... - CiteSeerX

37 downloads 9540 Views 49KB Size Report
of them require a secret key to encode digital data. After applying a cryptography algorithm to our digital data, others can't regain the original data easily without ...
A VLSI Implementation of the Blowfish Encryption/Decryption Algorithm† Michael C.-J. Lin, Youn-Long Lin Department of Computer Science National Tsing Hua University Hsin-Chu, Taiwan 30043, R.O.C.

Cryptography is widely applied to protect digital data. Nowadays, there are many kinds of cryptography and most of them require a secret key to encode digital data. After applying a cryptography algorithm to our digital data, others can’t regain the original data easily without the secret key. Then, the private data are under protection. The Blowfish algorithm was designed by Bruce Schneier in 1993. It is a symmetric block cipher and each block is 64 bits. The secret key of Blowfish cryptography ranges from 32 bits to 448 bits. Blowfish has been examined for five years. Serge Vaudenay has examined weak keys in Blowfish. Vincent Rijmen's Ph.D. paper includes a second-order differential attack on 4-round Blowfish [2]. The key of the Blowfish algorithm is 448 bits, so it re-quires 2448 combinations to examine all keys. The Blowfish algorithm has many advantages. It is suitable and efficient for hardware implementation. Besides, it is unpatented and no license is required. The proposed architecture can produce 4-bit data per clock. The scan chains are also included in this architecture for testing. The die size of the chip is 5.7x6.1 mm2, and the

II. B LOWFISH ALGORITHM The elementary operators of Blowfish algorithm include table-lookup, addition and XOR. The table includes four S-boxes (256x32bits) and a P-array (18x32bits). The Blowfish algorithm consists of four steps including table initialization, key initialization, data encryption and data decryption. Fig. 1 shows the Blowfish encryption algorithm III. THE P ROPOSED ARCHITECTURE XL

XR

8 bits

8 bits

8 bits

S-box1

S-box2

S-box3

32 bits

a

b

32 bits

8 bits

c

Divide XL into four eight-bit quarters: a, b, c, and d F(XL) = ( ( S1[a] + S2[b] ) ⊕ S3[c] ) + S4[d]

Fig. 1 Blowfish algorithm



Supported in part by the National Science Council, R.O.C, under contract no. NSC 88-2215-E-007-025

32 bits

d

32 bits

e

f

XOR

XOR

32 bits

XOR

XOR

XOR

32 bits

XOR

32 bits

XOR

r1

32 bits

Result

Fig 2. DFG of the loop body XL

XR

8 bits

8 bits

8 bits

S-box1

S-box2

S-box3

a

32 bits

32 bits

b

c

32 bits

CG

XOR

32 bits

8 bits

P-array 32 bits

32 bits

S-box4 d

32 bits

e

f

XOR

XOR

XOR

CG

Divide X into two 32-bit halves: XL, XR For i = 1 to 16: XL = XL⊕ Pi XR = F(XL) ⊕ XR Swap XL and XR Swap XL and XR (Undo the last swap.) XR = XR ⊕ P17 XL = XL ⊕ P18 Concatenate XL and XR

P-array 32 bits

S-box4

32 bits

CG

I. INTRODUCTION

maximum frequency is up to 50MHz.

CG

Abstract We propose an efficient hardware architecture for the Blowfish algorithm [1]. The speed is up to 4 bit/clock, which is 9 times faster than a Pentium. By applying operator-rescheduling method, the critical path delay is improved by 21.7%. We have successfully implemented it using Compass cell library targeted at a 0.6 µm TSMC SPTM CMOS process. The die size is 5.7x6.1 mm2 and the maximum frequency is 50MHz.

32 bits

r2

XOR

32 bits

Result

Fig. 3 DFG after rescheduling A. Operator Rescheduling When calculating “s = a + b”, the i-th bit of s is equal to ai ⊕ bi ⊕ ci , where c i is the carry-in of i-th bit. Fig. 2 shows the original DFG of the loop body after replacing the add operation with CG and XOR function.

The operators include only carry generators and XOR, so we can use operator-rescheduling method to reduce the critical path delay. Fig. 3 shows the result of operator rescheduling. The gray line in these figures shows the critical path. The original critical path delay is two CG delay plus five XOR delay. After rescheduling, the critical path delay is reduced to two CG delay plus two XOR delay. Three 2-input XOR delays are hidden. According to a synthesizer’s report, the improvement of critical path delay is about 21.7%. B. Fast Carry Generator The fast carry generator is based on a carry-lookahead adder [3]. We construct the carry generator using hierarchical 4-bit carry generators. C. The System Configuration Controller The controller is implemented as a finite state machine and described in a behavioral Verilog model. See Fig. 4. !reset

start

clear

load

e1

mode=0 mode=1

idle

mode=3

initial

e2

e4

mode=2

e3

CORE implements the loop of the 16-round iteration. A pipeline stage is added to the output of the SRAM modules. The pipeline stage will double the performance of the Blowfish hardware but lead to the overhead of area. D. DFT Consideration The testing circuit of the controller is done by adding scan registers to store the signals of the controller and scan out the contents of the registers in test mode. The datapath is described by Verilog RTL model. All of the flip-flops of the datapath are replaced by scan flip-flops. IV. EXPERIMENTAL RESULTS Table 1 shows the feature of this chip. The maximum frequency of this Blowfish cipher chip is 50MHz. Fig. 6 shows the photomicrograph. Table 1 The chip feature Die size Pad Ext. Power Int. Power Input Output Clock buffer Macro SRAM ROM Random logic

5.7 x 6.1 mm2 5 vdd 7 gnd 4 pairs 12 6 PC5C03 256x32(x4), 16x32(x1) 256x32(x4) 16K gates

decrypt encrypt

Fig. 4 FSM of the controller Datapath It includes ROM modules, SRAM modules, and the main arithmetic units of Blowfish. Fig. 5 shows the datapath architecture. addr_p[3:0]

clk

sel5

sel6

oe_p we_p oe_s we_s[3:0]

clk

DataIn

Fig. 6 Photomicrograph

4

FF FF

Mux

32

Mux

s1

so_din

Mux

32

s1_din

Mux

s0

p_din

s2_din

Mux

32

FF

p

Mux

32

ROM_P

32

s3

32

FF

FF

s2

s3_din

FF

p_dout

s0_dout

SRAM_Sbox

ROM_Sbox

32

FF SRAM_P

clk

Shift Register

XOR

FF

0

0

Mux

Mux

Mux

Mux

cnl0

sel1

s1_dout s2_dout

clk en_de sel2

s3_dout

CORE addr_s0

Mux

addr_s1

Mux FF Mux

addr_s3

Mux

Mux

addr_s2

FF

32

cnl1

32

Shift Register

clk

4

addr_s[7:0]

sel4

sel3

V. CONCLUSION

sel0

DataOut

Fig. 5 The architecture of the datapath Because the size of SRAM module is 2n words, P1 and P18 are implemented as registers, and the others are mapped to 16x32 bits SRAM. We use a shift register under DataIn to expand 4-bit input to 64-bit input and a shift register over DataOut to reduce 64-bit output to 4-bit output.

The proposed hardware architecture of the Blowfish algorithm can achieve high-speed data transfer up to 4 bits per clock, which is 9 times faster than a Pentium. By applying operator-rescheduling method, the critical path delay is improved about 21.7%. Besides, DFT is also taken into consideration. Specially, the chip is cascadable that means if two chips are used, the performance is double. The test results show that the maximum frequency of this Blowfish cipher chip is 50MHz. The proposed architecture has satisfied the need of high-speed data transfer and can be applied to security device of a system. REFERENCE [1] Bruce Schneier, “Applied Cryptography”, John Wiley & Sons, Inc. 1996 [2] The homepage of description of a new variable-length key, 64-bit block cipher http://www.counterpane.com/bfsverlag.html [3] Patterson and Hennessy, “Computer Organization & Design: The Hardware/ Software Interface”, Morgan Kaufmann, Inc. 1994

Suggest Documents