A NEW COMMON SUBEXPRESSION ELIMINATION ALGORITHM ...

2 downloads 0 Views 145KB Size Report
set of constants is known as multiple constant multiplication. (MCM) problem. Avoiding costly multipliers is an essential de- sign criterion in VLSI implementation.
A NEW COMMON SUBEXPRESSION ELIMINATION ALGORITHM WITH APPLICATION IN COMPOSITE FIELD AES S-BOX M. M. Wong and M. L. D. Wong Swinburne University of Technology (Sarawak Campus) Jalan Simpang Tiga, 93350, Kuching Malaysia. E-mail: [email protected] X

ABSTRACT

x

Common subexpression elimination (CSE) is a critical procedure in many multiplierless implementation of DSP algorithms. The aim of CSE is dual-pronged: 1) to reduce the number of logic operators used and 2) to minimize the logic depth (critical path) of the DSP algorithm implemented in VLSI. In this work, a novel hybrid heuristic CSE algorithm that combines greedy algorithm and exhaustive search to select the best set of common subexpressions is proposed. The proposed algorithm aims at promoting area optimization in linear transformations with binary matrix multiplication. The efficiency of the proposed algorithm is demonstrated through a case study in constructing a composite field implementation of Advanced Encryption Standard (AES). Experimental results has shown that the proposed algorithm achieves an average area reduction of 44.09% as well as an average logic depth minimization of 47.55%.

h

+

D

D

Y

(a)

Pattern

+

X

>>2

Y

>>1

Y

>>1

+

X

+

>>2 >>3

+

X

(b)

Y

>>2

>>1

+

Y

>>1

(c)

Fig. 1. CSE in FIR filter design X X X

Common subexpression elimination (CSE) is an optimization procedure that searches for instances of identical expressions and replacing them with a single variable that holds the computed value. With this, the identical expressions will be computed only once and hence leading to a savings in terms of the hardware required. CSE is often used in solving MCM problem as well as other computer arithmetic problems, which require substructure sharing optimization. Multiplication of a variable with a given set of fixed-point constants can be computed by utilizing a multiplier block that consists exclusively of adders/substractors and shifters. Optimization in generation of a minimal multiplier block from the set of constants is known as multiple constant multiplication (MCM) problem. Avoiding costly multipliers is an essential design criterion in VLSI implementation. Therefore, MCM problems are often related to digital finite impulse response (FIR) filter implementation as well as DSP linear transformation. A simple idea of CSE demonstration on FIR filter is illustrated in Figure 1. On the other hand, CSE is also used in tackling substructure sharing optimization in combinational logic implementation, which is similar to the MCM problem. This combinational logic is expressed in bit-level equations, of which, CSE can be exploited to extract the common factors in all the equations to reduce the area cost. An example of substructure sharing optimization using CSE is noted in Figure 2. In general, any CSE algorithm involves the following general steps:

= 0.10110

+

D

Index Terms — Common subexpression elimination (CSE), multiple constant multiplication (MCM), substructure sharing, Advanced Encryption Standard (AES), composite field arithmetic (CFA) 1. INTRODUCTION

x h

= 0.0110

+

Y

+

Y

X

Pattern

+

X

+

Y

X

+

+

Y

X

X

X

X (b)

X

+

Y

X

Y

+

X

(a)

Y

+

X

Y

(c)

Fig. 2. CSE for substructure sharing

1. Identify patterns (common factors) present in the transformation. 2. Select a pattern for elimination. 3. Remove all the occurrences of the selected pattern. 4. The eliminated pattern is computed only once. 5. Repeat Step 1 - 4 until none multiple patterns is present. One of the common issues regarding the feasibility of CSE is the optimality of the algorithm. In particular, the optimality referred to here emphasizes on the elimination of logic operators. However, by eliminating one pattern, we are very likely to lose other possible patterns owing to the sharing of the nonzero bits [1]. Hence, the key issue here is to select the right pattern and eliminate over the others that will result in maximal

reduction. However, this pattern selection (Step 2) problem is claimed to be NP-complete [1]. Since there is no one exact algorithm to solve this problem, previous studies as well as our work attempted on the heuristic approach to achieve solution that is as optimal as possible. Apart from maximum pattern reduction, our algorithm also emphasizes on another important feature - minimal logic depth. Apart from minimal area cost, minimal logic depth is another highly desirable feature in VLSI as it enables deep sub-pipelining which then leads to higher speed of implementation. In this paper, we proposed a novel CSE algorithm that results in optimum substructure sharing scheme for isomorphism function and affine transformation in composite field AES S-box. The contributions of this work are: first, the best set of common subexpressions is selected by exploitation of both greedy algorithm and exhaustive search. Exhaustive search is very unlikely to be implemented in pattern selection logic due to the combinatorial complexity of the algorithm. In this work, appropriate search constraints are implied to speed up the processing of tree search and hence satisfactory runtime is achievable. Second, our algorithm guaranteed shortest logic depth compared to existing CSE algorithm for substructure sharing. Third, we have proven our new algorithm is capable of introducing a significant amount of saving in both area (logic operator) and logic depth in substructure sharing optimization used for composite field AES S-box. The performance of our algorithm is benchmarked against an earlier algorithm proposed in [2]. The rest of this paper is structured as follows. In Section 2, we discuss the previous works on CSE algorithm. Next, detailed description of our CSE algorithm is reviewed. The reason of the exploitation of the exhaustive search in collaboration with greedy algorithm is explained in Section 3. After that, implementation of CSE in composite field AES S-box is discussed. Exploitation of our CSE algorithm to achieve hardware cost reduction in isomorphism function and affine transformation is presented in Section 4. Following which, some discussion and detailed performance (in term of efficiency and effectiveness) of our work in comparison with an existing algorithm is presented in Section 5. Finally some comment on future works and concluding remarks are described in Section 6 to summarize the work. 2. RELATED WORKS To date, there are several studies that have worked on CSE for constant multiplications optimization in linear systems, particularly digital filters [1,3–8]. Whereas, CSE specifically for binary linear transformation (Galois field isomorphic mapping) was introduced in [2]. The technique utilized greedy algorithm along with three other criteria in pattern selection, which differs with the algorithm proposed herein. Besides, their iterative two-term pattern selection does not guarantee minimal logic depth. On the other hand, the work reported in [1] employed both exhaustive search and greedy algorithm but in different stages. The prior is performed to extract all the possible patterns multiplying a single variable (pattern identification), while only the latter is used in pattern selection. In [8], a polynomial transformation of the constant multiplications is employed to enable the detection of multi-variable pattern using a rectangle covering an intersection matrix. Following which, they used heuristic pingpong algorithm in order to obtain a good prime rectangle that result in inferior results in certain cases. As most linear DSP systems involved signed integers matrix multiplication, the major effort in designing CSE algorithms is emphasizing on pattern identification either from a single variable [1, 3, 7] or multiple variables [4, 8] multiplications. Hence, the elimination strategy will be very much specific to the struc-

ture of their patterns. Consequently, most of the reported CSE algorithms are incapable of giving direct optimization in binary linear transformation. 3. PROPOSED COMMON SUBEXPRESSION ELIMINATION ALGORITHM In this section, we provide a detailed description of our CSE algorithm that combines both greedy algorithm and exhaustive search in an iterative two-term pattern selection. First of all, the binary R × R matrix-vector product is expressed into R bit-level equations. Exhaustive search for all possible 2-bit pattern existed in the equations is performed and complete statistic of the pattern frequencies are deduced. There will be several patterns that have more than one occurance and therefore a decision criterion is needed on which one is ought to be chosen for elimination. Here, we employed the greedy algorithm to solve this problem where the pattern with the highest frequency of occurences will be selected for elimination. However, the problem in question now would be that if there are more than one pattern sharing the same number of occurrences (highest) in the equations, which is ought to be chosen for elimination. Random selection from one of those for elimination may not lead to the best gate reduction. Therefore, exhaustive search is applied in addition to greedy algorithm. As such, all patterns that bear the highest occurrence and their consequents pairings will be derived. Every occurrence of the selected pattern are then renamed as new 4-bit variable. The process will iterate until no 2-bit pattern is found. Moving on to the second stage with a similar logic, but instead of the original 2-bit pattern, the newly named 4-bit pattern is paired in order of priority . Again the paired factors will be renamed as a new 8-bit variable. This process will repeat until there is only single variable or one nonzero bit exists in each bitlevel equation. The constructed tree is searched and the path that results in the least gate count will be chosen. The occurrences of the renaming procedure gives the number of total adder needed while the number of iteration in the second stage plus one gives the logic depth. The shortest achievable logic depth of a Q-term equation, P , satisfies the followings, 2P −1 < Q ≤ 2P where, P = 1, 2, 3 . . .

(1)

P can be obtained if and only if the subexpressions (pattern) additions of the equations are fully arranged in a tree structure as opposed to the serial structure. This characteristic is fulfilled in our algorithm. The proposed CSE algorithm is now summarized as in the following: 1. Express binary matrix-vector product in bit-level equations. 2. Initialize Ni = 2 (N is the bit size of the pattern), with i = 0 (i is the iteration number). 3. Perform exhaustive search for all possible N-bit pattern in all bit-level equations and the occurrence, f of each pattern is deduced. 4. Determine the largest f value, fmax . 5. All pattern having fmax are served as tree nodes. 6. Every N-bit pattern with fmax is eliminated and renamed as a new 2N -bit pattern separately. 7. Repeat Step 3-6 until no Ni -bit pattern is found. 8. Let Ni+1 = 2Ni 9. Repeat Step 3-8 until there is only a single variable or one nonzero bit exists in each bit-level equation. 10. Iterate through the constructed tree and the path with the shortest depth (minimal gate count) is determined.

4. CASE STUDY: SUBSTRUCTURE SHARING IN COMPOSITE FIELD AES S-BOX The Rijndael Advanced Encryption Standard (AES) is an encryption standard selected by NIST to replace the Data Encryption Standard in 2001. Rijndael AES algorithm is symmetric block cipher that performs four transformations, namely the SubBytes (commonly referred to as the S-box), ShiftRows, MixColumns and AddRoundKey through out the encryption process. Composite field arithmetic (CFA) has been widely utilized in designing an optimized combinatorial circuit S-box (computation of multiplicative inversion and affine transformation) to mitigate the performance bottleneck when realizing hardware architecture for AES. CFA (see [9–12]) can be built iteratively from the lower order fields; allowing simpler mathematical manipulation to be performed in the lower field rather than in its original higher order field. Hence, by using CFA, one can compute multiplicative inverse of AES over the subfield (either GF (24 ) or GF (22 )) rather than the initial field, GF (28 ). Therefore, isomorphism function and its inverse are required for mapping the elements in GF (28 ) to its equivalents in the subfield and vice versa. These transformations are essential steps performed prior and post to any CFA manipulation. Both isomorphism function and its inverse, as well as affine transformation are binary linear transformations, which are done by the mean of vector product by an 8 × 8 binary matrix (refer (2)), 2 3 2a11 y1 a21 6y2 7 6 6 7 6 6 6y3 7 6 6 7 = 6a31 6 .. 7 6 . 4.5 4 . . y8 a 81

a12 a22

... ...

a32 ...

... ... .. . .. .

3 2 3 a18 x1 a28 7 6x 7 7 6 27 7 6x 7 3 a38 7 7 7×6 6 .. 7 .. 7 4.5 5 . x8 a

(2)

88

with all the vector elements x, y and constant matrix a ∈ {0, 1}. A potential optimization hereby would be to express the multiplication in bit-level equations to save the cost of multiplier. Then, substructure sharing could be perform to share the common factors from the equations effectively. Take note that, in CFA, addition is referring to XOR operation in implementation. Most of the previous works in designing compact AES Sbox focused on the derivation of the optimal composite field that result in the minimal multiplicative inversion circuit. However, it is worth noting that optimization of binary linear transformation composite field AES S-box will contribute a significant amount saving in hardware. Therefore, it is crucial that optimum isomorphism function and affine transformation are determined to ensure minimal AES S-box as a whole. In this work, we implied our proposed CSE algorithm as a solution to achieve minimal complexity (minimal additions with minimal logic depth) of isomorphism and its inverse together with affine transformation. Isomorphism function derivation requires root β (an element of the respective subfield), that satisfies q(β) = 0 such that q(x) is the field polynomial specified for AES. In addition to that, for each root β, there exists seven other conjugates which can be utilized in deriving isomorphism for the same composite field. Hence, we exploit our CSE on all the eight possible binary linear transformations for a single composite field AES S-box architecture.

44.09% and an average logic depth reduction of 47.55%, which is nearly half of the original complexity. Furthermore, our results are benchmarked agaisnt the performance reported by an earlier CSE method proposed in [2]. Their algortihm gave a slightly larger average area saving (45.96% on average) compared to ours. However, the logic depth reduction achievable by the latter was a mere 16.35%. In general, their algorithm was effective in promoting maximal area minimization but the performance was traded off with a higher logic depth. This higher logic depth may be the bottleneck of the clocking frequency achievable for the overall system. The reason for the higher logic depth was attributed to the fact that the algorithm had the pattern additions arranged in both serial and tree architectures. A clear performance comparison of both algorithms are as summarized in Figure 3. In the second experiment, we investigated the advantage of having exhaustive search together with greedy algorithm in the pattern selection process in our CSE. This hybrid mechanism may not produce optimal solution but the performance is guaranteed to be better than or at the very least the same as that of a sole greedy algorithm. Using MATLAB, we have implemented two versions of our proposed algorithm (one with exhaustive search and one without) one for optimizing the complexities of the binary linear transformations in a composite field AES S-box. The resultant performance of both algorithms in term of effectiveness (total operator elimination) and efficiency (total time elapsed) were observed. The minimum and maximum time elapsed (in seconds) achievable as well as the average and standard deviation for both algorithms were as tabulated in Table 2. Though the algorithm is of a higher complexity, exhaustive search is still worthwhile considering the additional logic operator reduction gained and that the runtime achievable is within satisfactory time period. From Table 2, it is shown that our hybrid algorithm took an average of 1.56 sec for completion while sole greedy algorithm requires only fraction of a second. This is expected as our algorithm is based on the greedy algorithm and therefore, the extra time elapsed is contributed by the exhaustive search mechanism. However, this result is justified with the additional operator reduction gained. It may be worth taking note that the same complexity was deduced for the case where the root equals β5 . This phenomenon occurs under possible three circumstances. First, there is only one pattern found with the highest occurrence at every stage. Second, the most optimal path derived from exhaustive search lies on the most left hand side of the tree (the first node from every branch), of which will be the same as the solution derived using sole greedy algorithm. Third, all the paths derived using exhaustive tree share lead to substructure sharing of same complexity. Nevertheless, the most optimal result is only achievable if a pure exhaustive search is used in finding the optimal factoring such as proposed by Canright in [13]. However, the biggest drawback of this method is that the search time is not deterministic and may take a time ranging from a fraction of microseconds to weeks time, due to the combinatorial complexity of the algorithm [13]. Considering that development time is quite a critical requirement in the VLSI design cycle, it is therefore undesirable to consume excessive design time in exchange for reduction of a few operator. As such, we attempt to maximize both efficiency and effectiveness of the algorithm by introducing CSE that combines both greedy algorithm and exhaustive search. 6. CONCLUSION

5. IMPLEMENTATION PERFORMANCE AND COMPARISON Optimization in binary linear transformations achieved by the proposed algorithm is as tabulated in Table 1. Based on the experimental results, our approach gave an average area saving of

In this work, a novel heuristic common subexpression elimination (CSE) algorithm for binary linear transformation was presented. The proposed CSE algorithm combined both greedy algorithm and exhaustive search in pattern selection in order to deduce maximal logic gate reduction for substructure sharing opti-

Achieved XOR Gate

Logic Operator Without CSE

β2

β3

β4

β5

β6

β7

β8

59

59

63

63

79

79

61

61

65.50

Our CSE

35

35

36

36

40

40

35

34

36.38

40.68

40.68

42.86

42.86

49.37

49.37

42.62

44.26

44.09

CSE in [2]

34

34

35

36

38

38

32

34

35.13

Reduction (%)

42.37

42.37

44.44

42.86

51.90

51.90

47.54

44.26

45.96

25 20

10 5

Total gate deduced from our CSE Total gate deduced from CSE in [2] Logic depth deduced from our CSE Logic depth deduced from CSE in [2]

0 β1

Achieved XOR Gate

β2

β3

β4 Roots β5

β6

β7

β8

Average

β1

β2

β3

β4

β5

β6

β7

β8

Without CSE

11

11

11

11

13

13

11

11

11.50

Our CSE

6

6

6

6

6

6

6

6

6.00

Reduction (%)

45.45

45.45

45.45

45.45

53.85

53.85

45.45

45.45

47.55

CSE in [2]

9

9

9

9

11

11

9

10

9.63

Reduction (%)

18.18

18.18

18.18

18.18

15.38

15.38

18.18

9.09

16.35

Table 2. Performance in terms of time elapsed (sec) deduced from our hybrid CSE technique and our CSE with sole greedy algorithm in pattern selection. The optimization algorithms are performed in binary linear transformations in composite field AES S-box. The simulation is repeated for 10 round runtime. Time elapsed (sec)

Average

β1

β2

β3

β4

β5

β6

β7

β8

Minimum

0.58

0.60

0.54

0.54

0.43

0.44

4.81

4.03

1.50

Maximum

0.66

0.90

0.60

0.58

0.46

0.46

6.06

4.84

1.82

Standard Deviation

0.03

0.09

0.02

0.01

0.01

0.01

0.38

0.24

0.10

Average

0.62

0.64

0.56

0.56

0.45

0.45

4.99

4.17

1.56

Achieved XOR gate

35

35

36

36

40

40

35

34

36.38

β1

β2

β3

β4

β5

β6

β7

β8

Minimum

0.02

0.02

0.02

0.02

0.02

0.02

0.03

0.02

0.02

Time elapsed (sec)

Greedy Algorithm

30

15

Reduction (%)

Our Work

35

Average

β1

Logic Depth

40

Total Logic Operator and Logic Depth (in XOR gates)

Table 1. Performance analysis of complexity (total logic operator and logic depth) reduction achievable by using our CSE algorithm and the one reported in [2]. Optimizations are performed in isomorphism function and its inverse together with affine transformation for 8 possible roots β for a composite field AES S-box.

Average

Maximum

0.02

0.03

0.02

0.02

0.03

0.03

0.04

0.02

0.03

Standard Deviation

0.00

0.00

0.00

0.00

0.00

0.00

0.01

0.00

0.00

Average

0.02

0.02

0.02

0.02

0.02

0.03

0.03

0.02

0.02

Achieved XOR gate

38

38

38

38

40

41

37

35

38.13

mization in composite field AES S-box. It was demonstrated that our CSE algorithm is capable of reducing nearly half of the original complexity in the binary linear transformations of composite field AES S-box. Furthermore, we have shown that the collaboration with exhaustive search was capable of improving the performance of the greedy algorithm. This additional effort was worthwhile as additional logic reduction was gain within reasonable runtime. In addition to that, iterative two-term elimination was performed in such a way that minimal logic depth was achieved. Our experimental results further showed that the proposed CSE algorithm outperformed an previously proposed algorithm in terms of the logic depth reduction. For future works, we will extend our CSE algorithm to solve MCM problems as found in digital filter design. 7. REFERENCES [1] R. Pasko, P. Schaumont, V. Derudder, S. Vernalde, and D. Durackova, “A new algorithm for elimination of common subexpressions,” IEEE Trans. Comput.-Aided Design Integr. Circuits Syst., vol. 18, no. 1, pp. 58–68, 1999. [2] S. Hsiao, M. Chen, and C. Tu, “Memory-free low-cost designs of Advanced Encryption Standard using common subexpression elimination for subfunctions in transforma-

Fig. 3. Performance comparison between our CSE and previous work as summarized from Table 1. tions,” IEEE Trans. Circuits Syst. I, vol. 53, no. 3, pp. 615– 626, 2006. [3] R. I. Hartley, “Subexpression sharing in filters using canonic signed digit multipliers,” IEEE Trans. Circuits Syst. II, vol. 43, no. 10, pp. 677–688, 1996. [4] M. Mehendale, S. D. Sherlekar, and G. Venkatesh, “Synthesis of multiplier-less FIR filters with minimum number of additions,” in IEEE/ACM International Conference on Computer-Aided Design ICCAD-951995, 1995, pp. 668– 671. [5] R. Pasko, P. Schaumont, V. Derudder, and D. Durackova, “Optimization method for broadband modem FIR filter design using common subexpression elimination,” in Tenth International Symposium on System Synthesis, 1997, 1997, pp. 100–106. [6] S. Vijay, A. P. Vinod, and E. M. K. Lai, “A greedy common subexpression elimination algorithm for implementing FIR filters,” in IEEE International Symposium on Circuits and Systems,ISCAS, 2007, pp. 3451–3454. [7] M. Potkonjak, M. B. Srivastava, and A. P. Chandrakasan, “Multiple constant multiplications: efficient and versatile framework and algorithms for exploring common subexpression elimination,” IEEE Trans. Comput.-Aided Design Integr. Circuits Syst., vol. 15, no. 2, pp. 151–165, 1996. [8] A. Hosangadi, F. Fallah, and R. Kastner, “Common subexpression elimination involving multiple variables linear DSP synthesis,” in 15th IEEE International Conference on Application-Specific Systems, Architectures and Processors, 2004, 2004, pp. 202–212. [9] M. Hasan, “Efficient computation of multiplicative inverses for cryptographic applications,” in Proc. IEEE ISIT, 2001, pp. 66–72. [10] C. Paar, “Some remarks on efficient inversion in finite fields,” in Proc. IEEE ISIT, 1995, pp. 5–8. [11] J. L. Fan and C. Paar, “On efficient inversion in tower fields of characteristic two,” in Proc. IEEE ISIT, 1997, p. 20. [12] D. R. Wilkins, “Part III: Introduction to Galois Theory,” in World Wide Web http://www.ercangurvit.com/abstractalgebr/galois.pdf, 2000. [13] D. Canright, “A very compact Rijndael S-box,” Naval Postgraduate School, Tech. Rep. NPS-MA-04-001, 2005.

Suggest Documents