Implementation Complexity for Achieving Bit Permutations ... - CiteSeerX

Implementation Complexity for Achieving Bit Permutations in One or Two Cycles in Superscalar Processors Xiao Yang and Ruby B. Lee Princeton Architecture Laboratory for Multimedia and Security (PALMS) Princeton University, Princeton, NJ 08544 {xiaoyang, rblee}@princeton.edu

Abstract— The increasing importance of secure information processing in publicly accessible Internet and wireless networks poses new challenges in the architecture of future generalpurpose processors. Symmetric-key cryptography algorithms are an important class of algorithms used to achieve confidentiality. Many of them use a category of operations that are not well supported by today’s word-oriented microprocessors: bit-oriented permutations. In this paper, we show how arbitrary bit permutations within a word can be achieved in just one or two cycles. This improves upon the O(n) instructions needed to achieve any one of n! permutations of n bits in existing RISC processors; it also improves upon recent work that achieves this in O(log(n)) instructions and cycles. This paper contributes two new architectural solutions, one with only microarchitecture changes, and another with ISA support as well. Slightly enhanced 4-way and 2-way superscalar microprocessors can achieve arbitrary 64-bit permutations in one or two cycles. We also define new architectural concepts for (s,t) datapaths and functional units, and grouped instructions. Keywords—cryptography; bit permutation; superscalar; processor architecture; instruction set architecture (ISA); microarchitecture;

I. INTRODUCTION The increasing use of the publicly accessible Internet and wireless networks in modern society poses a growing need for secure communications, computations, and storage. To provide basic security functions like data confidentiality, data integrity, and user authentication, different classes of cryptography algorithms can be used with security protocols at network, system, or application levels. Not only network transactions need to be encrypted, authenticated, and digitally signed, all data and programs may also need these security functions. Symmetric-key cryptography algorithms achieve confidentiality of messages transmitted over the public networks, and of data stored in disks by encrypting the data. Examples are DES (Data Encryption Standard [1]), and AES (Advanced Encryption Standard [2]). As secure computing paradigms become more pervasive, it is likely that such cryptographic computations will become a major component of every processor’s workload, provided they could be accomplished quickly and painlessly. Understanding the new requirements of secure information processing is important for the design of all future programmable processors, whether general-purpose, application-specific, or embedded. In this paper, we especially target the needs of high performance microprocessors.

Bit-oriented operations are very commonly used in symmetric-key cryptographic algorithms, but are not well supported in existing processors. Previously, the only bitoriented operations that were frequent in general-purpose workloads were bit-parallel operations like AND, OR, XOR and NOT. These are very effectively performed in parallel on all bits in a word, in a single processor cycle. Symmetric-key cryptography introduces a new requirement: bit-level permutations. Arbitrary permutations of the bits within a word are indeed a new challenge for word-oriented processors. Previously, processors only supported a very restricted subset of bit-level permutations known as rotations where every bit in the n-bit word is moved by the same shift amount, with wrap-around. While some n-bit permutations could be accomplished with fewer instructions, a generic way to achieve any n-bit permutation takes O(n) instructions, which is unacceptably slow. Recent work has reduced this to O(log(n)) instructions. Because of serial data dependencies in these O(log(n)) instructions, the execution time taken is also O(log(n)) cycles. The key contribution of this paper is to show that the execution time can be reduced from O(log(n)) cycles to one or two cycles for achieving an arbitrary bit permutation with very small incremental complexity in the control of a superscalar machine. There is essentially no increase in the datapath complexity. We present two architectural alternatives for achieving arbitrary n-bit permutation instructions in one or two cycles, one with and the other without instruction-set architecture (ISA) changes. Since the incremental control complexity would be greatest for an out-of-order superscalar machine, rather than an in-order machine, we describe in detail the microarchitectural changes required. We feel that this essentially answers, in the affirmative, the question of whether bit permutations can be performed efficiently in word-oriented processors, since we have now shown how any one of n! permutations can be achieved in at most one or two cycles in a typical 4-way or 2-way superscalar processor. While this execution time could possibly be further reduced, there is no need to perform bit permutations in less than one cycle, or the time taken by an ALU operation of the same width. Another contribution of this paper is the definition of new architectural concepts for describing microarchitectural features and extending the ISA. We define (s,t) functional units and datapaths and grouped instructions. These are very useful to expose the capabilities of microarchitectural implementations, and the needs of new instructions in the ISA.

Princeton University Department of Electrical Engineering Technical Report CE-L2003-002, May 2003.

In Section II, we describe past work on permutation instructions, including how recent past work has reduced the time taken to achieve any n-bit permutation down from O(n) to O(log(n)) instructions and cycles. In Section III, we propose two new architectural methods for further bringing this down to one or two cycles. One method is purely microarchitectural, and the other involves new ISA. In Section IV, we describe a baseline out-of-order machine and the changes in the datapath and control path needed to implement our two methods. In Section V, we discuss performance, and conclude in Section VI. II. PAST WORK Performing bit-level permutation has been a hard problem for word-oriented general-purpose processors. In the past, this has been achieved with conventional logical and shift instructions, which can achieve any one of n! permutations, but takes O(n) cycles. Alternatively, table lookup methods are used to achieve the permutation. However this is limited to a few fixed permutations due to the memory required for each permutation. Cache misses can also cause permutations to take many more cycles than the actual number of instructions required to look up a table of permuted results. More recently, permutation instructions have been introduced into certain microprocessors as multimedia ISA extensions to handle the re-arrangement of subwords packed in registers. Examples are MIX and PERMUTE in HP’s MAX-2 [3], VPERM in Motorola’s AltiVec [4], and MIX and MUX in IA-64 [5]. However, these instructions can only handle subword sizes down to eight bits. They do not provide a general solution for arbitrary bit-level permutations. Very recently, researchers have tackled the general bit permutation problem, and defined new permutation instructions that can achieve any n-bit permutation with only log(n) instructions. Several approaches were proposed. The CROSS [6] and OMFLIP [7] permutation instructions performs the equivalent function of two stages of a “virtual” Benes or omega-flip network. In a sequence of log(n) CROSS or OMFLIP instructions, we can build a 2log(n)-stage virtual network of a butterfly network followed by an inverse butterfly network, or an omega network followed by a flip network. This can achieve any one of the n! possible permutations of the n input bits. Another approach was taken by the GRP instruction [8]. It partitions the data bits into left and right parts. At most log(n) GRP instructions are sufficient to achieve any one of n! permutations [8]. CROSS, OMFLIP and GRP all achieve arbitrary 64-bit permutations in six instructions. A third approach involves specifying the order of the indices of the bits in the permuted result. For 64-bit permutations, PPERM [8] specifies eight 6-bit indices per instruction, taking eight instructions to specify the 64 indices of the permuted result. In [9], a pair of SWPERM and SIEVE instructions are defined; the former specifies four bits of these 6-bit indices, while the later specifies the remaining two bits of each 6-bit index. Although eleven instructions are required to perform a full 64-bit permutation, the cycles needed can be reduced from eleven to four cycles in a 4-way superscalar machine. Unfortunately, CROSS, OMFLIP and GRP cannot achieve speedup with superscalar machines, due to the strict

data dependency between the six permutation instructions needed for arbitrary 64-bit permutations. In this paper, we show how this data dependency can be overcome, so that arbitrary 64-bit permutations can be achieved in one or two cycles, rather than six cycles, for high performance machines. III. ACHIEVING ARBITRARY 64-BIT PERMUTATIONS IN ONE OR TWO CYCLES On analyzing why the sequence of log(n) CROSS or OMFLIP instructions cannot be executed in fewer than log(n) cycles in a superscalar machine, we find that the main performance limiter is the ISA instruction format that supports only two operands and one result per instruction. The n-bit permutation needs 1+log(n) operands, each n bits wide. The log(n) words of configuration specification are supplied in log(n) data-dependent instructions to avoid saving states between permutation instructions. We support the stateless implementation since this reduces context switch and operating system overhead, but we question the ISA constraint on the amount of data that can be sent or produced by a functional unit due to standard RISC instruction formats. If we could send all 1+log(n) operands to a permutation functional unit (PU) in a single instruction, would the latency through this unit take more than one cycle? Although the definition of what constitutes a cycle is illusive, we assume it is defined relative to the latency of an ALU. A butterfly network with log(n) stages has a latency comparable to that of an n-bit adder, which also has about log(n) stages. If an n-bit add can be performed in one pipeline cycle, so can a butterfly network, which has simpler stages, because each stage contains just one level of two-to-one muxes (see Figure 1). A permutation function unit containing a 64-input butterfly network with log(n)=6 stages can achieve three of the six steps needed for a 64-bit permutation, so that an entire 64-bit permutation can be done in two cycles, rather than six cycles (see Figure 5). However, in order to utilize such a functional unit, we must be able to supply one 64-bit data source and 3*64 configuration bits to it in one cycle. This paper describes, in detail, two architectural techniques for accomplishing this. More generally, we show how ISA limitations can be overcome, so that a 4-operand functional unit can be accommodated, to achieve higher permutation performance in superscalar machines. 8

mux

mux

mux

mux

mux

mux

mux

mux

mux

mux

mux

mux

mux

mux

mux

mux

mux

mux

mux

mux

mux

mux

mux

mux

8 Figure 1 An 8-bit butterfly network. Each stage has one level of 2-1 muxes

A. Architectural definitions We first define the new terms: (s,t) functional units and datapaths, and grouped instructions. (s,t) functional units. We define an (s,t) functional unit in a word-oriented processor as a functional unit that takes s wordsized operands and produces t word-sized results. We define a standard functional unit as a (2,1) functional unit. (s,t) datapaths. We define an (s,t) datapath in a wordoriented processor as a datapath with s source buses and t destination buses connected to functional units. If the datapath contains a register file used by the functional units, this has s read ports and at least t write ports for results coming from the functional units in a given processor cycle. In general, a k-way multi-issue processor has a (2k, k) datapath, supporting the simultaneous execution of k standard (2,1) functional units each cycle. Grouped instructions. A sequence of consecutive instructions is defined as grouped instructions if the whole sequence can be executed simultaneously by an (s,t) functional unit. We observe that a k-way multi-issue processor with a (2k,k) datapath can support the execution of grouped instructions with operands of up to 2k words and results of up to k words. We will show that this can be done with relatively minor changes to the pipeline control logic in the next section. For example, a 2-way superscalar machine with a (4,2) datapath can support a (2,2), (3,1) (3,2), (4,1), or (4,2) functional unit. These functional units in turn support the execution of grouped instructions. Method 1 identifies grouped instructions dynamically with microarchitecture techniques, while method 2 employs ISA techniques to identify grouped instructions statically. B. Method 1: microarchitecture group detection The definition of a permutation instruction is: PERM rd,rc,rs where rs contains data source, rc contains configuration bits and rd is the result. PERM can be either a CROSS or OMFLIP instruction. Figure 2(a) shows a 64-bit permutation specified with a sequence of six PERM instructions. These can form two groups of three instructions each. Each group provides the data source and three sets of configuration bits for a (4,1) PU. The group is dynamically detected, and then its three instructions are transformed into two "internal" instructions. The first one supplies the data source and one set of configuration bits, and the second one provides the other two sets of configuration bits. When all the source operands for the grouped instructions are ready, they are issued for execution simultaneously on one (4,1) PU. Hence, the six instructions for a 64-bit permutation can be transformed into four internal instructions in two groups and executed over two cycles. If there are two or more (4,1) PUs, we can pipeline the executions of the two groups and achieve a throughput of one permutation per cycle. The microarchitecture changes needed for the pipeline control are described in Section IV.

(a) Original

(b) Intermediate

PERM rd,rc1,rs PERM rd,rc2,rd PERM rd,rc3,rd

PERM,c rd,rc1,rs PERMcont rd,rc2,rc3

PERM rd,rc4,rd PERM rd,rc5,rd PERM rd,rc6,rd

PERM,c rd,rc4,rd PERMcont rd,rc5,rc6

Figure 2 Instruction transformation for method 1

C. Method 2: new ISA for group identification Only two RISC instructions should be sufficient to supply four operand words (one 64-bit data source and 3*64 configuration bits) to a (4,1) PU in one cycle [10]. Hence, in the second method, we enhance the conventional RISC instruction encoding by introducing two new subop bits, gs and gc, for identifying instructions which start a group (gs=1) or continue a group (gc=1). The three possible meanings of these two bits are shown in Table I. TABLE I.

gs=0, gc=0 gs=1, gc=0 gs=0, gc=1 gs=1, gc=1

MEANINGS OF GS AND GC BITS

Normal instruction, not part of a group Start of a group Continuation instruction in a group Reserved

The permutation instruction is defined as: PERM,subop rd,rs1,rs2 where subop contains the gs and gc bits. It is a normal (ungrouped) permutation instruction when none of gs and gc is set. If gs is set, this instruction is the first instruction in a 2instruction group, supplying the data source and one set of configuration bits to the (4,1) PU. If gc is set, it is the second instruction in a group, supplying two sets of configuration bits for the (4,1) PU. We specify this 2-instruction group as follows: PERM,gs rd,rs,rc1 PERM,gc rd,rc2,rc3 (In the assembly code, the gs or gc is listed if it is set, and not listed otherwise.) Method 2 does not need the dynamic group detection and instruction transformation required in method 1. It can also help reduce static code size, since only four permutation instructions (rather than six) are required in the program. Similar to method 1, when all the source operands for the two grouped instructions are ready, they are issued for execution together on one (4,1) PU. IV. DETAILED MICROARCHITECTURAL CHANGES We now consider the implementation complexity of the two methods. The underlying microarchitecture of the processor can be single-issue or multi-issue (superscalar), inorder or out-of-order. For a low-end single-issue processor, the performance of log(n) cycles for an n-bit permutation

should be sufficient. However, a high-end processor may want to improve this performance further, using one of our two methods. Hence, we first describe a superscalar, out-oforder processor as the baseline microarchitecture. We then detail changes that must be made to its datapath and control path in Sections B and C, respectively. The changes needed by an out-of-order processor also bound the incremental control complexity of our two methods. A. Baseline microarchitecture Figure 3(a) shows a standard 2-way superscalar RISC processor with a (4,2) datapath, i.e., four register read ports, two write ports and associated data buses and bypass paths. It has at least two standard (2,1) Arithmetic Logic Units (ALUs) which can execute simultaneously. Figure 4 shows the pipeline frontend of a generic out-of-order superscalar processor [11]. A block of instructions is fetched from the instruction cache. These instructions are then decoded and their operands renamed before entering the issue window. They will be issued for execution when all their source operands and required functional units become available. Certain stages of the pipeline may take multiple cycles. For an in-order issue processor, there are no rename or select stages.

6-stage butterfly network and the other one contains an inverse butterfly network. In a 2-way processor, we have both of them in the datapath, but only one PU is used at a time, resulting in a 2-cycle latency and a throughput of one permutation per two cycles. In a 4-way or wider processor, we can use both of the PUs in parallel. Then we can pipeline the permutation operation and achieve one permutation per cycle throughput (see Figure 6). We observe that inclusion of a (4,1) functional unit in a 2way superscalar processor causes minimal datapath overhead of one additional result multiplexor. All the expensive register ports, data buses and bypasses required have already been provided by the (4,2) datapath of the 2-way superscalar machine. Similarly, for the inclusion of two (4,1) functional units in a 4-way superscalar processor. Furthermore, we need not consider a permutation unit with more than four operands; two (4,1) PUs leveraging the (8,4) datapath of a 4-way superscalar machine are sufficient to achieve the ultimate performance of a different 64-bit permutation every cycle. A key benefit of our solution is leveraging the existing resources of today’s microprocessors, essentially all of which are at least 2-way superscalar. Next, we show that even the required control path changes are minimal. 64-bit data

64-bit data

64*3 control bits

64*3 control bits

Memory Memory

64 64 64

64

64 64 64

64

Register file

Register file

ALU1 ALU1

ALU2

Inverse butterfly network

Butterfly network

ALU2 (4,1) PU

(4,1)-PU1

64

(a)

(b)

64-bit permuted result

64-bit permuted result

Figure 3 (a) Standard 2-way superscalar RISC processor datapath; (b) with a (4,1) PU added

Figure 5 One implementation of (4,1) permutation FU

Issue window

Memory

Register

Wakeup/Select

Rename

Fetch/Decode

Register file

ALU1

Fetch/Decode

Rename

(4,1)-PU2

64

Wakeup/Select

Register Read

ALU2 (4,1)-PU1

ALU3

ALU4 (4,1)-PU2

Figure 4 Generic out-of-order superscalar processor pipeline frontend

B. Changes to the datapath Figure 3(b) shows a (4,1) permutation unit (PU) added to a standard (4,2) datapath of a 2-way superscalar processor. Figure 5 shows one implementation of the PU based on the butterfly network. There are two separate PUs, one contains a

Figure 6 Two (4,1) permutation FUs in 4-way superscalar processor

C. Changes to the control path Sequence detection

Code transformer

Issue window

Register

Modified wakeup/Select

mux

Rename

Fetch/Decode

C-bit Fetch/Decode

Register Rename

Wakeup/Select

Register Read

Figure 7 Modified superscalar processor pipeline frontend. Sequence detection unit, code transformer and extra mux are not needed for method 2

Method 1 requires some modifications to the pipeline control frontend as shown in Figure 7. The sequence detection unit detects sequences of three permutation instructions. These sequences are then transformed to groups of two instructions by the code transformer. The muxes pick the correct inputs to the issue window between the original instructions and the transformed instructions. A 1-bit field reserved for a C-bit is added to each entry in the instruction window to denote whether the corresponding instruction and the following one are in a group. The wakeup/select logic is also modified so that the two grouped instructions can be woken up and executed together. For method 2, since instruction groups are explicitly identified by subop bits in the instructions, the sequence detection unit, code transformer and mux can be removed. The rest of the control path is the same as for method 1. IR Instr 1

Instr 2

Instr 3

Instr 4

Decoder

Decoder

Decoder

Decoder

isPERM rs rd isPERM rs rd isPERM rs rd

Instruction wakeup. Figure 10 shows the modified wakeup logic needed by both methods to wake up the two grouped instructions together. This is necessary because otherwise the two instructions might be issued for execution separately, producing the wrong result. Previously, an instruction is ready to issue when both of its source operands are ready. The modifications ensure that grouped instructions become ready only when all the source operands in the group are ready. For the first instruction in a group (C-bit set), check the ready bits of its two source operands and the two source operands in the following instruction. For the second instruction in a group (previous instruction’s C-bit set), check the ready bits of its two source operands and the two source operands in the previous instruction. If the instruction is not in a group, it is woken up normally. Instruction sequence

Instruction group

1

1

rd

rc1

rs

1

0

2

rd

rc2

rd

2

0

3

rd

rc3

rd

C1

rc1

rs

0

rd

rc2

rc3

C

Ci-1 rdy1i-1 rdy2i-1 Ci

From instr i

rdy1i

Wakeup logic for instr i

Ireadyi

rdy2i From instr i+1

C4

isPERM rs rd

rd

Figure 9 Code transformer transforms 3-instruction sequence to 2-instruction group

C2 C3

1

C

From instr i-1

Sequence detection isPERM test and data dependency check

operands of the original three instructions. The code transformer replaces the data operand in the second instruction with the configuration operand from the third instruction before discarding the third instruction. Then, it updates the Cbits in the newly generated instructions. An instruction that has its C-bit set starts a group. Grouped instructions are adjacent in the issue window. Figure 9 shows the functions of the code transformer.

Ireadyi =

rdy1i+1 rdy2i+1 (Ci-1 & rdy1i-1 & rdy2i-1 & rdy1i & rdy2i) | (Ci & rdy1i & rdy2i & rdy1i+1 & rdy2i+1) | (!Ci-1 & !Ci & rdy1i & rdy2i)

Figure 8 Functions of sequence detection unit

Figure 10 Modified instruction wakeup logic

Group sequence detection. Dynamic group detection is needed only by method 1. The group sequence detection unit (Figure 8) recognizes three consecutive permutation instructions in a fetch block that satisfy the following two criteria: they have the same opcodes and the data source operand in a permutation instruction is the result of the previous permutation instruction. It then sets C-bits for the first instructions of the detected group sequences. For simplicity, sequences spanning two fetch blocks are not recognized to avoid keeping additional states.

Instruction select. We can modify the select logic for the two existing ALUs, ALU1 and ALU2, to handle the permutation unit as well. This is achieved by adding C-bit propagation to the original select logic for ALU1 and ALU2 and two small control units, as shown in Figure 11(a). Assume the select logic for ALU1 selects instruction i. The control unit 1 tests the C-bit of i. If i’s C-bit is set, i.e., i is the first instruction in a group, then grant both instruction i and i+1 and bypass the select logic for ALU2. If i’s C-bit is not set, then grant i and proceed to select logic for ALU2. Suppose instruction j is selected for ALU2. The control unit 2 then tests the C-bit of j. If j’s C-bit is set, do not grant any instruction (leave ALU2 idle). Otherwise we grant instruction j. Alternatively, we can add a new set of select logic for the PU, which deals only with instructions with C-bits set, while the select logic for ALU1

Instruction transformation. The code transformer is also needed only by method 1. It transforms the group of 3instruction sequences into 2-instruction groups (see Figure 2). The two new instructions are generated according to the C-bits produced by the sequence detection unit and the renamed

and ALU2 deals with normal instructions. The arbitration unit picks the result of either the select logic for ALU1 and ALU2 or the new select logic (see Figure 11(b)). Iready Issue window

C

Control 1 Control 2 bypass Cj

Ci

grant

i i+1

j

grant

Select logic Select logic for ALU2 for ALU1 with C-bit with C-bit propagation propagation

Arbitration

Issue window Iready

i i+1

j

Select logic Select logic Select logic for for ALU1 for ALU2 PU

Non-perm instructions Perm instructions

(a)

(b)

Figure 11 (a) Select logic with modifications on the original select logic for ALU1 and ALU2; (b) Select logic with new set of logic for PU

If there are multiple issue queues, such as proposed in [11], we can devise an instruction steering method so that the two permutation instructions in a group are dispatched to the same queue. This is easy to achieve because the two grouped instructions are adjacent. If the C-bit of the instruction at the head of a queue is set, we grant this instruction together with the following one.

second method however, we have to add extra logic to break the two grouped instructions back to three instructions in series so that they can be executed separately. In fact, we only need to break the second instruction into two instructions. However, it may add one more stage to the simple pipeline. V. PERFORMANCE Although very fast bit permutation instructions will most benefit new ciphers and enhanced existing ciphers [18], they can also speedup important existing ciphers. Table II illustrates the performance of our new architecture with four DES test cases. DES is a widely used symmetric-key algorithm and the NIST standard for data encryption for the last 20 years [1]. A new key is generated for each of the 16 rounds of computation on each 64-bit block of data to be encrypted. We test DES encryption and key generation with respect to the best software program on existing processors which uses table lookup to perform bit permutations (columns a and b). We then test DES with respect to an enhanced ISA that has an OMFLIP permutation instruction [7] added to it (columns c and d). TABLE II.

D. Complexity and delay of control path modifications The modifications to the control path consist of a small amount of combinatorial logic, estimated at a few thousand gates for a 4-way superscalar processor. As comparison, the issue logic of the Compaq Alpha 21264 processor, a 4-way superscalar RISC processor, contains about 141000 transistors [12], making the complexity of our modifications negligible. In terms of delay, the sequence detection unit and the code transformer run in parallel with the decode and rename logic. Due to their simple functions, they should have no impact to the processor cycle time. The modifications to the issue (wakeup and select) logic may cause cycle time increase since they are already in the critical path for back-to-back executions of dependent instructions. Many methods have been proposed to reduce the latency of issue logic by either simplifying the instruction issue logic [11,13,14,15], or breaking wakeup/select to multiple stages [16,17] in order to achieve fast instruction scheduling. By incorporating these methods, we can integrate our modifications without affecting the processor cycle time. E. In-order issue processor and single-issue processor In an in-order issue processor, because instructions are executed strictly in order, there is no need for register renaming and instruction selection. For the first method, we may need to insert a stage after instruction decode to perform instruction transformation. This is not necessary for the second method. For both methods, care must be taken at instruction issue because the two grouped instructions have to be issued together. Both of our methods are designed for superscalar execution. They should also handle the low-end single-issue case correctly. For the first method, because the original instructions in a sequence do not have to be executed together, they will execute correctly in a single-issue processor. For the

Std 2-way vs. 1-way Our 2-way vs. 1-way Our 2-way vs. std 2-way

SPEEDUP OF EXECUTION TIME

a. DES enc 1.49 1.89 1.27

b. DES key 1.04 17.64 17.04

c. DES enc 1.50 1.70 1.13

d. DES key 1.19 1.42 1.19

We implement these programs using a generic 64-bit RISC processor. First, we find the execution time, in cycles, of the programs running on a single-issue processor with one set of (2,1) functional units. Each set of functional units consists of a 64-bit ALU, a 64-bit shifter, and a 64-bit permutation unit (for columns c and d). Second, the same programs are executed on a standard 2-way superscalar processor with two sets of (2,1) functional units. Third, we simulate the programs on an enhanced 2-way superscalar processor with our new (4,1) butterfly permutation unit as detailed in Sections III and IV. Either method 1 or 2 can be used in the DES programs (columns a-d). The cache parameters used in the DES simulations are 16 KBytes L1 data cache and 256 Kbytes L2 unified cache with 10-cycle and 50-cycle miss penalties, respectively. The first row of Table II shows the speedup of a standard 2-way superscalar processor over a single-issue processor. The second row shows the speedup of our enhanced 2-way superscalar processor with a (4,1) permutation functional unit over a single-issue machine. In all cases, our new architecture achieves greater speedup over single-issue execution than the standard 2-way superscalar processor. This is highlighted in the third row, which shows the additional speedup provided by our methods over standard 2-way superscalar processors. For DES, the performance gain is very pronounced for key generation (17X speedup in column b), where permutation operations are more frequent than for encryption. The speedup of our 2-way methods is less when compared to the

enhanced ISAs (columns c and d) than when compared to existing ISAs (columns a and b). This is because the introduction of new permutation instructions (CROSS or OMFLIP) in columns c and d already yields huge speedup over the table lookup method (in columns a and b). The number of instructions for a 64-bit permutation is reduced from over 20 to at most six, and most of the memory accesses are also eliminated, resulting in much fewer cache misses. Even then, our 2-way execution with a (4,1) PU achieves an additional speedup of 13% to 19% by further reducing the cycles needed for a 64-bit permutation from six to two cycles. This type of performance improvement is important for a high-end server doing many encryptions and key generations simultaneously for clients. Furthermore, the latency through the (4,1) butterfly unit is less than that through a (2,1) OMFLIP or CROSS unit, allowing faster cycle times. VI. CONCLUSIONS This paper makes several new contributions. First, we present two architectural solutions for achieving arbitrary 64bit permutations in one or two cycles. This is a significant result since general and fast bit permutations are one of the very few remaining unsolved mathematical problems in Instruction Set Architecture. It is certainly one of the most challenging in terms of achieving single-cycle performance in microprocessor architectures. In fact, we show how the ultimate performance of a different 64-bit permutation every cycle can be achieved by a 4-way superscalar processor with minimal incremental control complexity and essentially no additional datapath complexity. Most high-end microprocessors today are at least 4-way superscalar. For 2way superscalar processors, any arbitrary 64-bit permutation can be achieved in at most 2 cycles. It is reasonable to assume that most microprocessors will be at least 2-way superscalar in the future – this is even true today. Our software general-purpose solution for achieving permutations is much more powerful than a hardware specialpurpose circuit solution; the latter can only achieve a few statically-defined permutations, while our solution can achieve all possible dynamically-defined permutations. Furthermore, the incremental cost is minimal: we leverage common microarchitecture trends like superscalar processors, overcoming existing ISA limitations while achieving the highest performance. Our result is also significant because it implies that word-oriented processors have no problem supplying very high (single-cycle) performance for even the most challenging bit-oriented processing like arbitrary bit permutations. Cryptographers can use bit permutations freely in their new algorithms if microprocessor architectures include these bit permutation instructions. Beyond solving the problem of fast bit permutations in word-oriented processors, we also define the concepts of (s,t) functional units and datapaths and grouped instructions. The (2k, k) datapath of a k-way superscalar processor gives us the flexibility of supporting all grouped instructions and functional unit sizes covered by these existing datapath

resources. We have also shown the control path modifications needed to support such grouped instructions are minimal when compared to the complex pipeline control in typical superscalar, out-of-order machines. These new architectural concepts can lead to higher performance in areas other than single-cycle bit permutations and cryptography acceleration. VII. REFERENCES [1] [2] [3] [4]

[5] [6]

[7]

[8]

[9]

[10] [11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

National Bureau of Standards (NBS), “Data Encryption Standard (DES) - FIPS Pub. 46-2”, December 1993. National Institute of Standards and Technology (NIST), “Advanced Encryption Standard (AES) - FIPS Pub. 197”, November 2001. Ruby B. Lee, “Subword Parallelism with MAX-2”, IEEE Micro, Vol. 16, No. 4, pp. 51-59, August 1996. Keith Diefendorff, Pradeep K. Dubey, Ron Hochsprung, and Hunter Scales, “AltiVec Extension to PowerPC Accelerates Media Processing”, IEEE Micro, Vol. 20, No. 2, pp. 85-95, March/April 2000. “IA-64 Application Developer’s Architecture Guide”, Intel Corp., May 1999. Xiao Yang, Manish Vachharajani, and Ruby B. Lee, “Fast Subword Permutation Instructions Based on Butterfly Networks”, Proceedings of SPIE Media Processors 2000, pp. 80-86, January 2000. Xiao Yang and Ruby B. Lee, “Fast Subword Permutation Instructions Using Omega and Flip Network Stages”, Proceedings of the International Conference on Computer Design, pp. 15-22, September 2000. Zhijie Shi and Ruby B. Lee, “Bit Permutation Instructions for Accelerating Software Cryptography”, Proceedings of the IEEE International Conference on Application-Specific Systems, Architectures and Processors, pp. 138-148, July 2000. John P. McGregor and Ruby B. Lee, “Architectural Enhancements for Fast Subword Permutations with Repetitions in Cryptographic Applications”, Proceedings of the International Conference on Computer Design, pp. 453-461, September 2001. Withheld for blind review Subbarao Palacharla, Norman P. Jouppi, and James. E. Smith, “Complexity-Effective Superscalar Processors”, Proceedings of the 24th Annual International Symposium on Computer Architecture, pp. 206218, 1997. James A. Farell and Timothy C. Fischer, “Issue Logic for a 600-MHz Out-of-Order Execution Microprocessor”, IEEE Journal of Solid-State Circuits, Vol. 33, Issue 5, pp. 707-712, May 1998. Soner Onder and Rajiv Gupta, “Superscalar Execution with Direct Data Forwarding”, Proceedings of the 1998 ACM/IEEE Conference on Parallel Architectures and Compilation Techniques, pp. 130-135, 1998. Dana S. Henry, Bradley C. Kuszmaul, Gabriel H. Loh, and Rahul Sami, “Circuits for Wide-Window Superscalar Processors”, Proceedings of the 27th Annual International Symposium on Computer Architecture, pp. 236-247, 2000. Ramon Canal, Antonio Gonzalez, “A Low-Complexity Issue Logic”, Proceedings of the 14th international conference on Supercomputing, pp. 327-335, 2000. Jared Stark, Mary D. Brown, and Yale N. Patt, “On Pipelining Dynamic Instruction Scheduling Logic”, Proceedings of the 33th Annual ACM/IEEE International Symposium on Microarchitecture, pp. 57-66, 2000. Mary D. Brown, Jared Stark, and Yale N. Patt, “Select-Free Instruction Scheduling Logic”, Proceedings of the 34th ACM/IEEE International Symposium on Microarchitecture, pp. 204-213, December 2001. Ruby B. Lee, Ron Rivest, Zhijie Shi and Yiqun Yin, “Cryptographic Properties of Bit Permutation Operations and Applications in Cipher Design”, submitted for publication.

Implementation Complexity for Achieving Bit Permutations ... - CiteSeerX

Implementation Complexity for Achieving Bit Permutations ... - CiteSeerX

Suggest Documents

Bit-Interleaved Coded Modulation: Low Complexity ... - CiteSeerX

Pruned Bit-Reversal Permutations: Mathematical ... - arXiv

NONLINEAR COMPLEXITY OF BOOLEAN PERMUTATIONS Thomas ...

Infinite permutations of lowest maximal pattern complexity

Optimal Bit-Reversal Using Vector Permutations - Imperial Spiral

Low-Complexity Bit-Parallel Systolic Multipliers over GF(2m) - CiteSeerX

Designing good permutations for turbo codes - CiteSeerX

Random Bit Recycling, PCPs and the Complexity of NP? - CiteSeerX

Computational complexity and implementation

Tree Traversals and Permutations - CiteSeerX

Arbitrary Bit Permutations in One or Two Cycles - PALMS - Princeton ...

Cache Complexity and Multicore Implementation for ... - ORCCA

Steganography â Bit Plane Complexity Segmentation

FrontInvest: Conquering Complexity in Achieving AIFMD ... - eFront

The complexity of computation in bit streams

The complexity of computation in bit streams

A Novel High-Throughput, Low-Complexity Bit

Python for education: permutations

Design, Implementation and Evaluation of a Variable Bit ... - CiteSeerX

Achieving Complexity and Reliability on Unreliable Platforms

Virtex 4 FPGA Implementation of Viterbi Decoded 64-bit ... - CiteSeerX

Low-Complexity Bit-Parallel Systolic Montgomery Multipliers for ...

Number of permutations with same peak set for signed permutations

A Low-complexity Power and Bit Allocation Algorithm for Multiuser ...

Implementation Complexity for Achieving Bit Permutations ... - CiteSeerX