Document not found! Please try again

ISA Support for Fingerprinting and Erasure Codes - IEEE Computer

0 downloads 0 Views 853KB Size Report
per considers ISA support for two kinds of core computa- ...... nd Lehve sudle theuseof pralll lokupProc. of the ACM SIGCOMM, pages 87-95, August 2000.
ISA Support for Fingerprinting and Erasure Codes William Josephson, Ruby Lee, and Kai Li [email protected], [email protected], [email protected]

Abstract Using small, pre-computed tables is a well-known technique

for improving the performance of expensive computations with small operands. However; as the performance gap between CPU and memory continues to increase, table lookup in main memory may no longer be beneficial. Instead of doing table lookups in memory, this paper proposes table lookup instruction support to accelerate Rabin fingerprinting and Reed-Solomon erasure coding over Galois fields. Both are core computations in emerging main-stream systems such as bandwidth optimized protocol engines, capacity optimized storage systems, and content- . distribution networks. We show that the proposed instructions are both benef cial and easy to implement. A simple table lookup

Introduction

cations, including disk arrays, storage devices, communication systems, and multicast and peer-to-peer protocols. From a performance standpoint, disk arrays pose the most .

.

challenes since they requIre high throughput. Traditional

platforms is to support high-throughput while consuming nominal CPU power, since the storage layer is often only one part of the whole system. The state-of-the-art software method for Rabin fingerprinting and Reed-Solomon coding is to compute over small Galois fields in which additions are equivalent to exclusive-or and multiplications are computed using precomputed lookup tables[2, 9]. Using small, pre-computed tables is a well-known technique for accelerating expensive computations with small operands, but a table lookup on today's high-end processors is relatively expensive: on a modern processor, a LI hit takes 3-4 cycles, a L2 hit 10-20, and a load from DRAM hundreds. The in-memory table lookup approach consumes substantial CPU and memory resources on today's commodity server platforms. On a capacity optimized storage system that runs global compression software and ReedSolomon coding on a server platform with a 3.2GHz Intel Xeon processor, Rabin hash computations consume about 42% of the total CPU power. When such a system is delivering a throughput of about 60MB/sec via NFS, 14% of the CPU cycles are devoted to anchoring and 28% to fingerprinting segments[6]. The CPU utilization for ReedSolomon coding configured to tolerate two simultaneous disk failures on the same system consumes about 5% of the CPU power to store compressed data. For other systems such as traditional filesystems and database systems,

A common trend in next-generation software systems is to trade computation for storage space, network bandwidth, and reliability. In the last several years, systems making these tradeoffs have emerged in the market place as main-stream appliance products. Such systems typically run on a commodity server platform and require substantial CPU power. Given the compute-intensive nature of these applications, it is worth considering whether simple instruction set architecture (ISA) extensions are sufficient to substantially improve their performance. This paper considers ISA support for two kinds of core computations: Rabin fingerprinting and Reed-Solomon coding. Recently, global compression based upon the detection and elimination of duplicate data segments has become a de facto standard technique for building bandwidth optimized network engines[l15, 5], capacity optimized storage systems[1 1, 3], and wide-area file systems such as WAFS [14, 8]. The general method used by many of these systems is to break (anchor) the input data stream into segments (substrings) and then compute a fingerprint for each segment. The fingerprints are then used as content-based identifiers to fetch data segments and to eliminate redundant segments. Each unique segment can be further compressed with conventional compression algorithms. While these systems can achieve lossless compression rates of 10:1 to 20:1, the throughput is often limited by the per1

1-4244-1027-4/07/$25.00 ©C2007 IEEE

of fault tolerance and are used in a wide range of appli-

disk arrays such as RAID-4 and RAID-5 arrays, are built with simple parity codes, but are vulnerable to multiple disk failures. Several next-generation storage arrays are designed to tolerate two or more simultaneous failures. The challenge in building such systems on commodity

instruction that addresses four 256-entry tables in parallel can speed up Rabin fingerprinting and anchoring by a factor of 2.6 and Reed-Solomon coding by a factor of 1. 5.

1

formance of fingerprinting and anchoring on server platforms. Reed-Solomon codes are the most general and bestknown erasure codes for improving reliability[l13]. ReedSolomon codes can be configured for an arbitrary level

415

the CPU utilization would be much higher. This paper investigates table lookup instruction support to accelerate Rabin fingerprinting and Reed-Solomon encoding. We have evaluated the proposed ISA support in two ways: first we report the impact on the instruction counts in the inner-most loops for fingerprinting, anchoring, and Reed-Solomon coding. Second, we obtain running times by substituting existing instructions for the table lookup instruction while preserving the original data-flow dependencies and branch behavior. This method allows us to estimate the impact on real systems with real workloads. Our evaluation shows that a simple table lookup instruction can speed up anchoring by a factor of 2.6, fingerprinting by a factor of 3.1, and Reed-Solomon encoding by a factor of 1.5.

tions, and modifications. In addition to the collision bound, polynomial hashing has several other desirable properties for systems applications. First, the representations of a binary bit string and a polynomial over Z/2Z are identical. Second, polynomial hashes are efficiently computable: given the hash, H, of a binary string A = b1b2 ... b, we can compute the hash, H', of Abn+± = b1b2 ... bnbn+l by:

2 Fingerprinting and Anchoring

reducing the cost to one £-bit shift, one table lookup, and

Rabin fingerprints or hashes are computed by treating the input data as a polynomial over a finite field and computing the residue of the resulting polynomial modulo a predetermined, randomly chosen irreducible polynomial. Rabin has shown that reduction modulo a randomly chosen irreducible polynomial is a good hash function with provable collision bounds[ 12]. The main factors influencing the choice of a fingerprinting function are its collision resistance properties and the cost of computing the function. An ideal fingerprinting function is efficient to compute and has a collision probability that is much smaller than the error rates of hardware components. Many cryptographic hash functions meet the collision resistance criterion, but are expensive to compute. Rabin fingerprints satisfy both criteria: they are efficient to compute and the probability that two out of n random, distinct strings of length m have the same k-bit fingerprint is bounded above by nm2/2k[12]. Rabin fin-

gerprints also have the desirable property that the finger-

print of the concatenation of two segments, A and B, can be computed directly from the fingerprints for A and B alone, allowing builders to compute the fingerprints for large blocks efficiently by taking advantage of the storage hierarchy. The standard approach to anchoring is to compute a fingerprint for every overlapping w-byte window of the data[7, 8]. Whenever a pre-selected set of k bits of each fingerprint match a pre-determined constant, the anchoring algorithm starts a new segment. Assuming random data, the expected segment size is 2k bytes. In practice, this approach works well and systems builders can choose k on the basis of the desired expected segment size; they may also choose to impose an arbitrary minimum and maximum segment size. Since each anchor depends only on the content of a w-byte window of the data, the scheme is self-synchronizing in the presence of insertions, dele-

416

H' =

b1xT + b2xn- + + bn±i mod p(x) Hx + bn+l mod p(x)

(1) (2)

This amounts to shifting H left by one bit, inserting bn+±, and if the most significant bit of H is one, XOR-ing the result with p(x) - 1k. In practice, f > 1 bits are processed in parallel by computing a lookup table for the 2f residues,

one XOR for every f bits. Third, hashes for a sliding window of fixed size w are efficiently computable. Given the hash, H, of the bit string A b1b2 ... bw, it is possible to compute the hash, H', of A' b2b3 ... bwbw+l in constant time since: mod pQx) H' b2Xwl + .+

bw±l

= =

(b, - bl)xw + b2Xwl + + bW±1 mod pQv) blxw + Hx + bw+l mod p(x)

The only change necessary to support a "rolling" hash is an additional XOR when b1 is non-zero. As in the case of ordinary fingerprinting, it is possible to process f bits at a time using table lookup; for most applications f is eight. It is important to note that the rolling hash property enjoyed by Rabin's scheme relies on algebraic properties that most hash functions, including cryptographic hash functions such as MD5 and SHA- 1, do not enjoy.

3 Reed-Solomon Coding

Erasure codes have found wide application in data storage and transmission. Typical applications include wireless communications, high-speed modems, multicast protocols, distributed checkpointing, and disk array parity codes. Building storage systems to tolerate multiple disk failures is arguably the most challenging use of codes such as Reed-Solomon codes because of the high throughput

requirements.

The codes Reed and Solomon proposed in 1960 remain the only general, space-optimal, maximum distance separable (MDS) erasure codes[13]. The erasure channel model is attractive for disk arrays since most storage devices already use error-correcting codes to detect and report media failures. As a result, a data block on an individual storage device is often assumed either to be corrupt or inaccessible, or, with very high probability, present and

Encoding:

I1 O o

0

Bil

B21

Decoding:

0

0

1

°

o

0

0

B12

B13

B22

1

B23

[1

0

0

0

0

0

[0

B21

0

1

0 0 0

Di

|Ox B14 1

B24

0-

1

B22 B23 B24]

D2 19 3

D3

D3

1D4

x

C2

CTRL

Di1R

D3

D2

C2]

[D4]

ID

XR

A

C19

[Dl

01

TMUX Rs2

-Di-

21 MUX R

Figure 2: The TCOM Module: Inputs are taken from Rs2 and from table lookup (TLU) or the TMUX module (PTLU)

D3

Figure 1: An example of encoding and decoding for a (4, 2)Reed-Solomon erasure code. The decoding case assumes that data disk D2 and coding disk Ci have failed.

Anc

correct. code encodes data words (n,,m)-Reed-Solomon An (n, m-Red-Soomoncod encdes n dta wrds

memory can cause significant performance problems. Scratchpad memories have been proposed in the past and can be used to hold small tables. Typically, scratchpad memories have been treated as small memory units with memory-like addresses, whereas we propose a new functional unit in the processor chip using standard processor datapaths, i.e., two source registers and one desti-

as n + m encoded words and can recover from m eranation register. sures. Each data word is represented as an element from a Wefistcpc finite field, most commonly GF(2W). The parameter w is T e L ookup aTLU)i ingtatlcanerfor asnl chosen to be smallest integer such that n + m < 2W. By Table Lookup (TLU) instruction that can perform a single computing in GF(2W), additions and subtractions iintethetinee and lookup in one cycle. We expect the performance enhancear meel ment to come mainly from reduced access time compared more expensive to compute and the usual approach is to to the Li data cache and the elimination of cache conflict pre-compute log and anti-log lookup tables. misses. We then consider a Parallel Table Lookup (PTLU) In a Reed-Solomon encoder, the n data words to be eninstruction which has the additional advantage of being coded are represented as a vector D =D...... Dn) TO able to perform several independent table lookups with encode the data words, the vector D is multiplied by a disdifferent indices in parallel in a single cycle. tribution matrix, M. The distribution matrix must have the 4.1 Single Table Lookup The single table lookup inproperty that all n x n sub-matrices are invertible. In pracstruction is straightforward: there are two source operands, tice, M is often an (n + m) x n Vandermonde matrix. The Rs 1 and Rs2. The rightmost byte of source register Rs 1 is product of M and D is an n + m vector, the first n entries of which are identical to D and the last m of which are the used to index a single table with 256 entries, each of which is w bits wide, where w is the width of the processor's coding words C, . . . , Cm. Figure 1 shows the encoding native word size. The output of the table can be further process for an (n, m) = (4, 2) Reed-Solomon code over the finite field GF(23). manipulated in the TCOM combinatorial logic block before being written to the destination register, Rd. In this Given the location of up to m failed blocks, reconstruction proceeds by forming the n-vector E from the nonblock, the table output can also be combined with the second source register, Rs2. For instance, Rs2 may be a mask failed code words and the matrix M' from the distribution that selects certain bits of the output and zeroes the other matrix M by removing the rows of M corresponding to erasures. The original data vector, D, can by recovered by bits, or Rs2 can be XOR'ed with the table output (see Figure 2). M' and multiplying by E. inverting Although the TLU instruction can only lookup in a single table at a time, we provide the ability to have multiple We propose adding a new functional unit to existing independent tables, all of the same size. Additional tables can be used independently or to simulate a single table that processor designs that supports lookups in a small table. is a multiple of w bits wide. By default, two such tables are The new instructions can accelerate fingerprinting, anavailable. The table used for a given lookup is specified by choring, and erasure codes with small incremental cost to a subfield of the instruction coding. For our applications, the processor. Like many cryptographic algorithms, these a single bit subop field is sufficient to encode which of the algorithms use small tables extensively. Typically, these two lookup tables is to be used. tables are stored in main memory and accessed through The instruction format for a table lookup is: the usual data cache mechanisms. Although the tables are R,s2d(3 smnall, they are accessed very frequently and are still rel-TL.p,K TUos K Rl s,R 3 atively large compared to the size of the Li data cache The parameter bK specifies that the TLU instruction on modern processors. As a result, table lookups in main

fieldtare

XOR.M'utitions

divisactions

41'7

TMUX Rs2

Rsl

t3

tl

t2

XMUX1I

to

XMUXO

XMUXF

ResO TM

Control Bits

TCOM

|Rd

X--

XMUXs O & 1 XMUXF

|

(1, 1)

LeR

LeR

(1, 0) 0

LIIR

(0, 1)

(0, 0)

L

R

L

R

Figure 3: The 32-bit ptlu Instruction

Figure 4: The TMUX Module and XMUX Control Bits

should lookup in table bank K. The subop ops specifies which function should be applied by the TCOM block. We reserve three bits for this subop field, but define only two combinatorial functions: bitwise AND or bitwise XOR. More combinatorial functions could be added in this block as the need arises.

MUXes in the PTLU implementation, but do so at the cost of increasing the size of the PTLU unit. To minimize the size of the PTLU unit, we therefore prefer to restrict table entries to the native word size. The subop bK specifies a bank of parallel tables. We reserve two bits for this subop, allowing a total of four table banks. Since additional table banks add to the cost, area,

4.2 Parallel Table Lookup Instead of just allowing the number of table entries to grow for a single table, we can also allow parallel table lookups. To this end, we propose a PTLU instruction that allows four parallel table lookups in a single instruction, as illustrated in Figure 3. The instruction format for parallel table lookup is:

and delay of the PTLU unit, we present our results assuming only a single bank is available. If additional banks of tables are available, our algorithms can be modified to take advantage of them. 4.3 Initializing the Lookup Tables The lookup tables can be initialized quickly with a table write instruction. We

PTLU.x.y.bK, tmux, tcom

Rsl, Rs2, Rd

(4)

Here, Rs 1 supplies the the indices of table entries from four tables in successive bytes. The subop tmux determines how the four table outputs are combined in the TMUX module and the subop tcom specifies the operation of the TCOM module described above (by default, XOR). The output of each table passes through the TMUX module shown in Figure 4. Figure 4 shows the available operations and their encodings. The resulting value becomes the first operand to the TCOM module and Rs2 the second operand. By default, the XMUXes in TMUX XOR their inputs and the TCOM module XORs the result with Rs2. The subops x and y modify the parallel lookup by using the low x bytes of Rs 1 to index the tables starting with table y instead of table 0 (tables are numbered from right to left starting at zero). What we gain from this form of the PTLU instruction is the ability to simulate having more than one "bank" of lookup tables, each of which can be indexed independently by the same bytes of the first source operand. In the case of fingerprinting, we use this approach to simulate table entries twice as wide as that supported by the hardware directly. One could also implement wider entries or multiple physical banks of tables. Both of these approaches eliminate the need for the input

418

describe the version of the instruction for parallel tables since the single table version is a simpler case. The parallel table write instruction, PTW, can write a different value at the same index in each of the parallel tables. In the case of a 32-bit processor, this is 4 x 32 bits, or a single cache line worth of data. The instruction format is defined as follows:

PTW,k,e

Rb, imm

(5)

The instruction thus defined reads the data at memory location M [Rb+imml writes to the entry e of the table in the table bank k. If Rb is RO, then the immediate value is written directly into the table. In general, the tables used by the TLU and PTLU instructions are initialized once and then used for a long time. That is, the map stored in the lookup tables is assumed to be read mostly, as it is in the applications described in this paper. Nonetheless, the overhead of saving and reloading the tables is a concern as they are now explicitly a part of the CPU context. A number of optimizations are possible. First, a dirty bit may be associated with the tables so that it need not be saved if the tables have not been written. This is similar to the strategy used to cope with FPU and multimedia state on existing processors. Second, if the tables are initialized once and then treated as read-only, the contents of the tables need not be spilled as part of the context switch as they can be reloaded from the original source in main memory. A third possi-

bility is to expand the number of physical table banks so

that each application that needs lookup tables can use a different bank. If there are not enough distinct table banks available for all of the running processes in the system, the contents of the tables may still need to be spilled.

e

4.4 Area and Latency The single TLU unit is quite small and fast: the simplest version with two banks of tables has 512 entries in total and takes 2Kbytes of RAM on a 32-bit processor. In fact, the tables may be implemented as a 256-entry table with wider entries followed by a MUX. Although this is larger than a typical register file, it is at least a factor of four smaller than typical LI data caches. Unlike processor register files in modern processors, which have multiple read and write ports and a lot of bypass circuitry, the table lookup unit requires only a single read port and a single write port. As a result, there should be no difficulty in implementing a TLU instruction that runs in a single cycle where the ALU's latency is used to define one pipelined processor cycle. The area required by the unit should also be comparable to that of other processor functional units. Both the TLU and PTLU modules apply the combinatorial logic in the TCOM module to the result of the table lookup and the second source operand. The PTLU instruction has several additional layers of combinatorial logic in the TMUX module. The timing critical path consists of the table lookup followed by these two modules. The TMUX module consists of two levels of 4:1 multiplexors and two 2-input XOR stages (see Figure 4). The subsequent TCOM block does a simple XOR or AND operaton 2:1muliplxor.Assmwith ith followed by a 2:1 ration Rs2,s2,folowedby multiplexor. twoufo four levels of for each 4:1 logic multiplexor, two for ing the 2:1 multiplexor, and one for each of the XOR or AND stage, thetotaldepthof th TMUXand TOM moules

ist13alogiclhevels.othiscombinatoral logioften iM eduled lookup.step itsl. Simutablelat studieselhi Simulation studies have shown that it is possible to p

b

the table

have the PTLU instruction operate in a single cycle if the ALU latency determines half to one processor cycle time[4]. This is usually the case in microprocessors, except for very deeply pipelined ones, where an ALU latency may take two or more cycles,

5 Using Table Lookup Instructions This section describes how to use the proposed ISA extensions to accelerate fingerprinting, anchoring, and ReedSolomon coding. We compare the implementations of the inner-most loops of these algorithms with and without the new table lookup instructions by counting the number of * 5.1 Assumptions About Target Architecture We assume that the target architecture supports two types of

419

loop: ld mov shrp shrp

* Two

Fingerprint in r3 & r4 8 bytes from *rl r4, r5 ; Copy high word of hash $32, r3, r4, r4 ; Shift out high word $32, r2, r3, r3 ; and shift in new sets of 4 lookups for first 32 bits

ptlu.4.0

ptlu.4.4

(rl), r2

; Load

r5, r4, r4

; 4 lookups at offset 0 ; 4 lookups at offset 4

r5, r3, r3

shr $32, r5, r5 ; Another two sets of 4 lookups for second 32 bits ; Process remaining 32 bits add $8, r6 rl, rl ; Loop rl, cmp jne loop

Figure 5: Fingerprinting with 64-bit PTLU

bit-manipulation instructions available on many, but not all, commodity processors. The first of these instructions is the the shift-right pair instruction. The shrp instruction shifts the concatenation of the two word-sized source operands right by an immediate operand and deposits the word-sized result in the destination register. The second instruction is the pack instruction, which interleaves the 4-bit nibbles in each of the two source operands to form a result; even numbered nibbles starting with nibble zero are drawn from the first source operand, odd nibbles from the second. 52 52Fingerprinting Fingerprinting with table lookup instructions is a straight-forward application of Equation 2. Given a table lookup instruction that can lookup in k tables in parallel, the natural loop implementation of fingerprint. ' k . can be unrolled k times. Multiplication by 1k becomes ng a shift-right pair operation, addition is XOR, and modular reduction is accomplished via table lookup for the approp . p.p t priate residues. For small choices of k, it iS useful to unroll the resulting loop once again by a factor of two to further amortize the loop control overhead. Figure 5 shows the resulting assembly code for computing 128-bit Rabin hashes ona6bimcrpces.

64-bit i ssor. onca Since the hash is a 128-bit hash but table entries are at

most 64 bits wide on a 64-bit processor, the eight parallel tables are divided into two sets of four. One set contains the most significant 64 bits of the residues of 1k and the other the least significant 64 bits. By halving the num-

ber of parallel lookups, we halve the number of new bytes

processed at once by the PTLU instruction. The advantage of this approach is that the two sets can be treated as separate "banks" as described in Section 4.2 above. As a result, a processor with word-sized table entries can compute Rabin hashes that are a small multiple of the proIcessor word size wIthout addional physIcal table banks. 7 shows the number of instructions per input byte ~~~~~~~~~~~~Figure processed for the software and PTLU-based implementations.

loop: ld ld

; loop unrolled 4 times (rl), r3 ; load leading edge (r2), r4 ; load trailing edge ; hash is in r5 (initialized outside loop) tlu.bO r4, r5, r5 subtract trailing residue and insert leading edge into hash shrp $8, r3, r5, r6 tlu.bl r5, r6, r5 ; add new residue shr $8, r3, r3 ; new leading byte cmp $mask, r5 ;save for mask & test

je

match

add

$4, r2, r2

cmp

$end, rl

jne

loop

Algorithm Software TLU PTLU

128-bit Fingerprinting Inst. Per NormalByte ized 7.25 1 4.13 1.75 1.88 3.8

32-bit Anchoring Inst. Per NormalByte ized 16 1 7.25 2.2 7 2.3

Figure7:InstructionCountsforFingerprintingandAnchoring used to specify the number of parallel lookups to perform and hence the number of intermediate hash values to recompute. Figure 7 shows the instruction counts for both implementations; we omit the source for PTLU-based anchoring due to space constraints

3 more unrolled iterations omit shr on last one input exhausted?

Figure 6: Anchoring with 32-bit TLU

5.3 Anchoring Implementing anchoring efficiently with table lookup support is slightly more involved than fingerprinting for two reasons. The first is that anchoring involves computing a rolling hash and so two different table lookups are required, one for the residues that must be added to the current hash value as new bytes are incorporated into the window and one for the residues that must be subtracted as old bytes fall out of the window. The second is that the termination condition is more complicated: the hash value for the current window must be checked after each input byte is processed. Unlike fingerprinting, however, the hash value used for anchoring need not be large. A 32-bit hash is sufficient for anchoring since the size of the anchoring window is small and collisions are not catastrophic. As a result, each table used for computing hashes during anchoring can be smaller than that used for fingerprinting. Figure 6 shows the assembly for a single table lookup based implementation of anchoring. In practice, it is still profitable to unroll the main loop by a small factor to amortize the loop control overhead. In addition to a compare and conditional branch for every input byte, the inner loop now also requires two loads per iteration, one for the leading edge of the window under consideration and one for the trailing edge. Given a sufficiently large register file and a small anchoring window, it is possible to eliminate the second load by storing the entire window in registers. Parallel table lookup can be used to further improve the performance of anchoring, however the benefit is small since every intermediate hash value must be computed and compared with the mask. Furthermore, since two sets of lookup tables are required, only half the parallelism available for fingerprinting is available for anchoring. Rather than compute each new rolling hash value from the previous one, we compute a new hash value from the preceding word-aligned hash value by recomputing the intermediate hash values. The PTLU instruction encoding can be

420

5.4 Reed-Solomon Coding The basic primitive required for conventional Reed-Solomon encoding is multiplication in a small finite field, GF(2W). Recall that the number of data code words, n, plus the number of check words, m, must be strictly less than 2W. Typical values for w are 4 and 8 since they divide the processor's word size in bits. For many storage applications, w = 4 is sufficient. Unlike the primitives in fingerprinting and anchoring, field multiplication is a function of two operands rather than one. As a result, for k-bit operands, the size of the table required to store the function is much larger: 22k instead of 2k. One common software approach is to store pre-computed log and anti-log tables and to use these tables to convert multiplication into modular addition. The disadvantage of this approach is that each multiplication requires three table lookups and a modular addition. In practice, using a full multiplication table is substantially faster. As a result, efficient software Reed-Solomon coding for RAID on existing commodity hardware store the entire pre-computed multiplication table in main memory. The inner loop of the encoding algorithm is arranged so that several bytes of the input data are multiplied by the same field element from the distribution matrix at once, improving locality of reference. Although a large lookup table for field multiplication is acceptable in software, it is not practical for our table lookup extension since it would increase area and latency and the cost of reloading tables. Figure 8 shows the TLU implementation of field multiplication using log and anti-log tables stored in two banks. Since the table lookup unit can not store the entire precomputed multiplication table and using a table with 8-bit indexes for anti-logs requires reduction modulo 28 - 1, the TLU-based implementation actually runs more slowly than the pure software one. One might consider working in a smaller field such as GF (24), as we do for PTLU below, but doing so involves additional bit manipulations that negate the advantages. There are at least two ways to improve the performance of the single table lookup encoding algorithm. First, one could remove the compare, jump, and subtract instructions

; Log (anti-log) table in Bank 0 tiu .bO ri, rO, r2

add cmp jle sub

$alpha, r2, r2 $255, r2 next

(1)

mance of the proposed ISA extensions. The advantage of

this approach is that it allows us to estimate the performance impact of the table lookup instruction on real hardware in the absence of cycle-accurate simulators.

r2 % (2-8-1)

$255, r2, r2

next: tlu.bO r2, rO, r2 ror

$8, r2, r2

Repeat 3 more times; omit final shr Figure 8: Multiplication in GF (28) with 32-bit TLU pack $even, rl, r2 Interleave multiplicands

pack rl, $odd, r3

*

from rl & dist. matrix

r2, rO, r2 , Lookup products r3, rO, r3; in parallel $8, r3, r3; Pack results pack r2,packr3, r3, r2 r2 r2 ' Figure 9: 32-bit PTLU Field Multiplication in GF (24)

ptlu ptlu shl

by providing a table with 9-bit indexes. Second, one could reduce the amount of bit manipulation necessary by allowing the TLU instruction to specify the byte in the Rs 1 argument that is used to index the table. Even with this extension, either more table banks or the ability to rotate the output in the TCOM block would be necessary to see a meaningful performance benefit. Since both of these solutions increase the complexity of the TLU instruction, it may make more sense to implement PTLU to support Reed-Solomon encoding. The PTLU instruction is better suited to the ReedSolomon field multiplication task than the TLU instruction is. Rather than store complete multiplication table for GF(28), we instead store the complete multiplication table for GF (24), which does fit in a single 256-entry table. The input words are divided into 4-bit groups and combined with the pre-computed multiplicands from the distribution matrix. The 32-bit PTLU instruction can then be used to compute four products in parallel. The result, shown inin Figue 9, requiresfive instructons ntutostto perfor efr eight multiplications and process fourr input bytes, yieldeight byte basis. approximately 2.6 2 onape on a pe-r byte basis. ing a speedup of approximately

shown~~~ Fiue9.eursfv

amspeedulipl iof

6 Implications for Real Hardware We have implemented microbenchmarks for each of the three target applications on real hardware. Our target commodity platform is a 3GHz Pentium 4 Xeon with a 16KB LI data cache and a 2MB L2 unified cache run-

ning FreeB SD 6. 1. To estimate the performance impact of the table lookup instructions, we have also implemented versions of the microbenchmarks that simulate the perforSoftware

GF (28)

Field

Instr. Per Byte

PTLU

GF (24

1.50

4.05

0

Normalized

6.1 Methodology Since table lookup instructions are not available on commodity hardware, we have substituted instructions with the same data-flow dependencies and cycle times and similar resource constraints to those of the proposed instructions. In particular, we assume that table lookup, shift-right pair (shrp), and pack instructions run in the variable-width shift unit of the Pentium 4 and therefore that at most one shift, shift-right pair, pack, or table lookup can execute per cycle. Merely substituting an instruction with similar performance characteristics and the same dataflow dependencies as the original is not sufficient. We must also preserve the branch behavior to obtain meaningful results. For fingerprinting and Reed-Solomon encoding, the branch behavior of the microbenchmarks is independent of the input data. The dynamic branch behavior of the anchoring algorithm does depend on the input data, however. To account for this, we modify the input data for the anchoring simulation in such a way that the sequence of branches in the original and the simulation are identical. We can achieve this by inverting the new, simulated hash function and computing an input stream such that the simulated hash function of every overlapping window of the input matches the pre-selected mask if and only if the original hash on the unmodified window does. Since the execution time of the instructions substituted for the table lookup instruction are independent of the operands, massaging the input data does not alter the performance of the simulated microbenchmark. 6.2 Results The table in Figure 11 summarizes the es-

timaTeperformance ima the table lookup instructions. The results show that table lookup instructions can

significantly improve Rabin fingerprinting and sgiiatyipoeRbnfnepitn n anchoring nhrn computations. The single table lookup instruction is easy

t implement mlmn and n to ittcnipoefnepitn can improve fingerprinting byyafc a fac-

tor of 1.76 and anchoring by a factor of 2.2. The PTLU

instruction, although somewhat more complex to implement, offers a greater performance improvement: a factor of 3.1 for fingerprinting and 2.6 for anchoring. Although anchoring benefits from the parallel table lookup, the speedup is smaller than that for fingerprinting since it must perform a comparison and branch for every input byte. In the simulated version of the algorithm, the IA-32 performance monitoring counters indicate that the number of branch-related stalls is significantly larger in the PTLU implementation.

The performance improvement available from on-chip

1.0

table lookup for Reed-Solomon RAID-6 encoding iS smaller. Since we use a standard Vandermonde distribu-

2.60

Figure 10: Instruction Counts for Reed-Solomon Encoding

421

Best SW TLU

Cycles/byte Cycles/byte

PTLU

Cycles/byte

Finger-

Anchor

11.5 6.5

12.2 5.4 2.2

print

3.1.76

Speedup

2.6

R-S

8

3.6

This paper describes instruction set architecture support for accelerating Rabin fingerprinting, anchoring, and Reed-Solomon coding over small Galois fields. Each of these three applications share table lookup as a core, performance sensitive primitive. Our proposed ISA extension provides this primitive and is simple to implement. The table lookup instructions can also be used to accelerate other computations such as symmetric Feistel ciphers and ReedSolomon decoding; these applications also benefit from moving memory-based computations into on-chip hardware. As the performance gap between CPU and memory continues to increase, we believe that it would be worth identifying and exploiting other such opportunities. Acknowledgments William Josephson is supported by an NSF Gradu-

Encode

10.3

0.35 1.5

Figure 11: Simulation Results On a 3.2GHz Pentium 4 Xeon

tion matrix, the first check word is computed with a simple XOR and more general field multiplications are necessary only to compute the second check word, significantly reducing the potential performance improvement. Moreover, available memory bandwidth constrains the performance of the encoding algorithm. Even when the field multiplications are replaced with no-ops, the throughput of the algorithm only increases by a factor of two. Both the TLU and PTLU implementations of Reed-Solomon encoding are further hampered by the need to perform twice as many table lookups per byte as the software implementation since the hardware lookup tables are much smaller.

7

Related Work

Methods for computing Rabin fingerprints over 2 {0, 1} are well known[ 12]. Over this this field, arithmetic operations become simple bit manipulations and Rabin

fingerprintsce oputed

witheaeinear-feeda sift

register. Border proposed implementing Rabin fingerprinting by using precomputed tables[2]. This approach turns the compute-intensive operations in a software implementation into memory operations, which worked well at the time.

time.

Reed-Solomon codes are well known erasure codes that must be computed over finite Galois fields[13]. The standard method for Reed-Solomon encoding and decoding of n data words with the ability to tolerate m failures is to use a distribution matrix derived from a Vandermonde matrix over

GF(2') where over

< 2'. m In this setting, addition 2W.Inthisettng,aditio

GF('W)wherem+ n

+

m

Conclusion

ate Research Fellowship.

This work was also supported in

References [1] J. Blomer et al. An XOR-based erasure-resilient coding scheme. Technical report, International Computer Science Institute, 1995. TR-95-048. [2] A. Broder. Some applications of rabin's fingerprinting

method. Sequences II: Methods in Communications, Security, and Computer Sciences, 1993. Datacapacity Inc. Data domain DD400 2005. [3] ries Domain,optimized enterprise storage appliances, http:se-/ '/www. datadomain. com. [4] A. M. Fiskiran and R. B. Lee. On-chip lookup tables for fast symmetric-key encryption. In Proceedings ofthe IEEE Int. Conf: on Application-Specific Systems, Architectures, and Processors, pages 356-363, 2005. [5] Juniper Networks. Accelerating application performance across the WAN, 2005. http: //www. juniper .net/ E. Lee and U. Maheshwari. Private communication, 2006. [6] welcome_peribit.html. [7] U. Manber. Finding similar files in a large file system. In Proc. of the Winter 1994 USENIX Tech. Conf:, Jan. 1994.

[8] A. Muthitacharoen, B. Chen, and D. Mazieres. A lowbandwidth network file system. In Proc. of the ACM 18th Symp. on Operating Systems Principles, October 2001.

and subtraction again reduce to bitwise XOR, but Reed[9] J. S. Plank. A tutorial on reed-solomon coding for faulttolerance in RAID-like systems. Software Practice & ExSolomon coding requires general field multiplication. For perience, 27(9):995-1012, 1997. small values of n, this can be accomplished with complete [10] J. S. Plank. Optimizing Cauchy Reed-Solomon codes multiplication tables; in general, precomputed logarithm for fault-toleratant storage applications. Technical report, tables used. tables are used. Dept. of Comp. Sci., Univ. of Tennessee, 2005. CS-05-569. [11] S. Quinlan and S. Doward. Venti: A new approach to More recently, Blomer et al. have described a archival storage. In Proc. of the USENIX Conference on Reed-Solomon encoding method dubbed Cauchy ReedFile and Storage Technologies, Jan. 2002. Solomon coding[1] that replaces the Vandermonde dis[12] M. Rabin. Fingerprinting by random polynomials. Technitribution matrix with a Cauchy matrix and converts all cal report, Harvard University, 1981. TR-15-81. [13] I. S. Reed and G. Solomon. Polynomial codes over certain encoding operations in to bitwise XORs at the cost of finite fields. J. Soc. Industrial Math., 8(2):300-304, 1960. twice as many XOR operations and memory references [14] Riverbed, Inc. The Riverbed optimization system (RiOS), per input byte. It also requires more space for internal data 2006. http: //wwwI. riverbed, cor. structures [10] . [15] N. T. Spring and D. Wetherall. A protocol-independent eliminating redundant network traffic.2000.In FiskianndadLe hav stuied theuseof te us of pralll aralel lokupProc. lokuptechnique Flskran Le hve sudle of thefor ACM SIGCOMM, pages 87-95, August tables for block ciphers in cryptography [4].

422

Suggest Documents