Bipartite Implementation of the Residue Logarithmic ... - CiteSeerX

0 downloads 0 Views 148KB Size Report
The Logarithmic Number System (LNS) has area and power advantages ... results show that bipartite RLNS achieves area savings and shorter delays compared to naıve RLNS. ... come this by using interpolation based on binary multiplication to ... mic Residue Number System (LRNS)26, 35 that uses the log-like properties ...
Bipartite Implementation of the Residue Logarithmic Number System Mark G. Arnolda and Jie Ruana a Lehigh

University, 19 Memorial Drive West, Bethlehem, PA, USA ABSTRACT

The Logarithmic Number System (LNS) has area and power advantages over fixed-point and floating-point number systems in some applications that tolerate moderate precision. LNS multiplication/division require only addition/subtraction of logarithms. Normally, LNS is implemented with ripple-carry binary arithmetic for manipulating the logarithms; however, this paper uses carry-free residue arithmetic instead. The Residue Logarithmic Number System (RLNS) has the advantage of faster multiplication and division. In contrast, RLNS addition requires table-lookup, which is its main area and delay cost. The bipartite approach, which uses two tables and an integer addition, is introduced here to optimize RLNS addition. Using the techniques proposed here, RLNS with dynamic range and precision suitable for MPEG applications can be synthesized. Synthesis results show that bipartite RLNS achieves area savings and shorter delays compared to na¨ıve RLNS. Keywords: logarithmic number system, residue number system, interpolation, bipartite table, multipartite table

1. INTRODUCTION Previous research shows the Logarithmic Number System (LNS) has area and power advantages compared to the fixed-point number system (scaled integer arithmetic) for applications that tolerate low-precision real arithmetic, such as MPEG decoding,3, 4 MPEG encoding27, 28 and N-body simulation.6, 21 In conventional LNS, multiplication, division and exponentiation are easy to implement at low cost as these need fixed-point binary addition, subtraction and multiplication, respectively. However, the implementation of LNS addition has area that grows exponentially with precision because LNS addition requires a table-lookup operation. Conventional LNS has overcome this by using interpolation based on binary multiplication to approximate the LNS addition function.1, 20 Besides the table-lookup part, LNS addition implementations also contain fixed-point adders/subtractors. The Residue Number System (RNS) is a number system which has a short delay for addition and multiplication of integers. The problem with RNS is that operations like overflow detection and scaling (division by a constant) can be difficult, making real (fixed- or floating-point18 ) arithmetic expensive. Both LNS23 and RNS8 have been shown to offer low-power at high-speed. Arnold1, 5 introduced RNS into LNS, calling the combination the Residue Logarithmic Number System (RLNS). This combination is different than the similar-sounding Logarithmic Residue Number System (LRNS)26, 35 that uses the log-like properties of index calculus to implement an integer-only system that has as much difficulty dealing with reals as RNS. In contrast, the RLNS designs discussed here reduce the delay for real LNS multiplications, divisions and additions because RNS reduces the delay for the fixed-point additions that implement these LNS operations. The problem is that the difficulty of residue scaling makes multiplier-based interpolation, and thus higher precision, unattractive for RLNS despite the fact that integer multiply is an inexpensive operation in residue. The bipartite approach is a widely-used approximation method30 to reduce the area for table-lookup implementations. Most commonly, bipartite has been used to implement reciprocal and reciprocal-square-root approximations on commercial binary floating-point processors; however, Pickett24 originally suggested bipartite as a means for LNS implementation and Detrey and de Dinechim12 implemented HDL libraries (which have been enhanced by Garcia et al.15 ) using a generalization of bipartite with conventional LNS that depends on binary fixed-point internally. Further author information: (Send correspondence to M.G.A.) M.G.A.: E-mail: [email protected], Telephone: 1 610 758 3285; J.R.: E-mail: [email protected].

This paper introduces the bipartite approach into RLNS to reduce the area and delay for addition compared to na¨ıve table lookup. The problem is that bipartite was envisioned for binary arithmetic, where bit extraction is free. Despite needing extra effort to overcome this problem, bipartite is a viable implementation for RLNS because it avoids the difficulty of scaling that plagues multiplier-based interpolation. Section 2 briefly reviews the conventional Logarithmic Number System (LNS). Section 3 reviews the bipartite approach to reduce the area for table-lookup functions using binary arithmetic. Section 4 explains the properties of the Residue Number System relevant here. Section 5 defines the Residue Logarithmic Number System (RLNS). Section 6 introduces the bipartite approach into RLNS to reduce the area and delay for RLNS addition. Section 7 shows the synthesis results for the na¨ıve and proposed RLNS addition approaches. Section 8 gives conclusions and points out future work.

2. LOGARITHMIC NUMBER SYSTEM (LNS) The conventional Logarithmic Number System (LNS)1, 19, 31 represents a number by its exponent to a certain base b. Most applications need signed LNS values, in which one bit is used as the sign of the represented number and other bits are used for this exponent value. The choice of b = 2 means that the precision of the system depends on the number of fractional bits, F , used to represent this fixed-point exponent.20, 33 Let X be an arbitrary real value, xs be the sign of this value, x be the fixed-point exponent (which is the internal LNS ¯ be the approximate value that results representation manipulated by the binary arithmetic hardware), and X ¯ = (−1)xs · 2x . Some unsigned LNS applications, from this manipulation, which for b = 2 we can summarize as X like the N-body simulation21 which won the Gordon Bell Prize in 1999, are able to neglect xs . ¯ Typically, we want to minimize relative error |(X − X)/X| by rounding x = (⌊2F · log2 |X| + 0.5⌋)/2F , although other rounding modes are possible.2 There is an isomorphic way23 to view these definitions in which −F b is selected to be a number very close to one, such as b = 22 so that x = ⌊logb |X| + 0.5⌋ is an integer. Thus ¯ is the mapping from the signed LNS representation (xs , x) to the real value it represents (X) ¯ = (−1)xs · bx . X

(1)

The integer x is finite machine representation, ranging between xmin and xmax = M + xmin − 1,

(2)

where M is a suitable constant. In conventional twos-complement binary, M is a power of two, and xmin is the next smaller power of two. Later in this paper, we will make a different selection for xmin and xmax . For applications of interest, the only requirement is to choose xmin and xmax so that overflow is avoided. The multiplication of two numbers simply forms the sum of the two numbers’ exponent parts. It only needs a fixed-point adder and an XOR gate, compared to the higher cost of a fixed-point multiplier. Division is analogous, requiring only a fixed-point subtractor. However, the addition of two numbers is not a simple operation in LNS. The basic method for signed LNS addition4, 31 is: 1. z = y − x and zs = xs ⊕ ys ; 2. If zs = 0, w = sb (z); otherwise w = db (z); 3. u = x + w. xs and ys denote the sign bits of the two operands. In Step 1, the calculation of z = y − x is equivalent to calculating z = logb (Y /X). Step 2 is a table look-up. Tables sb (z) and db (z) correspond to addition or subtraction according to zs : • sb (z) = logb (1 + bz ). • db (z) = logb |1 − bz |.

Thus, the final result is u = logb |X(1 ± Y /X)| = logb |X ± Y |. The complexity of the above algorithm is reduced for unsigned LNS, in which case we can neglect zs and do not need the table of db . This paper will consider such an sb -only system, suitable for applications like the GRAvity PipelinE (GRAPE) designed by Makino.21 The most straightforward LNS implementation is to provide a table of M words that looks up sb (z) for all possible zs. We will refer to this as the na¨ıve approach. It is clear that the na¨ıve approach makes the cost of LNS implementation quite expensive for larger wordsizes. Since M grows exponentially with the number of bits in a conventional binary representation, only very small implementations are economical with na¨ıve tables. Several approaches have been proposed to improve this situation. One of the most effective is interpolation,20 in which we separate the binary word z = zH + zL into its high part, zH , and its low part, zL . The hardware then approximates using a polynomial whose coefficients come from a smaller table addressed only by zH , for example with a linear1 approximation: sb (z) = sb (zH + zL ) ≈ sb (zH ) + C(zH ) · zL .

(3)

There are a variety of methods to choose the slope C(zH ), such as Lagrange. The memory reduction of interpolation comes at the expense of having to provide a fixed-point multiplier in the hardware. A hidden aspect of (3) is that C(zH ) is a fractional value between 0.0 and 1.0. To carry out this computation in fixed-point requires scaling the tabulated slope to be an integer c(zH ) = ⌊C(zH )·K⌋, which can be done a priori. K is usually chosen to be a power of two for ease of implementation using binary arithmetic. Interpolation also requires scaling the product at run-time:   c(zH ) · zL . (4) sb (z) = sb (zH + zL ) ≈ sb (zH ) + K (4) poses no problem in binary, as this scaling occurs at no cost by shifting the product by log2 (K) bit positions to the right (assuming K is a power of two).

3. BIPARTITE APPROACH The bipartite approach is a widely-used linear interpolation method that reduces the area for table-lookup.10, 11, 16 To the best of our knowledge, bipartite has only been used with functions whose values have been encoded using weighted radix systems, like conventional binary. In such a conventional system, a bipartite circuit splits the binary input into three parts,30 denoted as z2 , z1 and z0 . The bit lengths of the three binary-encoded parts are n2 , n1 and n0 . The bipartite function approximation is f (z) ≈ a21 (z2 , z1 ) + a20 (z2 , z0 ). 12, 15, 24

In the context of LNS implementation, zL = z0 . This makes

(5)

the relationship between (5) and (4) is that zH = z2 + z1 and

a21 (z2 , z1 ) = sb (zH ) and

(6)



 c(zH ) · zL . (7) K An advantage of (7) is that a20 not only encapsulates the multiplication but also the scaling. (The latter advantage has not been noted before because scaling is not a significant issue with conventional binary implementations, although, as we will see, it is with residue implementation.) With the bipartite method, (7) is only an approximation because not all of zH is available to address the table. The bipartite method trades accuracy in representing the slope for elimination of the multiplier and scaling. Put another way, the same slope is used for several different zH and in consequence the designer could choose any of these slopes, rather than the c(zH ) used in (7). This does not reduce the accuracy of the bipartite method much, since the designer already had some latitude in choosing c(zH ). For example, Schulte and Stine30 suggest: a20 (z2 , z0 ) ≈

a21 (z2 , z1 ) a20 (z2 , z0 )

= f (z2 + z1 + δ0 ) ′

= f (z2 + δ1 + δ0 )(z0 − δ0 ),

where δ1 and δ0 are midpoints. The number of words of memory for the binary bipartite approach is 2 2n2 +n0 , compared to M = 2n2 +n1 +n0 words required by the na¨ıve approach.

(8) (9) n2 +n1

+

4. RESIDUE NUMBER SYSTEM (RNS) The Residue Number System (RNS)19, 29, 32 represents a number by a vector of remainders of division by a set of moduli. For example, a positive integer x is represented in RNS as (x0 , x1 , · · · , xN −1 ) with respect to the set of moduli (m0 , m1 , · · · , mN −1 ), where xi = |x|mi = x mod mi , i = 0, 1, · · · , N − 1.

(10)

The unsigned numbers between 0 and M − 1 can be represented uniquely by the residues (x0 , x1 , · · · , xN −1 ), where M = m0 ·m1 ·m2 · · · mN −1 . Similarly, signed numbers can be represented using a constraint like (2). There are several choices for the encoding of each individual xj that compose the RNS number. The most common encoding and the one used in this paper is binary (⌈log2 mj ⌉ wires, each using two-valued logic).22 It is the most compact for storage. Other encodings include mj -ary (a single wire using multi-valued logic),34 one-hot (mj wires each using two-valued logic),8, 9 index,35 quantum25 and perhaps most interestingly optical encoding.7 The advantage of RNS is that additions, subtractions and multiplications are carry-free. Because the RNS representation uses several short words instead of one long word as in binary representation, the delay of addition and multiplication is smaller than in the binary representation. However, other operations, such as division, comparison, sign detection and overflow detection are difficult because the moduli are equally weighted. To perform magnitude comparison, overflow detection and sign detection, the RNS numbers need to be converted into their associated mixed-radix number representations. The weighted mixed-radix number system representation is x = tN −1 (mN −2 · mN −3 · · · m0 ) + · · · + t2 (m1 · m0 ) + t1 m0 + t0

(11)

The digit set and encoding options for mixed radix are the same as for residue. To convert an RNS representation (x0 , x1 , · · · , xN −1 ) into the associated mixed-radix representation, the following recursive procedure, known as Mixed-Radix Conversion (MRC), needs to be performed19 : ti = yi mod mi yi+1 = (yi − ai )| m1i |mi+1 ,

(12)

where the multiplicative inverses of | m1i |mi+1 (i = 0 · · · N − 1) are pre-computed and y0 = x initially.

5. RESIDUE LOGARITHMIC NUMBER SYSTEM (RLNS) The hardware for multiplication or division of two LNS numbers is primarily a fixed-point adder. The hardware for addition of two LNS numbers consists of at least two fixed-point adders in the combinational part of the circuit as well as a table. The short-delay property for residue addition can be applied to improve the speed of logarithmic multiplication and division, and to a certain extent, logarithmic addition. Arnold1, 5 proposed the Residue Logarithmic Number System (RLNS) to reduce the delay for multiplication and division. The process for arithmetic in RLNS is the similar to ordinary LNS except the exponent parts are represented in residue. The consequences of using RLNS rather than conventional binary LNS are that the choice of xmin and xmax in (2) need to reflect the fact that M is a product of moduli, rather than a power of two, and that overflow detection and comparison are difficult. Many applications will tolerate these extra limitations in exchange for the higher speed RLNS offers. The distinction in xmin and xmax can be mitigated by choosing one modulus to be a power of two and the other moduli to be close to powers to two. RLNS multiplication is straight-forward. However, since RLNS addition involves table-lookup and the RLNS representation usually has a longer word length than the binary representation, the na¨ıve table size may become too large. This will increase the total area for RLNS adders and result in longer propagation delay, which will lower the delay advantage brought by using RLNS. Arnold1 suggested using linear interpolation (4) to implement RLNS addition. This involves three steps: First, extract zH and zL ; Second, multiply zL by c(zH ). Finally, divide by K (now chosen to be a product of a subset of moduli) and add the scaled product to sb (zH ). Extraction of zL = z mod K is a free operation (because K is now a product of moduli) just as extraction is a free operation in conventional binary (when K is a power

of two). Unlike binary, extraction of zH is not completely straightforward. It requires computing zH = z − ZL using an actual subtractor whose zL input is base extended. The residue representation of zH has zeros in the moduli associated with K, thus these moduli do not need to be formed. The remaining moduli can address the tables of sb (zH ) and c(zH ). As discussed above, the interpolation multiplication is low cost in residue, but (4) has the hidden cost of scaling, which is a difficult operation in residue.

6. COMBINING BIPARTITE WITH RLNS The bipartite approach is used here to decrease the table size for sb (z) without needing an explicit multiply or scaling operation. However, the bipartite approach described in the literature requires the inputs to be weighted numbers. This restriction can be resolved by converting the residue representation to its associated mixed-radix representation. Figure 1 shows the bipartite sb (z) table for RLNS addition. We will assume the system uses three moduli m0 , m1 and m2 . The three residues z0 , z1 and z2 represent the residues of z with respect to these three moduli. The three numbers t0 , t1 and t2 output from MRC represent the associated mixed-radix number representation of z. The tables sb21 = a21 (t2 , t0 ) and sb20 = a20 (t2 , t0 ) are the two bipartite tables. z2

t2 s (z)

Mixedz1

z0

Radix

b21

t1 sb(z)

Conversion t0 s (z) b20

Figure 1. Bipartite RLNS sb (z) Table.

The mixed-radix conversion (MRC) part adds extra delay and area to the system. By choosing particular moduli, the mixed-radix conversion can be optimized. In this paper the moduli are chosen as 7, 8 and 15. This set of moduli makes the inverses of each other either 1 or -1, so that the mixed-radix conversions do not involve multiplications. Another advantage is that the calculation of the modular addition and subtraction is easier, because the numbers are either powers of two or powers of two minus one. Figure 2 shows the hardware for calculating the sum of two numbers A and B modulo 2n −1. The MUX is used to select the result as A+B mod 2n or A + B + 1 mod 2n depending on whether A + B + 1 mod 2n overflows. The calculation of A − B mod 2n − 1 has a similar structure. More research on modulo 2n ± 1 can be found in Kalampoukas,17 Efstathiou13 and Zimmermann.36

A B "1"

overflow n

A+B mod 2 -1

A B

Figure 2. Hardware for Calculating (A + B) mod 2n − 1.

Because the circuit described here is restricted to the properties of these three moduli, it will be convenient to change notation to emphasize the moduli. What was generically referred to as z0 will now be referred to as R15 , where the “R” emphasizes the subscript now acts as the Residue modulus. Likewise, z1 will now be referred

to as R8 and z2 will now be referred to as R7 . The residue to mixed-radix conversion in (12) performs the three steps below in equations (13), (14) and (15) for the moduli 7, 8 and 15. t0

=

t1

=

t2

= =

R15 , R8 − R15 mod 8 = R15 − R8 mod 8, 15 mod 8 R7 −R15 ( 15 mod 8 mod 7) − t1 mod 7 8 mod 7 (R15 − R7 − ((R15 − R8 ) mod 8)) mod 7.

(13) (14) (15)

In the three steps, R7 , R8 and R15 denote the residue representation and t2 , t1 and t0 denote the mixed-radix representation. Because the three moduli’s inverses are either 1 or -1 to other moduli, no multiplication is involved during the conversion. The bipartite approach shown in Figure 1 can be further optimized. Figure 3 is the optimized version of Figure 1. There are 9 units in Figure 3, numbered as units (1), (2), · · · (9). The tables t21 and t20 (units (4) and (6)) are the two bipartite tables which perform the same function as the table sb21 (z) and sb20 (z) in Figure 1 with some variances in detail for incorporating mixed-radix conversion into the tables. The modular adders and modular subtractors are represented by the “+” and “-” with the word “mod n” below, where the number n is the modulus, for example, “mod 7”. The three input channels on the left side are R7 , R8 and R15 respectively. The table t01 (unit (3), which performs base extension) together with the left “mod 7” (unit (5)) perform the calculation of t2 as given in equation (15). The left “mod 8”, “mod 7” and table t21 perform both the calculation of t1 and t2 and the higher order of the bipartite table. The output of the tables t21 and t20 are in residue number format. The three modular adders (unit (7), (8) and (9)) on the right side perform the additions of the two bipartite tables’ outputs. R7

s (z) mod 7 b

mod 8

mod 7

(1)

(7)

t21

R8

s (z) mod 8 b

(4)

mod 7

mod 8

(2)

t01

(8) mod 7

mod15

(5)

R15 (3)

s b(z) mod 15

t20 (6)

(9)

Figure 3. Optimized Hardware for RLNS sb (z) Table.

The range represented by the moduli 7, 8 and 15 is M = 7×8×15 = 840. The total word length of moduli 7, 8 and 15 is 10 bits. This range is 18% smaller than the range of a 10-bit binary representation, which is 210 = 1024. In LNS DCT computations, this is enough to cover the required dynamic range, where the equivalent of F = 4 bits are used as the fractional part. If this range of 840 numbers is split symmetrically into two nearly equal size subsets for positive and negative ranges (as is done for two’s complement), the high-order MRC digit will not completely identify the sign. For example, the hardware should be looking up sb (−z) ≈ 0 when it actually produces sb (z) ≈ z. Thus what should be the RLNS addition of an insignificant fraction will be misclassified as the addition of a much larger number. To guarantee the results of the RLNS additions to be correct, the ranges assigned to positive and negative zs need to be asymmetric. There are several ways to do this. For example, with the 840 numbers represented by moduli 7,8 and 15, 4 × 8 × 15 = 480 of them could be used to represent negative numbers and 3 × 8 × 15 = 360 of them could be used to represent positive numbers. This makes xmin = −480 and xmax = +359. Figure 4 shows this asymmetric range for positive and negative z. The range of RLNS values then is bxmin to bxmax . For example, with the equivalent of F = 4 bits of precision that is adequate for 1 DCT calculations,28 b = 2 16 . Thus, the residue exponents -480 to +359 are used to generate real RLNS values −30 −10 between 2 ≈ 9.3 · 10 to 2+22.4 ≈ 5.7 · 106 , which is larger than required in most multimedia applications.

positive range

negative range

-480

-360

-240

-120

0

120

240

360

Figure 4. Asymmetric Range of Positive/Negative RLNS Representations for M = 840, corresponding to reals between b−480 to b359 .

7. SYNTHESIS RESULTS There are two memory structures that could be used to compare the memory requirements of the na¨ıve and proposed approaches. The first alternative is to consider only the memory actually used. For example, with the moduli here the na¨ıve approach requires M = 840 words, compared to 7 × 8 + 7 × 15 + 8 × 15 = 281 for the total for tables t21 , t20 and t01 , respectively, reflecting about a two-thirds savings. The second alternative is to use memories whose sizes must be powers of two, as is the case in FPGAs. Here, the na¨ıve approach takes 1024 words versus 64 + 128 + 128 = 320 words for the proposed approach, reflecting similar savings. The above memory costs neglect the extra hardware required by the proposed bipartite circuit. It would be possible to implement the six residue adders needed in the proposed system using tables taking 7 × 15 + 8 × 15 + 7 × 7 + 7 × 7 + 8 × 8 + 15 × 15 = 612 words for units (1), (2), (5), (7), (8) and (9), respectively. The total is 281 + 612 = 893, which is slightly larger than the na¨ıve approach. The problem with this approach is that it overestimates the area required by the adders. It would be better to use synthesis to make a fairer comparison. Leonardo Spectrum is used as the synthesis tool; ASIC library scl05u is used as the target library. Synthesis results show that with the bipartite RLNS approach, the sb (z) table has 30% area savings and 40% less delay than the RLNS sb (z) table without the bipartite approach.

8. CONCLUSIONS A bipartite function generator was designed that is suitable to approximate the sb function needed for RLNS addition. The bipartite approach offers area savings compared to na¨ıve RLNS addition implementation. To our knowledge, this is the first attempt to implement bipartite using residue arithmetic. The bipartite method is useful in this context because the scaling (which is a difficult operation in residue) can be built into the two bipartite tables. The extra cost of residue-based bipartite is mixed radix conversion. Although bipartite requires extra time and area to perform this conversion, proper selection of the moduli can mitigate the cost. This paper has focused on the sb function used to compute logarithmic sums. The db function, used for logarithmic differences, is more difficult to implement using bipartite because of its singularity. We are currently investigating whether the techniques used to overcome the singularity with binary arithmetic are applicable for residue arithmetic. Although this paper has focused on logarithmic-based functions, the residue bipartite techniques may be applicable to other functions. In particular, the modular properties of residue may be a nice fit with bipartite approximation of trigonometric functions, eliminating the need for range reduction.

REFERENCES 1. M. G. Arnold, Extending the Precision of the Sign Logarithm Number System, M.S. Thesis, University of Wyoming, Laramie, 1982. 2. M. G. Arnold and C. Walter, “Unrestricted Faithful Rounding is Good Enough for Some LNS Applications,” 15th IEEE International Symposium on Computer Arithmetic, Vail, Colorado, pp. 237-245, 11-13 June 2001. 3. M. G. Arnold, “Reduced Power Consumption for MPEG Decoding with LNS,” Application-specific Systems, Architectures and Processors (ASAP), IEEE, San Jose, California, pp. 65–75, Jul. 2002. 4. M. G. Arnold, “LNS for Low Power MPEG Decoding,” Proceedings of SPIE: Advanced Signal Processing Algorithms, Architectures and Implementations XII, vol. 4791, Seattle, pp. 369-380, 7-11 July 2002.

5. M. G. Arnold, “The Residue Logarithmic Number System: Theory and Implementation,” 17th IEEE International Symposium on Computer Arithmetic, Cape Cod, MA, pp. 196–205, Jun. 2005. http://www.cse.lehigh.edu/∼caar/rlnstool.html

6. M. G. Arnold and P. W. Leong, “Logarithmic Arithmetic for N-body Simulations,” to appear in Work-inProgress Session of 31st EuroMicro Conference Porto, Portugal, Sept, 2005. 7. C. D. Capps, et al., “Optical Arithmetic/Logic Unit Based on Residue Arithmetic and Symbolic Substitution,” Applied Optics, vol. 27, pp. 1682–86, 1988. 8. W. A. Chren, “One-Hot Residue Coding for Low Delay-Power Product CMOS Design,” IEEE Transactions on Circuits and Systems, vol. 45, no. 3, pp. 303–313, Mar. 1998. 9. R. Conway, T. Conway and J. Nelson, “New One-hot RNS Structures for High-speed Signal Processing,” Proceedings of SPIE: Advanced Signal Processing Algorithms, Architectures and Implementations, XII, vol. 4791, Seattle, pp. 381–392, Jul. 2002. 10. D. Das Sarma and D. W. Matula, “Faithful Bipartite ROM Reciprocal Tables,”12th IEEE International Symposium on Computer Arithmetic, pp. 17-28, 1995. 11. F. de Dinechim and A. Tisserand, “Multipartite Table Methods,” IEEE Transactions on Computers, vol. 54, no. 3, pp. 319–330, Mar. 2005. 12. J. Detrey and F. de Dinechim, “A VHDL Library of LNS Operations,” 37th Asilomar Conference on Signals, Systems and Computers, Pacific Grove, CA, Sep. 2003. http://perso.ens-lyon.fr/jeremie.detrey/FPLibrary/

13. C. Efstathiou, H.T. Vergos and D. Nikolos, “Modulo 2n ± 1 Adder Design Using Select-prefix Blocks”, IEEE Transactions on Computers, vol. 52, no. 11, pp. 1399–1406, Nov. 2003. 14. P. G. Fernandez, A. Garcia, J. Ramirez, L. Parrilla and A. Llores, “Fast RNS-based DCT Computation with Fewer Multiplication Stages,” Proceedings of the XV Design of Integrated Circuits and Systems Conference (DCIS), Montpellier, pp. 276–281, Nov. 2000. 15. J. Garcia, M. G. Arnold, L. Bleris and M. V. Kothare, “LNS Architectures for Embedded Model Predictive Control Processors,” Proceedings of the 2004 International Conference on Compilers, Architecture, and Synthesis for Embedded Systems (CASES), Washington DC, pp. 79-84, Sept. 2004. 16. H. Hassler and K. Takagi, “Function Evaluation by Table Lookup and Addition,” 12th IEEE International Symposium on Computer Arithmetic, , pp. 10-16, June 1995. 17. L. Kalampoukas, D. Nikolos, C. Efstathiou, H.T. Vergos and J. Kalamatianos, “High-speed Parallel-prefix Module 2n − 1 Adders”, IEEE Transactions on Computers, vol. 49, no. 7, pp. 673-680, Jul. 2000. 18. E. Kinoshita and Ki-Ja Lee, “A Residue Arithmetic Extension for Reliable Scientific Computation,” IEEE Transactions on Computers, vol 46, pp. 129–138, Feb. 1997. 19. Israel Koren, Computer Arithmetic Algorithms, Brookside Court Publishers, Amherst, Massachusetts, 1998. 20. D. M. Lewis, “Interleaved Memory Function Interpolators with Application to an Accurate LNS Arithmetic Unit,” IEEE Transactions on Computers, vol. 43, no. 8, pp. 974–982, Aug. 1994. 21. Junichiro Makino and Makoto Taiji, Scientific Simulations with Special-Purpose Computers—the GRAPE Systems, John Wiley and Sons, Chichester, 1998. 22. V. Paliouras and T. Stouratis, “Multifunction Architectures for RNS Processors,” IEEE Transactions on Circuits and Systems, vol. 46, pp. 1041–54, Aug. 1999. 23. V. Paliouras and T. Stouraitis, “Low Power Properties of the Logarithmic Number System,” 15th IEEE International Symposium on Computer Arithmetic, Vail, pp. 229–236, Jun. 2001. 24. L. C. Pickett, “Method and Apparatus for Generating Mathematical Functions,” United States Patent 5,184,317, 2 Feb. 1993. 25. K. Qing and M. J. Feldman, “Single Flux Quantum Circuits Using the Residue Number System,” IEEE Transactions of Applications of Superconductors, vol. 5, pp. 2988–91, Jun. 1995. 26. D. Radhakrishnan and Y. Yuan, “Novel Approaches to Design of VLSI RNS Multipliers”, IEEE Transactions on Circuits and Systems II, vol. 39, pp. 52–57, Jan. 1992. 27. J. Ruan and M. Arnold, “Combined LNS Adder/Subtractors for DCT Hardware,” First Workshop on Embedded Systems for Real-Time Multimedia, Newport Beach CA, pp. 118-123, 3-4 Oct. 2003.

28. J. Ruan, M. G. Arnold, “New Cost Function for Motion Estimation in MPEG Encoding Using LNS,” Proceedings of SPIE: Advanced Signal Processing Algorithms, Architectures, and Implementations XIX, Denver, vol. 5559, Aug. 2004. 29. D. Soudris, et al., “A Methodology for Implementing FIR Filters and CAD Tool Development for Designing RNS-based Systems,” IEEE International Symposium on Circuits and Systems, pp. 129–132, May 2003. 30. M. Schulte and J. Stine, “Symmetric Bipartite Tables for Function Approximation,” 13th IEEE International Symposium on Computer Arithmetic, Asilomar, California, pp. 175-183, Jul. 1997. http://www.cse.lehigh.edu/∼caar/SBTM.html. 31. E. E. Swartzlander and A. G. Alexopoulos, “The Sign/Logarithm Number System,” IEEE Transactions on Computers, vol. C-24, pp. 1238–1242, Dec. 1975. 32. N. S. Szabo and R. I Tanaka, Residue Arithmetic and Its Application to Computer Systems, McGraw Hill, 1967. 33. F. J. Taylor, R. Gill, J. Joseph and J. Radke, “A 20 Bit Logarithmic Number System Processor,” IEEE Transactions on Computers, vol. C-37, pp. 190–199, 1988. 34. S. Wei and K. Shinizu, “Residue Arithmetic Circuits Based on Signed-Digit Multi-Valued Arithmetic Circuits,” 27th International Symposium Multi-Valued Logic, Fukuoka, Japan, pp. 276–281, May 1998. 35. G. Zelniker and F. J. Taylor, “A Reduced Complexity Finite Field ALU,” IEEE Transactions on Circuits and Systems, vol. 38, pp. 1571–73, Dec. 1991. 36. R. Zimmermann, “Efficient VLSI Implementation of Modulo 2n ± 1 Addition and Multiplication”, 14th IEEE International Symposium on Computer Arithmetic, pp. 158–167, April 1999.

Suggest Documents