A Single-Multiplier Quadratic Interpolator for LNS ... - Semantic Scholar

8 downloads 0 Views 178KB Size Report
Horner's rule, the number of multiplications can be re- duced to two. The coe cients ..... 4] M. Arnold, \Slide Rules for the 21st Century: Logarithmic Arithmetic as a ...
A Single-Multiplier Quadratic Interpolator for LNS Arithmetic Mark G. Arnold UMIST Manchester M60 1QD UK

[email protected]

Abstract

Mark D. Winkel XLNS Research Boulder, CO 80308-1204 USA [email protected]

Linear interpolation requires a single multiplication but is signi cantly less accurate than quadratic interpolation. The latter requires two multiplications. Two novel quadratic interpolation schemes are shown here that approximate the functions required by the Logarithmic Number System (LNS) with more accuracy than linear interpolation using only a single multiplication. One method uses two ROMs to give the accuracy of quadratic interpolation, whilst the other uses one ROM to give four- to six-bits better accuracy than linear interpolation. These techniques save four- to eight-fold on memory compared to linear interpolation for the same accuracy. We illustrate the usefulness of these techniques for serial implementation with a clone of the ARM TM microprocessor (known as AWE) that we developed to have LNS instructions. We also show a novel technique for decreasing the propagation delay in both linear and quadratic interpolation that stores the logarithm of the derivative of the function in a ROM, rather than the function itself.

error that we expect in oating-point arithmetic. The place where LNS has diculty is with addition and subtraction. These operations require approximation of the functions sb (z ) = logb (1 + bz ) and db (z ) = log j1 ? bz j [1]. Several approaches have been discussed in the literature to approximate these functions [6, 10, 5]. The simplest is direct table lookup, in which all possible values of the function are tabulated [13]. Although this is feasible for low-precision systems where the word size is small, for practical word sizes that approach the precision a orded by IEEE-754, direct lookup requires memory that grows exponentially with the word size. Thus, more complicated approximation techniques are required. One such technique is interpolation, which requires some form of multiplication [1]. This paper compares several interpolation methods that have been proposed in the literature, and suggests two novel methods that replace some of the required interpolation multiplication(s) with addition of logarithms. This improvement is possible by selecting coecients that can be computed from a single table access rather than requiring several accesses as with Chebychev polynomials.

1. Introduction

2. Linear Tangent (LT) Interpolation

The Logarithmic Number System (LNS) [13] represents real numbers with a sign bit and a xed-point (usually base-two) logarithm. Except for input/output conversion, numbers stay in this format throughout all operations. Multiplication in LNS is transformed into addition of the logarithms and XORing of the signs. Likewise, division in LNS is transformed into subtraction of logarithms and XORing of the signs. As a xed-point adder is considerably less expensive than a

oating-point multiplier, this a ords considerable savings over conventional techniques, such as the IEEE754 standard, when more multiplications are performed than additions. Also, LNS multiplication and division cause no roundo error, compared to the half of ulp

One approach [1, 8] to approximate a function, F , uses the derivative, D = F 0 , with linear interpolation: F (?n ? )  F (?n) ?   D(?n): (1) As shown in Figure 1, the line1 is tangent to F at ?n. The interpolation line only intersects the curve F at that one point, and diverges from the function in the remainder of the interval. This technique is commonly referred to as Taylor interpolation. For an LNS ALU, either F (z ) = sb (z ) = logb (1+ bz ) for logarithmic addition or F (z ) = db (z ) = logb j1 ? bz j for logarithmic subtraction. 1 Following [5], ?n ?  is the input bus (z = z + z in H L [1, 3]). At no cost, this can be split into ?n and ?.

Like most linear interpolation techniques [6, 12], (1) requires one multiplication in the hardware, indicated by the \" in (1). The hardware splits the input bus into a high part, n, and a low part, . The fewer bits in n, the more bits in . n is used to address the F and D ROMs. We would like as narrow a bus for n as permitted. To obtain the desired level of accuracy, the size of the bus for n is chosen based on the value of D0 (?n ? ). Often the ROMs are partitioned at powers of two [7] to minimise storage requirements whilst maintaining the desired accuracy, but we will ignore this optimisation here. The hardware to implement such interpolation is shown in Figure 2, where ROM1 is F (?n) and ROM2 is F 0 (?n). The largest component in Figure 2, aside from the ROMs, is the xedpoint multiplier.

3. Multiplierless LT Interpolation By implementing the interpolation with logarithmic arithmetic instead of xed-point arithmetic, the multiplication shown in Figure 2 can be eliminated: F (?n ? )  F (?n) ? b(logb ()+logb (D(?n))) : (2) As was rst noted in 1982 [1] and later rediscovered [7], logb (D(?n)) = ?n ? F (?n):

(3)

for both F = sb and F = db . Therefore, (2) can be implemented with only an additional ROM for logb () and another one for the anti-log: F (?n ? )  F (?n) ? b(logb ()?n?F (?n)): (4) Lewis and Yu [7] used this approach to fabricate a 30bit LNS ALU in CMOS. A problem with Lewis and Yu's approach is the propagation delay of the subtractor required to compute (3). We can turn this around:

F (?n) = logb (D(?n)) ? n

(5)

and have the ROM contain the slope so that the system can begin the interpolation as soon as the output of the ROM stabilises. The hardware using this novel variation of multiplierless linear-tangent-line interpolation is shown in Figure 3, where ROM3 is logb (D(?n)), ROM4 is ? logb () and ROM5 is the anti-log.

4. Linear Secant (LS) Interpolation A di erent approach for implementation of LNS arithmetic with interpolation uses a slightly di erent

slope that causes the line to be a secant (Figure 4) that intersects F at ?n and (?n ? 1): F (?n?)  F (?n)? (F (?n) ? F((?n ? 1))) : (6) Such a technique is often referred to as Lagrange interpolation. It was shown [1] that when implemented using identical hardware, (6) produces a result that is accurate to two additional bits beyond the result produced by (1) or (4) for F = sb .

5. Quadratic Tangent (QT) Interpolation Coleman [5] has extended the idea of tangent-line linear interpolation to quadratic interpolation using an \error correcting" term, E (n)  P (): F (?n ? )  F (?n) ?   D(?n)+ E (n)  P () (7) where E (n) is a function that is approximately D0 (?n) times a constant and P () is a function that is 2 times the reciprocal of that constant. Since  is a small fractional value (k leading bits are zero) and 2 < 2?2k is signi cantly smaller than , Coleman's technique is a ordable because only the high-order bits of  are required to obtain a satisfactory approximation for P (). Coleman's technique requires four ROMs (F , D, E and P ) and two multipliers. A motivation for the quadratic approach is to reduce the width of the address bus (?n) to be less than required by linear interpolation at a given accuracy. Doing this, however, increases the width of the other bus, ?, fed to the multiplier. For this reason, it is impractical to use logarithmic arithmetic to replace the rst multiplication of (7) since the size of the ROM for logb () would be too large. For a 32-bit LNS equivalent to IEEE-754, it would require 218 words.

6. Novel Single-Multiplier QT Interpolation On the other hand, the second multiplication in (7) is a candidate for logarithmic arithmetic: F (?n ? )  F (?n) ?   D(?n) (8) +b(logb (E(n))+logb (P ())) : This is possible because not all bits of  are required for the input to the P () ROM in Coleman's original technique (7). The same number of bits will be required for the input to the logb (P ()) ROM in the new technique (8) shown above. There is a delay in converting the logarithm of the product back after doing the multiplication with logarithmic arithmetic.

Arnold [1] partially foresaw the possibility of using logarithmic arithmetic, similar to Figure 5, to implement the quadratic term in such interpolation, noting that logb (D0 (?n)) = ?n ? 2F (?n) + logb (ln b): (9) Using (9), (8) simpli es to have only one multiplication: F (?n ? )  F (?n) ?   D(?n) (10) +b(?n?2F (?n)+logb (ln b)+2 logb ()) : There is a constant, logb (2 ), that cancels out of logb (P ) and logb (E ), so (10) can be implemented using four ROMs: F (?n), D(?n), logb (ln b) + 2 logb () and the anti-log. Again, to reduce propagation delay, (9) could be turned around: 0 F (?n) = ? logb (D (?n))2? n + logb (ln b) (11) Thus, the novel single-multiplier quadratic interpolator can be implemented as shown in Figure 5, where ROM6 is ? logb (D0 (?n)) + logb (ln(b)), ROM7 is 2 logb () + logb (ln(b)), ROM8 is D(?n) and ROM9 is the antilogarithm function.

7. Quadratic Secant (QS) Interpolation A quadratic generalization of (6) is possible: F (?n ? )  F (?n) (12) F ((?n?1))?F (?n) +   +2(F (?n) + F ((?n ? 1)) ? 2F ((?n ? 0:5))) (  =(2 ) ? =): Lewis [9, 10] implemented something similar to (12) using an interleaved ROM for F . By application of Horner's rule, the number of multiplications can be reduced to two. The coecients for the multiplications are derived by subtracting three adjacent elements of the ROM. The interleaved ROM allows this to occur more rapidly than if three sequential accesses occurred.

8. Novel Single-Multiplier QS Interpolation The goal is now to eliminate two of the multiplications in (12) (i.e., one of the multiplications in [9]) by using logarithmic arithmetic. At rst glance, it does not seem straightforward how to accomplish this. There are some novel manipulations of (12) required to make this feasible. First, note that (13) D0 (?n) 42  (F (?n) + F ((?n ? 1)) ? 2F ((?n ? 0:5)))

thus,

F (?n ? )  F (?n) ? F (?n) +  F ((?n ? 1))  +D0 (?n)  (   ? )=2:

(14)

Of course, this still requires three multiplications. Using logarithmic arithmetic:

F (?n ? )  F (?n) ?F (?n) +  F ((?n?1)) 

(15)

?b(log (D (?n))+log (?)?1) : b

0

b

One multiplication can be eliminated, and the other one can be put into a ROM that contains logb (? ). Substituting (9) into (15), we have

F (?n ? )  F (?n) ?F (?n) +  F ((?n?1)) 

(16)

?b(?n?2F (?n)+log (ln b)+log (?2 )?1) : b

b

This can be implemented using one multiplier and four ROMs: F , (F ((?n ? 1)) ? F (?n))=, logb (ln b) + logb ( ?   ) ? 1 and the anti-log. The sizes of the rst two of these ROMs are the same as in Coleman's original technique. The size of the logb (ln b)+logb (?   ) ? 1 ROM is roughly the same as the P ROM in (7) or possibly smaller. Again, the propagation delay can be reduced by application of (11). This technique ts into the hardware of Figure 5 with the same ROM contents as before, except ROM7 is logb ( ? 2 ) ? 1 + 2 logb (ln(b)) and ROM8 is (F ((?n ? 1)) ? F (?n))=. An alternate implementation of (16) could eliminate the slope ROM by using a simple two-out-of-twobank interleaving scheme that accesses F (?n) and F ((?n?1)) in parallel. This would be more economical than the three-out-of-four-bank interleaving scheme used by Lewis [9].

9. Accuracy For (10) or (16), the designer may choose the sizes of ROM7 and ROM9 (the anti-log ROM), and thereby choose the accuracy of the interpolation. When these ROMs are large enough, (10) and (16) are as accurate as their multiplier-based equivalents, such as (7). Assuming ROM7 is of comparable size to the P ROM in (7), a small anti-log ROM will then be the limiting factor in the accuracy of (10) and (16), and the corresponding errors can then be measured relative to (1)

and (6), respectively. Since (1) is two bits less accurate than (6), (10) is also two bits less accurate than (16) when the anti-log ROM is small. If the anti-log ROM is only good for m signi cant bits, (10) has m bits and (16) has m + 2 bits more accuracy than (1).

10. Mitchell's Anti-log Method

LT (1) QS (16) QS (16) QS (16) QS (16) LT (1) QS (16) QS (16) QS (16) QS (16) LT (1) QS (16) QS (16) QS (16) QS (16)

ROM7 32 64 128 256 ROM7 32 64 128 256 ROM7 32 64 128 256

 = 2?7 Error ROM9 Error Mitchell 5:29  10?6 1:65  10?7 1:70  10?7 ? 8 8:38  10 9:99  10?8 ? 8 4:27  10 8:64  10?8 ? 8 2:15  10 8:28  10?8  = 2?8 Error ROM9 Error Mitchell 1:32  10?6 4:12  10?8 4:25  10?8 ? 8 2:10  10 2:50  10?8 ? 8 1:07  10 2:16  10?8 ? 9 5:37  10 2:07  10?8

11. Memory Tradeo s Assuming  is constant, consider implementing an

The anti-log ROM can be eliminated altogether by using Mitchell's method [11], which substitutes a shifter for ROM9 by noting that 2x  2bxc (1 + (x ? bxc)). This has the advantage that the propagation delay of the system will be similar to simple singlemultiplication linear interpolation. With Mitchell's method, m  4. Thus, (16) using Mitchell's method for 2x can be up to six bits more accurate than (1). An exhaustive simulation of (16) with and without Mitchell's method was conducted for all values of ?n ?  that are sensible in an LNS analogous to IEEE-754 (23 bits of precision) using di erent sizes of ROM7 and di erent values of . In the non-Mitchell simulation, the size for ROM9 was set to one quarter of that for ROM7. For comparison, the same values of  were used with LT (Taylor) interpolation. The following shows the size in words of ROM7 (which is not used with Taylor interpolation), the observed maximum error with and without Mitchell's method, and the number of bits improvement observed in (16) using Mitchell's method compared to LT interpolation:  = 2?6 Error ROM9 Error Mitchell 2:12  10?5 6:58  10?7 6:80  10?7 ? 7 3:35  10 4:00  10?7 ? 7 1:70  10 3:46  10?7 ? 8 8:61  10 3:31  10?7

The results con rm that when ROM7 is large enough, the proposed single-multiplier method using a shifter instead of ROM9 o ers six additional bits of accuracy.

Bits 4:96 5:73 5:93 6:00 Bits 4:96 5:73 5:94 6:00 Bits 4:96 5:73 5:94 6:00

sb interpolator using the prior methods (x2, 4, 7) that

can achieve results about as good as 32-bit IEEE-754

oating point, that is 23-bit precision. In order to obtain about this precision with an LT interpolator [1] requires  = 2?10 . In addition to 10 address bits to the right of the radix point required by this choice of , dlog2 (23)e = 5 address bits will be needed on the left. Thus, the size of the ?n bus is 15 bits, which means devoting 215 words (32-bits wide) to the sb table. By switching to the better linear method (LS),  = 2?9 and the table size is cut in half (214 ). Both of those design alternatives use one multiplier. By increasing the resouces for multiplication, two-multiplier QS interpolation [9] achieves slightly better precision using  = 2?5 , or 210 words. To achieve similar precision, the novel singlemultiplier QS method shown in Figure 5 uses more memory than the two-multiplier QS method, but less than either of the linear methods. For example, the single-multiplier quadratic interpolator in Figure 5 has the same address-bus size as a two-multiplier quadratic one since  is still 2?5 . However, Figure 5 requires two tables addressed by ?n (ROM6 and ROM8). In addition to ROM6 and ROM8, Figure 5 needs one table addressed by the high-order ten bits of ? (ROM7) and an eight-bit table for ROM9 followed by a shifter. Thus, the total memory used is 210 + 210 + 210 + 28 words, which means the method proposed in Figure 5 o ers more than a factor-of-four savings compared to the better linear (LS) method. Figure 5 reduces propagation delay by starting the multiplication as soon as possible. Another design choice would be to eliminate ROM8 and obtain the linear slope by subtraction of adjacent elements of ROM6: D(?n) = (ROM6(?n) ? ROM6((?n ? 1)) ? )=(2). This would reduce the memory requirements to 210 + 210 + 28 at the cost of sequential accesses [1], interleaved memory [9], or a dual-port block RAM in an FPGA [4]. This saves almost seveneighths the memory compared to the better linear (LS) method. Using Mitchell's method instead of ROM9 gives the designer additional choices in this design space. To achieve similar precision to the earlier examples when ROM9 is replaced with a shifter (as explained in the last section) requires  = 2?7 . If Figure 5 is otherwise

unchanged, this requires a total memory of 212 + 212 + 28 words which is about half of the better linear (LS) method. If ROM9 is replaced with Mitchell's method and ROM8 is replaced with subtraction of adjacent elements in ROM6, the total memory will be 212 + 28 words, which is almost a factor of four improvement compared to the better linear (LS) method.

12. Multi-Cycle Implementation The novel LNS techniques are shown in Figures 3 and 5 as feed-forward combinational logic/ROM for consistency with the high-performance VLSI LNS ALUs in the literature [8, 9, 5] and for ease of exposition. Although such logic can produce one result per cycle, it does so at the cost of large area and power consumption. It is dicult to give overall comparisons of accuracy, speed and size of such implementations under realistic circumstances without making many assumptions about the underling VLSI technology. For example, the relative area saved by elimination of the second multiplier versus the area consumed by the addition of ROM7 and ROM9 (or its Mitchell-based shifter equivalent) depends heavily on the technology. Consider cost-constrained applications for which LNS is well suited. Instead of expensive feed-forward combinational logic, more economical but lower performance sequential implementations may be considered that use a portion of the system's RAM to hold the LNS tables. For example, elsewhere in [3] we describe how we extended our FPGA clone of a common RISC microprocessor2 to implement LNS addition by linear-secant interpolation of the sb function with a serial algorithm equivalent to Figure 2. This low-cost serial approach only costs 3 percent of the FPGA's LUTs above the basic integer RISC core because the CPU's internal resources are reused for the LNS algorithm. As described in the last section, to achieve precision equivalent to 32-bit IEEE-754 oating point with linear-secant interpolation,  = 2?9 . The state machine that was added to the processor requires an additional eight states. The slope is derived by subtraction of adjacent elements of the sb table. The size of the  bus is 23 ? 9 = 14 bits wide, which implies that the serial multiply takes 14 cycles. Including the cycles to look up the function and so forth, it takes at most Known as the AWE [2], our CPU is a partial clone for experimental, academic purposes of the popular ARM TM microprocessor core using implicit Verilog. The AWE implements data processing, branch, multiply and single data transfer instructions with no Thumb TM or interrupt register bank. 2

another 8 cycles to complete the addition algorithm, for a total of at most 22 cycles. If this processor were modi ed to use the singlemultiplier quadratic method shown in Figure 5 (with subtraction of adjacent elements instead of ROM8), the size of memory can be cut more than eight-fold compared to [3], due to  = 2?5 as described in the last section. Because the serial multiplier now takes 23 ? 5 = 18 cycles during which time the RAM would otherwise be idle, the RAM accesses that replace ROM7 and ROM9 can occur in parallel to the serial multiplication. Thus, this technique takes at most 26 cycles. If ROM9 is replaced with Mitchell's method,  = 2?7. Again assuming subtraction of adjacent elements is used instead of ROM8, the memory can be cut almost four-fold compared to [3]. The serial multiplier now takes 23 ? 7 = 16 cycles. The shifting required for Mitchell's method can occur sequentially at low cost during the same time the multiplication is ongoing. Thus, this technique takes at most 24 cycles. By comparison, a serial implementation of IEEE754 costing the same as our LNS techniques requires at least 24 cycles for multiplication, and a similar time for addition. LNS multiplication only takes one cycle, thus, with an equal mix of addition and multiplication, our LNS processor will be about twice the speed of the similar oating-point processor.

13. Conclusion Two novel quadratic interpolation techniques for

sb and db were presented that capitalise on special

properties of logarithmic arithmetic. The memory requirements are either similar or lower than a previous quadratic interpolator[5], with corresponding accuracy either similar or lower (but, in any case, better than linear interpolation). The advantage of these novel techniques is that only a single multiplication is required. In a serial implementation on our ARM clone, these techniques will reduce memory by four- to eight-fold with minor additional delay. Previous multiplierless linear interpolators stored F in a ROM, and computed the logarithm of the derivative, logb (D), from this. A novel technique to reduce propagation delay stores logb (D) in the ROM so that the interpolation can commence earlier, and then computes F from that in parallel.

References [1] M. G. Arnold, \Extending the Precision of the Sign Logarithm Number System," M. S. Thesis,

[2]

[3] [4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12] [13]

University of Wyoming, Laramie, 1982. M. G. Arnold, F. N. Engineer and M. D. Winkel, \AWE: The ARM Workalike Experiment," WESTCON, San Jose, CA, 21 Oct. 1999. (www.cs.uwyo.edu/ marnold/awe.html) M. Arnold and M. Winkel, \Recon guring an FPGA-based RISC for LNS Arithmetic," ITCon, Denver: SPIE, 20-24 Aug 2001. M. Arnold, \Slide Rules for the 21st Century: Logarithmic Arithmetic as a High-speed, Lowcost, Low-power Alternative to Fixed Point Arithmetic," to appear in Second Online Symposium for Electronics Engineers, www.osee.com. J. N. Coleman, E. I. Chester, C. I. Softley, and J. Kadlac, \Arithmetic on the European Logarithmic Microprocessor," IEEE Trans. Comput., vol. 49, no. 7, pp. 702-715, July 2000. H. Henkel, \Improved Accuracy for the Logarithmic Number System," IEEE Trans. Acoust., Speech, Signal Proc., vol. ASSP-37, pp. 301-303, Feb 1989. D. Lewis and L. Yu, \Algorithm Design for a 30bit Integrated Logarithmic Processor," Proc. 9th Symposium on Computer Arithmetic, IEEE Press, pp. 192{199, 1989. D. M. Lewis, \An Architecture for Addition and Subtraction of Long Word Length Numbers in the Logarithmic Number System," IEEE Trans. Comput., pp. 1325{1336, 1990. D. M. Lewis, \Interleaved Memory Function Interpolators with Application to an Accurate LNS Arithmetic Unit," IEEE Trans. Comput., vol. 43, no. 8, pp. 974-982, Aug. 1994. D. M. Lewis, \114 MFLOPS Logarithmic Number System Arithmetic Unit for DSP Applications," International Solid{State Circuits Conference, pp. 1547{1553, San Francisco, Feb. 1995. J. N. Mitchell, \Computer Multiplication and Division using Binary Logarithms," IEEE Trans. Electron. Comput., vol. EC-11, pp. 512-517, Aug. 1962. A. S. Noetzel, \An Interpolating Memory Unit for Function Evaluation: Analysis and Design," IEEE Trans. Comput., vol. 38, pp. 377-384, Mar. 1989. E. E. Swartzlander and A. G. Alexopoulos, \The Sign/Logarithm Number System," IEEE Trans. Comput., vol. C-24, pp. 1238{1242, Dec 1975.

(-n∆,F(-n∆)) (-n∆-δ,F(-n∆-δ)) ((-n-1)∆,F((-n-1)∆))

(-n∆-δ, F(-n∆) -δ D(-n∆)) ∆ Figure 1. Linear Tangent Line interpolation

-n∆

ROM1 F

+ ROM2

-n∆-δ

* -δ Figure 2. Multiplier-based Linear Interpolator

-n∆

+ ROM3

+

F

-n∆-δ -δ

+

ROM5

ROM4

Figure 3. Multiplierless Linear Interpolator

(-n∆,F(-n∆)) (-n∆-δ,F(-n∆) δ (F(-n∆)-F((-n-1)∆))/∆) ((-n-1)∆,F((-n-1)∆)) (-n∆-δ,F(-n∆-δ)) ∆

Figure 4. Linear Secant Line Interpolation

-n∆

-

/2

ROM7

-

ROM9

ROM8

* *

ROM6

+

F

-n∆-δ



Figure 5. Novel Single-Multiplier Quadratic Interpolator

Suggest Documents