Jun 8, 1998 - Division with Speculation of the Quotient Digits. Gianluca Cornetta. Department of Computer Architecture. Universitat Polit ecnica de Catalunya.
Division with Speculation of the Quotient Digits Gianluca Cornetta Department of Computer Architecture Universitat Politecnica de Catalunya 08034 Barcelona, Spain June 8, 1998 Abstract
Progress in VLSI technology has made possible the hardware implementation of all the basic arithmetic operations in the design of general-purpose as well as special-purpose processors. While operations such as multiplication and sum have been extensively studied and fast implementations are possible, the design of fast and ecient circuits for division is still challenging. The factor that limits the eciency of such operation is the complexity of the digit generation function. To override this intrinsic limitation of the division algorithm an extensive theory that includes quotient digit prediction, prescaling and redundant representation of the operands, has been developed. The time T necessary to perform a division is T = ti n; where ti is the latency of a single recurrence of the algorithm and n is the number of iterations necessary to complete a division. In order to reduce the latency of a division we may try to reduce n; this may achieved by using very high radices, since at each iteration log2 r bits of the quotient are computed , with r being the radix. Nevertheless, using high radices implies an increase of ti due to the greater complexity of both the quotient digit selection function and of the logic for partial remainder updating. Thus a tradeo between ti and n must be found. We may reduce the complexity of the quotient digit selection logic by assimilating a smaller number of bits of both partial remainder and divisor. With such approach we may achieve the desired speedup since ti is smaller; nevertheless we may commit an error since the digit selected is only a speculation and may not be the correct one. As a consequence a post-correction of the partial remainder may be necessary; in order to increase the divider speed, the error detection is overlapped with the speculation of the next quotient digit. It is obvious that the less are the assimilated bits, the faster is the speculation but the greater is the probability to perform a post-correction so a compromise solution must be found. In this paper we propose an analytical approach to quotient digit speculation for SRT division. The basic theory of SRT division is reformulated in order to obtain a
1
set of equations that permit us to design speculative dividers with the desired amount of error.
1 The Division Algorithm Division is a sequential operation, that is the quotient is computed digit by digit by iterating a recursive formula. At each division step a quotient digit is computed and the partial remainder updated. The most time-consuming operation to be performed at each step of the recurrence is, especially in the case of high-radix division, the comparison between the divisor and the shifted partial remainder in order to determine the quotient bit. The updating of the partial remainder is simply an add or subtract operation. Let w[j ] be w[ik] d Division Step 1
w[ 0 ] q
d Division Step 1
w[j] d Division Step w[j+1 ] w[0]
q j+1
w[ik+1 ]
1
Division Step 2
w[1 ] q Division Step 2
qik+2
w[ik+2 ]
2
w[2 ]
q( i+1)k
Residual Register
q Division Step n
Division Step k w[( i+1)k]
n
w[ 0 ]
(a)
qik+1
(b)
Residual Register
(c)
Figure 1: Division Implementation: (a) Totally Sequential, (b) Totally Combinational, (c) Combined Implementation. the partial remainder computed at the j -th step (clearly, x = w[0] is the dividend), d the divisor, r the radix and qj the quotient digit computed at the j -th step; each iteration consists of four operations that, in general, are executed sequentially: 1. An arithmetic left-shift of w[j ] to produce rw[j ]; 2. Determination of the quotient digit qj , performed by a quotient-digit selection function; 3. Generation of a divisor multiple qj d; 4. Update of the partial remainder by adding or subtracting the divisor multiple to the shifted partial remainder rw[j ]. 2
There are three possible ways of implementing the division algorithm; the choice of one of these design alternative is in uenced by cost, speed and throughput considerations. The implementation may be totally sequential (Figure 1(a)), if the hardware that performs a division step is reused for all the iterations and the partial remainder is updated in a register; totally combinational (Figure 1(b)), if the hardware that performs the division step is replicated n times (where n is the number of bits of the mantissa); or a combination of both these design approaches (Figure 1(c)). In this case the hardware implementing a division step is replicated k times so that at each iteration k quotient digits are generated and the overall number of iterations decreases from n to nk . The totally combinational approach is particularly attractive for pipelined implementations so that several divisions might be executed at the same time increasing the throughput.
2 SRT Division The SRT method, developed independently by Sweeney, Robertson and Tocher at the end of the fties [Rob58, Toc58, CS57] may be considered as an improvement of non restoring division. In a non restoring division a sum or a subtraction is performed at each iteration; on the other hand, SRT method, allowing zero to be a valid quotient digit, permits to skip sequences of zeros in the partial remainder by performing only an arithmetic left-shift without no add or subtract operation. The residual is updated in a non restoring fashion, that is negative residual are allowed and in case of negative partial remainder, the quotient is corrected in the successive iterations. The following recurrence is applied repeatedly:
w[j + 1] = r w[j ] ? qj+1 d
(1)
Where the quotient digit qj+1 2 s. Furthermore, in order to guarantee the convergence of the algorithm, the residual must be bounded, that is:
? d w[j ] + d
8j
(2)
For sake of simplicity we will consider positive operands expressed in normalized form and symmetric digit sets, that is: s = f?a; ?a + 1; : : :; ?1; 0; +1; : : : ; +a ? 1; +ag
(3)
A digit set is said to be over-redundant if a r; otherwise, if a r ? 1 is said non over-redundant. We also de ne a redundancy factor : (4) = r ?a 1 If = 21 , the digit set is non-redundant; otherwise, if = 1 the digit set is said to be maximally redundant. If > 1 the digit set is said over-redundant. 3
2.1 The Selection Function
The digit selection function F generates at the (j + 1)-th iteration the digit qj+1 of the quotient after evaluating the divisor and the partial remainder at the j -th iteration, that is w[j ] and the divisor d: qj+1 = F (w[j ]; d) (5) A digit qj+1 = k is selected if and only if the partial remainder at the j -th iteration is such that Lk w[j ] Uk , where:
Lk = (? + k) d
Uk = ( + k) d
are respectively the lower and the upper bound of the selection interval for digit qj+1. Thus quotient digit calculation is simply a bound checking operation. It is important to remark that since we use a redundant representation [Avi61, Par90, Par93, Kor94, KKH88], the selection intervals overlap; in particular the higher , the higher the overlap. There are, essentially, three dierent ways of implementing a selection function: 1. using multiple comparison 2. using selection constants 3. rounding the partial remainder The rst method consists in nding a multiple of divisor d that falls in the overlapping region between two consecutive selection intervals. The selection is then performed by comparing the partial remainder with such multiples. The second method is similar to the rst. The range of d is divided in several intervals and for each interval a constant falling in the overlapping region of two consecutive selection interval must be found. Such constant must be an integer multiple of the partial remainder granularity. Also in this case the selection operation is performed by comparing the partial remainder with the selection constants. The last method consists in reciprocating the divisor d and then obtaining the partial remainder as the product of the dividend by the inverse of the divisor. Digit qj+1 is computed rounding the partial remainder. All of these methods are reported in [EL94]. If the residual and the divisor are evaluated in their full precision, F may be extremely complex. It would be nice to nd a way to simplify F , for example considering only an estimation w^ [j ] and d^ of the residual and of the divisor such that qj+1 = F w^[j ]; d^ . These estimations are obtained from w[j ] and d by truncating them. This, clearly leads to an estimation error; anyway, if the estimation falls into the overlapping region of two consecutive selection intervals, the selected digit is always correct [Atk68]. A redundant representation that results particularly appealing for hardware implementation is the carry-sum form (see Figure 2). A number represented in carry-save form is doubled in two parts: the sum and the carry. The non-redundant representation of the number may be obtained by summing the sum and carry bits. 4
n t SUM CARRY
xxx.xxx xxxxxx w[j-1] xxx.xxx xxxxxx xxx.xxx discarded bits estimation
^ w[j-1]
Figure 2: Partial Residual Carry-Save Representation and its Limited Precision Assimilation. The carry-sum representation permits the use of a carry-save adder and hence results in fast hardware implementations since we avoid the use of time-consuming carry-propagation adders. If we consider residuals represented in the two's complement carry-save form, we may compute the amount of truncations t and of residual and divisor respectively, using the following expression: (2 ? 1) 21 ? (? + a) 2? 2?t (6) Equation (6) expresses a necessary but not sucient condition for convergence, in fact it guarantees a minimum t but it does not insure the continuity between consecutive selection intervals. Finally, Figure 3, depicts the architecture of a simple radix-2 SRT divider. In ^ w
-d
2 w[j]
0 +d
F
MUX
CSA
q
w[j+1]
j +1
Figure 3: Architecture for SRT Radix-2 Division. this case, we may note that the selection function does not depends on d. This is not true for higher radices. 5
2.2 Higher Radix Division
Several techniques have been reported for improving the performance of SRT division [EL90b, Fan89, MMH93, MC93, MC94a, MC94c]. A way of speeding up the division process consists in considering groups of bits rather than a single one; this operation may be interpreted as using a radix greater than two. For example in radix-4 division a pair of bits of the quotient is generated at each iteration; in radix-8 three bits are generated at each step of the process. This means that the higher the radix the less is the number of iterations necessary to complete a division. Nevertheless, the tradeo of this design approach is the increased hardware complexity of the quotient digit selection logic and of the multiple generation logic as well. In general a string of k bits is equivalent to y radix-r digits, where: (7) y = dlogk re 2 A common technique used to implement higher-radix division stages consist in using lowerradix stages [Atk70, Tay85]. For example a radix-8 stage with = 67 may be implemented by overlapping a radix-2 and a radix-4 stage as illustrated in Figure 4. A quo^ w
F ^ w
F
d^
{- 4, 0 ,+4}d
w[ j ]
MUX
2 qh {- 2 ,....,+2}d
CSA
MUX
4 ql
CSA r w[ j +1]
Figure 4: Architecture for SRT Radix-8 Division. tient digit is obtained as the sum of two contributions; namely qj+1 = qh + ql. Digit qh 2 f?4; 0; +4g is generated by the radix-2 selection function F2; on the other hand digit ql 2 f?2; ?1; 0; +1; +2g is generated by the radix-4 selection function F4. As a consequence digit qj+1 falls in the range [?6; +6]. Another technique used to speedup the algorithm convergence consists in overlapping the division steps in order to reduce the overall cycle-time [Fan89, WH91]. Implementing high-radix SRT dividers is, in general a challenging task [OF97], since the complexity of the quotient digit selection function increases. A technique used to 6
reduce the function complexity is the operands scaling. This technique permits to design selection functions that do not depend on divisor d and hence have a simpler hardware implementations. With such technique is possible to implement a radix-256 divider whose performance is similar to that of a multiplication-based scheme. Implementations with operands scaling are presented in [Svo63, Tun68, EL90b, Mat91, MC94a, SK95]. In general, very-high radix algorithms are multiplication-based; rst is computed an approximation of the reciprocal of d and then the quotient is obtained multiplying such reciprocal by the dividend. Some methods to compute the reciprocal of the divisor d are reported in [Alv91, SW94, SM95, Fer67, Fly70, FS89, ITY95, PV96, SOS94, SF93, ITY97] and are based on Newton-Raphson method as well as convergence methods and linear or polynomial approximations of the divisor reciprocal. The main drawback of such methods is the unavailability of the remainder; for this reason it is preferable to employ hybrid techniques combining convergence and SRT methods [ELM93, ELM94].
3 Quotient Digit Speculation Speculation of quotient digits as a technique to speedup the division algorithm was rst reported in [CL93] and then extended to high-radix division and to square root in [CL94]. Other related works are [PKCW95] and [Now96, NYBD97] where the speculation technique is exploited to design high-performance asynchronous adders. Speculation must not be confused with prediction [EL85, EL89]. In a prediction algorithm, at the j -th iteration digits qj and qj+1 are computed. This means that we are able to generate qj+1 before knowing the value w[j ] of the partial remainder at the end of iteration j . Namely qj+1 = F (qj ; d;^ w^[j ? 1]) is a function of qj , of an estimation of the divisor d and of an estimation of the partial remainder at the beginning of the j -th iteration. In [MC94b] is reported a prediction technique that unlike [EL85, EL89] does not need the operands prescaling. The main dierences between the method proposed in [CL93, CL94] and the one we have developed are the following: 1. The method proposed in [CL93, CL94] applies to multiplicative division, namely the speculation is performed computing rst an estimate of the divisor reciprocal, then performing a left to right multiplication using the technique reported in [EL87, EL90a] and rounding the result in order to obtain a speculation of the quotient digit. Our method follows the SRT algorithm, hence the speculation of the quotient digit is obtained by the comparison of the partial remainder with some selection bounds opportunely computed. To each speculated digit qjs+1 = k are associated a particular lower and upper bound (Lsk and Uks respectively). If the partial remainder falls within such bounds then the selection function selects the digit qjs+1 = k. 2. The method proposed by us is analytical, namely the amount of divisor and partial remainder truncation, as well as selection constant computation is obtained using a set 7
of equations, while in [CL93, CL94] the selection function is implemented empirically via simulation.
3.1 Basic Theory
In case of speculation of the quotient digit the recurrence for division becomes: ws [j + 1] = rw[j ] ? qjs+1 d (8) Where qjs+1 is the speculated digit and ws[j + 1] the speculated partial remainder. If the speculation is correct, it will be w[j + 1] = ws[j + 1] and qj+1 = qjs+1. In case of wrong prediction both the partial remainder and the speculated quotient digit have to be corrected. In particular the correct digit qj+1 results: qj+1 = qjs+1 + qjc+1 (9) Where qjc+1 is the correction digit. The wrong speculated remainder may be corrected using the following relation: (10) w[j + 1] = ws[j + 1] ? qjc+1 d The speculation error "s results: (11) "s = w[j + 1] ? ws [j + 1] = ?qj+1 d + qjs+1 d = (qjs+1 ? qj+1) d Digit qj+1 is such that qj+1 2 fqjs+1 ? ; : : :; qjs+1 ? 1; qjs+1; qjs+1 + 1; : : : ; qjs+1 + g, whereas qjc+1 is such that qjc+1 2 f?; : : :; ?1; 0; +1; : : :; +g.Taking into account relation (11) the speculation error results: ? d "s + d. Furthermore, according to the (2), the partial remainder must be bounded, consequently we obtain: ? d ws[j + 1] + "s + d (12) from which follows that: ? d ? d ws[j + 1] + d + d (13) Hence, for qjs+1 = k we obtain: ? d + kd ? d rw[j ] + d + kd + d (14) Remembering that Lk = (? + k) d and Uk = ( + k) d the (14) becomes: Lk (d) ? d rw[j ] Uk (d) + d (15) As a consequence, we may de ne a new lower and upper bound, that we call respectively Lsk and Uks, for speculating a digit qjs+1 = k: Lsk = (? + k) d ? d < Lk Uks = ( + k) d + d > Uk (16) As we may note Lsk < Lk and Uks > Uk , this results in a larger overlap between two consecutive selection intervals, consequently the selection function will need a smaller number of bits of both w and d. 8
3.2 The Selection Function
The overlap amount between two consecutive speculation intervals is: Uks ? Lsk+1 = ( + k + ) d ? (? + k + 1 ? ) d = (2 ( + ) ? 1) d (17) Let now t be the minimum number of fractionary bits of w necessary to assure convergence and the number of bits of w^ we discard to perform a speculation. Thus the minimum number of bits of w necessary to perform a speculation is given imposing: (2 ( + ) ? 1) d 2 ?t+1 (18) Since the worst case occurs for d = 21 the (18) becomes: (2 ( + ) ? 1) 2 ?t+2 (19) Imposing the continuity condition between two consecutive speculation intervals we may nd the minimum number of bits of d necessary for speculation: Lsk (di + 2? ) Uks?1(di ) (20) and hence, since the worst case occurs for di = 21 and k = a, we obtain: (21) (?2 + 1) 12 + (? + a ? ) 2? Equations (19) and (21) are used to compute the truncation amounts ( and ) of partial remainder and divisor respectively. Let w0 and d0 be the truncated remainder and divisor that we use to perform a speculation. Digit qjs+1 is speculated if and only if the following condition is satis ed: qjs+1 = k () msk (i) rw0[j ] < msk+1(i) (22) Where msk (i) is the speculation constant for digit qjs+1 = k and 0 i 2?1 ? 1 is the generic selection interval in which the variation range of the divisor has been divided. Each constant must fall in the overlap region between two adjacent selection intervals; namely: (23) max(Lsk (i); Lsk (i + 1)) msk (i) min(Uks?1 (i); Uks?1(i + 1)) In order to minimize the speculation error we may choose the speculation constants according to one of the following criteria: 1. Calculate msk (i) as the average of all the exact selection constants mk (j ) that fall in the in range [d0i; d0i+1), namely all the mk (j )'s such that d0i dj < d0i+1 (with i 2? j < (i + 1) 2? and 0 i < 2?1 ). So, if is the exact number of bits of the divisor d necessary to guarantee the algorithm convergence, we obtain: ? ?1 2X 1 m (i) = 2? mk (j ) j =0
s k
9
2. For all the possible msk (i) choose the one that maximize the probability of correct speculation, that is the one such that the interval [msk(i); msk+1(i)) contains the largest number of values of ws[j ] for which qjs+1 = qj+1 = k. The rst approach is suitable for relatively low radices for which the dierence between the value used in the speculation and the correct value is small. Since, for high radices this dierence may increase signi cantly it is likelier the second approach give the best results.
3.3 Error Detection and Correction
A correct speculation is performed if and only if, at each step j of the recurrence, the speculated partial remainder ws[j ] results to be bounded; that is:
? d ws [j ] d
(24)
Thus, in order to verify the correctness of a prediction we may check if the partial remainder generated by a speculation falls within the correct bounds expressed by the (24). The speed of the comparison may be augmented by assimilating only f bits (with f < ) of the divisor and by reducing the precision of the speculated partial remainder by inspecting only e fractionary bits with e < t. Taking into account that the partial remainder is represented in redundant carry save form, the error due to the truncation to the e-th fractionary bit is "e = 2?e+1 Let w00 be an assimilation of ws till the e-th bit and let d00 a truncation of divisor d till the f -th fractionary bit. The (24) may be rewritten as follows:
? d00 w00 + 2?e+1 d00
(25)
and hence, in the most conservative case:
? d00 w00 d00 ? 2?e+1
(26)
Let us now suppose, for example, to have obtained a negative value of w00 that exceeds the lower bound but that it is still very close to it; a correction by ?1 is necessary in order to obtain a bounded partial remainder. Hence we obtain that:
? d00 + 1 d d00 ? 2?e+1
(27)
This condition guarantees the algorithm convergence since satis es the sucient condition for convergence, namely: jw[j + 1]j < jw[j ]j Since d = d00 + 2?f we obtain:
? d00 + d00 + 2?f d00 ? 2?e+1 10
(28)
In the worst case d00 = 12 thus the (28) becomes: (29) (2 ? 1) 21 2?e+1 + 2?f Moreover, a wrongly speculated partial remainder, as exceeds the correct bounds it is likely greater than one and needs some integer bits to be represented. Since, according to the (11), the maximum speculation error is d the bounds of a wrong remainder become: ?( + ) d ws [j + 1] ( + ) d Thus, for 1 and taking into account that in the worst case d = 1, the number of integer bits necessary to represent a wrong remainder is:
= dlog2(1 + )e + 1
r?1
(30)
Since, in order to simplify the error detection and correction logic, we impose that qjc+1 2 f?1; 0; +1g, once an error has been detected by a bound checking operation, the correction may be performed by evaluating only the sign of the wrong remainder. Namely: (
1 if w00[j ] < ?d00 q +1 = ? +1 if w00[j ] > d00 ? 2?e+1 c j
(31)
3.4 Extending the Correction Digit Set
In order to speedup the algorithm convergence we may try to reduce the number of iterations necessary to correct a wrongly speculated partial remainder. This may be achieved by augmenting the correction digit set. Till now we have supposed that qjc+1 2 f?1; 0; +1g, thus in case of an unbounded remainder a correction by 1 is performed. If we take into account the magnitude of the speculation error it is possible to perform two types of corrections weather the speculation error "s exceeds or not a certain bound. Let us suppose, for sake of simplicity, that qjc+1 2 f?0; ?1; 0; +1; +0g with 0 > 1, then we obtain: 8 ?0 if w00[j ] ? > > > < 1 if ? < w00[j ] < ?d00 (32) qjc+1 = > ? 00 ? 2?e+1 < w00[j ] < + +1 if d > > : +0 if w00[j ] The lower bound and the upper bound may be calculated in a straightforward way. The lower bound must be such that a correction by 0 d produce a bounded partial remainder; namely: ? d00 + 0 d + d00 ? 2?e+1 (33) Hence, considering that d = d00 + 2?f , we obtain:
? ( + 0) d00 ? 0 2?f +( ? 0) d00 ? 0 2?f ? 2?e+1 11
(34)
Since we are computing a lower bound, to be conservative, we have to choose the largest value; consequently results to be: +( ? 0) d00 ? 0 2?f ? 2?e+1
(35)
For what concerns the upper bound , proceeding in the same fashion, we obtain: +(0 ? ) d00 ? 0 2?f
(36)
These considerations may be easily extended to any correction digit set c.
3.5 The Design Environment
The design framework we have used to map on a VLSI circuit our algorithm is formed by three parts 1. Emma, a tool implemented by us that permits to simulate the algorithm and generates the truth table of the selection function; 2. Gyocro, a tool used to simplify Emma's output and to map it on a PLA; 3. SIS [SSL+92], a tool used to map the simpli ed PLA into multilevel combinational logic of a speci c technology, and to perform area and delay estimations. The designs have been implemented using a 1m CMOS standard-cell library (ES2-ECPD10). Delays and area are normalized with respect to a 2-input NAND gate. For delay estimation we assume a fanout of three NAND gates. Delay and area due to the interconnections have been neglected. In Table 1 are reported the physical parameters of a two-input NAND cell. From Parameter Value Unit Size 12:5 47:5 m2 Cin 0.052 pF Fanout 0.69 pF tp 0.24 ns tp 0.20 ns tp 1.43 ns=pF tp 1.06 ns=pF Table 1: Two-input NAND physical Parameters. lh hl
lh hl
Table 1 we may deduce that a fanout of three NAND gates implies a load capacitance of 0.156 pF. 12
Let now Cw be the equivalent capacitance of the loading wire and Cg the input capacitance of the loading gates. Delay D is computed using the following expression:
D = tp + tp (Cw + Cg ) Since in our estimations we neglect the delay due to wiring, we suppose Cw = 0 whereas Cg = 0:156 pF. All the evaluations have been performed for double precision oating point numbers with a 52-bit signi cand. xx
xx
Radix 4 8 16 a 2 7 12 # of CSAs 1 2 2 # bits of w0 (sum-carry) (3,3) (5,5) (5,5) # bits of d0 0 1 2 # bits of w00 (sum-carry) (6,6) (5,5) (6,6) # bits of d00 3 1 2 cycle/digit 1.38 1.12 1.26 cycle delay 20.6 27.3 28 delay/bit 14.2 10.2 8.8 area 2360 3295 3815 Table 2: Characteristics of the implementations.
3.6 Performance Evaluation
In order to evaluate the performance of a design we use two parameters: 1. cycles per digit (Cd); 2. delay per bit (Db ). Let Nd be the number of divisions simulated, Nc the number of correction cycles that have been performed and m the size of the signi cand (in bits); the number of cycles per quotient digit may be de ned as: Nc Cd = 1 + N dm= (37) log2 re d If D is the cycle delay of the design, the delay per bit may be de ned as follows: d D (38) Db = Clog 2r 13
3.7 Implementation
We have performed several designs implementing our algorithm and the results are reported in table 2. If we compare the results shown in table 2 with those reported in [CL94] in case of speculation without partial advance we may conclude that, in case of radix-16, our algorithm produces a speedup of about 7% with respect the radix-16 described in [CL94] and has the same performance of a speculative radix-512 architecture. Nevertheless, the main drawback of our design approach is its unsuitability for very high radices since the increase of the hardware complexity of the selection table introduces very high delays that aect the performance of the divisor augmenting the execution time per quotient digit as also shown in Figure 5. Analyzing Figure 5 we may note that the radix that oer the best
Dd
15 13 11 9 7 5
0 4 8 16 32 64 radix
Figure 5: Delay/Digit vs. Radix for Quotient Digit Selection Function. performance is r = 16. For higher radix the delay per digit starts increasing (for example for r = 64 we have Dd = 10:35). 12 . The In Figure 6 is shown the implementation of the speculative radix-16 with = 16 s dashed line indicates the critical path. In order to reduce the cycle time digits qj+1 = ?11 and qjs+1 = +11 are not speculated even if they belong to s since, to obtain such digits would be necessary three CSAs; consequently when qj+1 = 11 an error occurs and the correction unit is exploited to obtain the correct digit. Analogously, to avoid the use of a decoder that could introduce an extra delay along the critical path, also digits qjs+1 = ?12 and qjs+1 = +12 are not speculated. In this way we avoid the con icts that could arise when the speculated digits are 12, 6 or 5. Also, the PLAs implementing both the speculation function is split into two smaller PLAs separating the product terms according to the delay they introduce so that the speculated digit qjs+1 result to be the sum of two contribution; namely:
qjs+1 = qh + ql The faster PLA generates the digit qh 2 f8; 4g and drives the rst multiplexer. The slower PLA generates digits ql 2 f2; 1g and drives the second multiplexer. A similar design is followed for error detection and correction logic. It also must be remarked 14
speculation
w’[j]
q
16 w[j]
{-8,-4,0,+4,+8} d 9.2
11
h 12.4
d’
q
10.9
CSA
l {-2.-1.0,+1,+2} d
w’’[j]
d’’
10.3 error correction error detection
9.4
ws
14.6
12.3 14.1 16
15.5
spec./ corr.
CSA 11.2 20.2
buffer
REG. ws
28
Figure 6: Implementation of the Radix-16 Speculative Divider. that, since we use multiplexers with decoded control, digit qjs+1 = 0 is implicit and is not generated by the speculation function. Finally, in Table 3 are reported area and delays for the radix-16 speculative divider.
3.8 Reducing the Number of Outputs
The selection function may be simpli ed by reducing the digit set. Such design approach leads to a reduction of the number of outputs of the selection logic. The error correction logic may be used to generate the remaining digits. Since the speculation interval of the generic digit qjs+1 = k contains the selection intervals of all the digits from qj+1 = k ? to qj+1 = k + , the number of outputs nS of the selection logic results: ' & 2 a + 1 (39) nS = 2 + 1 This aects also the number of integer bits necessary to represent the wrongly speculated partial remainder that becomes: r ? 1 = dlog2(1 + 2)e + 1 2 (40) In fact reducing the number of outputs of the speculation logic means reducing the set of digits that have to be speculated and hence increasing the speculation error; as a consequence, the main drawback of this design approach is the increased number of corrections necessary for the algorithm to converge. Anyway it is possible to reduce it by extending the correction digit set c. Although this approach simplify the hardware complexity of the selection logic producing a faster execution time, does not lead to a speedup of the architecture since the performance are heavily penalized by the increase of Cd . 15
Module digit speculation
Area 140 218 error detection 104 error correction 134 MUX (for qj d) 2 319 MUX (for w[j ]) 342 MUX (for ql) 12 CSA (57 bits) 2 382
Delay 9:2 (for qh) 10.9 (for ql) 9.4 10.3 1:4 1:4 1.4 4:2 (from residual) 2:2 (from qj d) Buer 12 3 1:8 Registers 6 228 7:8 Total 3815 28 ( denotes delays in the critical path) Table 3: Area and Delay of the Radix-16 Speculative Divider. From the (38) we obtain:
D = Dd Clog2 r d
(41)
Thus, in order to obtain a radix-16 with a delay per bit Dd of, say, 8 and = 2, according to the (41) the cycle delay should be approximately 17.6. This means that, according to the values reported in Table 3, the quotient digit selection logic should have a delay of 0.6 that is impossible to achieve.
4 Conclusions Quotient digit speculation is an alternative to conventional division techniques. Assimilating a reduced number of bits of both residual and divisor may lead to a fast selection functions and hence to a speedup of the execution time. Nevertheless the speculated digit may be incorrect. The correctness of the selection is checked by an error detection function that veri es weather the speculated residual falls within the correct bounds or not. In case of incorrect speculation the algorithm rolls back and the digit is corrected. This operation is performed by an error correction function. Because of the possible rollbacks, the execution time is variable. We have developed an analytical approach to quotient digit speculation, reformulating the basic theory of SRT division. This means that the number of bits of residual and divisor that have to be assimilated are computed by a set of equations that take into account the desired speculation error. 16
# bits of w-(sum-carry) # frac. bits of 21 d cyc:=dig: exact spec. corr. exact spec. corr. 4 12 (9, 9) (3, 3) (9, 9) 5 0 2 2.34 (9, 9) (4, 4) (9, 9) 5 0 2 2.24 16 2 12 (9, 9) (4, 4) (8, 8) 5 1 2 1.91 (9, 9) (5, 5) (8, 8) 5 2 2 1.81 4 22 (12, 12) (4, 4) (10, 10) 6 2 2 2.34 (12, 12) (5, 5) (10, 10) 6 2 2 2.25 32 2 22 (12, 12) (5, 5) (9, 9) 6 3 2 1.93 (12, 12) (6, 6) (9, 9) 6 3 2 1.87 4 37 (14, 14) (6, 6) (12, 12) 8 2 3 2.25 64 (14, 14) (7, 7) (12, 12) 8 2 3 2.24 2 37 (14, 14) (6, 6) (11, 11) 8 4 3 1.9 Table 4: Truncation Amounts for Quotient Digit Speculation, Error Detection and Correction Reducing the Outputs Number of the Selection Logic. Radix a
In order to reduce the number of corrections to be performed is possible to extend the correction digit set c so that the largest digit 0 2 c cover the largest speculation error; that is in case of maximum error, a correction by 0 d produces a bounded partial remainder. Clearly, extending c causes an increase of the hardware complexity of correction function, so a tradeo must be found. We have performed several designs implementing the proposed algorithm and it has been found that the architecture with the best execution time per quotient digit is the 12 . For higher radices we have noted a decrease of performance due to radix-16 with = 15 the increasing complexity of the selection function as well as the decrease of the probability of correct speculation. Presently we are examining several design issues to extend our algorithm to higher radices; among these we are considering the eventuality of using over-redundant digit sets to increase the probability of correct speculation, simple low-radix overlapped speculation stages as well as the eventuality of using polynomial approximations of the divisor reciprocal.
References [Alv91] [Atk68]
R. Alverson. Integer Division Using Reciprocals. In 10th Symposium on Computer Arithmetic, pages 186{190, 1991. D. E. Atkins. Higher-Radix Division Using Estimates of the Divisor and Partial Remainders . IEEE Transaction on Computers, C-17(10):925{934, October 1968.
17
[Atk70] [Avi61] [CL93] [CL94] [CS57] [EL85] [EL87] [EL89] [EL90a] [EL90b] [EL94] [ELM93] [ELM94] [Fan89] [Fer67] [Fly70]
D. E. Atkins. Design of the Arithmetic Unit of Illiac III: Use of Redundancy and Higher Radix Methods. IEEE Transaction on Computers, C-19(8):720{733, August 1970. A. Avizienis. Signed-Digit number representation for Fast Parallel Arithmetic. IRE Transaction on Electronic Compututers, EC-10:389{400, September 1961. J. Cortadella and T. Lang. Division with Speculation of Quotient Digits. In 11th Symposium on Computer Arithmetic, pages 87{94, 1993. J. Cortadella and T. Lang. High-Radix Division and Square Root with Speculation. IEEE Transaction on Computers, C-43(8):919{931, August 1994. J. Cocke and D. W. Sweeney. High Speed Arithmetic in a Parallel Device. Technical report, IBM, February 1957. M. D. Ercegovac and T. Lang. A Division with Prediction of the Quotient Digits. In 7th Symposium on Computer Arithmetic, pages 51{56, 1985. M. D. Ercegovac and T. Lang. Radix-4 Multiplication without Carry-Propagate Addition. In International Conference on Computer Design (ICCD), pages 654{658, 1987. M. D. Ercegovac and T. Lang. Fast Radix-2 Division with Quotient Digit Prediction. Journal of VLSI Signal Processing, 2(1):169{180, January 1989. M. D. Ercegovac and T. Lang. Fast Multiplication Without Carry-Propagate Addition. IEEE Transaction on Computers, C-39(11):1385{1390, November 1990. M. D. Ercegovac and T. Lang. Simple Radix-4 Division with Operands Scaling. IEEE Transaction on Computers, C-39(9):1204{1207, September 1990. M.D. Ercegovac and T. Lang. Division and Square Root. Digit-Recurrence Algorithms and Implementations. Kluwer Academic Publishers, Norwell, MA, 1994. M. D. Ercegovac, T. Lang, and P. Montuschi. Very High-Radix Division with Selection By Rounding and Prescaling. In 11th Symposium on Computer Arithmetic, pages 112{119, 1993. M. D. Ercegovac, T. Lang, and P. Montuschi. Very-High Radix Division with Prescaling and Selection by Rounding. IEEE Transaction on Computers, C-43(8):909{918, August 1994. J. Fandrianto. Algorithm for High-Speed Radix-8 Division and Radix-8 Square Root. In 9th Symposium on Computer Arithmetic, pages 68{75, 1989. D. Ferrari. A Division Method Using a Parallel Multiplier. IEEE Transaction on Computers, C-16(4):224{226, April 1967. M. J. Flynn. On Division by Functional Iteration. IEEE Transaction on Computers, C-19(8):702{706, August 1970.
18
[FS89]
D. L. Fowler and J. E. Smith. An Accurate High-Speed Implementation of Division by Reciprocal Approximation. In 9th Symposium on Computer Arithmetic, pages 60{67, 1989. [ITY95] M. Ito, N. Takagi, and S. Yajima. Ecient Initial Approximation and Fast Converging Methods for Division and Square Root. In 12th Symposium on Computer Arithmetic, pages 2{9, 1995. [ITY97] M. Ito, N. Takagi, and S. Yajima. Ecient Initial Approximation for Multiplicative Division and Square Root by a Multiplication with Operand Modi cation. IEEE Transaction on Computers, C-46(4):495{498, April 1997. [KKH88] S. Kawahito, M. Kameyama, and T. Higuci. Design of Highly-Parallel Arithmetic Circuits Based on a New Redundant Radix-R Number System. Technical Report CPSY88-1, IEICE, June 1988. [Kor94] P. Kornerup. Digit-Set Conversions: Generalizations and Applications. IEEE Transaction on Computers, C-43(5):622{629, May 1994. [Mat91] D. W. Matula. Design of a Highly Parallel IEEE Standard Floating Point Arithmetic Unit. In Symposium on Combinatorial Optimization in Science and Technology at RUTCOR/DIMACS, April 1991. [MC93] P. Montuschi and L. Ciminiera. Reducing Iteration Time when Result Digit Is Zero for Radix-2 SRT Division and Square Root. IEEE Transaction on Computers, C42(2):239{246, February 1993. [MC94a] P. Montuschi and L. Ciminiera. Over-Redundant Digit Sets and the Design of Digitby-Digit Division Units. IEEE Transaction on Computers, C-43(3):269{277, March 1994. [MC94b] P. Montuschi and L. Ciminiera. Quotient Prediction Without Prescaling. Technical report, Politecnico di Torino, September 1994. [MC94c] P. Montuschi and L. Ciminiera. Radix-8 Division with Over-Redundant Digit Set. Journal of VLSI Signal Processing, 7(3):259{270, May 1994. [MMH93] S. E. McQuillan, J. V. McCanny, and R. Hamill. New Algorithms and VLSI Architectures for SRT Division and Square Root. In 11th Symposium on Computer Arithmetic, pages 80{86, 1993. [Now96] S. M. Nowick. Design of a Low-Latency Asynchronous Adder Using Speculative Completion. IEE Proceedings on Computer Digital Techniques, 143(5):301{307, September 1996. [NYBD97] S. M. Nowick, K. Y. Yun, P. A. Beerel, and A. E. Dooply. Specculative Completion for the Design of High-Performance Asynchronous Dynamic Adders. In Symposium on Asynchronous Design Methodologies, pages 210{223, 1997. [OF97] S. F. Oberman and M. J. Flynn. Division Algorithms and Implementations. IEEE Transaction on Computers, C-46(8):833{854, August 1997.
19
[Par90]
B. Parhami. Generalized Signed-Digit Number System: A unifying Framework for Redundant Number Representation. IEEE Transaction on Computers, C-39(1):89{ 98, January 1990. [Par93] B. Parhami. On the Implementation of Arithmetic Support Functions for Generalized Signed-Digit Number System. IEEE Transaction on Computers, C-42(3):379{384, March 1993. [PKCW95] T. H. Pan, H. S. Kay, Y. Chun, and C. L. Wey. High-Radix SRT Division with Speculation of Quotient Digits. In International Conference on Computer Design (ICCD), pages 479{483, 1995. [PV96] J. E. Phillips and S. Vassiliadis. High Performance Dividers with Multiply-Add. In 2nd International Conference on Massively Parallel Computing Systems, pages 406{413, 1996. [Rob58] J. E. Robertson. A New Class of Digital Division Methods. IRE Transaction on Electronicc Computers, EC-7(3):88{92, September 1958. [SF93] E. Schwarz and M. Flynn. Hardware Starting Approximation For the Square Root Operation. In 11th Symposium on Computer Arithmetic, pages 103{111, 1993. [SK95] H. Srinivas and Parhi K. A Fast Radix-4 Division Algorithm and Its Architecture. IEEE Transaction on Computers, C-44(6):826{831, June 1995. [SM95] D. D. Sarma and D. W. Matula. Faithful Bipartite ROM Reciprocal Tables. In 12th Symposium on Computer Arithmetic, pages 12{25, 1995. [SOS94] M. J. Schulte, J. Omar, and E. E. Swartzlander. Optimal Initial Approximation for the Newton-Raphson Division Algorithm. Computing, 53:233{242, 1994. [SSL+ 92] E. M Sentovich, K. J. Singh, Lavagno L., et al. SIS: A System for Sequential Circuit Synthesis. Technical report, Electronics Research Laboratory-EECS Department Univerity of California, Berkeley, May 1992. [Svo63] A. Svodoba. An Algorithm for Division. Information Processing Machines, (9):25{34, 1963. [SW94] D. D. Sarma and Matula D. W. Measuring the Accuracy of ROM Reciprocal Tables. IEEE Transaction on Computers, C-43(8):932{940, August 1994. [Tay85] G. S. Taylor. Radix-16 SRT dividers with Overlapped Quotient Selection Stages. In 7th IEEE Symposium on Computer Arithmetic, pages 64{71, 1985. [Toc58] K. D. Tocher. Techniques of Multiplication and Division for Automatic Binary Computers. Quarterly Journal of Mechanics and Applied Mathematics, 11, Pt. 3:364{384, 1958. [Tun68] C. Tung. A Division Algorithm for Signed-Digit Arithmetic. IEEE Transaction on Computers, C-17(9):887{889, September 1968.
20
[WH91]
T. E. Williams and M. Horowitz. A 160ns 54-bit CMOS Division Implementation Using Self-Timing and Simmetrically Overlapped SRT Stages. In 10th IEEE Symposium on Computer Arithmetic, pages 210{217, 1991.
21