Decimal SRT Square Root: Algorithm and ... - Semantic Scholar

2 downloads 0 Views 483KB Size Report
Mar 21, 2013 - synthesis results show an area cost of about 31K NAND2 and a cycle time of 40 FO4. These results reveal the 14 % speed advantage of the ...
Circuits Syst Signal Process DOI 10.1007/s00034-013-9586-3

Decimal SRT Square Root: Algorithm and Architecture Amir Kaivani · Seok-Bum Ko

Received: 15 November 2012 / Revised: 21 March 2013 © Springer Science+Business Media New York 2013

Abstract Given the popularity of decimal arithmetic, hardware implementation of decimal operations has been a hot topic of research in recent decades. Besides the four basic operations, the square root can be implemented as an instruction directly in the hardware, which improves the performance of the decimal floating-point unit in the processors. Hardware implementation of decimal square rooters is usually done using either functional or digit-recurrence algorithms. Functional algorithms, entailing multiplication per iteration, seem inadequate to use for decimal square roots, given the high cost of decimal multipliers. On the other hand, digit-recurrence square root algorithms, particularly SRT (this method is named after its creators, Sweeney, Robertson, and Tocher) algorithms, are simple and well suited for decimal arithmetic. This paper, with the intention of reducing the latency of the decimal square root operation while maintaining a reasonable cost, proposes an SRT algorithm and the corresponding hardware architecture to compute the decimal square root. The proposed fixed-point square root design requires n + 3 cycles to compute an n-digit root; the synthesis results show an area cost of about 31K NAND2 and a cycle time of 40 FO4. These results reveal the 14 % speed advantage of the proposed decimal square root architecture over the fastest previous work (which uses a functional algorithm) with about a quarter of the area. Keywords Decimal arithmetic · Square root · SRT algorithm

A. Kaivani · S.-B. Ko () Department of Electrical & Computer Engineering, University of Saskatchewan, Saskatoon, SK, Canada e-mail: [email protected] A. Kaivani e-mail: [email protected]

Circuits Syst Signal Process

1 Introduction The annals of computer arithmetic are replete with facts uncovering the crucial applications of both binary and decimal arithmetic processors. Binary digital circuits have dominated the processor industry due to their undisputed cost effectiveness. Decimal arithmetic, however, has been supported by the supremacy of its human-centric applications [1]. The (partially) software implementation of decimal arithmetic operations over binary logic devices had been the most common approach until lately, when IBM started to reveal the all-hardware solution for decimal processors [2]. Moreover, the gradual growth of decimal hardware arithmetic began to gather pace at the time when the IEEE was spurred to support the revitalized decimal floating-point number system in the IEEE 754-2008 standard [7]. Hardware implementation of decimal arithmetic operations calls for elaborate observations due to the inherent intricacy of the decimal number system with respect to that of the binary system. Besides the popular four dyadic decimal operations (addition, subtraction, multiplication, and division), the unary square root operation can be implemented as an instruction directly in the hardware. This boosts the performance of the decimal floating-point unit in the processors, particularly when the square root is implemented, sharing hardware with the decimal divider. Decimal square rooters are usually implemented in hardware using functional algorithms such as Newton–Raphson [13, 17] and [12]. However, these methods require a multiplication per iteration. Consequently, given the high cost of parallel decimal multipliers [6, 8] and [16], the functional algorithms seem inadequate to employ for the decimal square root. Digit-recurrence algorithms, conversely, are conceptually simple and well suited for decimal square root due to their low hardware complexity. Moreover, using these algorithms paves the way for a shared decimal division/square root unit. The digit-recurrence square root algorithm is exemplified in Eq. (1), where 0.01 ≤ X < 1, 0.1 ≤ Q < 1, and 0 ≤ R < ulp = 10−n1 (where n is the number of fractional digits) are the normalized radicand, root, and the remainder, respectively. Note that for the floating-point representation the radicand should be scaled in a way to have an even exponent. √ X=Q+R (1) Two recent pertinent works based on the Newton–Raphson algorithm are [17] and [12]; the latter method is the fastest available one in the literature. The former presents a hardware design for decimal floating-point square root in which the size of the required look-up table is reduced with respect to other similar designs. The most recent work [12] and [13] has reduced the latency of the decimal square root operation by taking advantage of a parallel fused-multiply-add (FMA), but at the expense of high area consumption.

1 ulp means unit in least (significant) position.

Circuits Syst Signal Process

The work by Ercegovac and McIlhenny [3], which is based on the digit-recurrence algorithm, uses a look-up table to compute a rough approximation of the root and then corrects the result via a division operation. Moreover, one may use the CORDIC algorithm to compute the decimal square root [15] and [10]. This paper, with the purpose of reducing the latency of the decimal square root operation while not consuming a large area, proposes a digit-recurrence algorithm and the corresponding hardware architecture to compute the decimal square root. The main advantage of the proposed algorithm over previous works is to remove the slow and costly look-up tables. This, however, entails generating inconstant comparison multiples to be used in the proposed SRT algorithm. The rest of the paper is organized as follows. Section 2 provides some background on the decimal digit-recurrence SRT square root algorithm which is a preface to the proposed design discussed in Sect. 3. In this section, the proposed algorithm is explained, and then the corresponding hardware architecture is presented. An evaluation of this architecture and a comparison to the fastest previous method are provided in Sect. 4. Section 5 presents the conclusions.

2 Decimal Digit-Recurrence Square Root The decimal digit-recurrence square root estimates the root Q, with an error of less than ulp = 10−n . In this approach, one root digit qi (0 ≤ i ≤ n) is generated per iteration, such that the root in the√ith iteration Q[i] is assumed to be as in Eq. (2). It should be noted that 0.1 ≤ Q = X − R < 1 forces q0 = 0, in the case of nonredundant representation of Q. Q[i] =

i 

qj 10−j

(2)

j =0

Therefore, the error of the root estimation in the ith iteration εq [i] is εq [i] =

√ X − Q[i]

(3)

√ The bounds of the final error (i.e., after n iterations) 0 ≤ |εq [n]| = X − Q[n] < 10−n = ulp impose (as a conclusion of Eqs. (2) and (3) for i + 1) Eq. (4) as the required condition on the root estimation error, assuming a root digit with a redundant representation, i.e., qi ∈ [−α, β] (α, β ≥ 0 and α + β + 1 > 10): 10

−i



−α 9



  β < εq [i] < 10 9 −i

(4)

Define bounds Hn [i](Hp [i]) as the lower (upper) bounds of the ith partial remainder w[i]; then this must satisfy Eq. (5), derived by substituting Eq. (3) in Eq. (4).

Circuits Syst Signal Process

−α √ β < X − Q[i] < 10−i 9 2 9    −α β 2 2 −i −α ⇒ Q[i] + 10 + 2Q[i]10−i < X < Q[i]2 + 10−i 9 9 9 −i β + 2Q[i]10 2 9 2   −i −α −i −α 2 −i β < X − Q[i] < 10 + 2Q[i]10 ⇒ 10 9 9 9 −i β + 2Q[i]10 9  2     −α −i −α × Q[i] < w[i] + 2× ⇒ Hn [i] = 10 9  9  i 2 = 10 X − Q[i]  2     β β × Q[i] = Hp [i] < 10−i + 2× 9 9

10−i

(5)

The admissible range for the partial remainder, defined in Eq. (5), is also known as the convergence condition of the decimal digit-recurrence square root algorithm. The recurrence equation of the decimal square root algorithm is determined, in Eq. (6), by substituting Eq. (2) in Eq. (5) for w[i + 1]:   w[i + 1] = 10i+1 X − Q[i + 1]2  = 10i+1 X − Q[i]2 − (qi+1 )2 × 10−2(i+1) − 2Q[i] × qi+1  × 10−(i+1) (6) = 10w[i] − 2Q[i] × qi+1 − (qi+1 )2 × 10−(i+1) ⇒ w[i + 1] = 10w[i] − Q[i]; Q[i] = Q[i] × 2qi+1 + (qi+1 )2 × 10−(i+1) The recurrence equation should be performed such that w[i + 1] be bounded as in Eq. (5). This imposes a careful computation for Q[i] and hence a suitable selection of the qi+1 , regarding the values of 10w[i] and Q[i], i.e., root digit selection (RDS). The correct selection is guaranteed if computing Eq. (6) satisfies Eq. (5) for the next partial remainder. For this purpose, selection intervals (Lk [i], Uk [i]) are defined such that if 10w[i] ∈ (Lk [i], Uk [i]) with k ∈ [−α, β], then qi+1 = k is admissible, i.e., keeping the next partial remainder within the required bound. Therefore, according to Eqs. (5) and (6), the boundaries of the selection intervals (Uk [i] and Lk [i]) are Lk [i] < 10w[i] < Uk [i] qi+1 = k ⇒ Hn [i + 1] < 10w[i] − Q[i] < Hp [i + 1] (7) Uk [i] = Hp [i + 1] + Q[i] × 2k + k 2 × 10−(i+1) ⇒ Lk [i] = Hn [i + 1] + Q[i] × 2k + k 2 × 10−(i+1) It is necessary that for any value of the shifted partial remainder 10w[i] there must exist at least one root digit. Therefore, the selection intervals must overlap, i.e.,

Circuits Syst Signal Process Fig. 1 The selection intervals and the comparison multiples

Lk [i] < Uk−1 [i]. Hence Eq. (8) must hold in all iterations, i.e., for 0 ≤ i ≤ n, Lk [i] < Uk−1 [i] ⇒ Hn [i + 1] + Q[i] × 2k + k 2 × 10−(i+1) < Hp [i + 1] + Q[i] × 2(k − 1) + (k − 1)2 × 10−(i+1) ⇒ 2Q[i] + (2k − 1) × 10−(i+1) < Hp [i + 1] − Hn [i + 1]

(8)

To make RDS less costly, it is common to select qi+1 by comparing the truncated shifted partial remainder (10w[i]) with comparison multiples Mk [i] for k ∈ (−α, β]. These comparison multiples should be bounded as in Eq. (9) (also shown in Fig. 1) in order to be able to compensate for the error brought about by using truncated operands. Therefore, the next quotient digit is selected based on Eq. (10), and hence the maximum absolute admissible selection error is equal to min[(Mk [i] − Lk [i]), (Uk [i] − Mk [i])]. Lk [i] < Mk [i] < Uk−1 [i] ⇒ Hn [i + 1] + Q[i] × 2k + k 2 × 10−(i+1) (9) < Mk [i] < Hp [i + 1] + Q[i] × 2(k − 1) + (k − 1)2 × 10−(i+1)   (10) qi+1 = k, if Mk [i] ≤ 10w[i] < Mk+1 [i]

3 The Proposed Decimal Square Root We opt for α = β = 5 for the proposed square root algorithm for two reasons: first, to reduce the complexity of computing (qi+1 )2 , required in Eq. (6); second, for the sake of fewer comparison multiples. Consequently, the bounds of the partial remainder (according to Eq. (5)) are as in Eq. (11). These bounds, according to Eq. (8), require Q[i] > 0.05, which always holds given the assumed radicand and root, i.e., 0.1 ≤ Q < 1. Hn [i] =

25 25 10 10 × 10−i − Q[i] < w[i] < × 10−i + Q[i] = Hp [i] 81 9 81 9

(11)

The comparison multiples should be bounded as in Eq. set (12), derived from Eq. (9) by replacing Hp [i + 1] from Eq. (11) and Q[i + 1] = Q[i] + qi+1 × 10−(i+1) . 

 2   10 k k 25 Q[i] + 10−i − + 9 10 9 810     2 k 25 8 −i (k − 1) Q[i] + 10 + + < Mk [i] < 2k − 9 10 9 810

2k −

(12)

Circuits Syst Signal Process

Therefore, we determine the comparison multiples Mk [i] as Mk [i] = ⇒

Lk [i] + Uk−1 [i] 2

Mk [i] = (2k − 1)Q[i] + 10−i



k 2 + (k − 1)2 25 + 20 810

 (13)

In the RDS we subtract the truncated comparison multiples (Mk [i]) from the truncated shifted partial remainder (10w[i]) , where the maximum admissible error, for all values of k, is defined as |ε[i]| < min[(Mk [i] − Lk [i]), (Uk [i] − Mk [i])]. Q[i] −i 2k+9 Given that Mk [i] − Lk [i] = Uk [i] − Mk [i] = Q[i] 9 + 10 ( 180 ), |ε[i]| < 9 + 2k+9 −i 10 ( 180 ), for i = 0 we have if Q[0] = 0 then 1 ≤ q1 = k ≤ 5 if Q[0] = 1 then − 5 ≤ q1 = k ≤ 0



ε[0] < 11 = 0.0611 . . . 180

⇒ ⇒



ε[0] < 1 − 1 = 0.1055 . . . 9 180

For i ≥ 1, regarding 0.1 ≤ Q[i] < 1, the admissible error of the RDS (ε[i]) is bounded as  −i 



i≥1

ε[i] < 1 − 10 (14) =⇒ ε[i] < 0.01055 . . . 90 180 The initial values of the square root recurrence (Eq. (6)) are determined, given the minimally redundant root digit set, as2 0 if 0.01 ≤ X < 0.3 Q[0] = q0 = 1 if 0.3 ≤ X < 1 and w[0] = X − Q[0]. By and large, the proposed decimal square root algorithm is summarized in Fig. 2. Example 1 presents the required steps, based on Fig. 2, to compute a decimal square root. 3.1 Architecture The most straightforward architecture to implement the recurrence stage of Fig. 2 is 2 2 25 shown in Fig. 3, where K = k +(k−1) + 810 . 20 For the sake of a faster design, Fig. 3 can be modified as shown in Fig. 4 such that the next root digit is generated partially in parallel with the partial remainder computation. The details of each constituent block in Fig. 4 are presented in the following. 2 The upper bound of X with q = 0 is (0.55 . . .)2 = ( 5 )2 = 25 ≈ 0.3. 0 9 81

Circuits Syst Signal Process Initialization: if X < 0.3 then Q[0] = q0 = 0 else Q[0] = q0 = 1; w[0] = X − Q[0] Recurrence: For 0 ≤ i ≤ n do

  2 2 25 for −4 ≤ k ≤ 5. (1) Mk [i] = (2k − 1)Q[i] + 10−i k +(k−1) + 810 20           (2) qi+1 = RDS 10w[i] , Mk [i] ⇒ qi+1 = k if Mk [i] ≤ 10w[i] < Mk+1 [i] .

(3) Q[i] = Q[i] × 2qi+1 + (qi+1 )2 × 10−(i+1) ; Q[i + 1] = Q[i] + qi+1 × 10−(i+1) . (4) w[i + 1] = 10w[i] − Q[i]. Termination: Perform the rounding and normalization and conversion of Q[n + 1] to BCD format. Fig. 2 The proposed decimal square root algorithm

Example 1 Decimal square root Initialization: X = 0.3521986 ⇒ Q[0] = q0 = 1 and w[0] = 0.3521986 − 1 = −0.6478014 Recurrence:

  i = 0: − 7.00 = M−4 [0] < 10w[0] = −6.47 < M−3 [0] = −6 ⇒ w[1] = −0.078014; q 1 = −4.   i = 1: − 1.80 = M−1 [1] < 10w[1] = −0.78 < M0 [1] = −0.6 ⇒ w[2] = 0.40986; q 2 = −1.   i = 2: + 2.95 = M3 [2] < 10w[2] = 4.09 < M4 [2] = 4.14 ⇒ w[3] = 0.5496; q 3 = 3.   i = 3: − 5.33 = M5 [3] < 10w[3] = 5.49 ⇒ w[4] = −0.4365; q 4 = 5.   i = 4: − 5.34 = M−4 [4] < 10w[4] = −4.36 < M−3 [4] = −4.15 ⇒ w[5] = 0.38284; q 5 = −4.   i = 5: + 2.96 = M3 [5] < 10w[5] = 3.8284 < M4 [5] = 4.15 ⇒ w[6] = 0.267631; q 6 = 3.   i = 6: + 1.78 = M2 [6] < 10w[6] = 2.67 < M3 [6] = 2.96 ⇒ w[7] = 0.302458; q 7 = 2.   i = 7: + 2.96 = M3 [7] < 10w[7] = 3.02 < M4 [7] = 4.15 ⇒ w[8] = −0.536203; q 8 = 3. Termination: Q[8] = 1.4 1354323

Conversion

=⇒

Q[8] = 0.59346323

Rounding

=⇒

Q = 0.5934632

3.1.1 Step 3 This block, according to Fig. 2, is meant to compute Q[i] = Q[i] × 2qi+1 + (qi+1 )2 × 10−(i+1) , which is performed in three parts. • Part I: Compute Q[i]×2qi+1 , primarily, as two minimally redundant decimal numbers (i.e., UQ + VQ = Q[i] × 2qi+1 ). For this purpose, the required easy multiples of Q[i] (i.e., ±2Q[i], ±4Q[i] and ±10Q[i]) are generated; next qi+1 selects the appropriate multiples to be assigned as UQ and VQ . • Part II: Compute (qi+1 )2 via a simple combinational logic. Given the minimally redundant digit set [−5, 5], we have 0 ≤ 10C +S = (qi+1 )2 ≤ 25; hence 0 ≤ C ≤ 2 and −5 ≤ S ≤ 5.

Circuits Syst Signal Process Fig. 3 The straightforward architecture

Fig. 4 Block diagram of the proposed architecture

• Part III: Compute UQ + VQ + C via a minimally redundant decimal adder [4] and [5], where C fits into the adder as the low significant bits of Vq and Uq , due to their even value. Figure 5 shows how the aforementioned parts are connected to generate Q[i]. The decimal redundant adder shown in Fig. 4, to generate the partial remainder, receives two inputs and generates an output all in [−6, 6] digit sets. The details of this (and other) redundant adders are extensively discussed in [4, 5] and [9].

Circuits Syst Signal Process Fig. 5 Details of Step 3 in Fig. 4

3.1.2 Comparison Multiples Generation This block, according to Fig. 2, is responsible for generating Mk [i + 1] = (2k − 1)Q[i + 1] + 10−(i+1) K   2 25 k + (k − 1)2 + . for − 4 ≤ k ≤ 5; where K = 20 810 For this purpose we first generate (2k − 1)Q[i + 1] by means of easy multiples of Q[i + 1]. The required multiples are ±Q[i + 1], ±2Q[i + 1], ±3Q[i + 1], and ±10Q[i + 1] to generate 10 interim sums Wk [i + 1] as follows, where the addition is performed via a redundant adder whose inputs are in [−5, 5] and [−6, 6] and the output is [−6, 6]. In essence, ±Q[i + 1], ±2Q[i + 1], and ±10Q[i + 1] are generated in a [−5, 5] digit set while ±3Q[i + 1] is in [−6, 6]. W−4 [i + 1] = −10Q[i + 1] + Q[i + 1];

W−3 [i + 1] = −10Q[i + 1] + 3Q[i + 1]

W−2 [i + 1] = −3Q[i + 1] − 2Q[i + 1]; W0 [i + 1] = −Q[i + 1]; W2 [i + 1] = 3Q[i + 1];   W4 [i + 1] = 10Q[i + 1] + −3Q[i + 1] ;

W−1 [i + 1] = −3Q[i + 1] W1 [i + 1] = Q[i + 1] W3 [i + 1] = 2Q[i + 1] + 3Q[i + 1]   W5 [i + 1] = 10Q[i + 1] + −Q[i + 1]

Next, each Wk [i + 1] is added to the constant value 10−(i+1) K by a redundant decimal adder with [−6, 6] as the digit set. See Fig. 6. 3.1.3 Root Digit Selection (RDS) Regarding the admissible error of the RDS (Eq. (14)) we require the four most significant digits of the comparison multiples and the shifted partial remainder to be involved in the RDS. This block is meant to generate the output carriers (i.e., ∈ {−1, 0, 1}) of the addition of 10w[i + 1] − Mk [i + 1]. With the purpose of reducing the latency and complexity of this carry-generation block, (10w[i + 1] − Mk [i + 1])

Circuits Syst Signal Process Fig. 6 Comparison multiples generation

Fig. 7 Bit representations used in RDS

is represented as shown in Fig. 7, where white (black) dots symbolize negative(positive-) weighted bits. Consequently, only 13 bits of each operand are required to meet the error bounds in Eq. (14) (i.e., |ε[i]| < 0.01055 . . . ). Next, the comparing signal is produced, based on the values of the carry bits, to indicate whether 10w[i +1] ≥ Mk [i +1]. Finally, an encoder is used to generate qi+2 , given the 10 comparison signals. By and large, the architecture of the recurrence stage is shown in Fig. 8. 3.1.4 Initialization and Termination In the initialization stage, according to Fig. 2, we need to convert the BCD representation of the radicand to redundant decimal encoding ([−6, 6]) and to compare the most significant digit of the radicand with 3. Next, based on this comparison result, the most significant digit of the redundant radicand takes the value of {−1, 0, 1}. In the termination stage, some operations are needed to convert the minimally redundant decimal root Q to the final BCD result, namely conversion to BCD, rounding, and normalization. These operations are performed in the standard manner explained in division papers [11], where conversion and normalization are done on-thefly and the rounding mode is RoundTiesToEven.

4 Evaluations In this section the evaluation results of the proposed architecture, in terms of latency and area, are presented and compared with the fastest previous methods. We simulated the proposed design using a Synopsys Design Compiler using the STM 90 nm

Circuits Syst Signal Process Fig. 8 The proposed architecture of the recurrence stage

Table 1 Critical delay path of the proposed design (16 digits)

Delay (ns)

Register

x(−3)

Adder 1

Adder 2

Comparator

Encoder

Cycle time

# of cycles

Total latency

0.17

0.22

0.51

0.54

0.26

0.10

1.80

19

34.2

CMOS standard library [14] for 1.00 VDD and 25 ◦ C temperature in which the FO43 latency is 45 ps and the area of a NAND2 is 4.4 µm2 . According to Fig. 8 the recurrence stage of the proposed multiplier consists of three main parts; comparison multiples generation, partial remainder computation, and RDS. The simulation results show that the critical delay path consists of comparison multiples generation and RDS. Moreover, the number of cycles required for a 16-digit root is 1 + 17 + 1 = 19. Table 1 illustrates the critical delay path, the number of cycles, and the total latency of the 16-digit proposed decimal square root architecture. The area consumption of the proposed 16-digit architecture is evaluated as the sum of the area cost of various constituent parts and is tabulated in Table 2. The cycle time of the proposed decimal square root architecture is 40 FO4 and, for the 16-digit root, the total latency and area are 760 FO4 and about 31,000 NAND2, respectively. The fastest previous pertinent method [12] and [13] is based on the Newton– Raphson iterative method where a decimal FMA is the main building block. This is a floating-point decimal square root unit with a cycle time of 62.22 FO4 [13], requiring 15 cycles to compute a 16-digit root; thus the total latency of 933.3 FO4. The area of this design is reported as 157,284 NAND2. For a fair comparison we need to estimate the latency and area of the proposed design for the floating-point (FP) computation. In this case, a pre-processing unit is 3 Fan-out of 4, i.e., the latency of an inverter driving 4 similar inverters in the output.

Circuits Syst Signal Process Table 2 Area consumption of the proposed 16-digit architecture (NAND2) Combinational Initialization Recurrence Termination Whole design

Registers

Total

276

441

717

20,870

4,743

25,613

3,662

1,000

4,662

24,808

6,184

≈31K

Table 3 Comparison of the FP architectures Cycle time (FO4)

# of cycles

Total latency (FO4)

Ratio

Area (NAND2)

Ratio

Proposed

40.00

20

800.0

1.00

36,400

1.00

[13]

62.22

15

933.3

1.16

157,284

4.32

[10]

34.63

35

1211

1.51

18,826 + 4.5KB



required to convert the radicand from Densely-Packed-Decimal (DPD) to BCD encoding, determine the number of leading zeros, and normalize the radicand. This preprocessing unit adds one extra cycle to the proposed fixed-point design and consumes about 2378 NAND2 of area [17]. Moreover, a post-processing unit is required to deal with the exponents, handle the exceptions, and convert the BCD root back to the IEEE 754-2008 format. This can be performed in the termination stage with an extra area of 3027 NAND2 [17]. Consequently, using the proposed decimal square root architecture for the floatingpoint computation leads to a total latency of 40 × 20 = 800 FO4 and consumes an area of about 36,400 NAND2. Table 3 compares the proposed design with that of [13] and [12] and the work based on the CORDIC algorithm [10], in terms of latency and area. There is also another work, based on the digit-recurrence algorithm [4], which uses look-up tables, small adders, and multipliers. However, this work is optimized and implemented on FPGA, and hence no ASIC evaluation results are available to compare it with. According to the comparison results the proposed design is about 14 % faster than the fastest previous method and uses about a quarter of the area. This implies that, due to the high latency and area cost of decimal multipliers, using digit-recurrence algorithms for computing decimal square root is more efficient.

5 Conclusions This paper proposes a decimal square root architecture based on the SRT algorithm in which the root is computed iteratively. A root digit is determined by comparing the

Circuits Syst Signal Process

partial remainder to ten inconstant comparison multiples. The radix-10 minimally redundant representation of the root (i.e., a digit set of [−5, 5]) brings about a simple RDS unit. The partial remainder, however, is represented in a digit set equal to [−6, 6] so as to reduce the complexity of the intermediate redundant decimal adders. This endeavor leads to a fixed-point (floating-point) decimal square root architecture with a cycle time of 40 FO4 and an area of about 31K (40K) NAND2. Given the n + 3 (n + 4) cycles consumed by this design to compute the n-digit root, the proposed architecture is 14 % faster than the fastest previous work [12] and [13] (which is based on the Newton–Raphson algorithm) with about a quarter of the area. The results of this paper suggest that implementing the decimal square root using digit-recurrence algorithms is more efficient than designs based on functional methods, e.g., Newton–Raphson. The main reason is that functional methods necessitate a decimal multiplication per iteration, hence the high cost of area and delay. Envisaging the research trend in this field leads to designing a combined decimal square root/divider based on the SRT algorithm. Acknowledgements This work was supported by the Natural Sciences and Engineering Research Council of Canada (NSERC). The authors would like to express their appreciation for the comments of the anonymous reviewers for improving this paper.

References 1. M.F. Cowlishaw, Decimal floating-point: algorithm for computers, in Proceedings of the 16th IEEE Symposium on Computer Arithmetic, June 2003, pp. 104–111 2. L. Eisen et al., IBM POWER6 accelerators: VMX and DFU. IBM J. Res. Dev. 51(6), 663–684 (2007) 3. M.D. Ercegovac, R. McIlhenny, Design and FPGA implementation of radix-10 algorithm for square root with limited precision premitives, in Proceedings of the 43rd Asilomar Conference on Signals, Systems and Computers (2009), pp. 935–939 4. S. Gorgin, G. Jaberipur, Fully redundant decimal arithmetic, in Proceedings of the 19th IEEE Symposium on Computer Arithmetic (2009), pp. 145–152 5. S. Gorgin, G. Jaberipur, A family of signed digit adders, in Proceedings of the 20th IEEE Symposium on Computer Arithmetic (2011), pp. 112–120 6. L. Han, S. Ko, High speed parallel decimal multiplication with redundant internet encoding. IEEE Trans. Comput. 62(5), 956–968 (2013) 7. IEEE Standards Committee, 754-2008 IEEE Standard for Floating-Point Arithmetic, pp. 1–58, August 2008. doi:10.1109/IEEESTD.2008.4610935 8. G. Jaberipur, A. Kaivani, Improving the speed of parallel decimal multiplication. IEEE Trans. Comput. 58(11), 1539–1552 (2009) 9. A. Kaivani, G. Jaberipur, Fully redundant decimal addition and subtraction using stored-unibit encoding. Integration 43(1), 34–41 (2010) 10. A. Kaivani, G. Jaberipur, Decimal CORDIC rotation based on selection by rounding. Comput. J. 54(11), 1798–1809 (2011) 11. T. Lang, A. Nannarelli, A radix-10 digit-recurrence division unit: algorithm and architecture. IEEE Trans. Comput. 56(6), 727–739 (2007) 12. R. Raafat et al., Decimal Floating-Point Square-Root Unit Using Newton–Raphson Iterations. US Patent Application Publication, US 2012/0011182 (2012) 13. SilMinds, DFP Newton–Raphson Square Root Units. IP Core Product Data Sheet, NRDecDiv64/128 14. STMicroelectronics, 90nm CMOS090 Design Platform, 2007 15. A. Vazquez, J. Villalba, E. Antelo, Computation of decimal transcendental functions using the CORDIC algorithm, in Proceedings of the 19th IEEE Symposium on Computer Arithmetic (2009), pp. 179–186

Circuits Syst Signal Process 16. A. Vazquez, E. Antelo, P. Montuschi, Improved design of high-performance parallel decimal multipliers. IEEE Trans. Comput. 59(5), 679–693 (2010) 17. L.K. Wang, M.J. Schulte, Decimal floating-point square root using Newton–Raphson iteration, in Proceedings of the 16th International Conference on Application Specific Systems, Architecture and Processors (2005), pp. 309–315