RISC-Linz Research Institute for Symbolic Computation Johannes Kepler University A-4040 Linz, Austria, Europe
Bidirectional Exact Integer Divison Werner KRANDICK, Tudor JEBELEAN (August 11, 1994)
RISC-Linz Report Series No. 94-50 Editors: RISC-Linz Faculty E.S. Blurock, B. Buchberger, C. Carlson, G. Collins, H. Hong, F. Lichtenberger, H. Mayr, P. Paule, J. Pfalzgraf, H. Rolletschek, S. Stifter, F. Winkler. Supported by: National Science Foundation (FWF), project no. M0135-PHY, Lise-Meitner Stipendium, and project no. P10002. Published in: Proceedings, PASCO'94, World Scienti c Publ. Comp. Copyright notice: This paper is published elsewhere. As a courtesy to the publisher distribution of the paper is strictly limited.
Proceedings of PASCO'94
c 1994 by World Scienti c Publishing Company
Bidirectional Exact Integer Division Werner Krandick Tudor Jebeleany Research Institute for Symbolic Computation Johannes Kepler University, A-4040 Linz, Austria krandick,
[email protected] Abstract: Division of integers is called exact if the remainder is zero. We show that the
high-order part and the low-order part of the exact quotient can be computed independently from each other. A sequential implementation of this algorithm is up to twice as fast as ordinary exact division and four times as fast as the general classical division algorithm if the dividend is twice as long as the divisor. A shared-memory parallel implementation on two processors gains another factor of two in speed.
Keywords: exact arithmetic, parallel arithmetic
1 Introduction Division of integers is called exact if the remainder is zero. Exact division arises systematically in exact calculations, e.g. when rational numbers are added or when the primitive part of an integral polynomial is computed. Traditionally, these divisions are performed using a general quotient-remainder algorithm (e. g. Knuth's \Algorithm D" [6]), and then discarding the remainder. The number of digit multiplications required by this algorithm is TD (m; n) = n(m ? n + 1); where m; n are the numbers of digits in dividend and divisor, respectively. The amount of work is suggested by the shaded area in Figure 1. In 1993, Jebelean [4] proposed an algorithm for exact division which determines the quotient digits from right to left and requires only 8m?n+4 (m ? n + 1) ? 1 if m < 2 2n ? 1 TJ (m; n)= : n(m ? 23 n + 12 ) + m otherwise digit multiplications (see Corollary 2). Supported by the Austrian Science Foundation (Grant M0135-PHY). y Supported by the Austrian Science Foundation (Grant P10002-PHY).
A B
Q
Figure 1: Classical division with remainder.
B
Q
A H
H
B
Q
L
L
Figure 2: Bidirectional exact division.
2
PASCO'94: First International Symposium on Parallel Symbolic Computation
In 1994, Krandick devised an algorithm for the seemingly unrelated problem of multiple precision oating point division. His method requires typically 8 n(m ? n + 1) if n 3 > > < m?n+6 (m ? n + 1) if n > 3 ^ TK (m; n)= > 2 m 2n ? 4 > : n(m ? 23 n + 27 ) ? 3 otherwise digit multiplications (see Corollary 1). The method is modeled after Knuth's algorithm, but it does not compute the full remainder. Instances where this would lead to an incorrect result are detected by testing a condition sucient for correctness. The condition is eciently computable, and it is satis ed with very high probability | if dividend and divisor are chosen at random. If, however, the remainder is zero, the condition will not be satis ed. Thus, Krandick's method cannot be used to compute all the digits of the exact quotient. We will present an algorithm which uses Krandick's method to compute the high-order part of the exact quotient, and Jebelean's method to compute the low-order part. The combined method requires 8 m if n = 1 > > < (m?n)(m?n+11) 19 + 8 if n > 3 ^ 4 TKJ (m; n) > m 3n ? 6 > : n(m ? 2n + 5) ? 5 otherwise digit multiplications (see Theorem 1). In particular, if the dividend is twice as long as the divisor our method is almost four times as fast as the traditional method (TKJ (m; n) n(n + 11)=4 + 19=8, see also Figure 2). The high order part QH of the quotient is computed using the high order part BH of the divisor, while the low order part QL of the quotient is computed using the low order part BL of the divisor. The method is well suited for coarse-grain parallelization, because the two computations are completely independent. When run in parallel on two processors, each processor has to compute at most 8 2m+1 if n = 1 > 3 > > 2 (m?n+1)+mn > n > > if n = 2; 3 > 2n+1 > > > ( m ? n +11)( m ? n +1) > > if n > 3 ^ > 8 > < m 3n ? 6 (m; n) n(m?2n+6) TKJ ? 3 if n>3^ > 2 > > > 3n ?6 < m > > > > 4n ?6 > > > n(m?2n+2) + m > > if n > 3^ > 2 2 : 4n ? 6 < m:
digit multiplications (see Theorem 2). In particular, if the dividend is twice as long as the divisor, the parallel version of our method is almost eight times as fast as the traditional method (m; n) (n + 1)(n + 11)=8). (TKJ The idea of combining a high-order algorithm with Jebelean's exact division was rst suggested to us by Schonhage, who recently investigated an implementation and applications together with Vetter [10]. In contrast to his approach we use a dierent method for computing the high-order part of the quotient, we analyze the method in terms of the required number of digit products, we provide detailed empirical data, and we discuss and implement a parallel version of the method. Coarse level parallelization of long integer division is apparently not treated in the literature. Parallel algorithms for division refer mostly to xed-point fractions, and are designed at the level of bit processing. One research direction in this area is the theoretical investigation of time and area complexity of division on parallel computing models such as parallel random access memory (PRAM); for a survey see [9], Section 3.7. Research with practical applications is mainly VLSI-oriented (see e.g. [11]), again at bit-level. Word-level algorithms are based on the systolic approach (see e.g. [5]), which is too ne-grained for shared-memory architectures. Our algorithm, although not scalable, is suitable for coarse-grain parallelization on shared-memory machines and it will increase the performance of parallel algebraic algorithms which contain exact division as a subalgorithm. Under a title quite similar to the title of this paper Vacariu [12] treats the computation of the quotient of xed-point numbers at bit level. He uses the term \exact quotient" to refer to the representation of the quotient as a periodic fraction. The computation is performed from both directions, on two parallel processors, but it is organized according to a \master-slave" scheme, with communication at each step. In our algorithm only nal synchronization is needed. Section 2 describes Krandick's algorithm, Section 3 reviews Jebelean's method. Section 4 combines the two methods to maximize the performance of a sequential implementation. Section 5 shows how a parallel implementation best combines the two methods. Section 6 compares the new method empirically with the algorithms by Knuth and Jebelean.
PASCO'94: First International Symposium on Parallel Symbolic Computation
2 Computing the high-order part
3
(A n?1 B) n?3 where n?1 denotes \the short product with respect to (n ? 1)". The deviation from the exact product can be estimated as follows. Given positive integers Proposition 1 Let A,B as in (1) with B having nX ?1 mX ?1 j i ai ; B = bj ; (1) n digits, let A= j =0 i=0 0 if n = 1; 2 = (n n ? 2 ? 3) if n 3: of -length m; n 1 and with -digits 0 ai ; bj < and am?1 ; bn?1 > 0, we want to nd the h Then high-order digits of Q = bA=B c, where 1 h A B AB A B + : m ? n+1. Here we consider Q as having m ? n+1 digits, with the high-order digit being possibly The cases n = 1; 2 are trivial. By induczero. In other words, we want to compute tion on n 3, X ( ?1)2 i+j (n?3) n?2 +(2?n) n?3 +1: Q = BA ; i+j 3 and h n ? 3. The savings are obtained by suppressing the computation of the remainder. 0 R < B, Instead of computing Q and R, with A = Q B + R we compute Q and R such that A = Q B + R ; where Q B is a well-de ned approximation to Computing Q B requires the product Q B. fewer digit products than computing the exact product Q B. We will de ne the approximate product and derive an error bound Q B ? Q B:
The right hand side is . This proves the second inequality. 2 The proposed division method uses the ap Noting that B proximate product of Q and B. has m ? h + 1 digits we obtain X Q B = qibj i+j ; (2) i+j m?h?2
?1 q i , 0 q < and with with P Q = hi=0 i i m ? h j B = j =0 bj , 0 bj < . Furthermore, we have (3) Q B Q B Q B + ; where ? h = 0; 1 = 0(m ? h ? 2) m?h?1 ifif m m ? h 2: (4) We now give a sucient condition for the success of the method. Proposition 2 Let as in (4). If R < B (5) P
We will state a condition involving and R which will be easy to test and which will imply equality We will argue that Q can be deter- then of Q and Q. mined in such a way that the sucient condition will be satis ed with high probability.
De nition 1 Let A, B be integers as in (1) with
Proof.
Q = Q: Let
R = A ? Q B B having n digits. We de ne the approximate and assume (5). Then (3) implies product X ? QB 0 ) ? Q B (Q B+) R =(Q B+R AB = ai bj i+j : i+j n?3 and R A ? Q B = R < B: Clearly, for n = 1; 2 or 3, the approximate pro so Q duct coincides with the exact product. In the Hence, A = Q B +R with 0 R < B, 2 notation of Krandick and Johnson [8], A B= must be the desired quotient Q.
4
PASCO'94: First International Symposium on Parallel Symbolic Computation
The proposed division method will produce a Q such that A = Q B + R . It will be shown that Q = Q with high probability. Condition (5) will be used to establish Q = Q with certainty. Therefore, the following question has to be dis what is the probability cussed. Given Q = Q, that condition (5) is satis ed? | Because of (3) we have A ? Q B A ? Q B A ? Q B ? : Hence, letting Q = Q in the middle, R R R ? or, equivalently, R R R + : Thus, condition (5) will be satis ed if R < B ? :
(6)
If all possible values 0; : : :; B ? 1 for R are equally likely, (6) is true with probability B ? 2 1 ? 2(m ? h ? 2) m?h?1 B B 1 ? 2(m ? h ? 2) : This value is very close to 1 if is the word size of a computer (e.g. = 232). The digits of Q are determined by Algorithm H in Figure 3 analogously to Knuth's \Algorithm D" ([6], p.257f). We assume radix to be a power of 2, for example the word size of the computer. We represent -digits as words; hence we may refer in step H1 to the number of leading 0bits of a -digit. The normalization is eected by a binary shift which is applied to all digits of B, but only to those digits of A that will be needed in step H9. Steps H3 and H4 together subtract [bn?1; : : :; bJ ] qk from [ai; : : :; ak+J ] with J = max(0; m ? h ? 2 ? k). In total, the loop subtracts [qm?n; : : :; qm?n?h+1] [bn?1; : : :; b0; 0; : : :; 0] = [qm?n; : : :; qm?n?h+1] [bm?h ; : : :; bm?n?h+1 ; bm?n?h ; : : :; b0] = Q B from [am ; : : :; a0]. This motivates the analysis in Proposition 3. Steps H8 and H9 test for the condition of Proposition 2. We will now argue that the loop in Algorithm H will produce Q with high probability. Knuth shows in exercise 4.3.1.21 of his book [6] that step D3 of his algorithm will fail to supply the correct quotient digit with approximate probability 2= . This number is very small when is the word size of a computer. For Algorithm H this means that quotient digit qi?n calculated in step H3 will
Algorithm H (High-order part of quotient ). Let A; B as in (1) with m and n digits, respectively, and let 1 h m ? n + 1. The algorithm will compute the h high-order digits qm?n ; : : :; qm?n?h+1 of bA=B c or fail. The probability of a failure is very small. We may assume m n > 1 ^ (m = n ) am?1 bn?1). H1. [Normalize.] Set d the number of leading 0-bits in bn?1, I max(0; m ? n ? h). Then perform the left-shifts [am ; am?1; : : :; aI ] [am?1 ; : : :; aI ] 2d and [bn?1; : : :; b0] [bn?1; : : :; b0] 2d . H2. [Initialize loop.] Set k m ? n, i m. H3. [Calculate quotient digit.] If ai = bn?1, set qk ? 1; otherwise set qk b[ai ; ai?1]=bn?1c. Subtract bn?1qk from [ai; ai?1]; then subtract bn?2qk from [ai; ai?1; ai?2]. If this leaves ai negative, decrement qk by 1 and add [bn?1; bn?2] to [ai ; ai?1; ai?2]. H4. [Multiply and subtract.] Set J max(0; m?h?2?k); then subtract [bn?3; : : :; bJ ] qk from [ai; : : :; ak+J ]. H5. [Test remainder.] If ai 0, go to step H7. H6. [Add back.] Decrement qk by 1; add [bn?1; : : :; bJ ] to [ai; : : :; ak+J ]. H7. [Loop on k.] If k > m ? n ? h+1, decrement k and i by 1 and go back to H4. H8. [Remainder too small?] If am?h = 0 ^ am?h?1 < n ? 2, fail. H9. [Remainder too large?] If [am?h ; : : :; am?n+1?h ] [bn?1; : : :; b0], fail. Figure 3: Computing the high-order part of the quotient. be correct in almost all cases where the three leading digits ai , ai?1, ai?2 of the current \remainder" are correct. We will use this result in an induction argument to show that all digits qi?n are most likely correct. We start with the observation that qm?n is correct. The induction argument states that if the quotient digits qm?n; : : :; qi?n are correct for an m i m ? h + 2, it is highly probable that [ai; ai?1; ai?2] is correct. Indeed, the probability that the correctness of qm?n ; : : :; qi?n implies the correctness of
PASCO'94: First International Symposium on Parallel Symbolic Computation
5
[ai; ai?1; ai?2] is smallest for i = m ? h + 2. For Knowing the i we can now verify the three cases i = m ? h+2 this probability can be bounded by of the proposition. the probability that the rst h+1 digits of Q B 1. In case n 3 we have n ? 3 0 h ? 1, This agree with the the rst h + 1 digits of Q B. so all i are n ? 3, hence all i = n and is certainly the case if adding to Q B does not so there are produce a carry into the h + 1 high-order digits. hX ?1 In fact, adding to Q B will | with a proba(n; h) = i = nh bility of < m? h?2 | not even generate a carry i=0 into the h + 2 high-order digits: If am?h?1 < ? (m ? h ? 2), [am?h+1 ; am?h ] will be correct; digit products. assuming that all numbers 0; : : :; ?1 are equally likely as values of am?h?1 (cf. [7]), the probabi2. In case n > 3 ^ h n ? 3 we have 0 lity that am?h?1 < ? (m ? h ? 2) is 1 ? m? h?2 . h ? 1 < n ? 3, so all indices i are < n ? 3, hence all i = 3 + i and so there are Finally, since [am?h+1 ; am?h ; am?h?1 ] deviates from its true value by at most m ? h ? 2, even hX ?1 qm?n?h+1 will most likely be correct. Thus, for (n; h) = (3 + i) = h(h 2+ 5) all practical purposes, the loop in Algorithm H i=0 will produce Q with a probability in the neighdigit products. borhood of 1 ? 1= . Proposition 3 The number (n; h) of digit pro- 3. In case n > 3 ^ h > n ? 3 we have 0 < n ? 3 h ? 1, so there are ducts in formula (2) is 8
3 ^ h n ? 3 if n > 3 ^ h > n ? 3:
(7) For each index i = 0; : : :; h ? 1 index j ranges from max(0; m ? h ? 2 ? i) to m ? h, but since b0 = : : : = bm?h?n = 0, only the j m ? h ? n + 1 have to be considered. Hence the number i of digit products for a given i is i = (m ? h) + 1? max((m ? h) ? (n ? 1); 0; (m ? h) ? (2 + i)) = (m ? h) + 1? max(0; (m ? h) ? min(n ? 1; 2 + i)): The expression can be simpli ed by distinguishing two cases. 1. In case i n ? 3 we have n ? 1 2 + i, so i = (m ? h) ? max(0; (m ? h) ? (n ? 1))+1: Noting that (m ? h) ? (n ? 1) 0 since h m ? n + 1, we obtain i = (m ? h) ? ((m ? h) ? (n ? 1)) + 1 = n: Proof.
(n; h) =
nX ?4
hX ?1
i=0
i=n?3
(3 + i) +
n
2 = 2hn ? n 2+ 5n ? 6
digit products. 2
Corollary 1 Letting h = m ? n + 1 in Proposition 3 we obtain 8 > >
3 ^ TK (m; n)= > 2 m 2n ? 4 > : n(m ? 23 n + 72 ) ? 3 otherwise.
3 Computing the low-order part Let A and B as in (1), and assume that B divides A. Jebelean's exact division algorithm [4] exploits the implication (A0 + a0 ) = (B 0 + b0) (Q0 + q0)
+
2. In case i < n ? 3 we have n ? 1 > 2 + i, so a0 = (b0 q0) mod : i = (m ? h) ? max(0; (m ? h) ? (2+i))+1: The latter equation can be used to compute q0 1 ) Since i < n ? 3 m ? h ? 2, the expression q0 = (a0(b0 )?mod mod evaluates to | provided GCD(b0; ) = 1. When is a power i = (m ? h) ? ((m ? h) ? (2+i))+1 = 3+i: of 2 this condition can be ensured by shifting A
6
PASCO'94: First International Symposium on Parallel Symbolic Computation
Algorithm L (Low-order part of quotient ). Let Proposition 4 The number (n; l) of digit proA; B be as in (1) with m and n digits, respec- ducts in Algorithm L is ( tively, with A mod B = 0 and with b0 = 6 0, and if l n (8) let 1 l m ? n+1. The algorithm will compute (n; l) = l(l+3) 2 ?1 l(n + 1) ? n2 ?2n+2 if l > n: the l low-order digits ql?1 ; : : :; q0 of Q = A=B. Let L = min(n; l). We will show that L1. [Right-shift.] Set d the number of trailing 0-bits in b0 , L2 ? L + 2 : (n; l) = l(L + 1) ? (9) and set L min(n; l). 2 Proof.
Then perform the right-shifts From Algorithm L we obtain [al?1 ; : : :; a1] [al ; al?1 ; : : :; a0]=2d and l?2 X [bL?1; : : :; b0] [bL ; bL?1; : : :; b0]=2d. (n; l) = l + min(L; l ? k) k=0 L2. [Compute modular inverse.] 1 . min(lX ?2;l?L) Set b0 (b0 )?mod = l+ L+ L3. [Initialize loop.] Set k 0. k=0 l?2 X L4. [Calculate quotient digit.] (l ? k) Set qk (b0 ak )mod . k=max(0;min(l?2;l?L)+1) L5. [Test termination] If k = l ? 1 then STOP. l?max(2 X ;L) = l + L+ L6. [Multiply and subtract.] k =0 Set J min(L; l ? k); then subtract l?2 X [bJ ?1; : : :; b0] qk from [al?1 ; : : :; ak ]. (l ? k): k =max(0 ;l ? max(2 ;L )+1) L7. [Loop.] Increment k and go back to step L4. 1. If L = 1 we have l?2 l?2 X X Figure 4: Computing the low-order part of the (n; l) = l + 1 + (l ? k) quotient. k=0 k=l?1 = l + (l ? 1) + 0 = 2l ? 1: and B to the right until b0 becomes odd. Af2. If L = 2 we have ter the least-signi cant quotient digit q0 has been l?2 l?2 X X found, A is replaced by (n; l) = l + 2 + (l ? k) k=0 k=l?1 (A ? q0B)= = Q0 B; = l + 2(l ? 1) + 0 = 3l ? 2: and the procedure is repeated to nd q1, and so 3. If L > 2 we have on. This method is faster than the traditional classical algorithm, because only the l low order l?2 lX ?L X digits of the intermediate results A ? q0B etc. (l ? k) (n; l) = l + L + have to be computed in order to determine the k=l?L+1 k=0 l low order digits of the quotient. Algorithm L + 1) : = l + (l ? L + 1)L + (L ? 2)(L in Figure 4 takes advantage of this insight. We 2 may assume that the least-signi cant -digit of B is non-zero; indeed, A must have at least as In all three cases equation (9) is satis ed; and many trailing zero -digits as B, and common equation (9) implies equation (8). 2 trailing zeros can be deleted without aecting the quotient. Corollary 2 Letting l = m ? n + 1 in ProposiWhen analyzing Algorithm L, we will not con- tion 4 we obtain 8 sider the (constant) cost of nding the modular m?n+4 m inverse of b0 in step L2. It is shown in [4] that b0 T (m; n)= < 2 (m ? n + 1) ? 1 if 2n ?1 J can be inverted using one or two digit multipli: 3n + 1) + m n(m ? otherwise. 2 2 cations and a table look-up when is a power of This result corrects the analysis given in the 2; the extended Euclidean algorithm need not be applied. In our experiments the modular inverse original paper [4], which did not account for the digit multiplications in step L4 of the algorithm. costs 2.25 digit products.
PASCO'94: First International Symposium on Parallel Symbolic Computation
4 Sequential exact division 5 Parallel exact division The digits of the exact quotient can be computed sequentially by rst using Algorithm H to calculate the high-order part of the quotient and then Algorithm L for the low-order part. This is most ecient when the quotient is split in such a way that the combined number of digit products is minimized. De nition 2 Let (n; h) as in (7), (n; l) as in (8), and let (n; 0) = (n; 0) = 0. Now de ne TKJ (m; n) = min (n; h) + (n; m ? n + 1 ? h): 0hm?n+1
7
The high-order and the low-order part of the exact quotient can be computed by executing Algorithm H and Algorithm L in parallel on two processors. This is most ecient when the quotient is split in such a way that the number of digit products in either algorithm is minimized. De nition 3 Let (n; h) and (n; l) as in De nition (2). De ne
(m; n) = TKJ min max((n; h); (n; m ? n + 1 ? h)): 0hm?n+1
For simplicity we only give an upper bound for (m; n). In order to avoid a profusion of unproductive TKJ case distinctions we only give an upper bound Theorem 2 for TKJ (m; n). 8 2m+1 if n = 1 > 3 > Theorem 1 > 2 (m?n+1)+mn > n > > 8 if n = 2; 3 > 2n+1 > m if n = 1 > > > > ( m ? n +11)( m ? n +1) > < (m?n)(m?n+11) > if n > 3 ^ > 8 + 198 if n > 3 ^ > 4 < TKJ (m; n) > m 3n ? 6 (m; n) n(m?2n+6) m 3n ? 6 TKJ > : ? 3 if n>3 ^ > 2 n(m ? 2n + 5) ? 5 otherwise > > > 3n ?6 > > > Let 4n ?6 > > > n(m?2n+2) 8 m > > + if n > 3^ m if n = 1 > > 2 2 : > < m?n 4n ? 6 < m: if n > 3 ^ h=> 2 (10) m 3n ? 6 > Let : m ? 2n + 2 otherwise, 8 l m n+1 (m ? n) if n 3 < 2 n +1 and let l = m ? n + 1 ? h. Now the desired in(11) h=: m?n equality is obtained by bounding (n; h)+(n; l) otherwise, 2 from above. For easy application of equations (7) and let l = m ? n + 1 ? h. and (8) we distinguish the following cases. We rst prove the theorem for the case n 3. 1. In case n = 1 the result is straightforward. 1. The rst branch of in (8) is only relevant 2. In case n > 3^m 3n?6 let h~ = (m?n)=2 if n = 1 and m = 1; 2; 3 or if n = 2 and and ~l = (m ? n+3)=2. Then h h~ n ? 3, m = 2; : : :; 6 or if n = 3 and m = 3; : : :; 9. l ~l < n and In each of these 15 cases the theorem can be veri ed explicitely. ~ ~ (n; h) + (n; l) (n; h) + (n; l) 2. If m and n are such that is de ned by its : = m 4? n (m ? n + 11) + 19 second branch in (8), we handle the ceiling 8 function in the de nition of h by letting 3. In case n > 3^m = 3n?5 we have h = n?3, l = n ? 1, and (n; h) + (n; l) = n2 ? 5 = ~h = (n + 1)(m ? n) + 2n h: 2n + 1 n(m ? 2n + 5) ? 5. Since the rst branch of in (7) is mono4. In case n > 3^m > 3n?5 we have h > n?3, tone increasing in h we have l = n ? 1 and (n; h)+(n; l) = n(m ? 2n+ 5) ? 5. 2 1) + mn : (n; h) (n; h~ ) = n (m ?2nn + +1 5. In case n = 2; 3, h = m ? 2n + 22, l = n ? 1 and (n; h) + (n; l) = 2mn?3n2 +5n?4 . For Furthermore, let n = 2; 3 this equals n(m ? 2n + 5) ? 5. 2 n + 1 (m ? n) l: l~ = m ? n + 1 ? 2n +1 Proof.
Proof.
8
PASCO'94: First International Symposium on Parallel Symbolic Computation Now (n; l) (n; ~l), where Our method for exact division on two pro2 + 3n + 3) n(2mn + 2m ? 4n cessors will be most useful on a shared-memory ~ (n; l) = : 4n + 2 machine when invoked by an algebraic algorithm a higher level of parallelism. When several Now note (n; ~l) (n; ~h) if n = 1, and with exact divisions have to be executed in parallel, (n; ~h) (n; ~l) if n = 2; 3. our method will add another level of parallelism We now prove the theorem for the case n > 3. to the program. Here we let h~ = m ? 2n + 1 h We ran a sequential and a parallel implementaand tion of our method on the shared-memory archi~l = m ? n + 1 ? m ? n l: 2 tecture of the Sequent Symmetry. We used the ~ 1. In case m 3n ? 7 we have h n ? 3 and PACLIB environment [3] which combines the com~l < n; hence the second branch of and the puter algebra library SACLIB [2] with the parallel features of the System library [1]. rst branch of have to be used. Thus, Table 1 lists computing times and computing m ? n + 11 time ratios for inputs of various lengths. The row ~ (n; h) (n; h) = (m ? n+1); 8 heading 20=15 refers to a dividend of 20 words and a divisor of 15 words. The column hean (n; l) (n; ~l) = m ? 8 (m ? n + 10) + 1; ding IQR stands for the SACLIB implementation of Knuth's integer quotient-remainder algorithm, and (n; ~h) (n; ~l). Algorithm D; IEQ stands for the SACLIB imple2. In case m = 3n ? 6 we have h = n ? 3, mentation of Jebelean's integer exact quotient l = n ? 2, and the theorem can be veri ed method; Sequential and Parallel refer to a sequential and a parallel implementation of our new explicitely. method. The sequential implementation splits 3. In case 3n ? 5 m 3n ? 2 we have to use the quotient as in the proof of Theorem 1; the the third branch of and the rst branch parallel implementation splits the quotient as in of . We obtain the proof of Theorem 2. n Table 2 has the same structure as Table 1, but (n; h) (n; ~h) = 2 (m ? 2n + 6) ? 3 instead of the computing time it lists the number of digit products that were computed. Those and numbers agree very well with the bounds given (n; l) (n; ~l) = m 8? n (m ? n + 10) + 1: in Theorems 1 and 2. The ratios of those numbers with respect to the number of digit products required in the classical algorithm are a measure For each 3n ? 5 m 3n ? 2, (n; ~h) for the expected speed-up. (n; ~l). The observed speed-up agrees well with the 4. In case m = 3n ? 1 we have h = l = n, and expected speed-up when the quotient is more than (n; h) > (n; l) can be bounded as in the 30 words long. When the quotient is shorter, previous case. certain linear-time operations are signi cant. In particular, since PACLIB integers are represented 5. In case m > 3n ? 1 we have h m?2 n > as linked lists, we copy the inputs from lists to n ? 21 and l m ? n + 1 ? m?2n+1 > n. arrays and the output from an array to a list. Using the third branch of and the second Surprisingly, the observed speed-up of IEQ branch of we have and Sequential in the third section of Table 1 (n; h) (n; ~h) = n2 (m ? 2n + 6) ? 3 exceeds the expectations. This can be explained by noting that IQR and Algorithm H use digit divisions in order to determine the quotient diand gits. Table 2 counts those digit divisions as digit (n; l) (n; ~l) = n2 (m ? 2n + 2) + m2 : products, but the true cost of digit division is about 2.5 times the cost of a digit product in the Now, if m 4n ? 6, (n; ~h) (n; ~l); if SACLIB implementation we used. Hence the unm > 4n ? 6, (n; l~) > (n; h~ ). 2 expected speed-up is due to the replacement of a
6 Experiments
PASCO'94: First International Symposium on Parallel Symbolic Computation
9
Table 1: Computing times in milliseconds (left) Table 2: Count of digit products (left) and exand speed-up ratios (right) with respect to the pected speed-up ratios (right) with respect to the classical algorithm. classical algorithm. Lengths 20/15 40/30 60/45 100/75 150/112 200/150 20/10 40/20 60/30 100/50 150/75 200/100 20/5 40/10 60/15 100/25 150/37 200/50
IQR 4.7 16.1 34.4 91.2 201.5 351.5 6.0 20.6 44.7 119.3 262.5 462.3 5.0 16.4 34.8 91.4 198.1 351.0
IEQ Sequential Parallel 1.5 3.1 1.5 3.1 1.5 3.1 3.7 4.4 3.4 4.7 2.4 6.7 7.2 4.8 5.8 5.9 3.7 9.3 17.5 5.2 12.3 7.4 7.4 12.3 37.6 5.4 23.7 8.5 13.1 15.4 62.6 5.6 37.7 9.3 20.6 17.1 3.7 1.6 3.0 2.0 2.3 2.6 14.4 1.4 8.1 2.5 4.7 4.4 24.2 1.8 14.9 3.0 8.6 5.2 61.5 1.9 35.0 3.4 20.0 6.0 133.3 2.0 72.1 3.6 40.6 6.5 233.4 2.0 122.3 3.8 66.8 6.9 4.2 1.2 4.2 1.2 2.9 1.7 13.8 1.2 12.6 1.3 7.2 2.3 28.8 1.2 25.1 1.4 14.5 2.4 75.0 1.2 62.8 1.5 35.1 2.6 162.8 1.2 133.0 1.5 72.8 2.7 286.9 1.2 229.6 1.5 123.9 2.8
linear number of divisions by multiplications. Finally we note that the parallel algorithm provides a signi cant speed-up even when the quotient is only 10 words long. In our experiments the eciency of the parallel implementation exceeds 83% for quotients longer than 25 words and reaches 93% in some cases.
References
Lengths IQR 20/15 90 40/30 330 60/45 720 100/75 1950 150/112 4368 200/150 7650 20/10 110 40/20 420 60/30 930 100/50 2550 150/75 5700 200/100 10100 20/5 80 40/10 310 60/15 690 100/25 1900 150/37 4218 200/50 7550
IEQ 26 3.5 76 4.3 151 4.8 376 5.2 818 5.3 1376 5.6 75 1.5 250 1.7 525 1.8 1375 1.9 3000 1.9 5250 1.9 85 0.9 295 1.1 630 1.1 1675 1.1 3665 1.2 6475 1.2
Sequential 20 4.5 51 6.5 95 7.6 220 8.9 457 9.6 751 10.2 51 2.2 151 2.8 301 3.1 751 3.4 1595 3.6 2751 3.7 70 1.1 245 1.3 520 1.3 1370 1.4 2992 1.4 5245 1.4
Parallel 12 7.5 26 12.7 52 13.8 117 16.7 229 19.1 376 20.3 26 4.2 76 5.5 151 6.2 376 6.8 817 7.0 1376 7.3 37 2.2 130 2.4 267 2.6 697 2.7 1514 2.8 2650 2.8
with exact rounding. Technical Report 9376, RISC-Linz 1993. [8] W. Krandick and J. R. Johnson. Ecient multiprecision oating point multiplication with optimal directional rounding. In E. Swartzlander, Jr., M. J. Irwin, and G. Jullien, editors, Proceedings of the 11th IEEE Symposium on Computer Arithmetic, Windsor, Ontario, July 1993. IEEE Computer
Society Press 1993, pages 228{233. [1] P. A. Buhr, H. I. Macdonald, and R. A. Stroobosscher. System Annotated Refe- [9] S. Lakshmivarahan and S. K. Dhall. Analysis and design of parallel algorithms: Arithrence Manual. Version 4.4.1. Technical remetic and matrix problems. McGraw-Hill, port, Department of Computer Science, Uni1990. versity of Waterloo, Ontario, October 1991. [2] George E. Collins et al. SACLIB user's guide. [10] A. Schonhage and E. Vetter. A new approach to resultant computations and other Technical Report 93-19, RISC-Linz 1993. algorithms with exact division. In Jan [3] H. Hong et al. PACLIB User Manual. Techvan Leeuwen, editor, Proceedings of the nical Report 92-32, RISC-Linz 1992. 2nd Annual European Symposium on Algorithms, Utrecht, The Netherlands, Septem[4] T. Jebelean. An algorithm for exact diber 1994, Lecture Notes in Computer Scivision. Journal of Symbolic Computation, ence. Springer-Verlag, 1994. To appear. 15(2):169{180, February 1993. [11] E. E. Swartzlander, editor. Computer Arith[5] T. Jebelean. Systolic algorithms for exact metic, volume 1, part IV, and volume 2. division. Mitteilungen { Gesellschaft fur IEEE Computer Society Press, 1990. Informatik e. V. Parallel-Algorithmen und Rechnerstrukturen, Nr. 12, July 1993, pp. [12] C. T. Vacariu. Method and symmetrical architecture circuit for performing bidirec40{50. tional exact division through step by step [6] D. E. Knuth. The Art of Computer Proapproximation, in various ways and forgramming, volume 2: Seminumerical Algomats (German). Patent Application 892/92, rithms, 2nd edition. Addison-Wesley 1981. O sterreichisches Patentamt, Vienna, 1992. [7] W. Krandick and J. R. Johnson. Ecient multiprecision oating point multiplication