Design and Implementation of Long-Digit Karatsuba's ... - CiteSeerX

18 downloads 0 Views 93KB Size Report
suba's multiplication. ... suba's multiplication is recursively applied until 2k dig- ..... We write the generated program using the syntax of C language. First, formula ...
Design and Implementation of Long-Digit Karatsuba’s Multiplication Algorithm Using Tensor Product Formulation ∗ Chin-Bou Liu1 , Chua-Huang Huang1 , Chin-Laung Lei2 1

Department of Information Engineering and Computer Science Feng Chia University Taichung, Taiwan, R.O.C. 2

Department of Electrical Engineering National Taiwan University Taipei, Taiwan, R.O.C.

Abstract

two numbers [10]. The technique became attributed to karatsuba (although his development of the idea is disputed), and so the technique is called Karatsuba’s multiplication.

Karatsuba’s multiplication algorithm uses three singledigit multiplications to perform one two-digit multiplication. If we apply Karatsuba’s multiplier recursively, it takes only 3n single-digit multiplications to multiply a pair of 2n -digit numbers. This is a significant improvement compared to 4n single-digit multiplications using grade-school multiplier. In this paper, we will use tensor production formulation to express Karatsuba’s multiplication algorithm in both recursive and iterative form. Usually, Karatsuba’s algorithm is implemented as recursive program. With the iterative tensor product formula of Karatsuba’s algorithm, we can derive an iterative (for loop) program to perform multiplication of long-digit numbers. Furthermore, the 3n single-digit multiplications can be fully parallelized.

We use the tensor product (also known as Kronecker product) notation [3] to formulate the Karatsuba’s multiplication algorithm. The tensor product notation has been used to design and implement block recursive algorithms such as fast Fourier transform [8, 9], Stranssen’s matrix multiplication [6, 13], parallel prefix algorithms [2], and Hilbert space-filling curves [14]. Tensor product formulas can be directly translated to computer programs. For different architecture characteristics, such as vector processors, parallel multiprocessors, and distributed-memory multiprocessors, tensor product formulas can be manipulated using appropriate algebraic theorems and then translated to high-performance programs [1, 4]. Tensor product formulas can also be used to specify data allocation and generate efficient programs for multi-level memory hierarchy including cache memory, local memory, and external memory [12].

Keyword: tensor product, block recursive algorithm, programming methodology, Karatsuba’s multiplication algorithm.

In this paper, tensor product formulas of Karatsuba’s multiplication algorithm are developed. Both recursive 1 Introduction form and iterative form will be presented. We will also use the iterative formula to generate a C program of KaratThere are many applications of modern science, such as suba’s multiplication. The program employ Karatsuba’s cryptography and fast Fourier transform, need to compute algorithm to multiply two integers of 2n digits. Karatlarge number operations. The numbers used in these ap- suba’s multiplication is recursively applied until 2k digplications may go up to hundreds or thousands of digits. its, and then the grade-school multiplication is used to Multiplication of large numbers requires huge resource compute the product of 2k -digit numbers. The resultof time and space. The study of fast multiplication al- ing program is tested for various combination of n and gorithms is always an important task pursued by mathe- k. The experimental times are compared with the gradematicians and computer scientists. In the 1960’s, Russian school multiplication. It shows Karatsuba’s multiplication mathematicians discovered a faster way for multiplying is much more efficient than the grade-school multiplication and there exist a cutoff value 2k that gives the best ∗ This work was supported in part by National Science Council, Taiperformance. wan, R.O.C. under grant NSC 91-2213-E-035-015. 1

Theorem 2.1 (Multiplicative Theorem) Let A be an m × n matrix, B be a p × q matrix, C be an n × r matrix, and D be a q × s matrix. We have

Many works on the application of Karatsuba’s multiplication have been reported since Karatsuba published his algorithm. Shand and Vuillemin applied Karatsuba’s multiplication to the optimal radix choice of modular product in RSA cryptography [16]. Hollerbach presented a generalization of the Karatsuba’s algorithm for multiplying very large numbers [5]. Jebelean developed sequential and parallel versions of Karatsuba’s multiplication and implemented them in the paclib computer algebra system on a Sequent Symmetry shared-memory multiprocessor architecture [7]. Zimmermann, Lorraine, and Monagan implemented Karatsuba’s multiplication for univariate polynomials in the general purpose computer algebra system MuPAD and to alleviate the problem of modular integer arithmetic for large integers in Maple [18].

(A ⊗ B)(C ⊗ D) = AC ⊗ BD. From Theorem 2.1, if two of the matrices are the identity matrix, we can obtain the following lemma with two special forms. Lemma 2.1 Let A be an m × n matrix and B be a p × q matrix. We have A ⊗ B = (Im ⊗ B)(A ⊗ Iq ) = (A ⊗ Ip )(In ⊗ B). The two special forms are called parallel form and vector form. Parallel form and vector form are tensor products with the identity matrix as its first operand and second operand, respectively. Let In be the n × n identity matrix and A be a p × q matrix. The parallel form and vector form are given as:   A 0 ··· 0  0 A ··· 0    In ⊗ A =  . . . , . . ...    .. ..

The paper is organized as the followings. In Section 2, we give an overview of the algebraic theory of tensor products. Tensor product formulation of Karatsuba’s algorithm is presented in Section 3. Program generation of Karatshuba’s multiplier is explained in Section 4. Performance analysis and the experimental results are reported in Section 5. Conclusions and future work are given in Section 6.

0

2

The Tensor Product Representation



  A ⊗ In =  

In this section, we give an overview of the tensor product notation and the properties which are used in the formulation of Karatsuba’s multiplication algorithm. First, a tensor product is a bilinear operation and is defined as below:

  A⊗B =  

a0,0 B a1,0 B .. .

a0,1 B a1,1 B .. .

am−1,0 B

am−1,1 B

··· ··· .. .

a0,n−1 B a1,n−1 B .. .

· · · am−1,n−1 B

··· A

a0,0 In a1,0 In .. .

a0,1 In a1,0 In .. .

··· ··· .. .

ap−1,0 In

ap−1,1 In

a0,q−1 In a1,q−1 In .. .

· · · ap−1,q−1 In



  . 

Usually, a tensor product in parallel form is implemented as parallel operations of a mutli-processor computer and a tensor product in vector form is implemented as vector operations in a vector architecture. We state other properties of tensor products below withn−1 Q out proof. Note that Fi is denoted as matrix product

Definition 2.1 (Tensor Product) Let A and B be m × n and p × q matrices respectively. Then, the tensor product of A and B, denoted by A ⊗ B, is an mp × nq matrix defined as: 

0

i=0

Fn−1 Fn−2 · · · F1 F0 , because matrix product is not commutative.



1. (αA) ⊗ B = A ⊗ (αB) = α(A ⊗ B)

  . 

2. A ⊗ (B + C) = (A ⊗ B) + (A ⊗ C) 3. A ⊗ (B ⊗ C) = (A ⊗ B) ⊗ C 4. (A1 ⊗ · · · ⊗ Ak )(B1 ⊗ · · · ⊗ Bk ) = (A1 B1 ⊗ . . . ⊗ Ak Bk )

n If the operands are vector bases, say, em i ⊗ej , the tensor product is called a tensor basis. According to the definin tion of tensor product, tensor basis em i ⊗ ej is equal to mn ein+j . Tensor product is a bilinear operation which constructs a ”large” matrix from two ”small” matrices. Therefore, it can be viewed as a matrix operation. The following theorem says that a matrix product of two tensor products is equal to a tensor product of two matrix products. We state the theorem without a proof.

5. (A1 ⊗ B1 )(A2 ⊗ B2 ) · · · (Ak ⊗ Bk ) = (A1 A2 · · · Ak ) ⊗ (B1 B2 · · · Bk ) 6.

n−1 Q

(Im ⊗ Ai ) = Im ⊗

n−1 Q

(Ai ⊗ Im ) =

i=0

7.

i=0

2

n−1 Q

Ai

i=0

n−1 Q i=0

Ai ⊗ I m

3

Karatsuba’s Multiplication Algorithm

Next, we use notation ”∗” to denote pairwise multiplication of two vectors. Therefore, we have:

In 1962, Karatsuba discovered an algorithm for multiplying two numbers by using a divide-and-conquer method [10]. The algorithm reduces computational complexity by using less number of basic multiplications than that of the grade-school algorithm.

= =

Suppose two n-digit radix b numbers X and Y are expressed as x1 bn/2 + x0 and y1 bn/2 + y0 , respectively, where 0 ≤ x1 , x0 , y1 , y0 < bn/2 . The grade-school multiplication algorithm computes the product Z of X and Y as below: Z

=

=

= X ·Y = (x1 bn/2 + x0 )(y1 bn/2 + y0 ) = x1 y1 bn + (x1 y0 + x0 y1 )bn/2 + x0 y0 .

Sa X  ∗ SbY   x0 y0 Sa ∗ Sb x1 y1         1 0  1 0    0 1  x0  ∗   0 1  y 0  x1 y1 1 1 1 1     y0 x0  ∗  y1 x1 y0 + y 1 x0 + x 1   x0 y 0  . x1 y 1 (x0 + x1 )(y0 + y1 )

The result is the three products that appears in Karatsuba’s multiplication algorithm. Then, we need to add (subtract) the three products together to obtain the final result. This step is formulated using matrix operation Sc1 :

Karatsuba proposed a multiplication algorithm that computes the product Z of X and Y as below: Z =X·Y = x1 y1 bn + [(x1 + x0 )(y1 + y0 ) − x1 y1 − x0 y0 ]bn/2 +x0 y0 .

S c1



 1 0 0 =  −1 −1 1  . 0 1 0

To ensure correctness of the result, we have to place the result of Sc1 in proper digit locations. An additional operations Sc2 is introduced to place the result of Sc1 at proper positions:   1 0 0 S c2 =  0 b 0  . 0 0 b2

If we count the number of products and sums, the gradeschool algorithm requires four products x1 y1 , x1 y0 , x0 y1 , and x0 y0 ; Karatsuba’s algorithm requires three products x1 y1 , (x1 + x0 )(y1 + y0 ), and x0 y0 . At the first glance, Karatsuba’s algorithm uses one less multiplication, but three more additions than the grade-school algorithm. However, Karatsuba’s algorithm will give better complexity than the grade-school algorithm, if they are applied recursively down to single digit. To multiply two n-digit numbers, the grade-school multiplication algorithm has time complexity O(n2 ) and Karatsuba’s multiplication algorithm has time complexity O(nlog2 3 ) [11, 17].

Combining the two operations Sc1 and Sc2 together, we define the operational step Sc as:    1 0 0 1 0 0 Sc = Sc2 Sc1 =  0 b 0   −1 −1 1  . In this paper, we will present a tensor product formu0 0 b2 0 1 0 lation of Karatsuba’s multiplication algorithm. We begin with the base case of two-digit number multiplica- Karatsuba’s multiplication algorithm of two-digit numtion. Let X and Y be two-digit numbers, and X · Y = bers is then summarized as: (x1 b + x0 )(y1 b + y0 ) = x1 y1 b2 + [(x1 + x0 )(y1 + y0 ) − Z = X · Y = Sc [(Sa X) ∗ (Sb Y )] . x1 y1 − x0 y0 ]b + x0 y0 , where b is the digit radix. Suppose the two digits x0 and x1 , and the two digits y0 and y1 are Note that the input operands X and Y are represented represented as two vectors: as vectors of length 2 such that each vector element is     a single-digit number. After the pairwise multiplication, x0 y0 X= and Y = . ”∗”, the digit length is doubled. The result Z is a vecx1 y1 tor whose elements are two-digit numbers with placement First, we define Sa as a matrix operation for computing factors, 1, b, and b2 . This reflects the fact that the digit the three terms x0 , x1 and x0 + x1 , and Sb as a matrix length of the final product is up to four. operation for computing the three terms y0 , y1 and y0 +y1 . Karatsuba’s algorithm can be applied recursively to The definitions of Sa and Sb are given as: multiply long-digit numbers. Let us consider multipli    cation of 2n -digit numbers. We will derive the tensor 1 0 1 0 product formulation of Karatsuba’s algorithm for multiSa =  0 1  and Sb =  0 1  . plying two 2n -digit numbers. An 2n -digit number D = 1 1 1 1 3

d2n −1 d2n −2 · · · d1 d0 is represented as a vector of length times, followed by applying Sc,n−1 to segments of 3n−1 2n : elements.   d0 The recursive Karatsuba’s algorithm can be modified to  d1    obtain the following iterative tensor product formula:   . .. D ⇐⇒  .    d2n −2  d2n −1 Algorithm 2 ( Iterative Karatsuba’s Algorithm) Z = n n n The recursive Karatsuba’s algorithm for multiplying X · Y = Sc (Sa X ∗ Sb Y ), where 2n -digit numbers is then given as: Algorithm 1 ( Recursive Karatsuba’s Algorithm) Z = X · Y = Scn (San X ∗ Sbn Y ), where  San = I3 ⊗ San−1  (Sa ⊗ I2n−1 ) , Sbn = I3 ⊗ Sbn−1 (Sb ⊗ I2n−1 ) ,  Scn = (Sc,n−1 ⊗ I3n−1 ) I3 ⊗ Scn−1 ,

San

=

Sbn

=

Scn

=

n−1 Q

i=0 n−1 Q

i=0 n−1 Q

(I3i ⊗ Sa ⊗ I2n−i−1 ), (I3i ⊗ Sb ⊗ I2n−i−1 ), (I3n−i−1 ⊗ Sc,i ⊗ I3i ).

i=0

and Sa1 = Sa , Sb1 = Sb , Sc1 = Sc .

Algorithm 1 and Algorithm 2 apply Karatsuba’s algorithm all the way to single-digit numbers. However, we A minor revision of Sc is necessary to adjust the place- can combine Karatsuba’s algorithm and the grade-school ment factor correctly. This is caused by the fact that Sc2 algorithm in the way of applying Karatsuba’s algorithm is dependent on the level of recursion. At a recursive ex- recursively down to 2k -digit numbers, 0 ≤ k < n, and pansion of Sci , where 1 ≤ i ≤ n, the placement factors applying the grade-school algorithm to multiply those 2k i−1 i−1 are 0, b2 , and b2·2 . We use notation Sc,i−1 to repre- digit numbers. This hybrid algorithm is called Karatsent Sc operation in the recursive expansion of Sci , whose suba’s algorithm with cutoff number. Let ”∗k ” denote pairwise multiplications of 2k -digit numbers. placement factor is given as:

Sc2 ,j

1 0 j  = 0 b2 0 0 

 0 . 0 2j+1 b

Algorithm 3 ( Karatsuba’s Algorithm with Cutoff Number) Z = X · Y = Scn,k (San,k X ∗k Sbn,k Y ), where

In the recursive tensor product formula of Karatsuba’s algorithm, San and Sbn can be interpreted as computing single-digit additions of X and Y in two steps:

San,k

= =

1. Sa ⊗ I2n−1 and Sb ⊗ I2n−1 : applying Sa and Sb to two input vectors X and Y , respectively. Each input vector of 2n elements is divided into two segments of length 2n−1 . The result expands the vector to three segments of 2n−1 elements in each segment and makes the total vector length be 3 · 2n−1 . 2.

Sbn,k

= =

Scn,k

and I3 ⊗Sbn−1 : recursively applying three times of San−1 and Sbn−1 to segments of data obtained from Step 1. The results of San X and Sbn Y n

I3 ⊗San−1

= =

n−k−1 Q

i=0 n−k−1 Q

i=0 n−k−1 Q

i=0 n−k−1 Q

i=0 n−k−1 Q

i=0 n−k−1 Q

(I3i ⊗ Sa ⊗ I2n−k−i−1 ⊗ I2k ) (I3i ⊗ Sa ⊗ I2n−i−1 ), (I3i ⊗ Sb ⊗ I2n−k−i−1 ⊗ I2k ) (I3i ⊗ Sb ⊗ I2n−i−1 ), (I3n−k−i−1 ⊗ Sc,i ⊗ I3i ⊗ I2k ) (I3n−k−i−1 ⊗ Sc,i ⊗ I3i 2k ).

i=0

are two vectors of length 3 .

Operations of tensor product formulas can be translated into programming language constructs. Hence, our programming methodology provides a powerful means for generating computer programs of block recursive algorithms. In the next section, we will explain program generation of Karatsuba’s multiplication algorithm. Algorithm 3 is used to illustrate the rules of program generation. Also, the generated program is an iterative program containing for loops.

The additions in San and Sbn may extend the digit length due to carry effect. The maximum length of the resulting digits with carries is n + 1. For the tensor product formulation, we will not deal with the carry effect. It will be handled in program generation. Thus, the pairwise multiplication, ”∗”, performs 3n pairs of single-digit product. We assume the product of each pair is a two-digit number. Operation Scn is now interpreted as applying Scn−1 three 4

4

Program Generation of Karatsuba’s Algorithm

Hence, tempX is a vector of three elements and X is a vector of two elements. We generate a sequence of assignment statements for the code of Sa :

Code(tempX = Sa X) =⇒ We consider multiplication of decimal numbers of 2n digits. A number is represented as an integer array with one tempX[q0] = X[p0]; digit occupying one array element. Since an input numtempX[q1] = X[p1]; ber is of 2n digits, it may be stored in an integer array of tempX[q2] = X[p0] + X[p1]; size 2n . Because San,k and Sbn,k expand the number of digits from 2n to 3n−k 2k , we allocate two arrays X and Y where p0 and p1 are indices of array X and q0, q1, and q2 of length 3n−k 2k for input vectors X and Y and use only are indices of array tempX. The final step is to determine these index functions. We consider the input and output the first 2n elements at initialization. tensor basis of I3i ⊗ Sa ⊗ I2n−i−1 : We write the generated program using the syntax of C i n−i−1 input basis : e3i1 ⊗ e2j ⊗ e2i2 language. First, formula Z = X · Y = Scn,k (San,k X ∗k i n−i Sbn,k Y ) is modularized as the following pseudo code: 2 = ei31 ·2 n−i +j·2n−i−1 +i 2 n,k n,k k n,k Z = Sc (Sa X ∗ Sb Y ) =⇒ i 3 3 2n−i−1 output basis : ei1 ⊗ ej ⊗ ei2 Code(X = San,k X); i+1 n−i−1 = e3i1 ·3·22n−i−1 +j·2n−i−1 +i2 Code(Y = Sbn,k Y); Code(Z = X ∗k Y); where i1 and i2 are the indices corresponding to the loop Code(Z = Scn,k Z); variables and j is the index corresponding to the indices of p’s and q’s with values {0, 1} and {0, 1, 2}, respectively. Hence, we obtain the array indices as below:

We explain program generation of Code(X = San,k X) in details. According to Algorithm 3, the formula is rewritten to: Code(X = San,k X) =⇒ Code(X =

n−k−1 Q

p0 = i1 ∗ pow(2, n − i) + i2 p1 = i1 ∗ pow(2, n − i) + pow(2, n − i − 1) + i2 q0 = i1 ∗ 3 ∗ pow(2, n − i − 1) + i2 q1 = i1 ∗ 3 ∗ pow(2, n − i − 1) +pow(2, n − i − 1) + i2 q2 = i1 ∗ 3 ∗ pow(2, n − i − 1) +2 ∗ pow(2, n − i − 1) + i2

(I3i ⊗ Sa ⊗ I2n−i−1 )X).

i=0

In program generation, the matrix product

n−k−1 Q

is

i=0

mapped to a for loop as below: Code(X =

n−k−1 Q

We summarize the program of Code(X = San,k X) as below:

(I3i ⊗ Sa ⊗ I2n−i−1 )X) =⇒

Code(X = San,k X) =⇒

i=0

for (i = 0; i ≤ n − k − 1; i + +) { Code(tempX = (I3i ⊗ Sa ⊗ I2n−i−1 )X); X = tempX; }

for (i = 0; i ≤ n − k − 1; i + +) { for (i1 = 0; i1 < pow(3, i); i1 + +) { for (i2 = 0; i2 < pow(2, n − i − 1); i2 + +) { p0 = i1 ∗ pow(2, n − i) + i2; p1 = i1 ∗ pow(2, n − i) + pow(2, n − i − 1) + i2; q0 = i1 ∗ 3 ∗ pow(2, n − i − 1) + i2; q1 = i1 ∗ 3 ∗ pow(2, n − i − 1) +pow(2, n − i − 1) + i2; q2 = i1 ∗ 3 ∗ pow(2, n − i − 1) +2 ∗ pow(2, n − i − 1) + i2; tempX[q0] = X[p0]; tempX[q1] = X[p1]; tempX[q2] = X[p0] + X[p1]; } } X = tempX; }

The loop body processes the entire array elements. In each iteration, the resulting data are stored in a temporary array tempX of size 3n−k 2k and the beginning address of tempX is assigned to X at the end of an iteration. Hence, X is the input array of the next iteration. The loop body contains two identity matrices I3i and I2n−i−1 . They are also translated to for loops with loop variables i1 and i2: Code(tempX = (I3i ⊗ Sa ⊗ I2n−i−1 )X) =⇒ for (i1 = 0; i1 < pow(3, i); i1 + +) { for (i2 = 0; i2 < pow(2, n − i − 1); i2 + +) { Code(tempX = Sa X); } }

Note that the index computation can be further simplified by eliminating common terms and accumulating loop induction variables.

where tempX and X denote partial array elements used in operation Sa . Recall that Sa is a 3 × 2 matrix operation:   1 0 Sa =  0 1  . 1 1

The translation result of Code(Y = Sbn,k Y) has the same loop structure as the above program segment. We may implement San,k and Sbn,k in the same nested loop by adding 5

P[q0] = P[q0]; P[q1] = P[q1] + pow(2, i); P[q2] = P[q2] + 2 ∗ pow(2, i); } } Z = tempZ; }

the following assignment statements to the body of the inner most loop: Code(tempX = Sb X) =⇒ tempY[q0] = Y[p0]; tempY[q1] = Y[p1]; tempY[q2] = Y[p0] + Y[p1];

Some extra works need to be done. After completion of Scn,k , placement factors are used as indirect indexing to place the resulting elements of Z in their exact locations. and adding Y = tempY at the end of the outer most loop. In Algorithm 3 each basic operation Sc,i is applied to a k We continue to generate code for pairwise multiplica- segment of 2 digits, the indirect index must be multiplied k n+1 to store the final tion Code(Z = X ∗k Y). The final output basis of San,k by 2 . We use an array R of size 2 n−k k n,k result. All elements of R are initialized to 0’s. 3 2 and Sb is ei ⊗ ej . That is, pairwise multiplicak n−k tion ”∗ ” performs 3 grade-school multiplications of for (i = 0; i ≤ pow(3, n − k) ∗ pow(2, k + 1); ) { 2k digits. The 2k digits used in a grade-school multiplicafor (j = 0; j < pow(2, k + 1); j + +) { tion are stored consecutively. The pairwise multiplication R[P[i] ∗ pow(2, k) + j] + = Z[i]; is translated into the following program segment: i + +; } } Code(Z = X ∗k Y) =⇒

In the tensor product formulation of Karatsuba’s algorithm, we do not consider carry effect. To ensure the final result is correct, a carry processing loop is performed at the end:

for (i = 0; i ≤ pow(3, n − k) ∗ pow(2, k); i+ = pow(2, k)) { for (j1 = 0; j1 ≤ pow(2, k); j1 + +) { for (j2 = 0; j2 ≤ pow(2, k); j2 + +) { Z[2 ∗ i + j1 + j2]+ = X[i + j1] ∗ Y[i + j2] } } }

for (i = 0; i ≤ pow(2, n + 1) − 1; i + +) { R[i + 1] + = R[i] / 10; R[i] = R[i] % 10; }

The index of resulting product Z[2 ∗ i + j1 + j2] has the term 2 ∗ i. This term reflects the fact that the size of a This concludes program generation of the tensor prodgrade-school product is the double size of the operands. uct formulation of Karatsuba’s multiplication algorithm. n−k k+1 Therefore, the size of array Z is 3 ·2 . The program has been implemented and the experimental Program generation of Code(Z = Scn,k Z) is similar to results are reported in the next section. that of San,k with additional code generation of placement factors. We use an integer array P of size 3n−k · 2k+1 to store the placement factors. All the values of elements in P 5 Experimental Results are initialized to 0’s. Let the indices of the array elements in the loop body be q0, q1, and q2. The placement factors The program is compiled using gcc on a 1.3 GHz Pentium specified in Sc,i are computed as below: III processor with FreeBSD operating system. The perforP[q0] = P[q0]; mance of the program is tested using problem size form P[q1] = P[q1] + pow(2, i); 8 to 2048 digits. The length of cutoff number is from 2 P[q2] = P[q2] + pow(2, i + 1); to 128 digits. The execution times are reported in Table 1 and Table 2. Because the execution time of some problem We show the program segment of Code(Z = Scn,k Z): size is quite small, we run the program with small probn,k lem size multiple trials and report the average execution Code(Z = Sc Z) =⇒ time. The execution times are also plotted in Figure 1. for (i = 0; i ≤ n − k − 1; i + +) { The execution times of Karatsuba’s algorithm with the for (i1 = 0; i1 < pow(3, n − k − i − 1); length of cutoff numbers from 2 to 16 digits and the gradei1 + +) { school algorithm are shown in Figure 1(a) and with the for (i2 = 0; i2 < pow(3, i) ∗ pow(2, k + 1); length of cutoff numbers from 16 to 128 digits are shown i2 + +) { in Figure1(b). From Table 1 and Table 2, the gradeq0 = i1 ∗ pow(3, i + 1) ∗ pow(2, k + 1) + i2; school algorithm performs better than Karatsuba’s algoq1 = i1 ∗ pow(3, i + 1) ∗ pow(2, k + 1) rithm when the problem size is less than 128. The per+pow(3, i) ∗ pow(2, k + 1) + i2; formance of the grade-school algorithm becomes worse q2 = i1 ∗ pow(3, i + 1) ∗ pow(2, k + 1) when the problem size grows. We also observe that Karat+2 ∗ pow(3, i) ∗ pow(2, k + 1) + i2; suba’s algorithm gives the best performance when the tempZ[q0] = Z[q0]; length of the cutoff number is 16 digits which is indetempZ[q1] = −Z[q0] − Z[q1] + Z[q2]; pendent to the problem size. tempZ[q2] = Z[q1]; 6

25

Table 1: Execution times (ms) of Karatsuba’s Algorithm (1)

8 16 32 64 128 256 512 1024 2048

Grd. schl. 0.0007 0.0022 0.0074 0.0265 0.1012 0.3875 1.5406 6.0789 24.4601

Cutoff number 2 4 8 0.0076 0.0069 --0.0104 0.0095 0.0090 0.0194 0.0153 0.0150 0.0453 0.0342 0.0337 0.1281 0.0900 0.0895 0.3683 0.2687 0.2585 1.1113 0.7773 0.7664 3.6812 2.3648 2.3031 16.6304 9.9929 8.0265

Grand School K(2) K(4) K(8) K(16)

20

Time (ms)

# of digits

15 10 5 0 0

Table 2: Execution times (ms) of Karatsuba’s Algorithm (2)

8 16 32 64 128 256 512 1024 2048

6

16 ----0.0148 0.0343 0.0873 0.2527 0.7472 2.2539 6.8500

Cutoff number 32 64 ------------0.0356 --0.0954 0.1151 0.2769 0.3296 0.8148 0.9875 2.4437 2.9718 7.3445 8.9109

2000

(a)

128 ----------0.4203 1.2468 3.7367 11.2242

25

Grand School K(128) K(64) K(32) K(16)

20

Time (ms)

# of digits

500 1000 1500 Number of Digits

15 10 5

Conclusions and Future Work

0

We present the derivation of tensor recursive and iterative product formulas for Karatsuba’s multiplication algorithm. A hybrid algorithm combining Kararatsub’s algorithm and the grade-school algorithm is also given. These formulas can be directly used to generate computer programs. Process of program generation is explained in the paper. Performance of Karatsuba’s algorithm is measured by running the program for various problem size with different length of cutoff numbers.

0

500 1000 1500 Number of Digits

2000

(b)

Figure 1: Execution times of Karatsuba’s algorithm

The program generated uses integer arrays to represent long-digit integers. The size of these integer arrays can be reduced if the tensor product formulas are modified and thus used to generate memory-saving programs. Also, the computation of San , Sbn , and Scn consist of several copy operations. These copy operations can be eliminated. The experimental results presented in this paper are based on a program which is tailored to eliminate copy operations.

much larger than a digit value. Therefore, the carry effect is postponed until the computation is completed. However, in a VLSI circuit design, a carry must be handled as soon as it occurs. Using the tensor product formula, we can determine the carry effect in every basic operation (an adder or a multiplier) step. This will make the circuit design more feasible and more effective.

The programming methodology based on the tensor product theory is not only powerful on generating software programs; it can also be used to generate hardware description language, e.g., VHDL and Verilog, programs [15]. Hence, it can be employed to design VLSI circuits for fast multiplier. In a software program, a digit is represented as an integer element which can store a number

In the future, we will extend this methodology to design VLSI circuits for fast multiplier and integrate it with other algorithms to design hardware solution of cryptography applications. 7

References

[12] B. Kumar, C.-H. Huang, P. Sadayappan, and R. W. Johnson. An algebraic approach to cache memory characterization for block recursive algorithms. [1] D. L. Dai, S. K. S. Gupta, S. D. Kaushik, J. H. In 1994 International Computer Symposium, pages Lu, R. V. Singh, C.-H. Huang, P. Sadayappan, and 336–342, 1994. R. W. Johnson. EXTENT: A portable programming environment for designing and implementing [13] B. Kumar, C.-H. Huang, P. Sadayappan, and R. W. high-performance block-recursive algorithms. In Johnson. A tensor product formulation of Strassen’s Proceedings of Supercomputing ’94, pages 49–58, matrix multiplication algorithm with memory reduc1994. tion. Scientific Programming, 4(4):275–289, 1995. [2] M.-H. Fan, C.-H. Huang, Y.-C. Chung, J.-S. Liu, and [14] S.-Y. Lin, C.-S Chen, L. Liu, and C.-H. Huang. J.-Z. Lee. A programming methodology for designTensor product formulation for Hilbert space-filling ing parallel prefix algorithms. In Proceedings of the curves. Technical Report FCU-IECS-2003-001, 2001 International Conference on Parallel ProcessDept. of Information Engineering and Computer ing, pages 463–470, 2001. Science, Feng Chia University, March 2003. [3] A. Graham. Kronecker Products and Matrix Cal[15] W.-R. Lin, M.-H. Fan, C.-H. Huang, Y.-C. Chung, culus: With Applications. Ellis Horwood Limited, and D.-S. Chen. Synthesizing VHDL programs from 1981. tensor product formulas. In Proceedings of the 2002 VLSI Conference, pages 134–140, 2002. [4] S. K. S. Gupta, C.-H. Huang, P. Sadayappan, and R. W. Johnson. A framework for generating [16] M. Shand and J. Vuillemin. Fast implementations of distributed-memory parallel programs for block reRSA cryptography. In Proc. of the 11th IEEE Symcursive algorithms. Journal of Parallel and Disposium on Computer Arithmetic, pages 252–259, tributed Computing, 34(2):137–153, 1996. 1993. [5] U. Hollerbach. Fast multiplication & division of [17] S. Winograd. Arithmetic Complexity of Computavery large numbers. In Sci. Math. Research Posttions. SIAM, 1980. ing, 1996. http://forum.swarthmore.edu/epigone/ sci.math.research/zhouyimpzimp/x1ybdbxz5w4v@ [18] P. Zimmermann, I. Lorraine, and M. Monagan. forum.swarthmore.edu. Polynomial factorization challenges. In International Symposium on Symbolic and Algebraic Com[6] C.-H. Huang, J. R. Johnson, and R. W. Johnson. A putation, 1996. tensor product formulation of Strassen’s matrix multiplication algorithm. Appl. Math Letters, 3(3):104– 108, 1990. [7] T. Jebelean. Using the parallel Karatsuba algorithm for long integer multiplication and division. In European Conference on Parallel Processing, pages 26– 29, 1997. [8] J. R. Johnson, R. W. Johnson, D. Rodriguez, and R. Tolimieri. A methodology for designing, modifying and implementing Fourier transform algorithms on various architectures. Circuits Systems Signal Process, 9(4):450–500, 1990. [9] R. W. Johnson, C.-H. Huang, and J. R. Johnson. Multilinear algebra and parallel programming. The Journal of Supercomputing, 5(2–3):189–217, 1991.

[10] A. Karatsuba and Y. Ofman. Multiplication of multidigit numbers on automata. Soviet Physics-Doklady (English translation), 7(7):595–596, 1962. [11] D. E. Knuth. The Art of Computer Programming, volume 2 of Computer Science and Information Processing, chapter 4, pages 278–279. Addison-Wesley, 2 edition, 1981. 8

Suggest Documents