(Spine title: Factor Refinement Principle on Multicore Architectures). (Thesis ...... Reducing Call Delivery Cost and Location Servers Load in Wireless Mobile Net-.
On the Factor Refinement Principle and its Implementation on Multicore Architectures
(Spine title: Factor Refinement Principle on Multicore Architectures) (Thesis format: Monograph)
by Md. Mohsin Ali Graduate Program in Computer Science
A thesis submitted in partial fulfillment of the requirements for the degree of Master of Science
The School of Graduate and Postdoctoral Studies The University of Western Ontario London, Ontario, Canada
c M. M. Ali 2011
THE UNIVERSITY OF WESTERN ONTARIO THE SCHOOL OF GRADUATE AND POSTDOCTORAL STUDIES CERTIFICATE OF EXAMINATION Supervisor:
Examination committee:
Dr. Marc Moreno Maza
´ Dr. Eric Schost
Dr. Sheng Yu
Dr. Lex E. Renner
The thesis by Md. Mohsin Ali entitled: On the Factor Refinement Principle and its Implementation on Multicore Architectures is accepted in partial fulfillment of the requirements for the degree of Master of Science
Date
Chair of the Thesis Examination Board
ii
Abstract The factor refinement principle turns a partial factorization of integers (or polynomials) into a more complete factorization represented by basis elements and exponents, with basis elements that are pairwise coprime. There are lots of applications of this refinement technique such as simplifying systems of polynomial inequations and, more generally, speeding up certain algebraic algorithms by eliminating redundant expressions that may occur during intermediate computations. Successive GCD computations and divisions are used to accomplish this task until all the basis elements are pairwise coprime. Moreover, square-free factorization (which is the first step of many factorization algorithms) is used to remove the repeated patterns from each input element. Differentiation, division and GCD calculation operations are required to complete this pre-processing step. Both factor refinement and square-free factorization often rely on plain (quadratic) algorithms for multiplication but can be substantially improved with asymptotically fast multiplication on sufficiently large input. In this work, we review the working principles and complexity estimates of the factor refinement, in case of plain arithmetic, as well as asymptotically fast arithmetic. Following this review process, we design, analyze and implement parallel adaptations of these factor refinement algorithms. We consider several algorithm optimization techniques such as data locality analysis, balancing subproblems, etc. to fully exploit modern multicore architectures. The Cilk++ implementation of our parallel algorithm based on the augment refinement principle of Bach, Driscoll and Shallit achieves linear speedup for input data of sufficiently large size. Keywords. Factor refinement, Coprime factorization, Square-free factorization, GCD-free basis, Parallelism, Muticore architectures.
iii
Acknowledgments Firstly, I would like to thank my thesis supervisor Dr. Marc Moreno Maza in the Department of Computer Science at The University of Western Ontario. His helping hands toward the completion of this research work were always extended for me. He consistently helped me on the way of this thesis and guided me in the right direction whenever he thought I needed it. I am grateful to him for his excellent support to me in all the steps of successful completion of this research. Secondly, I would like to thank Dr. Yuzhen Xie in the Department of Computer Science at The University of Western Ontario for helping me successfully completing this research work. Thirdly, all my sincere thanks and appreciation go to all the members from our Ontario Research Centre for Computer Algebra (ORCCA) lab, Computer Science Department for their invaluable teaching support as well as all kinds of other assistance, and all the members of my thesis examination committee. Finally, I would like to thank all of my friends around me for their consistent encouragement and would like to show my heartiest gratefulness to my family members for their continued support.
iv
Contents Certificate of Examination
ii
Abstract
iii
Acknowledgments
iv
Table of Contents
v
List of Algorithms
vii
List of Figures
viii
List of Tables
x
1 Introduction
1
2 Background 2.1 Systems of polynomial equations and inequations . . . . 2.2 Square-free factorization . . . . . . . . . . . . . . . . . . 2.3 Fast polynomial evaluation and interpolation . . . . . . . 2.3.1 Fast polynomial evaluation . . . . . . . . . . . . . 2.3.2 Fast polynomial interpolation . . . . . . . . . . . 2.4 The fork-join parallelism model . . . . . . . . . . . . . . 2.4.1 The work law . . . . . . . . . . . . . . . . . . . . 2.4.2 The span law . . . . . . . . . . . . . . . . . . . . 2.4.3 Parallelism . . . . . . . . . . . . . . . . . . . . . . 2.4.4 Performance bounds . . . . . . . . . . . . . . . . 2.4.5 Work, span and parallelism of classical algorithms 2.5 Cache complexity . . . . . . . . . . . . . . . . . . . . . . 2.6 Multicore architecture . . . . . . . . . . . . . . . . . . . v
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
3 3 5 7 7 10 11 12 13 13 13 14 14 17
2.7 2.8 2.9
Systolic arrays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Graphics processing units (GPUs) . . . . . . . . . . . . . . . . . . . Random access machine (RAM) model . . . . . . . . . . . . . . . . .
3 Serial Algorithms for Factor Refinement and GCD-free Basis putation 3.1 Factor refinement and GCD-free basis . . . . . . . . . . . . . . . 3.2 Quadratic algorithms for factor refinement . . . . . . . . . . . . 3.3 Fast algorithms for GCD-free basis . . . . . . . . . . . . . . . .
18 19 20
Com. . . . . . . . .
21 21 22 25
4 Parallel Algorithms for Factor Refinement and GCD-free Basis Computation 29 4.1 Parallelization based on the naive refinement principle . . . . . . . . 30 4.2 Parallelization based on the augment refinement principle . . . . . . . 40 5 Implementation Issues
51
6 Experimental Results 6.1 Integers of type int inputs . . . . . . . . . . . . . . . . . . . . . . . . 6.2 Polynomial type inputs . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3 Integers of type my big int inputs (work in progress) . . . . . . . . .
55 56 56 58
7 Parallel GCD-free Basis Algorithm Based on Subproduct Tree Techniques 7.1 Algorithms and parallelism estimates . . . . . . . . . . . . . . . . . . 7.2 Asymptotic analysis of memory consumption . . . . . . . . . . . . . . 7.3 Challenges toward an implementation . . . . . . . . . . . . . . . . . .
62 62 66 67
8 Conclusion
68
Curriculum Vitae
72
vi
List of Algorithms 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
Square-free Factorization . . . . . . . . SubproductTree . . . . . . . . . . . . . . . TopDownTraverse(f, Mk,h ) . . . . . . . . . MultipointEvaluation . . . . . . . . . . . LinearCombination(Mk,v , n) . . . . . . . . FastInterpolation . . . . . . . . . . . . . Refine . . . . . . . . . . . . . . . . . . . . PairRefine . . . . . . . . . . . . . . . . . . AugmentRefinement . . . . . . . . . . . . . multiGcd(f, A) . . . . . . . . . . . . . . . . pairsOfGcd(A, B) . . . . . . . . . . . . . . gcdFreeBasisSpecialCase(A, B) . . . . . gcdFreeBasis(A) . . . . . . . . . . . . . . GcdOfAllPairsInner(A, B, G) . . . . . . . GcdOfAllPairs(A, B) . . . . . . . . . . . . MergeRefinement(A, E, B, F ) . . . . . . . ParallelFactorRefinement(A) . . . . . . PolyRefine(a, e, b, f ) . . . . . . . . . . . . MergeRefinePolySeq(a, e, B, F ) . . . . . . MergeRefineTwoSeq(A, E, B, F ) . . . . . . MergeRefinementDNC(A, E, B, F ) . . . . . . ParallelFactorRefinementDNC(A) . . . . parallelPairsOfGcd(A, B) . . . . . . . . . parallelGcdFreeBasisSpecialCase(A, B) parallelGcdFreeBasis(A) . . . . . . . . .
vii
. . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . .
7 8 9 10 10 11 23 24 24 26 26 27 27 31 31 33 34 41 42 43 44 44 64 65 66
List of Figures 2.1 2.2
2.3 2.4 5.1 5.2 5.3
6.1 6.2 6.3
6.4 6.5
6.6
Subproduct tree for the multipoint evaluation algorithm. . . . . . . . 8 A directed acyclic graph (dag) representing the execution of a multithreaded program. Each vertex represents an instruction while each edge represents a dependency between instructions. . . . . . . . . . . 12 The ideal-cache model. . . . . . . . . . . . . . . . . . . . . . . . . . . 15 Scanning an array of n = N elements, with L = B words per cache line. 16 Demonstration of unpacking and packing. . . . . . . . . . . . . . . . Balancing polynomials for data traffic during divide-and-conquer. . . Balancing polynomials for GCD calculation and division during divideand-conquer. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Scalability analysis of the augment refinement based parallel factor refinement algorithm for 200, 000 int type inputs by Cilkview. . . . Running time comparisons of the augment refinement based parallel factor refinement algorithm for int type inputs. . . . . . . . . . . . Scalability analysis of the naive refinement based parallel factor refinement algorithm for 4, 000 dense square-free univariate polynomials by Cilkview. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Running time comparisons of the naive refinement based parallel factor refinement algorithm for dense square-free univariate polynomials. . Scalability analysis of the augment refinement based parallel factor refinement algorithm for 4, 000 dense square-free univariate polynomials by Cilkview. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Running time comparisons of the augment refinement based parallel factor refinement algorithm for dense square-free univariate polynomials. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
viii
51 53 53
56 57
58 59
59
60
6.7
6.8
Scalability analysis of the augment refinement based parallel factor refinement algorithm for 4, 120 sparse square-free univariate polynomials by Cilkview when the input is already a GCD-free basis. . . . . . . . Running time comparisons of the augment refinement based parallel factor refinement algorithm for sparse square-free univariate polynomials when the input is already a GCD-free basis. . . . . . . . . . . .
ix
60
61
List of Tables 2.1
Work, span and parallelism of classical algorithms. . . . . . . . . . . .
x
14
Chapter 1 Introduction System of non-linear equations and inequations have much practical importance in many fields such as theoretical physics, chemistry, and robotics. Solving this type of systems means describing the common solutions of the polynomial equations and inequations defining the corresponding system. Since the number of solutions of such systems grows exponentially with the number of unknowns, this process is hard in both contexts of numerical methods and computer algebra methods. The latter scenario, to which this work subscribes, is even more challenging due to the infamous phenomenon of expression swell. The implementation of non-linear polynomial system solvers is a very active research area. It has been stimulated during the past ten years by two main progresses. Firstly, methods for solving such systems have been improved by the use of so-called modular techniques and asymptotically fast polynomial arithmetic. Secondly, the democratization of supercomputing, thanks to hardware acceleration technologies (multicores, general purpose graphics processing units) creates the opportunity to tackle harder problems. One central issue in the implementation of polynomial system solvers is the elimination of redundant expressions that frequently occur during intermediate computations, with both numerical methods and computer algebra methods. Factor refinement, also known as coprime factorization, is a popular technique for removing repeated factors among several polynomials, while square-free factorization is used to remove repeated factors within a given polynomial. Coprime factorization has many applications in number theory and polynomial algebra, see the papers [8, 9] and the web site http://cr.yp.to/coprimes.html by Bernstein. Algorithms for coprime factorization have generally been designed with algebraic complexity as the complexity measure to optimize. However, with the evolution of 1
computer architecture, parallelism and data locality are becoming major measures of performance, in addition to the traditional ones, namely serial running time and allocated space. This imposes to revisit many fundamental algorithms in computer science, such as those for coprime factorization. In this thesis, we revisit the factor refinement algorithms of Bach, Driscoll and Shallit [6] based on quadratic arithmetic. We show that their augment refinement principle leads to a highly efficient algorithm in terms of parallelism and data locality. Our approach takes advantage of the paper “Parallel Computation of the Minimal Elements of a Poset” [19] by Leiserson, Li, Moreno Maza and Xie. Our theoretical results are confirmed by an experimentation, which is based on a multicore implementation in the Cilk++ concurrency platform [10, 14, 20, 22, 26]. For problems on integers, we use the GMP library [1] while for problems on polynomials, we use the “Basic Polynomial Algebra Subroutines (BPAS)” library [30]. We also revisit the GCD-free basis algorithm of Dahan, Moreno Maza, Schost and Xie [15]. Their algorithm, which is serial, is nearly optimal in terms of algebraic complexity. However, our theoretical study combined with experimental results from the literature [13, 29] suggest that this algorithm is not appropriate for a parallelization targeting multicore architecture. This thesis is organized as follows. Chapter 2 introduces background materials of this research work. Chapter 3 then presents existing serial algorithms for factor refinement and GCD-free basis with their working principles as well as complexity estimates. It is out of the scope of this thesis to give an exhaustive description of the vast amount of previous work; we only provide details for algorithms that we will use. Chapter 4 describes details of parallel factor refinement and GCD-free basis algorithms with estimates of work, span, parallelism and cache complexity. Some implementation issues of parallel algorithms are presented in Chapter 5. Chapter 6 contains our experimental results, performance analysis and benchmarking of the parallel algorithms. Application of subproduct tree techniques for the parallel computation of GCD-free basis is presented in Chapter 7 with some analysis as well as a discussion on the challenges toward an implementation. Finally, conclusion of this thesis and future work appear in Chapter 8.
2
Chapter 2 Background This section gathers background materials which are used throughout this thesis. Section 2.1 is a brief introduction to polynomial system solving, which motivates the work reported in this thesis. Section 2.2 is a review of the notion of square-free factorization, which is one of the two “simplification techniques” for polynomial systems that are discussed in this thesis. The other one is coprime factorization, to which the rest of this document is dedicated and which is introduced in Chapter 3. Section 2.3 is a review of asymptotically fast algorithms for univariate polynomial evaluation and interpolation. It follows closely the presentation in Chapter 10 of [33]. Sections 2.4 and 2.5 describe our two models of computations: the fork-join parallelism model and the idealized cache model, which were proposed in the papers [21, 22]. Here, we follow closely, the lecture notes of the UWO course CS9624-4435 available at http://www.csd.uwo.ca/∼moreno/CS9624-4435-1011.html.
2.1
Systems of polynomial equations and inequations
A system of polynomial equations and inequations is defined as f1 (x) f (x) m h1 (x) h (x) s
3
= 0 .. . = 0 6= 0 .. . 6= 0
(2.1)
where x = (x1 , x2 , . . . , xn ) refers to n variables, and f1 , . . . , fm , h1 , . . . , hs are multivariate polynomials in x with coefficients in a field K. Let L be a field extending K. For instance K and L could be the fields of real and complex numbers, respectively. A solution over L of System (2.1) is a tuple z = (z1 , z2 , . . . , zn ) with coordinates in L such that we have f1 (z) = f2 (z) = · · · = fm (z) = 0 and h1 (z) 6= 0 and · · · and hs (z) 6= 0.
(2.2)
A first difference between the case of linear equations and that of polynomial equations is the fact that a non-linear system may have a number of solutions which is both finite and greater than one. Actually, this is often the case in practice, in particular for square systems, that is, for systems with as many equations as variables. When finite, this number of solutions, counted with multiplicities, can be as large as dn , when d is the common total degree of f1 , f2 , . . . , fm . This fact leads to another difference: “describing” the solutions of System (2.1) is a process that requires a number of arithmetic operations in L which is (at least) exponential in n, the number of variables. There are two different ways of computing the solutions of System (2.1). The first one relies on so-called numerical methods which compute numerical approximations of the possible real or complex solutions. These methods have the advantage of being quite fast in practice while flexible in accuracy by using iterative methods. However, due to rounding errors, numerical methods are principally uncertain, often unstable in an unpredictable way; sometimes they do not find all solutions and have difficulties with overdetermined systems. Some of these approaches are presented in [16, 17, 28]. The alternative way is to represent the solutions symbolically. More precisely, the solution set is decomposed into so-called components such that each component is represented by a polynomial system of a special kind, called a regular chain [5, 25, 27], which possesses a triangular shape and remarkable properties. For this reason, such a decomposition is called a triangular decomposition of the input system. Methods computing triangular decompositions are principally exact. However, they are practically more costly than numerical methods. As a result, they are applicable only to relatively “small” input systems. Nevertheless, they are applicable to any polynomial system of any dimension and for zero-dimensional systems they can precisely provide the number of complex solutions (counted with multiplicities) and specify which solutions have real coordinates. The practical efficiency as well as the theoretical time complexity of triangular 4
decompositions depend on various techniques for removing superfluous expressions (components, polynomial factors, . . . ) during the computations. For instance, if a, b, c are three polynomials which satisfy the following system of inequations a b 6= 0, b c 6= 0, c a 6= 0, one may want to replace it simply by a 6= 0, b 6= 0, c 6= 0, since they are both equivalent and the second one is simpler. As a result, algorithms which do this type of simplification are very important. There are two well-developed techniques for simplifying systems of equations and inequations: one is square-free factorization [7, 35] and another one is coprime factorization [6, 15]. The former one is described in next section and the latter one is in next chapter.
2.2
Square-free factorization
Square-free factorization is a first step in many factorization algorithms. It factors non-square-free polynomials in terms of square-free factors that are relatively prime. It can separate factors of different multiplicities, but not factors with the same multiplicity. Formal definitions of square-free factorization are summarized hereafter. For a more complete presentation, please refer to Chapter 14 in [33]. Definition 1. Let D be a UFD (unique factorization domain) and P ∈ D[x] be a non-constant univariate polynomial. The polynomial P is said square-free if for every non-constant polynomial Q ∈ D[x], the polynomial Q2 does not divide P . If D is a field and P ∈ D[x] is square-free, it is easy to show that there exists pairwise different monic irreducible polynomials P1 , P2 , . . . , Pk such that their product equals P/lc(P ), where lc(P ) denotes the leading coefficient of P . We observe that P1 , P2 , . . . , Pk are necessarily pairwise coprime, that is, for all Pi , Pj , with 1 ≤ i < j ≤ k, there exists Ai , Aj ∈ D[x] such that Ai Pi + Aj Pj = 1 holds. Since D is a field, using the (extended) Euclidean Algorithm, this latter property is equivalent to the fact that for all Pi , Pj , with 1 ≤ i < j ≤ k, the polynomials Pi , Pj have no common factors other than elements of D.
5
Definition 2. Let K be a field and P be a non-constant univariate polynomial P ∈ K[x]. A square-free factorization of P consists of pairwise coprime polynomials P1 , P2 , . . . , Pk such that we have P = P1 · P22 . . . Pkk and each Pi is either 1 or a square-free non-constant polynomial. It is easy to see that a square-free factorization of P is obtained from the irreducible factors F1 , . . . , Fe of P by taking for Pi the product of the Fj ’s such that Fji divides P but Fji+1 does not divide P . We illustrate the above definitions with a few examples. Example 1. Let P = (x + 1) ∈ K[x]. From Definition 1, the polynomial P is squarefree, because there is no non-constant polynomial Q ∈ K[x] such that Q2 divides P . On the other hand, assume P = x2 +4x+4 ∈ K[x]. From Definition 1, the polynomial Q is not square-free, because, for Q = (x + 2) ∈ K[x], the polynomial Q2 divide P . Example 2. Let P = x15 + 55x14 + 1400x13 + 21868x12 + 234290x11 + 1822678x10 +10629552x9 +47283632x8 +161614309x7 +424015067x6 +845928448x5 + 1258456700x4 + 1348952000x3 + 981360000x2 + 432000000x + 86400000 ∈ R[x], where R is the field of real numbers. Then, a square-free factorization of P is given by P = (x + 1)(x + 2)2 (x + 3)3 (x + 4)4 (x + 5)5 , where we have P1 = (x + 1), P2 = (x + 2), P3 = (x + 3), P4 = (x + 4), and P5 = (x + 5). There are several well-known algorithms for computing square-free factorization of a univariate polynomial over a field (and more generally for multivariate polynomials over a field). For instance, Yun proposed an algorithm for computing square-free factorization of univariate polynomials over a field of characteristic zero in 1976 [35]. Yun’s algorithm is a basic building block for other polynomial factorization algorithms and it is described in Algorithm 1. An asymptotic upper bound of the algebraic complexity of this algorithm is stated in the following proposition. Proposition 1. The cost (number of bit operations) of Algorithm 1 is O(k 4 (n2 d + nd2 )), where d = degree(Pi ) (thus assuming that all Pi have the same degree), n is the maximum bit size of a coefficient in P1 , P2 , . . . , Pk and algorithm used for computing GCDs is a quadratic time algorithm (say the Euclidean Algorithm). On the other hand, if M (d) is a multiplication time (that is, a running time estimate in terms of field operations for multiplying two polynomials in K[x] of degree less than d) then the cost becomes O(k 2 M (d) log(d)) operations in K, where M (d) ∈ O(d log(d) log log(d)) operations in K for FFT-based multiplication. 6
Algorithm 1: Square-free Factorization Input: A polynomial P ∈ K[x], where K is a field of characteristic zero. Output: A square-free factorization of P , that is, pairwise coprime square-free polynomials P1 , P2 , . . . , Pk such that P = P1 · P22 . . . Pkk holds in K[x]. d 1: G ← gcd(P, P) ; /* G = P2 P32 . . . Pkk−1 */ dx 2: C1 ← P/G ; /* C1 = P1 P2 . . . Pk */ d d 3: D1 ← ( P )/G − C ; dx dx 1 d d d /* D1 = P1 dx (P2 ) . . . Pk + 2P1 P2 dx (P3 ) . . . Pk + . . . + (k − 1)P1 P2 . . . dx (Pk ) */ 4: for i = 1, step 1, until Ci = 1 do 5: Pi ← gcd(Ci , Di ) ; /* Ci = Pi Pi+1 . . . Pk */ 6: Ci+1 ← Ci /Pi ; d d d (Pi+1 ) . . . Pk +2Pi+1 dx (Pi+2 ) . . . Pk +. . .+(k −i)Pi+1 . . . dx (Pk ) /* Di = Pi ( dx */ d Di+1 ← Di /Pi − dx 7: Ci+1 ; 8:
return (P1 , P2 , . . . , Pk ) ;
2.3
Fast polynomial evaluation and interpolation
This section presents asymptotically fast algorithms for univariate polynomial evaluation and interpolation. These are based on the concept of a subproduct tree. For a more complete presentation, please refer to Chapter 10 in [33].
2.3.1
Fast polynomial evaluation
Multipoint evaluation of a polynomial (univariate) primarily means evaluating that polynomial at multiple points and can formally be stated as follows. Given a univariPn−1 ate polynomial P = j=0 pj xj ∈ K[x], with coefficients in the field K, and evaluation Pn−1 points u0 , . . . , un−1 ∈ K compute P (ui ) = j=0 pj uji , for i = 0, . . . , n − 1. It is useful to introduce the following objects. Define mi = x − ui ∈ K[x], and Q m = 0≤i