Exact Dense Linear Algebra in Haskell Alexandru Ghitza1 and

HLinear Exact Dense Linear Algebra in Haskell

arXiv:1605.02532v1 [cs.MS] 9 May 2016

Alexandru Ghitza1 and Martin Westerholt-Raum2 Abstract: We present an implementation in the functional programming language Haskell of the PLE decomposition of matrices over division rings. We discover in our benchmarks that in a relevant number of cases it is significantly faster than the C-based implementation provided in FLINT. Describing the guiding principles of our work, we introduce the reader to basic ideas from high performance functional programming.

algebra pervades modern algorithms. Today’s multitude of applications of linear algebra has spawned an equal multitude of refinements of linear algebra. Dense (as opposed to sparse) linear algebra refers to the computation with matrices or vectors with few expected zero entries. Exact linear algebra (as opposed to approximate linear algebra [KLN96] and numerical linear algebra) refers to computation admitting no approximation error. Need for exact dense linear algebra arises from, among others, cryptography, compression, and “inner” mathematical3 problems. It is the backbone of symbolic and algebraic-geometric computation facilities. Major open-source implementations of exact dense linear algebra are available within LinBox [LinBox] and Flint [Flint] (Fast LIbrary for Number Theory). LinBox covers functionality to find solutions to linear equations, to compute invariants of linear operators, and to compute various canonical forms of matrices. The focus is on computation over finite fields and the integers, and extends to rationals via a technique called rational reconstruction. The library is based on

L

INEAR

The first author was partially supported by ARC Discovery Grant DP120101942. The second author was partially supported by Vetenskapsrådet Grant 2015-04139. 3 We use the adjective “inner” to describe questions that occur in purely mathematical contexts, rather than those geared mainly toward applications.

black box algorithms [KT90]. Flint offers functionality similar to LinBox, but uses a different way of implementing it. It is mostly based on classical and Strassen approaches [Str69]. Questions about vector spaces and systems of linear equations are mostly addressed via matrix factorizations. One instance of matrix factorization is the PLE decomposition of a matrix M , which in particular provides an echelon form E . In this work, we present an implementation HLinear of PLE decomposition in the functional programming language Haskell. It is competitive with Flint and in some cases outperforms it by a factor of 10. On the other hand, it enjoys typical benefits of programs written in functional languages. For instance, it opens doors to formal verification and painless distributed computation.

1 Background §1.1 Motivation. The need for an efficient PLE decomposition grew out of the second author’s project to compute with (Siegel) modular forms— cf. the last section of [BWR14]. Matrices arising from this application are comparatively large with 10,000 up to several 100,000 rows. To complicate matters, they have entries over number fields. On the plus side, algebraic-geometric methods show that these matrices have PLE decompositions with rather small denominators. The urgent need for parallelization and distributed computing in conjunction with the authors’ desire to formally verify as much of their future computation as possible, rendered impossible the usage of available implementations. The verification requirement, specifically, suggested use of a functional programming language. §1.2 PLE decomposition. Let R be a (unital) ring. Given a matrix M ∈ Matm,n (R), we say that M = P LE is a PLE decomposition of M , if P is a permutation matrix, L is a lower triangular matrix with diagonal entries 1, and E is a matrix in echelon form. Jeannerod, Pernet, and Storjohann [JPS13] explain the PLE and related decompositions in the context of rank-profiles. As an example of PLE decomposition, we record that a

–1–

HLinear — 1 Background

A. Ghitza, M. Westerholt-Raum

4 × 6 matrix could allow for the following factor- where the number of rows of E i equals the rank ization: of E . Setting M ′′ = ( M1′′ ··· Mr′′−1 ) allows to find M ′′ = P ′′ L ′′ E ′′ by recursion, and thus build the µ1 ¶ µ ∗′ ∗ ∗ ∗ ∗ ∗ ¶ ′ ∗ ∗ ∗ , PLE decomposition of M by rearranging P ∗∗ ∗1 1 ∗′ ∗ ∗∗∗1 ³ ´ E 1 ··· E r −1 M = P L E · . 0 0 0 P ′′ L ′′ E ′′ where ∗ is an arbitrary entry from R and ∗′ is a non-zero entry. This approach and its iterative counterpart have If R = K is a field (or a division ring) then it is been implemented in M4RI [M4RI] and in a notpossible to pass to normalized echelon forms. In yet-released version of LinBox. Any algorithm of this case, we allow for arbitrary non-zero entries this flavor calls for thinking about it in terms of on the diagonal of L, and in exchange demand directed acyclic graphs (DAGs). They have not yet that the pivot entries of E be 1. been applied to exact dense linear algebra, but Ã ′ !µ they do appear for example in [KE14]. ¶ ∗ 1∗∗∗∗∗ ′ ∗ ∗ 1 ∗ ∗ Given that LinBox employs rational reconstrucP ∗ ∗ ∗′ 1 ∗ . ′ tion, we revisit multi-modular linear algebra. ∗ ∗ ∗ ∗ The key observation is that the decomposition Slightly ambiguously, this decomposition is also M = P LE for M ∈ Matm,n (Q) can be reconstructed called (normalized) PLE decomposition. To obfrom its reductions modulo a large enough intetain a normal form of M with respect to the acger N , coprime to the denominators of M , L, and tion of invertible matrices from the left, one may E . In practice N can be chosen to be a product of proceed to the reduced echelon normal form by distinct word-sized primes, which leads to vastly applying to E an upper triangular matrix U with reduced coefficient size. Of course, this comes at diagonal entries 1. We thus obtain the PLUE dethe expense of additional reconstruction steps. composition associated with the previous exam§1.3 Implementations of PLE decompositions. ple: From the three approaches to linear algebra preÃ ′ !µ ¶ ¶µ1 ∗ ∗ ∗ 1 ∗ ∗ ∗ sented above, we see that dense linear algebra re′ 1 ∗ 1 ∗ . P ∗∗ ∗∗ ∗′ 1 1∗ quires optimization on at least two scales. First 1 ∗ ∗ ∗ ∗′ of all, it involves frequent additions and multiGaussian elimination is the most classical algo- plications in the coefficient ring R, and occarithm to compute PLE decompositions. It pro- sional divisions. Such operations are typically ceeds by iterating through columns: (1) pick- optimized at low level. For exact computation, ing a non-zero element in the current column, this is addressed in libraries like GMP [GMP], if possible; (2) permuting the corresponding row MPIR [MPIR], and FFLAS [DGP08]. Second, the to the top unprocessed position; (3) normalizing structural dependencies among the operations that row; (4) eliminating entries in the current col- require optimization at high level. They are traumn below that row. Despite its age, this algo- ditionally met by studying the algorithms from a rithm continues to be the fundamental building theoretical point of view. Modern compilers and block in the computation of PLE decompositions libraries facilitate exploitation of fusion, depenof modest-sized matrices. dency analysis, and even term rewriting. Other aspects that receive increasing attention One alternative to Gaussian elimination is the slice PLE decomposition [ABP11], which is are reliability, security, and correctness. They are a hierarchical approach. Splitting up M into of considerable importance to cryptography and column slices M = ( M0 ··· Mr −1 ), one computes “inner” mathematical applications. Correctness −1 M 0 = P 0 L 0 E 0 and sets M i′ = L −1 0 P 0 M i for i ≥ 1. is typically addressed by testing, which is supt tE tM ′′ ′ Then we decompose M i = ( i i ) row-wise, ported by various frameworks available for major

–2–

HLinear — 1 Background


programming languages. Formal verification provides further reassurance, but it is hard to apply to today’s most popular languages due to, for example, insufficient type systems. We have mentioned the two implementations LinBox and Flint of exact dense linear algebra. For computation over the rationals, Flint mostly performs a little bit better than LinBox. Competition between LinBox and Flint has produced tremendous progress in the field of computational linear algebra. However, both also suffer from certain deficits. Flint has had two “severe bugs” in the last two years affecting primality testing and gcd computation 4 . Parts of LinBox for a certain time were excluded from the computer algebra distribution Sage [Sage]5 , because of incorrect results or segmentation faults. LinBox is written in C++, while Flint is written in C. We notice that Flint cannot rely on the compiler’s ability to simplify structure at a larger scale. Indeed, it focuses on low-level optimizations, and deals with high-level optimizations by hand. LinBox can and does rely on template metaprogramming for achieving a certain level of generality and structural optimization. In connection with correctness, we quote the Google style guide for C++, which says [Goo16]: “Avoid complicated template programming”, and reasons that: The techniques used in template metaprogramming are often obscure to anyone but language experts. Code that uses templates in complicated ways is often unreadable, and is hard to debug or maintain.

things, the highly-developed and aggressively optimizing Glasgow Haskell Compiler (GHC). While functional programming languages suffer the reputation of being relatively slow, recent progress on fusion [MLPJ13] and plenty of highly developed libraries allowed for implementing the highlyperformant webserver Warp [YSV13], beating C code on some numerical applications [MLPJ13], and software being employed in financial industry6 . Haskell supports the developer by providing compositional code, term rewriting rules, and a strong type system. As a result, code reusability, testing, and verification (in connection with Coq [Coq]) are superior to any other language encountered in an industrial setting. Haskell has important weak points: (a) it is infamous for its steep learning curve; (b) it does not have a type system as strong as, say, Agda [Nor09]; (c) it has several low hanging fruits to be optimized in its parallelization and distributed computation libraries. Regardless of these imperfections, it currently appears as the best possible choice for functional (in the sense of functional programming) implementations. We refer the reader who is unfamiliar with Haskell to [Wik16], and illustrate code reusability in Haskell by a design pattern that we will encounter later. It is a common scheme to (i) decompose a data structure into a sequence of relatively simple data structures, and then (ii) recombine these simple structures. In Haskell, this is supported by unfolding and folding. The respective type signatures are7 unfoldr : : (b −> Maybe ( a , b ) ) −> b −> [ a ] f o l d l : : (b −> a −> b) −> b −> [ a ] −> b

Both C and C++ have excellent properties when it comes to program overhead; LinBox and Flint make us of them. On the other hand, neither supports the programmer with optimization of large scale structures, testing, or even verification.

That is, unfolding is based on a function that decomposes an instance of a data structure b, if possible, into an easier piece of type a and a remainder, which is again of type b. As a simple example, we formulate the extended §1.4 Haskell. Haskell is a functional program- Euclidean algorithm for non-negative integers ming language, the most popular one besides 6 Examples include Barclays Capital [Fra+09], Credit SuOCaml. It is successful due to, among other 4 See the News section at flintlib.org 5 See tickets 6296 and 12629 at trac.sagemath.org

isse [Man06], Deutsche Bank [Pol08]. 7 In some of the code listings, we had to violate Haskell’s indentation rules in order to make the lines fit.

–3–

HLinear — 2 HLinear


(corresponding to the type Natural) in terms of fold and unfold. Thinking of pairs (a, b) as row vectors, a single reduction step in the Euclidean algorithm can be viewed as writing ( a b ) = ( b r ) T ¡ ¢ for the matrix T = 1t 10 , where a = t b + r is the result of the division of a by b. This reduction step is implemented as:

R e i f i e s s a => Proxy s −> r ) −> r r e f l e c t : : R e i f i e s s a => Proxy s −> a

Observe that Proxy carries no runtime information, since it has one constructor without any parameter. The first argument of reify is a configuration parameter, and its second argument is a function that requires configuration. Inside that reduce ( _ , 0 ) = Nothing reduce ( 0 , b) = Just (( −1) , (b , b ) ) function, one can use reflect to recover the conreduce ( a , b) = l e t ( t , r ) = quotRem a b figuration parameter from an instance of Proxy. in Just ( t , (b , r ) ) Prototypical application of this idea would be as From a pair (a, b) we thus obtain a list [T1 , . . . , Tr ] follows: such that ( a b ) = ( 1 0 ) T1 · · · Tr . The matrix import Data . R e fl e c t i o n ¡ ¢−1 T1 · · · Tr = Tr−1 · · · T1−1 is then computed by import Data . Proxy r e i f y 4 $ \ ( _ : : Proxy s ) −> folding via: 3 + r e f l e c t ( Proxy : : Proxy s )

mulinv ( a , b , c , d) t = ( b , a−t * b , d , c−t * d)

Edward Kmett’s library reflection provides a fast The extended Euclidean algorithm on a pair (a, b) implementation of reflection. can thus be cleanly written as

2 HLinear

l e t ( x , _ , y , _ ) = f o l d l mulinv ( 1 , 0 , 0 , 1 ) $ unfoldr reduce ( a , b) in ( x , y )

All intermediate steps are exposed directly to the compiler, which can optimize more aggressively. §1.5 Implicit configuration via reflection. The configuration problem in functional programming is that data that is given on the outer level of a program needs to be accessed in the inner level. Since functional programming style strives to use pure functions—that is, to exclude side effects— this would a priori require one additional configuration parameter in all functions. For example, the two-argument function f : : a −> b −> c

would be augmented to

§2.1 algebraic-structures. The new package algebraicstructures provides classes for algebraic structures ranging from magmas, groups, and actions, to rings, modules, and algebras. For example, magmas are sets together with a binary operator; no further conditions are imposed. The class c l a s s MultiplicativeMagma a where ( * ) : : a −> a −> a

f ’ : : cfg −> a −> b −> c

and one would need to carry the argument cfg through all function calls. One solution to the dynamic configuration problem proposed in [KS04] is to let the type system assist. One introduces a pair of functions data Proxy s = Proxy r e i f y : : a −> ( f o r a l l ( s : : * ) .

We have split HLinear into three packages: algebraic-structures, HFlint, and the main package HLinear. They rely on various packages authored by others, most prominently the vector package for tuned vector and array manipulation, and Edward Kmett’s reflection package. Testing relies on a combination of QuickCheck and SmallCheck, bundled conveniently in the testing framework Tasty. Benchmarking is based on Criterion.

therefore captures completely the definition of a magma with its binary operator written multiplicatively. At the other extreme, (unital) R-algebras can be characterized as sets A with binary operators + and · , such that (i) R is commutative, (ii) A is a left R-algebra, (iii) A is a right R-algebra, (iv) A is an R-module. A type a representing an

–4–



algebra with base ring represented by r is thus de- defined at mathematical level of rigor. If true, the scribed by the class roots of this observation might be the compiler’s ability to rearrange intermediate steps more efc l a s s ( Commutative r fectively, i.e. to optimize more aggressively. It can , LeftAlgebra r a , RightAlgebra r a definitely not be related to rewrite rules, which are , Module r a ) => Algebra r a not included in the current version of algebraicstructures. Commutativity is implemented by a similarly§2.2 HFlint. HFlint is a wrapper around some looking class parts of Flint. Specifically, it wraps integers Z, rac l a s s MultiplicativeMagma a tionals Q, polynomial algebras Z[x] and Q[x], and => Commutative a finite fields Fp for primes p. Given the current caHowever, it includes the axiom that a 1 · a 2 = a 2 · a 1 pabilities of Flint, it is possible to extend this to for all instances a 1 , a 2 of type a. There is no gen- number fields (via Antic [Har16]), to p-adics Q p eral way to establish that such axioms are valid and Z , and their finite extensions, to all finite p for a given Haskell function. For this reason, fields F for prime powers q, and to the real and q the package algebraic-structures provides Tasty- complex numbers R and C (via Arb [Joh13]). In combinators to test that implementations respect light of the layout of HFlint, this extension would relevant mathematical axioms. To invoke only the be feasible with rather little effort. combinator for commutativity inside a test group, In Section 1.3, we have discussed low-level and we can write high-level optimizations of linear algebra. While testGroup " Algebraic properties of a" $ papers like [CSS03; MLPJ13] suggest that the per( ‘ runTestR ‘ t e s t P r o pe r t y ) $ formance barrier between C and Haskell for elefmap concat $ sequence mentary operations is low or even non-existent, [ isCommutative ( Proxy : : Proxy a ) ] state-of-the-art implementations of, say, integer Instances for some built-in Haskell types (Inte- arithmetic (GMP, MPIR) are very hard to beat. This ger, Int, Natural, and Rational) are provided. Two shall not be the topic of this paper. Consequently, we rely on Flint for fundamental arithmetic. Note typical instances are given by that we purposely wrap Z, which is also repremkEuclideanDomainInstanceFromIntegral sented by Integer. However, Rational (built on top ( return [ ] ) [ t | Integer | ] of Integer) cannot compete with the correspondmkRingInstance ( return [ ] ) [ t | Int | ] ing Flint implementation. from which the pattern for invoking the correFlint objects are either synonyms for an elesponding Template Haskell should become clear. mentary data type (int, long, etc.) or pointers to As a side remark, note that the type Int, as opC structures. Functions for Flint objects may acposed to Integer, does not represent an integral cept a context object storing information about domain due to overflow. the type rather than an individual object. For exThe package algebraic-structures makes it easample, rationals are usually accessed via ier to implement mathematical ideas in greatest possible generality. For example, the normalized typedef fmpq fmpq_t [ 1 ] ; PLE decomposition can be defined for all division and the signature of a typical function is rings. And this is exactly the level of generality that void fmpq_add ( fmpq_t res , HLinear meets. const fmpq_t op1 , const fmpq_t op2 ) ; We conclude with one vague remark. From experiences with HLinear it seems that perfor- with self-evident meaning of the arguments. mance of some programs (specifically HLinear’s) There are no context objects attached to rationals. can profit from intermediate objections that are Finite fields Fp do have a context associated to

–5–



The inner variables ac and bc arise from the data represented by a and b, but the context originates from their type via reflection. In particular, it is semantically correct to ignore the context pointer provided by the first call to withFlintPrimCtx, since a and b have the same type, and theremp_limb_t nmod_add( fore they have the same context. mp_limb_t a , mp_limb_t b , nmod_t mod) To create and employ the context of a finite field The first two arguments are the summands and one uses withNModContext, whose type signature the third one is a pointer to a context. includes the condition NFData b for the type b of The disciplined interface style of Flint allows for the result. Its first argument is the modulus of the systematic wrapping. As mentioned above there finite field. is a context attached to finite fields. Clearly it is withNModContext desirable to disallow, say, addition of elements of : : NFData b Fp and Fp ′ for p 6= p ′ . A context parameter on the => FlintLimb left hand side of the data type declaration takes −> ( f o r a l l ctxProxy . R e i fi e s F l i n t C o n t e x t NModCtx ctxProxy care of this. Note that the context does not appear => Proxy ctxProxy −> b ) on the right.

them that keeps track of the prime p. Elements of finite fields of small moduli are encoded by means of an elementary data type mp_limb_t, which on most common architectures resolves as unsigned long. A typical signature for such finite fields is

−> b

type FlintLimb = CULong newtype NMod ctxProxy = NMod {unNMod : : FlintLimb }

Reflection and dynamic types. The goal of this section is to explain the condition NFData b, which might appear unnecessary. Context obTo operate with elements of finite fields, Flint jects are generally represented by a pointer to a C requires the context reference nmod_t. The imstructure. Most commonly, one would use a Forplementation of addition with type signature eignPtr, whose finalizer frees the C structure. A ( + ) : : NMod c t x −> NMod c t x −> NMod c t x typical computation in F7 might look as follows. can therefore be viewed as a dynamic configuration problem. We make NMod an instance of the class FlintPrim, which contains the function

unsafePerformIO $ do c t x NMod ctxProxy −> ( CFlintPrim NMod −> Ptr ( CFlintCtx c t x ) −> IO b) −> IO b

Invoking in this way an implementation based on foreign pointers and Kmett’s reflection library can and will produce segmentation faults. The concept of reflection is mathematically sound, but is incompatible with finalizers. A priori, the inner function in the second line does not The condition ReifiesFlintContext is related to the contain any reference to the C context instance. reflection library and is discussed in the next sub- This might make the finalizer free it before an acsection. The second argument corresponds to a tual reference is created by wrapped C function. Concretely, addition could r e f l e c t ( Proxy : : Proxy c t x ) be implemented as The point is that the latter call does create runtime (+) a b = data out of type information, which the finalizer withFlintPrimCtx a $ \ ac _ −> of a ForeignPtr cannot keep track of. withFlintPrimCtx b $ \bc c t x pt r −> Dynamic configuration via reflection is fast, nmod_add ac bc c t x pt r since it allows us to move context information

–6–



away from the element information. Instances of such that P is a permutation matrix and Ã0 0 0 0 0 0! a hypothetical data type µ ∗′ 0 0 0 ¶ µ1 ∗ ∗ ∗ ∗ ∗¶ 0 ′ 0 0 0 0 0 0 ∗ 1 0 0 L = ∗ 0 1 0 , E = 0 0 0 0 0 0 , M = 0 M′ . data NMod’ = NMod’ NModContext FlintLimb

will not only double the memory footprint, but also prevent some optimizations for the elementary data type FlintLimb. Dynamic configuration is also convenient, since it transparently prevents accidental combination of elements of different fields. It thus seems worth to introduce an NFData b condition to maintain it. Context objects will be implemented by plain Ptr instances, and memory allocation should be taken care of manually. The implementation of withNModContext in HFlint is withNModContext n f = unsafePerformIO $ do c t x Matrix a −> Maybe (PLEHook a , Matrix a )

∗ ∗ 01

(P 1 , L 1 , E 1 ) · (P 2 , L 2 , E 2 ) ¢ ¡ = P 1 P 2 , P 2−1 L 1 P 2 L 2 , E 1 + E 2

if r 1′ ≥ r 2 + r 2′ . One verifies that this condition is satisfied for all sequences of PLE hooks that arise from unfolding a matrix with splitOffHook. As a result, we can formulate PLE decomposition of a matrix m as f o l d l ( * ) ( firstHook nrs ncs ) $ unfoldr splitOffHook m

PLE hooks. PLE hooks consist of three elements, P , L, and E . The implementation is designed for general division rings, with no particular optimizations for rational numbers. For permutations, we rely on the library permutation, which internally makes use of IntArray. Their action on Vector is implemented directly via invoking functionality of the vector library. Since

Columns of left transformations keep track of their column index j , which is referred to as offset (from the top). The headNonZero of a column is its j -th element. The newtype wrapper NonZero ensures that it is not zero, which over division rings is equivalent to being invertible. All remaining entries of a left transformation column are stored in a vector tail. Implementing left transformations with offsets ensures that they are stored as compactly as possible. The separate saving of the column index, which would a priori be deducible from the container columns, makes some operations more localizable. Echelon forms are stored in a way that is similar to left transformations. data EchelonForm a = EchelonForm { nmbRows : : Natural

–8–

HLinear — 4 Performance


Using a more general interface (not described in this work), it is also possible to obtain the PLE decomposition without referring to the fold-unfold implementation. To define the PLE hook hk, we could have written

, nmbCols : : Natural , rows : : Vector ( EchelonFormRow a ) } data EchelonFormRow a = EchelonFormRow { o f f s e t : : Natural , row : : Vector a }

l e t hk = D. unPLEDecomposition $ D. pleDecomposition m

We need to keep track of both the number of rows 4 Performance and columns of E . A row of vectors has an offset We compared the performance of HLinear to that as left transformation columns do. of Flint via a suite of benchmarks (all run on one 3 Usage core of a 2.70GHz Intel Xeon E5-4650 processor). We illustrate usage of HLinear via the computa- Increasing fractions. The first test case considtion of one example. ers the family of special matrices of size n whose (i , j ) entry is given by import HFlint .FMPQ HLinear . Matrix as M HLinear . PLE . FoldUnfold as FU HLinear . PLE . Decomposition as D HLinear . PLE . Hook as Hk

import import import import

i2 +2 . (n − j )3 + 1 Comparison of FLINT and HLinear is given in Table 1 on page 11.

l e t m = M. fromListsUnsafe [ [ 84 , 168 , 588 ,−252 , 336 , 49 ] , [ 672 ,1344 ,4704 , −1992,4722 ,2552 ] ,[ −504, −1008, −3528 ,2100 ,−1575,−4998 ] , [ 168 , 336 ,1176 ,−168 ,1428 , −2002]] : : Matrix FMPQ

We start by invoking the fold-unfold implementation of PLE decomposition directly. It returns a PLE decomposition object, which we immediately convert to a PLE hook. The PLE hook allows for extraction of the permutation, the left transformation, and the echelon form. We convert them to matrices directly.

Random matrices. The other benchmarks use random square matrices. To accommodate our focus on matrices with bounded denominators, we generate random matrices whose denominators are products d 1 · · · d n for random numbers d i . We use the following parameters: • matrix size; • snum: upper bound on the size of the numerators of the matrix entries (in bytes); • nden: upper bound on the number of factors used to generate denominators of the matrix entries; • sden: upper bound on the size of the factors used to generate denominators of the matrix entries (in bytes).

l e t hk = D. unPLEDecomposition $ FU. pleDecompositionFoldUnfold m Hk. toMatrices hk

Reformatting the output slightly, this yields ¶ Ã 1 2 7 −3 4 72 ! ¶ µ µ 1 0 0 0

0 1 0 0

0 0 1 0

0 0 0 1

,

84 672 −504 168

0 0 0 24 0 0 588 −49392 0 336 −27720 1

,

000 1 000 0 000 0

339 4

90

1 0

0

7 6

.

Both Flint and HLinear are run on the same random matrices. Tables 2 through 5 show results for one varying parameter at a time.

Alternatively, we can obtain the respective data 5 Conclusion structures by We have demonstrated that the implementation of Gaussian elimination in a functional programl e t p = Hk. permutation hk ming language can compete with C implemental e t l = Hk. leftTransformation hk tions and even outperform them. The design of l e t e = Hk. echelonForm hk

–9–

HLinear — 5 Conclusion


our implementation was guided by the algebraic structure of intermediate steps. In particular, we exposed the iteration scheme of Gaussian elimination by unfolding explicitly a matrix to a vector of PLE hooks. We believe that this feature made it possible for the compiler to rearrange them more easily while optimizing the code. Potential for such rearrangement is generally advertised as a fundamental advantage of functional programming, and our example shows how it comes into effect in a practical case. Our fold-unfold implementation of Gaussian elimination is general enough to cover all division rings. With slight modification it can be optimized for rationals (fraction free PLE) or extended to discrete valuation rings (e.g. the local ring Zp ). Despite being very general it performs well in practice. The splitting up into algebraically modeled intermediate steps also opens doors to formal verification. The main obstacle for this is the partially defined multiplication of PLE hooks. The introduction of some type parameters in conjunction with type literals would allow to remedy this, but it would lead to a heterogeneously typed list. While implementation of this list is possible without problems in Haskell, it would not profit from the extensive optimizations in the vector library. It would thus defeat our central aim to provide an implementation of linear algebra that can compete with major contemporary ones. Annotations, e.g. Liquid Haskell, might provide a usable compromise. [ABP11]

M. Albrecht, G. Bard, and C. Pernet. Efficient dense Gaussian elimination over the finite field with two elements. 2011.

[BWR14]

J. Bruinier and M. Westerholt-Raum. Kudla’s Modularity Conjecture and Formal Fourier-Jacobi Series. arXiv:1409.4996. 2014.

[Coq]

The Coq development team. The Coq proof assistant reference manual, version 8.4. 2014.

[CSS03]

K. Claessen, M. Sheeran, and S. Singh. “Using Lava to design and verify recursive and periodic sorters”. International Journal on Software Tools for Technology Transfer 4.3 (2003).

[DGP08]

J-G. Dumas, P. Giorgi, and C. Pernet. “Dense linear algebra over word-size prime fields: the FFLAS and FFPACK packages”. ACM Transactions on Mathematical Software (TOMS) 35.3 (2008).

[Flint]

W. Hart, F. Johansson, and S. Pancratz. FLINT: Fast Library for Number Theory Version 2.5.2. http://flintlib.org. 2015.

[Fra+09]

S. Frankau, D. Spinellis, N. Nassuphis, and C. Burgard. “Commercial Uses: Going Functional on Exotic Trades”. J. Funct. Program. 19.1 (Jan. 2009).

[GMP]

T. Granlund and the GMP development team. GNU MP: The GNU Multiple Precision Arithmetic Library. http://gmplib.org.

[Goo16]

Google. Google C++ style guide.

http://google.github.io/ styleguide/cppguide.html. 2016. [Har16]

W. Hart. “Antic – Algebraic Number Theory In C”.

https://github.com/wbhart/antic. 2016. [Joh13]

F. Johansson. “Arb: a C library for ball arithmetic”. ACM Communications in Computer Algebra 47.4 (2013).

[JPS13]

C-P. Jeannerod, C. Pernet, and A. Storjohann. “Rank-profile revealing Gaussian elimination and the CUP matrix decomposition”. J. Symbolic Comput. 56 (2013).

[KE14]

K. Kim and V. Eijkhout. “A Parallel Sparse Direct Solver via Hierarchical DAG Scheduling”. ACM Trans. Math. Softw. 41.1 (Oct. 2014).

[KLN96]

V. Kreinovich, A. Lakeyev, and S. Noskov. “Approximate linear algebra is intractable”. Linear Algebra Appl. 232 (1996).

[KS04]

O. Kiselyov and C. Shan. “Functional pearl: implicit configurations–or, type classes reflect the values of types”. Proceedings of the 2004 ACM SIGPLAN workshop on Haskell. ACM. 2004.

[KT90]

E. Kaltofen and B. M. Trager. “Computing with polynomials given by black boxes for their evaluations: greatest common divisors, factorization, separation of numerators and denominators”. J. Symbolic Comput. 9.3 (1990).

[LinBox]

The LinBox Group. LinBox – Exact Linear computational linear algebra Version 1.3.2. http://linalg.org. 2015.

[M4RI]

M. Albrecht and G. Bard. The M4RI Library. http://m4ri.sagemath.org. The M4RI Team.

– 10 –


[Man06]

[MLPJ13]


H. Mansell. “Why functional programming matters to Credit Suisse”. Commercial Users of Functional Programming, Portland. 2006.

Table 1: Matrices of increasing fractions Matrix size n

G. Mainland, R. Leshchinskiy, and S. Peyton Jones. “Exploiting vector instructions with generalized stream fusion”. ACM SIGPLAN Notices. Vol. 48. 9. ACM. 2013.

[MPIR]

The MPIR Group. MPIR: Multiple Precision Integers and Rationals. http://www.mpir.org.

[Nor09]

U. Norell. “Dependently Typed Programming in Agda”. Proceedings of the 4th International Workshop on Types in Language Design and Implementation. TLDI ’09. 2009.

[Pol08]

J. Polakow. “Is Haskell ready for everyday computing?” Commercial Users of Functional Programming, Victoria. 2008.

[Sage]

The Sage Developers. Sage Mathematics Software (Version 6.9). http://www.sagemath.org. 2015.

[Str69]

V. Strassen. “Gaussian elimination is not optimal”. Numer. Math. 13 (1969).

[Wik16]

Haskell Wiki. “Learning Haskell”.

100 200 300 400 500 600 700 800 900 1 000 1 200 1 400 1 600 1 800 2 000 2 500 3 000

https://wiki.haskell.org/ Learning_Haskell. 2016. [YSV13]

CPU time in milliseconds Flint HLinear 143 2 111 6 599 26 804 38 704 77 866 132 205 229 307 371 995 541 234

249 736 3 312 4 466 7 979 21 929 33 923 39 300 46 787 56 798 103 509 174 117 255 548 370 306 514 827 1 078 127 2 329 201

K. Yamamoto, M. Snoyman, and A. Voellmy. “Warp”. The performance of open source applications. 2013.

Alexandru Ghitza School of Mathematics and Statistics, University of Melbourne, Parkville, VIC 3010, Australia E-mail: [email protected] Homepage: http://aghitza.org

Table 2: Varying matrix size: snum = 50, nden = 10, sden = 20

Martin Westerholt-Raum Chalmers tekniska högskola och Göteborgs Universitet, Institutionen för Matematiska vetenskaper, SE412 96 Göteborg, Sweden E-mail: [email protected] Homepage: http://raum-brothers.eu/martin

– 11 –

Matrix size n 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150

CPU time in milliseconds Flint HLinear 5 200 30 898 810 2 817 290 978 24 1 265 749 2 011 192 1 897 532 5 127 3 063 023 8 180 156 8 004 673 10 055

5 154 2 511 881 3 563 12 705 59 85 329 311 672 361 258 2 803 310 306 659 109 2 400 729 5 512



Table 3: Varying numerator size (in bytes): matSize = 71, nden = 10, sden = 20 Size of numerator

CPU time in milliseconds Flint HLinear

10 20 30 40 50 60 70 80 90 100

19 543 843 509 743 1 277 3 178 729 967 1 342 1 091 764 184 072 1 396 408

51 39 064 55 712 144 710 2 921 97 205 776 133 734 107 426 80 657

Table 4: Varying number of factors in denominator: matSize = 71, snum = 50, sden = 20 Factors in denominator 5 10 15 20

CPU time in milliseconds Flint HLinear 4 208 3 216 1 471 252 2 148 879

6 176 2 961 51 824 200 091

Table 5: Varying denominator size (in bytes): matSize = 71, snum = 50, nden = 10 Size of denominator 10 20 30 40 50 60 70 80

CPU time in milliseconds Flint HLinear 189 859 3 179 1 112 938 4 136 898 5 578 141 5 297 293 11 836 301 13 269 407

22 582 2 914 131 965 228 384 140 004 110 420 201 168 441 943

– 12 –

Exact Dense Linear Algebra in Haskell Alexandru Ghitza1 and

Exact Dense Linear Algebra in Haskell Alexandru Ghitza1 and

Suggest Documents

Exact Dense Linear Algebra in Haskell Alexandru Ghitza1 and

Dense and Iterative Concurrent Linear Algebra in ... - Semantic Scholar

Some Recent Progress in Exact Linear Algebra and Related ...

Binding Performance and Power of Dense Linear Algebra ... - HPCA

Linear Equations in Linear Algebra

SC 08 Benchmarking GPUs to Tune Dense Linear Algebra

A Scalable Approach to Solving Dense Linear Algebra Problems on ...

The Science of Deriving Dense Linear Algebra Algorithms - CiteSeerX

Energy Footprint of Advanced Dense Numerical Linear Algebra using

Fast Development of Dense Linear Algebra Codes on Graphics

Minimal Data Copy For Dense Linear Algebra Factorization - CiteSeerX

Dense Linear Algebra Solvers for Multicore with GPU Accelerators

Minimal Data Copy For Dense Linear Algebra ... - Semantic Scholar

Profiling High Performance Dense Linear Algebra ... - The Netlib

Linear Algebra in Maple®

Matrices and Linear Algebra

Linear Algebra and Geometry

Matrices and Linear Algebra

Linear and Multilinear Algebra

Sage for Linear Algebra - A First Course in Linear Algebra

Linear Algebra

LINEAR ALGEBRA

Linear Algebra

Linear Algebra