A Case Study in High-Performance Mixed-Language Programming Hans Petter Langtangen1,2 1
2
Simula Research Laboratory, Oslo, Norway
[email protected] http://folk.uio.no/hpl Dept. of Informatics, University of Oslo, Norway
Abstract. Several widely used and promising programming tools and styles for computational science software are reviewed and compared. In particular, we discuss function/subroutine libraries, object-based programming, object-oriented programming, generic (template) programming, and iterators in the context of a specific example involving sparse matrix-vector products. A key issue in the discussion is to hide the storage structure of the sparse matrix in application code. The role of different languages, such as Fortran, C, C++, and Python, is an integral part of the discussion. Finally, we present performance measures of the various designs and implementations. These results show that high-level Python programming, with loops migrated to compiled languages, maintains the performance of traditional implementations, while offering the programmer a more convenient and efficient tool for experimenting with designs and user-friendly interfaces.
1
Introduction
During the last fifty years we have experienced a tremendous speed-up of hardware and algorithms for scientific computing. An unfortunate observation is that the development of programming techniques suitable for high-performance computing has lagged significantly behind the development of numerical methods and hardware. This is quite surprising since difficulties with handling the complexity of scientific software applications is widely recognized as an obstacle for solving complicated computational science problems. As a consequence, technical software development is a major money consumer in science projects. The purpose of this paper is to review and compare common scientific programming techniques and to outline some new promising approaches. The first attempts to improve scientific software development took place in the 1970s, when it became common to collect high-quality and extensively verified implementations of general-purpose algorithms in Fortran subroutine libraries. The subroutines then constitute the algorithmic building blocks in more advanced applications, where data represented by array, integer, real, and string variables are shuffled in and out of subroutines. Such libraries are still perhaps the most important building blocks in scientific software (LAPACK being the primary example). B. K˚ agstr¨ om et al. (Eds.): PARA 2006, LNCS 4699, pp. 36–49, 2007. c Springer-Verlag Berlin Heidelberg 2007
A Case Study in High-Performance Mixed-Language Programming
37
In the 1980s and 1990s some transition from Fortran to C took place in the scientific computing community. C offers the possibility to group primitive variables together to form new compound data types. Fortran 90 also provides this functionality. The result is that the function arguments are fewer and reflect a higher abstraction level. With C++ or Fortran 90, and the class or module concept, compound data types could also be equipped with functions operating on the data. Application code may then define some class objects and call their functions instead of shuffling long lists of variables in and out of function/subroutine libraries as in C and Fortran 77. C++ also offers object-oriented programming and templates, two constructs that can be used to parameterize types away such that one single function may work with different types of input data. This reduces the amount of code in libraries significantly and increases the flexibility when using the library in a specific application code. The upcoming Fortran 2003/2008 will enable the same constructs. An important feature of the present paper is that we, along with the above mentioned programming strategies in Fortran, C, and C++, also illustrate the use of the Python language. Python supports all major programming styles and serves as a more convenient and more productive implementation environment than the traditional compiled languages used in high-performance computing. A particularly important feature of Python is the clean syntax, which by many is referred to as “executable pseudocode”. This feature helps to bridge the gap between a mathematical expression of a solution algorithm and the corresponding computer implementation. Add-on modules equip Python with functionality close to that of basic Matlab. Contrary to Matlab, Python offers an advanced, full-fledged programming language with strong support for clean programming with user-defined objects. Another advantage of Python is its rich set of libraries for non-numerical/administrative tasks, and such tasks tend to fill up large portions of scientific codes. The paper is divided into two parts. First, we exemplify major programming techniques, tools, and languages used in scientific computing and demonstrate how Python is both a complement and an alternative to Fortran, C, and C++. The second part presents some computational performance assessments of the various programming techniques. We remark that the paper primarily addresses readers with some technical knowledge of C, C++, and Python.
2
A Review of Programming Tools and Techniques
Programming tools and techniques are often best illustrated by applying them to a specific problem. We have chosen to focus on the matrix-vector product because the numerics of this problem is simple, the codes become short, and yet different implementation styles exhibit diverse properties. Mathematically, we want to compute r = Ax, where A is a square n × n matrix, and x and r are n-vectors. This operation is a key ingredient in many numerical algorithms. In particular, if A arises from discretization of a partial
38
H.P. Langtangen
differential equation with finite difference, element, or volume methods, A is sparse in the sense that most matrix entries are zero. Various matrix storage schemes enable the programmer to take computational advantage of the fact that A is sparse. One such storage scheme is the compressed row storage format. Given a matrix ⎞ ⎛ a0,0 0 0 a0,3 0 ⎜ 0 a1,1 a1,2 0 a1,4 ⎟ ⎟ ⎜ ⎟ A=⎜ ⎜ 0 a2,1 a2,2 0 0 ⎟ , ⎝ a3,0 0 0 a3,3 a3,4 ⎠ 0 a4,1 0 a4,3 a4,4 the compressed row storage scheme consists of a one-dimensional array, a, of all the nonzeroes, stored row by row, plus two indexing integer arrays, where the first, rowp (“row pointer”), tells where each row begins, and the second, coli, holds the column indices: a = (a0,0 , a0,3 , a1,1 , a1,2 , a1,4 , a2,1 , a2,2 , a3,0 , a3,3 , . . . , a4,4 ) rowp = (0, 2, 5, 7, 10, 13) coli = (0, 3, 1, 2, 4, 1, 2, 0, 3, 4, 1, 3, 4) Notice here that we index A with base zero, i.e., the entries of A are ai,j for i, j = 0, . . . , n − 1. Also notice that rowp has length n + 1, with last entry equal to the number of nonzeroes. The straightforward algorithm for computing the matrix-vector product goes as follows: r = 0 for i = 0, ..., n-1: s = 0 for j = rowp[i], ..., rowp[i+1]-1: s = s + a[j]*x[coli[j]] r[i] = s
The next subsections exemplify how different programming techniques result in different levels of complexity in codes that need to call matrix-vector product functionality for different matrix storage schemes. Two schemes are exemplified: a two-dimensional array for a (full) n × n matrix and compressed row storage of a sparse matrix. 2.1
Classical Fortran 77 and C Implementations
In Fortran 77 the matrix-vector product is naturally implemented as a subroutine where A, x, r, and the array size n are subroutine arguments. Typically, the product is computed by call matvec_dense(A, n, x, r) call matvec_sparse(a, rowp, coli, n, nz, x, r)
for the two cases that A is stored in a two-dimensional n × n array A or A is a sparse matrix stored in the a, rowp, and coli arrays (nz is the number of
A Case Study in High-Performance Mixed-Language Programming
39
nonzeroes, i.e., the length of a and coli). As seen, the calls are quite different although the mathematics is conceptually identical. The reason is that Fortran 77 exposes all details of the underlying data structures for matrix storage. An implication for a linear solver, say some Conjugate Gradient-like method, where the matrix-vector product is a key ingredient in the algorithm, is that the code depends explicitly on the way we store the matrix. Adding a new storage scheme implies that all matrix-vector product calls in the whole code need modification. A similar C implementation differs mainly by its use of pointers instead of proper arrays: void matvec_dense(double** A, int n, double* x, double* r); matvec_dense(A, n, x, r); /* call */ void matvec_sparse(double* a, int* rowp, int* coli, int n, int nz, double* x, double* r); matvec_sparse(a, rowp, coli, n, nz, x, r); /* call */
2.2
Object-Based Implementation in Terms of C-Like Structs
A significant language feature of C is the struct type, which allows several basic data types to be grouped together. It makes sense that the arrays a, rowp, and coli used to represent a sparse matrix, are grouped together and equipped with information on the sizes of the array: struct MatSparseC { double* a; int* rowp; int* coli; int n; int nz; }; MatSparseC A = MatSparseC();
We can now access the internal sparse matrix arrays through the dot notation: A.a, A.rowp, etc. A dense matrix representation of A in terms of a C array can be implemented similarly: struct MatDenseC { double** A; int n; };
The calls to matrix-vector product functions void matvec_dense (MatDenseC* A, double* x, double* r); void matvec_sparse(MatSparseC* A, double* x, double* r);
can now be made more uniform: matvec_dense (&A, x, r); matvec_sparse(&A, x, r);
That is, only the fundamental mathematical objects A, x, and r enter the function that computes r = Ax. Applying C++ as a “better C”, we may use the same name for the matrixvector product function (known as function overloading) such that the call hides the way A is stored:
40
H.P. Langtangen void matvec(const MatDenseC* A, const double* x, double* r); void matvec(const MatSparseC* A, const double* x, double* r); matvec(&A, x, r);
// same syntax when A is MatDenseC or MatSparseC
The const keyword contributes to increased readability (input arguments) and safety (no possibility to change A and x inside the functions). Actually, in C++ we would use references instead of pointers and probably also wrap primitive arrays in a class. A natural choice is to employ the ready-made vector class std::valarray from the C++ Standard Template Library (STL) [6]. Alternatively, a home-made vector class can easily be implemented. For the rest of the paper we introduce the short form Vecd for std::valarray. A new vector of length n is then created by Vecd(n), while Vecd(a,n) borrows data from an existing double* a array of length n. With help of the Vecd type the function signatures and associated call become void matvec(const MatDenseC& A, const Vecd& x, Vecd& r); void matvec(const MatSparseC& A, const Vecd& x, Vecd& r); matvec(A, x, r);
// same syntax when A is MatDenseC or MatSparseC
When data are grouped together using a struct or class, one often speaks about object-based programming, since the fundamental variables are objects (here A) encapsulating more primitive data types. In Python we may mirror all the implementations above. The relevant array data structure is supported by Numerical Python [3], and this structure contains both the array data and the length of each array dimension. Fortran-style functions can be implemented in Python and called as matvec_dense2(A, x, r) matvec_sparse2(a, rowp, coli, x, r)
A more “Pythonic” implementation would return the result r rather than having it as an argument (and performing in-place modifications of r): r = matvec_dense(A, x) r = matvec_sparse(a, rowp, coli, x)
This latter design requires the matvec * routines to allocate a new r vector in each call. This work introduces some overhead that we will quantify in Section 3. Python’s class construction can be used to group data structures into objects as one does in C with a struct. Here is a possible example: class MatSparseC: pass # start with empty class A = MatSparseC() A.a = some_a; A.rowp = some_rowp; A.coli = some_coli r = matvec_sparse_C(A, x) class MatDenseC: pass A = MatDenseC(); A.a = some_a r = matvec_dense_C(A, x)
A Case Study in High-Performance Mixed-Language Programming
41
Function overloading is not supported in Python so each function must have a unique name. Also observe that in Python, we may declare an empty class and later add desired attributes. This enables run time customization of interfaces. 2.3
Object-Based Implementation in Terms of Classes
The class concept, originally introduced in SIMULA and made popular through languages such as Smalltalk, C++, Java, C#, and Python, groups data structures together as in C structs, but the class may also contain functions (referred to as methods) operating on these data structures. An obvious class implementation of sparse and dense matrices would have the matrix-vector product as a method. The idea is that the internal data structures are invisible outside the class and that the methods provide algorithms operating on these invisible data. In the present example, the user of the class is then protected from knowing how a matrix is stored in memory and how the algorithm takes advantage of the storage scheme. A typical C++ implementation of a sparse matrix class may be outlined as follows. class MatSparse { private: Vecd a; Vec rowp; Vec coli; int n, nz; public: MatSparse(double* a_, int* rowp_, int* coli_, int nz, int n) : a(a_, nz), rowp(rowp_, n+1), coli(coli_, nz) {} Vecd matvec (const Vecd& x) const { Vecd r = Vecd(x.size()); matvec2(x, r); return r; } void matvec2 (const Vecd& x, Vecd& r) const { matvec_sparse(a.ptr(), rowp.ptr(), coli.ptr(), rowp.size()-1, a.size(), x.ptr(), r.ptr()); } Vecd operator* (const Vecd& x) const { return matvec(x); } };
There are three matrix-vector multiplication methods: one that takes a preallocated r array as argument, one that returns the result r in a new array, and one (similar) that overloads the multiplication operator such that r = Ax can be written in C++ code as r=A*x. Observe that if we have the plain C function matvec sparse from page 38, class MatSparse above may be thought of as a wrapper of an underlying C function library. The idea is now that all matrix storage formats are encapsulated in classes such as MatSparse, i.e., we may have classes MatDense, MatTridiagonal, MatBanded,
42
H.P. Langtangen
etc. All of these have the same interface from an application programmer’s point of view: // A is MatSparse, MatDense, etc., x is Vecd Vecd r(n); r = A*x; r = A.matvec(x); A.matvec2(x, r);
This means that we may write a solver routine for a Conjugate Gradient-like method by using the same matrix-vector product syntax regardless of the matrix format: void some_solver(const MatSparse& A, const Vecd& b, Vecd& x) { ... r = A*x; ... }
Nevertheless, the arguments to the routine must be accompanied by object types, and here the explicit name MatSparse appears and thereby ties the function to a specific matrix type. Object-oriented programming, as explained in the next section, can remove the explicit appearance of the matrix object type. What we did above in C++ can be exactly mirrored in Python: class MatSparse: """Class for sparse matrix with built-in arithmetics.""" def __init__(self, a, rowp, coli): self.a, self.rowp, self.coli = a, rowp, coli # store data def matvec(self, x): return matvec_sparse(self.a, self.rowp, self.coli, x) def matvec2(self, x, r): matvec_sparse2(self.a, self.rowp, self.coli, x, r) def __mul__(self, x): return self.matvec(x)
The
mul method defines the multiplication operator in the same way as operator* does in C++. Also this Python class merely acts as a wrapper of
an underlying function library. Contrary to C++, there is no explicit typing in Python so a single solver routine like def some_solver(A, x, b): ... r = A*x ...
will work for all matrix formats that provide the same interface. In other words, there is no need for object orientation or template constructs as we do in C++
A Case Study in High-Performance Mixed-Language Programming
43
to reach the same goal of having one solver that works with different matrix classes. We should add that the matrix classes shown above will normally be equipped with a more comprehensive interface. For example, it is common to provide a subscription operator such that one can index sparse matrices by a syntax like A(i,j) or A[i,j]. More arithmetic operators and I/O support are also natural extensions. 2.4
Object-Oriented Implementation
The key idea behind object-oriented programming is that classes are organized in hierarchies and that an application code can treat all classes in a hierarchy in a uniform way. For example, we may create a class Matrix that defines (virtual) methods for carrying out the matrix-vector product (like matvec, matvec2, and the multiplication operator), but there are no data in class Matrix so it cannot be used to store a matrix. Subclasses implement various matrix formats and utilize the specific formats to write optimized algorithms in the matrix-vector product methods. Say MatDense, MatSparse, and MatTridiagonal are such subclasses. We may now write a solver routine where the matrix object is parameterized by the base class, void some_solver(const Matrix& A, const Vecd& b, Vecd& x) { ... r = A*x; ... }
Through the Matrix reference argument we may pass any instance of any subclass in the Matrix hierarchy, and C++ will at run-time keep track of which class type the instance corresponds to. If we provide a MatSparse instance as the argument A, C++ will automatically call the MatSparse::operator* method in the statement r=A*x. This hidden book-keeping provides a kind of “magic” which simplifies the implementation of the flexibility we want, namely to write a single function that can work with different types of matrix objects. Python offers the same mechanism as C++ for object-oriented programming, but in the present case we do not need this construction to parameterize the matrix type away. The dynamic typing (i.e., the absence of declaring variables with types) solves the same problem as object orientation in C++ aims at. However, in some cases where some matrix classes can share data and/or code, it may be beneficial to use inheritance and even object-oriented programming. This is exemplified in Section 2.6. 2.5
Generic Programming with Templates in C++
C++ offers the template mechanism to parameterize types. With templates we can obtain much of the same flexibility as dynamic typing provides in Python.
44
H.P. Langtangen
For example, we can write the solver function with a template for the matrix type: template void some_solver(const Matrix& A, const Vecd& b, Vecd& x) { ... r = A*x; ... }
Now we may feed in totally independent classes for matrices as long as these classes provide the interface required by the some solver function. Sending (say) a MatSparse instance as A to this function causes the compiler to generate the necessary code where Matrix is substituted by MatSparse. Writing algorithms where the data are parameterized by templates is frequently referred to as generic programming. This style encourages a split between data and algorithms (see Section 2.7), while object-oriented programming ties data and algorithms more tightly and enforces classes to be grouped together in hierarchies. A consequence is that libraries based on object-oriented programming are often harder to reuse in new occasions than libraries based on generic programming. The class methods in object-oriented programming are virtual, which implies a (small) overhead for each method call since the methods to call are determined at run-time. With templates, the methods to call are known at compile time, and the compiler can generate optimally efficient code. 2.6
Combining Python and Compiled Languages
Python offers very compact and clean syntax, which makes it ideal for experimenting with interfaces. The major deficiency is that loops run much more slowly than in compiled languages, because agressive compiler optimizations are impossible when the variables involved in a loop may change their type during execution of the loop. Fortunately, Python was originally designed to be extended in C, and this mechanism can be used to easily migrate time-consuming loops to C, C++, or Fortran. Integration of Python and Fortran is particularly easy with the aid of F2PY [4]. Say we have the matvec sparse routine from page 38. After putting this (and other desired routines) in a file somecode.f, it is merely a matter of running Unix/DOS> f2py -m matvec -c somecode.f
to automatically parse the Fortran code and create a module matvec that can be loaded into Python as if it were written entirely in Python. Either in the Fortran source code, or in an intermediate module specification file, the programmer must specify whether the r parameter is a pure output parameter or if it is both input and output (all parameters are by default input). Because all arguments are classified as input, output, or both, F2PY may provide both the interfaces matvec sparse and matvec sparse2 on page 40 by calling the same underlying
A Case Study in High-Performance Mixed-Language Programming
45
Fortran routine. It is the F2PY-generated wrapper code that, in the former case, allocates a new result array in each call. Class MatSparse can now be modified to call Fortran routines for doing the heavy work. The modification is best implemented as a subclass that can inherit all the functionality of MatSparse that we do not need to migrate to Fortran: class MatSparse_compiled(MatSparse): def __init__(self, a, rowp, coli): MatSparse.__init__(self, a, rowp, coli) import matvec; self.module = matvec def matvec(self, x): # call Fortran routine, let wrapper allocate result: return self.module.matvec_sparse(self.a, self.rowp, self.coli, x) def matvec2(self, x, r): # call Fortran routine, use pre-allocated result r: self.module.matvec_sparse2(self.a, self.rowp, self.coli, x, r)
Note that it is not necessary to redefine mul since the inherited version (from MatSparse) calls matvec, which behaves as a virtual method, i.e., the correct version of matvec in the subclass will be automatically called (we thus rely on object-oriented programming when we do not repeat mul in the derived class). To summarize, the classes MatSparse and MatSparse compiled have the same interface and can be used interchangeably in application code. F2PY can also be used to wrap C code. However, it is hard to parse C code and detect which function arguments are arrays because (i) a pointer can be both an array and the address of a scalar, and (ii) there is no mechanism to associate array dimensions with a pointer. By expressing the function signatures in C in a Fortran 90-like module (an interface or .pyf file in F2PY terminology), F2PY can automatically generate extension modules from C sources. It is a simple matter to extend class MatSparse compiled so that self.module either refers to a Fortran or to a C extension module. We have done this in the experiments to see if there are performance differences between Fortran and C. There exist several techniques for migrating the matrix-vector product to C++. One technique is to apply a tool like SWIG [7] to automatically wrap the C++ class MatSparse such that it is mirrored in Python. Another alternative is to make a straight C function taking pointers and integers as arguments and then pass the arrays to a C++ function that creates a MatSparse instance for computation. The latter strategy has been implemented and will be referred to as Python calling C, calling C++ MatSparse::matvec2. The C function was made callable from Python by F2PY. A third technique is to use tools like Weave, Pyrex, PyInline, or Instant to write inline C or C++ code in Python. When Python is combined with Fortran, C, or C++, it is natural and convenient to allocate memory on the Python side. Numerical Python arrays have contiguous memory layout and can be efficiently shuffled to compiled code as pointers.
46
2.7
H.P. Langtangen
Iterators
Classical Fortran and C code applies do or for loops over integer indices to traverse array data structures. This coding style works in most other languages as well, but C, C++, and Fortran compilers are very good at optimizing such constructions. In recent years, iterators have become a popular alternative to loops with integer indices. The standard template library (STL) [6] in C++ promoted the use of iterators in the mid 1990s, and the construction has gained wide use also in scientific computing (three extensive examples being Boost [1], MTL [5], and Dune [2]). One may view iterators as generalized loops over data structures. A matrix-vector product can be implemented as something like template void matvec(const Matrix& A, const Vecd& x, Vecd& r) { r = 0; Matrix::iterator element; for (element = A.begin(); element != A.end(); ++element) { r(element.i()) += (*element)*x(element.j()); // index i value index j }
The idea is to have an iterator instance, here Matrix::iterator element, which acts as generalized pointer to the elements of the data structure. The for loop starts with setting element to point to the beginning of the data structure A. The loop continues as long as element has not reached one step beyond the data in A. In each pass we update the element “pointer” by the unary Matrix::iterator::operator++ method. For a sparse matrix, this operator++ must be able to pass through all the nonzeroes in each row of A. The element instance also provides a pointer dereferencing, through the unary operator*, for accessing the corresponding data element A. The mathematical matrix indices are available through the methods element.i() and element.j(). We emphasize that there are many design considerations behind a particular iterator implementation. The one shown here is just an example. High-quality numerical libraries based on iterators provide different designs (MTL [5] applies separate row and column iterators, for instance). The idea of using iterators to implement a matrix-vector product is that the same function can be used for different matrix formats, as long as the provided matrix instance A offers an iterator class with the described interface and functionality. Using inline functions for the operator++, operator*, i, and j methods, the iterator class may with a clever implementation be as efficient as classical loops with integers. In the present example with one iterator running through the whole sparse matrix, optimal efficiency is non-trivial. In some way iterators are both complementary and an alternative to templates, object-oriented programming, or dynamic typing. Iterators are here used to implement the matrix-vector product such that the same code, and the same call of course, can be reused for various matrix objects. The previous constructions in Sections 2.1–2.6 aimed at encapsulating different matrix-vector imple-
A Case Study in High-Performance Mixed-Language Programming
47
mentations in terms of loops tailored to the underlying representation array of the matrix, such that the product looks the same in application code and the matrix type is parameterized away. Iterators and templates in combination (generic programming) provide a quite different solution to the same problem. Iterators may be somewhat comprehensive to implement in C++. To play around with iterators to see if this is the desired interface or not, we strongly propose to use Python because the language offers very convenient constructs to quickly prototype iterators. For example, the sparse matrix iterator can be compactly implemented through a so-called Python generator function as follows: class MatSparse: ... def __iter__(self): """Iterator: for value, i, j in A:""" for i in xrange(len(self.rowp)-1): for c in xrange(self.rowp[i],self.rowp[i+1]): yield self.a[c], i, self.coli[c]
We can now write the matrix-vector product function in a generic way as def matvec(A, x): r = zeros(len(x), x.dtype) for value, i, j in A: r[i] += value*x[j] return r
This implementation works well with all matrix instances that implement an iterator that returns a 3-tuple with the matrix value and the two indices. Observe how this compact for loop replaces the traditional nested loops with integer indices, and also notice how simple the mapping between the traditional loops and the new iterator is (in the MatSparse. iter method). A faster but also more memory consuming implementation can be made by precomputing a (new sparse matrix) storage structure that holds the (value,i,j) tuple for each matrix entry: def __iter__(self): try: return self._ap # use a precomputed list as iterator except AttributeError: # first time, precompute: self._ap = [(value, i, j) for value, i, j in self] return self._ap
3
Performance Experiments
We have run experiments comparing the various implementations sketched in Section 2 on an IBM laptop X30 machine with 1.2 GHz Pentium III processor and 500 Mb memory, using Linux Ubuntu 2.6.12, GNU compilers (gcc, g++, g77) version 4.0 with -O3 optimization, Python version 2.4, numpy version 1.0, and Numeric version 24.2.
48
H.P. Langtangen
The results in Table 1 show that there is no significant difference between implementations of the matrix-vector product r=A*x in C, C++, and Fortran (and especially no overhead in wrapping arrays in the C++ class MatSparse). Using a preallocated result array r is as expected most efficient in these tests where a large number (160) of matrix-vector products are performed. Letting the result array be allocated in each call gives 10-15% overhead, at the gain of getting more “mathematical” code, either r=matvec sparse(A,x) or r=A*x. The pure Python implementation may be useful for experimenting with alternative software designs, but the CPU time is increased by two orders of magnitude, so for most applications of Python in scientific computing one must migrate loops to compiled code. We also note that the new Numerical Python implementation numpy is still slower than the old Numeric for pure Python loops, but the efficiency of numpy is expected to increase in the future. Iterators are very conveniently implemented in Python, but these run slower than the plain loops (which we used internally in the iterator implementation). Even a straight for loop through a precomputed data structure tailored for the iterator is as slow as plain loops. The overhead of loops in Python also drowns the expected positive effect of preallocating arrays. When our Python class for sparse matrices has its loops migrated to compiled code, there is no loss of efficiency compared to a pure Fortran or C function working with plain arrays only. We have from additional experiments estimated that a Python call to a Fortran function costs about 70 Fortran floating-point multiplications, which indicates the amount of work that is needed on the Fortran side to make the overhead negligible.
Table 1. Scaled CPU times for various implementations of a matrix-vector product with vector length 50,000 and 231,242 nonzeroes
Python Python Python Python Python Python Python Python Python Python Python Python Python Python
Implementation calling matvec sparse2, calling C, no array returned calling matvec sparse2, calling F77, no array returned calling C, calling C++ MatSparse::matvec2
numpy Numeric
1.0 1.0 1.0 MatSparse compiled.matvec2 1.0 calling matvec sparse, calling F77 , new array returned 1.10 1.13 calling matvec sparse, calling C, new array returned MatSparse compiled.matvec 1.13 MatSparse compiled: r=A*x 1.16 iterator with new precomputed matrix data 136 MatSparse.matvec 153 MatSparse.matvec2 153 MatSparse: r=A*x 153 153 function matvec sparse iterator based on generator method 228
1.0 1.0 1.0 1.0 1.12 1.15 1.12 1.13 89 92 92 92 92 156
A Case Study in High-Performance Mixed-Language Programming
4
49
Concluding Remarks
We have in this paper reviewed the ideas of function libraries, object-based programming, object-oriented programming, generic programming and iterators, using a simple example involving the implementation of a matrix-vector product. The role of the traditional high-performance computing languages Fortran and C, along with the more recently accepted language C++ and the potentially interesting Python language, has also been discussed in the context of the example. A driving force from “low level” languages like Fortran and C to higher-level languages such as C++ and Python is to offer interfaces to matrix operations where the specific details of the sparse matrix storage are hidden from the application programmer. That is, one goal is to write one matrix-vector call and make it automatically work for all matrix types. Object-oriented programming, generic programming with templates and iterators, as well as the dynamic typing in Python, provide different means to reach this goal. A performance study shows that low level (and fast) Fortran or C code can be wrapped in classes, either in C++ or Python, with negligible overhead when the arrays are large as in the present tests. There is neither any difference in speed between loops in Fortran, C, and C++ on the author’s machine with GNU v4.0 compilers. The author’s most important experience from this case study is that scientific programming is much more convenient in Python than in the more traditional languages. Therefore, Python shows a great potential in scientific computing, as it promotes quick and convenient prototyping of user-friendly interfaces, while (with a little extra effort of migrating loops to compiled code) maintaining the performance of traditional implementations.
References 1. 2. 3. 4. 5.
Boost C++ libraries, http://www.boost.org/libs/libraries.htm Dune library, http://www.dune-project.org Numerical Python software package, http://sourceforge.net/projects/numpy Peterson, P.: F2PY software package, http://cens.ioc.ee/projects/f2py2e Siek, J., Lumsdaine, A.: A modern framework for portable high-performance numerical linear algebra. In: Arge, E., Bruaset, A.M., Langtangen, H.P. (eds.) Advances in Software Tools for Scientific Computing, Springer, Heidelberg (1999) 6. Stroustrup, B.: The C++ Programming Language, 3rd edn. Addison-Wesley, Reading (1997) 7. SWIG software package, http://www.swig.org