ISSN 1500-6050
Oslo Scienti c Computing Archive Report 1998-4
Optimizing C++ Code for Explicit Finite Dierence Schemes Elizabeth Acklam Anders Jacobsen Hans Petter Langtangen March 15, 1998
Aims and scope: Traditionally, scienti c documentation of many of the activities in modern scienti c computing, like e.g. code design and development, software guides and results of extensive computer experiments, have received minor attention, at least in journals, books and preprint series, although the the results of such activites are of fundamental importance for further progress in the eld. The Oslo Scienti c Computing Archive is a forum for documenting advances in scienti c computing, with a particular emphasis on topics that are not yet covered in the established literature. These topics include design of computer codes, utilization of modern programming techniques, like object-oriented and object-based programming, user's guide to software packages, veri cation and reliability of computer codes, visualization techniques and examples, concurrent computing, technical discussions of computational eciency, problem solving environments, description of mathematical or numerical methods along with a guide to software implementing the methods, results of extensive computer experiments, and review, comparison and/or evaluation of software tools for scienti c computing. The archive may also contain the software along with its documentation. More traditional development and analysis of mathematical models and numerical methods are welcome, and the archive may then act as a preprint series. There is no copyright, and the authors are always free to publish the material elsewhere. All contributions are subject to a quality control.
Oslo Scienti c Computing Archive 1998-4
Revised March 15, 1998
Title
Optimizing C++ Code for Explicit Finite Dierence Schemes Contributed by
Elizabeth Acklam Anders Jacobsen Hans Petter Langtangen Communicated by
Are Magnus Bruaset
Oslo Scienti c Computing Archive is available on the World Wide Web. The format of the contributions is chosen by the authors, but is restricted to PostScript les, generated from LaTeX, and HTML les for documents with movies and text, and compressed tar- les for software. There is a special LaTeX style le and instructions for the authors. There is also a standard for the use of HTML. All documents must easily be printed in their complete form.
Contents 1 2 3 4 5
Introduction Example on C++ Abstractions in Dipack Outlining an Optimizing Strategy When and How to Introduce the Methods Concluding Remarks
i
1 2 4 12 13
This report should be referenced as shown in the following
Bib
TEX entry:
@techreport{OSCA1998-4, author = "Elizabeth Acklam and Anders Jacobsen and Hans Petter Langtangen", title = "Optimizing C++ Code for Explicit Finite Difference Schemes", type = "Oslo Scientific Computing Archive ", note = "URL: http://www.math.uio.no/OSCA ; ISSN 1500-6050", number = "\#{}1998-4", year = "March 15, 1998", }
ii
Optimizing C++ Code for Explicit Finite Dierence Schemes Elizabeth Acklam1 Anders Jacobsen2 Hans Petter Langtangen3 Abstract Most of the CPU time in explicit nite dierence schemes is spent on array traversal in nested loops. Implementation of such schemes using high-level eld classes in C++ tends to decrease the eciency signi cantly compared to a plain Fortran 77 code. The present note outlines an optimization strategy, where the programmer can verify the code using high-level C++ objects and then in a step-by-step re nement procedure introduce more complicated and ecient array look-up methods. The resulting code runs at a speed very close to Fortran 77 on all the tested platforms. The note reports our experience with the ability of various compilers to optimize inline functions and array operations.
1 Introduction The use of C++ for scienti c computing has grown signi cantly in recent years. C++ supports object-based and object-oriented programming, as well as template programming and attractive syntax via operator overloading. Experience in the 90's has shown that these programming features make software development faster and more reliable. However, the computational eciency of C++ has always been of some concern in the scienti c computing community. Many of the attractive features of C++ that are Numerical Objects A.S. Email:
[email protected]. Petroleum Geo Services (PGS). Email:
[email protected]. 3 Mechanics Division, Dept. of Mathematics, University of Oslo, P.O. Box 1053 Blindern, N-0316 Oslo, Norway. Email:
[email protected]. 1
2
1
Page 2
Index Optimization in C++
promoted in non-numerical textbooks easily lead to code that runs dramatically slower than a corresponding program using plain Fortran 77 features. Several studies [1, 5] have also revealed that C++ can run at the same speed as Fortran 77, but this requires careful coding. Loops with single-indexed arrays seem to be easily optimized by C and C++ compilers. The study [1], involving BLAS 1 operations and nite element simulators, showed that even small, hand-tuned Fortran 77 programs did not run signi cantly faster than their C++ counterparts using ver generic, high-level libraries. When we implemented an explicit 3D nite dierence scheme for the heat equation, using high-level eld abstractions in Dipack, the program run up to a factor of 10 slower than a similar Fortran 77 program, despite our original struggle to make the design of the eld abstractions computationally ecient. We therefore began to investigate the reasons why C++ was slower in this common test case and how a small part of the simulation code, containing the nite scheme, could be safely optimized to give an overall speed close to Fortran 77. In other words, the purpose was to combine the high-level eld abstractions, which increases the reliability and simplicity of the software development, with low level C array syntax, which ensures the speed of Fortran 77. Timing results are presented for a wide range of common Unix platforms.
2 Example on C++ Abstractions in Dipack Dipack [2] is a C++ library with various layers of abstractions. In the bottom layer we have a very simple template class VecSimplest encapsulating a plain C array: class VecSimplest { T* A; // C array int length; public: void operator () (int i) { return A[i]; } };
This class is as ecient a plain C array since the indexing operator is an inlined function. Multi-dimensional arrays are implemented in a subclass ArrayGenSimplest. The additional code contains information on the length of the array in each of its dimensions as well as an multi-index operator. class ArrayGenSimplest : public virtual VecSimplest { int nm[3]; // length of each dimension public: int index (int i, int j, int k); void operator () (int i, int j, int k) { return A[index(i,j,k)]; } };
Page 3 Note that the function index involves a formula for transferring a triple index to a single index. There are various ways of implementing such a function. Dipack actually stores all constants of the function in internal variables and applies the formula of the function directly inside the operator() function. For numerical array computations one needs member functions for taking the norm, inner product and so on. Fictitious grid points at the boundary (often called ghost boundary) are also handy in an array class for nite dierence computations. These additional features are oered in yet another subclass, class ArrayGenSel : public ArrayGenSimplest
The array look-up procedure in terms of operator()(i,j,k) is of course inherited from class ArrayGenSimplest. Field objects usually consists of a grid and a set of function values at the grid points. The latter quantities are normally represented in terms of an array object. Since it sometimes can be convenient to let several elds can share the same values, the eld object should have a pointer to an array object. In Dipack we employ smart pointers, called handles, because these can perform reference counting and a form of garbage collection, such that no object can be deleted before all its users are deleted. The smart pointer looks like this: class Handle // smart pointer { T* ptr; // reference counting and other intelligent features public: T* operator -> () { return ptr; } };
In theory, this smart pointer should be as ecient as an ordinary pointer since the operator-> function is inlined. Finally, we can outline the essential parts of our eld object for nite dierence programming: class FieldFD { // grid Handle vec; public: real valueIndex (int i, int j, int k) { return (*vec)(i,j,k); } };
Overloaded versions of valueIndex can, e.g., handle staggered grids, with half-index subscripting. Principally, a valueIndex(i,j,k) call should lead to a chain of pointers to an ordinary, one-dimensional C array. The fundamental question is: Will the compiler inline all the nested inline functions and reduce the chain of pointers such that the code is as ecient as a plain C array?
Page 4
Index Optimization in C++
3 Outlining an Optimizing Strategy In a program solving partial dierential equations using nite dierence methods, accessing the array elements is the dominating time consuming part of the code. The eciency of the method used to access the elements is here of vital importance. However, it seems to be the case that increasing the eciency of a program often increases the chance of entering errors into the program. The most ecient code is often the hardest to read, hence errors may also be hard to detect. A safe strategy would therefore be to write the program using inecient, but safe tools, like class FieldFD. When the program is running and all the errors have been removed, one can improve the eciency of the code by using one of the strategies presented below. As a working example we study an explicit nite dierence scheme for a 3D-heat conduction problem. The governing equation reads = r2u; (x; y; z ) 2
u(x; y; z; 0) = f (x; y; z ); (x; y; z ) 2
u(x; y; z; t) = g (x; y; z; t); (x; y; z ) 2 @
@u @t
(1)
and a corresponding explicit numerical scheme reads 1 +1 u = u + t (1x)2 (u +1 ? 2u + u ?1 ) + (1y )2 (u +1 ? 2u + u ?1 ) 1 + (z )2 (u +1 ? 2u + u ?1 ) : n
i;j;k
n i;j;k
n i
;j;k
n i;j;k
n i
;k
n i;j;k
n i;j
n i;j;k
n i;j;k
n i;j
n i;j;k
Implementing this using the Dipack class and u gives the following loop for updating u n
FieldFD
;j;k
;k
to represent
int k0=1,j0=1,i0=1; double dtdx2=dt/(dx*dx), dtdy2=dt/(dy*dy), dtdz2=dt/(dz*dz); #define U(i,j,k) u->valueIndex(i,j,k) #define Up(i,j,k) u_prev->valueIndex(i,j,k) initCond(); for ( int t=1; tvalues();
and leaving the rest of the code unchanged. As the loops in the program have not been altered, the chance of having introduced any errors is still absent. For further re nement we need to study the ArrayGenSel class more closely. ArrayGenSel implements a multidimensional array in terms of a onedimensional C array. This means that every time we access an array element using three indices, these indices must be recalculated into one global index before the contents of the entry can be returned. When the base in the array is 1, the single index n(i; j; k) is given by n(i; j; k ) = (k ? 1) x y + (j ? 1) x + i; l
l
l
Page 6
Index Optimization in C++
where x and y is the number of grid points in the x and respectively. This could be implemented more eciently as l
l
(
n i; j; k
y
directions
) = k x y + j x + i + C; l
l
l
(2)
where C is given by C
= ?(x y ) ? x : l
l
l
To avoid some of the index calculations, we can use the function local(i,j,k) in ArrayGenSel. This function allows us to access array elements using indices relative to a temporarily xed index set by the setLocalIndex function. This function calculates one single index using (2). Then local adds or subtracts a suitable number of indices from this xed single index and returns the value in that array element. If the xed index is n and it is set by (i; j; k), the element (i; j; k + 1) is obtained by using local(0,0,1) which returns the array element in n + (x y ), equivalently, (i; j ? 1; k) is reached by using local(0,-1,0) which returns the entry in n ? x . The resulting program code, localMInd, is also intuitively similar to the numerical scheme. l
l
l
int k0=1,j0=1,i0=1; double dtdx2=dt/(dx*dx), dtdy2=dt/(dy*dy), dtdz2=dt/(dz*dz); ArrayGen(real)& U = u->values(); ArrayGen(real)& Up = u_prev->values(); initCond(); for ( int t=1; t