Feb 10, 1994 - code and Aaron Longbottom for his assistance. Finally to ...... puter", The Use of Supercomputers in Steller Dynamics, Hut P., McMillan S. (Eds) ...
Numerical Study of Three-Dimensional Flow using Fast Parallel Particle Algorithms. Gavin J. Pringle A dissertation submitted to the Department of Mathematics, Napier University, Edinburgh in partial ful llment of the requirements for the degree of Doctor of Philosophy. February 10, 1994
Abstract Numerical studies of turbulent ows have always been prone to crude approximations due to the limitations in computing power. With the advent of supercomputers, new turbulence models and fast particle algorithms, more highly resolved models can now be computed. Vortex Methods are grid-free and so avoid a number of shortcomings of gridbased methods for solving turbulent uid ow equations; these include such problems as poor resolution and numerical diusion. In these methods, the continuum vorticity eld is discretised into a collection of Lagrangian elements, known as vortex elements, which are free to move in the ow eld they collectively induce. The vortex element interaction constitutes an N -body problem, which may be calculated by a direct pairwise summation method, in a time proportional to N 2. This time complexity may be reduced by use of fast particle algorithms. The most common algorithms are known as the N -body Treecodes and have a hierarchical structure. An in-depth investigation of one high performance Treecode, namely the Fast Multipole Method, or FMM, is conducted for uniform distributions in both 2 and 3 dimensions. This method has a complexity of O(N ) and is most suited to N -body simulations characterised by large N and a high precision. The precision of the FMM may be known a priori, and a tolerance parameter is determined in terms of this prescribed precision. The most commonly used expression to determine this tolerance parameter is shown to be too conservative, which results in an excess of computational eort. Thus, through an investigation of the error-bounds, the value of the tolerance parameter may be reduced, and hence the time for execution may be reduced without compromising the required precision. Other methods of optimising the FMM parameters are also discussed. It is found that a number of rough approximations, such as `super-nodes', have been introduced to the FMM which, under certain conditions, reduce the time for execution at the cost of losing control over the accuracy. A parallel version of the FMM in both 2 and 3 dimensions is implemented on the Meiko Computer Surface, CS-1, which is a Multiple-Instruction-Multiple-Data, or MIMD, distributed memory parallel computer using a Single-Program-MultipleData, or SPMD, local domain decomposition, explicit message-passing paradigm. Two scalable communication algorithms are required; a systolic loop and nearestneighbour processor inter-communication. The latter algorithm exhibits poor load balancing, as the boundary processors will lie idle for some of the time. The `break-even' point between the FMM and the `direct' method, i.e. the point where the two methods take the same amount of time to execute, is N 180 and N 5000 particles for the sequential 2D and 3D respectively. For the parallel versions the corresponding points are N 70 for 2D and N 1000 for 3D. In the context of one particular Vortex Method it is shown how the FMM can be embedded within a simulation of a 3-dimensional vortex ring. This simulation dictates a highly non-uniform distribution of vortex elements, thus this parallel version of the FMM exhibits poor load balancing.
Declaration I hereby declare that all the work presented in this thesis has been carried out by myself between September 1990 and January 1994. No part of this work has previously been submitted in support of a degree validated by a University.
c Gavin James Pringle, 1994.
i
Acknowledgements The work was carried out at Napier University of Edinburgh, with collaboration from the University of California at Berkeley, Lawrence Berkeley Laboratories and the University of Edinburgh. The parallel computer employed is the Meiko Computing Surface, CS-1, housed at the Edinburgh Parallel Computer Centre, EPCC. This work is being supervised by Dr D. Summers (Napier University), Dr D. Heggie (Edinburgh University), Dr D. Roberts (Napier University) and G.R. Henderson (Napier University).
Many thanks to: Dave S. for his talks, direction, support and for proposing the topic, Douglas for his patience and understanding, Dave R. for his general organisation and pertinent questions, Alexandre Chorin for providing the uid- ow code and Aaron Longbottom for his assistance. Finally to Louis and Alison, Julie and Dave, Rob, Beto and Vale for their continuing support and friendship.
This work is dedicated to the memory of Frank Zappa.
ii
Contents 1 Introduction 1.1 1.2 1.3 1.4
Vortex Methods : : : : : : : The N -body Problem : : : : Fast Summation Techniques Organisation of the Thesis :
: : : :
: : : :
: : : :
: : : :
: : : :
: : : :
: : : :
: : : :
: : : :
: : : :
: : : :
: : : :
: : : :
: : : :
: : : :
: : : :
: : : :
: : : :
: : : :
: : : :
: : : :
: : : :
2 Fast Multipole Method in 2 Dimensions
2.1 Informal Description of the FMM Algorithm : : : : : : : : : : : : : 2.2 The Mathematical Operators : : : : : : : : : : : : : : : : : : : : : 2.2.1 Translation of a Multipole Expansion. : : : : : : : : : : : : 2.2.2 Conversion of a Multipole Expansion into a Local Expansion 2.2.3 Translation of a Local Expansion. : : : : : : : : : : : : : : : 2.3 The Time for Execution of the 2DFMM : : : : : : : : : : : : : : : 2.4 General Implementation Details : : : : : : : : : : : : : : : : : : : : 2.4.1 Exploiting the Symmetry of a Quad-tree : : : : : : : : : : : 2.4.2 A Basic Element of Adaptivity : : : : : : : : : : : : : : : : 2.5 Error Analysis for the 2DFMM : : : : : : : : : : : : : : : : : : : : 2.5.1 Error Bound of the Truncated Multipole Expansion : : : : : 2.5.2 Error Bound of the Truncated Local Expansion : : : : : : : 2.5.3 Accumulated errors : : : : : : : : : : : : : : : : : : : : : : : 2.5.4 Numerical Trials : : : : : : : : : : : : : : : : : : : : : : : : 2.6 The Dynamic p Principle : : : : : : : : : : : : : : : : : : : : : : : : 2.7 Timing Results for the Sequential 2DFMM : : : : : : : : : : : : : :
3 Fast Multipole Method in 3 Dimensions
3.1 Mathematical Operators : : : : : : : : : : : : : : : : : : : : : : : : 3.1.1 Multipole Expansion : : : : : : : : : : : : : : : : : : : : : : 3.1.2 Translation of a Multipole Expansion : : : : : : : : : : : : : 3.1.3 Conversion of a Multipole Expansion into a Local Expansion 3.1.4 Translation of a Local Expansion : : : : : : : : : : : : : : : 3.2 Associated Legendre Polynomials : : : : : : : : : : : : : : : : : : : 3.3 The Time for Execution of the 3DFMM : : : : : : : : : : : : : : : 3.4 Implementation Details : : : : : : : : : : : : : : : : : : : : : : : : : 3.4.1 Exploiting the Symmetry of an Oct-tree : : : : : : : : : : : 3.4.2 Exploiting the Symmetry of spherical harmonics : : : : : : : 3.4.3 A Basic Element of Adaptivity : : : : : : : : : : : : : : : : 3.5 Error Analysis for the 3DFMM : : : : : : : : : : : : : : : : : : : : 3.5.1 Error Bound of the Truncated Multipole Expansion : : : : : iii
1 1 2 2 4
6
9 12 12 12 13 14 14 16 18 18 19 21 22 23 29 30
33 35 35 35 36 37 38 39 40 40 43 43 43 44
3.5.2 Error Bound of the Truncated Local Expansion 3.5.3 Accumulated errors : : : : : : : : : : : : : : : : 3.5.4 Numerical Trials : : : : : : : : : : : : : : : : : 3.6 The Dynamic p Principle : : : : : : : : : : : : : : : : : 3.7 The Optimum Interaction Set : : : : : : : : : : : : : : 3.8 Timing Results for the Sequential 3DFMM : : : : : : :
: : : : : :
4 A Discussion of Alternative Fast Summation Methods 4.1 Zhao's Cartesian 3DFMM : : : : : : : : : : : : 4.1.1 The Use of Super-nodes in 2 Dimensions 4.1.2 The Use of Super-nodes in 3 Dimensions 4.2 Anderson's `FMM without the multipoles' : : : 4.3 The Barnes and Hut Algorithm : : : : : : : : : 4.4 Buttke's `FMM without the hierarchy' : : : : :
: : : : : :
: : : : : :
: : : : : :
: : : : : :
: : : : : :
: : : : : :
45 49 49 55 56 57
60
: : : : : :
: : : : : :
: : : : : :
: : : : : :
: : : : : :
: : : : : :
: : : : : :
: : : : : :
: : : : : :
: : : : : :
: : : : : :
: : : :
: : : :
: : : :
: : : :
: : : :
: : : :
: : : :
: : : :
: : : :
: : : :
: : : :
6.1 The Meiko Computing Surface, CS-1 : : : : : : : : : : : : : 6.1.1 The CS Tools Communication Harness : : : : : : : : 6.2 The Programming Strategy : : : : : : : : : : : : : : : : : : 6.3 Domain Decomposition : : : : : : : : : : : : : : : : : : : : : 6.3.1 A Scalable Parallel Model : : : : : : : : : : : : : : : 6.3.2 Scattered Domain Decomposition : : : : : : : : : : : 6.4 The Required Communication Strategies : : : : : : : : : : : 6.4.1 CS Tools commands and de nitions : : : : : : : : : : 6.4.2 The Systolic Loop : : : : : : : : : : : : : : : : : : : : 6.4.3 Nearest-Neighbour Processor Inter-Communications : 6.5 The Parallel FMM Algorithm Description : : : : : : : : : : 6.6 The Packing and Unpacking of Data : : : : : : : : : : : : : 6.6.1 2 Dimensions : : : : : : : : : : : : : : : : : : : : : : 6.6.2 3 Dimensions : : : : : : : : : : : : : : : : : : : : : : 6.7 The `Optimum' Interaction Set for the Parallel 3DFMM : : 6.8 Timing Results : : : : : : : : : : : : : : : : : : : : : : : : : 6.8.1 2 Dimensions : : : : : : : : : : : : : : : : : : : : : : 6.8.2 3 Dimensions : : : : : : : : : : : : : : : : : : : : : : 6.9 Discussion of the Parallel Implementation : : : : : : : : : : 6.9.1 Memory Constraints : : : : : : : : : : : : : : : : : :
: : : : : : : : : : : : : : : : : : : :
: : : : : : : : : : : : : : : : : : : :
: : : : : : : : : : : : : : : : : : : :
: : : : : : : : : : : : : : : : : : : :
: : : :
: : : :
: : : :
: 97 : 98 : 99 : 102
5 Parallel Computers and Treecodes
5.1 A Brief Review of Parallel Computing : : : : : 5.1.1 The Three Types of Parallel Computer : 5.1.2 Programming Considerations : : : : : : 5.2 Methods of Parallelising the N -body Treecodes
6 A Parallel FMM on the Meiko Computing Surface.
7 Fluid Dynamical Application
7.1 Background Information : : : : : : : : : : : : : : : : : : : : 7.2 Random Vortex Methods : : : : : : : : : : : : : : : : : : : : 7.2.1 Navier-Stokes and the Vorticity Transport Equations 7.3 Calculating the Derivative of the Potential : : : : : : : : : : iv
60 62 63 66 66 68
69 69 69 70 71
74 74 75 75 75 76 77 78 78 79 82 84 87 87 87 88 89 89 91 94 95
97
7.3.1 The Error of the Derivative of a Truncated Local Expansion 7.4 Application of the FMM; The Vortex Torus : : : : : : : : : : : : : 7.4.1 Embedding the 3DFMM : : : : : : : : : : : : : : : : : : : : 7.4.2 The Parallel Program : : : : : : : : : : : : : : : : : : : : : : 7.4.3 Program Veri cation : : : : : : : : : : : : : : : : : : : : : : 7.4.4 Results : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :
8 Conclusions and Future Directions
8.1 The Fast Multipole Method : : : : : : 8.2 The Parallelisation of the FMM : : : : 8.3 The Vortex Application : : : : : : : : : 8.3.1 An Alternative Adaptive FMM
: : : :
: : : :
: : : :
: : : :
: : : :
: : : :
: : : :
: : : :
: : : :
: : : :
: : : :
: : : :
: : : :
: : : :
: : : :
: : : :
A Derivation of Error Bounds
A.1 Derivation of a Stricter Error Bound for the Truncated Local Expansion in 2 Dimensions : : : : : : : : : : : : : : : : : : : : A.2 The Error of the Derivative of a Truncated Local Expansion : A.2.1 2 Dimensions : : : : : : : : : : : : : : : : : : : : : : : A.2.2 3 Dimensions : : : : : : : : : : : : : : : : : : : : : : :
115 115 117 119 120
122 : : : :
: : : :
B The Pseudo-code for the Systolic Loop through a 3D Grid C Nearest Neighbour Processor Blocking Inter-Communications Bibliography
v
103 105 106 110 111 112
: : : :
122 124 124 125
127 129 132
List of Tables 2.1 2.2 3.1 3.2 4.1 4.2 6.1 6.2 6.3 6.4 6.5 6.6 7.1 7.2 B.1 C.1 C.2
The FMM algorithm in pseudo-code : : : : : : : : : : : : : : : : : Iteration values using eqn.(2.12), c = 2:8281 and = 10?3 . : : : : Estimates for p, for varying ; c = 2:4641. : : : : : : : : : : : : : : Computed absolute and relative errors for a 2DFMM : : : : : : : The super-nodes for second nearest neighbours in the near- eld : The super-nodes for nearest neighbour cells only in the near- eld : Pseudo-code for a scalable systolic loop in 2D. : : : : : : : : : : : Nearest-neighbour processor scalable inter-communications : : : : The parallel FMM algorithm in pseudo-code : : : : : : : : : : : : Regression analysis for parallel 2DFMM times : : : : : : : : : : : The value of s for each level transition : : : : : : : : : : : : : : : Regression analysis for the parallel 3DFMM times : : : : : : : : : Parallel pseudo-code for the calculation of the velocities : : : : : : Velocity program veri cation. : : : : : : : : : : : : : : : : : : : : 3D Scalable Systolic Loop Pseudo-code : : : : : : : : : : : : : : : Nearest-neighbour scalable blocking inter-communications in 2D : Nearest-neighbour scalable blocking inter-communications in 3D :
vi
: : : : : : : : : : : : : : : : :
11 20 47 50 65 66 81 84 86 90 92 93 111 112 128 130 131
List of Figures 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 2.10 2.11 2.12 2.13 2.14 2.15 2.16 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9 3.10 3.11 3.12 3.13 4.1 5.1 6.1 6.2
Diagram indicating near- eld cells : : : : : : : : : : : : : : : : : : : Hierarchical mesh re nement : : : : : : : : : : : : : : : : : : : : : : The interaction set of a single cell for a 2DFMM. : : : : : : : : : : The local expansion for disc D2 is formed from the multipole coecients of disc D1 : : : : : : : : : : : : : : : : : : : : : : : : : : Numbering of the 4 child cells in 2D : : : : : : : : : : : : : : : : : : The relationship between p and the execution time : : : : : : : : : Graphical representation of as a function of p : : : : : : : : : : : Con guration of 3 points in 2D, c = 1:828. : : : : : : : : : : : : : : Empirical and theoretical errors, c = 1:828. : : : : : : : : : : : : : : The 3 3 and 4 4 con gurations : : : : : : : : : : : : : : : : : : Graph of the empirical and analytical errors, 3 3 con guration, c = 3:0 Graph of the empirical and analytical errors, 4 4 con guration, c = 5:0 Dog-leg con guration : : : : : : : : : : : : : : : : : : : : : : : : : : Empirical and theoretical errors, c = 2:1623 : : : : : : : : : : : : : Typical timing curves for a 2DFMM, p = 11 : : : : : : : : : : : : : Timing curves for the 2DFMM, `direct', p = 11, p = 22 : : : : : : : A cubic domain with 3 levels of re nement : : : : : : : : : : : : : : Points P and Q, with an angle subtended between OP and OQ. : Numbering of the 8 child cells : : : : : : : : : : : : : : : : : : : : : The relationship between p and the execution time : : : : : : : : : Points P and Q, with an angle subtended between OP and OQ. : Equation (3.24) plotted for varying . : : : : : : : : : : : : : : : : : Second nearest neighbours con guration, c = 2:4641 : : : : : : : : : Empirical and theoretical errors, c = 2:4641 : : : : : : : : : : : : : Nearest neighbour con guration, c = 1:309 : : : : : : : : : : : : : : Empirical and theoretical errors, c = 1:309 : : : : : : : : : : : : : : Empirical and theoretical errors, c = 3:0 : : : : : : : : : : : : : : : Typical timing curve for the 3DFMM, p = 3. : : : : : : : : : : : : : Timing curves for the 3DFMM, p = 3 and p = 8, and the direct summation method. : : : : : : : : : : : : : : : : : : : : : : : : : : : Super-nodes in 2 dimensions. : : : : : : : : : : : : : : : : : : : : : : A quad-tree, where represents both one processor and one cell : : Mapping for a 4 3 grid onto 12 processors : : : : : : : : : : : : : Four processors in a loop : : : : : : : : : : : : : : : : : : : : : : : : vii
8 8 9 13 17 19 22 24 25 27 27 27 28 28 30 31 33 34 42 44 44 48 51 52 52 53 54 57 58 61 71 76 79
6.3 6.4 6.5 6.6 6.7 6.8 6.9 7.1 7.2 7.3
A systolic loop through a 2D grid of processors : : : : : : : : : : : 80 Revised systolic loop layout for a large number of processors : : : : 81 Communication topology which produces unpredictable results : : : 83 Parallel 2DFMM execution times, p = 11. : : : : : : : : : : : : : : 89 2D parallel eciency curves : : : : : : : : : : : : : : : : : : : : : : 91 Parallel 3DFMM execution times, p = 3 : : : : : : : : : : : : : : : 92 3D parallel eciency curves : : : : : : : : : : : : : : : : : : : : : : 93 A single tube, divided into 8 segments : : : : : : : : : : : : : : : : 105 Evolving vortex torus where only one tube is used for clarity : : : : 107 Execution times for the sequential 3DFMM, non-uniform distribution 113
viii
Chapter 1 Introduction Computational modelling of turbulent ow is an important enterprise due to its signi cant practical applications, and is a popular subject for its fascinating complexity and chaotic nature. However, due to the limitations in computing power, vast amounts of data and time required to obtain high resolution simulations, numerical studies of turbulent ows have always been prone to crude approximations. In this thesis we show how three dierent elds may be combined to aid this computation. Firstly Vortex Methods are used to enhance the result of turbulent uid
ow models; secondly, Fast Summation Methods reduce the time to execute these vortex methods; and nally Parallel Computers can be utilised to allow for very large data sets, and to perform the calculations within acceptable time scales. By the unions of these elds, we may now realise simulations which once seemed intractable.
1.1 Vortex Methods Vortex methods [24, 72] are grid-free, and therefore avoid a number of shortcomings of Eulerian, grid-based methods for solving uid ow equations - such as poor resolution and numerical diusion. They do this by discretising the continuum vorticity eld - the curl of the velocity eld - into discrete Lagrangian elements, known as vortex elements. These elements are free to move in the ow eld which they create. The velocity eld induced by these elements is a solution to the NavierStokes equation, and in principle the method should model turbulent ow.
1
1.2 The N -body Problem The vortex element interaction constitutes an N -body problem. Consider a collection of N particles; the N th particle is acted upon by the remaining (N ? 1) particles. Hence the amount of computational eort required to evaluate the force acting on each particle is of order N 2 ? N ; O(N 2). If we calculate the interactions using a pairwise law then the amount of work becomes of order 21 (N 2 ? N ), but this remains O(N 2). This computation has proved prohibitive for large N due to the amount of storage required and, more crucially, the amount of time required to compute the interactions. In fact a variety of physical problems can be represented as an evolving distribution of bodies or particles, where each body interacts with the remaining bodies. This is in eect an N -body system evolving in time; a computational model of such a system can provide valuable insight into the physical processes involved. The N -body problem arises not only in Lagrangian Methods in Computational Fluid Mechanics [26] where the bodies may be vortex elements, but also in Astrophysics, where the bodies may be stars [49]; Plasma Physics where the bodies may be ions and electrons [2]; Molecular Dynamics; and in the eld of Computer Graphics with radiosity rendering techniques [75]. Other applications arise in elliptic partial differential equations (solutions of the Laplace equation via potential theory), and in numerical complex analysis when computing Cauchy integrals.
1.3 Fast Summation Techniques There are two types of strategy which may be used to speed up the N -body computation: we can reduce the frequency with which the force at an individual particle is calculated [66], or we can reduce the computational cost of calculating the force per particle. In this thesis we shall concentrate on the latter strategy. Purpose-built parallel computers, such as the GRAPE [77], are used to compute the force directly; however, an alternative approach is to implement fast summation techniques on existing parallel machines. These techniques use approximations when calculating the interactions which can reduce the time for execution substantially. Several algorithms exist which do just this, such as the early particle/mesh methods, rst formulated around 20 years ago, including Anderson's more recent local correction 2
method [4], which calculates the force at the grid points of a uniform mesh and evaluates the force at each particle by extrapolation. In the last few years, faster summation methods have been developed which employ a hierarchical structure, and are commonly referred to as N -body Treecodes. The rst of these include Appel [6] and Barnes and Hut [9]. These N -body Treecodes execute in a time of O(N log N ). The basic notion behind N -body Treecodes is that a cluster of distant particles is replaced by a `large' single pseudo-particle, and that as the distance to the cluster increases, the size of the pseudo-particle also may increase. The force exerted on near-by particles is approximated by their interaction with this pseudo-particle. Asymptotically faster N -body Treecodes were then developed which are of O(N ), at least for uniform distributions, such as Greengard and Rokhlin [40], Zhao [85] and Anderson [5]. The novel aspect of these codes is that the pseudo-particles are, in some sense, allowed to `interact' with each other. Thus the methods are based around cluster-cluster `interactions', in contrast to the Appel and BarnesHut methods which employ particle-cluster interactions. Intuitively, the more recent codes will be faster. The Greengard-Rokhlin algorithm [40], namely the Fast Multipole Method, or FMM, was the rst N -body Treecode where the truncation error is controllable and the algorithm may be ne-tuned to produce a speci ed precision known a priori. (The codes preceding the Fast Multipole Method can only estimate the error of the approximated interactions.) For the FMM, the pseudo-particle is represented by an in nite multipole expansion centred on a sphere which contains the entire cluster. This expansion is truncated to a nite number of terms, where the number of terms taken in each expansion controls the precision. Forces may be evaluated to machine precision if required, and for a large number of particles this method can be even more accurate than the direct summation technique, which requires many more operations and is therefore more susceptible to round-o error [2]. The FMM is employed for problems demanding a high number of particles and high controllable precision. To produce a worthwhile 3-dimensional turbulence model using the Random Vortex Method [24], upwards of a million vortices may be required [28, 55], and a high precision is necessary to reduce numerical diusion. Thus the FMM is of compelling interest in this context. The use of high performance super computers is also bene cial and there exist 3
numerous methods of `parallelising' N -body Treecodes, which have been mounted on a wide range of parallel computers [39, 48, 52, 58, 60, 62, 65, 69, 71, 74, 85]. The computer used in this work is the Meiko Computing Surface, or CS-1, which is a Multiple-Instruction-Multiple-Data, or MIMD, distributed memory machine. An explicit message-passing paradigm is required, utilising the CS Tools communication harness [29]. We follow a Single-Program-Multiple-Data, or SPMD, local domain decomposition methodology.
1.4 Organisation of the Thesis This thesis begins with a detailed investigation of the Fast Multipole Method for statistically uniform particle distributions, in both 2 and 3 dimensions. Chapter 2 describes the 2-dimensional Fast Multipole Method, or 2DFMM, and an asymptotic description of the execution time is derived in section(2.3). Section(2.4) introduces some general implementation details. We then discuss how elements of symmetry, inherent to the hierarchical tree, can be utilised. A basic element of adaptivity is introduced, to cater for a non-uniform distribution of particles. An error analysis is performed in section(2.5), where a more stringent method of determining the tolerance parameter is presented, through which the time for execution may be reduced substantially without compromising the prescribed precision. A method of reducing the work load by determining the tolerance parameter in terms of the distances to each cluster is presented in section(2.6). The times and results of the sequential version of the 2DFMM are then discussed in the following section with a view to determining the optimum run-time parameters. Chapter 3, describing the 3-dimensional Fast Multipole Method, or 3DFMM, follows the same basic thread as chapter 2, highlighting the major dierences within the 2D and 3D cases. The topic of associated Legendre Polynomials is discussed in section(3.2), where a stable generator for these polynomials is presented along with a number of time-saving devices. The same error analysis for the 2DFMM is performed for the 3DFMM, and the optimum `interaction set' is described in section(3.7). The results are presented in section(3.8) with a view to determining the optimum run-time parameters. Other fast summation methods, those of Zhao [85], Barnes-Hut [9], Anderson [5] and Buttke [15], are reviewed and contrasted in chapter 4, with an investigation 4
into the use of `super-nodes' in section(4.1). The topic of Parallel Computing is introduced in the following chapter, and the various ways the N -body Treecodes have been parallelised in the past are discussed in section(5.2). Our parallel implementations of the 2D and 3DFMM are presented in chapter 6, where the parallel computer is described in the rst section, and the method of domain decomposition is discussed in section(6.3). Our manner of parallelising the FMM employs two scalable communication strategies; a systolic loop and a nearest neighbour processor inter-communication for both 2 and 3 dimensions, and are described in section(6.4). The following section contains the parallel algorithm, along with general implementation details. Execution times and parallel eciencies are presented in section(6.8), along with a summary of the parallelisation in section(6.9). Chapter 7 describes the uid dynamical application, with a brief review of modelling of uid ow, including turbulent ow, in section(7.1). Then the attractions of grid-free methods, such as Chorin's Random Vortex Method [18], are described in section(7.2). The derivative of the potential calculated by the FMM, as required by Vortex Methods, is formulated in the next section and an error analysis is performed. Finally it is then shown in section(7.4) how the FMM may be embedded to enhance the performance of a vortex code which is used to study the transition to turbulence of a vortex torus under non-viscous conditions. Chapter 8 contains the conclusions and recommendations for future developments, including a suggestion for a novel N -body Treecode.
5
Chapter 2 Fast Multipole Method in 2 Dimensions The FMM is an O(N ) fast summation method, which employs approximations of controllable precision, to reduce the execution time for the N -body problem. The basic concepts behind the FMM may be described as follows. Particles are grouped together into clusters and are represented by a list of coecients which describes their distribution, namely a multipole expansion. This expansion can be used to approximate the interaction between distant clusters, provided the distance from one cluster to the next is large enough. The clusters are formally de ned with the introduction of a grid. As the distance between the clusters increases, then the clusters themselves may also increase in size; this fact is eected by employing a hierarchy of grids. In both 2 and 3 dimensions, the FMM has a complexity of O(N ), at least for a uniform distribution of N particles [42, 45], and an investigation into the order of complexity is performed in sections (2.3) and (3.2) for 2 and 3 dimensions respectively. This chapter investigates the 2-dimensional Fast Multipole Method, 2DFMM, in depth and introduces some modi cations to improve its performance. Before we introduce the hierarchy, let us rst describe exactly what we mean by a multipole expansion. Consider a particle of strength qj at xj inducing a force at xi, and let r = jxi ? xj j, x = (x1; x2) 2 R2. The force at xi is inversely proportional to the distance r. So if we de ne a potential function, such that the gradient of this function is the force, then the potential function is qj log r. The R2 plane can be mapped onto the complex plane, z 2 C , and since log jzi ? zj j = Reflog(zi ? zj )g, we may conveniently represent the potential at zi due to a charge of strength qj , 6
located at zj , in terms of a complex potential, i.e. (zi) = qj log(zi ? zj ) and later take the real part. The potential is expanded using an in nite series, or multipole expansion, as follows. Suppose that m particles of strengths q1; q2; :::; qm are located at points z1; z2; :::; zm, with jzij < r. Then for any z 2 C with jzj > r, the potential induced by the particles is given by 1 m X X [2:1] (z) = qi log(z ? zi) = Q log(z) + zakk ;
where Q =
Pm
i=1 qi
i=1
k=1
and ak , the multipole coecients, are given by
ak =
m X i=1
?qizik : k
[2:2]
Suppose the in nite sum in eqn.(2.1) is truncated to p terms, say. Then the precision of the algorithm is controlled by p, which is much smaller than, and indeed independent of, the number of particles the expansion represents. Suppose we insist that jzj > cr, for some c > 1, and since jzij < r then j zzi j < 1c . If we substitute the multipole coecients, eqn.(2.2), into the multipole expansion, eqn.(2.1), it can be seen that the multipole expansion is bounded from above by a power series in 1c , where c = jzrj . It is shown in [42] that a truncation error-bound p may be given as O 1c , and that, for a given precision, , the number of terms in the truncated series, p is de ned by setting = c?p, and hence p = d? logc()e, where dxe is the smallest integer larger than x 2 R. This is a rather conservative estimate of the error, and a more stringent bound is derived in section(2.5.2). The question now arises, how far apart must the clusters be before the use of the multipole expansion is valid. To determine what clusters are suciently far-removed, or well-separated, a uniform mesh is laid over the whole area. In gure 2.1, the cells marked `o' are not well-separated from cell `X'. Therefore the expansions are not valid in these cells. The cells marked `o' are used to de ne the near- eld of cell `X'. The far- eld is de ned as the complement of the near- eld and cell `X'. The clusters, which the multipole expansions represent, are determined in terms of the uniform mesh, in that a cluster of particles is de ned as all the particles within a single cell (any naturally occurring clusters are ignored). In turn, these clusters 7
o o o o X o o o o Figure 2.1: Diagram indicating near- eld cells can themselves be aggregated into increasingly larger clusters and this gives the algorithm its recursive nature. A key dierence between the FMM and prior N -body Treecodes (cf. chapter 4) is that clusters are allowed to, in some sense, `interact' with each other, as opposed to clusters interacting with individual particles. This cluster-cluster `interaction' is achieved by the use of local expansions, which are formed in terms of the multipole coecients (cf. section 2.2.2). The local expansion of cell `X' in gure 2.1 say, is used to describe the potential in `X' due to particles in cells which are well-separated from `X'. Now the parameter c, which is used to determine p, may be explicitly expressed in terms of the distance jzj between `X' and the nearest well-separated cell, and the radius r of the circle which circumscribes this cell, i.e. c = jrzj . (If the point of evaluation is the centre of `X' then c = 2:828.) This is true whatever the size of mesh, so this idea can be applied recursively on a successively ner grid, i.e. on a hierarchical mesh. The hierarchical mesh is introduced by encompassing the entire system of particles by a single cell. This cell is then subdivided into four equal cells, until the nest mesh level, level n, is reached. The top level is labelled level 0; hence at any particular level l there are 4l cells ( gure 2.2); this is known as a quad-tree. The number of levels, n, is chosen such that only a small number of particles, a
pp
level= 0
level= 1 level= 2 Figure 2.2: Hierarchical mesh re nement
level= n
maximum of s say, are present in each leaf cell, which is a cell at the nest level, m l i.e. n log4 Ns . An investigation into the optimum value of s, and hence an 8
optimum value for n, is performed in section(2.7). We now present some terminology associated with the hierarchical mesh before an informal description of the algorithm is presented. At any level of re nement l, a cell x is subdivided into 4 cells, which are located at level l +1. The 4 cells are known as the children of x; x is known as the parent. Two cells are known as nearest neighbours if they share a common boundary at the same level of re nement, and so in 2 dimensions a cell has 8 nearest neighbour cells. Cells are de ned as well-separated if they are at the same level l, and are not nearest neighbours. The interaction set of a cell x is de ned as those wellseparated cells, which are the children of the nearest neighbours of x's parents. This can be compactly expressed in set notation as follows: int(x)=fcells i : i 2= near- eld(x), i 2 children(near- eld(parent(x)))g, where int(x) denotes the interaction set of cell x and near- eld(x) is the cell's nearest neighbours. The maximum number of cells in the interaction set in 2 dimensions is 27. Figure 2.3 shows the layout of a cell's nearest neighbours, marked n, and interaction set, marked i.
i i i i i i
i i i i i i
i n n n i i
i n x n i i
i n n n i i
i i i i i i
Figure 2.3: The interaction set of a single cell for a 2DFMM.
2.1 Informal Description of the FMM Algorithm An informal description of the FMM algorithm is now presented, using mathematical operators to be described in detail in the following section(2.2). The algorithm requires that 2 `passes' be made, the upward pass and the downward pass. The upward pass begins at the grid's nest level, n. The 9
multipole expansion, eqn.(2.1), is calculated for every cell. The multipole expansion associated with the parent cell, at level n ? 1, is created from the expansions of its four children by shifting the centre of its children's multipole expansions to the parent's own centre. This occurs recursively from level n ? 1 through to level 0. Now there is a multipole expansion associated with every cell in the tree, which represents the cluster de ned by that cell. The downward pass runs from level 0 to level n. As the tree is descended, a cell at any level has a local expansion associated with it, and this expansion is calculated from the multipole expansions belonging to every cell in its interaction set. These multipole expansions describe the far- eld potential due to particles in those well-separated cells. If the entire ensemble of particles is contained in the single cell of level 0, then there are no interaction sets for level 0 or level 1, and therefore no computations are performed. For level 2 the interaction set simply consists of every well-separated cell. At level 3 the interaction set is every wellseparated cell which has not yet been accounted for, and indeed this is the case for all of the remaining levels. So for a cell at any level, a local expansion is formed from the coecients of the multipole expansions associated with the cells in its interaction set; then these local expansions are summed and shifted to the centres of the four children; this is performed recursively down the tree. Once at the nest level, level n, we now have a description of the potential in the leaf cells due to all the particles which lie in the well-separated leaf cells. The potential at each particle, due to particles which lie in the far- eld, is approximated by evaluating the local expansion associated with that particle's leaf cell. The only particles which have not yet been accounted for will be the near- eld particles, which reside in the nearest neighbour leaf cells. The `direct' pairwise interaction summation method is applied to particles in these cells. Clearly, since no work is performed for level 0 and 1 in the downward pass then the upward pass need only run from level n to level 2, and the downward pass need only run from level 2 to level n. A pseudo-code version of the algorithm is presented in table 2.1.
10
Let i be the potential due to the particles in cell i. This is the multipole expansion, and can be evaluated in cells which are well-separated from cell i. Let i be the potential inside cell i, due to the particles in cells well-separated from cell i. This is a local expansion and it can be evaluated at any point in cell i. Let i be the potential at a particle in cell i due to all other particles. Let ?i be the contribution to the potential in cell i obtained by direct pairwise summation. 1. Do For i 2 level n, compute i (i.e. calculate the multipole expansion for every cell at the nest level using eqn.(2.2)) 2. Do For j = n ? 1 down to 2 Do For i 2 level j
i =
X
k2children(i)
k
(i.e. shift and add the multipole expansions using eqn.(2.3)) 3. Do For i at level 1, i = 0 4. Do For j = 2; n Do For i 2 level j i = parent(i) +
X
k2int(i)
k
(i.e. convert the multipole expansions to local expansions using eqn.(2.5) and eqn.(2.6) and then shift centres using eqn.(2.7)) 5. Do For all i 2 level n i = i +
X
k2neigh(i) and i
?k
Table 2.1: The FMM algorithm in pseudo-code
11
2.2 The Mathematical Operators The following three theorems follow closely the corresponding theorems in Greengard [42] (where proofs will be found) and are the translating operations required by the 2DFMM.
2.2.1 Translation of a Multipole Expansion. During the upward pass, the following theorem is required to shift the centre of the four children's multipole expansions to the centre of their parent's cell. Suppose that eqn.(2.1) is written as (z) = a0 log(z ? z0) +
1 X
ak k k=1 (z ? z0)
where this is a multipole expansion of the potential due to a set of m particles of strengths q1; q2; : : :; qm, all of which are located inside the disc D of radius r with centre at z0. Then for z outside the disc D1 of radius (r + jz0j) and centre at the origin, 1 X [2:3] (z) = a0 log(z) + zbll ; l=1
where
l a 0z0l X [2:4] bl = ? l + ak z0l?k kl ?? 11 ; k=1 here kl is a binomial coecient. Within the 2DFMM, the in nite sum in eqn.(2.3) is truncated to p terms, therefore eqn.(2.3) has a truncation error associated with it. However, once the multipole coecients a0; a1; a2; :::; ap in eqns.(2.3) and (2.4) have been computed, then we may obtain b1; b2; :::; bp exactly by eqn.(2.4), i.e. the centre of the multipole expansion may be shifted without incurring an additional truncation error [42]. !
2.2.2 Conversion of a Multipole Expansion into a Local Expansion During the downward pass, a local expansion, associated with a particular cell i say, is computed from the multipole coecients of the cells in the interaction set of cell i, by the following theorem. 12
'$ '$ &% &% cr 1
I r@ @ zo
@ I @
ro
D1
D2 Figure 2.4: The local expansion for disc D2 is formed from the multipole coecients of disc D1 Suppose that m particles of strengths q1; q2; : : : ; qm are located inside the disc D1 with radius r and centre at z0, and that jz0j > (c +1)r with c > 1 (cf. gure 2.4). Hence jz?z0j > r, thus the corresponding multipole expansion of D1 with coecients ak , converges inside the disc D2 of radius r centered about the origin. At a point z inside D2, the potential due to the particles is described by a power series: (z) = where
1 X l=0
b0 = a0 log(?z0) +
and for l 1
bl z l
1 X
ak (?1)k ; k k=1 z0
[2:5]
1 a l+k?1 X a 1 0 [2:6] bl = ? lzl + zl zkk k ? 1 (?1)k : 0 0 k=1 0 Within the 2DFMM, the local expansion is truncated to p terms, and is computed from a p-term multipole expansion. There is therefore a truncation error associated with this `conversion' process, and the bound to the absolute error of a p-term local expansion is investigated in section(2.5.2). !
2.2.3 Translation of a Local Expansion. This theorem is used to shift the centre of a local expansion of a cell to the four children cells in the downward pass. For any complex z0; z and ak ; k = 0; 1; : : : ; p; p X k=0
ak (z ? z0 = )k
p X l=0
(
p X
!
)
ak kl (?z0)k?l zl k=l
[2:7]
where p is a natural number. The eqn.(2.7) is exact, and so the translation of a p-term local expansion contributes no truncation error term to the overall error of the FMM. 13
2.3 The Time for Execution of the 2DFMM The order of the time for execution of the 2DFMM is now be derived using asymptotic analysis. Note that the operation count in what follows does not represent
oating point operations. Our intention is to discover how the number of operations at each stage depends on the parameters n, N and p of the algorithm. At the nest level n, for each of the N particles, a p-term multipole expansion is formed (Np operations). For the translation of the multipole expansions there are 4l+1p2 operations for l = n ? 1; :::; 2 since for each cell at level l a p-term multipole expansion is computed from four other p-term multipole expansions. Within the downward pass, the conversion of a multipole to a local expansion contributes 27 4lp2 operations for l = 2; :::; n since for each cell at level l a p-term local expansion is computed from a p-term multipole expansion from each cell in the interaction set. Also in the downward pass we have the translation of local expansions, where there are 4l+1 p2 operations for l = 2; :::; n ? 1 since for each cell at level l, four p-term local expansions are computed from a p-term local expansion. At the nest level for each particle there is a p-term local expansion to be evaluated (Np operations), followed by the `direct' summation over near- eld cells, each containing a maximum of s particles (Ns operations). By explicitly summing these terms we nd that n p2 ? 560 p2 + Ns: 4 [2:8] operation count = 2Np + 116 3 3 By eqn.(2.8), the asymptotic time can be expressed as t Np + Ns p2 + Ns since N = s4n and the constants of proportionality are ignored. Following Greengard [17, 39], if we dierentiate to calculate an optimal value for s, we have @t ?N p2 + N = 0 ) s / p @s s2 hence the optimal value for s is dependent on p. By substitution, we nd that the asymptotic time is of O(Np). If s = 1 then the asymptotic time is of O(Np2 ).
2.4 General Implementation Details A program based on the above 2DFMM has been written in FORTRAN 77 and implemented on a SUN Sparc ELC workstation. This is the case for all programs 14
written in connection with this thesis, unless otherwise stated. Prototypes of the 2DFMM, sequential and parallel, had been written by Summers [80], but some eort was made to improve the storage requirements and decrease the computation time, which are important considerations for a successful sequential, and indeed parallel implementation. (The maximum number of particles one could run using the sequential prototype was only 1600 approximately.) To address the problem of memory management a new way of numbering the cells in the algorithm was introduced so that, given a cell number, the parents (or the children) could be located directly through a single expression, rather than a pointer array which stores the parents (or children) in memory and thus occupies valuable space; the numbering is de ned by $
%
parent(i) = i +4 2 and child(i) = 4i ? k + 2; k = 1; 2; 3; 4 and i 1 [2:9] where bxc is the largest integer less than x 2 R. A pointer array which indicated the cell number of a particular particle and a pointer array for the nearest neighbour of a cell, were replaced with a function which returns the cell number given the coordinates of its centre, or the coordinates of a particle therein. Once the downward pass is complete, Summers' program contained a loop proportional to N 2: a search over every particle and over all the cells at the nest level, which are of O(N ) in number. This was replaced by a much more ecient pointer list which allocates one particle as the head of the cell and forms a list of the neighbouring particles in that same cell. This pointer list is composed of two 1-dimensional arrays; head(ncell) and nbr(N), where head is the rst particle in the list of particles in a cell, nbr is a pointer to the next particle in the list of remaining particles in the cell, ncell is the total number of cells in the tree and N is the total number of particles. The binomial coecient subroutine was also re-written since in Summers' code the factorials were calculated as integers. The 2DFMM requires that (4p)! be calculated, but the maximum integer factorial which FORTRAN 77 can store is only 12! for integer*4. The binomial coecient subroutine was replaced by a system of successive fraction multiplication, which are exact to machine precision for any binomial coecient.
15
2.4.1 Exploiting the Symmetry of a Quad-tree The following scheme exploits the symmetry inherent to the process of translating local expansions, and reduces the overall execution time for the 2DFMM. This idea was introduced by Greengard and Rokhlin in [43] where it was described as a precursor to how the Fast Fourier Transform can be employed to reduce the computational time to compute the local expansions, in terms of multipole coecients, from O(p2 ) to O(p log p). (The latter step was not carried out for this thesis however, because the development of the 2DFMM is not our primary aim.) Symmetry is exploited in the translation of a local expansion as follows. From eqn.(2.7) we have for any complex z0; z and ak ; k = 0; 1; : : : ; n, n X k=0
ak (z ? z0)k =
n X l=0
(
!
n X
)
ak kl (?z0)k?l zl: k=l
By way of an example consider a cell centred on the real axis at (x; 0), say. This cell would require the terms !
n X
bl = ak kl (?x)k?l; k=l and another cell at (?x; 0), say, would require !
n X
dl = ak kl xk?l; k=l and so if we de ne the following 2 functions: ! n X k ak l xk?l (?1)k?l = l = k l
we have
l =
n X
!
ak kl xk?l
k=l k?l even
=
k?l even
and
n X
!
!
n X ak kl xk?l(?1)k?l = ? ak kl xk?l k l
k=l k?l odd
bl = l + l
=
k?l odd
and
dl = l ? l:
Now we apply a similar principle to the 2DFMM, where each cell has 4 child cells, which may be numbered as gure 2.5, with the origin at O.
16
3 1
O
4 2
Figure 2.5: Numbering of the 4 child cells in 2D Therefore, we de ne the following 4 functions:
Al =
p X k=l
!
ak kl (?z0)k?l; Bl =
? mod4=0
(k l)
Cl =
p X k=l
k=l
!
ak kl (?z0)k?l ;
? mod4=1
(k l)
!
ak kl (?z0)k?l ; Dl =
? mod4=2
(k l)
p X p X k=l
!
ak kl (?z0)k?l;
? mod4=3
(k l)
where z0 is the center of cell 1. Then cell 1 requires bl(1) = Al + Bl + Cl + Dl . Similarly the results for all 4 cells may be written in matrix notation as 0
1
0
b (1) C B 1 B l B C B B b (2) C B 1 B l C B B C=B B C B B bl (3) C B 1 @ A @ bl(4) 1
1 i ?i ?1
1 ?1 ?1 1
1 ?i i ?1
10
Al CB CB CB B l CB CB CB C B Cl A@ Dl
1 C C C C C C C A
:
Utilising the inherent symmetry in this manner reduces the operation count of this operation from O(4p2 ) to O(p2 + 4p). For the conversion section of the algorithm, consider a cell with a full interaction set. The set has the same con guration for every cell at every level, unless, of course, some part of the interaction set lies outwith the boundary, and indeed out of the 27 cells in this set there are only 7 distinct distances from the cell to each of the interaction set cells. It was found that storing this information occupied too much valuable memory space and furthermore accessing a le of precomputed constants was found to slow down the execution time. This set does have an inherent symmetry, but the number of symmetries is low and they have a complex form, and so this method of exploiting the symmetry of the interaction set was not invoked. However, an element of symmetry in the conversion section was employed. If cell j lies in the interaction set of cell i say, then the opposite is also true, or in set notation; cell i 2 int(cell j ) ) cell j 2 int(cell i) 17
where int(cell j ) denotes a cell of the interaction set of cell j . This is similar to the pairwise interaction used in the `direct' method and reduces the computation time by a factor of order 2.
2.4.2 A Basic Element of Adaptivity The adaptive FMM [17] is a version of the FMM which is sensitive to the distribution of the particles. Instead of creating a hierarchy of cells to a depth n everywhere, each parent cell is subdivided only if it contains more than a few particles. The non-uniformity of this irregular and dynamic hierarchy has a major eect on the downward pass, where a number of new interaction sets are de ned. The increase in required memory, and the added complexity of the algorithm meant that this method of adaptivity was not attempted in this present work. In our modi ed version of the 2DFMM we shall, however, introduce the following basic element of adaptivity to cater for non-uniform distributions. For each cell i in the hierarchy we calculate
Ai =
m X
k=1
jqk j
during the upward pass of the FMM, where there are m particles in cell i of strengths q1; q2; :::; qm. If Ai = 0, then clearly we have no particles in cell i, nor in any of its `ospring'. Therefore, during the downward pass, there is no need to calculate the local expansion coecients for cell i, and so the relevant loops are skipped. This simple test to ignore empty cells works very well, and is discussed in section(7.4.4).
2.5 Error Analysis for the 2DFMM Consider a 2DFMM executed with 5K and then 10K particles, with 4 levels (which was found to be the optimum number of levels for p = 11). The real elapsed times1 for execution on a Sun Sparc ELC workstation are shown in g 2.6, where it can be seen that the time is clearly dependent on p. The order of complexity for the 2DFMM is asymptotically O(Np) (section 2.3); this dependence on p suggests that a method for reducing p whilst retaining the prescribed error would be bene cial towards reducing the execution time. The next section investigates a possible approach to reducing the p-value. 1
All times stated in this thesis are real elapsed times in seconds.
18
2 2 22 250 2 N = 5000 3 22 2 N = 10000 2 2 2 200 2 22222 33 Time (secs) 3 33 3 100 333 3 3 50 3 3 3 3 3 300
0
6
8
10
12
14
16
18
20
22
p Figure 2.6: The relationship between p and the execution time
2.5.1 Error Bound of the Truncated Multipole Expansion Greengard [42] shows that the absolute error, abs, of the truncated multipole expansion is bounded by 1 X
1 rk 1 X r ak A X < A abs = k k k=p+1 k jz j k=p+1 jz j k=p+1 z
where
!k
p A = c ? 1 1c
m X jzj q i zik j q ak = ? ; A = ij; c = r i=1 k i=1 and r is the radius of the disc which contains the cluster, cf. eqns.(2.1) and (2.2). p 1 . From [42], we < O Hence a relative error-bound may be estimated by = abs A c determine the number of terms, p1 say, required to achieve this error by m X
p1 = d? logc()e:
[2:10]
However, if we note that k 1 k X r A r A k p+1 k=p+1 k jz j k=p+1 jz j ; then, by using this more stringent bound, we have p A abs (c ? 1)(p + 1) 1c [2:11] with c = jzrj , which implies an expression for the number of terms, p2, where p2 = dq2e and q2 is given implicitly by !
1 X
q2 = d? logc ( (q2 + 1)(c ? 1))e 19
[2:12]
where = abs A as before. The value for p2 will be less than or equal to p1 for any or any c > 23 . This can be shown in the following manner. If we let q1 = ? logc() and let q2 = ? logc ( (q2 + 1)(c ? 1)), then p1 = dq1e and p2 = dq2e. If we subtract q1 from q2, we nd q2 ? q1 = ? logc ((q2 + 1)(c ? 1)). Now since p 1 we have (q2 + 1)(c ? 1) > 1 provided that c > 23 , and this implies
q2 ? q1 = ? logc ((q2 + 1)(c ? 1)) < 0 ) q2 < q1 ) p2 p1: Due to the non-linear character of eqn.(2.12), the solution is not trivial and so the following simple iterative scheme was used to determine p2, by solving for q2, since p2 = dq2e. All such schemes used in this thesis were shown to be convergent for the starting values q(0) = :001; :01; :1; :::; 1000. Starting from the initial value q2(0) = q1 = ? logc() we can update q2 using q2(i+1) := f (q2(i)) with
f (q) = ? logcf (q + 1)(c ? 1)g: It is seen from table 2.2 how, for c = 2:8281 and = 10?3 , p = dq1e = 7 determined from eqn.(2.10), while the more stringent error-bound, eqn.(2.12), implies that p = dq2e = 5. The asymptotic order of the formation of local expansions is of O(p2 ), thus the use of p2 instead of p1 reduces the overall computational time for this particular case by roughly a factor of order 2. In general the improvement is greatest for small and large c. Using p2 instead of p1, the number of terms necessary
i q2(i) 1 6.644821 2 4.107954 3 4.495835 4 4.425429 Table 2.2: Iteration values using eqn.(2.12), c = 2:8281 and = 10?3 . for a required precision is therefore reduced, yielding a shorter execution time for the formation of the multipole expansion and all other p-dependent parts of the algorithm. (Note, however, that the multipole expansion is not actually evaluated as such; the multipole coecients are used to calculate a p-term local expansion.)
20
2.5.2 Error Bound of the Truncated Local Expansion Greengard [42] determined the maximum error of a truncated local expansion to be bounded by 2 1 p+1 4 e ( p + c )( c + 1) + c abs < A [2:13] c(c ? 1) c :
By asymptotic analysis of eqn.(2.13), and invoking a further inequality, namely p p+1 < O 1c abs < O 1c ; he derives the following expression for p;
p = d? logc()e
[2:14]
where c = jzrj ? 1 and jzj is the separation distance between the centres of the two expansions. (Note that this `c' is dierent to the c used in the formulation of the multipole error-bound, where c = jrzj and the parameter jzj is the distance between the centre of the multipole expansion and the evaluation point.) In fact his analysis is rather conservative; more speci cally, retaining p terms in the local expansion, where p is de ned by eqn.(2.14), produces an error which is less than the prescribed error, . To see this, consider an alternative estimation based on eqn.(2.13), namely p+1 abs < O 1c :
In analogy to the analysis leading to eqn.(2.14), we can express p as
p = d? logc()e ? 1:
[2:15]
This expression reduces the p-value, derived from eqn.(2.14), by 1. In fact, a still more precise expression for p can be developed, since eqn.(2.13) implies the following non-linear expression for p, &
)'
(
[2:16] p = ? logc 4e(p + cc)((cc ?+ 1) 1) + c2 ? 1: This expression is solved for p using a simple iteration procedure, similar to that described previously for the solution of eqn.(2.12). The solution of this equation is depicted graphically in gure 2.7, which illustrates how eqn.(2.16) produces a larger value of p than that produced by eqn.(2.15) for any given , and where c = 1:828 as determined by the closest well-separated cell. 21
10 1 0.1 0.01 0.001 0.0001 1e-05 1e-06
Eqn.(2.16) Eqn.(2.15) Eqn.(2.17) .001
5
10
15
20
p Figure 2.7: Graphical representation of as a function of p An investigation was carried out into the derivation of the error-bound, eqn.(2.13), and a more stringent bound was developed. The details are presented in Appendix A. The more stringent bound is given by !p+1 p 1 1 c + 1 p + c p 1 p+1 [2:17] abs = A < (p + 1)c c + 1 + ln c c ? 1 p ? 1 c : Figure 2.7 illustrates how eqn.(2.17) produces a p-value which lies between the pvalues produced by eqns.(2.15) and (2.16). Therefore this represents a reduction in the number of terms from the p-value calculated by Greengard's conservative error-bound. However, the order of eqn.(2.17) is the same as that of eqn.(2.13), i.e. p+1 ) p = d? logc e ? 1: abs < O 1c Thus eqn.(2.15) is the expression which produces the smallest value for p. This value for p is larger than that derived in terms of the truncated multipole expansions, eqn.(2.10), due to inequalities introduced in the derivation of eqn.(2.17), however if a local expansion is required to have p terms, then the program is required to store a p-term multipole expansion for every cell in the tree, and indeed this larger value of p must be used throughout the FMM. In practice, however, we may be able to reduce this p-value further, so to investigate this numerical experiments were carried out and are presented in section(2.5.4).
2.5.3 Accumulated errors The shifting of expansions performed within the FMM, i.e. shifting the centre of a truncated multipole expansion or a local expansion, are exact, and so do not 22
contribute a truncation error to the global error of the FMM. Increasing the number of levels used in the FMM means that more shifting operations are performed; however this will not increase the overall truncation error for the above reason. The deduction that increasing the number of levels has no eect on the truncation error is borne out in numerical experiments [5].
2.5.4 Numerical Trials Errors from a large system A system with 5000 uniformly distributed particles each with strength +1:0 was tested. The potential was calculated once, using the 2DFMM in double precision using p = 11 from p = d? logc()e ? 1; where c = 1:828 and = 10?3 . In this case we have N = 5000 and n = 4; this number of levels takes the shortest time to calculate the potential for these parameters. The `direct' method was used to calculate the `exact' values to double precision and the error was calculated as the absolute dierence between the two. Since we have a large value of N , then it is likely that an error very close to the maximum error will occur for some pair of particles, i.e. each will lie near the extremities of its own cluster. It was found that the maximum error (over all particles) was max
i=1;N (abs ) = 8:05 10
?4
?7 ) imax =1;N () = 1:61 10
abs . This con rms that the error , and indeed , lies well within the where = 5000 abs expected bounds. This may imply that more terms were taken than necessary to achieve = 10?3 for this set of parameters.
Errors from a two particle system Employing just 2 particles not only reduces the computational time, but by testing various pairs one can produce speci c errors of interest and acquire a working knowledge of the nature of the errors. The position of one particle is used as a point of evaluation, while the other has a strength of +1:0. A xed number of terms in the expansions (multipole and local) are used so that the empirical error can be compared to the theoretical error-bound. The 2DFMM was executed, without 23
invoking the hierarchy, with one multipole expansion formation, one conversion to a local expansion, and nally a potential evaluation. This means that the particle's potential has undergone every kind of transformation, each of which induces its own unique form of truncation error. (The shifting of a truncated multipole expansion or a local expansion does not contribute an error.) We now assume that the error induced by this numerical experiment is bounded from above by the error of the truncated multipole expansion plus the error of the conversion to a truncated local expansion. Hence emp = jFMM ? exactj mult + local where exact is calculated using the `direct' method in double precision. The theoretical truncated multipole expansion error is much smaller than the theoretical truncated local expansion error and has, in fact, very little eect when added to the local expansion error, and so one can use graphs of the empirical errors versus p as a numerical estimate of the truncated local expansion error; hence the graphs can be used as a device to determine p by superimposing the theoretical truncated local expansion error-bounds. The con guration shown in gure 2.8 arises when using the nearest cell in the interaction set to de ne c, and hence the closest cell for which a multipole expansion is valid. (We evaluated at one point due to a unit particle at the other, similarly numbered point.) Many pairs of points were analysed, but the 3 pairs of
'$'$ &%&% 3 3 2 11 2
Figure 2.8: Con guration of 3 points in 2D, c = 1:828. points illustrated in gure 2.8 were found to characterise the largest errors for this con guration. It should be noted that the points `1' were contrived to lie within the disc centred on the cells shown, as they would normally be allocated to the cell in which they lie. The empirical errors are plotted in gure 2.9 on a log-linear graph, versus p, along with theoretical error-bounds. The error associated with `pair 1' is the maximum error associated with the disc which is used in the calculation of the error-bounds, but does not occur in the FMM since, as already stated, the algorithm assigns these points to the cell in which they lie. 24
1 3 c**(-p-1) 0.1 +2 3 3 2/((p+1)*(c-1))*c**(-p) + 3 Pair 1 3 0.01 2 +2 2 32 3 3 3 Pair 2 + + 2 0.001 3 Pair 3 2 + 2 3 3 + 3 3 0.0001 + 2 3 3 + 2 2 1e-05 3 3 + 2 3 3 2 + 1e-06 2 3 3 + 2 + 1e-07 2 2 2 2 + 2 + 1e-08 5 10 15 20
p Figure 2.9: Empirical and theoretical errors, c = 1:828.
It is seen from gure 2.9 that `pair 3' contributes the largest error which arises from the FMM, except for p = 2 and p = 9. At this point it is worth noting that as p increases the actual maximum error which occurs in the FMM may also increase. If we consider the error which occurs from the use of `pair 3', we can see that the actual error increases as p is incremented from 2 to 3, 9 to 10 and 15 to 16. There are two theoretical error-bounds plotted on gure 2.9. The value of c = jzrj ? 1 1:828 is calculated from the c associated with the formation of a local expansion. The rst bound, `c**(-p-1)', is the bound derived in this thesis, i.e. eqn.(2.15), and is indeed an upper bound to all the errors except that which occurs at p = 1. The other bounds discussed in section(2.5.2), including the bound derived by Greengard, namely eqn.(2.14), are numerically larger, and therefore too conservative. The second bound on gure 2.9, `2/((p+1)*(c-1))*c**(-p)', is twice the error-bound for the formation of the multipole expansions, eqn.(2.11), i.e. p 2 [2:18] = (p + 1)(c ? 1) 1c : This expression was derived empirically and was suggested by the numerical experiment performed for 3 dimensions for a two particle system (cf. section 3.5.4). This expression appears to be a very good estimate of the maximum error. Other con gurations of cells are now examined to test the generality of eqn.(2.15) and eqn.(2.18) before any conclusions are drawn. The two con gurations shown on gure 2.10 are now considered. The 3 3 con guration has a c-value of 3:0, while the 4 4 con guration has a c-value of 5:0. The errors incurred here will be the maximum incurred through the FMM and 25
the theoretical maximum error, since the points of interest lie on the extremities of both the cell and the disc. The two graphs associated with these errors, showing the empirical and theoretical errors, are plotted on gure 2.11 and 2.12. (Note that when the theoretical maximum actual error matches the maximum error in the FMM, then as p increases, the error decreases monotonically.) They show that when c varies according to distance one can vary p, since the larger the separation distance, the smaller the number of terms to retain a given precision. This important result is investigated further in section(2.6) on the `dynamic p principle'. For both the 3 3 and the 4 4 con gurations, eqn.(2.15) is an upper bound to the errors for all values of p, and eqn.(2.18) seems to t the error very closely. So much so, in fact, that the local gradient of the error-bound curve appears to match that of the empirical error curve.
26
'$ &% '$ '$ &% &% 4
4
'$ &% 5
5
c = 3:0 c = 5:0 Figure 2.10: The 3 3 and 4 4 con gurations
0.1 0.01 0.001 0.0001 1e-05 1e-06 1e-07 1e-08 1e-09 1e-10
3
3
3
2
3
4
3
c**(-p-1) 2/((p+1)*(c-1))*c**(-p) Pair 4 3 3
6
3
3
8
3
3
3
10
3
12
3
3
14
3
3
16
3
18
p Figure 2.11: Graph of the empirical and analytical errors, 33 con guration, c = 3:0 0.1 3
3
3
3
3
c**(-p-1) 2/((p+1)*(c-1))*c**(-p) Pair 5 3 3
3
3
3
3
1e-10 2
4
6
8
10
3
3
12
3
3
14
3
3
16
p Figure 2.12: Graph of the empirical and analytical errors, 44 con guration, c = 5:0 27
'$ '$ &% &%
A nal con guration is now examined. Figure 2.13 shows the location of two pairs of points with c = 2:1623 in a dog-leg con guration of cells. Figure 2.14 shows
6,7
6 7
Figure 2.13: Dog-leg con guration
the empirical and theoretical errors and, once again, it can be seen from the graph that eqn.(2.15) and eqn.(2.18) are reasonable estimates for the relation between , c and p.
1 c**(-p-1) 0.1 +3 2/((p+1)*(c-1))*c**(-p) 3 + 0.01 + 3 + + Pair 6 3 + Pair 7 + 3 0.001 + 3 0.0001 + 3 3 + + + 3 1e-05 + 3 3 + 3 1e-06 + 3 3 1e-07 3 + + + + 3 + 3 1e-08 + 3 3 1e-09 2 4 6 8 10 12 14 16 18 20
p Figure 2.14: Empirical and theoretical errors, c = 2:1623
The nature of eqn.(2.18), which corresponded to the theoretical maximum error, is similar in character to that of the equation found in the numerical experiments for the 3-dimensional case (apart from the factor of 2 which was discovered empirically), i.e. it is the multipole expansion's error-bound. For further discussion on the implications of this expression, see section(3.5.4). We can conclude that a sound method for choosing p for the 2DFMM, is to use the formula p = d? logc()e ? 1; c = jzrj ? 1; which was obtained from the theoretical error-bound calculations, eqn.(2.15), and is used as an expression to determine p throughout the remainder of this thesis. At present, the use of eqn.(2.18) lacks theoretical foundation. 28
2.6 The Dynamic p Principle The idea that, as the distance to a cluster increases, then the cluster can be replaced by a larger cluster, is the concept which gives the FMM its hierarchical nature. It has been stated that as the distance to a cluster increases then fewer terms need be taken in the expansions for clusters of equal size. Buttke [15] used this idea in his non-hierarchical particle-cluster interaction method, which is described in section(4.4). We now present an implementation of this `dynamic p principle' for the FMM. From section(2.5.4), the number of terms in a p-term multipole/local expansion can be given by p = d? logc()e ? 1; c = jrzj ? 1 where jzj is the separation distance between the two closest well-separated cells. The fact that p is a function of distance (i.e. depends on jzj through c) as well as precision can be exploited within the FMM during the downward pass where multipole expansions are converted to local expansions - the most expensive calculation of the algorithm. The variable c is an important variable as it controls the value of p, hence the precision and execution time for the algorithm. The value of c will vary from cell to cell, since a cell which is further away will have a larger c-value, and hence requires fewer terms to obtain the same precision. Therefore the distance to a cell de nes how many terms one needs in order to have the same tolerance as any other cell. We denote this c-dependent value of p by pdyn . When forming a p-term local expansion one need only take pdyn terms from the multipole expansions, pdyn p where p is calculated with the smallest c-value, as de ned by the nearest cell in the interaction set. In the conversion section of the 2DFMM, the eqns.(2.5) and (2.6) are now computed as pX dyn a (i) k k; ( ? 1) b0 = a0(i) log(?zi) + k k=1 zi and pX dyn a (i) l + k ? 1 ! a 1 0(i) k k bl = ? l zl + zl zk k ? 1 (?1) where
i
i k=1
i
pdyn = d? logci ()e; ci = jzrij ? 1; l = 0; :::; p and ak (i) are the multipole coecients associated with each cell i in the interaction set centred at zi. 29
Without introducing this dynamic p principle, i.e. p = max i (pdyn ), the time taken to form the local expansion is of O(p2 ), i.e. a p-term local expansion is evaluated from a p-term multipole expansion. But by introducing the dynamic p principle the order of time becomes O(ppdyn ) (for a single cell). Consider now a full interaction set; 27 cells, each with an associated c-value; c 2 [1:828; 5:0]. We may now explicitly compute the order of the time when utilising the dynamic p principle, and hence the speed-up can be approximated by computing the ratio time with constant p = 27p2 1:6: time with dynamic p P27 box=1 ppdyn In other words, by using the dynamic p principle the calculation of a local expansion is approximately 1:6 times faster.
2.7 Timing Results for the Sequential 2DFMM We now present times for the sequential version of our modi ed 2DFMM, running on a Sun Sparc ELC workstation. All the techniques for speed-up discussed above are incorporated, so we de ne p from p = d? logc()e ? 1 and c = jzrj ? 1. The dynamic 10000
2 levels 3 levels 4 levels 5 levels 6 levels
1000 Time (secs) 100 10 1 100
1000
10000 100000 Number of Particles
1000000
Figure 2.15: Typical timing curves for a 2DFMM, p = 11
p principle is employed. The basic element of adaptivity described in section(2.4.2) is not employed for this particular set of results, so that its use in the context of 3 dimensions is more apparent. The graphs shown in gure 2.15 depict typical timing curves of the FMM, and correspond to various choices for n (the total number of levels). All results are computed in double precision for N uniformly distributed ?4 particles using p = 11 (p = max i (pdyn ) = 11 ) < 7:2 10 ). 30
Consider as an example the FMM executed with 4 levels. For N < 2000, the upward and downward passes over the tree dominate the calculation, and since there is no element of adaptivity, the time is barely in uenced by N . As N increases it becomes so large that the `direct' computation, using the particles in the near- eld leaf cells, now dominates the time for execution, which therefore becomes of O(N 2 ). If the optimal number of levels, nopt say, is employed for a particular N , then this `fastest' time curve is given by the lower envelope of all timing curves. This timing curve is a continuous line and is theoretically O(N ), by eqn.(2.8) (cf. gure 2.16). The actual equation of the timing curve can be approximated if we assume that time varies as aN b, since this appears as a straight line on a log-log graph. Hence a straight line regression may be performed through the points which form the lower envelope, over the range N 2 [200 : 100000]. The regression suggests a relationship time = 0:0057 N 1:13secs. 1000 p = 11 p = 22
direct
100 Time (secs) 10 1 100
1000 10000 Number of particles
100000
Figure 2.16: Timing curves for the 2DFMM, `direct', p = 11, p = 22 Figure 2.16 shows the lower envelope of two timing curves for p = 11 and p = 22, as well as the result for the pairwise summation method. The point where the performance of the FMM becomes faster than the direct summation method, shown here as direct, is known as the break-even point and is a useful measure of the performance of the FMM. In 2 dimensions, for p = 11, this point is N 180 particles; for p = 22 it is around N 300 particles. The optimum number of levels, nopt, was derived for the two 2DFMM curves by executing numerous programs with various values for N and a static n. The value for nopt was then found with the aid of a graph, such as that displayed in gure 2.15. 31
As N increases, then so does the optimum number of levels, nopt. Theoretically the dependence of nopt on N depends on s, the maximum number of particles per leaf cell. Values of s can be computed by locating the value for N at which the time for n0 levels say, exceeds the time for n0 + 1 levels (i.e. points of intersection in gure 2.15), so that s is given by s = 4Nn . From gure 2.15 it can be shown that the average value of the four possible values for s is s = 41. For the curve p = 22, the corresponding result is s = 65. It is now clear that s, and hence nopt, is dependent on p. These variables have been thought to be independent [74, 75]. In fact this variable, s, will also depend on the implementation details of each individual FMM. Let us now consider the form of the 2DFMM. It can be broken into two basic component parts: the calculation of the far- eld of O(Np), and the calculation of the near- eld of O(Ns) (see section 2.3). If s is too large, then the direct summation method dominates the calculation, and if s is too small, then the calculation of the far- eld dominates; so we wish to balance the time to perform the two calculations. From section(2.3) we have s / p, which implies that the optimal number of levels is now dependent on p, that is ! N N nopt = log4 s log4 p : 0
Therefore for a given N , as p increases, the optimal number of levels should decrease to minimise the time for execution. Since the constant of proportionality in the expression s / p depends on the method of coding, the language, the compiler, memory access speed, oating point rates, etc., an analytic generic expression to determine the value of nopt is unobtainable. Naturally, suggestions for the value of s have varied for each implementation. In the original form of the 2DFMM, [40], nopt was chosen so that there is only one particle per leaf cell on average. In [75] it was suggested that s = 40, and in [42]; s = 30. In [39] it is suggested that s 2p. This relationship was derived by estimating the computer's parameters, but does not concur with our implementation. (Here we have p = 11 and s = 41, p = 22 and s = 65, hence s = 2:18p + 17.) When using the FMM in practical applications one must locate the optimal number of levels empirically. One such method is presented in detail by Anderson in [5], and is described brie y in section(3.8).
32
Chapter 3 Fast Multipole Method in 3 Dimensions The 3-dimensional Fast Multipole Method, 3DFMM, is conceptually simple since it is a logical extension of the 2DFMM; it is, however, substantially more complex in its formulation and programming requirements. The hierarchical frame-work for the 3D version is essentially the same as the 2D, but in the 3D case, at any particular level l there are 8l cells and one parent cell will have 8 children; this is known as a oct-tree. Figure 3.1 shows a cube containing the particle ensemble, showing 3 levels of re nement, where the entire cube is level 0, thick lines are level 1, thin lines are level 2. The notion of Second nearest neighbours is introduced to the near
HHHHHH HH HHHHHH HHHHHHHHHHH H HHHHHHHHHHHHHHHHHHH HHHHHHHHHHHHH HH HH H HH HHHHH HHH HHHHH HH HH H H H HH HHH HH H HHHHH H HHH H
Figure 3.1: A cubic domain with 3 levels of re nement eld: second nearest neighbours are de ned as the nearest neighbours of the nearest neighbours of a cell (excluding the cell itself and its nearest neighbours). The near eld is now rede ned as the nearest and second nearest cells at the same level, a maximum of 124 cells. The interaction set now increases in size to a maximum of 33
875 cells. Alternatively the near- eld may be de ned as a cell's nearest neighbour cells only [71], which means that the interaction set has a maximum size of 189 cells. The merits of one de nition over the other are not obvious, and an investigation is performed in section(3.7). The mathematical structure of the algorithm in 3D diers dramatically from the 2D case, as the potential is now = rq0
where r0 is the distance between the evaluation point and a particle of strength q. This can be expressed in terms of an in nite sum over Legendre polynomials namely 1 qn q =X r0 n=0 rn+1 Pn (cos ) such that < r, where is the radial distance to the evaluation point from an origin, r is the distance to the particle from the same origin, and is the angle subtended between the two position vectors, see gure 3.2. +AKP = (r; ; ) A A A A
+Q = (; ; )
* A
O Figure 3.2: Points P and Q, with an angle subtended between OP and OQ. The Legendre polynomials can be written in the form bnc
1 3 : : : (2n ? 2k ? 1) (?1)k xn?2k 2k k!(n ? 2k)! k=0 where n = 0; 1; 2; : : : and jxj 1: The Legendre polynomial is not evaluated as such, but is described in terms of spherical harmonics via the Addition Theorem: Let P and Q be points with spherical coordinates (r; ; ) and (; ; ) respectively, and let be the angle subtended between them ( gure 3.2). Then
Pn (x) =
2 X
Pn (cos ) =
n X
m=?n
Yn?m (; )Ynm(; )
[3:1]
where the spherical harmonics are calculated in terms of associated Legendre polynomials; v u u jmj)! P jmj(cos)eim; m [3:2] Yn (; ) = t ((nn ? + jmj)! n 34
p
where i = ?1. The use of associated Legendre polynomials is discussed in section(3.2).
3.1 Mathematical Operators The four operators that form the basis of the Fast Multipole Method are similar to those of the 2D algorithm, in that they eectively perform the same operations, but they have a more complicated structure. These operators are used to form the multipole expansions, translate these expansions, construct local expansions from the multipole coecients and nally to translate those local expansions. The following theorems follow closely those found in [42], where the proofs will be found.
3.1.1 Multipole Expansion Suppose that k particles of strengths q1; q2; :::; qk are located at the points Qi = (i; i; i), i = 1; : : : ; k, with jij < a. Then at any point P = (r; ; ) 2 R3 with r > a, the potential (P ) is given by (P ) =
1 X n X
Mnm Y m (; ) n+1 n n=0 m=?n r
[3:3]
where the multipole coecients are given by
Mnm =
k X i=1
qini Yn?m (i; i):
In analogy to the 2-dimensional case, the in nite sum in eqn.(3.3) is truncated to p terms, and then the precision of the algorithm is controlled by the number of terms taken in the expansion, p, which is much smaller than, and indeed independent of, the number of particles the expansion represents. For a given precision, , Greengard calculates p from p = d? logc ()e using c = 2 [42]. The equation determining p and this choice of c, are investigated in section(3.5) with a view to reducing the p-value whilst retaining the prescribed precision.
3.1.2 Translation of a Multipole Expansion During the upward pass, the following theorem is required to shift the centre of the eight children's multipole expansions to the centre of their parent's cell. 35
Suppose that l particles of strengths q1; q2; : : : ; ql are located inside the sphere D of radius a with centre at Q = (; ; ), and that for points P = (r; ; ) outside D, the potential due to these particles is given by the multipole expansion 1 X n Om X n m 0 0 (P ) = [3:4] 0n+1 Yn ( ; ) r n=0 m=?n (cf. eqn.(3.3)) where P ? Q = (r0; 0; 0) represents a position vector relative to Q. Then for any point P = (r; ; ) outside the sphere D1 of radius (a + ) and centred at the origin, j Mk 1 X X j Y k (; ) [3:5] (P ) = j j +1 r where with
j =0 k=?j
Mjk
Ojk??nm Jmk?m Amn Ajk??nm n Yn?m (; ) = Akj n=0 m=?n j X n X
[3:6]
8 >
and
n ( ? 1) : [3:7] = (n ? m)!(n + m)! Within the 3DFMM, the in nite sum in eqn.(3.5) is truncated to p terms, and therefore eqn.(3.5) has a truncation error associated with it. However, once the multipole coecients Onm in eqn.(3.4) have been computed, then we may obtain Mjk exactly by eqn.(3.6), i.e. the centre of the multipole expansion may be shifted without incurring an additional truncation error [42].
Amn
q
3.1.3 Conversion of a Multipole Expansion into a Local Expansion During the downward pass, a local expansion, associated with a particular cell i, say, is computed in terms of the multipole coecients of the cells in its interaction set, by the following theorem. Suppose that l particles of strengths q1; q2; : : : ; ql are located inside the sphere DQ of radius a with centre at Q = (; ; ), and that > (c + 1)a with c > 1. Then the corresponding multipole expansion converges inside the sphere DO of radius a centred at the origin. At a point P inside DO , the potential due to the particles 36
q1; q2; : : : ; ql is described by a local expansion: (P ) = where
Lkj =
with
j 1 X X
j =0 k=?j
Lkj Yjk (; ) rj
m Am Ak Y m?k (; ) Onm Jn;k n j j +n m ? k j Aj+n +n+1 n=0 m=?n
1 X n X
[3:8] [3:9]
8 >
0 [3:10] => : (?1)n otherwise and Amn is de ned in eqn.(3.7). Within the 3DFMM, the local expansion, eqn.(3.8), is truncated to p terms, and is calculated from a p-term multipole expansion, eqn.(3.9). There is therefore a truncation error associated with this conversion process. A bound to the absolute error of a truncated version of eqn.(3.8), i.e. p-term local expansion, is investigated in section(3.5.2). m Jn;k
3.1.4 Translation of a Local Expansion This theorem is used to shift the centre of a p-term local expansion of a cell to the eight children cells in the downward pass. Let Q = (; ; ) be the origin of a local expansion (P ) = where P = (r; ; ) and P
p X n X
n=0 m=?n ? Q = (r0; 0; 0)
(P ) = where with
Lkj =
Onm Ynm (0; 0) r0n
p X j X
j =0 k=?j
as before for eqn.(3.5). Then
Lkj Yjk (; ) rj
Onm Jnm?j;m?k Amn??jk Akj Ynm??j k (; ) n?j Amn n=j m=?n p X n X 8 > > > >
0 and jmj < jk j Jj;k > > > > : (?1)j otherwise and Amn is de ned in eqn.(3.7). The translation is exact, and therefore has no associated truncation error. In other words, we may shift the centre of a p-term local expansion without incurring an additional truncation error. 37
3.2 Associated Legendre Polynomials The associated Legendre polynomial is de ned as m dm m m 2 Pn (x) = (?1) (1 ? x ) 2 dxm Pn (x)
where jxj 1, n = 0; 2; ::: and m = 0; 1; : : : ; n. We may then compute the function values Pnm (x), where n,m and x are de ned as above, by using the following recursion relations; p Pmm (x) = 1 ? x2(2m ? 1)Pmm??11 (x)
Pmm+1 (x) = x(2m + 1)Pmm (x) m (x) ? (n + m ? 1)P m (x) (2 n ? 1) xP n ?1 n?2 m Pn (x) = n?m 0 using the starting condition P0 = 1:0. These relations are developed from [67]. The computed values for Pnm (x) were checked against the Tables of Associated Legendre Functions, [14]. The recurrence relation given in [42] is Pnm+2 (x) + 2(m + 1) p 2x Pnm+1 (x) = (n ? m)(n + m + 1)Pnm(x): x ?1 p p The term x2 ? 1 is incorrect as jxj 1, but should read 1 ? x2. In fact most recurrences involving m are unstable, and so are unsuitable for numerical calculations [67]. One feature of the associated Legendre polynomials which can be used to save computation time occurs for the cases = 0 or in eqn.(3.9), since it can be shown for these cases that 8 > < (cos )n m = 0 m Pn (cos ) = > : 0 m > 0: This feature can be utilised1 in the formation of the local expansion from the multipole coecients associated with the interaction set cells, when the cells lie directly above or below the evaluation cell. Consider the form of eqn.(3.9) from section(3.1.3) appropriate to the conversion of a p-term multipole to a local expansion, namely Lkj =
m Am Ak Y m?k (; ) Onm Jn;k j j +n n m?k j +n+1 A j +n n=0 m=?n p X n X
1The feature may also be employed when the multipole expansions are formed and during the nal evaluation, where a particle may lie in just such a position, although this would be extremely rare.
38
where 0 j p and jkj j , which is to be calculated for a particular cell when = 0 or ; then by substitution in eqn.(3.2) we nd 8 >
:
(cos )j+n m = k 0 m= 6 k
and hence we have the reduced form p Ok (?1)n+jkj Ak Ak (cos )j +n X n n j : Lkj = 0 j +n+1 A j +n n=jkj The computation involved with eqn.(3.9) may be reduced further for any cell in the interaction set, if we employ the following property of the associated Legendre polynomial: Pnm (?x) = (?1)n+m Pnm(x): This reduces the number of times the spherical harmonic in eqn.(3.9) is calculated by roughly a factor of 2. Furthermore, the computation involved with eqn.(3.12) may also be reduced. If jm ? kj > n + j then Ynm+?j k (; ) is unde ned. This property reduces the number of terms in the summation.
3.3 The Time for Execution of the 3DFMM As with the 2-dimensional case, the order of the time for execution of the 3DFMM is now derived in terms of an operation count depending on the parameters n, N and p. At the nest level n, for each of the N particles, a p2-term multipole expansion is formed (Np2 operations). For the translation of the multipole expansions there are 8l+1 p4 operations for l = n ? 1; :::; 2 since for each cell at level l a p2-term multipole expansion is computed from 8 other p2-term multipole expansions. Within the downward pass, the conversion of a multipole to a local expansion contributes 875 8lp4 operations for l = 2; :::; n since for each cell at level l a p2-term local expansion is computed from a p2-term multipole expansion from each of the cells in the interaction set. Also in the downward pass we have the translation of local expansions, where there are 8l+1 p4 operations for l = 2; :::; n ? 1 since for each cell at level l, eight p2-term local expansions are computed from a p2-term local expansion. At the nest level for each particle there is a p2-term local expansion to be evaluated (Np2 operations), followed by the `direct' summation over the near- eld cells, each 39
containing a maximum of s particles (Ns operations). By explicitly summing these terms we nd that n p4 ? 57024 p4 + Ns: operation count = 2Np2 + 7016 8 [3:13] 7 7 By eqn.(2.13), the asymptotic time can be expressed as t Np2 + Ns p4 + Ns since N = s8n and the constants of proportionality are ignored. Following Greengard's work for the 2-dimensional case [17, 39] we dierentiate to calculate an optimum value for s, i.e. @t ?N p4 + N = 0 ) s / p2: @s s2
The value for s is therefore proportional to p2, which is in contrast to the result s / p in 2 dimensions. By substitution, we nd that the asymptotic time is of O(Np2 ). If s = 1 then the asymptotic time is of O(Np4 ).
3.4 Implementation Details The methods of implementing the 2DFMM which were discussed in section(2.4) are also applied for this 3-dimensional case. For example, the cell numbers of the parents (or the children) are located directly through a single expression, where the numbering is de ned by $
%
parent(i) = i +8 6 and child(i) = 8i ? k + 2; k = 1; 2; :::; 8 and i 1: Consider now the arrays of multipole and local coecients: if the coecients Mnm, n = 0; p and m = 0; n, are stored in a p p 2-dimensional array, then since Mnm is unde ned if m > n, around half of the entries in this 2D array may be assigned to zero. This waste of memory space is avoided by storing the coecients in a 1-dimensional array, without storing coecients Mnm, such that m > n.
3.4.1 Exploiting the Symmetry of an Oct-tree The following scheme exploits the symmetry inherent in the process of translating the local expansions. This idea was introduced by Greengard and Rokhlin in [43], where it was described as a precursor to how the Fast Fourier Transform can be 40
employed to reduce the computational time to form the local expansions O(p4 ) to O(p2 log p). Unfortunately this technique does not work well for large p as it involves large factorials which will exceed the computer's limitation on the storage of integers. The method of successive fraction multiplication, indicated in section(2.4), may alleviate this problem. The Fast Fourier Transform was not implemented in the work described in this thesis, however, for lack of time. The symmetry involved during the translation of local expansions is more complex than in the 2-dimensional case (section 2.4.1). Consider the spherical harmonic term in the expression for Lkj in eqn.(3.12), i.e. v u u t
j ? jm ? kj)! P jm?kj (cos )ei(m?k) ; Ynm??j k (; ) = ((nn ? ? j + jm ? kj)! n?j
p
where i = ?1. Note that the terms Lkj for each of the 8 child cells dier only because of their dependence on and . The eight child cells can be de ned in terms of a shifting vector centred on the parent cell, at level= l, which points to the centres of the 8 children cells, which have the coordinates
p
3d ; cos?1 (?p1)u ; + v (; ; ) = 2l+2 3 4 2 where u = 0; 1, v = 0; 1; 2; 3. (The cube at level= 0 is of side d). (In [41] the shifting vector is incorrectly de ned as ; 4 + u; 4 + v2 with u = 0; 1 and v = 0; 1; 2; 3:) If we de ne function Gm;k n;j as !
!
Onm Jnm?j;m?k Amn??jk Akj Ynm??j k (; ) n?j Amn
Gm;k n;j = with
!
p1 and = 4 ; 3 then we can de ne the following 8 functions = cos?1
8 p X n ca, then ai is bounded above by 1c , therefore we have 1 1 n A X A 1 p+1 = abs < ca [3:20] (c ? 1)a c ; n=p+1 c where A = Pki=1 jqij. An estimate of the relative error-bound may be given by dividing the absolute error bound by an estimate of the mean eld. The largest absolute error is given by a con guration in which all charges have i = ? a, and so an appropriate estimate of the mean eld is ?Aa . Then, in a similar manner to that followed in eqns.(3.19) and (3.20), we have p [3:21] < (c ?1 1) 1c :
This implies that p may be de ned as
p = d? logc ((c ? 1))e where c = a ? 1: [3:22] Greengard [42] applies a number of further inequalities to eqn.(3.20) and derives the expression p = d? logc ())e; [3:23] which is more conservative than eqn.(3.22). Moreover, in the case of the 3DFMM, he assigns c = 2, and this c-value has been employed by all other implementations of the 3DFMM known to the author. The value of c has a profound eect on the value of p, where c is de ned by c = a ? 1, and where is the distance between the centres of the multipole expansion 46
and the local expansion and a is the radius of the two spheres associated with the two expansions. When the hierarchy is introduced, c is determined by the nearest cell which is well-separated, hence when using second nearest neighbours in the near- eld, we assign c = p63 ? 1 2:4641. When the near- eld is de ned as only nearest neighbours, as is the case for some implementations of the 3DFMM [71], then c = p43 ? 1 1:309. Table 3.1 shows three expressions which have been used to determine p. The estimate d? logc()e is the expression derived by Greengard, where he assigns c = 2. The second column in table 3.1 replaces this value with c = 2:4641. The expression
d? log2()e d? logc ()e d? logc((c ? 1))e 10?2 7 6 5 ? 3 10 10 8 8 ? 4 10 14 11 10 10?5 17 13 13 ? 6 10 20 16 15 Table 3.1: Estimates for p, for varying ; c = 2:4641.
d? logc((c ? 1))e is eqn.(3.22) above. It can be seen that the number of terms in the
expansions may be reduced through the appropriate expression for p and choice of c, and since the time to compute the local expansions is of O(p4 ), then the execution time for the FMM can be reduced dramatically. Using a value c = 2 leads to taking too many terms in the expansions when using second nearest neighbours in the near- eld, and this will increase the execution time. When using only nearest neighbours in the near- eld, assigning c = 2 results in taking too few terms and this will result in a loss of accuracy. More speci cally, let us consider in more detail the expression derived in [42], i.e. eqn.(3.23), namely p = d? logc()e. Using the value c = 2:4641 can reduce the number of terms in the expansion from the p-value associated with c = 2; for small this reduction is a factor of O(:77). This can be shown by considering the ratio pc=2:4641 = d? log2:4641()e : [3:24] pc=2 d? log2()e For the case for small , so that we may write d? log e ? log then pc=2:4641 log2:4641() = log(2) 0:77 : pc=2 log2() log(2:4641) 47
From this analysis, we may determine that the speed-up obtained when calculating the local expansions is approximately a factor of O(3) for small . The prescribed accuracy is not compromised by this change of c. The ratio given in eqn.(3.24) is calculated explicitly for a range of relative error, , and is plotted in gure 3.6. The maximum value of the ratio is 1, and the least value is 0:6. Thus eqn.(3.24) implies
dlog2 4641 e dlog2 e :
1 0.95 0.9 0.85 0.8 0.75 0.7 0.65 0.6 0.55
10?2
10?2
10?6
10?8 10?10 10?12 Figure 3.6: Equation (3.24) plotted for varying .
a reduction in p of order 0.6 at best, and at worst there is no reduction at all. The formation of local expansions has a complexity of O(p4 ), and so if c is given the value c = 2:4641, instead of the value c = 2, then this will result in a speed up by 4 a factor of O 01:6 O(7:7) at best; at worst there will be no speed-up. We now note that two dierent c-values are used in the two error-bound expressions, eqn.(3.15) and eqn.(3.21), in the same manner as the c's dier in the 2DFMM. We shall denote the c-value associated with the error-bound for the truncated multipole expansion as cm, and similarly for the truncated local expansion error-bound we denote this c-value as cl. Using this notation we have cm = ar , where r is the distance from the evaluation point to the centre of the multipole expansion; and cl = a ? 1, where is the distance between centres of the local expansion and the multipole expansion. If we de ne the evaluation point for a multipole expansion as a point on a sphere which delineates a local expansion, and is the closest point to the centre of the multipole expansion, then we have r = ? a. By substitution, the cm-value becomes cm = ar = ?a a = a ? 1 = cl: 48
If we substitute this value for cm in the associated error-bound, eqn.(3.15), we may see that the error-bound of both the multipole and local expressions, eqns.(3.15) and (3.21), are in fact equivalent at this point of evaluation. This is the point where the theoretical maximum error occurs. Hence the evaluation of a truncated local expansion does not contribute a larger maximum error than the largest maximum error incurred by the evaluation of a truncated multipole expansion. So we may calculate the required number of terms from eqn.(3.22), namely p = d? logc((c ? 1))e where c = a ? 1:
3.5.3 Accumulated errors As in the 2-dimensional case, the shifting of expansions performed within the FMM, i.e. shifting the centre of a truncated multipole expansion or local expansion, are exact, and so do not contribute an error to the global truncation error of the FMM. Therefore increasing the number of levels, which increases the number of shifting operations, has no eect on the global truncation error. This is borne out in numerical experiments [5].
3.5.4 Numerical Trials Errors from a large system A system with 5000 uniformly distributed particles with strengths qi = +1:0 was tested. Using the optimal 3 levels in the 3DFMM, the potential was calculated once in double precision using p = 3, which is a p-value used in other versions of the 3DFMM [60, 85]. This value is obtained from c = 2:4641, = 4:5 10?2 and p = d? logc((c ? 1))e. A constant p is used throughout. The `direct' method was used to calculate the `exact' values using double precision, and the error was calculated as the absolute dierence between the two. It is likely that an error very close to the maximum possible error will occur for some pair of particles, i.e. each will lie near the extremity of its own cluster. The calculated error is shown in table 3.2 along with the relative error de ned in two dierent ways. (For this case, level= 0 corresponds to the unit cube, is the distance between the two nearest p well-separated cells, which will be at the level= 2, i.e. = 83 and a = 162 .) The error max is the biggest observed error, i.e. max = imax =1;N (i ), which corresponds to 49
error numerical value prescribed: 4:5 10?2 maximum: max 25.760 max relative: 5000 5:15 10?3 a)max relative: (?5000 1:48 10?3 Table 3.2: Computed absolute and relative errors for a 2DFMM a discrete maximum error norm. It can be seen that the relative precision achieved was better than the prescribed precision, which may imply that more terms have been taken than necessary for this set of parameters.
Errors from a two particle system. Employing just 2 particles not only reduces the computational time, but by testing various pairs one can produce a speci c error of interest and acquire a working understanding of the nature of the errors. The position of one particle is used as a point of evaluation, while the other has a strength of +1:0. A xed number of terms in the expansions (multipole and local) are used, and so now the empirical error can be compared to the theoretical error-bound. The 3DFMM was executed, without invoking the hierarchy, with one multipole expansion formation, the conversion to a local expansion, and nally the potential evaluation. This means that the particle's potential has undergone every kind of transformation which induces a truncation error. (The formation of a truncated local expansion has the same maximum error as the truncated multipole expansion error, and the shifting of a multipole or local expansion does not contribute a truncation error.) We now assume that the maximum error induced by this numerical experiment is bounded from above by the error of the truncated multipole expansion plus the error of a truncated local expansion, i.e. twice the absolute error-bound, abs in eqn.(3.22), of the truncated local expansion. Hence we assume that
emp = jFMM ? exact j mult + local = 2 abs: This implies that a bound on the relative error is given by = ( ? Aa)emp 2( ?Aa)abs where is the smallest distance between the centres of two well-separated cells, a is the radius of the multipole expansion and A = P jqij. Here we have A = 1, and 50
so = ( ? a)emp: We wish to relate the empirical errors to expressions of the theoretical maximum error, and so we plot half of the relative empirical error, since abs 21 max(emp). Many pairs of points were analysed, but the 4 pairs of points illustrated in gure 3.7 were found to characterise the largest errors when de ning the near- eld as the nearest and second nearest neighbours, as is the case in the original form of the 3DFMM. (We evaluated at one point due to a unit particle at the 1
1
2,4 3
3 2,4
ELEVATION
1
4 2,3
END ELEVATION
Figure 3.7: Second nearest neighbours con guration, c = 2:4641 other, similarly numbered point.) `Pair 3' lie on the surfaces of the spheres which circumscribe the cubes. It should be noted that the points `3' were contrived to lie on the spheres centred on the most separated cubes in gure 3.7, but these points would normally be allocated to the cube in which they lie. It was discovered empirically that other points chosen would either produce errors less than or equal to the errors of the four pairs of points shown. This con guration of points produces the same results if the axis of symmetry lies along any of the three spatial dimensions; x, y and z. The empirical errors are plotted in gure 3.8 on a log-linear graph, versus p, along with certain theoretical error-bounds. The error associated with `pair 3' is the maximum error associated with the spheres in the formulation of the theoretical error-bounds. This error does not arise from the FMM, since, as already stated, the algorithm assigns these points to the cube in which they lie. It is seen from the gure 3.8 that `pair 1' contributes the largest error for any pair of points associated with the 3DFMM. At this point it is worth noting that as p increases the actual maximum error which arises from the use of the FMM may also increase. As p is incremented from 11 to 12, the error associated with `pair 1', also increases. This is discussed with regard to `super-nodes' in section(4.1). A check can be made on the validity of the empirical errors produced by this 2 particle system, by comparing these errors to the errors produced by the 5000 particle system in the previous section. For the large system we had p = 3 and 51
1
2 2 2**(-p) Pair 1 3 + 2 2 c**(-p) Pair 2 + 3 + 3 3 2 2 1/(c-1)*c**(-p) Pair 3 2 + 3 2 Pair 4 2 2 + 3 + 3 2 2 3 3 + 2 2 + 3 2 2 3 + 2 2 + 3 3 3 2 2 + 3 2 + + 3 3 3 + 3 1e-10
2
4
6
8
10
12
14
16
18
20
p Figure 3.8: Empirical and theoretical errors, c = 2:4641
p
used c?1 1 1c ; which is the third bound in gure 3.8, to calculate a theoretical maximum relative error of = 4:5 10?2 ; which is consistent with gure 3.8. Moreover, the observed maximum relative error for the large system was 1:48 10?3 , which corresponds approximately to the relative error associated with `pair 1', ( = 2:24 10?3 when p = 3), plotted in gure 3.8. Therefore we may conclude that this 2 particle system is indicative of the errors which arise from a larger system. There are three theoretical error-bounds printed on gure 3.8. The rst is the bound used by Greengard [42] in his 3DFMM, i.e. < 2?p , and its use as an estimate for p is widespread. This line is an upper bound to all the errors, but use of this estimate leads to the use of more terms than necessary, as already stated. The second bound represents eqn.(3.23), i.e. < c?p, and is indeed a more stringent bound, as previously stated. The third bound is the error-bound associated with the truncation of a local expansion, eqn.(3.21), and appears to be a very good estimate of the maximum error. Other con gurations of cells are now examined to test the generality of eqn.(3.23) for dierent con gurations of cells. The 4 pairs of points in gure 3.9 characterise the algorithm when using only nearest neighbours in the near- eld, which is the case in [71]. `Pair 5' lie on the 6
6
7,8 5 5 7,8 ELEVATION
6 7
5,8
END ELEVATION
Figure 3.9: Nearest neighbour con guration, c = 1:309 52
surfaces of the spheres which circumscribe the cubes, as before. Once again, this con guration of points produces the same results if the axis of symmetry lies along any of the 3 spatial dimensions, x, y, and z. The empirical errors are plotted in gure 3.10 on a log-linear graph, versus p, along with the theoretical error bounds. It was found empirically that `pair 6' produces the largest error for any pair of points occurring in the FMM. 13 3 3 3 3 3 3 3 3 3 3 3 3 3 +2 3 3 3 3 3 + + + 2 + 3 2 2 + + 2
1e-10
2
4
2 2 + + + + + + + + 2 2 + + + + 2 + 2 2 2 2 2 2 2 2 2 2
6
8
10
12
14
16
18
3 + 2
2**(-p) c**(-p) 1/(c-1)*c**(-p) Pair 5 Pair 6 Pair 7 Pair 8
20
p Figure 3.10: Empirical and theoretical errors, c = 1:309
From gure 3.10 we can now quantify the precision lost for such a near- eld, when using c = 2 in the expression p = d? logc e, by considering the rst bound, i.e. < 2?p . It would appear that this choice of c is permissible for 10?4 or for p 12. If a higher tolerance is required, however, then the equation for p would have to be revised. The estimate < c?p, where c = 1:309, does encompass the error which arises from the FMM, but the values of the maximum error are greater. The third bound, eqn.(3.21), corresponds to the maximum error. This is similar to the result obtained for the previous con guration of cells, where the near- eld was de ned by the second nearest neighbours. Another pair of points were chosen (which occur when using only nearest neighbours in the near- eld) with c = 3:0. Each point lies on the vertex of its associated cube, and hence on the surface of the sphere which circumscribes that cube. This sphere is the one associated with the formation of the error-bounds. The centres of the two spheres and two points under consideration, all lie on a straight line through space, analogous to the 2-dimensional case depicted in gures 2.11 and 2.12, and so the radial component of the error is maximised. We may maximise the angular 53
component of the error, Pn(cos ) (cf. section 3.5.2), if = 0. If we consider the centre of the local expansion as an origin, then is the angle subtended between the position vector of the source particle and the position vector of the evaluation point, see gure 3.5, and in this case, = 0. Hence the maximum error incurred by the FMM will now equal the theoretical maximum error. The empirical and theoretical errors are plotted in gure 3.11. (Note that when the theoretical maximum actual error matches the maximum error of the FMM, then as p increases, the error decreases monotonically.) 1 3
3
3
3
3
3
1e-10
2
4
6
3
2**(-x) c**(-x) 1/(c-1)*c**(-x) Pair 9 3 3
3
8
3
10
3
3
3
12
3
14
3
3
16
3
3
18
p Figure 3.11: Empirical and theoretical errors, c = 3:0
3
20
From gure 3.11 it is clear that using a c-value of 2 is too conservative, and that using c = 3 is more economical. Once again eqn.(3.21) corresponds to the theoretical maximum error, as before. It would be useful to parameterise c, and in turn the error-bounds, in terms of cubes instead of spheres to determine the actual maximum errors which will arise in the FMM, in a similar manner to the `face-distance' parameter introduced by Warren et al. for the Barnes and Hut Algorithm [84] (section 4.3). In the rest of this thesis, eqn.(3.21) will be used to determine the value for p, namely p = d? logc ((c ? 1))e; where c = a ? 1; as this was justi ed both analytically and empirically.
54
3.6 The Dynamic p Principle As with the 2-dimensional case, the fact that p is a function of distance (i.e. depends on r through c), as well as a function of precision, can be exploited within the FMM during the downward pass. At this stage multipole expansions are converted to local expansions - the most expensive calculation of the algorithm. The number of terms taken from each multipole expansion in forming the local expansion can be de ned as pdyn , where pdyn p. In the conversion section, a truncated version of eqn.(3.9) is computed, i.e.
Lkj
m Am Ak Y m?k (i ; i ) Onm(i) Jn;k n j j +n ; = j + m ? k Aj+n i n+1 n=0 m=?n pX dyn
n X
for each cell i in the interaction set centred at (i; i; i) with an associated multipole m and and Am are de ned as before, 0 expansion with coecients Onm (i), where Jn;k n j p; jkj j ; furthermore we de ne pdyn = d? logci ((ci ? 1))e, where ci = ai ? 1. Using a constant p value, i.e. p = max i (pdyn ), the order of time for this calculation is of O(p4 ), but by using the dynamic p principle, the order of time becomes O(p2p2dyn ) (for a single cell). Now consider a full interaction set in 3D, in which the near- eld is de ned by nearest neighbours only. There are 189 cells, each with an associated c-value. The smallest of these is c = 1:309. The speed-up can be estimated as for the 2-dimensional case, see section(2.6). In 3 dimensions, with = 10?3 , the result is time with a constant p = 189p4 8:6; 2 2 time with a dynamic p P189 cell=1 p pdyn where the series in the denominator has been computed explicitly. Thus there is a speed-up by a factor of approximately 8. Now consider a full interaction set in 3D, using nearest and second nearest neighbours to de ne the near- eld, and where = 10?3 . There are 875 cells, each with an associated c-value, and min(c) = 2:4641. The speed-up may be found approximately by computing the ratio 875p4 3:7 P875 2 2 cell=1 p pdyn i.e. the speed-up is by a factor of 3 approximately.
55
3.7 The Optimum Interaction Set Now we consider the relative merits of dierent de nitions of the near- eld. Recall that the interaction set is de ned as follows; int(i)=fcells j : j 2= near- eld(i) , j 2 children(near- eld(parent(i)))g In the original form of the 3DFMM [42] the near- eld was de ned as the nearest and second nearest neighbours, giving c 2 [2:454; 9:0] and a total of 875 cells in a full interaction set. In this case the calculation of local expansions is of O(875p4 ). Some cells at the same level of re nement which lie outside the interaction set of cell i are closer to cell i than some cells within the interaction set. It seems more geometrically appropriate to de ne the near- eld, and hence the interaction set, in terms of a radial distance from cell i, rather than the standard cubic domain. (In fact the near- eld, and hence the interaction set, may be de ned by any con guration of cells, provided the near- eld set surrounds cell i.) Other versions of the 3DFMM have de ned the near- eld as the nearest neighbours only, giving c 2 [1:3094; 5:0] and a total of 189 cells in a full interaction set. Now the calculation of local expansions is of O(189p4 ) for a constant p throughout. This gives fewer cells in the interaction set (compared with the case in which the near- eld is de ned to include also second nearest neighbours) at the cost of more terms in the expansions. It was decided to develop a de nition of the interaction set which would yield the shortest execution time for the conversion section, where local expansions are formed from multipole coecients. This was done by calculating the operation cost of the this section for various interaction sets; all de ned in terms of a radial distance. Using the dynamic p principle, the order of the operation cost may be 2 2 estimated as Pint i=1 p pdyn where int is the total number of cells in the full interaction set and p = max i (pdyn ). The value for p will decrease as the extent of the near eld increases. This sum was computed explicitly for = 10?1 ; 10?2 ; :::; 10?9. An optimal size of the interaction set size was found to lie between the values for the two standard de nitions, stated above, for all . This optimal set has 651 cubes, corresponding to 92 near- eld cells which lie less than a radial distance of 3d away, where d is the length of one side of a cube. Here c 2 [2:4641; 7:24621]. Zhao [85] also de nes a near- eld in terms of a radial distance, and has an interaction set with 567 cubes, so that c 2 [2:26599; 6:57188]. In his formulation of 56
the 3DFMM, he also uses c = 2 to determine the required number of terms p, and indeed his is the smallest of the possible interaction sets which satis es c > 2. A smaller interaction set would give c < 2. Thus Zhao has changed the extend the near- eld, rather than rede ning the c-value. See section(4.1) for a discussion of Zhao's 3DFMM.
3.8 Timing Results for the Sequential 3DFMM We now present times for the sequential version of our modi ed 3DFMM running on a Sun Sparc ELC workstation. This modi ed version incorporates all the techniques for speed-up discussed above. We have the near- eld described in terms of a radial distance, rather than a cubic con guration, with 651 cells in a full interaction set. This interaction set gives c 2 [2:4641; 7:24621]. We use the dynamic p principle, such that p = d? logc((c ? 1))e where c = a ? 1. The basic element of adaptivity is 100000 10000 1000 Time (secs) 100
+
10 1
3
10
3
+ + 3 3
100
+ +
3 3
2 2 3 + + + 3 3
+ 2 levels 3 3 2 2 3 levels + 2 + 4 levels 2 +
3
1000 10000 100000 Number of particles
1000000
Figure 3.12: Typical timing curve for the 3DFMM, p = 3. incorporated (cf. section 3.4.3). All results are computed in double precision with p = max i (pdyn ) = 3. The graphs shown in gure 3.12 are typical timing curves of this FMM, and correspond to various choices for n (the total number of levels). Consider as an example the FMM executed with 2 levels. For small N , N < 200, the element of adaptivity removes any calculations accessing empty cells. As N increases, the cells of the tree ll up and the timing curve displays a point of in ection; we say that the tree is now a full tree. This is in contrast to the 2-dimensional case, where the basic element of adaptivity is not invoked, see gure 2.15. Eventually N becomes so large, N > 2000, that the `direct' interaction of the near- eld at the 57
nest level dominates the time for execution; which therefore becomes of O(N 2 ). If the optimal number of levels, nopt say, is employed for a particular N , then this `fastest' timing curve is given by the lower envelope of all timing curves. This timing curve is a continuous line and is theoretically O(N ), by eqn.(3.13) (cf. gure 3.13). The actual equation of this timing curve can be approximated if we assume that time varies as aN b, since this appears as a straight line on a log-log graph, hence a straight line regression may be performed through the points which form the lower envelope, over the range N 2 [500 : 200000]. The regression analysis suggests a relationship time = 0:0053 N 1:26secs. Figure 3.13 shows the timing curves for p = 3 and p = 8 as well as the result for the pairwise direct calculation. The optimal number of levels, nopt, was derived in the same manner as in the 2D case, that is numerous programs were executed with various values of N and each n. The value for nopt was then found with the aid of a graph, such as that displayed in gure 3.12. 100000
p=8 p=3
10000
direct
Time (secs) 1000 100 10 1000
10000 100000 Number of particles
1000000
Figure 3.13: Timing curves for the 3DFMM, p = 3 and p = 8, and the direct summation method. The value of N where the performance of the FMM is faster than the direct summation method, shown here as direct, is known as the break-even point and is a useful measure of the performance of the FMM. Here this point is N 3500 for p = 3, and N 15000 for p = 8. A version of the FMM which employs Fourier transforms [43] has a break-even point of N 16000 for p = 8, which indicates that the dynamic p principle is competitive with this method of reducing the execution time. 58
As N increases, then so does nopt, see gure 3.12. Theoretically the dependence of nopt on N depends on s, the maximum number of particles per leaf cell. In analogy to the 2-dimensional case, values of s can be computed by locating the value for N at which the time for n0 levels say, exceeds the time for n0 + 1 levels, so that s is given by s = 8Nn . By gure 3.12 it can be shown that the average value of the two possible values for s is s = 171. When p = 8 the corresponding result is s = 955. The optimal value of s, and hence nopt, will depend on the particular implementation of the 3DFMM, but as in the 2D case (section 2.7), it is clear that s is dependent on p. In section(3.3) we showed that s is, in fact, proportional to p2. This implies that the optimal number of levels now varies as ! N N nopt = log8 s log8 p2 : 0
Therefore for a given N , as p increases the optimal number of levels should decrease to minimise the time for execution. The value of n has a great in uence over the execution time of the FMM, but nopt cannot be determined analytically, therefore for practical applications, the optimal n can be located by `operation counting'. This scheme is presented in detail by Anderson in [5]. Brie y stated, `dry runs' are performed, counting the amount of computation for the given parameters, N , and the various values for n. The value of n which achieves the lowest operation count is the value to be used in the actual execution of the FMM.
59
Chapter 4 A Discussion of Alternative Fast Summation Methods We now discuss certain other fast summation methods, the manner in which they dier from the FMM, and their practicality. The methods discussed here are the N -body Treecodes of Zhao [85] (in particular the use of `super-nodes'), Anderson [5], Barnes and Hut [12] and nally the non-hierarchical fast summation method of Buttke [15]. The FMM and the Barnes-Hut Algorithm are generally considered as the two main fast N -body solvers and are included in the SPLASH suite, Stanford Parallel Applications for Shared-Memory [73], at Stanford University.
4.1 Zhao's Cartesian 3DFMM Zhao employs the same hierarchical frame-work as the 3DFMM. Multipole expansions and local expansions are formed, as in the 3DFMM, but the algorithm is formulated in cartesian co-ordinates, as opposed to the spherical co-ordinates of the 3DFMM. He uses p = b? logc c with c = 2, the commonly adopted c-value, but a dierent near- eld, and therefore a dierent interaction set, is de ned so that the nearest cell gives c = 2:26599. (cf. section(3.7) on the optimisation of the interaction set.) For p = 3 the break-even point with the `direct' method is 1000 particles. Simon [76], assuming that s = 1, nds that the order of Zhao's 3DFMM is of O(Np6 ), in contrast to Greengard's 3DFMM which is of O(Np4 ), and it is found to take longer to compute the potential for the same parameters. Simon also nds that 60
Zhao's version is only practical for > :01 since the precision cannot be guaranteed for every problem. There are two reasons for this. Firstly, consider the expressions employed to convert a multipole expansion to a local expansion; eqns.(3.8)-(3.10). In the spherical formulation of the FMM, the local expansion coecients are calculated using eqn.(3.9), where the in nite sum is truncated to p terms. To relate Zhao's cartesian formulation with the spherical formulation, the upper limit of the truncated outer sum of eqn.(3.9) is reduced from p to p ? j [76], thus reducing the precision and, indeed, the execution time. The second reason for the loss of precision is a labour-saving device which Zhao introduced in 1987, namely the `super-node'. A super-node is a parent cell with all its children in an interaction set. The parent's p-term multipole expansion is then used, rather than those of its children. The super-node is purely a feature of the hierarchy and does not depend on the coordinate system employed. Figure 4.1 illustrates the super-nodes, which are circled,
'$ '$ '$ '$ &% &% &% '$ &% &% o o o o Xo o o o
Figure 4.1: Super-nodes in 2 dimensions.
for the 2-dimensional case. It would appear that the time saved by the use of these super-nodes is roughly a factor of four for 2 dimensions, and a factor of eight for 3 dimensions [85], as it is assumed that the number of terms remains a constant. However, this gain in speed-up is at the cost of lost accuracy. Leathrum and Board [58] found that the general use of super-nodes causes the prescribed error to be exceeded, and so to compensate for this lost accuracy, they suggest that one should simply take one extra term from the super-node's multipole expansion. But this is not robust as incrementing p by 1 may actually increase the error which arises from the FMM, as was seen in the gures 2.9 and 3.8 in sections (2.5.4) and (3.5.4) respectively. We now investigate the use of super-nodes in 2 and 3 dimensions and determine whether time can be saved without losing precision for certain cases.
61
4.1.1 The Use of Super-nodes in 2 Dimensions When the near- eld is de ned as consisting of the nearest neighbours, there are 5 possible super-nodes, which are circled in gure 4.1. Each super-node has a c-value associated with it which determines the number of terms required to retain the given precision. Recall that c = jzrj ? 1 (cf. section 2.4.2), where jzj is the distance between the centres of the cells, and r is the radius of the sphere circumscribing the supernode cell. There are 3 distinct c-values, namely 0:8028; 1:0616 and 1:5 associated with the 5 super-nodes. From the de nition in Greengard [42] for the conversion of a multipole expansion to a local expansion, we require c > 1 to ensure convergence, as the multipole expansion is bounded by a power series in 1c . If c < 1 then the disc containing the cluster overlaps the disc on which the local expansion is to be de ned. However, the original form of the FMM uses spheres of the same size. It can be seen from gure 4.1 the disc of the local expansion is half the radius of the multipole, the discs do not overlap, and so it would appear that csuper = 0:8028 is a potentially valid value for c. It would seem that the time saved by the use of these super-nodes is roughly a factor of four, assuming that the number of terms remains constant, because each super-node replaces its four children. Let us now consider the time saved by the most distant super-node, but without compromising the prescribed precision. The value of c for the most distant super-node is csuper = 1:5, and that for the closest cell in the interaction set is c = 1:828. If p terms are required for a local expansion, for a small , then the most distant super-node requires 1:49p terms to retain the same precision. This multiplying factor is calculated as follows. If eqn.(2.15) is used to determine p, then the multiplying factor is given by psuper = d? logcsuper ()e ? 1 : [4:1] p d? logc()e ? 1 For small we may write d? logc e ? 1 ? logc , so that psuper logcsuper () = log(c) = 1:49 : p logc() log(csuper ) Therefore the 4p2 operations required for the four children in the FMM are replaced by (1:49p)2 = 2:22p2 operations if the most distant super-node is used for a small , which implies a reduction in the number of operations by a factor of 1.8. In analogy to the analysis of eqn.(3.24), the eect of the ceiling operators in eqn.(4.1) is to 62
produce values for psuper p which range from 1.25 to 1.75 inclusively, for 0:1, but this still implies a reduction in the number of operations. However, the other supernodes will all require more operations than the most distant super-node, and it was discovered empirically that they increase the number of operations for 0:1. We now consider the use of super-nodes if the dynamic p principle is to be employed. The time required for converting a multipole expansion to a local expansion, when using super-nodes in 2D, can be described by O(ppsuper ). As before, p is determined by the closest cell used in the formulation of the local expansion. Usually this is the closest cell in the interaction set, but a super-node may have a smaller c-value so that psuper > p. The largest p-value determines the p-value which is to be used throughout the FMM, and so we use max(p; psuper ). We have p = d? logc()e ? 1, by eqn.(2.15), and will furthermore suppose that = 10?3 . If we compute the operation count using the super-node with csuper = 1:5 we nd that max(p; psuper ) psuper = p2super = 172 = 289: In comparison the order of the time for the 4 child cells is 4 X
child=1
ppchild = 11
4 X
pchild = 11(4 + 4 + 4 + 6) = 198:
child=1
This implies that use of the most distant super-node increases the time required, if = 10?3 . It was discovered empirically the time is increased for all :1, and therefore the use of any of the super-nodes in 2 dimensions will increase the execution time for the FMM, if the dynamic p principle is employed.
4.1.2 The Use of Super-nodes in 3 Dimensions Let us consider the near- eld de ned as consisting of the nearest and second nearest neighbours which gives c = 2:4641. If p terms are required for the local expansion for a small , then the three nearest super-nodes, which share the same c-value (which can be shown to be csuper = 1:06155), require 15p terms to retain the same precision. If we determine p by eqn.(3.23) then for small we can write psuper = d? logcsuper ()e logcsuper () = log(c) 15 [4:2] : p d? logc()e logc() log(csuper ) As before, the eect of the ceiling operators in eqn.(4.2) is to produce values for the ratio psuper p which range from 15 to 21 for c = 1:06155 and all values of . The extra 63
time required, due to the increase in p, is roughly a factor of (15)4 = 50625 at best, and so there is a huge net increase in time. However, if the most distant super-node is used; with csuper = 3:5, then eqn.(4.2) implies that only 0:72p terms are required for small , hence reducing the execution time. There are, in fact, 98 possible super-nodes, with only 20 distinct c-values due to symmetry. These are listed in table 4.1. The column headed `Factor of p' is calculated using eqn.(4.2) for small values of , and is the factor by which p must be multiplied to retain the precision. A factor exceeding 1 implies an increase of time for all p-dependent sections of the FMM. The column headed `Time ratio' is calculated in the following manner: a p2-term local expansion is calculated from a p2super -term multipole expansion, and so the time required for using the super-nodes is of O(p2 p2super ), where p is determined by the closest cell used in the formulation of the local expansion. As in 2 dimensions, this is usually the closest cell in the interaction set, but a super-node may have a smaller c-value so that psuper > p, therefore, as was the case in the previous section, we must employ max(p; psuper ) throughout. Then the `Time ratio' can be de ned as time for super-node = max(p2; p2super ) p2super : P8 2 2 time for eight children child=1 p pchild For table 4.1 we set = 10?4 , and use p = d? logc()e, by eqn.(3.23), to calculate the `Time ratio'. This expression for p is the most commonly used, but here we vary the c-value. If the `Time ratio' is less than one, then the use of that super-node reduces the time required by the eight child cells. In Table 4.1 it can be seen, from the lower 12 rows, that there are a total of 47 super-nodes which can be utilised to decrease the overall time for the conversion section of the FMM. However, there is a region in Table 4.1, the middle 4 rows, where the time is reduced but p is calculated from the super-nodes. This p must be used throughout the FMM and so will slow down the other sections of the program. Therefore the use of the 24 super-nodes in these middle 4 rows may or may not reduce the overall execution time, depending on the details of the individual implementation. Where the near- eld is de ned as consisting of the nearest neighbours only, such as in [71], then there are 19 possible super-nodes with 6 distinct c-values. These c-values are listed in Table 4.2 along with their `Factor of p', calculated using a similar expression to that in eqn.(4.2), but now c = 1:309. As before we use 64
No. of super-nodes 3 6 3 6 9 6 9 9 6 3 9 6 1 6 3 3 3 3 3 1
csuper Factor of p Time ratio 1.06155 15.1 11644.9 1.21736 4.6 115.7 1.36291 2.9 21.4 1.50000 2.2 8.0 1.62996 1.8 4.1 1.75379 1.6 2.7 1.87228 1.4 1.6 1.98608 1.3 1.3 2.09570 1.2 .98 2.20156 1.1 .72 2.30404 1.1 .77 2.40343 1.02 .55 2.50000 .98 .61 2.59398 .95 .50 2.68556 .91 .55 2.77492 .88 .55 2.86221 .86 .47 2.94757 .83 .47 3.19325 .78 .38 3.50000 .72 .38
Table 4.1: The super-nodes for second nearest neighbours in the near- eld
p = d? logc()e as an expression for the number of terms required. If csuper < 1, then this gives a negative value for psuper . If we are using a constant p throughout the FMM, then the only super-node which decreases the execution time is the supernode with csuper = 1:5 as this has a `Factor of p' less than 1. If we employed the dynamic p principle with = 10?4 , then the time to compute the local expansions using the most distant super-node with csuper = 1:5 would be of order p2p2super = 352232 = 648025: In comparison the operation count for the 8 child cells is 8 X
child=1
p2p2child = 352
8 X
child=1
p2child = 352 (2 62 + 4 72 + 2 82) = 485100:
Therefore using the super-node with csuper = 1:5 now increases the time required if the dynamic p principle is employed, and hence the use of any of the super-nodes in 3 dimensions, with this de nition of the near- eld and = 10?4 , would increase the execution time for the 3DFMM. 65
No. of super-nodes 3 6 3 3 3 1
csuper Factor of p 0.5 0.6603 0.8930 1.0616 4.5 1.2174 1.4 1.5 0.66
Table 4.2: The super-nodes for nearest neighbour cells only in the near- eld Super-nodes were not incorporated into our version of the 3DFMM since a thorough numerical study needs to be performed to clarify which super-nodes can be used, and how sensitive they are to changes in .
4.2 Anderson's `FMM without the multipoles' Anderson [5] describes a fast summation method which has the same hierarchical structure as the FMM, but instead of a complex power series in 2D or a spherical harmonic expansion in 3D, Poisson's formula is used to represent solutions to Laplace's equation in both 2 and 3 dimensions. This means that the extension from 2D to 3D is far simpler to perform in comparison to the FMM. The Poisson integral is relaced by a K -term trapezoidal rule. The tolerance can be determined a priori in terms of K , in much the same way as p determines the tolerance in the FMM. Anderson uses second nearest neighbours in the near- eld for 3D, but he does not attempt an analytic investigation into the truncation error of his formulation.
4.3 The Barnes and Hut Algorithm As in the case of the FMM, the Barnes-Hut Algorithm has a hierarchical tree structure which de nes the clusters. The tree is created by placing the particle ensemble into the root cell of the tree, level=0, and then recursively dividing each cell into sub-cells, 4 children for 2D, until there is only 1 particle per cell. This means that this algorithm is sensitive to the distribution of particles and will have more levels in the tree in regions which have a high particle density. The tree is traversed once per particle to compute the far- eld potential acting 66
on that particle. The algorithm performs the following recursive test on each cell: If the cell is `well-separated' then that cell's cluster is approximated by a single pseudoparticle centred at that cell's centre of mass. If the cell is too close, then the tree is descended and each of the child cells are tested. To determine what cells are `well-separated' a test is made on each cell, namely l 1 Mbyte), and may be described as a `coarse-grain' parallel computer. A MIMD parallel machine may be categorised by how each processor accesses data. 69
The rst category is the distributed memory MIMD machines, such as the Meiko Computing Surface, CS-1, or the new Meiko CS-2, Hypercube, or the Intel Touchstone. Each processor has its own local memory space. Processors may require data which resides on other processors, and so data must be communicated between them, which usually involves the use of explicit message passing, where the responsibility of control over the ow of data is placed with the programmer. The second category is the shared memory MIMD machines, which are in contrast to the distributed memory machines; each processor shares the same data address space. These types of computers include the Encore Multimax, the Sequent Symmetry and the Stanford DASH machine.
SIMD parallel computers A SIMD { Single-Instruction-Multiple-Data { parallel machine uses one instruction set on many items of data simultaneously. Typically, this type of computer employs a large number of processors (> 1000), each with a small memory (< 1000 bytes), and may be described as a ` ne-grain' parallel computer. The CM-1 or CM1-2 computers may be described as SIMD machines. It seems to be increasingly accepted that the SIMD concept has passed its peak, and that the MIMD model is the way of the future [16, 32, 34, 36], but the appropriate choice does rely heavily on the problem to be parallelised.
Vector computers Vector machines, such as a single CRAY XMP or YMP processor, is another popular type of parallel computer. They employ processors capable of vector processing, i.e. if we were adding two vectors of dimension 10 say, then this would take 10 steps on a sequential processor, while on a vector machine, this operation would take approximately one step. The additions do not happen simultaneously, but are `pipelined' [36].
5.1.2 Programming Considerations To use a parallel environment to its full potential, one must ensure that the parallel program exhibits load balancing. This occurs when all the processors perform 1
CM stands for Connection Machine
70
approximately the same amount of work, so that no processor will lie idle for any signi cant length of time. Sharing the computational load across the processors is usually done at the expense of increasing inter-processor data dependency, or in other words, decreasing data locality. A program exhibits data locality when spatially close data is held on processors which are physically close to each other. This means less communication between separated processors, which would require data to be routed through intermediate processors. Many parallelisation paradigms use a form of divide-and-conquer strategy, namely domain decomposition. There are two kinds of domain decomposition: local decomposition divides the problem space into smaller sub-domains which are locally connected, i.e. conserving data locality; in contrast, scattered decomposition implements the problem in such a way that spatially close data may be on widely separated processors.
5.2 Methods of Parallelising the N -body Treecodes Katzenelson [55] looked at the parallelising of the 3DFMM and the 3D Barnes-Hut Algorithm for three con gurations of processors, assuming uniform distributions of particles for both algorithms. The rst method aligns the processors as a quad-tree for 2D, where each processor has 5 connections, i.e. it has one parent and four children ( gure 5.1). In 3 dimensions, oct-trees are used. Messages containing interaction set data are routed through parent cells/processors.
v vv v v v v vvv v v vvv v v vvv v PP ?@ PPP ? @ PP PP ? @ P ? @ P S C S C S C S C CS CS CS CS CC SS CC SS CC SS CC SS
level 0 level 1
level 2 Figure 5.1: A quad-tree, where represents both one processor and one cell The second con guration of processors is a hypercube, where a quad-tree is mapped onto a 3D cube, and an oct-tree is mapped onto a 4D cube. The nal con guration is a 3D grid of processors, but a hierarchy of processors 71
is introduced when the extent of the processor grid exceeds the extent of the near eld cells. We also employ a lattice of processors for this thesis, however, we do not introduce the hierarchy of processors, see section(6.5). Katzenelson uses theoretical times for the CM-2 and the CalTech Hypercube and concludes that the last con guration is the fastest of the three. Greengard and Gropp [39] parallelise the 2DFMM using 16 processors of the Encore Multimax 320, which is a MIMD, shared memory machine. A parallel version of Zhao's FMM [86] was implemented on the CM-2, which is a SIMD machine. For N bodies and N processors, both the SIMD and MIMD versions of a parallel FMM yield a complexity of O(log N ). Leathrum and Board [58] implemented 2D and 3DFMMs on shared memory MIMD machines, (Sequent Symmetry, Encore Multimax) and on distributed memory MIMD machines (Intel Touchstone Gamma and the INMOS Transputer-based system with 8 T800's) and they are now attempting to parallelise the FMM using clusters of workstations. They employ the super-node idea, which is the key to their communication pattern, as super-nodes reduce the amount of communicated data by factor of 2 (section 6.6.2). It was found that the distributed memory implementations were not as good as the shared memory implementation, because of the slow communication links on the current generation of transputers [58]. Lustig, Cristy and Pensak [60] have compared the FMM performances on a Cray YMP-4 vector computer, a CM-2 and a WAVETRACER-DTC, the latter two being SIMD parallel machines. They use a binary hypercube connectivity scheme to embed Zhao's 3DFMM on the SIMD machines, but do not employ super-nodes as they found that SIMD models do not lend themselves to the corresponding form of communication structure. They retain p = 3, or p = 4, and vary the near- eld to vary the achieved precision, although with this method an exact bound on the maximum error will not be known prior to execution. Schmidt and Lee [71] implemented the 3DFMM on the CRAY YMP, a uniprocessor vector machine. They used nearest neighbours only in the near- eld and naively assigned p = ? log2 . Nyland, Prins and Reif [65] are forming a version of the Adaptive 3DFMM using a high-level prototyping language, \Proteus", for a generic vector machine. Salmon and Warren [69, 70] implemented a 3D quadrupole Barnes-Hut model on the Intel Touchstone Delta at CalTech, which is a MIMD distributed memory 72
machine. A scattered decomposition paradigm is used, namely Orthogonal Recursive Bisection, to ensure load balancing between processors (cf. section(6.3.2) on scattered domain decomposition). Singh [75] has implemented the 3D Barnes-Hut Algorithm and the 2DFMM on a MIMD shared memory machine, namely the Stanford DASH machine. (A 3DFMM was attempted but subsequently forsaken due to the complexity and lack of time.) In this work Singh introduced a more ecient method of load balancing than Salmon's approach, namely Costzones (cf. section(6.3.2) on scattered domain decomposition). The 3D Barnes-Hut Algorithm was parallelised by Barnes [48, 12] on the CM-2, which executes in time of O(log N ) for N bodies on N processors, the same order of complexity as a parallel FMM for N bodies using N processors. Makino and Hut [62] compared the Barnes-Hut Algorithm on the CM-2 to implementations on several vector machines, and found that the former was less ecient. Makino vectorised the tree particle by particle, while Hernquist [52, 53] parallelised the Barnes-Hut Algorithm using FORTRAN90 recursion on a CRAY XMP vector computer, vectorising the tree level by level.
73
Chapter 6 A Parallel FMM on the Meiko Computing Surface. We now present a parallel version of the FMM on the Meiko Computing Surface. For simplicity it is assumed that the FMM under discussion is the 2-dimensional version, unless otherwise stated, since the 2D strategy is conceptually simpler and the extensions to 3D are straightforward.
6.1 The Meiko Computing Surface, CS-1 The Meiko Computing Surface, CS-1, is housed at the Edinburgh Parallel Computing Centre, EPCC, at Edinburgh University. It is a MIMD, distributed memory machine, with 432 INMOS T800 transputers, each with 4 Mbytes of memory. Each transputer can be viewed as a computer in its own right, with memory, oating point processor, etc. The transputers1 are partitioned into small groups, or domains, of varying number so that each individual user may access a group of transputers exclusively. There are also 2 domains of transputers which have 8 Mbytes of memory each, one domain has 8 transputers and the other has 17. These domains were used when possible to allow for larger data sets. 1
In the remainder of this thesis the terms transputer and processor are interchangeable.
74
6.1.1 The CS Tools Communication Harness An interface must be provided between the program and the machinery used to control the communications between processors. For this thesis the communications harness employed is called CS Tools2 - Computer Surface Tools - and is supplied by Meiko [29]. CS Tools is a high-level interface between the high-level language (here it is FORTRAN 77) and OCCAM which is a low-level communication language for the transputers. Each of the processors has only 4 I/O channels which limits the number of possible con gurations, but CS Tools hides this low number of channels from the user, and allows the programmer to set up as many communication channels, or transports, between processors as is required. Messages are routed through other processors (this is not seen by the user) which eectively means the programmer is relieved of all routing problems. This ease of programming is at the cost of relinquishing supreme control and can lead to a reduction in parallel eciency as the user cannot impose physical data locality.
6.2 The Programming Strategy A prototype parallel version of the 2DFMM, running on a 2 2 grid of processors, had been implemented on the Meiko [80]. This consisted of 4 programs, identical except for the communication functions, on each of the 4 processors. This was improved by replacing the 4 by a single program which is loaded onto each of the processors, and it is the processor's number, de ned by its abstract location, which determines the route the communications take. This method may be described as a SPMD { Single-Program-Multiple-Data { type of programming, which shares features of both the SIMD and MIMD paradigms. The 2 2 grid of processors was then generalised to a grid of any size. Explicit message-passing is employed with the use of CS Tools.
6.3 Domain Decomposition A simple local domain decomposition paradigm is followed where a n m grid of processors is used in 2D, and a grid of n m k processors is used for 3D, where 2
Version number: CS Tools for Sun hosts, 01.19.04.
75
n, m, and k are any natural numbers. Figure 6.1 illustrates a mapping of a domain, decomposed as a 4 3 grid, to the abstract topology of processors, connected as a 2D grid with the processor numbers shown. The domains of each processor are extended to overlap, so that boundary information is shared at the nest level of the tree, as in gure 6.1 where the particles which lie within the shaded area are mapped into processor 7. The width of the overlap, or `picture frame', that is the shaded area minus box 7, equals the radius of the near- eld at the nest level of the FMM: 1 leaf cell in 2D and 2 leaf cells in 3D. With this overlap, the direct summation over the near- eld cells may be computed without any communication of data, see section(6.5). 9
10 11 12
5
6
1
2
?? ???? ? ? ? ?? ?? ? ?? ? ? ????? ? ? ? ??????
7
8
3
4
-
9
10
11
12
5
6
7
8
1
2
3
4
Figure 6.1: Mapping for a 4 3 grid onto 12 processors
6.3.1 A Scalable Parallel Model A scalable parallel model is one where processors may be added to the processor network to `assist' in the computation. A primary factor determining the scalability of a model is the complexity the communication strategy; it is important that the action of adding processors is as simple as possible. This has been carefully incorporated into the communications used in this thesis, for both 2 and 3 dimensions. Moreover, neither the total number of processors, nor the dimensions of the grid, have to be a power of 2 (which is often the rule due to network con gurations [30, 32, 36]) making the presented scalable strategy more versatile than most. If we consider a xed number of particles and xed number of levels, then we could try to speed up the execution time by simply adding more processors. However, this type of scaling leads to a reduction in parallel eciency. In fact, as the number of processors increases, then the parallel overheads increase relative to the amount of computation, and so there will come a point when simply adding more 76
processors will actually increase the execution time. Another manner of scaling requires that every processor should be used to its full memory capacity, so that by adding more processors, one adds more particles to the ensemble. A more in-depth discussion of scaling is presented in [75]. From the analysis of the results of the time trials in section(6.8), it is apparent that, for a particular number of particles and prescribed precision, there is an optimum number of processors, in addition to the optimum number of levels employed in the FMM.
6.3.2 Scattered Domain Decomposition The key to a successful domain decomposition is to balance data locality with a balanced load across all the processors. For our non-adaptive modi ed FMM, we assume a uniform distribution of particles, so a balanced load is ensured. However, under our local decomposition method, non-uniform distributions will lead to poor load balancing. Non-uniform distributions require a scattered domain decomposition to achieve a balanced load. One scattering decomposition paradigm is known as Orthogonal Recursive Bisection, or ORB, and strives to achieve load balancing between processors at the cost of decreasing data locality; and so two particles which are geometrically close together may be on 2 distant processors. Salmon and Warren [69, 70] employ the ORB technique, as previously indicated in section(5.2), when parallelising the Barnes-Hut N -body solver. Brie y stated, a work pro le is calculated for every particle by estimating the amount of computation which is to be performed. The particle distribution is subdivided by recursively bisecting the computational domain, such that the resulting subdomains have a equal pro led computation. Each processor is then allocated the particles which lie in these subdomains. A more ecient method of load balancing to Salmon's ORB approach is introduced by Singh [75], and is called Costzones. In contrast to the ORB approach, the Costzones approach is to recursively bisect the cells of the hierarchical tree, as opposed to the particle distribution. Singh argues that neither of these load balancing techniques are suited for the message-passing machines, such as the Meiko. This is because the burden of data manager is placed on the programmer and is 77
just too great, the program would be too complex and the parallel overheads would occupy too much memory. An non-hierarchical N -body method which follows the ORB paradigm is presented by Baden [7].
6.4 The Required Communication Strategies The following two communication strategies are required by our parallel algorithm; the systolic loop and nearest-neighbour processor inter-communication. These algorithms are not con ned to this particular parallel paradigm, and can be employed for other problems. Parallel Multigrid Methods [31, 64] can bene t from this work since they have a similar hierarchical structure to the N -body Treecodes. There are two methods of message-passing communications for MIMD, distributed memory machines; these are called Blocking and Non-blocking communications. Blocking describes the action of a `send' or `receive' which waits until the function has been executed successfully. The process is suspended until the operation is completed and this forms a `synchronous periodic barrier', since the two communicating processors must coincide at a certain time. On the other hand, a non-blocking `send' initiates the sending of data and then continues with the computation. A non-blocking `receive' initiates an array in local memory to receive the incoming data, then continues with the computation, during which time the data may be received. Our strategy requires a mixture of both blocking and non-blocking communications. The following CS Tools' functions are required by this programming paradigm, and are described in more detail in [29].
6.4.1 CS Tools commands and de nitions csntxnb(transport, ag,peerid,buer,size,tag) sends a non-blocking message, where transport identi es the sender's identity, ag=0 denotes an asynchronous transmission, peerid identi es the receiver's identity, buer identi es the start of the message, size is the size of the message and tag is a ag used in csntest. csnrx(transport,peerid,buer,maxsize) which will block until a message is received, where transport identi es the receiver's identity, peerid is a variable 78
which will hold the senders identity, buer identi es where the message is to be held and maxsize states the maximum size of the received message. csntest(transport, ag,timeout,peerid,tag,status) which tests to see if a non-blocking send/receive has actually been sent/received, where transport identi es the identity for the test, ag states what kind of check it is, a send (csntxready) or a receive (csnrxready), timeout states how long to wait in microseconds, which can be inde nitely, peerid is a variable which will hold the identity with which the communication took place, tag is a variable which will hold the ag used in the non-blocking send and status is a variable which will indicate whether the communication was successful.
6.4.2 The Systolic Loop The systolic loop communication strategy enables data residing on each processor to be sent to every other processor, i.e. an `all-to-all broadcast'. By way of an introduction, consider the following example. If we connect 4 processors in a ring, data
P4
P * 1 HHjH
YHHH P3
P2
Figure 6.2: Four processors in a loop held in P1 can visit all the other processors in the ring by taking 3 steps following the path indicated by the arrows in gure 6.2. If we perform the communications in parallel, 4 data sets can each visit the same 4 processors in the same amount of time. In each step, each processor sends its data set in the direction of the outgoing arrows and then receives a new data set. After only 3 steps all 4 data sets have visited all 4 processors. In general this takes (P ? 1) steps, where P is the number of processors, i.e. the systolic loop takes a time of O(P ). This method of processor inter-communication has been used to parallelise the N -body `direct' computation on SIMD computers, N particles on N processors solved in O(N ) time [30], and on MIMD based computers [50]. A particle is located on every processor, a copy of the particle's `information' is then pumped round the systolic loop, visiting every other processor, and hence every other particle. The 79
6 9 - 10 - 11 - 12 6 5 6 7 8 6 ?- 1 - 2 - 3 - 4 Figure 6.3: A systolic loop through a 2D grid of processors interaction is then calculated between the pairs of particles on each processor at each communication step of the systolic loop. The overall eld is therefore calculated in a time proportional to the number of processors. The communication network is con gured as a 2D lattice (for the 2DFMM); then the ring is created using communication links which already exist. This saves time by not requiring the creation of any super uous links. Only one new communication link is required to complete the loop, 12 to 1 in the con guration illustrated in gure 6.3. Table 6.1 presents the pseudo-code which creates and performs the systolic loop, where P is the total number of processors, procNo is the processor's number and N is the number of columns in the N M grid. Tail is the identity of the processor which sends data to processor 1. North, south, east and west are all identities of processors in the grid relative to the present processor and loopdir is the direction the communications take. The call to csntest and the blocking receive are required so that the next non-blocking send will not write over the data sent previously. Note that this is a scalable communication routine. For the 3-dimensional case, the systolic loop pseudo-code is an extension of the code in table 6.1, requiring that 2 new processor identities to be introduced, namely up and down. This is presented in the Appendix(B). For a large number of processors there is a more ecient systolic loop con guration than that previously described. This can be illustrated by the use of a speci c layout of processors displayed in gure 6.4 in a 2-dimensional grid. Firstly the rows are linked as systolic loops (thick lines) and the data fed round in 3 steps. The columns are then linked as loops (thin lines), which require 2 steps to feed the data round. So the number of steps required in this example is 3+2 = 5 steps. This is in 80
tail=N *M If (mod(M; 2) = 0) then tail=N *(M -1)+1 place=mod(int(real(procNo-1)/real(N )),M )+1 If (procNo=tail) then loopdir=ring elseif (mod(place,2)=1) then if (mod(procNo,N )=0) then loopdir=north else loopdir=east else if (mod(procNo,N )=1 or N =1) then loopdir=north else loopdir=west Do For i=1,P -1 tag=(i-1)*P +procNo csntxnb(transport,0,loopdir,message,size,tag) csnrx(transport,peerid,message,size) csntest(transport,csntxready,inde nitely,loopdir,tag,st) Table 6.1: Pseudo-code for a scalable systolic loop in 2D. contrast to the 11 steps required by a single systolic loop for the same con guration of processors. However, the double systolic loop con guration, gure 6.4, requires more data to be sent per step than for the single systolic loop. This is because the rst systolic loop (the rows) must retain all the migrating information, to be pumped round the second systolic loop (the columns). So the amount of data to be sent per step is greater than the amount of data sent per step in the single systolic loop, but fewer steps are required. This double systolic loop should be utilised only 6
6
6
6
6
6
6
6
6
6
6
6
6?
6?
6?
6?
?- 9 - 10 - 11 - 12 -6 ?- 5 - 6 - 7 - 8 -6 ?- 1 - 2 - 3 - 4 -6 Figure 6.4: Revised systolic loop layout for a large number of processors 81
for a large number of processors, depending on the amount of data and speed of the communication channels. Consider a general case, in which we disregard the amount of data to be communicated. If there are P processors, the single systolic loop is of order O(P ), whereas the revised systolic method, with two loops, takes p 2 P ? 2 steps for a square grid of P processors, i.e. O(P ). 1 2
6.4.3 Nearest-Neighbour Processor Inter-Communications Consider the general 2D case where each processor must swap information with its eight neighbouring processors. It is possible to set up 8 transports and send diagonal data explicitly. However this requires a large memory resource and represents a total of 8 operations. Another possibility is to mix blocking and non-blocking communications and set up 4 non-blocking `sends' with 4 blocking `receives', then locate and send the data needed for diagonal communication; a total of 6 operations. However, minimising the amount of inter-processor communications is of great importance to a successful parallel implementation; we can estimate the communication times in the following manner. The time for message-passing on MIMD machines can be split, approximately, into 2 parts: the start-up time (at which point the initialisations take place) and the time for one byte of data to be sent multiplied by the number of bytes in the message. i.e.
Tcomms = Tstart-up + TbyteM
[6:1]
where Tstart-up= the start-up time, Tbyte= the time for one byte of data to be sent and M is the total number of bytes of data, Tstart-up Tbyte [30]. Values for Tstart-up and Tbyte for the Meiko CS-1 are not readily available, but they can be approximated by Tstart-up = 9:5secs and Tbyte = 0:8secs [51] for two physically neighbouring transputers. This start-up time is relatively small, compared to that for the CalTech. Hypercube, where Tstart-up = 350secs [30]. For the nearest-neighbour processor inter-communications, diagonal communication in the array has been removed, since it is not worth the start-up time for such a small amount of data. The presented nearest-neighbour algorithm is performed in only 4 operations; the data from the diagonal processors being sent twice, vertically and then horizontally. Therefore the order in which the messages are received is of paramount importance so that diagonal communications are not corrupted. 82
A communication algorithm is created, similar to the algorithm for the systolic loop, using the same CS Tools routines, where the direction of communication becomes east, then west, north then south. However, the previous CS Tools routines alone do not provide a reliable system as the following problem will occur. Consider the con guration of three processors shown in gure 6.5. If P1 and P3 send to P2 at
P1
-
P2
P3
Figure 6.5: Communication topology which produces unpredictable results the same time, P2 will sometimes receive data from P1 rst, sometimes data from P3 rst. This problem of random message-arrival arises in the lattice of processors close to the boundary. In the case of the 2D lattice, a processor at the corner only has 2 directions to send/receive data with, and is required to lie idle for some of the time. Therefore to ensure that only the required data from a speci ed processor is received, the following CS Tools function is used: csnsetpeer(transport,peerid). This function only allows messages from peerid to be received. Messages sent to this transport from a processor which is not peerid will receive a response, which informs the sender that the destination processor will not accept messages from this address. The `fatal' response to a non-blocking send is to assign the status variable in the csntest function call to a number not equal to the size of message. The `non-fatal' response to a non-blocking send is to simply remove the message from the communication harness. (Blocked sends will retry automatically with a `non-fatal' response, which occurs when the communication channel to the receiving transputer has not been `opened' yet.) Table 6.2 contains the algorithm for scalable nearest-neighbour communication in 2 dimensions. The 3-dimensional code is a simple extension of this (where direction=east,west,north,south,up,down). More primitive environments may not support non-blocking communications. An alternative, scalable, purely blocking strategy for nearest-neighbour processor inter-communications was devised, and is presented in Appendix(C). This communication routine executes in approximately the same time as the routine described before, but it is structurally more complex. 83
Do For direction=east,west,north,south tag=some-linear-function(direction,procNo) incoming=opposite(direction) csnsetpeer(transport,incoming) csntxnb(transport,0,direction,message,size,tag) csnrx(transport,peerid,message,size) csntest(transport,csntxready,inde nitely,direction,tag,st) while (st 6= size) do csntxnb(transport,0,direction,message,size,tag) csntest(transport,csntxready,inde nitely,direction,tag,st) Table 6.2: Nearest-neighbour processor scalable inter-communications
6.5 The Parallel FMM Algorithm Description The key to a successful parallel algorithm lies in reducing the amount of data to be transmitted to a minimum, and to perform the communication of this data as seldom as possible. In this parallel paradigm, the communication section occurs only once per time-step, at the top of the tree. The amount of data to be communicated is reduced by `routing' data through other processors, i.e. diagonal communications are not performed, as described in the previous section. At the start of our parallel FMM the same program is loaded onto each of the processors in the array. The strengths and positions of the particles are read from a le, providing that the particles lie within the processor's allocated domain, such as the one illustrated in gure 6.1. The trees are then traversed to the top, one tree on every transputer. The multipole expansions are calculated using only particles belonging to the inner box, box 7 in gure 6.1 say, i.e. the particles which lie in the `picture-frame' are ignored until the nal evaluation. Level 0 now corresponds to each transputer's subdomain of particles belonging to the inner box, and not the entire domain, as is the case for the sequential algorithm. The `global tree', i.e. a tree containing each of the transputers individual trees, is not computed. When each transputer reaches the top of its own tree, the top level interaction set is rede ned as top level cells which reside in all well-separated transputers. This far- eld information is communicated by the systolic loop strategy. A cell's nearest neighbour cell may reside on an adjacent transputer. If so, then a subsection of a cell's interaction set will lie on that adjacent transputer. This means 84
that nearest-neighbour transputers must also swap boundary information. This is performed with the nearest-neighbour processor inter-communication routine, and reduces the amount of data passed through the transputers in the systolic loop. The tree is then descended, utilising communicated data when necessary. Once the nest level of the FMM is reached the `direct' method is used, taking into consideration the particles lying within the overlapping `picture-frame' region. This sharing of boundary information reduces the need for additional communication, at the small expense of extra memory, and gives the direct summation section independence from the rest of the parallel FMM. The pseudo-code version of the parallel algorithm is presented in table 6.5. This algorithm diers from other published MIMD, distributed memory implementations in the following ways. Katzenelson [55] also con gures the processors as a grid, where each processor gets an equal share of the domain, and each owns an equal sub-tree; however, the remaining top of the tree is controlled by just one of the processors. Leathrum and Board [58] use a grid of processors, as in our implementation, but they limit the number of processors to 64, to avoid a processor hierarchy. This enforced processor hierarchy is an inherent communication `bottleneck' causing all but one of the processors to lie idle. This bottleneck does not arise in our implementation since the top of the `global' tree does not enter the computation explicitly, thus removing the need for a processor hierarchy. Each transputer's top-level information is passed in the systolic loop; therefore no processor lies idle and the bottleneck is eectively removed. There is a periodic synchronisation where the processors must coincide to perform the communications, but this is acceptable since the processors must synchronise at the beginning and end of each potential evaluation for the necessary input and output of data and results.
85
Let i be the potential due to the particles in cell i. This is the multipole expansion, and can be evaluated in cells which are well-separated from cell i. Let i be the potential inside cell i due to the particles in cells well-separated from cell i. This is the local expansion and can be evaluated in cell i. Let i be the potential at a particle in cell i due to all bodies. Let ?i be the contribution to the potential in cell i obtained by direct pairwise summation. 1. Read data into relevant processor. 2. Do For i 2 level n, compute i, i.e. calculate the multipole expansion for every cell at the nest level. 3. Do For j = n ? 1 down to 1 Do For i 2 level j
i =
X
k2children(i)
k
i.e. shift and add the multipole expansions. 4. Send and receive level 1 multipole expansions from all other processors using the systolic loop. 5. Do For i 2 level 1
i =
X
k2far- eld(i)
k
i.e. convert the multipole expansion of the far- eld cells to local expansions. 6. Pack, send, receive and unpack level 2,3,...,n multipole expansions from nearest-neighbour processors. 7. Do For j = 2; n Do For i 2 level j i = parent(i) +
X
k2int(i)
k
i.e. shift the centres of level j ? 1 local expansions to the parent cell and convert the multipole expansions of cells in the interaction set to local expansions. Note: A subset of the interaction set may reside in the data arrays from step 6 above. 8. Do For all i 2 level n i = i +
X
k2neigh(i) and i
?k
Table 6.3: The parallel FMM algorithm in pseudo-code 86
6.6 The Packing and Unpacking of Data For each side of the processor's domain the data of the boundary cells and their nearest neighbours, belonging to level= 2; 3; :::; n, is packed and sent out during the nearest-neighbour processor inter-communication. When each data array is received, the array is searched to locate the outlying cells which constitute part of the interaction set of the current cell. The location of cells within the array are known, since the messages are sent in a predetermined sequence, so that the arrays can be systematically unpacked.
6.6.1 2 Dimensions For each domain boundary there are 2l+1 cells of data to be sent east, or west, with 3 + 2p real numbers, i.e. 2 layers of cells, each cell having co-ordinates, the sum of the strengths of all the charges contained in each cell and the p multipole coecients, which are complex numbers. Recall that p = max i (pdyn ) as before. So for n levels the total number of real numbers to be sent is n X l=2
2l+1 (3 + 2p) = 2n+2 ? 8 (3 + 2p):
For sending north, or south, there are 2 layers of cells per level plus 4 cells of the received data from the east and 4 cells from the west, so the total number of real numbers is n X 2l+1 + 8 (3 + 2p) = 2n+2 + 8n ? 16 (3 + 2p): l=2
6.6.2 3 Dimensions For each side of the processor's domain there are now 4 layers of cells which contain data to be packed and sent out in the nearest-neighbour processor intercommunication. For each domain boundary there are 4l+1 cells of data to be sent east, or west, with p2 + 3p + 5 real numbers, i.e. the co-ordinates of the cell and the complex variables Mnm; n = 0; :::; p; m = 0; :::; n, namely the multipole coecients. Recall that p = max i (pdyn ) as before. So for n levels the total number of real numbers to be sent is n n+2 ? 43 ! X 4 2 l +1 (p2 + 3p + 5): 4 (p + 3p + 5) = 3 l=2 87
For sending north, or south, there are 4l+1 cells per level plus 2l+4 cells of the received data from the east and the west. So the total number of real numbers is ! n n+2 ? 43 X 4 2 l +1 l +5 n +6 7 4 + 2 (p + 3p + 5) = + 2 ? 2 (p2 + 3p + 5): 3 l=2 For sending up, or down, there are 4l+1 cells per level plus 2l+4 cells of received data from the east and from the west and 2l+4 + 27 cells of received data from the north and from the south. So the total number of real numbers is ! n n+2 ? 43 X 4 n +7 8 2 l +1 l +6 8 + 2 + 2 (n ? 2) (p2 + 3p + 5): 4 + 2 + 2 (p + 3p + 5) = 3 l=2 Leathrum and Board [58] advocate the use of super-nodes for 3D, as this reduces the amount of data to be transmitted by a factor of 2, as only 2 layers are sent since the super-nodes replace their children's nodes. Super-nodes are not employed in this parallel version since their contribution to the overall error is not fully known, see section(4.1.2).
6.7 The `Optimum' Interaction Set for the Parallel 3DFMM The interaction set used in the parallel implementation is the same non-cubic interaction set derived for the sequential 3DFMM in section(3.7). However, for the parallel case a cell of an interaction set may lie on a neighbouring processor, therefore accessing the information from such a cell involves the packing, sending, receiving and unpacking of data, or in other words, an increase in the amount of required operations, as compared to the sequential case. The optimum interaction set was derived by operation counting, hence the `sequential' optimum interaction set will be dierent to the `parallel' optimum interaction set. Since the latter set depends on the parallelisation, it will therefore depend on the number of particles, the number of terms, the number of processors and the number of FMM levels. This is problem-dependent and therefore requires to be investigated individually for each implementation.
88
6.8 Timing Results 6.8.1 2 Dimensions This parallel version of our modi ed 2DFMM is implemented on the Meiko Computing Surface, CS-1, using p = 11. The dynamic p principle is employed. Execution times were obtained using a wide range of n m grids, and it was found that the n n grids were representative of all the cases. The timing curves are presented on a log-log graph in gure 6.6, employing the optimum number of levels for any given N . This optimum number of levels, nopt, was derived in the same manner as for the 1000
3
?
100
? 3
Time (secs) 10 4 4 43? 3 2
1
+
3 2? +
100
2 +
4 + 2
3 + 4 2
3 3 + 2
4
+ 2
4
+ 2
4
+ 2
4
1000 10000 Number of particles
4 1 3 2x2 + 4
3x3 4x4 5x5 direct
2
4 ?
100000
Figure 6.6: Parallel 2DFMM execution times, p = 11. sequential timings; that is, numerous programs were executed with various values for N and a static n. The value for nopt was then found with the aid of a graph. The size of N is limited by the available memory per transputer, see section(6.9.1). The times for the `direct' method and for the sequential 2DFMM are obtained by executing these programs on a single T800 transputer. The break-even point between the `direct' method and the fastest parallel time occurs at N 70 particles, which is in contrast to N 180 particles for the sequential break-even point. For N 10000 particles, using a 5 5 grid of transputers, the parallel version is over a factor of 17 times faster than the sequential version. An array of 8 8 processors was attempted but the times produced varied wildly, see section(6.9). The maximum number of particles per leaf cell, s, for every con guration of processors is s = 34. (The values of s are calculated by using the procedure as described in section(2.7).) For a single processor, i.e. purely sequential, s is also 34. 89
This compares to s = 41 for the sequential times from the Sun Sparc ELC. This dierence in s is due to the individual character of each system. As with the sequential cases, the timing curve of the parallel FMM using nopt will appear as a continuous line and is theoretically O(N ) for a single processor. The actual equation of the timing curve can be approximated if we assume that time varies as aN b. This appears as a straight line on a log-log graph. Hence a straight line regression may be performed through the points associated with a certain range where this regression is valid. Table 6.4 contains the results of this regression analysis. Using the same regression analysis on the lower envelope range con guration Time (secs) [100 : 20000] 1 0:03 N 1:13 [100 : 100000] 22 0:01 N 1:08 [500 : 20000] 33 0:01 N 1:02 [1000 : 50000] 44 0:01 N 0:967 [2000 : 20000] 55 0:03 N 0:835 Table 6.4: Regression analysis for parallel 2DFMM times of all the timing curves in gure 6.6, over the range [100 : 20000], gives time = 0:0058 N 0:780secs. This expression represents the fastest achievable time from this parallelisation of the FMM. Note that for each point on this curve, n = 1, i.e. the FMM employs only one level per processor, see section(6.9). The quality of any parallel implementation can be quanti ed by an expression which is described as parallel eciency. The eciency value indicates the amount of time lost to parallel overheads, such as initialisation, communications, packing and unpacking of data, etc. The parallel eciency is de ned as time on one processor [6:2] parallel eciency = Best sequential P Best parallel time where there are P processors in total. An eciency of 85% on 10 processors say, means that the program runs 15% slower than it would on a hypothetical sequential computer with computing power of 10 processors. To calculate speed-up one simply multiplies the eciency by P , and therefore a program which exhibits linear speed-up has an eciency of 100%. The eciency curves for this parallelisation are presented in gure 6.7. The sequential version may have a dierent number of levels to the parallel version and so we may not be comparing `like' with `like', however, it 90
100 90 80 70 3 3 60 Eciency % 50 40 30 20 10 2 2 + + 0 10
3 3 3
+
+
2 +2 2 100
3 +
2
3 3 + + 2
2
3 + 2 2x2 3 3 23x3 + 2 + +4x4 2 5x5
1000 10000 Number of Particles
100000
Figure 6.7: 2D parallel eciency curves does give an indication how much speed-up the parallel version achieves. The grid of 2 2 processors retains its eciency for low N since no processor has a full complement of nearest-neighbour processors, and hence it has the lowest parallel overheads of all the con gurations. It can be seen that to retain eciency as the number of particles increases, the number of processors must also increase, and hence there is an optimum con guration for every value of N . However, the overall eciency of this parallelisation will decrease with the increase of processors due to the systolic loop; this communication time is O(P ) where P is the total number of processors (cf. section 6.4.1), and so that the ratio of communication to computation increases.
6.8.2 3 Dimensions The parallel version of our modi ed 3DFMM is also implemented on the Meiko CS-1 using p = 3. The dynamic p principle is employed. Execution times were obtained using a wide range of n m k grids, and it was found that the n n n grids were representative of all the cases. The timing curves are presented on a log-log graph in gure 6.8, employing the optimum number of levels, nopt, for any given N . Once again the size of N is limited by the memory of each processor, see section(6.9.1). Execution times were obtained using a wide range of n m k grids, and it was found that the n n n grids were representative of all the cases. The times for the `direct' method and for the sequential 3DFMM are obtained by executing these programs on a single T800 transputer. The break-even point between the `direct' method and the fastest parallel time is N 1000 particles, 91
10000
3
4 + 3
1000 Time (secs) 100 23 10 +
4 3 3 2 +
100
3 2 +
3 2
+
4
+ 2
3 3 4 2
2 + 4
1000
+
+
2
2
10000 Number of particles
+ 2
2
1 2x2x2 3x3x3 4x4x4 direct
100000
3 + 2
4
1000000
Figure 6.8: Parallel 3DFMM execution times, p = 3 which is in contrast to N 5000 particles for the sequential break-even point. For N = 20000 particles, using a 4 4 4 grid of transputers, the parallel version is only a factor 13 times faster than the sequential version, whereas a parallelisation with 100% eciency would achieve a factor of 64 speed-up with this con guration. An array of 5 5 5 processors was attempted but the times produced varied wildly, see section(6.9). The maximum number of particles per leaf cell, s, varies for each con guration of P processors. The various values for s are displayed in table 6.5. (The values of s are calculated by using the procedure as described in section(3.8).) The column headed l = 1 ! l = 2 denotes the transition between levels 1 and 2, and nem denotes that there is not enough memory available on the processors. The table 6.5 Con guration l = 1 ! l = 2 l = 2 ! l = 3 l = 3 ! l = 4 1 297 235 2x2x2 143 108 nem 3x3x3 149 64 nem 4x4x4 180 nem nem Sun Sparc ELC 172 170 Table 6.5: The value of s for each level transition illustrates how the value s, and hence the correct choice of nopt depends on each individual implementation, as already stated. As with the sequential cases, the timing curve of the FMM, using n = nopt, will appear as a continuous line and is theoretically O(N ) for a single processor. 92
The actual equation of the timing curve can be approximated if we assume that time = aN b. A regression analysis may be performed through the points associated with a certain range and table 6.6 contains these results. The regression was not valid for the sequential case nor for the 4 4 4 con guration, as the timing curves were not `straight enough'. Using the same regression analysis on the lower envelope range Con guration Time (secs) [500 : 200000] 2 2 2 0:0020 N 1:39 [1000 : 200000] 3 3 3 0:0075 N 1:21 Table 6.6: Regression analysis for the parallel 3DFMM times of all the timing curves in gure 6.8, over the range [500 : 20000], we nd that the time = 0:08 N 0:848secs. Once again, as in the 2D case, this expression represents the fastest achievable time from this parallelisation of the FMM. Note that for each point on this curve, n = 1, i.e. the FMM employs only one level, see section(6.9). The parallel eciency curves can now be presented in gure 6.9. Once again the 70 60 50 40 Eciency % 30 20 10 3 3 0+ + 10
3
3
2x2x2 3 3x3x3 + 4x4x4 2
3 3 3 3
+
+ 2
100
+ 2
+ 2
+ 2
+ 2
1000 Number of particles
3 3 + 2
3 + 2
10000
2 +
3 2 +
100000
Figure 6.9: 3D parallel eciency curves sequential version may have a dierent number of levels to the parallel version and so we may not be comparing `like' with `like', but it does give an indication how much speed-up the parallel version achieves. The relatively poor eciencies for the 3D case are due to the amount of data to be communicated, recall section(6.6.2). When n, the number of levels in the FMM, is low, then the ratio of the number of internal cells (the cells for which data is not communicated) to the number of external cells (the cells for which data is communicated) is low. As the number of levels used 93
increases, then this ratio will increase, thus increasing the ratio of computation to communication, and so the eciency of the 3DFMM will increase when more levels are used. However, due to the memory limitations of the transputers, n > 3 is not possible for this parallel implementation. The parallel eciency of the 3D version can be improved in two ways. Firstly, the systemised procedure employed to unpack the incoming data could be re ned further, thereby reducing the time required by this operation. Secondly, the optimum interaction set could be determined for each set of run-time parameters (cf. section 6.7).
6.9 Discussion of the Parallel Implementation The FMM employed to produce the fastest time achievable from this parallelisation only has one level, for both 2 and 3 dimensions. This means that of the two communication strategies, only the systolic loop is used. It follows that as the number of particles increases, one should increase the number of processors, and not the number of levels in this parallel FMM. The number of levels should only be increased when there are no longer any available processors. This is due to the fact that using only one level means there are no cells in a hierarchy, and the hierarchy introduces a large quantity of data which must be swapped between nearest-neighbour processors. This particular communication strategy causes some boundary processors to lie idle for some of the time, since a boundary processor has fewer neighbours than an internal processor. Therefore when the ratio of internal processors to external processors is low a systolic loop is more ecient than employing the nearest-neighbour strategy. Indeed, a parallel N -body solver which uses only a systolic loop for processor inter-communications can be found in [50], where T800s are also used in conjunction with OCCAM II. The nearest-neighbour processor inter-communication is ignored in [50], owing to the poor load balance. To improve parallel eciency, computation and communication should in principle be overlapped by the use of non-blocking communications alone. However, this is not possible in this case since a mixture of blocking and non-blocking communications were required in order to control the order of incoming messages. The ` nal' calculation, where the `direct' method is used to calculate the near- eld at the nest level, is completely independent of any other computation in the FMM 94
after initialisation, and so a new transputer may be introduced to perform this task while the far- eld is calculated by the original grid of processors. This is the method used for the vortex application, section(7.4.1). Consider an FMM with a xed set of parameters. Using all, or nearly all of the processors on a particular domain executes this FMM in a wide range of runtimes. This was found to be the case when using a 8 8 grid in a domain of 65 transputers, and when using a 5 5 5 grid in a domain of 131 transputers. This is due to the physical layout of the transputers. The route that a particular message travels through the network of transputers can vary under the same programming conditions. Indeed, in this situation, some con gurations are unobtainable, in which case the user should use a domain with many more transputers than is required by the program.
6.9.1 Memory Constraints The memory limitations have a dramatic eect over what simulations are possible. Leathrum and Board [58] (cf. section 5.2) assume that, due to the migratory nature of the particles, each processor can contain all the particles at one time. This meant that they were restricted to 20K particles in the ensemble, for p = 8 in 3D. Consider a particular problem for our 3D parallel code; N = 106 and p = 3. If the processors were to cover the domain in a disjoint union of cells, then the average number of particles per processor would simply be NP , where N is the total number of particles and P is the total number of processors. However, in our implementation the domains of the processors overlap by 2 leaf cells. Then the average number of particles per processor is given as [6:3] f82?n + 3 22?n + 1g NP where n is the number of levels. (In 2 dimensions, where the overlap is 1 leaf cell, the average number is given by f42?n + 22?n + 1g NP :) From the 3-dimensional time trials, with p = 3, the maximum number of particles per transputer was found to be Nmax = 7:2 104 , Nmax = 6:5 104 , and Nmax = 4:6 104 for n = 1; 2; 3 respectively, and n 4 is not possible. Hence for N = 106 , p = 3 and n = 1 (the most ecient value for n in this parallelisation) we have P > 208, by eqn.(6.3), which represents a number of processors greater than is currently available on the CS-1 without recon guring the entire machine. (The largest domain has 131 transputers.) We 95
now consider values of n which are less ecient. For n = 2, we have P > 61, and for n = 3, we have P > 38; so, although this problem with N = 106 is tractable, the most ecient number of levels is not usable. The adaptive FMM (section 2.4.2) requires larger parallel overheads, thereby reducing this Nmax value even further.
96
Chapter 7 Fluid Dynamical Application 7.1 Background Information When the viscosity of a uid, , is suciently small, its motion can be found to evolve very dierently from seemingly identical initial conditions. This phenomenon is categorised as turbulence. The velocity takes random values which are not determined by the nature, or the `macroscopic' properties, of the ow, but appear at the microscopic level and are best described probabilistically. It is assumed, however, that the average properties of the uid motion are deterministic [12]. A useful dimensionless measure used in uid mechanics is the Reynolds number, Re, where Re = UL ; in which U is the characteristic velocity of the uid and L is the characteristic length of the model. If Re is large (Re 104 ) the ow is typically turbulent. There are a variety of well-known mathematical procedures for modelling turbulence, including the Boussinesq model, which was one of the rst; the so called k ? model (of which several versions exist); spectral models based upon the energy spectrum of the ow; and the Baldwin-Lomax model [12, 57]. With the development of powerful computers, numerical simulation of turbulent
ow can be attempted by the direct integration of the Navier-Stokes equations, using a regular mesh laid over the entire ow domain [33]. However, there is a shortcoming in such methods which is related to the use of grids. For several turbulence models, microscopic perturbations in the ow eld have a profound eect on the macroscopic
ow. With grid methods, irrespective of grid size, movements at a microscopic level 97
may seem invisible, since very small perturbations can occur between mesh points, as if they fall through the mesh - this is known as grid-drain. One of the models used by the British Meteorological Oce is an example of such an Eulerian mesh: a uniform 2-dimensional mesh over the planet with a grid size of 15 km. It is the limitations of the size and speed of present day computers which dictate this rather coarse mesh. Such a global model will not adequately simulate local wind ow eects, as turbulence in the atmosphere exhibits structure on much smaller length scales, and therefore this model will not describe wind ow correctly over long time periods. To complicate the problem further, other sections of the ow domain may consist of purely laminar ow, needing relatively few grid points. To attempt to cater for this high non-uniformity, meshes can be re ned in areas where a higher resolution is required. This technique is known as adaptive mesh re nement, but the problems of `grid-drain' still persist. The following section describes a computational
uid ow model which is naturally self-adapting.
7.2 Random Vortex Methods An answer to problems which arise from the use of an Eulerian approach, is to adopt a grid-free, or Lagrangian, uid ow model, such as the class of models known as Random Vortex Methods. When modelling a particular system, these methods require much less memory than a regular mesh, and yet produce results of equal quality. The Random Vortex Method was developed by Alexandre Chorin [24], and an overview is found in [72]. The method is based on discretising the continuum vorticity eld, r u, into discrete Lagrangian elements, where u is the velocity eld. These elements contribute to, and are aected by, the ow. That is to say, each vortex element is free to move in the ow eld they collectively induce. There is a higher concentration of the vortex elements within the more turbulent regions, and much less in a smooth, laminar ow region. The ow they produce is a solution to the Navier-Stokes equation, and in principle the method should model turbulent
ow.
98
7.2.1 Navier-Stokes and the Vorticity Transport Equations The Navier-Stokes equations for incompressible, viscous ow in a region D with boundary @D are Du = ?rP + 1 r2u [7:1] Dt Re with r u = 0 in D and u s = 0 on @D (the no-slip condition where s is a vector tangential to the boundary). Also we have u n = 0 on @D (the impermeability condition, where n is the outward normal to the boundary). For eqn.(7.1), we have pressure P = P (x; t); Re is the Reynolds number, u = u(x; t) is the velocity vector at position vector x, D is the material derivative, and Dt D = @ +ur : Dt @t By taking the curl of the Navier-Stokes equation, as well as invoking vector identities, the Vorticity Transport Equation can be derived, i.e. D = ( r)u + 1 r2 Dt Re where the vorticity, , is de ned by = r u, and u = 0 on @D as before. If ! 0 then Re?1 ! 0, and the equations tend to those of inviscid ow. Note that in 2 dimensions is orthogonal to the velocity u and the term ( r)u vanishes. We may determine the velocity eld induced by a vorticity distribution in the following manner. Since we have r u = 0 and = r u, there exists a vector function , such that u = r and
r2 = ?:
[7:2]
In 3 dimensions, is known as the velocity vector potential, and shall be denoted by . In 2 dimensions, is a scalar, and is known as the stream function. is calculated in terms of a complex potential, which shall be denoted by . We may write , and consequently u, in terms of . A solution to in eqn.(7.2) is given by the convolution integral Z
(x; t) = L(x ? z)(z)dz where L(x) is the Poisson kernel
8 >
:
?1 log jxj; 2
1 4jxj ;
99
x 2 R2 x 2 R3:
In 2D the convolution integral is an area integral, and in 3D it is a volume integral. Since u = r , we express the velocity in terms of the vorticity by means of a Biot-Savart integral, such that u = K , where denotes a convolution, i.e. Z
u(x; t) = K (x ? z)(z)dz
[7:3]
where K is the Biot-Savart kernel. If x 2 R2, K is expressed as K (x) = 21 (?xjx2j;2x1) and if x 2 R3 then K is expressed as 8 > > > >
?x3 0 x1 > > > : x 2 ?x1 0
9 > > > > = > > > > ;
:
For numerical purposes the vorticity eld is discretised spatially in the following manner; we take N X [7:4] = i (x ? xi ) i=1
where i is a measure of the strength of the ith vortex element, xi is its position vector, and (x) is a function of small support, which in the rst instance may be thought of as a delta function. Substituting eqn.(7.4) into the convolution integral for (x; t) we can calculate at each vortex element i from (xi; t) =
N X j=1 i6=j
L (xi ? xj )j
[7:5]
where L is a smoothed kernel, L = L . Then we may calculate the velocity vector at the ith vortex element as N X
u(xi; t) = r (xi; t) = (ri L (xi ? xj ))j ; j=1 i6=j
where ri indicates the gradient with respect to the xi variables. This equation may be expressed in terms of a molli ed Biot-Savart kernal;
u(xi; t) =
N X j=1 i6=j
K (xi ? xj )j :
[7:6]
Therefore to update the position of the vortex element i, its velocity is calculated in terms of all the other remaining vortex elements by employing the expression given 100
in eqn.(7.6). This constitutes an N -body problem, so the FMM may be employed to reduce the time for computation. The function (x) in eqn.(7.4) may be various delta-like functions called smoothing functions [24], characterised by a smoothing core. These functions describe the discrete elements of vorticity in a de-singularised way, and can be described as vortex blobs, vortex sheets (in the case of shear layers), lament segments, tubes or loops. The structure of the problem determines which de nition of (x) is to be used. In 2 dimensions, for ow over a blu body, vortex blobs are used in the interior while the no-slip condition is simulated by the creation of vortex sheets at the boundary [19, 20]. These sheets are then allowed to move in, and contribute to, the ow until, at a certain distance from the wall, they are transformed into vortex blobs. For 3 dimensions we can think of the Lagrangian discretisation as consisting of vortex laments in the interior, which may also stretch in the local velocity eld. With viscous incompressible ow, the Vorticity Transport Equation can be split up using a technique known as the fractional step method [26, 72]. For simplicity, consider the Vorticity Transport Equation in 2 dimensions, i.e. 1 r2 ; [7:7] @t = ?(u r) + Re where we have decomposed the material derivative into its two parts. The rst term is the advection term, the second is the diusion term. We discretise the problem in time with a time-step t. The pure advection equation
@t = ?(u r) is solved, given the distribution , for uni at a time nt, using eqn.(7.6). Hence the ith vortex segment can be evolved by
xni +1 = xni + uni t to determine the vorticity distribution after one time-step of advection1. Simultaneous to advection, there is diusion. This can be simulated by Brownian motion. A t random-walk displacement, i, of mean = 0, and variance 2 = 2 Re , is superposed on the advected distribution;
xni +1 = xni + uni t + i: Note that a simple Euler time-integrator was used for illustration. It is advisable to employ a higher order time-integrator to retain precision. 1
101
A more rigorous discussion of this is formed in [72]. For viscous ow, the boundary conditions in the velocity eld are simple to express, i.e. u n = 0, impermeability condition; and u s = 0, the no-slip condition. We may combine these conditions: u = 0 at @D. To uphold the boundary conditions for the vorticity eld, we must take a dierent approach to that of calculations of the velocity eld. Here we have a collection of migrating computational elements, i.e. vortex elements. We ensures that the migrating elements remain within the
ow domain, which enforces the impermeability condition. To simulate the no-slip condition, vorticity is created at the boundary by introducing new vortex segments to the ensemble. These two boundary conditions are performed at each time-step in two separate operations.
7.3 Calculating the Derivative of the Potential The FMM may be employed to calculate the potential, , for a wide variety of problems, but it is often the case that it is the derivative of the potential that is required. In gravitational and electrostatic problems, the potential in 2D and 3D is expressed as a scalar, and the force typically takes the form r. In uid ow problems, the velocity eld is expressed in terms of the derivative of the vector potential. In 2 dimensions the velocity is derived from the potential ' which is the real part of the complex potential, i.e. ' = Re (). Thus, once the complex potential has been calculated, the velocity vector can be retrieved from the real and imaginary components of its derivative, i.e. ! ! d d ux = Re dz and uy = ?Im dz where z 2 C and u = (ux; uy ) is the velocity vector induced by a vorticity eld. Thus if the potential is expressed locally as a pth-order polynomial in z, as in the FMM, where p X (z; p) = blzl; then we have
l=0
p d(z; p) = X l?1 lb lz : dz l=1 In 3 dimensions the three components of the vorticity must be represented by a vector potential, , since a divergence-free vector eld cannot in general be derived
102
from a scalar eld. To use the FMM it becomes necessary to take the 3 components of the vorticity = (1; 2; 3) and consider each one separately as a scalar. Once the 3 components of the potential have been calculated, the velocity vector is calculated by taking the curl of the resulting vector, i.e. u = r : There are two possible methods of performing this calculation of the 3 components of . The rst is to create 3 dierent multipole expansions and 3 separate local expansions contained in the same tree. The other is to execute the FMM 3 separate times, once for each component. The former idea reduces the execution time since the framework of the tree is computed only once, but the amount of required memory is increased by a factor of 3. In the following parallel implementation of a 3D vortex code, minimising memory use is paramount and so the latter method has been chosen.
7.3.1 The Error of the Derivative of a Truncated Local Expansion
Anderson [5] states that one order of 1c is lost from the absolute error when the derivative of the local expansion is taken, and that this loss of accuracy can be overcome in one of two ways. Firstly, one can execute the FMM with one extra term in the expansion. In two dimensions this causes only a slight increase in time, but this increase is rather more dramatic for three dimensions. The second method is to reformulate the mathematical operations of the FMM from the analytical expression for the derivative of the potential. This second method retains the precision `lost', but moreover, it removes the need to take the derivative of each of the vector potentials. It thereby reduces the amount of computational labour, but it has not been included in our implementation because of time constraints. Intuitively the conjecture that accuracy is lost would seem reasonable since the local expansion, i.e. eqn.(2.6), is a power series in 1c and taking the derivative will subtract one from the power of the last term. However, the following investigation into the derivative of the potential shows that Anderson's conclusion is not relevant to the relative error.
103
The 2-dimensional case P The relative error, = abs A where A = jqi j, of the local expansion of the potential with p terms, from appendix(A.1), was shown to be: 8 !p+1 9 p+1 < = 1 p c + 1 p c < c + ln 1 + : (p + 1)(c ? 1) c c?1 p?1 ;
where c = jzr j ? 1, jz0j is the distance between the centres of the two expansions, and r is the radius of the disc containing the sources. Hence p+1 ! [7:8] O 1c : The expression for a local expansion can be dierentiated as follows: 0
1 X 0(z) = lblzl?1 l=0
) abs =
p 1 X X l ? 1 0 l ? 1 lblz (z) ? lblz = l=p+1 l=0 :
The calculation for the error-bound may be found in Appendix A.2.1, where it is shown that the relative error, = jz jAabs , is bounded according to ( ) !p+1 p p+1 ( c + 1) p 1 p 1 p 1 c + 1 < c c + 1 + ln 1 + c c ? 1 p + 1 + (c ? 1)c(p ? 1) p ? 1 c ; which implies that p+1 ! O 1c 0
Therefore the relative error of 0FMM error of FMM , eqn.(7.8).
:
has the same order of accuracy as the relative
The 3-dimensional case The error of the local expansion of the potential with p terms, from eqn.(3.21), was shown to be p p ( ? a ) 1 1 1 abs = O < [7:9] A c?1 c c : For the derivative we have 1 @ r = ^r @@r + ^ 1r @@ + ^ r sin @ where ^r, ^ , ^ are unit vectors of the spherical coordinate system (r; ; ). The calculation for the error-bound may be found in Appendix(A.2.2), where it is shown that the relative error, = (?aA) abs , is bounded according to p p 2 + c2 ? cp ! pc 1 1 O 2500, where more of the eld is being approximated than for n = 2; 3. However, the time saved by employing the FMM with n = 4 is not as dramatic as the case for a uniform distribution. Figure 7.3 also indicates the eectiveness of the basic element of adaptivity (cf. section 3.4.3) that has been incorporated into our modi ed 3DFMM. If we consider the case for N = 1012, then many of the cells in a tree with n = 4 will be empty. The point where a `full' tree occurs, is the point when the time required to traverse the 112
+ 3
3 +
1000
2
n=2 3 2 n=3 + n=4 2 direct
+ 3 2
Time (secs) 100 2 + 3
10 1000
+ 2 3
10000 Number of particles
Figure 7.3: Execution times for the sequential 3DFMM, non-uniform distribution tree does not increase as N increases, i.e. the `direct' summation starts to dominate the overall execution time. From gure 3.12 in section(3.8), we may deduce that a tree with n = 4 is `full' when N = 20000. The execution time for this `full' tree is approximately 10000 secs. We conclude that a tree without an element of adaptivity would execute in a time of at least 10000 secs for all N with n = 4. Our element of adaptivity has reduced the time for traversing an `empty' tree, N = 1012 with n = 4, to only 24.4 secs.
Parallel version results We tested each of the three possible cubic con gurations (i.e. P P P for P = 2; 3; 4) and p = 3 for N = 180, N = 1500 and N = 10000. The resulting execution times are slower than both the `direct' method and the sequential versions, due to extremely poor parallel eciency. For example, let us consider the 4 4 4 grid with n = 1. The leaf cells in this con guration have the same dimension as the leaf cells of the sequential FMM with n = 3. Therefore, the greatest part of the computation is the `direct' summation. This summation is performed by the uid- ow code transputer, and this executes simultaneously with the parallel FMM, which approximates the far- eld potential. Since the far- eld approximation is such a relatively small amount of computation, then the ratio of computation to communication is simply too small to achieve speed-up. This parallel FMM is optimised for uniform distributions, but the vortex torus requires a highly non-uniform distribution of vortex segments, as can be seen in 113
gure 7.2. In such a situation, a cell of the tree is communicated between processors even if it is empty, and if many processors are used then some of them may perform no work. Another method of parallelising the FMM was then tested, such that the `direct' summation computation was also sent to the array of processors, after the far- eld approximation had been computed. In this case we achieved a greater speed-up over the sequential version, since we divided up the work of the `direct' computation among the processors, as well as the work of the far- eld approximation. Moreover, this parallelised `direct' computation requires no inter-processor communication. However, the available speed-up is poor in comparison to the sequential 3DFMM on the Sun Sparc ELC. An alternative manner of distributing the segments over the transputers is considered in section(8.3), as well as a scattered domain decomposition which could achieve a factor of 6 speed-up on 6 processors. Many Vortex Methods, and indeed most particle simulations demand a nonuniform distribution, and therefore we require an adaptive fast summation method, with a controllable error. One such method is outlined in section(8.3.1).
114
Chapter 8 Conclusions and Future Directions 8.1 The Fast Multipole Method The FMM employs truncated multipole and local expansions to approximate a potential eld to a prescribed precision. The number of terms, p, retained in the expansions, has a strong in uence over the time for execution, as the time to compute the local expansions is of order p2 in 2D, and of order p4 in 3D. However, previous versions of the 3DFMM have used a naive expression to determine p. It was found that a conservative error-bound leads to a liberal choice of parameters, resulting in an excess of computational labour. In this thesis we investigate improved methods of predicting p, which satisfy prescribed bounds on the error. Suitable revised expressions for p are derived, but further error analysis will be required to optimise the expression. The implementation of the FMM described in this thesis is the rst O(N ) fast summation method to incorporate the idea that, as the distance to a cluster increases, then the number of terms can be reduced and yet still retain the same precision, i.e. the `dynamic p principle'. This principle would be still more eective when used with an expression for p derived in terms of the actual maximum error which arises from the FMM, rather than an approximate bound on the theoretical error. The number of levels, n, used in the FMM also has a great in uence over the execution time; n is governed by the maximum number of particles per leaf cell, 115
s. We show that the optimal value of s is not an independent constant, and varies linearly with p for 2D and quadratically with p for 3D; furthermore the optimal value for n will also depend on the implementation. No analytical expression can be derived to determine the optimal number of levels, and so an empirical approach is required. One such approach is described in detail in [5]. `Super-nodes' are discussed in section(4.1) and it is shown that, while use of the more distant nodes is valid, the closest of the super-nodes will cause a loss of precision. To compensate for this loss, more multipole coecients are required, which results in an increase in computation. Moreover, this increased work load can exceed the amount of work required if the super-node is not utilised. It is concluded that the use of all the possible super-nodes is not advisable until more information is known about their contribution to the overall error. We introduced the optimum interaction set for the 3DFMM, (section 3.7), and reviewed other devices for speed-up, such as utilising the inherent symmetry of a hierarchical tree (section 3.4.1). A stable generator of associated Legendre polynomials is presented, and moreover, various properties of the associated Legendre polynomial are exploited to reduce the computation eort (section 3.2). The rough approximations adopted by other authors, such as super-nodes and a naive choice of equation to determine the number of terms in the expansions, have not been found to produce an error which exceeds the prescribed error [61, 71]. In other words, the maximum error produced by the FMM does not attain the analytically-derived maximum error. This is because the truncation errors of individual truncated local expansions may dier in sign due to the particle distribution. Thus the typical error may be reduced because the summation of expansions partially cancels the separate errors. Furthermore, the fact that the number of terms in the expansions must be an integer means that the maximum error produced by the FMM will typically be less than the prescribed error. Under certain circumstances the use of these rough approximations is, in fact, valid, as is illustrated in the numerical trials in sections (2.5.4) and (3.5.4), but they cannot be employed generally. Future work will include investigating the plausibility of precomputing the constants which are calculated for each shifting and converting of the expansions. Preliminary investigations, however, suggest that this will substantially increase the required amount of memory space, which is at a premium in the parallelisation. 116
The FMM can be considered to be the `high performance' N -body Treecode, since its use is recommended for problems demanding a high number of particles and high precision. Intuitively, it appears that the FMM is faster than the BarnesHut Algorithm for the same precision; however, it has been suggested [74] that the hidden constants in the expression of the order of complexity are so large that for the FMM to be faster, N would need to be larger than one can simulate with present technology.
8.2 The Parallelisation of the FMM The method of parallelisation diers from other MIMD distributed memory implementations using local domain decomposition, in that processors do not form a hierarchy. The bottleneck which is `inherent' to the FMM means that some processors shall lie idle while a separate processor acts upon the top of the tree. By catering for the top of the tree at a lower level of the tree, we have been able to remove this bottleneck. Two scalable communication strategies are presented in this thesis, employing mixtures of blocking and non-blocking protocols. These strategies are a systolic loop through a lattice of processors, and a nearest-neighbour processor intercommunication algorithm. A scalable, purely blocking nearest-neighbour processor inter-communication algorithm is also presented, but not utilised. A draw-back to these, and indeed to any nearest-neighbour processor inter-communication strategy, is that boundary processors will lie idle for some of the time, which results in poor load balancing. These communication algorithms are not speci c to this particular parallel problem, and can be employed for other parallelisations on similar platforms, such as parallel Multigrid Methods [31, 64], since they have a similar hierarchical structure to N -body Treecodes. The optimal parallel strategy, for both the 2D and 3DFMM using our local domain decomposition, is to employ a non-hierarchical con guration of cells in conjunction with the systolic loop algorithm. In this context, as the number of particles increases, one should increase the number of processors and not the number of levels in the FMM. The hierarchical tree is only introduced when there are no longer any further available processors. The break-even points, where the FMM becomes faster than the `direct' method, 117
are N 70 and N 1000 particles for the parallel 2DFMM and 3DFMM respectively. The corresponding break-even points for the sequential versions are N 180 for 2D and N 5000 for 3D. These break-even points refer apply to a uniform distribution of particles. The act of parallelising the sequential code is quite complex, as is illustrated by the increase in the number of lines of code required. In 2 dimensions, 351 lines of code were extended to 889 lines, and in 3 dimensions, 588 lines of code were extended to 1411 lines. Moreover, explicit message-passing is a more complicated procedure than that of accessing data in shared memory machines, because in the latter case it is the machine which controls the ow of data as opposed to the programmer. Explicit message passing introduces considerable parallel overheads, such as those associated with data replication and storing communicated data. This will occupy valuable memory space which could otherwise be used for computation. The execution of the parallel codes was found to be very time-consuming, several days, even weeks, were not uncommon when obtaining the numerous timing results. The memory limitations have a dramatic eect on the kinds of simulation that are possible. From the time trials for 3 dimensions (section 3.8), with p = 3, the maximum number of particles per transputer was found to be Nmax 7:2 104 for n = 1. For n = 2, Nmax 6:5 104; and for n = 3, Nmax 4:6 104; and at present, n 4 is not possible. The adaptive form the FMM requires larger parallel overheads, thereby reducing this Nmax value even further. The speed of computation was also found to be rather slow, which may be illustrated if we consider the system with N = 1000, p = 3 and n = 2. The sequential version on a single transputer took 107 secs, whereas the same program on the Sun Sparc ELC workstation took 26 secs. As another example consider the `direct' pairwise calculation for N = 1000. The single transputer took 88 secs, whereas the same program on the Sun Sparc ELC took 35 secs. In fact on average, the T800 transputer is over a factor of 2.5 slower than the Sun Sparc ELC for this `direct' computation. Therefore, for the FMM on the Meiko CS-1 to be competitive with other parallel versions of the FMM, the memory of each processor, the speed of oating point operations and the total number of transputers would all have to be increased. Furthermore it is desirable that the user should be able to con gure the transputers himself, rather than utilising CS Tools. This requires a `hard-wiring facility', so 118
that the actual topology of transputers can closely resemble the abstract topology.
8.3 The Vortex Application We have successfully embedded a uniform parallel 3DFMM within a vortex simulation. We presented the expressions required to determine each vortex segment's velocity vector, calculated in terms of the vector potential as approximated by the 3DFMM. This required the derivation of a stable generator for the rst derivative of the associated Legendre polynomial. However, due to the high non-uniformity of vorticity in many applications of Vortex Methods, our local domain decomposition parallelisation exhibits poor speed-up over the sequential version. The parallel 3DFMM may have achieved less than we had hoped for, but this parallelisation did provide an insight into the quality of the local domain decomposition paradigm. If we are to retain local domain decomposition, it would be prudent to investigate the idea of dynamically con guring the grid of processors at each time-step. This would entail calculating a cuboid which is de ned by the limits of the distribution of vortex segments at each time-step. The grid of processors would then be formed to encompass this cuboid, such that each processor domain is a cube. At the start of the computation, the program would invoke all available processors, but processors which fall outside the cuboid would be ignored, so that no communication would occur between these `empty' processors and any other processor. Another possible parallel strategy is to employ a form of scattered domain decomposition; this may achieve a factor of 6 speed-up using 6 processors. If we consider the calculation of the vector potential, then we are required to calculate each of the three components of this potential separately. These computations are independent of each other. We may also consider the potential, , as the sum of the far- eld potential and the near- eld potential, i.e. = far + near . far is approximated by the use of multipoles and the near- eld potential, near , is calculated by the `direct' summation method. These two calculations are also independent of each other. Thus we have 6 independent calculations to be performed, which may be performed by 6 processors with no inter-processor communications. By the time analysis investigations in sections (2.3) and (3.3) we showed that the time to compute far and near should be approximately the same to achieve optimal speed-up. 119
This would suggest that this alternative parallel strategy would exhibit a balanced load, and hence would achieve a high parallel eciency. For most Vortex Methods, an adaptive N -body Treecode with a highly controllable precision is essential. If the method of local domain decomposition described in this thesis is used on the CS-1, the adaptive FMM and its associated parallel overheads would occupy too much memory to be viable. Moreover an adaptive version would also require a more complex form of data management.
8.3.1 An Alternative Adaptive FMM As an alternative, therefore, we might consider the following novel adaptive N -body Treecode which is a hybrid of the FMM, the Barnes-Hut Algorithm [12] and Buttke's method [15]. Essentially, the adaptive hierarchy of the Barnes-Hut Algorithm is retained to form the tree, but the multipole expansions are centred on the cluster's centre, and not at the cluster's centre-of-mass, as in the case of the Barnes-Hut Algorithm. The tree would then be descended once for every particle, wherein each well-separated cell at every level, that has not already been accounted for, would contribute a p2 -term multipole expansion to be evaluated at that particle. The value for p would be determined by the distance to each cell. No local expansions would be calculated. Consider the 3-dimensional case for a uniform distribution of particles with s = 1, i.e. each leaf cell contains only one particle. This hybrid method would then be of O(p2N log N ), as compared to the 3DFMM, with s = 1, which is of O(p4 N ). This novel method would be faster than the 3DFMM when log N < Cp2, where C is some constant of proportionality which depends on the implementation. So the novel code would be faster than the FMM for small N and high precision. Vortex Methods require a high precision fast summation method, to minimise numericallyinduced instabilities, and we feel con dent that this alternative FMM has promise in this respect. The techniques mentioned in section(4.3) on the implementation of the Barnes-Hut Algorithm may also be employed by this N -body Treecode, such as the `face-distance' parameter. The local domain decomposition could be used to parallelise this Treecode using a systolic loop for processor intercommunications, although the scattered domain decompositions suggested by Baden [7], Salmon [69] and Singh [75] would be more ecient. This novel Treecode would be far simpler to implement than the 3DFMM and would similarly de ne the maximum error a priori. However, the most attractive feature of this code is its sensitivity 120
to the particle distribution, which is essential for the highly non-uniform problems addressed by most Vortex Methods.
121
Appendix A Derivation of Error Bounds A.1 Derivation of a Stricter Error Bound for the Truncated Local Expansion in 2 Dimensions For de nitions and other details in the following analysis see Greengard [42]. Consider the error abs in a p-term local expansion of a multipole expansion centred at z0. This is given by
abs = (z) ?
p X l=0
1 X
blzl =
blzl l=p+1 ;
where 1 a l + k ? 1! X 1 ? a 0 bl = lzl + zl zkk k ? 1 (?1)k ; 0 0 k=1 0
for l 1: If we let
!
in which kl =l Ck ;
l = ?lzal0 and l = bl ? l; 0
we see that the error can be bounded by
abs S1 + S2; where
S1 =
1 X
1 X
1 1 l A 1 p zl A X = lz0l p + 1 l=p+1 c + 1 (p + 1)c c + 1 ;
We have
S1 ja0j
l=p+1
l=p+1
l zl
and
122
S2 =
1 X
lzl l=p+1 :
where ja0j P jqij = A, c = jzr j ? 1 and r is the radius of the disc in which the local expansion is to be evaluated, (i.e. jzj < r). This is a more stringent bound than the bound calculated in [42] for S1. Following Greengard [42], we now de ne 0
M= where Then Greengard [42] states that
S2 =
1 X
l=p+1
1 X
k=1
ak k (r(1 + c) ? s) ;
s = cr p ?p 1
lzl < M
1 X l=p+1
!
[A1]
:
jzj l = M jzj s 1 ? jzsj s !
!p+1
[A2]
:
A stricter bound for M in eqn.(A2), than that given in eqn.(2.36) in [42] can be found as follows. By eqn.(2.9) in [42] we have 1 k X jak j A rk ) M < A k1 p +p c k=1
!k
!
= ?A ln 1 ? p +p c = A ln 1 + pc [A3] :
From [42], when p c2?c1 we have 21 (c + 1) r s < cr, and since jzj < r, we have
jzj > c ? 1 : ) 1 ? (c + 1)r s c+1
jzj < 2r
[A4] s Therefore, by substituting eqn.(A1), eqn.(A3) and eqn.(A4) into eqn.(A2), which is analogous to the procedure followed in [42], we have +1 p S2 < A ln 1 + pc cc ? 1 p?1 which implies that the relative error is bounded by
!p+1 p+1
1 c
p 1 p = Aabs < (p +1 1)c c +1 1 + ln 1 + pc cc + ?1 p?1
!p+1 p+1
1 c
p+1 p+1 p p 1 p 1 c + 1 1 < (p + 1)(c ? 1) c + ln 1 + c c ? 1 p ? 1 c 9 8 ! p+1 = p+1 < p+1 1 c p p c + 1 = c + ln 1 + c c ? 1 p ? 1 ; = O 1c : (p + 1)(c ? 1) !
This expression has the same order of 1c , but it produces a smaller value for for any given c and p compared with eqn.(2.13) in section(2.5.2). 123
A.2 The Error of the Derivative of a Truncated Local Expansion A.2.1 2 Dimensions Dierentiation of the expression for a local expansion leads to the result 1 X 0(z) = lblzl?1 l=0
) abs =
p 1 X X 0 l ? 1 l ? 1 (z) ? lblz = lblz l=0 l=p+1 :
As in section A.1
l = ?lzal0 and l = bl ? l: 0 Then the absolute error can be bounded by abs S1 + S2; where Now
S1 = 1 X
1 X
l=p+1
l lz
l?1
S2 =
and
1 X
l=p+1
l lz
l?1
:
p 1 z l?1 A X 1 1 l X a A 1 0 l?1 ? l z ja0j zl jzj S1 = = jzjc c + 1 ; l=p+1 z0 l=p+1 0 l=p+1 c + 1
where A = P jqij. If we de ne
M= where s = cr
p?1 p
1 X
k=1
ak (r(1 + c) ? s)k
, as in section A.1, then we have
1 1 X X l lzl < jMzj S2 = j1zj l jzsj l=p+1 l=p+1
!l
[A5]
:
This series is easily evaluated. If Q = P1l=m lbl and jbj < 1 then
Q ? bQ = mbm +
1 X
l=m+1 bm+1
bl
) Q = 1mb? b + (1 ? b)2 : m
[A6]
By comparison with eqn.(A5) with m p + 1 and b jszj we have 8 >
p + j1zj jzsj : 1? s
!p+1
124
+ 1jzj 2 jzj s 1? s
9 !p+2 > = > ;
:
From eqn.(A4) we have 1 ? jzsj > cc?+11 and s = cr p?p 1 , and so in an analogous manner to the previous section and Greengard [42], we deduce that (
+ 1 p + 1 + (c + 1)p S2 < jMzj cc ? 1 (c ? 1)c(p ? 1)
)
p
p?1
!p+1 p+1
1 c
:
If we de ne the relative error by = absAjzj , and if we note that M < A ln(1 + pc ), eqn.(A3), then the relative error is bounded according to (
p + 1 p + 1 + (c + 1)p < 1c c +1 1 + ln 1 + pc cc ? 1 (c ? 1)c(p ? 1)
)
p
p?1
!p+1 p+1
1 c
:
A.2.2 3 Dimensions The expression for a local expansion in 3 dimensions can be dierentiated using the expression 1 @ r = ^r @@r + ^ 1r @@ + ^ r sin @
where ^r, ^ , ^ are unit vectors of the spherical coordinate system (r; ; ), as before. The potential, as computed in the FMM, is described in terms of a p-term local expansion (cf. eqn. 3.18). In analogy to the analysis of eqn.(3.19), an expression for a bound to the absolute error of rFMM may be given by
1 k 1 r X qirn P (cos ) < X n P (cos ) q r abs = r n i i n i n+1 n+1 n=p+1 ( ? a) i=1 i=1 n=p+1 i 1 k X X
since i > ? a. If we de ne the relative error as = (?aA) abs , where A = Pki=1 jqij, then we may describe an upper bound to the relative error by 2
(
k 1 X rn?1 X < A1 qi ^rPn (cos i)n+ n?1 n=p+1 ( ? a) i=1 )
+^Pn0 (cos i)(cos sin i cos( ? i) ? sin cos i) ? ^ Pn0 (cos i )(sin i sin( ? i))
(cos ) . Since where cos i = cos cos i + sin sin i cos( ? i), and Pn0 (cos ) = dPdn(cos
) jPn(cos )j 1 we have
k 1 n?1 X X r 1 ^ 0 Pn (cos i )(cos sin i cos( ? i ) ? sin cos i ) j q j n + < A i n ? 1 n=p+1 ( ? a) i=1 !
?^ Pn0 (cos i )(sin i sin( ? i)) 125
:
[A7]
If we consider the square of the expression inside the modulus signs above we have
j^ Pn0 (cos )(cos sin cos( ? ) ? sin cos ) ? ^ Pn0 (cos )(sin sin( ? )) 2
2 d 2 dP n (cos ) 2 2 = jrPn(cos )j = d(cos ) sin = d Pn (cos ) ; and so, by substituting back into eqn.(A7), we obtain
!
1 k n?1 d X X 1 r < A j q j n + P (cos
) i n i n ? 1 d i n=p+1 ( ? a) i=1
[A8]
:
To get an upper bound on d d Pn (cos ) we may consider the following de nition of the Legendre Polynomial [1], Z Pn (cos ) = 1 (cos + i sin cos )nd 0 Z d 1 ) d Pn (cos ) = 0 n(? sin + i cos cos )(cos + i sin cos )n?1d d ) d Pn (cos ) n sup j ? sin + i cos cos jj cos + i sin cos jn?1 n: 0 Substituting back into eqn.(A8), and following the procedure illustrated in eqn.(A6), we nd that k 1 X r 2nrn?1 X < A1 j q ij = 2 n ? 1 ?a n=p+1 ( ? a) i=1
!p 8
ca, then ?r a is bounded from above by 1c and we conclude that p p 2 + c2 ? cp ! 1 pc 1