Scalable plasma simulation with ELMFIRE using efficient data

0 downloads 0 Views 3MB Size Report
alternative structures improve scalability, thus enabling more MPI processes and a finer ..... retaining the rest of the data structure (lookup table L and hash table F) and .... To be published in Lecture Notes in Computer Science, Proceedings of ...
Scalable plasma simulation with ELMFIRE using efficient data structures in MPI process communication. Artur Signell∗

Francisco Ogando†

Mats Aspn¨as∗

Jan Westerholm∗

Abstract We describe the parallel full-f gyrokinetic particle-in-cell plasma simulation code ELMFIRE and the issue of solving a electrostatic potential from particle data distributed across several MPI (Message Passing Interface) processes. The potential is solved through a linear system with a strongly sparse matrix and ELMFIRE stores data of the estimated nonzero diagonals of the whole matrix in every MPI process, with low performance. We present three alternative more memory efficient structures for gathering the data while keeping only a local part of the matrix. We also demonstrate that these alternative structures improve scalability, thus enabling more MPI processes and a finer time and space scale than before without sacrificing performance.

1

Introduction

Plasma simulation has proven to be a very useful tool complementary to experimentation in many areas of physics like nuclear fusion and astrophysics. Specifically the study of the dynamics of plasma turbulence is crucial for the understanding of important processes like magnetic confinement in toroidal devices, or the spontaneous formation of macroscopic zonal flows at both engineering and planetary scales. Due to their particularities of low density and collisionality, plasmas have to be studied in phase-space, since the velocity distributions of their components may well deviate from the equilibrium Maxwellian function. Needless to say, a generic 7-dimensional problem involves such complexity that an analytic approach is probably impossible. Even from a numeric point of view, the complexity poses a strong constraint on the simulation capabilities. Magnetized plasmas are of special interest because of their ubiquitous presence in astrophysics and engineering. Here the particles are subject to the Lorentz force and particle motion can be decomposed into a fast gyration around magnetic field lines, a longitudinal streaming and much slower motion drifts due to field inhomogenities and the presence of an electric field. Most relevant plasma processes occur at much longer characteristic times than the particle gyration around the magnetic lines. This is why the gyrokinetic model was developed, in which particle motion is averaged around magnetic field lines, leading to a simplified 6-dimensional particle distribution function. Hence only the motion of particle gyrocenters is calculated, which is subjected to the much smoother streaming and drift motions. This method is called the full-f gyrokinetic model. The full-f method thus refers to simulating the whole particle distribution function, under the gyrokinetic simplification. In many interesting cases this function deviates only moderately from a local equilibrium Maxwellian distribution, a property that uses the delta-f method to calculate only these deviations. In contrast to the full-f method, delta-f is computationally less demanding, but has an inherent limitation on its simulation capabilities. ∗ Abo

Akademi University, Department of Information Technologies Association TKK, Advance Energy Lab.; UNED, Department of Energy Engineering.

† Euratom-Tekes

1

In chapter 2 we present an overview of the computational properties of the full-f gyrokinetic plasma simulation code ELMFIRE[1]. We construct a distributed sparse matrix in chapter 3 and 4, and present various schemes of making the process communcation of this sparse matrix efficient. The efficiency and scalability of the process communication is measured in chapter 5.

2

ELMFIRE overview

ELMFIRE[1] is a full-f gyrokinetic plasma simulation code developed as a cooperation between Helsinki University of Technology (TKK) and VTT Technical Research Centre of Finland. The code is based on a particle-in-cell (PIC) algorithm, which calculates the dynamics of a set (107 − 109 ) of markers, each marker corresponding to about 1010 particle gyrocenters. Markers are followed in a toroidal geometry inside a given magnetic field background and a self-consistent electric field. Since ELMFIRE is a PIC code, from now on markers will be referred to as particles, taking into account the difference with actual plasma particles. Particles are associated to a selected main ionic species (normally hydrogen or deuterium), electrons and optionally higher-Z impurities (like oxygen). The treatment of electrons may be either using the PIC algorithm or with an adiabatic model, which assumes electron density ne = hni i · exp(eΦ/kT ), hni i being a local average electron density. Collisions between species may be included in the PIC algorithm. As the code is computationally highly demanding in both memory and CPU-time usage, the code has been parallelized using the MPI[2] (Message Passing Interface) standard. ELMFIRE calculates plasma dynamics with a quasineutrality condition that forces ion and electron charges to neutralize each other every timestep over the all system cells. This is justified considering the plasma frequency and characteristic times. Fields and particles are evolved consistently in time so that plasma neutrality is kept at any moment. The calculation of motion, decomposed as mentioned in the introduction, is performed partially explicitly and implicitly (consistently with the electric field advanced in time), depending on the sensitivity of the given motion on the equation. The electrostatic field is then calculated using the gyrokinetic Poisson equation (eq. 1), considering implicitly the previously mentioned drifts. Z   1 ni (x) − ene (x)) (1) ∇2 Φ + (q 2 /mBε0 ) (Φ − hΦi)∂hf i/∂µ + (m/qΩ)hf i∇2⊥ hΦi dv = − (q¯ ε0 where Φ stands for the electrostatic potential, q, m the ionic species charge and mass, B the magnetic field, ε0 vacuum susceptibility, h·i a gyroaverage operator, f the particle distribution function, Ω the ion gyrofrequency and ni,e charge densities. As the full-f algorithm follows accurately arbitrarily strong plasma perturbations, the particle distribution function f has to be sampled every time step in order to compute the coefficients of the GK-Poisson equation. This procedure is different in the delta-f approach, which requires construction of the Poisson equation just once from the invariant assumed particle background. This is why delta-f codes never had to solve the motivational problem of this article. It is the construction and resolution of this implicit problem that leads to a numerically heavy process of collecting data from particles placed almost randomly across the toroidal domain. The optimization of this collection process is the target of the work presented in this article.

2.1

Calculation sequence

A simulation run in ELMFIRE starts with the initialization of the particle coordinates according to initial temperature and density profiles, which may naturally evolve in time. Using the gyrokinenic model, five phase-space coordinates are stored for each particle. Particles optionally may be unequally weighed (equivalent to different number of actual plasma particles), in which case it is neccesary to store the particle weight as a sixth parameter. Particles can be intuitively initialized using a given density profile and local temperature Maxwellian velocity distributions consistent with a given temperature profile. 2

After initialization, and as a first step in every time step, particles are propagated explicitly using a Runge-Kutta algorithm but without consideration of either ion polarization or electron parallel acceleration. The electric field is considered to be zero before the first timestep. This particle propagation process demands most of the CPU-time during a run, effectively limiting the number of particles a single MPI process may treat in order to achieve reasonable calculation times. Using current processor technology a single processor may well handle around 106 particles in reasonable time. This amount of particle coordinates presents no problem regarding memory consumption. An optional collisional operator calculates the effect of binary collisions on particle positions. Particle collisions produce a relevant effect on gyrocenter positions, contributing to heat transport. Once all particles have been partially propagated, without the consideration of either ion polarization drift or electron parallel acceleration, the electric field is calculated from the estimated final positions of particles. This estimation is made consistently with the electrostatic field under calculation. The electrostatic potential is calculated by solving the gyrokinetic Poisson equation on a 3D grid. Cell ordering in the matrix is done along poloidal, radial, and toroidal directions. The grid size is directly related to global memory requirement, so available memory effectively limits the grid size that can be used. The refinement in the toroidal direction is lower than the refinements in the other directions since the cells are field-oriented, following the particle streaming direction. Particle charge density is projected onto the grid to calculate the source term for the Poisson equation. The potential to be solved will influence the final position of the particles, and therefore also the densities of the RHS of the equation. This dependence is partially linearized in order to move some of these terms to the left hand side. Effectively (by neglecting the effect of the Laplace operator in the equation beacause of quasineutrality condition) the advanced E-field is calculated so that the plasma final state is electrically neutral. The construction of the density influence (DI) matrix representing the influence of advanced E-field on the cell densities is a CPU-time and memory demanding process. The matrix is constructed from every particle, whose position can in principle be considered random inside the calculation system. The ratio of particles to cells is directly related to the numerical noise of the PIC algorithm, and is kept at values around 2000. This means that the matrix elements are filled in random order and with many contributions to the same element. Even particles stored and handled by different MPI processes may contribute to the same matrix element. These contributions have to be interchanged somehow and the matrix has to be split in order to let a parallel algebra package invert it. The resulting matrix is sparse since it represents the effect of cell-averaged potential values onto the cell-averaged charge density. Charge density depends on final particle positions, which are influenced by both ion polarization drift and electron acceleration. Both displacements have a spatially limited dependence on the E-field, that is, the charge density has spatially limited dependence on potential values. This locality has a characteristic length of several local ion gyration (Larmor) radii. Since the numerical implementation of the gyrokinetic theory becomes easily impractical with cell size lower than the local Larmor radius with the present-day computers, we can estimate the Φ 7→ n interaction locality within several (∼ 10) cells around in the poloidal plane, and one in the toroidal direction. Due to the cell ordering this matrix structure produces a multidiagonal sparse matrix, where diagonals correspond to non-zero interactions with neighboring poloidal, radial and toroidal cells. Boundary conditions can be set to either a fixed potential or a fixed electric field.

3

Construction of the density influence matrix using a sparse matrix

Constructing the DI matrix is not a straight-forward task because particles are divided between MPI processes so that particles belonging to a single process can be located anywhere in the toroidal geometry and thus affect any element in the matrix. The parallel inversion algorithm used, on the other hand,

3

requires that each MPI process takes care of a local part of the matrix, that is a certain number of rows. This means that every MPI process has to gather information on all particles affecting the process’ local rows in the matrix from all the other processes in the system. The matrix construction is accomplished using a temporary contribution matrix C where every MPI process stores the contributions from its particles. The matrix is three-dimensional where the first dimension represents a grid point in the system, the second the position affecting the grid point and the third dimension the MPI process that handles these rows in the density influence matrix. As explained in chapter 2, contributions for each grid point will only come from surrounding positions and therefore the second dimension has been optimized not to include all grid points in the system but only a surrounding box of size ∼ 500 including cells which possibly will contribute. The dimensions of the C matrix are thus NxBxP where N is the number of process grid points, B is the used box size and P is the number of MPI processes. Even after the optimization it will be sparse with about 80% of the element zero in a typical simulation. The matrix is filled while calculating particle propagation and when all particles’ new destinations have been calculated, the matrix is split into several two-dimensional matrices. The two-dimensional matrices are then sent to their respective destinations (except the one that is local) so that every MPI process will have complete information on its local part of the matrix (see figure 1). Since the matrix construction is additive from particle data, the original algorithm collects and splits matrix elements with a single call to MPI REDUCE SCATTER. Each MPI process will receive N-1 contribution matrices (when using N MPI processes) and all those are summed together with the MPI process’ own, local contribution matrix to form the final contribution matrix for the MPI process. The data in the matrix are then processed and the local part of the DI matrix is formed. Traditionally the temporary contribution matrix has been represented with a dense matrix even though the final matrix will always be sparse (even though the second dimension size has been reduced). This contribution matrix is referred to later on as the original version. The dense matrix is very fast and easy to use for this task, however, it does not take into account that the matrix is sparse and does not scale well. The memory requirements of the matrix will grow as the number of grid points grows, independent of the number of MPI processes used in the simulation. This effectively limits the largest grid that can be used in the simulation.

Figure 1: The temporary contribution matrix C is constructed in each MPI process from its own particles. When all data for a part of the toroidal geometry has been gathered the corresponding part of the contribution matrix in each MPI process is sent to its destination. The destination MPI process sums all acquired matrices and based on that data generates the local part of the density influence matrix DI. The obvious solution for a problem where a sparse matrix is represented using a dense matrix is to 4

use a sparse matrix representation instead. The sparse matrix representations (AIJ type, block diagonal type) included in PETSc[4] (used for parallel matrix inversion in ELMFIRE) are however extremely inefficient when adding a contribution to an already existing element in the matrix[3]. As all elements in the matrix are built up from adding small contributions during the gyrokinetic motion of the particles, the performance when using one of these sparse matrix representations is poor. For some cases the performance of the sparse matrix was up to 10 times poorer compared to a dense one. As a dense contribution matrix is not scalable (memory requirement grows with the number of grid points) and a standard sparse matrix representation performs poorly, a new data structure for the contributions is required. The requirements for the data structure are that it should be scalable, not use unnecessary memory and that addition of a contribution to a (possibly) existing element should perform well. A structure that meets these requirements well is an AVL tree.

3.1

AVL tree

An AVL tree[5] is a variant of a binary search tree. The main property of a binary search tree is that every node has a unique identifier (key) and 0-2 children. A child which is to the right of its parent has a larger key value than its parent, while a left child node has a smaller key value. In addition to these properties, an AVL tree is kept balanced and guarantees that two child sub-trees of a node differ by at most one in height (height is defined as the length of the path to the child furthest away). Figure 2 shows an example of an AVL tree.

Figure 2: Example of an AVL tree

3.2

Sparse matrix as an AVL tree

To represent a sparse matrix as an AVL tree, the matrix indices have to be recalculated to a single key value. This can be done for instance using equation (2) where colmatrix refers to the number of columns in matrix C and x,y (zero-based) are the indices in the two-dimensional matrix associated with a MPI process. key = x + y ∗ colmatrix

(2)

The AVL tree can now be built up by creating nodes with key as the node identifier and associating the matrix value with each node. For instance the matrix in figure 3 can be represented by the AVL tree in the same figure. When performing an insert in the tree we need to traverse the tree top down to see if the key already exists. This is an O(log N) operation where N is the number of elements in the tree. If the key is found the value can simply be added to the node, otherwise a new node will be created and added to the tree, and finally the tree will possibly be rebalanced. Rebalancing is a simple rotation operation performed on some of the node’s parent nodes and is not a costly operation (O(log N)).

5



5

87

 20   7     21 33

64 45

       

Figure 3: Example sparse 6x6 matrix and a corresponding AVL tree representation In addition to the key and the value, a pointer to its left and right child must be stored for each node in the tree. As an AVL tree is balanced the balance for a node (-1,0 or 1 depending on subtree heights) must also be stored. This adds up to a long value (key), a double value (value), 2 pointers(lef t,right) and an integer(balance). One node would thus take 36 bytes on a standard 64 bit architecture. Compared to the 8 bytes a double value requires in a normal matrix it might seem much. The main difference however, is that the AVL tree is a dynamic structure and can use the amount of memory available. If the memory is full it is possible to send the current tree to the MPI process it is destined for, clear the tree and start building a new tree. The recipient simply adds all the contributions it receives to its local matrix. A straightforward implementation with a single structure for each node would, as stated, require at least 36 bytes of memory per node (in practice probably 40 bytes, depending on memory alignment in dynamic memory allocation). Memory usage can be improved by implementing the AVL tree using arrays and splitting the structure in separate smaller structures. Creating the node structure as an array means that the left and right child node pointers can be implemented as integer (4 bytes) referencing to indices instead of memory pointers (8 bytes). This saves a total of 8 bytes of memory on a 64 bit architecture. By moving the key and value data into a separate array, the data that needs to be sent to other MPI processes is separated from the rest of the data and thus easier to handle. By finally implementing the balance factors as character values in a separate array, the result is the structure in figure 4. This structure uses 25 bytes of memory/node and is more dynamic than a dense matrix because the matrix indices are represented by the explicit key value associated with the matrix value in contrary to a dense matrix where the indices are implicitly defined.

Figure 4: Data structures used for storing AVL nodes

3.3

Improving performance

Compared to a standard matrix, the AVL tree implementation is quite slow to access an element as the tree has to be traversed on every insert. The search can be sped up by splitting the large AVL tree into many smaller AVL trees and adding a hash table F as a front for the trees (see figure 5). Instead of 6

directly traversing a tree, the correct smaller tree is first determined by a hashing function like eq (3). The selected, smaller tree is then traversed normally. By selecting the hash table length to be a prime number, data will be quite evenly scattered among the trees.

Figure 5: Hash table combined with AVL trees

h(k) = k mod size

(3)

Using one tree with N nodes log2 (N ) comparisons have to be performed to find a leaf node in the worst case. When the three is split into H smaller trees using the hash table F , the smaller trees contain on average N/H elements. In order to find a leaf node we now do only one lookup in the hash table and log2 (N/H) comparisons instead of log2 (N ). This is a significant speed up if the original single AVL tree is large enough. For instance, with a tree size of N = 1000000 and hash table size H = 10000 we have log2 (1000000) = 20 comparisons in the single tree case and log2 (100) = 7 comparisons and one hash table lookup in the latter case. The hash table F requires only a small amount of memory. It can be implemented as an array of integers pointing to the root of the AVL tree in the data arrays. This way only 4 ∗ H bytes are required and as H  N the increase in memory usage will not be significant.

3.4

Lookup table

In order to further increase the performance of the search tree a second hash table L can be added. This one is used as a lookup table when updating elements in the tree. Elements in the lookup table consist of a key value (identifying the element) and an integer that refers to a position in the key,value data array. Every time a new node is added to a tree, a hash is generated using eq (3) and the node key value. The key value of the new node and the position of the node in the data array is stored in the element in the lookup table at the position given by eq (3).

Figure 6: Lookup table with only one tree and a lookup table of size 7. When a search for a node is performed, the second hash is generated and the element with the corresponding position in the lookup table is fetched. If this element’s key value is the same as the node 7

that we are looking for, there is no need to perform the actual search. Instead the node can be directly accessed from the position stored in the lookup table element and updated without a search. The cause for a key not to be found in the lookup table can be either that another node has been stored at the same position in the lookup table or that the key does not exist at all in the tree. In these cases a normal search has to be performed to find the node or to determine where it should be added. The function of the lookup table is illustrated in 6.

4

Alternatives to AVL trees

It might be argued that an AVL tree is not the best possible data structure for the sparse matrix. The main concerns with AVL trees are mainly the extra memory that is needed for each node for the pointers to the child nodes and the time it takes to search for a node when it is not found in the lookup table. By retaining the rest of the data structure (lookup table L and hash table F ) and changing the AVL trees to a different data structure some of these issues can be overcome.

4.1

Short arrays

One way to overcome the issue with two pointers to child nodes is to replace the AVL trees with ordinary, short arrays[3]. Each position in the hash table F is a link to an initially empty array, as shown in figure 7. Each array keeps track of how many elements it currently contains and the maximum size of it. Adding an element to this structure is very fast, as the element is simply added at the end of the array. If the array becomes full, it can either be reallocated to a larger size so that the new element can be added or the structure can be considered ”full” and communicated to its destination MPI process. Which one is more efficient largely depends on the case at hand: If we have randomly positioned elements that are evenly distributed among the hash table elements (as is the case in ELMFIRE), it can safely be assumed that when one array is full, the others are also almost full and everything can be communicated. If, on the other hand, the elements are distributed un-evenly among the hash table elements it would be more favorable to just reallocate the one array that is full. It should always be kept in mind though that reallocation is an expensive operation.

Figure 7: Hash table with uneven short arrays Searching for an element in an array is done using an ordinary, sequential search. This is efficient 8

because the arrays are kept short so there are only a few elements to compare. Also because the array is always allocated as a block and is thus contiguous in memory, the search has a very good cache hit rate: once loaded into the cache, the whole array can be found from the cache. The drawback with this implementation is mainly how the structure is allocated. When using AVL trees it is possible to allocate a contiguous array and use it to 100%: a new element is added at the first free position and a pointer from the parent node is set. When the whole array is full it can be sent directly without processing. This is not the case with short arrays if we use reallocation. As the short arrays are dynamically allocated they can reside at any position in the memory and thus a send buffer has to be allocated and data from all the arrays have to be copied to it before the data can be sent. One way to overcome this problem, if we have random enough data, is not to use reallocation but instead allocate the same contiguous memory array as in the AVL case. This contiguous area is then divided into arrays which are assigned to the hash table elements (see figure 8). Assuming the data is random enough it is highly likely that almost all elements in the memory area are in use when one array is full. The whole contiguous array can then be sent to the corresponding MPI process, sending possible empty elements also. This does not require any preprocessing and has only a small penalty of sending some empty elements (which are ignored by the recipient MPI process as the value for these elements is 0).

Figure 8: Short arrays as a contiguous memory block

4.2

A single large array

Another variant of this data structure is that we keep the lookup table L but instead of allocating a hash table F and small arrays for each element in the table, we simply allocate one huge array. This array will contain all elements that are added to the structure. No pointers or other information is stored, only the key and value. When an element cannot be found in the lookup table, it is added after the last element in the array without performing any search operation. This means that an element with a certain key value can exist at multiple positions in the array. This is not a problem in ELMFIRE though, as all elements are added together with existing elements in the receiver. This variant heavily relies on the fact that most elements are found in the lookup table. If no elements are found in the lookup table this structure basically acts as a queue, storing and sending all elements one by one. This version has been proven quite efficient in ELMFIRE, as a majority of all elements are found in the lookup table due to temporal locality. Advantages of this structure is that it is very fast and compact. No search operation has to be performed at any point and no memory is used for pointers, balances or anything else beside actual data. As long as data can be found in the lookup table it performs very well overall. A disadvantage with this 9

structure is that it will probably add elements with the same key to multiple positions in the array. This results in the structure being filled faster and more elements are sent over the network. With suitable data and a fast network connection between MPI processes this might not be a problem.

4.3

Implementation considerations

Implementing a substitute for a normal, dense matrix in a way that performs well is a challenge. Inserts/additions are performed tens of millions of times in a single time step in a typical run of ELMFIRE, which means that the implementation should contain an absolute minimum of code to be efficient. One thing that needs to be considered is the hashing function. By using a lookup table L and a hash table F of the same size, the hash value has to be computed only once. This might seem like a very minor issue but in ELMFIRE the performance is about 2-5% poorer when using tables of different sizes.

5

Performance

The performances of the different matrix representations were evaluated by running five different ELMFIRE test cases using 32/64/128 and 256 MPI processes. All the tests were run on a Cray XT4 with one dual-core 2,6 GHz AMD Opteron 64-bit processor and 2 GB of memory on each node. Only one of the two cores on each node was in use for these tests in order to be able to take advantage of the full 2 GB memory for each MPI process. The interconnect in the XT4 is Cray Seastar2 which connects the nodes in a three-dimensional torus. Each link between the nodes has a peak bandwidth of 7,6 GB/s. The tests clearly showed the difference between the original (dense) matrix and the new dynamic structure. Figure 9 shows the memory usage for each MPI process when running the simulation with different numbers of grid points. As can be seen the curve starts out being almost the same for a small number of grid points but the difference becomes significant when moving towards more grid points. At 384 000 grid points the difference is already roughly 1500 MB vs. 700 MB and a doubling of the grid points results in a problem too large to solve using the original structure (this applies when each MPI process has 2 GB memory). The dynamic structure, however, has no problem with this grid size and requires only around 1000 MB of memory per MPI process. Figure 9 also shows that the AVL tree variant requires a little more memory than the variant with arrays. This is expected as the individual AVL element is larger than the array element, mainly because the AVL element contains two pointers (to its left and right child node). The variant with only one array has been omitted for clarity, the memory usage at the scale of the graph is the same as for the variant with many arrays. The amount of memory used for the dynamic structures can of course be freely chosen, as the structures will be filled and then sent to their destination, freeing up the memory for more data. However, failing to increase the size of the structure when scaling up the problem will result in worse performance. Figure 10 shows the effect the size of the structure has on the run time of the simulation. It can be seen that a larger structure size gives better performance, with the difference being up to 15 % between a structure size of 800K and 25600K. It has to be noted that the tests were run on a XT4 with a very fast network. Running on a cluster with e.g. a 1 Gb/s interconnect would make the difference much larger. A good rule of thumb is to use as much memory for the data structure as there is available.

5.1

Scaling to larger problems

The biggest problem with the original (dense) matrix is that we cannot use more MPI processes to decrease the memory requirement for the individual MPI process. The memory requirement for the matrix is only dependent on the number of grid points and completely independent of the number of MPI processes, as illustrated by figure 11. When using a dynamic structure, however, the situation is

10

Figure 9: Memory usage for different matrix structures with increasing grid (problem) size. The ”one array” variant has been omitted from this and other graphs for clarity. The run time and memory usage for the ”one array” variant is almost identical to the ”arrays” variant.

Figure 10: The effect of different structure sizes on run time. completely different. Memory requirements for each MPI process is roughly halved when doubling the number of MPI processes, showing an almost perfect scaling by a factor of two. This makes it possible to simulate much larger problems, assuming we have a large amount of processors available. As noted before the memory usage by the dynamic structure can be freely chosen. Here it was chosen so that the run time of the simulation scales roughly as well as when using the static matrix. Figure 12 shows the run times for the same test runs that were used for the memory tests in figure 11. Figure 12 shows that the scaling factor for the dynamic structure is almost the same as for the static matrix. There is, however, an offset between the run times for the static and dynamic structures, as the insertion in the static matrix is faster than insertion in the dynamic one. This offset is not very large and quite insignificant if the goal is to simulate a problem so large that it cannot be done using the static matrix structure. Figure 12 also shows that the run time for the AVL trees and the arrays are roughly the same. When using only 32 MPI processes the AVL tree performs slightly better but already with 64 MPI processes the run time is the same and when going to 256 MPI processes the array is faster than the AVL trees and getting closer to the original matrix. The performance of the structure with only one large array is almost the same as for the structure with many arrays and has been omitted from the figure for clarity.

11

Figure 11: Memory usage for the matrix structure with varying number of MPI processes. It can be seen that the original matrix does not scale at all while the dynamic structures scale very well and make it possible to use a larger amount of processors to solve large problems.

Figure 12: The wall clock time needed for a simulation using the dynamic structure is not much worse than when using a normal matrix. The timing results are from the same simulation run as in figure 11.

6

Conclusions and further development

By replacing the original dense matrix with an sparse matrix it is possible to simulate much larger problems with ELMFIRE than before. The previous limiting factor, memory usage of the density influence matrix, is no longer in any way limiting (shown in figure 11) as the sparse matrix can be defined to use as much memory as there is available. The sparse matrix is naturally a bit slower (shown in figure 12) than the dense matrix but the difference is not significant and the memory gain in larger problems outweigh the decreased performance. There are, however, still problematic data structures in ELMFIRE, mainly such that are dependent on the grid density used. As all MPI processes have particles that reside in any part of the toroidal geometry, all MPI processes also have to have information for the full geometry on electrostatic potential and other variables that determine how particles propagate in time. These structures that represent the whole system cause problems when more grid points are added: memory usage grow linearly in each MPI process independent of how many MPI processes are used for the simulation.

12

Acknowledgements The facilities of the Finnish IT Center for Science (CSC) were used for this work. This work received support from the European Commision.

References [1] Heikkinen, J.A.; Henriksson, S.; Janhunen, S.; Kiviniemi, T.P.; Ogando, F., Gyrokinetic simulation of particle and heat transport in the presence of wide orbits and strong profile variations in the edge plasma. Contrib. Plasm. Phys 46/7-9 (2006) 490-495 [2] William Gropp and Ewing Lusk and Anthony Skjellum, Using MPI: Portable Parallel Programming with the Message-Passing Interface MIT Press 1994 [3] Mats Aspn¨ as, Artur Signell, Jan Westerholm, Efficient assembly of sparse matrices using hashing. To be published in Lecture Notes in Computer Science, Proceedings of PARA’06, June 18–21, 2006, Ume˚ a, Sweden. [4] Satish Balay, Kris Buschelman, Victor Eijkhout, William D. Gropp, Dinesh Kaushik, Matthew G. Knepley, Lois Curfman McInnes, Barry F. Smith and Hong Zhang. PETSc Users Manual. ANL-95/11 - Revision 2.1.5, Argonne National Laboratory, 2004. [5] G.M. Adelson-Velskii and E.M. Landis. An Algorithm for the Organization of Information. Doklady Akademii Nauk SSSR, Vol. 146, 263-266, 1962; Soviet Math. Doklady, 3:1259–1263, 1962.

13

Suggest Documents