Parallel Implementation of a Lattice Boltzmann Algorithm ... - CiteSeerX

Parallel Implementation of a Lattice Boltzmann Algorithm for the Electrostatic Plasma Turbulence Giuliana Fogaccia ? Associazione EURATOM-ENEA sulla Fusione, C.R.E. Frascati, C.P. 65 00044, Frascati, Roma, Italy

Abstract. A parallel version of a Lattice Boltzmann Equation algo-

rithm, which simulates the electrostatic plasma turbulence, has been developed using the High Performance Fortran language. The algorithm evolves a system of particle populations on a discrete lattice and dataparallel implementation has been performed by a regular domain decomposition. System evolution requires both completely local and non-local operations, involving communication between processors. Communication phase has been minimized using local HPF procedures. Eciency tests of the parallel code, performed on a 9076 IBM-SP2 parallel computer, have given satisfactory results.

1 Lattice Boltzmann Algorithm for the Electrostatic Plasma Turbulence A Lattice Boltzmann algorithm was developed to simulate the electrostatic turbulence in a magnetized plasma, within the framework of the two- uid model [1, 2]. The ion continuity, momentum and pressure equations are reproduced, with the electrons being described by the adiabatic response [3,4]. The Lattice Boltzmann Equation (LBE) method is a particularly promising numerical method to perform uid-dynamics high-resolution simulations [5]. The macroscopic dynamics is simulated starting from a system of particle populations fNj (r ; t) 2 [0; 1]IR; j = 1; : : :; bg, moving on a discrete lattice with velocities fcj; j = 1; : : :; b; = 1; :::; Dg, with r being the position vector, b the number of neighboring sites, the coordinate index and D the lattice dimension. Particles which arrive at the same site undergo a collision, with the collision operator chosen in a way such to locally conserve the particle number and the total momentum. After the collision phase, each particle propagates toward one of the neighboring sites along the direction of its velocity. The LBE describes the evolution of such a particle system, assuming that, under the eect of collisions, the system relaxes toward a local-equilibrium distribution function Njeq (r; t), and linearizing the collision operator around such distribution function. ?

E-mail: [email protected]; Tel: +39-6-9400.5351; Fax: +39-6-9400.5735

Each ion uid moment (mass density, momentum density and pressure) is expressed as a linear combination of particle populations. The macroscopic equations are obtained by a multiscale expansion of the LBE [5], similar to the well known Chapman-Enskog expansion of the plasma kinetic equation [6]. In order to perform two-dimensional electrostatic-turbulence simulations, a regular three-dimensional lattice, the so-called pseudo-four-dimensional model, must be used [7]. The number of links, in this case, is b = 18 and is equal to x, y or z. Each quantity is assumed to depend only on the rst two coordinates, associated with the two spatial variables, and to be constant along the third (unphysical) direction. Then, the particle motion occurs on the (x; y) plane and r = (rx; ry). In order to reproduce the two-dimensional ion uid equations, with the electrons being described by the adiabatic response, the Nj populations must evolve in time according to the following propagation and collision rules [1]: 18 X Nj (r + rcj ; t + t) = Nj (r; t) + Ajk [Nk (r ; t) ? Nkeq (r; t)] 18 X

k=1

18 X +B0 (r) Bjk Nk (r; t) ? B 24(r) sk (cjx ckx + cjy cky )Nk (r; t) ; (1) k=1 k=1 where cj = (cjx; cjy ), sj = 2 for j = 5; 6; 8; 14;15;17 and sj = 1 for the other values of j, r j cj j and t are the microscopic steps, physically related to the mean free path and to the time between two collisions, respectively, with r = (2Teeq =mi )1=2 t, Teeq being the equilibrium electron temperature and mi the ion mass. The second term in the r.h.s. of (1) represents the collision operator linearized about the local-equilibrium distribution function and reproduces, in the macroscopic dynamics, the eects related to the classical ion viscosity and thermal ion conductivity. The collision matrix Ajk is a 18 18 constant matrix, which satis es appropriate symmetry in order to conserve locally both particle number and total momentum. The matrix coecients can be written in terms of only two parameters, connected to the kinematic viscosity and thermal conductivity and given as input parameters [5]. Taking the magnetic eld B (r) = B(r )^b along the z axis and B0 (r) =

i(r )t, with i (r) = eB(r )=mi being the ion Larmor frequency and e the ion charge, the magnetic- eld matrix form is sk (c c ? c c ) : Bjk = 12 jx ky jy kx eq Nk (r; t) is the value of the population Nk at the local thermodynamic equilibrium and must be written as a suitable combination of uid moments to reproduce the macroscopic dynamics in the multiscale expansion of (1) [8]. In particular, the following expression yields the electrostatic-turbulence macroscopic equations [1]: r; t) f1 + 2v(r; t)ck + 3 [Qk v(r; t)v (r; t) Nkeq (r; t) = (24 2 0

?0:5(?1)sk v (r ; t)v (r; t)]g ? 41 p(r ; t)(?1)sk + 18 Qk (r ; t)

(2)

where the repeated indices are summed, , , = x; y; z, Qk ckck ? 21 , is the Kroneker delta and is the ion nite-Larmor-radius tensor [9], expressed in lattice units as: p(r; t) @ hv(r ) ^ ^bi ? h^b ^ ri v (r) : (r; t) = ? 2B 0 (r ) The mass density , the momentum density J = v and the thermal pressure p are expressed in terms of linear combinations of the microscopic variables Nj : 18 X (3) (r; t) = sj Nj (r; t) ; j =1

J (r; t) (r; t)v(r; t) = p(r ; t) (r ; t)vz (r ; t) =

X 18

j =1

18 X

j =1

sj cj Nj (r; t) ;

(4)

sj cjz Nj (r; t) :

(5)

The last term in (1) has been introduced to compensate an unphysical destabilizing contribution, coming, in the multiscale expansion of (1), from the magnetic eld operator Bjk and yielding a numerical instability at long wavelength [1]. Also the tensor can give raise to a numerical instability, characterized by a wavelength of the order of the lattice spacing. The spatial derivates in the tensor expression are expressed by the central nite dierence formulas; in order to eliminate such a numerical instability, a smoothing technique of the velocity elds, used to evaluate , has been performed over some lattice spacings [1].

2 Algorithm Scheme

The numerical evolution of the particle populations fNj (r; t); j = 1; : : :; 18g, from time t to time t + t, on a domain of l m lattice spacings can be summarized in the following steps:

{ Computation of Macroscopic Fields

The values of the macroscopic elds at the time t and on the lattice site r are calculated in terms of the particle populations fNj (r; t); j = 1; : : :; 18g from (3)-(5).

{ Smoothing Phase The velocity elds used to calculate the expression are smoothed over a spatial region of ls ms lattice spacings, with ls and ms much smaller than the maximum number of sites in the respective directions: ls ms X X 1 vx (r + ir^x + jr^y ; t) ; vxsm (r ; t) = (2l + 1)(2m s s + 1) j =?ms i=?ls

ls ms X X 1 (2ls + 1)(2ms + 1) j =?ms i=?ls vy (r + ir^x + jr^y ; t) ; p(r ; t) @ hvsm (r) ^ ^bi ? h^b ^ ri vsm (r) : (r; t) = ? 2B 0 (r)

vysm (r; t) =

{ Collision Phase

On each lattice site r, the particle distribution functions are modi ed by the eect of the collisions with the other particles: 18 X Nj (r; t + t ) = N ( r ; t) + Ajk [Nk (r; t) ? Nkeq (r ; t)] j 2 k=1 18 18 2 X X +B0 (r) Bjk Nk (r; t) ? B024(r) sk (cjx ckx + cjy cky )Nk (r ; t) ; (6) k=1 k=1

with j = 1; : : :; 18. The local equilibrium distribution function Nkeq (r ; t) is given by expression (2) and uses the macroscopic- eld values computed in the rst step of the algorithm, except for the tensor, which is obtained using the smoothed velocity elds, computed in the second step. Note that all the terms in the r.h.s. of (6) are completely local.

{ Propagation Phase

In the following t=2 interval, each particle propagates along its velocity direction toward the next neighboring site. Then the particle population with velocity cj , which is at the time t + t=2 on the site r, will propagate toward the site r + cj r at the time t + t: (7) Nj (r + cj r; t + t) = Nj (r ; t + t 2 ) j = 1; : : :; 18:

3 Parallelization Strategy A parallel version of the LBE algorithm for the electrostatic plasma turbulence has been implemented using the High Performance Fortran (HPF) language [10][12]. The HPF language is a set of constructs and extensions to Fortran90 [13], which allows to impose parallelism in a relatively simple manner and provides a code which is easily portable and machine independent. Between the two main styles of parallel programming (message passing and data parallel), the HPF language is based on the data-parallel paradigm. The idea behind the dataparallel programming paradigm is to focus on collective operations on arrays, whose elements are distributed over a number of processors. In the two-dimensional LBE simulations, particle populations N01,N02,. . .,N18 and uid moments Rho (mass density), Jx,Jy (x and y component of the momentum density, respectively), P (pressure) are de ned on a two-dimensional spatial lattice lm and are expressed as two-dimensional arrays, which are updated at every time step, with row index i=1,. . .,l and column index j=1,. . .,m:

REAL*4, DIMENSION(1:l,1:m) :: & N01,N02,N03,N04,N05,N06,N07,N08,N09,N10, & N11,N12,N13,N14,N15,N16,N17,N18 REAL*4, DIMENSION(1:l,1:m) :: Rho,Jx,Jy,P

Data-parallel implementation of the code is immediate by a regular domain decomposition. Each processor operates on the local data relative to its own sub-domain and communicates with the other processors when data belonging to dierent sub-domains are required. In the LBE algorithm, where local data must be communicated to the neighboring sites during the propagation and smoothing phases, a decomposition of the lm lattice into contiguous blocks of dimension l[m/n_procs] should be the more convenient one, where [m/n_procs] is the integer upper bound on the division and n_procs the number of processors. An abstract processor set, procs(NUMBER_OF_PROCESSORS()), with rank one and dimension given by the system inquiry function NUMBER_OF_PROCESSORS(), which returns the number of processors available to the program, is speci ed by the PROCESSORS directive. The array block distribution among the processor set is given by the following DISTRIBUTE and ALIGN directives: !HPF$ !HPF$ !HPF$ & & !HPF$

PROCESSORS procs(NUMBER_OF_PROCESSORS()) DISTRIBUTE (*,BLOCK) ONTO procs :: N01 ALIGN WITH N01 :: N02,N03,N04,N05,N06,N07,N08,N09,N10, N11,N12,N13,N14,N15,N16,N17,N18 ALIGN WITH N01 :: Rho,Jx,Jy,P

As the default block size is [m/NUMBER_OF_PROCESSORS()], in order to have load balancing among processors, the number of processors must be a divisor of m. In the following, the HPF implementation of the dierent phases that constitute the LBE algorithm, described in Sect.2, will be discussed. The phases relative to the computation of macroscopic elds and collisions are completely local. Speci cally, in the former, the uid moments (density, momentum and pressure) at the time t and the (i,j) lattice site are obtained by the particle-population values at the same time and site; in the latter, the new particle-population values (at time t + t=2) at the (i,j) point are computed using the values of particle populations and elds at time t and at the same site. For an array block distribution along the column index j, the HPF compiler distributes the loop over j between processors, with j=1,m/NUMBER_OF_PROCESSORS() for the rst processor, j=m/NUMBER_OF_PROCESSORS()+1,2*m/NUMBER_OF_PROCESSORS() for the second one and so on. After collisions, particle populations propagate from the lattice site r toward the neighboring lattice sites along their velocity directions. Then, a communication between processors will be established to calculate the propagated values of particle populations on the lattice sites belonging to the rst and last column of each sub-domain. The particle-population propagation can be easily expressed by the use of the Fortran90 array intrinsic procedure CSHIFT [13]:

N01 = CSHIFT(CSHIFT(N01,-1,1),+1,2) . . .

The CSHIFT(array,shift,dim) function provides a circular shift of all elements of an array, leaving the shape of array unchanged. The shift can be applied to any dimension of an array; a two-dimensional array is shifted along rows if dim=1 or along columns if dim=2, and the shift direction is to the left or upward, if the shift is positive, and to the right or downward, if the shift is negative. With the above HPF data-distribution directives, the HPF compiler parallelizes automatically the propagation phase and establishes the necessary communication between processors. Particularly, before the particle populations propagate, each processor sends data located in the extreme columns of its own sub-domain to the processors evolving the neighboring sub-domains and correspondingly receives data from the (topologically) adjacent processors. This communication phase is optimized by the compiler putting all data to send (or receive) on a temporal buer and sending (or receiving) them, all at once, by a single send (or receive) call. The smoothing phase performs averages of the velocity elds, at rst along the x coordinate (the row index) over l_s lattice spacings, and then along the y coordinate (the column index) over m_s lattice spacings. The HPF version of the average operation along the x coordinate has been obtained employing the CSHIFT function and the following declaration and distribution directives: REAL*4, DIMENSION(1:l,1:m) :: Vx,Vx_sm_x !HPF$ DISTRIBUTE (*,BLOCK) ONTO procs :: Vx,Vx_sm_x Vx_sm_x(:,:) = 0 c average along x (row index) do index_shift = 1,l_s Vx_sm_x = Vx_sm_x + CSHIFT(Vx,index_shift,1) enddo do index_shift = 1,l_s Vx_sm_x = Vx_sm_x + cshift(Vx,-index_shift,1) enddo Vx_sm_x(:,:) = (Vx_sm_x(:,:) + Vx(:,:))/(2.*l_s +

1.)

where Vx is the x component of the velocity eld and is given by the ratio between Jx and Rho. The average of the velocity- eld y component, Vy, along the x coordinate will be performed likewise. The HPF compiler distributes the average computation between processors and no communication is performed consistently with the fact that the average along the row index is an operation which can be executed by each processor independently. In order to make the average along the y coordinate, each processor needs data of the m_s columns adjacent to its own sub-domain and belonging to other processors. Using the standard HPF features to parallelize this operation, the computation and communication established by compiler are not optimized and a dierent strategy must be adopted.

If the media along the column index is performed employing DO loops, c

average along y (column index) do j = 1,m do i = 1,l Vx_sm(i,j) = Vx_sm_x(i,j) do k = 1,m_s Vx_sm(i,j) = Vx_sm(i,j) + Vx_sm_x(i,j-k) + Vx_sm_x(i,j+k) enddo Vx_sm(i,j) = Vx_sm(i,j)/(2.*m_s + 1.) enddo enddo

the j loop is not distributed , but executed sequentially. In the previous expression, the Vx_sm_x array must be declared as REAL*4, DIMENSION(1:l,1-m_s:m+m_s) :: Vx_sm_x

and the values for jm are determined by periodic conditions. All processors perform the whole loop for j=1,m, evaluating for each j whether the index belongs to their own sub-domain (and executing the computation only in this case), and whether data communication between processors is necessary to calculate the average. In such a way, the computation performed by each processor is more expensive than in the serial case. Moreover, the communication is not optimized. Indeed, the m_s data columns, belonging to the neighboring sub-domain, are not communicated all at once before the average computation, but, instead, these are communicated separately for every j index. In order to parallelize the computation and minimize the communication in the smoothing phase, we require that processors communicate before the average computation and that the communication phase between two processors, evolving neighboring sub-domains, occurs by transferring to each other, all at once, the last m_s data columns of the left sub-domain and the rst m_s data columns of the right sub-domain. This has been obtained by the use of EXTRINSIC(HPF_LOCAL) procedures [11], which run sequentially on each processor, with many copies executing on dierent processors. Such a procedures are referred to as local procedures. Then, all HPF actual arguments, which are distributed arrays, become locally-distributed array sections in the local procedure, while replicated arrays and replicated scalar objects are locally copied. Using an EXTRINSIC(HPF_LOCAL) function, the strategy is to build up, for every processor i_procs, two local arrays my_left_lay(l,m_s,i_procs), my_right_lay(l,m_s,i_procs) containing, respectively, the rst and the last m_s columns of the sub-domain. Then, data are communicated between processors, which evolve neighboring sub-domains, by copying such local arrays my_left_lay(l,m_s,i_procs) and my_right_lay(l,m_s,i_procs) in other local arrays, characterized by dierent processor index, right_lay(l,m_s,i_procs-1) and left_lay(l,m_s,i_procs+1), respectively:

& & & &

n_procs = NUMBER_OF_PROCESSORS() FORALL(i=1:l,k=1:m_s,i_procs=1:n_procs-1) right_lay(i,k,i_procs) = my_left_lay(i,k,i_procs+1) FORALL(i=1:l,k=1:m_s) right_lay(i,k,n_procs) = my_left_lay(i,k,1) FORALL(i=1:l,k=1:m_s,i_procs=2:n_procs) left_lay(i,k,i_procs) = my_right_lay(i,k,i_procs-1) FORALL(i=1:l,k=1:l_s) left_lay(i,k,1) = my_right_lay(i,k,n_procs)

In such a way, communication occurs before the average computation and the HPF compiler transfers data between processors all at once, using only one temporary buer to send the m_s data columns. Finally, an EXTRINSIC(HPF_LOCAL) procedure, which performs the average along the y coordinate, will be run simultaneously on multiple processors, using, for every processor i_procs, the locally distributed array sections, right_lay(l,m_s,i_procs), left_lay(l,m_s,i_procs) and Vx_sm_x(1:l,1+m*(i_procs-1)/n_procs : m*i_procs/n_procs). An eciency comparison between the two dierent versions to implement the smoothing phase will be shown in the following section, along with the results on the performance of the parallel LBE algorithm.

4 Results The HPF version of the LBE algorithm has been implemented on a 9076 IBMSP2 parallel computer. It belongs to the the family of shared-memory parallelcomputation systems and is a cluster of 16 RISC 6000=390 processors, connected one another by High Performance Switch (HPS). Each processor has a local RAM (Random Access Memory) of 128 Mb, a memory space of 2 Gb and a clock frequency of 66:7 MHz. The HPF algorithm has been compiled with the version 1:1 of the xlhpf compiler, under the ?O3 -qhot options. Performances obtained with two dierent versions of the smoothing phase have been compared. The rst version (labeled as DO LOOP version) is obtained adding the HPF array-distribution directives in the serial version. The average along the y coordinate is performed by the loop, over the column index, parallelized by the HPF compiler. The second version (labeled as EXTRINSIC(HPF_LOCAL) version) is obtained using the EXTRINSIC(HPF_LOCAL) procedures to compute the eld average along the y coordinate. The elapsed times required by the smoothing phase (Telsm ) are shown in Fig.1, when the algorithm runs on 2, 4 or 8 completely dedicated processors. The eld averages have been computed on a lattice of l m lattice spacings, with l = m = 512, averaging over a spatial region of ls ms lattice spacings, with ls = ms = 10. In the DO LOOP version, the elapsed time is greater than in the serial version, for every number of processors (nprocs ), being the computation performed by each processor more expensive than in the serial case and the communication not optimized (Sect.3). In the EXTRINSIC(HPF_LOCAL) version, the elapsed time decreases with the

number of processors and, for nprocs 4, the parallel version is more ecient than the serial one. For nprocs < 4, the time reduction associated to the computation distribution among processors is shorter than the time increase related to the data communication; then the parallel version is less ecient than the serial one. In Fig.(2), the performance increment (speed-up) with respect to the serial version, obtained in several parallel executions with dierent grid sizes, is plotted versus the number of processors for the LBE algorithm. In the parallel version, the smoothing phase is performed using the EXTRINSIC(HPF_LOCAL) procedures. The speed-up is estimated by the ratio between the elapsed time in the serial execution and that in the parallel one. The speed-up has been computed for the same value of ls = ms = 10 and dierent values of l = m. For l = m = 64, the communication phase results quite heavy in comparison with the computation phase. Indeed, for nprocs > 4 the parallel-version performance decreases with the number of processors; the overhead associated to the communication makes useless the time reduction related to the parallelism of computation. As the lattice dimensions are increased, the computation phase becomes more and more relevant with respect to the communication phase, and the parallel-code eciency increases until it approaches the ideal speed-up (the continuous line). The optimized HPF-version has required a thorough analysis of parallelization. A comparison with an optimized MPI-version, in terms of the eorts required to obtain it and its eciency, should be interesting. Then, a parallelization based in MPI could be the subject of a future work. I wish to give special thanks to S. Briguglio, G. Vlad and B. Di Martino for the helpful discussions. 4

DO LOOP EXTRINSIC(HPF_ LOCAL) ser ial

3.5

T

sm

el

( s e c)

3 2.5 2 1.5 1 0.5 0 1

2

3

4

5

n

6

7

8

9

p ro c s

Fig.1. Elapsed times vs the number of processors, using the two dierent parallel versions of the smoothing phase. The continuous line represents the elapsed time required in the serial case. The parameter values are l = m = 512 and ls = ms = 10.

10

l=m =1 0 2 4 l=m =5 1 2 l=m =2 5 6 l=m =6 4 id ea l

sp e e d-u p

8

6

4

2

0 1

2

3

4

5

n

6

7

8

9

p roc s

Fig.2. Speed-up of the LBE algorithm vs the number of processors, for the same values of ls = ms = 10 and dierent values of l = m. The continuous line represents the ideal speed-up.

References 1. G. Fogaccia, R. Benzi and F. Romanelli, Phys. Rev. E 54, 4384 (1996). 2. G. Fogaccia, R. Benzi and F. Romanelli, in Lecture Notes in Computer Science, edited by Springer Verlag, Brussels 1996, p. 276. 3. S. I. Braginskii, in Reviews of Plasma Physics, edited by M.A. Leontovich (Consultants Bureau, New York, 1985), 1, 285. 4. M. Ottaviani, F. Romanelli, R. Benzi, M. Briscolini. P. Santangelo and S. Succi, Phys. Fluids B 2, 67 (1990). 5. R. Benzi, S. Succi and M. Vergassola, Phys. Rep. 222, 145 (1992). 6. S. Chapman and T.G. Cowling, Mathematical Theory of Nonuniform Gases, Cambridge University Press, (1953). 7. U. Frisch, D. d'Humieres, B. Hasslacher, P. Lallemand, Y. Pomeau and J. P. Rivet, Complex Systems 1, 649 (1987). 8. F. Higuera, S. Succi and R. Benzi, Europhys. Lett. 9, 345 (1989). 9. S. Tsai, F.W. Perkins and T.H. Stix, Phys. Fluids 13, 2108 (1970). 10. R. G. Babb II, R. H. Perrot, "An Introduction to High Performance Fortran", Scienti c Programming 4, 87 (1995). 11. High Performance Fortran Forum, "High Performance Fortran Language Speci cation", Version 1.1 (1994). 12. A. K. Ewing, H. Richardson, A. D. Simpson, R. Kulkarni, "Writing data Parallel Programs with High Performance Fortran", Student Notes Version 1.3.1, Edinburgh Parallel Computing Centre. 13. ISO. Fortran 90, May 1991, [ISO/IEC 1539: 1991 (E) and ANSI X3.198-1992].

Parallel Implementation of a Lattice Boltzmann Algorithm ... - CiteSeerX

Parallel Implementation of a Lattice Boltzmann Algorithm ... - CiteSeerX

Suggest Documents

A fully relativistic lattice Boltzmann algorithm

Better GPU implementation of lattice Boltzmann ...

Domain-decomposition method for parallel lattice Boltzmann ...

Parallelised Hoshen-Kopelman algorithm for lattice-Boltzmann

A LATTICE BOLTZMANN METHOD-BASED FLUX ... - CiteSeerX

Implementation of a Lattice-Boltzmann-Method for ... - T-Systems

A Parallel implementation of Gram-Schmidt Algorithm

A Parallel Free Surface Lattice Boltzmann Method for ... - Informatik 10

Taxila LBM: a parallel, modular lattice Boltzmann ... - Springer Link

Parallel lattice Boltzmann flow simulation on a low-cost PlayStation3 ...

The Implementation of a Parallel Watershed Algorithm 1 ... - CiteSeerX

LATTICE BOLTZMANN SIMULATION OF NATURAL ... - CiteSeerX

A lattice Boltzmann algorithm for electro-osmotic flows in microfluidic ...

a parallel block lanczos algorithm and its implementation ... - CiteSeerX

Implementation of the Lattice Boltzmann Method on ...

Implementation of the Lattice Boltzmann Method on ...

Dreidimensionale parallele Lattice Boltzmann

High Performance Lattice Boltzmann Solvers on Massively Parallel ...

BOLTZMANN LATTICE EQUATION - MSAS

Lattice Boltzmann Simulation of ...

Massively parallel lattice-Boltzmann codes on large GPU clusters

LATTICE BOLTZMANN SIMULATION OF FLUID

Parallel Implementation of Baum-Welch Algorithm

Parallel Implementation of multipole-based Poisson-Boltzmann solver