Optimizing Parallel Particle-in-Cell code using both PVM and OpenMP and its Performances on Tsukuba PC Cluster DongSheng Cai and Quanming Lu, The University of Tsukuba,
fcai,
[email protected] Viktor K. Decyk, UCLA,
[email protected]
Abstract
Particle-in-Cell (PIC) or Particle-in-Mesh (PIM) code has been parallelized in many advanced parallel computers. On the other hand, many new high-performance PCs as those based Pentium II or Alpha 21264 RISC processors have been introduced recently. In a due course, these high-performance PCs have a high potential for the high-performance-computing purposes. We are currently building a dual PentiumPro PC cluster for a space weather simulator. In this report, using a test skeleton-PIC-code that is developed by Professor V. K. Decyk of UCLA for benchmarking purposes, we have measured the performances of these personal PCs in parallel and in serial, after optimizing the code for PentiumPro. In our benchmarking, we used PVM for message passing between PCs through 10baseT ethernet and OpenMP for shared memory accesses in each dual PentiumPro PCs. The benchmark results indicate that our PC cluster is comparable to recent parallel computers like SP2.
1
Skeleton-PIC-code
Ever since the emergence of parallel computers, particle-in-cell (PIC) or particle-inmesh (PIM) particle simulation has been recognized as a practical tool that scientists in disciplines such as uid dynamics and plasma sciences can use to study the complex dynamics of such particles as air molecules or sub-atomic ions and electrons [1]. A usual PIC code maps a spatial simulation domain onto a grid [2]. Particles are represented as moving within the grid, while both the properties that are tracked by the grid points and by the particles are updated. On a parallel computer, one can decompose both the particles and the grids onto the processors as the primary decomposed data structures as a typical SPMD model, which means that all processors process one PIC code program for each decomposed region. However, when particles moved the decomposed regions, the particle data need to be send to the appropriate processors. A skeleton-PIC-code has been proposed by Decyk [3] as a testbed where new algorithms can be developed and tested and new computer architecture can be evaluated. This code has been deliberately kept minimum, but they include all the essential pieces for which algorithms need to be developed. The code contains the critical pieces needed for depositing charge, advancing particles, and solving the eld. The code moves only electrons, with periodic electrostatic forces obtained by solving Poisson's equation with the fast Fourier transforms. The code uses the electrostatic approximation and magnetic elds are neglected. The only diagnostic is particle and eld energy. The basic structure of the main loop of the skeleton code is illustrated in Fig. 1. Recently, many high-performance RISC processors based PCs have been introduced including PentiumPro, Pentium II, Alpha 21264, and Power PC, etc. Using these high-performance PCs and interconnecting them through high speed networks it is very easy for us to do a high-performance parallel computation by building a PC cluster. Indeed this is easy to do and cheap to build.
1
In the present report, we build a PC cluster using 16 HP Vectras 6/200 that are cheap PCs based (about $ 3,500 each) on dual PentiumPro, and interconnect them through a cheap HP 10BaseT ethernet switch. We optimize the skeleton-PIC-code for PentiumPro using a common RISC optimization method [4] that is described in Section 3. We benchmark the skeleton-PIC-code on the PC cluster and compare the performance with other commercial parallel computers in Section 4. Since our PC cluster is the distributed shared memory system i. e. dual PentiumPro cluster, we use PVM message passing between PCs and use OpenMP within a PC for the communication for sending particle and eld data in skeleton-PIC-code in Section 5. The Tsukuba dual PentiumPro PC cluster is built for the space weather simulator, and our optimization and performance tuning are aimed to build a more ecient space weather simulator on the PC cluster. The detail of the space weather simulator is described in another report.
2 Building Tsukuba dual PentiumPro PC Cluster The skeleton-PIC-code has been benchmarked on many advanced parallel computers. We have built a dual PentiumPro PC cluster. Each PC is equipped with dual PentiumPros. We used Redhat 4.1 operating system. For parallel computations, we use PVM 3.4 as message passing library, and PGF77 and PGHPF from Portland Group Inc. Each PC is HP Vectra 6/200, and is a 200 MHz dual PentiumPro SMP machine with 64 Mbyte EDD DIMM memories. As shown in Fig. 1, the PCs are interconnected through 10 Base-T ethernetwork and 10 Base-T switching Hub. We use PGF77 as the major compiler.
Figure 1: Distributed shared memory PC cluster.
Figure 2: Structure of the main loop of skeleton-PIC-codes. All the benchmarks of the code are run in double precision. The skeletonPIC-code is developed by V. Decyk of UCLA. The code is kept as small as possible. However, it contains all essence of PIC code as shown in Fig. 2. The physical problem
2
in the code is a beam plasma instability where 10% of the particles are the beam with beam velocity that is ve times the thermal velocity of the background electrons. In the code, the quadratic spline function is used for the interpolation between grids and particles. In the parallel benchmarks, the benchmark time excludes the initialization and always initialized on one processor. Then the initialized particle data are distributed to other processors. This is because the initialization time is negligible and wants to start the simulation always in the same state and the same total energy even using parallel computers with dierent number of processors. We use a one-dimensional partition as shown in Fig. 3.
Figure 3: One-dimensional Grid and particle partition.
3
Single Processor Optimization for PentiumPro
First, we benchmark the performance of two-dimensional skeleton-PIC-code on recent single RISC processor. Second, we try to improve the single PentiumPro PIC performance using the usual RISC-optimization methods [4]. In the method, we try to make best use of the pipelines and the caches of PentiumPro processors. In the skeleton-PIC-code, most of the CPU time is spent both on the particle acceleration that is the subroutine "push" in Fig. 2 and the deposition that is the "deposit" in Fig. 2. We put our most eorts in optimizing these two subroutines. The details of our RISC-Optimization methods are mentioned in [4] and can be summarized as follows: (1) Change the data structure so that we can make best use of the cache memory: In the original non-optimized code, the electrical elds are kept in separate two-dimensional arrays f x(i; j ) and f y (i; j ), and the charge is stored in one twodimensional array q(i; j ). Thus we combine two electrical elds f x(i; j ) and f y (i; j ) into one array fxy(1,i,j)=fx(i,j) fxy(2,i,j)=fy(i,j)$
and change the charge array into one one-dimensional array qq(k ). In addition, we convert the particle coordinate arrays x(i); y (i); vx(i); vy (i) into two separate arrays xx(2; i) and vv (2; i) or one array part(4; i) depending on the processor: xx(1,i)=x(i) xx(2,i)=y(i) vv(1,i)=vx(i) vv(2,i)=vy(i)
or,
3
part(1,i)=x(i) part(2,i)=y(i) part(3,i)=vx(i) part(4,i)=vy(i)
(2) Remove of IF statements using guard cells so that we can make best use of the pipelines in RISC processors: When the particles cross the boundaries, we use extra cells to allow access to data beyond the boundaries. After all the particles move, we then add the data stored in the extra cells up or replicate them into proper arrays. For example, with the quadratic interpolation, we can enlarge the eld arrays to include 3 guard cells in one direction, one on the left and the two on the right in each dimension. So we store the x component of the eld array as follows: dimension fxy(2,0:nx+2,0:ny+2) fxy(1,i,j)=fx(i,j) fxy(1,0,j)=fx(nx,j) fxy(1,nx+1,j)=fx(1,j) fxy(1,nx+2,j)=fx(2,j) fxy(1,i,0)=fx(i,ny) fxy(1,i,ny+1)=fy(i,1) fxy(1,i,ny+2)=fy(i,2)
The method is the same for the arrays f y and q , thus we can remove the IF statements and make best use of the pipeline in PentiumPro processor. (3) Particle sorting: In the skeleton-PIC-code, the maximum cache reuse occurs when all the particles in the same cell are processed together. This is the way the code is initialized, but after the program has run, this way is changed. We use a simple bin-sort routine that calculates both how many particles there are at each grid point and their locations in the particle array. The particle data is then reordered and copied in the grid order using the location array and another temporary array. It is determined empirically that it is sucient to sort the particles every 50 time steps and only in y direction. For this benchmark purpose, the problem uses 8192 grids and 327680 particles, and the time steps are 325 so that the beam instability is fully developed. From these two subroutines we know how many oating operations needed moving every particle in one iteration. Then the actual performance, i. e. the number of oating operations per seconds can be calculated after the real time spent on these two subroutines in one benchmark run is known. The measured benchmark results are listed in Table 1. The results indicate that such optimizations are signi cant and we can improve the performance from 1.3-3 times the dusty deck results. As listed in Table 1, we obtained 18%-37% theoretical peak performances using our RISCoptimized skeleton-PIC-code. Using the PentiumPro processor, we obtained 24% theoretical peak performance in the benchmarks.
4 Parallel Computers and PC Cluster Benchmarks First of all, we measure both the speedup S and eciency E of our PC cluster using the two-dimensional skeleton-PIC-code. In the benchmark problem, we use 32768 grids and 1310720 particles. Although dual PentiumPro processors share their memory inside one PC, we use PVM to send the data to the neighbor processor in one PC in all the benchmarks mentioned in this section.
4
Single RISC Processor Benchmarks Computer name CrayT3E-9003 Mac G3/2663 IBM SP23 CrayT3D* Hitachi SR2201 Intel Pentium Pro/200MHz
Compiler & Option
"dusty deck" version RISC-optimized version theoretical (M ops) (% peak) (M olps) (% to peak) (M ops) cf77? 63(7%) 188(21%) 900 Absoft 64(24%) 84(32%) 266 xlf -O3 -qarch=pwr2 59(22%) 98(37%) 266 -qautodbl=dblpad4 cf77 -O1 17(11%) 48(32%) 150 f77 -O3 37(12%) 53(18%) 300 Gnu f77 10(5%) 13(7%) 200 Gnu f77 -O3 pgf77 pgf77 -O2 -Munroll -tp p6 -Mnoframe Visual Fortran 5.0
26(13%) 33(17%) 38(19%)
37(19%) 44(22%) 47(24%)
200 200 200
32(16%)
39(20%)
200
Table 1: Benchmark results with the two-dimensional skeleton-PIC-code, comparing Intel Pentium Pro/200MHz with other recent RISC processors used in advanced commercial parallel computers. The superscript * means that the data are measured by Professor V. K. Decyk of UCLA.
Processor number total time(s) speedup eciency(%) 1 1745.2 1.00 { 2 892.9 1.95 98 4 455.9 3.83 96 8 226.3 7.71 96 16 117.0 14.92 93
Table 2: Benchmark results of our PentiumPro PC cluster for two-dimensional skeletonPIC-code. In these benchmarks, we use 32768 meshes and 1310720 particles, and the time steps 325. The compiler options are "pgf77 -O2 -Munroll -tp p6 -pc 64".
5
As indicated in the table, the speedup is almost linear and the eciency is kept more than 93%. We use 32768 grids and 3571712 particles in these benchmarks and run the code for 325 time steps using the two-dimensional skeleton-PIC-code. We use 262144 meshes and 7962624 particles using the three-dimensional skeleton-PIC-code and run the code for 425 time steps. The two- and three-dimension benchmark results are shown in Fig. 4 (a) and (b). As indicated in the gures, the performance of the PC cluster is almost same as SP2 and SR2201. In addition, we still can expect to improve the performance of two-dimensional skeleton-PIC-code about 15%-20% using the RISC-optimization methods for our PentiumPro cluster.
Figure 4: Total time versus number of processors on log-log scale for various parallel computers and PC cluster, the compiler options for PentiumPro cluster is "pgf77 -O2 Munroll -tp p6 -pc 64". (a) Two-dimensional code uses 32768 grids and 3571712 particles and runs for 325 time steps. (b) Three-dimensional code uses 262144 grids and 7962624 particles and runs for 425 time steps. The superscript * means that the data are provided by Professor V. K. Decyk of UCLA.
5
PVM and OpenMP Performances
Because our PCs are all dual PentiumPro and share their memory in one PC, we try to test the OpenMP performance of the two- and three-dimensional skeletonPIC-code comparing with those using PVM. The OpenMP is a set of compiler directives and callable runtime library that extend Fortran (and separately, C and C++) to express a shared memory parallelism. In our problem discussed here, as indicated in Table 2, although we use relatively slow 10 Base-T switching Hub for PC interconnections, the eciency is kept more than 93%. This means that not so many particles move to other PCs in one step in this benchmark. If only 10% to 20% particles move from one processor to another, the overhead of OpenMP surpasses the overhead of PVM. Thus it is impossible to compare the communication or memory access time of OpenMP and PVM. In order to ensure more particles move from one processor to the other processors, and therefore more communications are needed, we increase the beam particle number to total particle number above 90% for only the benchmark purpose although this is not physical. Also the grid number in vertical direction is reduced from 2 to 16 times while the total grid numbers keep constant so that more particles cross the processors. First, we compare the performance of PVM with OpenMP on a single dual PentiumPro SMP PC. Thus we only compare the PVM message passing time in one PC with memory access time within a shared memory system. Of course, both overheads of OpenMP and PVM are all included. The total grid number and particle number are 32768 and 1015808, respectively for two-dimensional code, and 16384 and 458752 for three-dimensional code. We keep these two programs at the same style using the OpenMP compiler directives and PVM. The subroutine "push" is written as follows: PVM subroutine push: dimension part(idimp,npmax),
C part contains the total particles
6
sbufl(idimp,nbmax),sbufr(idimp,nbmax) do j=1,npp ...... define particles which move to other processors enddo end subroutine pmove
C C C C C
sbufl,sbufr contain particles which move to other processors npp is the number of particles on one processor here moves the particles
C this subroutine moves particles to C other processors
.... end
OpenMP subroutine push: dimension part(idimp,npmax,nvp)
C nvp is the number of processors C on SMP
sbufl(idimp,nbmax,nvp),sbufr(idimp,nbmax,nvp) !$OMP PARALLEL !$OMP DO do i=1,nvp do j=1,npp(i) .... C here move the particles define particles which move to other processors enddo !$OMP END DO NOWAIT !$OMP END PARALLEL end subroutine pmove
C this subroutine move particles C to other processors
... end
The benchmark results are shown in Fig. 5. The message passing time of the code using PVM is larger than the communication time that is the SMP memory access time using OpenMP, the ratio is about 5.5 times. When the portion of the communication time in the total CPU time is relatively smaller, the benchmark performances of PVM are better than that of OpenMP. However, when the portion of communication time in the total CPU time is relatively larger, the performance of OpenMP is better than that of PVM as one can expect easily.
Figure 5: The total CPU time and the message passing time for dierent grid number in vertical directions, the time step is 325 and compiler options is "pgf77 -O2 -Munroll -tp p6 -pc 64". (a) Two-dimensional code uses 32768 grids and 1015808 particles. (b) Three-dimensional code uses 6384 grids and 458752 particles in this benchmark. In our distributed shared memory PC cluster system, we use both PVM for the inter-PC communications and OpenMP for the shared memory accesses. We compare the skeleton-PIC-code using only PVM with the one using both PVM and OpenMP. Here we benchmarks the performances of that using only PVM and that using both PVM and OpenMP using 8 PCs that is 16 processors. The total grid number and particle number are, respectively, 32768 and 1212416 for twodimensional code, and 131072 and 1179648 for three-dimensional code. In this
7
benchmark, we use linear functions for interpolation for convenience. Here we try to compare two cases. The case 1 is that every dual PentiumPro PC has two PVM tasks, thus messages are send even within one PC, and the case 2 is that every PC has only one PVM task and in every PC we use OpenMP compiler directives shown above. The results are indicated in Fig. 6, the total message passing time of case 1 is about 1.45 times that of case 2, and when the message passing time is larger, the performance of case 2 also can exceed that of case 1.
Figure 6: The total CPU time and message passing time using 8 PCs for dierent grid number in vertical directions, the time step is 325 and compiler options are "pgf77 -O2 Munroll -tp p6 -pc 64". (a) Two-dimensional code uses 32768 grids and 1212416 particles. (b) Three-dimensional code used 131072 meshes and 1179648 particles.
6
Summary and Conclusion
We build a cheap dual PentiumPro PC cluster for the space weather simulator. In doing so, we benchmarked the cluster using the skeleton-PIC-code that is developed as a testbed to measure the performance of parallel computers. The benchmark results are very promising and the PC cluster is almost comparable to the recent commercial parallel computers like SP2 or SR2201 although we should note that the processor numbers are limited to 32 processors in our system. However, the major advantage is that the PC cluster is de nitely cheaper than any other commercial parallel computers if you do not need more than 100 processors. Regarding to the mixed utilization of both PVM and OpenMP paradigms, the mixed utilization has the advantage over PVM only when the communication time is signi cant, for example, when the communication time exceeds 10 to 20 % of the total CPU time in our benchmark. However, more detailed investigations are needed since in more recent PC servers 4 to 8 processor SMP becomes more popular and cheaper.
References [1] Birdsall, C. K., and A. B. Langdon, Computer Simulation Using Particles, [MacGraw-Hill, New York, 1981]. [2] Liewer, P. C., and V. K. Decyk, A General Concurrent Algorithm for Plasma Particle-in-Cell Simulation Codes, J. Comput. Phys., 85 pp. 302-322, 85. [3] Decyk, V. K., Skeleton PIC Codes for Parallel Computers, UCLA Institute of Plasma and Fusion Research Report PPG-1511, 1994 (www.physics.ucla.edu/ uclapic). [4] Decyk, V. K., S. R. Karmesin, A. Boer, and P. C. Liewer, Optimization of Particle-in-Cell Codes on RISC Processors, UCLA Institute of Plasma and Fusion Research Report PPG-1546, 1995 (www.physics.ucla.edu/ uclapic).
8