Some Improvements in PIC performance through sorting ... - CiteSeerX

2 downloads 28619 Views 269KB Size Report
Jan 5, 2005 - These optimization methods are controlled by cost minimizing functions that seek to find a ..... A crucial aspect of a parallel program is its ability to fully utilize CPU time across all .... [6] IA-32 Intel Architecture Software Developer's Manual ... [8] Mac OS X Assembler Guide, Apple Computer Corporation, 2004.
Some Improvements in PIC performance through sorting, caching, and dynamic load balancing Viktor Przebinda∗and John R. Cary† January 5, 2005

Abstract VORPAL has achieved an over quadrupling performance improvement through the introduction of sorting, caching, and dynamic load balancing. Particle sorting allows us to increase temporal locality of data structures, thus maximizing utilization of a computer’s cache. Manual caching of particle grid coordinates prevents redundant computation of these values which are needed during multiple steps in the particle update process. Dynamic load balancing continuously adjusts load distribution in a parallel environment to maximize the utilization across all processors. These optimization methods are controlled by cost minimizing functions that seek to find a balance between overhead in performing them and the performance improvement they provide.

1

Introduction

The Particle in Cell (PIC) algorithm is a powerful and highly accurate tool for simulating physical effects for a variety of applications. PIC is also relatively computationally expensive compared to other methods. This paper discusses a few significant performance-related improvements to the PIC model that are implemented in the University of Colorado Center for Integrated Plasma Studies’ Versatile Object-oriented Plasma Simulation Code (VORPAL). These improvements have lead to an overall quadrupling of performance. The PIC model imposes some difficult challenges for modern day processors. Particles are updated by factoring in the forces they experience from electric and magnetic fields, which themselves are updated by factoring in the electromagnetic (EM) forces generated by particle movement. Since all these physical entities are stored in entirely separate and dissimilar data structures, this kind of computation results in a large processor - memory gap. The CPU is therefore starved of work waiting on memory requests to return. Particle sorting aims to reduce this wait time by keeping the needed memory in cache, closer to the processor, thus reducing the time required to retrieve it. ∗ †

[email protected] [email protected]

1

We use particle sorting, discussed in section 2 to align each particle’s actual position in memory with its position on the grid where field data is stored. In this way, all memory access patterns become sequential as opposed to random. This allows maximum use of hardware-based performance features such as caching and automatic pre-fetching. The field-particle interconnection in PIC raises another computational challenge. Particle positions must be continuously converted from floating precision format to integer format so EM and other forces experienced by the particles can contribute to their acceleration. This conversion is necessary for multiple steps in the particle update process, not limited to the need for finding EM forces. Index and weights caching, discussed in section 3, aims to eliminate redundant conversions. By caching particle indices and weights in software, we are able to reuse these values during multiple steps in the computation process. To avoid large memory overhead in caching, particles are updated in groups, so indices and weights are only computed for a small subset of all particles at a time. All subsequent steps in the particle update process use these cached values as needed. Redundant use of indices and weights is only half of the motivation to use a software cache. Most computer architectures have limited hardware support for performing this operation, thus forcing the compiler to generate code to perform the task in a roundabout manner. In section 4 we illustrate and discuss the GCC1 generated assembly language sequences used to perform this operation for the x86 and PowerPC instruction sets. In VORPAL, parallelism through explicit message passing is done by means of domain decomposition. Each section of the domain is updated by its respective processor, and synchronized at the end of each time step. Particles residing in a particular section are stored on the processor updating that section. Therefore, update time per processor depends on the changing concentration of plasma during the course of a simulation. Due to this movement of particles, load will vary during the course of a simulation. Section 5 discusses dynamic load balancing, which aims to compensate for this by adjusting the decomposition at runtime thus keeping utilization high across all processors. Due to these and other computational challenges involving PIC, much research has been done in this area. Previous optimization-related work tends to be divided between parallel strategies, which focus on high level implementation to minimize latency, and serial optimization strategies which tend to be more low level and processor specific. Viktor Decyk[2] has illustrated some general optimization strategies for numerical computation including branch minimization, and data structure consolidation. He has also addressed the benefits of particle sorting. Kevin Bowers[1] has implemented a linear time counting sorted for PIC.

1.1

VORPAL PIC structure

VORPAL makes extensive use of class structures to maximize flexibility. The two primary objects used for performing a simulation are an EM object, which simulates desired electromagnetic properties, and a species object, which is responsible for storing particles and advancing their positions over time in accordance with the forces applied by the electric and magnetic fields. These two objects depend on each other since the electric and magnetic fields affect particle motion, and particle 1

http://gcc.gnu.org/

2

motion generates EM waves. When performing optimizations, the advancing of particles is of most interest since it takes the majority of the processing time when compared to the EM update time. The particle push algorithm is a general form of the particle in cell (PIC) method. The algorithm is designed to support arbitrary boundary conditions, arbitrary number of dimensions, and run in parallel on any number of processors. In VORPALs PIC implementation, particles are stored in groups of C++ standard template library (STL) vectors. Groups are processed by feeding them though various objects. Each object performs some operation on that group. These operations include position to grid translation, particle acceleration, particle movement, current weighing, charge weighing, and boundary checking. During these operations, data from other objects must be retrieved. For example, electric and magnetic field values are obtained from the EM object, which consults individual electric and magnetic field objects to interpolate the field values for a set of positions. Particle acceleration and movement are implemented using policy classes, which make it easy to add multiple push algorithms to simulate a variety of characteristics.

2

Particle Sorting

The particle update process involves changing each particle’s velocity according to the EM values at its position. EM field data is stored on a grid, addressed with integer indices whereas particle positions are stored with floating precision, addressed in a serial fashion. Computer programs benefit from maximizing use of temporal locality due to caches found within a computer. If the particles are sorted with respect to the grid, we can maximize temporal locality. If particles are out of order with respect to the grid, cached data will not have a chance to be re-used. This is illustrated in figure 1. Due to the large overhead involved, particle sorting must be implemented

X

X

X

X

X

X

X

X

X

X

Figure 1: Sorted (left) vs. unsorted particles (right) with special care so as not to worsen performance. A decent sorting algorithm will have an average

3

running time of O(n ∗ ln(n))2 where n is the length of the data set or number of particles in our case. Note that the particle update itself has a fixed run time of only O(n). VORPAL uses a modified Quicksort[3] algorithm to order particles. Quicksort was chosen because it can be done in place, and can easily be implemented on linked data structures such as those used in VORPAL that do not support random access iteration. Quicksort relies on a divide and conquer strategy. A “pivot” value is chosen and put in place by ensuring that all data values to the left are smaller than it, and all values to the right are larger. Values on the left that are larger than the pivot are swapped with values on the right that are smaller than the pivot. The algorithm then recurses, performing the same operation on the left and right halves. For an unsorted array, the pivot will usually divide the data set roughly in half, giving a maximum running time of O(n∗ln(n)), however if the data set is already sorted, the pivot will not divide the data set evenly, thus yielding a large recursive depth and a worst case running time of O(n2 ). A modification the the quicksort routine enables it to run in linear time O(n) by limiting the recursive depth of the algorithm. Notice that recursive depth of quicksort increases most rapidly when the input data is already sorted. As a consequence particles will be mostly, but may not be fully sorted after the sorting routine is done. Since we are only concerned with the performance impact that sorted data provides, this limitation is insignificant. The performance impact of sorting must also be controlled in a manner that minimizes the overhead. Sorting should only be done when its performance impact outweighs the cost required to perform the sort. To this end VORPAL uses a cost minimization algorithm to control when a sort should be performed based on the following criteria: • Last time, tSort, it took to sort, per particle. • Last time, tSorted, that it took to push the particles after sorting, per particle. • Total time, tSumP ush, that it took to push the particles since last sort, per particle. • Number of steps, nSinceSort, since the last sort. Whenever tSumP ush > nSinceSort ∗ tSorted + tSort then it is time to sort again, since the time invested in the sort will be recovered in the next nSinceSort steps. This result can be derived under the assumption that the update time increases linearly with increasing timestep y = mx where m is the slope constant, x is the step, and y is the excessive update time for the step. In other words, the difference between the total update time and the update time with particles perfectly sorted. We further assume that sorting takes a constant amount of time p. We can see from figure 2 that sorting is best done when the total time spent sorting is equal to the total time spent doing extra processing as a result of not having particles sorted. If we deviate by sorting too soon or too late, we can see that the total area under the curve show in figure 2 will increase. So we are looking for a value x where p = 21 m ∗ x2 or q x = 2p m . In the code this condition is satisfied whenever tSumP ush > nSinceSort ∗ tSorted + tSort 2

Kevin Bowers has implemented a linear O(n) sorting algorithm based on a counting sort, however the algorithm has memory requirements proportional to the number of particles (insert reference to bower’s paper here)

4

Sorting time Excessive push time due to unsorted particles

Update Time

Update time for sorted particles

Time Step

Figure 2: Optimal sorting step makes sorting time equal to excess particle update time

The performance impact of particle sorting varies widely with processor type and simulation parameters. Since particle sorting aims to improve performance by minimizing cache misses, the impact is best seen on machines with small cache sizes relative to the number of particles in the simulation. The results we present here are for 500 time steps of a 2-D simulation with 1800x100 cells. Full Yee EM solve and relativistic boris push were used. Five particles per cell were loaded at the start of the simulation. The table below shows average particle push time in micro-sections per particle. Machine AMD Athlon 2.8GHz IBM SP G4 800MHz Intel P4 2.4GHz

3

Push time with sorting 0.534 3.769 2.88 1.169

Push time without sorting 1.579 4.66 4.45 2.813

speedup 3 1.24 1.55 2.4

Caching of Particle Indices and Weights

During several steps of the particle update process, each particle’s indices and weights are needed. A particle’s indices are its grid coordinates. The weights convey information about where in the given grid cell the particle resides. This information is used in three of the update steps. First, the indices and weights are used to change each particle’s velocity by looking up Electric and Magnetic field values at each particle’s location. Here the weights are used to interpolate field values. Indices are then needed to accumulate currents as a result of particle movement (data from which is later added to the EM field). Finally, indices are again used to remove particles according to predefined boundary values. This is implemented through a sink array equal to the size of the grid as shown in figure 3. The array holds function pointers that take particle objects. For each particle, the appropriate function pointer is retrieved by consulting the index cache for that particle. The function pointed to by the cell is then called with the particle as it’s argument. This technique 5

 









Pointers to particle messenger for neighboring processors.

 

 









Pointers to particle removal routines. 





































































































 





























































































































































 

































































































 



 



 



 



 



 



 



 



 



 



 



 



 



 



 



 



 



 

















provides very flexible handling of boundary conditions. Periodic boundary conditions, messengers to neighboring processors, and particle absorbers are all implemented as functions, the pointers to which can be placed in the sink array. The performance impact of our index cache is roughly 1.17 time across all processors.

Figure 3: Particle Sinking

Unfortunately, conversion between floating and integer types is quite expensive on most machines. Neither Intel nor PowerPC instruction sets provide a convenient way to convert from floating point to integer formats. For this reason we must eliminate all redundant typecasts. To this end, VORPAL implements a data structure as shown in figure 4 to store the indices and weights of all particles in a group.

struct { public: size_t offset; FLOATTYPE wl[NDIM]; FLOATTYPE wu[NDIM]; }

Figure 4: Cache data structure for software caching of indices and weights. Since indices are always used to reference data on a grid, it is more useful to store offsets into the data rather than actual indices. Hence the above data structure only includes an offset instead of indices.

4

Typecasting in Assembly

When dealing with a high level language it can be difficult for the compiler to produce optimal code. Furthermore, in some cases computer architecture is not well suited to perform a specific task. Typecasting from float to integer is one such case. Here we discuss the assembly level implementation of such a typecast for both PowerPC as well as Intel x86 instruction sets. This is of 6

interest since it is used so frequently in PIC codes to lookup the indices where a particle resides, and must be performed at least once for each index of each particle at each timestep. We first consider the PowerPC instruction set on the G4, and then the x86. The following code, which manifests a float to size t (unsigned integer of maximum machine supported size) conversion generated by gcc 3.3.3: Instruction lfsx fmr fctiwz stfd lwz stwx

arguments f3 f2 f1 f1 r5 r5

flow ⇐ ⇐ ⇐ ⇒ ⇐ ⇒

arguments r2,r3 f3 f2 -24(r1) -20(r1) r2,r4

Meaning Load from memory register copy Round towards zero store to memory load from memory to integer store as integer

We see that the compiler first loads the floating value into f3, then decides to copy that value over to f2. The third instruction, fctiwz performs conversion to integer with rounding towards zero (truncation mode) and stores the result in f1, a floating point register rather than an integer register. In order to get the result in an integer register, the computer first stores the floating point value out to memory in floating point format, and then it loads up the least significant four bytes (32 bits) of that data as an integer. That loaded data is then stored again to memory in the needed location (array element, return value, etc). Unfortunately, this seemingly roundabout sequence must be performed for each particle since the PowerPC instruction set has no better support for conversion between the two types. The most likely reason that the fctiwz cannot store to an integer register is that it would complicate the design of the processor’s internal pipeline. Therefore the PowerPC, being a RISC architecture, prefers a more uniform instruction set. A floating point instruction that saves to an integer would break away from RISC ideals, and as a result make chip implementation more complicated. The keen reader may suggest the compiler use the PowerPC stfsx (store floating point single precision) to store only 32 bits as desired directly to the appropriate memory location rather than storing and re-loading as is currently done. The problem here is that after fctiwz, the data stored in the floating point register is in signed integer format, not IEEE floating point format. Whereas stfd simply stores the value regardless of its contents from a floating point register to memory, stfsx bundles the store with a conversion from double to single precision. Since the data in the floating point register is not even in floating point format, the result would be meaningless. The same scenario appears on the SP when gcc is used. The xlC compiler calls a library routine to perform the cast and conversion, and therefore cannot be inspected so easily. The scenario for the Intel instruction set, shown below, has its own advantages and weaknesses. Instruction flds fldcw fistpl fldcw

arguments fpstack fpstack fpstack fpstack

flow ⇐ ⇐ ⇒ ⇐

arguments (%esi,%edx,4) -12(%ebp) (%ebx,%edx,4) -10(%ebp)

Meaning Load from memory Reset rounding mode store as integer Restore rounding mode

The first instruction tells the processor to switch rounding modes from round to nearest to round to zero. The value to convert is then loaded from memory and stored back into memory as an integer 7

in the second and third instructions respectively. The last instruction switches the rounding mode back to round to nearest. We see that the x86 instruction set does not require additional instructions to convert between data lengths, but does require a change of rounding mode within the floating point processor for the fistpl instruction to work correctly, whereas PowerPC contains separate instructions for each of the two rounding modes.

5

Dynamic Load Balancing

A crucial aspect of a parallel program is its ability to fully utilize CPU time across all processors. For PIC codes, synchronization at the end of each time step is the limiting factor to achieving maximum utilization. Uptime time is a function of particle distribution. Processors with a larger number of particles to push will take longer to update to the next time step. During synchronization, processors that have finished early must wait idle while others catch up. Since particle distribution changes at each time step, it is not possible to set a single ideal decomposition for the simulation. A more dynamic approach is needed. VORPAL contains a fully dynamic runtime load balancer with over allocation and cost minimization. Before we proceed, a brief discussion of VORPAL’s decomposition framework is warented. A decomposition is described in VORPAL though a list of boundary positions for the zero direction. Each of these boundaries cuts through the entire domain, dividing it into two slices. Each of these slices can then be further divided independantly along direction one. Finally, each slice along direction 1 can be idependantly divided along direction 2. This concept is illustrated in figure 5. By measuring each particles update time, VORPAL is capable of determining an optimal decomposition adjustment. This is done by ranking all update times in order from greatest to least. Decomposition boundaries are adjusted by shrinking all boundaries around the processor with the longest update time.3 The second busiest processor then shrinks all of its boundaries provided that they have not already been adjusted by busier processors. This is repeated for all processors in the simulation. The result is a decomposition adjustment description object as shown in figure 6 that is broadcasted and applied on all processors. Once applied, VORPAL initiates a standard synchronization routine that moves all values to the appropriate processors. As a result of decomposition adjustment, the size of local grids on individual processors is will vary. Therefore all local data structures that hold values at grid locations such as the a Yee EM field must be dynamic. VORPAL supports over-allocation of data structures to support this. All gridded data structures are allocated to a specified allocation region, but iterated over active regions which contain meaningful data. By using this structure it is possible to expand the size of a local region without the overhead involved in reallocating and copying data. Whenever the allocated space runs out, VORPAL reallocates with overflow and copies all data structures to the new size as shown in figure 7. A certain level of overhead is associated with performing a decomposition adjustment which must be mitigated for DLB to be effective. VORPAL uses a cost minimization algorithm which performs 3 It is a limitation of the VORPAL framework that load balancing is capable of moving boundaries by no more than one cell per timestep.

8

Direction 2 boundaries subdivide direction 1’s divisions

Direction 1

Direction 2

Direction 0 boundaries divide entire domain Direction 1 boundaries subdivide direction 0’s divisions

Direction 0

Figure 5: Boundary adjustment is limited by direction. Direction zero divides the entire domain. Direction 1 boundaries divide individual divisions made by direction 0 boundaries. Direction 2 boundaries divide individual divisions made by direction 1.

an adjustment only if it is estimated to reduce the total execution time. The decision process is made as follows. Given • time difference, cP timeDif f , between update times of most and least busiest processors, • change in cP timeDif f , lastDLBsavings, as result of last decomposition adjustment, • last time, tSorted, that it took to push the particles after sorting, per particle, • remaining number, numRemSteps, of steps to take for this simulation, • last time, lastDLBtime, it took to adjust the decomposition, a decomposition adjustment is made if the following condition is met. lastDLBsavings ∗ numRemSteps > lastDLBT ime

A simple illustration of load balancing on two processors is shown in figure 8. When load balancing is disabled the update time on each processor varies accordingly with the density of particles. When load balancing is enabled, we see the load across both processors is controlled independently of the particle density. 9

CPU1

CPU0

CPU2

Direction 1

highest

CPU3

lowest CPU Load Direction 0

Figure 6: Load Balancing

6

Discussion and concluding remarks

We have presented an overview of three optimizations added to VORPAL which have resulted in an overall quadrupling of performance. Particle sorting yields outstanding improvements, especially on machines with small cache sizes. Viktor Decyk’s results show a substantially smaller impact as a result of sorting. This is because his simulations load particles in sorted order whereas VORPAL does not. Kevin Bowers’ sorting implementation, which relies on the fact that each particle’s position corresponds to an index on a finite grid, is elegant and efficient, however the additional memory required makes it less desirable. Software caching of particle indices and weights is an optimization very much specific to VORPAL and may be less applicable to other implementations. Since VORPAL is written with an Object Oriented structure, functionality of various operations is finely divided, despite the natural flow of data from one operation for the next. It is for this reason that we divide particles into groups so that individual operations within an update are applied to a group of particles at a time. Hence, software caching of otherwise re-computed values like indices and weights, is a natural step. In practice we have found that dynamic load balancing is extremely costly due to the communication latency involved in re-synchronizing data values after a decomposition adjustment. We believe that our one cell per step decomposition adjustment limitation is mainly responsible, as we are forced to pay a large performance penalty in exchange for minimal change in load. A more flexible 10

New memory block requested, values are copied over

Memory is over allocated to prevent future resizes.

Figure 7: Over Allocation

Update time per processor with load balancing 600

500

500

400

400

Time (Msec)

Time (Msec)

Update time per processor without load balancing 600

300

200

100

CPU0 CPU1

300

200

100

0 1

0

52 103 154 205 256 307 358 409 460 511 562 613 664 715 766 817 868 919 970

1

Time Step

52 103 154 205 256 307 358 409 460 511 562 613 664 715 766 817 868 919 970 Time step

Figure 8: Load Balancing

implementation could prove far more powerful. Runtime cost minimization is a feature that we have found to be nicely suited for VORPAL’s Object Oriented structure. We are able to easily scatter timer objects throughout the code to measure wall clock time, user time, and idle time. Then, cost analysis routines are able to query their values and act on them. Without cost minimization we would be forced to experimentally determine optimal static points at which to sort and adjust the decomposition. As a result, we are assured that performance-related enchantments are exactly that, enhancements, and not potential bottlenecks for a small subset of test cases.

References [1] K. Bowers, Journal of Computational Physics, November 2001, Volume 173, Issue 2 11

[2] V. K. Decyk, S. R. Karmesin, A. deBoer, and P. C. Liewer, ”Optimization of particle-in-cell codes on reduced instruction set processors,” Comp. in Phys., vol. 10 [3] T. H. Cormen, C. E. Leierson, and R. L. Rivest, Introduction to Algorithms. Cambridge, MA: The MIT Press, 1990. [4] C. D. Norton, B. K. Szymanski, and V. K. Decyk, ”Object-Oriented Parallel Computation for Plasma Simulation,” Communications of the ACM, October 1995, Volume 38, Issue 10 [5] IA-32 Intel Architecture Optimization [6] IA-32 Intel Architecture Software Developer’s Manual [7] Assembler Language Reference, IBM [8] Mac OS X Assembler Guide, Apple Computer Corporation, 2004 [9] C. Nieter, and J. R. Cary, VORPAL: a versatile plasma simulation code

12

Suggest Documents