Optimizing 3D Multigrid to be Comparable with the FFT Kaushik Datta UC Berkeley
[email protected]
Michael Maire UC Berkeley
[email protected]
Abstract
as a guide, we develop and test optimizations likely to offer performance improvements. As the multigrid algorithm can be decomposed into four main functional components, we analyze the performance of each of these functions separately. In addition, in order to validate the assumptions in the performance model, each optimization was tested on four processor architectures used for scientific computing, the IBM Power3, Compaq Alpha 21264, AMD Opeteron, and Intel Itanium 2. The remainder of this paper is organized as follows. In Section 2, we review related work on optimizing multigrid for solving 2D PDEs and discuss the additional difficulties associated with the 3D case. Section 3 gives a brief overview of the FFT and multigrid techniques for solving PDEs on a 3D grid. In Section 4 we develop a performance model for multigrid, based on processor, cache, and memory paramters. In Section 5, we present optimizations and predict their potential benefit using our performance model. Section 6 discuss the procedure used to test optimized versions of the multigrid code. In Section 7, we analyze results obtained from these benchmarks and in section 8 we discuss their implications for future research.
Multigrid is an iterative algorithm that uses nearestneighbor computations to solve PDEs on a regular mesh. Although multigrid is an algorithm, on problem sizes of interest it generally does not run as fast as the highly optimized FFT, which is an algorithm. This paper examines several optimizations to the multigrid approach for solving PDEs on a three-dimensional grid. The NAS Multigrid (NASMG) benchmark, a recognized standard, serves as a baseline for our performance analysis. We present a performance model for 3D multigrid that incorporates architectural parameters for the processor and memory system. Benchmark results for potential optimizations are optained on multiple architectures. The Performance Application Programming Interface (PAPI) library is used to examine hardware behavior in detail. We compare these results with the predictions of our performance model and discuss the architectural requirements for further improving the speed of 3D multigrid.
1 Introduction
2 Related Work
Fast Fourier Transform (FFT) based algorithms are currently popular for solving partial differential equations (PDEs) on regular meshes. The multigrid algorithm offers a better asymptotic running time of compared to the FFT’s runtime of for a regular mesh, or grid, of points. However, the constant overhead factor associated with multigrid is large enough to make an optimized FFT algorithm faster on problem sizes for current real-world applications. We examine the potential of several optimizations to multigrid to reduce this overhead. In particular, we focus on the task of improving multigrid performance for solving PDEs on 3D grids, as this problem has practical applications and has not been explored by previous work that focused exclusively on 2D multigrid [8]. We first create a performance model that enables us to predict the runtime of the multigrid algorithm given the characteristics of a particular platform. Using this model
Sellappa and Chatterjee [8] previously looked at the problem of optimizing the 2D version of multigrid. They use a slightly different variant of the multigrid, the red-black gauss-sidel algorithm, than the one on which this paper focuses. However, they face similar challenges in efficiently traversing a grid of points multiple times and performing only operations per grid point in each traversal. They found register blocking to improve performance for the 2D case. This optimization loads a block of the grid into registers and traverses along a grid direction, only loading new cells encountered into registers and shifting grid values in the registers as needed. The larger the block of the grid that can fit into registers, the greater the reduction of unnecessary loads. Analagous to register blocking, cache blocking is a technique that aims to reduce the number of unnecessary loads
1
from memory to cache. Rivera and Tseng [7] discuss cache blocking for 2D and 3D array layouts and present a method for computing the optimal block size given the cache size and array size. They also discuss copying a subset of an array into a temporary buffer in order to improve cache performance. However, they point out that this optimization is only applicable to algorithms that perform more than constant work per array element, and hence is of little help to stencil codes such as multigrid. Lam et. al [4] give a detailed performance analysis of cache blocking, taking into account cache and array sizes, cache associativity, and array access patterns.
ACBADBA
Figure 1. 27 point stencil ( cube) for 3D multigrid. Points shown in the same color have the same weights as they are equidistant from the center (shown in black).
3 Solving PDEs on a 3D Regular Mesh In this paper, we restrict our attention to the task of solving the Poisson equation, an elliptic PDE that arises in many important physical problems (such as electrostatic or gravitational potential). In 3D, the continuous version of this equation is
Algorighm FFT Multigrid Lower Bound
! " (1) # %$'& ! # where & is a constant and , , and specify spatial coordinates. discrete version of this problem, in which , ! , and # The are restricted to specify coordinates in a three (2)
is a matrix of stencil coordinates, and
(4)
The Fast Fourier Transform (FFT) can be used as an exact solution for Poisson’s Equation. This is done by first computing the forward transform of the right hand side of equation (2). Then, the actual solve is performed by dividing each element in the Fourier space by a factor based on the diagonal matrix of the eigenvalues of . Finally, a normalized inverse FFT is performed to produce the solution. For a grid, this procedure requires operations, were is the total number of grid elements.
(5)
3.2 Multigrid PDE Solver
&
and are an arrays of values at each grid point. This discretized version can be obtained by approximating as
*,*+,. -
3 $
E FG"N EFGN EFG"N
3.1 FFT PDE Solver
( $)&
/ 10!20,# 3
Memory Usage
Table 1. Computation complexity of FFT and Multigrid PDE solvers along with the theoretical lower bound.
dimensional grid, can be written as
( where
Serial Flops
EFGIHKJLMGN EFG"N EFGN
+
4 ** . 6 5 0,!708# :9 9 5 0,!708# ,; 0,!708# 19 0,!708# ,79 0,!708# 19 9= 10!20,# 9Z 5 B
As indicated by Table 1 the multigrid algorithm actually achieves the theoretical lower bound on the computational complexity required to solve the Poisson equation on a regular mesh. However, in practice, the running time of where is a fixed, yet multigrid is proportional to significant, constant. Specifically, each application of one of the four multigrid operators (coarsen, prolongate, apply smoother, or evaluate residual) traverses the entire grid and applies a stencil operation at each grid point. Thus, even for practical problems with large values of , the extra factor in the runtime of the FFT-based solver may still be smaller than . Note that since multigrid produces an approximate solution and the FFT produces an exact solution, several iterations of the multigrid V-cycle are required to yield a result comparable to the FFT in practical applications. Typically, 15-20 V-cycles is enough to achieve the desired error bound (see Figure 3).
grid.
apply smoother - Improves the current solution on the grid, reducing the difference between he right and left hand sides of equation (2). This requires one 27 point stencil computation per grid point.
c
\ 9 &
evaluate residual - Computes the residual at each grid point, using the 27 point stencil to approximate .
\
$ _ = O B =PO B V= O ` O ^ Y ^ O $ $ Yba 5
def
c
These operations are applied recursively so that their call graph forms a V, as shown in Figure 2. At each level, constant work is done per grid point. As there are grid points at level , the total number of operations for a 3D grid of size is
^VY
c
,=]Y Q $
(6)
4 Multigrid Performance Model
Thus, provided the number of V-cycles required to produce a solution within some fixed tolerance is independent of the grid size, then for a given tolerance, multigrid consteps. Although we will not provide a rigverges in orous proof here, this is in fact the case, as shown in Figure 3. Conceptually, one can view the error at any point in time in terms of a basis of sine curves of various frequencies. The lower levels of multigrid have smaller frequency domains than higher levels and a given level of multigrid dampens the error in the upper-half of its frequency domain by smoothing. A single V-cycle thus reduces the error in the entire frequency spectrum by a constant fraction and hence a fixed number of V-cycles are required to bring the error below some predetermined tolerance.
As each of the four operators involved in the multigrid algorithm simply traverses the grid, performing a constant amount of computation at each grid point, the speed at which grid values can be loaded from memory into registers is a limiting factor in performance. Moreover, since one floating point register is needed per stencil point at each cell in the grid, register blocking or preloading of grid values into registers before they are used are not feasible optimizations (since register pressure is too great). We first develop a performance model with memory as the limiting factor. The apply smoother and evaluate residual kernels use a full 27 point stencil at each grid cell, 3
Property Clock (MHz) Peak Flop Rate (GF/s) Peak Mem Bwidth (GB/s) L1 Cache (Data) (KB) L2 Cache (KB) L3 Cache (MB) L1 Cache Line (B) L1 Hit Time (cy) Mem Latency (min) (cy) Mem Latency (max) (cy)
whereas coarsen and prolongate access fewer neighbors. Hence, apply smoother and evaluate residual account for the majority of computation and we will focus our model on them. For these two functions, the computation at each cell in the grid essentially consists of loading all 27 grid cube, multiplying each points in the surrounding point by the corresponding weight defined by the stencil, and summing the results. The naive way to traverse the grid without incurring horrible cache performance uses 3 nested loops, one for each dimension, with the innermost loop corresponding a unit increment in memory location. Thus, reading values for the 27 point stencil in this pattern consists of loading along 9 streams in memory (one stream for each point on the face of the cube in the direction of motion). Using this description of the behavior of multigrid, we can now estimate performance bounds for a given architecture. The average time for a load to complete (average memory acess time or AMAT) is
A"BgACBhA
ikj i ( $mlonqp ( n _hr j nRsstu p rbv j nRssw r u d p !
$
e & d r n # r
y{z
-|}~Ry{Sz O{5 1x
~W O ~ :x
-~|W}~S O ~U@]~ @~
(7)
(8) (9)
Opteron 1600
6.4 16 256 1.5 64 1 (20+)*
8 64 8192 – 64 1
3.2/3.2** 64 1024 – 64 3
5 Optimizations
(10)
Improvements aimed at reducing the amount of work done per cell can be classified into two areas.
where, as explained above, there are 9 load streams in the unoptimized version of the algorithm. The best case for the time per cell spent performing floating point operations can be estimated as
( c r dd n _hr $ w s v d e tu]p r
Alpha 1000
where there is approximately one FP operation per stencil element. Using these equations, along with the processor specifications shown in Table 2, we obtain the predictions shown in Table 3 for multigrid performance on the IBM Power3 processor. In computing the total time per cell, we use the conservative estimate that half of the floating point operations are allowed to overlap with outstanding memory operations. This is a relatively minor point as the time spent on memory operations dominates that spent on floating point calculations. Similarly, we ignore the time spent performing integer operations in order to compute the memory addresses of the stencil elements. A comparison of the time bounds predicted in Table 3 with the actual running time of the standard multigrid algorithm on the Power3 processor is shown in Figure 4. The actual runtime is close to the minimum predicted by the model. This both validates the model and indicates that we are acheiving near the maximum performance possible given the current memory traffic and compute time required per cell. Successful optimizations are more likely to be ones that focus on reducing the amount of work done per cell, rather than for example, simply rearranging instruction order.
where is the number of bytes used to represent a double-precision floating point number. The miss penalty is simply the time to fetch from main memory (the memory latency). The time per cell spent reading from memory can be estimated as
c r dd ( n _gr Y ~ Y $ e u px r u _ s v i[j i (
Itanium 2 900
Table 2. Processor characteristics. (*) Memory latency for Itanium 2 is given as 20 cycles, plus the latency resulting from the memory subsystem. (**) 3.2 GB/s in each direction. Specifications for each processor were obtained from the manufacturor’s website. Not all statistics were available for all processors.
where the hit and miss statistics refer to the L1 cache. In the case of multigrid, it is reasonable to assume that higher levels of cache do not impact performance, at least for the unoptimized algorithm we are considering now. The size of the grid is so large that by the time the same cell is seen again in the next traversal of the grid, it will have been evicted from all levels of the cache. With this observation, every load of a grid cell is a cache miss, unless it was brought into the cache when servicing a miss just prior to it. Therefore, we can estimate the L1 cache miss rate as
j nxsstu]p r $
Power3 375 1.5 1.6 64 8192 – 128 1 35 139
X
(11) 4
Optmizations that reduce the number of integer and/or floating point ALU operations per cell. Possibilities include optimizing the computation of stencil coordinates (grid locations of the stencil points), and reusing
Figure 4. Predicted and actual running times of multigrid on the Power3. Results are shown for the standard algorithm with only common subexpression elimination (CSE) optimizations done by the programmer.
X Parameter MissRate AMAT ( sec) Cell Time (mem) ( sec) Cell Time (FP) ( sec) Cell Time (mem+FP) ( sec) Grid Time (mem+FP) (sec)
Min .0625 .0085 .0765 .0200 .0865 1.45
Max .0625 .0258 .2325 .0200 .2425 4.07
partial sums common to consecutive stencil operations. Optimizations that improve memory behavior either by reducing the number of loads or reducing the number of cache misses per cell. The number of loads could be reduced by keeping a block of the grid in registers (register blocking), allowing reuse of values in the block without reloading them form cache. Similarly, keeping a block of the grid in cache (cache blocking) allows reuse of values in the block without reloading them from memory.
Unfortunately, as a result of the size of the stencil, it is not possible to load a large enough section of the grid into registers in order to acheive any practical amount of reuse. Thus, we focus exclusively on optimizations from the first category and cache blocking. A description of each optimization we consider follows.
Table 3. Performance model predictions for the apply smoother kernel on the IBM Power3. Estimates are shown for the time per cell spent accessing memory (mem), the time spent on floating point operations (FP), and the total time (mem+FP), taking into account the overlap of these operations. The grid time is the cell time multiplied by the grid size ). (
5.1 Common Subexpression Elimination
=VV Q
The version of the multigrid code with common subexpression elimination (CSE) performed is simply meant to show the performance improvement that can be obtained over the totally naive baseline code without changing the algorithm’s behavior in any meaningful way. The only difference between the CSE code and the baseline code is that 5
Parameter Load Streams FP Loads/Cell FP Stores/Cell FP Ops/Cell Int Ops/Cell
stencil coordinates are computed in an optimized manner. In a sense, the difference between baseline and CSE performance on a particular platform gives some idea of the quality of the compiler optimizations for that system. If the compiler optimized the baseline code well, it should have performance near that of the CSE optimization. For any of the other optimizations to represent a true performance gain, they must result in a speedup over the CSE code.
Loop Unrolling 9 28 1 31 18
Memoization 9 18 3 19 17
Blocking 3 28 1 31 28
Table 4. Operation counts for evaluate residual after applying different optimizations.
5.2 Loop Unrolling Unrolling the loop that traverses the grid along the innermost dimension has a number of potential benefits. First, it reduces the number of stencil coordinates that must be grid computed per cell, as consecutive cells have a block in common. Unrolling once requires only 9 additional stencil coordinates to be computed for the second cell. In addition, reuse of the values loaded from the is exposed to the compiler. The number of floating point operations per cell, however, is unaffected by the application of loop unrolling alone.
of the size of the grid, all values loaded from these planes are evicted from the cache before they are read again. The cache-blocked code breaks the plane into small rectangular regions. A column of stacked rectangle regions is traversed before moving onto the next column. Thus, only values loaded from the uppermost plane are not in cache, resulting in 3 load streams. There is, of course, some overhead when switching between blocks.
ABA>B =
AgBAgB =
Table 4 gives a summary of the per cell operations required by the optimizations discussed above.
5.3 Memoization
6 Performance Testing
Memoization achieves the maximal possible reduction of floating point operations by precomputing all partial sums that can be reused between grid cells. At each cell, points in the surrounding grid shown in the same color in Figure 1 are multiplied by a constant and added together. The sum of the four corner points along the leading face of the stencil cube is multiplied by the “red” constant for one cell, the “blue” constant for following cell, and the “red” constant again for the next cell. Traversing the grid once to precompute these partial sums and again to perform a final multiplcation and summation cuts the number of floating point operations per cell by about a third. In addition, memoization reduces register pressure by requiring fewer floating point values to be loaded into memory at once. Of course, the initial traversal of the grid still uses 9 load streams and should incur the same number of cache misses. There is a trade off in that memoization produces more store traffic to memory.
Titanium is a parallel dialect of Java developed at UC Berkeley for high-performance scientific computing. However, only the serial aspects of the language were used for this analysis. Since Titanium compiles to C, we wrote the Multigrid skeleton in Titanium, but all the major kernels were written as C source code. The baseline code was unoptimized, naive multigrid code written by us to match the NAS MG benchmark results. All the optimizations were then performed by hand on this C source code. Note that common subexpression elimination is used in all versions of the code except baseline, since it can only help performance. In addition, for all our tests, the C compiler was instructed to use the ”-O3” (most aggressive) optimization setting.
5.4 Cache Blocking
6.2 Architectures
Cache blocking [4, 7] attacks the cache miss rate by attempting to allow reuse between adjacent 2D slices of the grid. By restricting the range of the two innermost loops used to traverse the grid, we can reduce the number of load streams from 9 to 3. The stencil encompasses three parallel planes of grid points when operating on the cells in the middle plane of these three. The upper two of these planes are again accessed as the lower and center planes when operating on cells in the next level up. Because
We chose to test our optimizations on 4 architectures commonly used by the scientific computing community, the IBM Power3, Compaq Alpha, AMD Opteron, and Intel Itanium 2. On each platform, we verified that there was no load on the processor before submitting the job. All timing data was collected using the Titanium timer, which has microsecond granularity. We ran four multigrid V-cycles in each run, and used the median of the timing data for our results.
6.1 Titanium
AoBAoBA
6
Figure 5. The effect of each optimization on the runtime of each multigrid operator across four different platforms. Memoization offers the greatest benefit on the Power3 and Opteron platforms, while cache blocking is ineffective on all systems except the Itanium.
Figure 6. Reduction in floating point operations required as a result of memoization. Results shown were obtained using the PAPI library to access the hardware counters on the IBM Power3.
7
Figure 7. Cache blocking performance on different architectures as a function of block size. The Itanium is the only processor for which cache blocking improves execution time.
=V
A
Figure 8. Analysis of cache blocking performance on the IBM Power3 system. A block size of between and cells (about 15%-33% of the L1 cache) reduces the number of cache misses compared to other block sizes. At the same time, the number of branch mispredictions is high, indicating a possible source of overhead for small block sizes.
8
6.3 PAPI The Performance Application Programming Interface (PAPI) library [6] provides a standard interface for accessing hardware counters that collect performance statistics. Unfortunately, only the Power3 processor had PAPI installed, so we were able to do better performance analysis for that particular machine. For all the other architectures, we were only able to gather timing data.
7 Results Due to the diversity of the platforms, none of the optimizations other than common subexpression elimination did well on every architecture (see Figure 5). On the other hand, all of them did improve performance on at least one platform.
Figure 9. FFT and multigrid solver times for obtaining comparable solutions to the same problem. The multigrid time is that for the best performing set of optimizations we found.
7.1 Memoization The effects of memoization were mixed across platforms. For example, on the Power3, memoization reduces the evaluateResidual running time by 55%, but on the Itanium, it increases the evaluateResidual running time by 18%. This is because memoization is not a clear-cut optimization. While it lowers the floating point and load instruction count, it also introduces another inner loop and a slight memory overhead. Depending on the architecture, either one of these opposing phenomenons will dominate. In the case of the Power3 evaluateResidual method, the floating point instruction count is reduced by 41% (see Figure 6) and there are 32% fewer load instructions. In addition, since the Power3 has a much lower clock rate than the other processors, the reduced floating point count has increased importance in comparison to memory latency. On the Itanium, however, the negative effects of the extra inner loop seem to overwhelm any of the positive aspects of memoization. Results were mixed on the Alphaserver and the Opteron.
other processors, since most of the examined block sizes are stored within L1 and L2 cache.
7.3 Comparison with FFT Unfortunately, the speedups observed for the optimizations tested in this paper did not improve multigrid to the point at which it would be competitive with the FFT for solving the Poisson equation in 3D. Figure 9 demonstrates that our best version of the multigrid code takes about 1.7 times as long as the FFT to solve the same problem with comparable precision on the Power3. Similar conclusions can be drawn for the Itanium, Alpha, and Opteron as Figure 5 shows no dramatic improvements over common subexpression elimination optimization for these platforms.
8 Future Work
7.2 Cache Blocking
While the performance improvements observed are not sufficient to place the 3D multigrid solver on par with the optimized FFT solver, they were still useful in validating and further developing our performance model. Eliminating redundant floating point operations through memoization, while performing slightly more stores appears to be a worthwhile tradeoff for most architectures. Cache blocking was surprisingly ineffective. It may be that L1 cache sizes on current processors are too small to support a reasonable block size for the 27 point stencil or (as we suspect for the Power3) that the loop overhead associated with cache blocking triggers undesireable performance from other aspects of the architecture. Memory access remains the
Cache blocking is another optimization that is not clearcut (see Figure 7). The goal of this optimization is to lessen the number of L1 load misses. On the Power3, for instance, this is slightly reduced at a block size of 20, as shown in Figure 8. However, cache blocking also drastically increases the number of mispredicted branches, especially at the smaller block sizes. This most likely overwhelms any benefit resulting from fewer load misses. The only platform where it seems to help is the Itanium, which is surprising considering most of the block sizes that we examine are too large to fit in the Itanium’s small L1 cache. It may be that the Itanium has better L2 cache performance than the 9
performance bottleneck. Explicit prefetch commands, or stream buffers that predict access patterns [5] might improve this situation by eliminating cache misses. Future work could focus on analyzing real or theoretical multigrid performance on architectures supporting such features.
References [1] B.L. Chamberlain, S.J. Deitz, and L. Snyder. “A Comparative Study of the NAS MG Benchmark across Parallel Languages and Architectures”, Proc. of Supercomputing (SC), pp. 67, 2000. [2] J. W. Davidson and S. Jinturkar. “An Aggressive Approach to Loop Unrolling”, Technical Report CS-95-26, Department of Computer Science, University of Virginia, Charlottesville, June 1995. [3] C. Grelck. “Implementing the NAS Benchmark MG in SAC”, IPDPS Workshops, April 15-19, 2002. [4] M. S. Lam, E. E. Rothberg, and M. E. Wolf. “The cache performance and optimizations of blocked algorithms”, Proceedings of ASPLOS IV, pp. 63-74, April 1991. [5] S. Palacharla and R.E. Kessler. “Evaluating Stream Buffers as a Secondary Cache Replacement”, Proceedings of the International Symposium on Computer Architecture, 1994. [6] Performance Application http://icl.cs.utk.edu/papi/
Programming
Interface.
[7] G. Rivera and C. Tseng. “Tiling Optimizations for 3D Scientific Computations”, Supercomputing, 2000. [8] S. Sellappa and S. Chatterjee. “Cache-Efficient Multigrid Algorithms”, International Journal of High Performance Computing Applications, Special issue on automatic performance tuning, 2002.
10