A Scalable Parallel SSOR Preconditioner for E ... - Semantic Scholar

1

A Scalable Parallel SSOR Preconditioner for Ecient Lattice Computations in Gauge Theories N. Eickera , A. Frommerb, H. Hoeberc , Th. Lipperta, B. Medekeb , G. Ritzenhofera and K. Schillinga;c a HLRZ, c/o Research Center Julich, D-52425 Julich, Germany b Department of Mathematics, University of Wuppertal, D-42097 Wuppertal, Germany c Department of Physics, University of Wuppertal, D-42097 Wuppertal, Germany Key Words: SSOR, preconditioning, local lexicographic, lattice QCD, Krylov subspace solver We extend the parallel SSOR procedure for the ecient preconditioning of modern Krylov subspace solvers [1], recently introduced in [2] towards higher order, quantumimproved discretization schemes [3] for lattice quantum chromodynamics [4].

1. Solving linear systems in LGT. Lattice gauge theory (LGT) deals with the controlled numerical evaluation of gauge theories like quantum chromodynamics (QCD) put on a 4-dimensional space-time-grid which, in the low energy regime, cannot be solved by non-perturbative analytical methods. QCD [5] is considered as the fundamental theory of the strong forces that bind quarks with gluons to form the known hadrons like the proton or the neutron. Very large scale LGT simulations become more and more essential to provide theoretical input for current and future accelerator experiments that attempt to observe new physics beyond the Standard Model of elementary particle physics [6]. The most heavy computational demands in LGT arise from the repeated computation of a huge system of linear equations, Mx = ; (1) with M being the quark matrix (in analogy to the discretized Laplace equation of classical electrodynamics). Its solution, a Green's function x, describes the time behavior of the quarks [7] and allows to extract physical observables like hadron masses or decay constants. The size of the solution vector is of order O(107) elements in todays state-of-the-art simulations. The coecients of the discretized dierential operators in M are taken from a stochastic background eld that represents the gluons in lattice QCD. Since it turned out that multigrid methods are not very useful in order to improve the speed of the computation of x, attempts have been made to use preconditioning techniques in order to accelerate Krylov subspace solvers like BiCGStab as the current method of choice [8].

2

1.1. Locally Lexicographic SSOR Preconditioning.

Recently, some of us have introduced a parallel preconditioning technique called locally lexicographic SSOR preconditioner (ll -SSOR) [2]. As opposed to familiar multicolor SSOR preconditioners, which produce a decoupling of variables on a very ne grained level, the ll -SSOR method reduces the decoupling to the minimum which is necessary to achieve a given parallelism. In this sense, the approach is scalable, i. e. , it is scaling with the machine size. As for any SSOR preconditioner, the Eisenstat trick [9] is crucial to the ecient implementation of the preconditioner. Our numerical experiments have shown that ll -SSOR leads to the fastest known calculations of fermion Green's function on current parallel computers [10], if M represents the widely used standard Wilson fermion matrix. The SSOR preconditioner is already applied on the Italian supercomputer APE100/Quadrics and on the Cray T3E within the large scale simulation project SESAM [11,12].

1.2. Parallel -SSOR Preconditioning for improved fermionic actions. ll

Recently, improved actions have generated a lot of interest in LGT [13]. In improved actions higher order dierencing schemes are exploited in order to decrease systematic discretization eects. However, due to the quantum nature of QCD, improved discretizations require a careful tuning of coecients in order to gain an improvement. In general, improved actions can be written in the form

M = A + B + C:::;

(2)

where A represents diagonal blocks (containing 12 12 sub-blocks of internal degrees of freedom), B stands for a nearest-neighbor hopping term, C goes to next-to-nearestneighbor and so on. Usually, third-neighbor terms are truncated. The new distinctive features of the present work are the inclusion of the internal degrees of freedom of the block diagonal term A into the ll -SSOR process and the further inclusion of next-to-nearest-neighbor terms in the action. A particularly nice feature of ll -SSOR is that the latter are dealt with in a very simple manner, whereas multicolor approaches would get increasingly complex due to an increasing number of colors to be used. For the so-called clover action, we compare improvements of iteration numbers and real life performance results of the new type of preconditioner with standard even-odd (or red-black) preconditioning [14].

2. Preconditioning The main computational eort in iterative methods for matrix inversion (i. e. the solution of (1)) lays in the matrix-vector multiplication, which is e. g. carried out twice every iteration in the BiCGStab-case. In order to improve the performance of the inversion method, the number of matrix-vector multiplication and hence the number of iterations has to be reduced. This can be achieved by replacing the system to be solved by a preconditioned system. In this preconditioned system the matrix M and the vectors x and are replaced by the preconditioned quantities M~ , x~ and ~. The aim of this replacement is to solve a less singular system, which is cheaper than solving the original system. From the solution of the preconditioned system the solution of the original system can readily

3 be constructed; an overall gain in performance remains. Technically, the implication for the algorithm is the replacement of the matrix-vector multiplication by ( Pzi = pi : vi = Mpi ) solve vi = Mzi Here P represents a matrix which is a result of the preconditioning method. Introducing the 'Eisenstat-trick' [9] the matrix P is decomposed into a product of three regular matrices R, S and T . De ning a fourth regular matrix K with P = RST ; K = R + T ? M ) M = R + T ? K; the result is the replacement: 8 solve Sz = p > > < solve T zîi = pii vi = Mpi ) solve Ru = z ? K z^ : (3) > i i i : vi = zî + ui

2.1. SSOR preconditioning

As already mentioned in [2], the method of choice is Symmetric Successive OverRelaxated (or, more precisely, symmetric Gau-Seidel) preconditioning. For clearness we recall the main de nitions here. This method needs a decomposition of the matrix M in a (block-)diagonal part D, and in strictly upper and lower triangular parts U and L respectively: M =D?L?U In the next sections we will go into detail about the underlying ordering. The choices for R, S and T in this method are: ?1 R = 1 D ? L; S = 2 ? ! D ; T = 1 D ? U:

! ! ! Here ! is an over-relaxation parameter to be chosen appropriately. These settings lead to:

K = R + T ? M = !1 D ? L + !1 D ? U ? D + U + L = 2 ?! ! D:

These special choices of R, S and T render the solution of the three linear equations in (3) quite easy. The solutions of T zî = pi can formally be written as: 1 solve T z^ = D ? U z^ = p ) z^ = !D?1 (p + U z^ ) : (4) i

!

i

i

i

i

i

Due to the fact that T only consists of diagonal elements and elements of the strictly upper triangular matrix, the righthand-side of this formal solution (4) is well-de ned: in the last (N th) equation it does not depend on z, which means that this equation can be solved. Because the (N ? 1)st equation only depends on zN , which is known from the last equation, this equation can be solved, too. The term 'backward solve' seems quite natural for this method. The 'forward solving' method for Rui = zi ? K zî is

4 straightforward. Since the de nition of S is the inverse of a (block-)diagonal matrix, the rst solution is easy to achieve, too. Altogether, this leads to the following replacement of the matrix-vector multiplication (after some rede nitions of the internal vectors of the BiCGStab-algorithm):

Szi = pi multiply diagonal T zî = pi ) solve backward : Rui = zi ? K zî solve forward

(5)

2.2. ll-Ordering

In order to render a matrix strictly upper or lower triangular, the matrix-elements need a speci cation of the ordering. Since dierent ordering schemes were already mentioned in [2], we will only handle the locally-lexicographic (ll ) ordering here. This scheme seems to be the most natural on parallel SIMD computers like APE100/Quadrics [15] or the Columbia machine [16]. Assuming the processors of the parallel computers to be connected as a p1 p2 p3 p4 4dimensional grid1, the space-time lattice can be matched to the processor grid in a natural loc loc loc loc way, using a local lattice size nloc = nloc 1 n2 n3 n4 with ni = ni =pi on each processor. For simplicity we will assume that, for each i, pi divides ni and nloc i k 2. There will be some remarks on k in section 2.3.

i li pi ci gi li pi ci gi li g

i mi qi di hi mi qi di hi mi h

i fi ii ki ni oi - bi ai ei fi ? ii ki ni oi ai bi ei fi ii ki e

i hi ei fi gi li mi ii ki li pi qi ni oi pi ci di ai bi ci - hi? ei fi gi gi - ki?- li? li mi ii - qi? ni oi pi pi 6 ci di ai bi ci gi hi ei fi gi li mi ii ki li g

i mi qi di hi mi qi di hi mi h

i ii ni ai ei ii ni ai ei ii e

i ki oi bi fi ki oi bi fi ki f

Figure 1. Locally lexicographic ordering and forward solve in 2 dimensions. We will now introduce the same lexicographic ordering on the nloc lattice points of each processor. We call this ll -ordering. A 2-dimensional case is depicted in gure 1 loc (with nloc 1 = n2 = 4). The arrows show dependencies between the lattice-sites during a 1

This also contains the case of a less-dimensional grid since p can be set to 1 ! i

5 forward solving step. They indicate, that site a does not depend on any other site whereas site h depends on site d and g on the same processor and on site e on the next processor in the 1-direction. The major improvement of this ordering is that all sites with the same lexicographic character see the same local topology, which gives the opportunity for an ecient implementation on a SIMD parallel machine. Further discussion on the improvement due to this ordering was presented in [2].

2.3. Next-to-nearest-neighbors and less local terms

The ecency of the ll -ordered algorithm is due to the fact, that all nearest-neighbors of a lattice point have a ll -character dierent from that point because nloc i 2. Performing forward- or backward-solving with only nearest-neighbor couplings, the above fact implies that lattice points with the same ll -character can be worked upon in parallel, leading to an optimal parallelism. Next-to-nearest-neighbor couplings cannot be treated in the same way, if the local lattice has at least one direction with a local size nloc i = 2 only. In this case, inside the forward- or backward-solving step the righthand-side of (4) is not wellde ned, since it depends on information located on sites with identical ll -character, i. e. on information that just should be generated. This clash can be avoided, if the choice of the local lattice size is enlarged. If the longest coupling in matrix M ranges to a site, located c lattice spacings from the present lattice site, the local lattice size has to be chosen as nloc i k = c + 1. This leads to the conclusion, that ll -ordering works for any suciently local matrix M , if the local lattice size is chosen large enough. I. e. the nextto-nearest-neighbor term C in (2) can readily be included in the ll -SSOR preconditioning scheme, just by choosing the local lattice size larger than 2.

2.4. Local terms

Due to a nontrivial local term A further modi cations on the algorithm have to be carried out. In [2] this local term was proportional to 1. This fact implied that the choice of D in the SSOR scheme is proportional to 1, too. This has led to a major simpli cation of the diagonal multiplication step in (5) and replaced the D?1 matrix-multiplication in the forward-/backward-solve by a simple scalar multiplication. Due to the fact that D was diagonal, the 12 entries concerning the local part A were decoupled and could be treated simultaneously. The situation changes dramatically if A is not a pure diagonal, the situation which is to be considered in this work. First of all, new freedom is introduced in the layout of the algorithm because there is now a real choice how this local term A is separated into the diagonal term D and the upper and lower terms U and L, respectively. Since for an ecient implementation the inverse of the diagonal term D?1 has to be stored, it is possible to control the memory overhead exploiting this choice. Furthermore, the entries concerning this part do not remain independent of each other, depending on the choice of the diagonal term. They can only be treated in parallel, if the diagonal part is chosen as D = A, the choice with the largest memory overhead. For any other choice, a dierent ordering, controlling these entries, has to be introduced and they have to be treated following the given scheme. Since the entries are all local, ordering is straightforward. As we will see later, the performance of the algorithm depends only weakly on the choice of the diagonal part.

6

3. Results As a test we have implemented an inverter for the clover action. The structure of the related quark matrix is:

S = a Px a1 C F

4

(

"

# 2 X a (x) 1 + cSW F (x) (x) (6) 2 ) i Xh y x) (1 + )U (x) (x ? ^) ? (x) (1 ? )U (x) (x + ^) + (

Clearly, because this matrix contains no next-to-nearest-neighbor couplings, we can use a local lattice size with the extension 2 in one or more directions. As an extension to the Wilson case described in [2], this matrix exhibits a local term, representing the 12 12 block described above. This local term has the structure 1 0 1 + F F F F 1 2 3 4 BB F y 1 ? F F y ?F CC X 3 C 1 4 F = BB F2 CA ; with Fi complex 3 3 matrices: (7) F 1 + F F 3 4 1 2 @ y F4 ?F3 F2y 1 ? F1 This fact can be used for an eective implementation of the algorithm.

200

ω = 1.0 ω = 1.4 EO

150

# of iterations

# of iterations

200

100 1x1 blocks 50 0

ω = 1.0 ω = 1.4 EO

150 100

3x3 blocks 50 0

0.13

0.132 κ

0.134

0.13

0.132

0.134

κ

Figure 2. Number of iterations for the implementation with pure diagonal D (left) and block-diagonal D (right) against . We implemented the ll -SSOR preconditioning scheme on the APE100/Quadrics machine applying the BiCGStab algorithm. We tested the inverter on a 164 system using three dierent choices for D, one being the pure diagonal that consists of twelve 1 1 blocks, another using the 1 F1 (i. e. 3 3) blocks as shown in (7) and one with 6 6 blocks. The clover-parameter was chosen to cSW = 1:769. We tested the performance

7 of the algorithm for dierent values of . For the testing we used eleven dierent eldcon gurations. The tests were done on the APE Q4 in Wuppertal, a 32 node system. Figure 2 shows on the left diagram iteration numbers for the 1 1 implementation. EO represents the results for the even/odd inverter, ! = 1:0 shows numbers for the case without overrelaxation, ! = 1:4 the numbers for the optimal value of !. The right diagram in Figure 2 depicts the same for the 3 3 case. The reduction of the number of iterations is about a factor of 2 against EO for the not overrelaxated case; overrelaxation gives another 10 ? 20%. 300 # of iterations

250 200

T [sec]

1x1 blocks 3x3 blocks 6x6 blocks EO

150 100 κ = 0.1342

50 0 1

1.2

1.4 ω

1.6

1.8

160 140 120 100 80 60 40 20 0

1x1 blocks 3x3 blocks 6x6 blocks EO

κ = 0.1342 1

1.2

1.4

1.6

1.8

ω

Figure 3. Number of iterations (left) and runtime in sec. (right) against overrelaxation parameter ! for = 0:145. For EO, the solid line in the middle represents the value for this algorithm, the upper and lower dashed lines the errors. The dependency on the overrelaxation parameter ! is shown in gure 3. The diagram on the left hand side gives the numbers of iterations for dierent values of ! for the 1 1, the 3 3 and the 6 6 case compared to the even/odd case. While the dierence between the three choices of D is quite small, the gain of about 10 ? 20% due to overrelaxation is easily to see. The runtimes of the inverters are depicted in the diagram on the right hand side of gure 3. The values for the ll -SSOR preconditioned inverters stay below the line for the even/odd result in the region of the optimal !. This diagram demonstrates clearly, that ll -SSOR preconditioning not alone decreases the condition number signi cantly but also allows an quite eective implementation of this scheme even on APE100/Quadrics supercomputer and hence on almost all SIMD machines. Furthermore, since the dierence in performance between the 1 1, 3 3 and 6 6 choice of D is quite small, but the dierence in memory overhead quite large, ll -SSOR preconditioning seems to be a very ecient tool to reduce memory overhead. This becomes more and more important, since the memory vs. compute-power ratio on upcoming special purpose supercomputers for LGT is expected to decrease.

8

4. Conclusions We have extended the application of the ll -SSOR preconditioning scheme to the case of improved actions with non-trivial diagonal and next-to-nearest-neighbor couplings. Since the speedup against EO is larger than 2, ll -SSOR beats even/odd preconditioning. Furthermore our method gives a large gain in memory usage without paying to much on the compute-time side. It is evident that the method exhibit a good scaling behavior, since the parallelism is adopted to the parallel system's size.

REFERENCES 1. H. Van der Vorst, `BiCGstab: A Fast and Smoothly Converging Variant of Bi-CG for the Solution of Non-symmetric Linear Systems', SIAM J. Sc. Stat. Comp. 13 (1992) 631. 2. S. Fischer, A. Frommer, U. Glassner, Th. Lippert, G. Ritzenhofer, and K. Schilling, À Parallel SSOR Preconditioner for Lattice QCD', Comp. Phys. Com. 98 (1996) 20. 3. T. DeGrand, `Lattice Gauge Theory for QCD', preprint server hep-ph/961039, http://xxx.lanl.gov/abs/hep-ph/961039. 4. M. Creutz, Quarks, Gluons and Lattices (Cambridge University Press, Cambridge 1983). 5. P. M. Zerwas and H. A. Kastrup (edts.), QCD 20 years later (World Scienti c, Singapore, 1992). 6. Particle data group, Phys. Rev D 54 (1996) 1. 7. Th. Lippert, K. Schilling, and N. Petkov, `Quark Propagator on the Connection Machine', Parallel Computing 18 (1992) 1291. 8. A. Frommer, V. Hannemann, Th. Lippert, B. Nockel, and K. Schilling, Àccelerating Wilson Fermion Matrix Inversions by Means of the Stabilized Biconjugate Gradient Algorithm' Int. J. Mod. Phys. C 5 (1994) 1073. 9. S. Eisenstat, Ècient Implementation of a Class of Preconditioned Conjugate Gradient Methods', SIAM J. Sci. Stat. Comput. 2 (1981) 1. 10. S. Fischer, A. Frommer, U. Glassner, S. Gusken, H. Hoeber, Th. Lippert, G. Ritzenhofer, K. Schilling, G. Siegert, A. Spitz, A Parallel SSOR Preconditioner for Lattice QCD', in C. Bernard, M. Golterman, M. Ogilvie, and Jean Potvin, (edts.): Proceedings of Lattice '96, Nucl. Phys. B (Proc. Suppl.) 53 (1997) 990. 11. N. Attig, S. Gusken, P. Lacock, Th. Lippert, K. Schilling, P. Ueberholz and J. Vieho, `Highly Optimized Code for Lattice Quantum Chromodynamics on Cray T3E', this conference. 12. SESAM-Collaboration: U. Glassner, S. Gusken, H. Hoeber, Th. Lippert, X. Luo, G. Ritzenhofer K. Schilling and G. Siegert: `QCD with Dynamical Wilson Fermions { First Results from SESAM', in T. D. Kieu, B. H. J. McKellar, and A. J. Guttmann, (edts.): Proceedings of Lattice '95, Nucl. Phys. B (Proc. Suppl.) 47 (1996) 386. 13. B. Sheikholeslami and R. Wohlert, Nucl. Phys. B 259 (1985) 572. 14. T. DeGrand and P. Rossi, Comp. Phys. Comm. 60 (1990) 211. 15. A. Bartoly et. al., The Software of the APE100 Processor, Int. Jour. Mod. Phys. C 4 (1993) 969 and references therein. 16. Igor Arsenin, et. al., Nucl. Phys. B (Proc. Suppl.) 47 (1996) 804 and references therein.