A Domain-Decomposition Message-Passing Approach to Transient Viscous Incompressible Flow using Explicit Time Integration Mark A. Christon Incompressible Fluid Dynamics Sandia National Laboratories M/S 0826, P.O. Box 5800 Albuquerque, New Mexico 87185-0826 PH.: (505) 844-8520 FAX: (505) 844-4523 E-mail:
[email protected] published in: Computer Methods in Applied Mechanics and Engineering, vol. 48, pp. 329-352, 1997
This work was supported by the U.S. D.O.E. under Contract DE-AC04-94AL85000.
Abstract
This paper discusses the design and implementation of solution algorithms suitable for performing transient, incompressible viscous ow simulations on massively parallel computers. The nite element formulation for incompressible ow along with the ad-hoc modi cations for explicit time integration are discussed with an emphasis on implementation aspects for achieving scalable, parallel computations. The issues associated with a domain-decomposition message-passing paradigm are presented in the context of explicit time integration and the solution of a pressure Poisson equation using a staggered grid. An element-by-element conjugate gradient algorithm for the pressure equation is outlined with a parallel sub-domain preconditioner. Sample calculations are presented with scaled speedup and eciency results to demonstrate the scalability of the domain-decomposition message-passing approach.
1 Introduction The simulation of ow elds about vehicles and in turbomachinery remains one of the computational grand challenges de ned by the Federal High Performance Computing Program[1]. An example of this class of computational uid dynamics problem is the simulation of the time-dependent ow around a vehicle such as a submarine or an automobile. In order to compute the ow eld for a vehicle such as a submarine, it is anticipated that over one million elements will be required to resolve important ow- eld features such as shed vortices in regions of separated ow - even for moderate Reynolds numbers (Re). Further, the complex geometry associated with the exterior of a submarine or automobile demands the use of unstructured grids and can require O(106) elements just to resolve important details of the geometry with a computational domain of sucient spatial extent that boundary eects are minimized. In addition to the high degree of spatial discretization, the temporal resolution for this class of problem is also demanding, ultimately requiring the eective mapping of
ow-solution algorithms to massively parallel supercomputer architectures. Aside from the issues associated with mapping ow solution algorithms to parallel computer architectures, the incompressible Navier-Stokes equations pose their own unique challenges for the design and implementation of ecient ow solution algorithms. An example of this is the incompressibility constraint which imposes a global elliptic nature on the conservation equations and inhibits the use of spatially localized solution strategies commonly employed for compressible ows. The promise of massively parallel computers, i.e., computers with greater than 1024 processing elements (PEs), is to render soluble ow problems that have heretofore been intractable. Emerging TeraFLOP supercomputers with O(4500) PEs and aggregate memory of over 512 GigaBytes (GB) (or 64 109 64-bit Words) promise to make the solution of incompressible ow problems using unstructured meshes with over 107 grid points a reality in the near future. However, solving transient incompressible ows on distributed memory parallel computers requires eective treatment of the global aspects of the incompressible Navier-Stokes equations, i.e., the elliptic pressure eld.
The primary focus of this work is upon the application of a domain-decomposition message-passing (DDMP) approach to solving the transient, viscous, incompressible, Navier-Stokes equations using a modi ed Galerkin nite element formulation. In this work, the DDMP strategy is applied to an explicit time integration algorithm making use of ad-hoc modi cations such as single-point integration, hourglass stabilization, and a lumped mass matrix. Recent examples of similar algorithmic modi cations may be found in Kovacs and Kawahara[2] and Palanisamy and Kawahara[3]. The ensuing discussion begins with an overview of the nite element formulation. In section 3, the DDMP approach is outlined, and the element-by-element preconditioned conjugate gradient algorithm for solving the pressure Poisson equation (PPE) is developed. Results for several two and three-dimensional calculations are presented with performance data to demonstrate the scalability of the DDMP approach. Finally, a summary of the parallel issues surrounding the explicit DDMP solution strategy is presented and conclusions drawn.
2 Finite Element Formulation The conservation equations for isothermal, time-dependent, laminar, incompressible, viscous ow are
r = 0; u
(1)
@ u + u ru = ?rP + r2 u + f ; (2) @t where u = (u; v; w) is the velocity, is the kinematic viscosity, f is the body force, p is the pressure, is the mass density, and P = p=. For the purposes of this discussion, the
constant property form of the conservation equations is sucient because the extension of the DDMP formulation to include variable properties and multiple scalar transport equations is straightforward. The system of equations above are subject to boundary conditions that consist of speci ed velocity on ?1 as in equation (3), or pseudo-traction boundary conditions on ?2 as in equations (4) and (5). u
= u^ on ?1
?P + @@n = fn on ?2 u
@@u = f on ?2
(3) (4) (5)
Here, ? = ?1 [ ?2 is the boundary of the domain as shown in Figure 1. @ u=@n and @ u=@ represent the derivative of u in the normal (n) and tangential ( ) directions respectively. Similarly, fn and f represent the normal and tangential components of the
boundary traction. Homogeneous traction boundary conditions correspond to the well known natural boundary conditions that are typically applied at out ow boundaries. For a well-posed ow problem, the prescribed initial velocity eld in equation (6) must satisfy equations (7) and (8) (see Gresho and Sani[4]). If ?2 = 0 (the null set, e.g., enclosure ows with n u speci ed on all surfaces), then global mass conservation enters as an additional solvability constraint as shown in equation (9). ( ; 0) = u0 (x)
(6)
r 0 =0
(7)
u x
u
n
( ; 0) = o( ) u x
Z
n
u
x
(8)
od? = 0
(9) The spatial discretization of the conservation equations is achieved using the Q1P0 element with bilinear support for velocity and piecewise constant support for the pressure in two dimensions. In three dimensions, the velocity support is trilinear with piecewise constant support for pressure. The methods for obtaining the weak-form of the conservation equations are well known and will not be repeated here (see for example, Gresho, et al.[5], Hughes[6], and Zienkiewicz and Taylor[7]). The spatially discrete form of equations (1) and (2) are ?
n
u
C T u = 0;
(10)
M u_ + A(u)u + K u + CP = F; (11) where M is the unit mass matrix, A(u) and K are the the advection and the viscous diusion operators respectively, and F is the body force. C is the gradient operator, and C T is the divergence operator. Here, u and P are understood to be discrete ap-
proximations to the continuous velocity and pressure elds. Equations (10) and (11) constitute a dierential-algebraic system of equations that precludes the direct application of time-marching algorithms due to the presence of the discrete incompressibility constraint. Following Gresho, et al.[8], a consistent, discrete pressure Poisson equation (PPE) is constructed using a row-sum lumped mass matrix, Ml . [C T Ml?1 C ]P = C T Ml?1 [F ? K u ? A(u)u] (12) The PPE constitutes an algebraic system of equations that is solved for the elementcentered pressure during the time-marching procedure. Figure 2 shows the dual, staggered grid associated with the pressure variables. The PPE in equation (12) incorporates
the eect of the essential velocity boundary conditions from equation (3), and automatically builds in the boundary conditions from equations (4) and (5) { see Gresho, et al. [5]. Equations (11) and (12) form the basis for a description of the explicit time integration algorithm. It is assumed that the explicit algorithm begins with a given divergence-free velocity eld, u0, that satis es the essential boundary conditions, and an initial pressure, P 0. The explicit algorithm proceeds as follows. 1. Calculate the partial acceleration, i.e., acceleration neglecting the pressure gradient, at time level n. ~n = Ml?1 F~ n a (13) where ~ n = Fn ? K un ? A(u)un F (14) 2. Solve the global PPE for the current pressure eld. [C T Ml?1 C ]P n = C T a~n (15) 3. Update the nodal velocities. n+1
u
= un + t[~an + Ml?1 CP n]
(16)
4. Repeat steps 1-3 until a maximum simulation time limit or maximum number of time steps is reached.
Remark
For the computations presented in the results section below, the prescribed initial conditions and boundary conditions are tested and, if necessary, a projection to a divergence-free subspace is performed on the initial velocity eld, u0 . This guarantees that the ow problem is well-posed, even if the user prescribed initial conditions violate the conditions of equations (7) - (9) above. In practice, the criterion for performing a divergence-free projection is based upon the RMS divergence error s
(C T u) (C T u)
Nel
where Nel is the number of elements and is a user-speci ed tolerance typically 10?10 to 10?8. If the RMS divergence error is greater than the speci ed tolerance for the initial candidate velocity eld, u^ 0 , then the PPE problem in equation (18) is solved for , and a mass-consistent projection performed using equation (19).
(17)
[C T Ml?1 C ] = C T u^ 0
(18)
= u^ 0 ? Ml?1 C (19) The explicit algorithm must respect both diusive and convective stability limits. Although the analytical stability limits for the explicit time integration of the Navier-Stokes equations in multiple dimensions remain intractable[8], approximate stability computations may be performed using local grid metrics. In an unstructured grid, with variable element size, the calculation of the grid Re (Reynolds) and CFL (Courant-Freidrichs-Levy) numbers uses the element-local coordinates and centroid velocities. Figure 3 shows the canonical element-local node-numbering scheme, coordinate system and centroid velocity for the 2-D and 3-D elements. The grid Re and CFL numbers are de ned as Rei = ju2hij (20) CFL = ju hi jt (21) u
0
i
k ik2 h
where i = , , are the element-local coordinates. The grid Re and CFL numbers rely upon the projection of the centroid velocity onto the element-local coordinate directions that are oriented according to the canonical local node numbering scheme for each element type[9, 10]. In order to use equation (21) to estimate a stable time step, a unit vector for each element-local coordinate direction is de ned as ^i =
e
h
i
(22)
k ik h
where e^i denotes the unit vector for each of the ( , , ) coordinate directions. Using the grid Re and the element size, h, the advective-diusive stability limit becomes
q
ti khik 1 + 1 + (Rei ) 2
2
?1
(23)
where a minimum over all elements and all element-local coordinate directions establishes a global minimum time-step. The advective stability limit is established in a similar manner using
khik : ti CFL ju h j i
(24)
The stable time step is based upon the minimum time step derived from either equation (23) or (24). However, for meshes graded to resolve boundary layers, the advectivediusive stability limit usually dictates the time step.
2.1 Modi ed Finite Element Formulation
Several ad-hoc modi cations are made to the standard Galerkin nite element formulation for the explicit time integration algorithm. These modi cations include the use of a row-sum lumped mass matrix, single point Gaussian quadrature, balancing tensor diusivity (BTD), and hourglass stabilization to damp the spurious zero-energy modes known as keystone or hourglass modes. A detailed numerical analysis of these modi cations is discussed in Gresho, et al.[8] Before discussing the reduced integration operators, a brief overview of the element matrices associated with equations (10) and (11) is presented. The element level gradient, mass, advection and diusion operators are shown in equations (25) { (28). Here, Na is the element shape function, and the subscripts a and b range from 1 to the number of nodes per element, nnpe.
Ciae
=?
Z @N a
@xi dV Z Mabe = NaNb dV
(25)
Aeab (u) = e Na u rNb dV
(27)
e
Z
Kabe =
Z
e
e
rNaT rNb dV
(26)
(28)
Reduced-Integration Operators
In this discussion, reduced integration is considered synonymous with single-point Gaussian quadrature. Using single-point integration, the element area and gradient operators may be written strictly in terms of element-local nodal coordinates. (29) Ae = 21 [x31 y42 + x24 y32]
Cxe = 2A1 e [y24; y31; ?y24 ; ?y31] Cye = 2A1 e [x42 ; x13 ; ?x42 ; ?x13 ]
(30)
In equations (29) and (30), xab = xa ? xb, where the subscripts a and b identify the local node number and range from 1 to 4 for the two-dimensional bilinear element. The fact that Cx3 = ?Cx1, and Cx4 = ?Cx2, permits the storage of only the unique values in the gradient operator at the element level. In two-dimensions, this requires only 4
oating point values per element for Cxe and Cye. The computation of the element level gradient operators in three-dimensions is somewhat more involved. To begin, the element-local nodal coordinates in the referential domain are de ned as follows
T = [?1; 1; ?1; 1; 1; ?1; 1; ?1] T = [?1; 1; 1; ?1; ?1; 1; 1; ?1] T = [?1; ?1; ?1; ?1; 1; 1; 1; 1]:
(31)
The Jacobian, evaluated at the element centroid, or central Gauss point, is de ned in terms of the element-local nodal coordinates (xe; ye; ze) and the referential coordinates, (; ; ) in equation (32). 3
2 J (0) = 1 64
T xe T ye T ze 7 T xe T ye T ze 8 T xe T ye T ze 5
(32)
The element volume is simply the determinant of the Jacobian in equation (33). Note that the element volume in three-dimensions may only be computed exactly with single-point integration for elements that are bricks or parallelepipeds.
V e = det(J (0))
(33) The computation of the element level gradient operators proceeds by rst evaluating the co-factors of the Jacobian, the inverse Jacobian in equation (34), and the gradient operators as shown in equation (35). Again, only the unique gradient operators must be stored, i.e., 12 words of storage per element are required for Cxe, Cye, and Cze [8]. D
= [Dij ] = J (0)?1
Cxe = 81 [D11 + D12 + D13 ] Cye = 18 [D21 + D22 + D23 ] Cze = 81 [D31 + D32 + D33 ]
(34)
(35)
The computation of the element level mass matrix using one-point quadrature and row-sum lumping yields the 2-D operator in equation (36) and the 3-D operator in equation (37), where ab is the Kronecker delta.
Mabe = ab A4
e
Mabe = ab V8
e
(36) (37)
The direct evaluation of the advection operator in equation (27) requires an integral of triple products that is very computationally intensive. Therefore, the advection operator
is approximated using an ad-hoc modi cation known as the centroid advection velocity. This modi cation assumes that u in equation (27) may be approximated by =
u
nnpe X a=1
Na(0)ua ;
(38)
where Na(0) indicates evaluation of the shape functions at the origin of the referential coordinate system. The application of single-point integration further simpli es the advection operator. In two-dimensions, 1 and 2 in equation (39) are used to form the advection operators in equation (40) where uab = ua ? ub.
1 = uCx1 + vCy1 2 = uCx2 + vCy2
(39)
Ae(u)ue = [1; 1; 1; 1]T A4 [1 u13 + 2u24 ] e Ae(u)ve = [1; 1; 1; 1]T A [ v + v ] e
(40) 4 1 13 2 24 For the evaluation of the three-dimensional advection operators, 1 { 4 in equation (40) are used in equation (41).
1 2 3 4
= = = =
uCx1 + vCy1 + wC z1 uCx2 + vCy2 + wC z2 uCx3 + vCy3 + wC z3 uCx4 + vCy4 + wC z4
(41)
Ae(u)ue = [1; 1; 1; 1; 1; 1; 1; 1]T V8 [ 1u17 + 2u28 + 3u35 + 4u46] e Ae(u)ve = [1; 1; 1; 1; 1; 1; 1; 1]T V [ v + v + v + v ] e
8e
1 17
2 28
3 35
4 46
Ae(u)we = [1; 1; 1; 1; 1; 1; 1; 1]T V8 [ 1w17 + 2 w28 + 3 w35 + 4 w46]
(42)
The single-point diusion operator may be stated simply as
Kabe = Cia ^ij CjbV e; (43) where ^ij represents a tensorial diusivity. Here, i and j range from 1 to the number of spatial dimensions while a; b range from 1 to the number of nodes per element. Generation
of the diusion operator in equation (43) using single point integration leads to rank de ciency of the element level operator. The presence of an improper singular mode in the element level operator may also lead to singularity in the assembled global operator. In two-dimensions, there is only one improper singular mode, while in three-dimensions, there are four singular modes. These modes are commonly referred to as hourglass modes, and when excited in a numerical solution, they can remain undamped and pollute the eld solution. In the current implementation of the explicit algorithm, several hourglass stabilization methods have been tested. A detailed discussion of the stabilization methods employed is beyond the scope of this paper. However, for the sake of completeness, a brief overview of the so-called h-stabilization is presented. The term, h-stabilization, derives from the fact that the outer product of the element hourglass vectors is used to form the stabilization operator. In two-dimensions, the single hourglass mode is ?T = [1; ?1; 1; ?1]: (44) This mode, when excited, may be detected visually as a w-mode in isometric plots of the eld variable from a numerical solution. Following Goudreau and Halquist[11] and Gresho, et al.[8], the element level stabilization operator is formed using the outerproduct of the hourglass vector.
Habe = hg Ae ?Ta ?b
(45) hg is a non-dimensional parameter that, in practice, is unity by default for outer-product stabilization (see Gresho, et al.[8]). In three-dimensions, the four hourglass vectors are ?T1 ?T2 ?T3 ?T4
= = = =
[1; 1; ?1; ?1; ?1; ?1; 1; 1] [1; ?1; ?1; 1; ?1; 1; 1; ?1] [1; ?1; 1; ?1; 1; ?1; 1; ?1] [?1; 1; ?1; 1; 1; ?1; 1; ?1]:
The resulting 3-D hourglass stabilization operator is 2 C1 6 C2 Habe = ehg [?1 ?2 ?3 ?4 ] 664 C3
38 9 > ?1 > 77 > = < ?2 > 75 > ? > ; > > > 3> C4 : ?4 ;
(46)
(47)
p p p where ehg = 1:0, C1 = C2 = C3 = C4 = h 3 V e, h = ( 3 Vmax ? 3 Vmin)=2. For the reduced integration element, -stabilization (see Belytschko, et al.[12], Liu and Belytschko [13], and Liu, et al. [14]) has also been investigated. -stabilization refers to the -vectors constructed from the hourglass modes for stabilization. While
-stabilization is perhaps more robust than h-stabilization, this type of hourglass con-
trol also requires more operations and storage. It is the author's experience that it is relatively more dicult to excite the hourglass modes in an Eulerian computation than in a Lagrangian computation, e.g., a DYNA3D[15] simulation. However, -stabilization still requires fewer operations and less storage than the fully integrated two-dimensional bilinear element. In three-dimensions, this is not the case. Table 1 shows the memory requirements and operations counts for a matrix-vector multiply (Ku) for a variety of element formulations. In 2-D, -stabilization requires nearly the same storage as the fully integrated element stored in either a compact, symmetric, element form, or in a global row-compressed form [16]. However, this element requires 9 more operations to achieve a matrix-vector multiply when compared to the element-by-element matrix vector multiply with 2x2 quadrature. In 3-D, -stabilization is about 3 times more expensive to perform than the corresponding global row-compressed matrix-vector multiply. There is one nal modi cation to the nite element formulation that derives from the explicit treatment of the advective terms. For advection dominated ows, it is well known that the use of a backward-Euler treatment of the advective terms introduces excessive diusion. Similarly, Gresho, et al.[8] have shown that forward-Euler treatment of the advective terms results in negative diusivity, or an under-diusive scheme. In order to remedy this problem, balancing-tensor diusivity (BTD), derived from a Taylor series analysis to exactly balance the diusivity de cit, is adopted. In the one-point quadrature element, the BTD term is simply added to the kinematic viscosity in equation (48) to form the tensorial diusivity used in equation (43). ^ij = ij + 2t uiuj (48) In summary, the modi cations made to the standard nite element formulation include the use of single-point integration, a row-sum lumped mass matrix, hourglass stabilization, and balancing tensor diusivity. The bene ts promised by one-point integration are tremendous in computational uid dynamics problems because of the requisite mesh sizes for interesting problems and the concomitant memory requirements. The reduction from 8 quadrature points to 1 in three dimensions reduces the computational load by a factor of about 6 to 7 and reduces memory requirements by a factor of 2 for the basic gradient operators. Neglecting the storage costs associated with the PPE, the total storage requirements for the explicit algorithm is 60 words per 3-D element. Further, it has been demonstrated that the convergence rate of the one-point elements is comparable to the fully integrated elements at a fraction of the computational cost, see Liu[14]. With the element formulation de ned, attention is now turned to the parallel aspects of the explicit algorithm when a domain-decomposition message-passing paradigm is used.
3 Domain-Decomposition Message-Passing Paradigm This section describes the domain-decomposition message-passing (DDMP) implementation of the explicit time integration algorithm for the the incompressible Navier-Stokes equations. A brief overview of spatial domain-decomposition and the associated subdomain to processor mapping is discussed rst. Next, the parallel right-hand-side assembly procedure is presented. The parallel assembly procedure then sets the stage for a discussion of the parallel iterative solution methods applied to the PPE.
3.1 Domain-Decomposition
Domain-decomposition is the process of sub-dividing the spatial domain into sub-domains that can be assigned to individual processors for the subsequent parallel solution process. There has been a great deal of work on the problem of spatial decomposition for unstructured grids over the last several years [17, 18, 19, 20, 21, 22, 23, 24]. In general, the requirements for mesh decomposition is that the sub-domains be de ned in such a way that the computational load is uniformly distributed across the available processors, and that the inter-processor communication is minimized. In this work, the decomposition of the nite element mesh into sub-domains is accomplished using the multilevel graph partitioning tools available in CHACO[21, 22, 23, 24]. This type of spectral domain decomposition attempts to sub-divide the computational domain in such a way that the computational load is uniform across processors while attempting to minimize the inter-processor communication. In part, CHACO was selected for this task because of its implicit weighting on the number of wires in a hypercube when using the spectral bisection and octasection methods. In order to exploit the nite element assembly process[6] for parallelization, the dual graph of the nite element mesh, i.e., the connectivity of the dual grid shown in Figure 2, is used to perform a non-overlapping element-based domain decomposition. Note that the dual grid corresponds to the grid associated with the element-centered pressure variables in the Q1P0 element. Implicit in this choice of a domain decomposition strategy is the idea that elements are uniquely assigned to processors while the nodes at the subdomain interfaces are stored redundantly in multiple processors. Figure 4 illustrates a non-overlapping, element-based decomposition for a simple two-dimensional vortexshedding mesh partitioned for four processors. The element coloring in Figure 4 is used to identify vector/cache blocks and will be discussed in x3.2. After de ning the sub-domains, the assignment of nodes, boundary conditions, and materials to individual processors is performed. Given that the non-overlapping decomposition assigns elements uniquely to processors, a simple assignment algorithm may be used to generate the necessary on-processor local-to-global mapping of nodes and boundary conditions. This is performed using an algorithm developed by Maltby[25]. Maltby's mapping algorithm uses a truth table to identify the nodes that reside on a given processor, the mapping from processor-local to global node numbering, the nodes with data that must be communicated o-processor, and the processors that must be communicated with.
In order to demonstrate Matlby's mapping procedure, consider the bit-array, i.e., truth table, shown in Figure 5, where a 1-bit corresponds to the existence of a node on-processor. The bit-array is constructed by simply placing a 1-bit in the column corresponding to the processor number assigned during the domain decomposition phase to each element of the mesh. Here, the assignment of elements to processors is assumed to be the responsibility of the spectral decomposition tool, e.g., CHACO. Scanning down any column of the bit-array identi es those global nodes (and elements) of the nite element mesh that are assigned to a processor. Contiguous node numbers are assigned to the nodes that reside on processor, i.e., in a column of the bit array. From this numbering, the local-to-global node number mapping is derived. By scanning across the rows of the bit-array, the nodal message lists may be constructed by noting that a global node number with a 1-bit in more than one column must exchange data with the processor corresponding to the column number. Thus, the sub-domain to processor mapping procedure is achieved entirely in an o-line pre-processing operation.
3.2 Parallel Assembly via Message Passing
The nite element assembly procedure is an integral part of any nite element code and consists of a global gather operation of nodal quantities, an add operation, and a subsequent scatter back to the global memory locations. A complete description of the sequential assembly algorithm may be found in Hughes[6]. The assembly procedure is used to both form global coecient matrices, right-hand-side vectors, and matrix-vector products in an element-by-element sense. In the case of the explicit time integration algorithm, the emphasis is upon the assembly of the element level contributions to the global right-hand-side vector n n n ^ = ANel u)ue g; (49) e=1 ffe ? Ke ue ? Ae ( where A is the assembly operator. From an examination of the explicit algorithm, it is clear that the bulk of the computational eort for a discrete time step consists of the formation and assembly of the right-hand-side in equation (49), and the subsequent PPE solution in equation (15). The right-hand-side vector in equation (49) is formed element-by-element using a right-to-left matrix-vector multiply with reduced integration operators for the diusive and advective terms. In order to exploit both register-to-register vector and cache-based processors, a topological domain-decomposition scheme is used on-processor to identify groups of elements that are completely data independent. Figure 4 shows the on-processor topological decomposition scheme used to identify the vector/cache blocks. A lucid description of the algorithm used for generating the data-independent element blocks is presented in Ferencz[26]. Identifying data-independent groups of elements that may be processed together not only permits the vectorization of the assembly process, but also maximizes the number of element level operations with respect to the number of load-store operations. For cache F
based architectures, this permits all of the element data in a data-independent group to be loaded into cache once for the element-level operations, e.g., right-to-left matrix-vector multiplies. Thus, the assembly process using the vector/cache blocks proceeds block-byblock with all element level operations for a block being performed before completing the \add-scatter" portion of the gather-add-scatter assembly procedure. In Figure 4, elements of like color on a single sub-domain constitute a single vector/cache block. On traditional CRAY vector supercomputers, the vector/cache block size is con gured to be an integer multiple of 64 to permit strip-mining operations to occur. On cache based architectures, the size of the cache blocks are tuned to match the size of the data cache insuring that all element operations re-use data that remains in cache during the element level matrix-vector multiplies. In the context of the DDMP approach, the vector/cache blocks may be viewed as a second level of ne-grained parallelism that may be exploited on machines with on-processor vector capabilities, or machines with software pipelining capabilities. The parallel assembly procedure may be viewed as a generalized form of the nite element assembly algorithm that requires inter-processor communication. As an example of a two-processor assembly, consider the sequential mesh and the assignment of the global nodes to two processors as shown in Figure 6. In Figure 6b, the local node numbers are enclosed in brackets to the right of the global node number (global node 1 in sub-domain P 0 is local node [1]). The sub-domain assignment of the global nodes is the consequence of the unique assignment of elements to processors, and reveals the existence of the nodes at the sub-domain boundaries on multiple processors. Figure 6b shows the inter-processor assembly of the sub-domain boundary nodes. Figure 6c illustrates the use of the local-to-global mapping required for the gather-addscatter operation and the arrows between global node numbers identify a send-receive pair. Thus, the parallel assembly procedure induces communication in the form of a gather-send-receive-add-scatter process. The parallel right-hand-side assembly for equation (14) may be viewed as a generalized assembly with o-processor communication, where rst the on-processor vector/cache blocked assembly is performed according to equation (50), and then the assembly at the sub-domain boundaries is performed according to equation (51). n n n ^ p = ANel u)ue g e=1ffe ? Ke ue ? Ae (
F
(50)
?1 ^ p g ^ = ANp (51) p=0 fF There are several things to note about the parallel assembly procedure. First, the parallel assembly is simply a generalization of the sequential assembly procedure that includes inter-processor communication. Second, the algorithm only requires the communication of nodal data at the edges of adjacent sub-domains. Therefore, as the problem size increases, the communication overhead scales with number of surface nodes associated with sub-domain boundaries. This algorithm also permits the implementation of vector-valued messages in order to avoid start-up issues associated with short messages, i.e., for the assembly shown in equation (51), the total message length is proportional to F
the product of the number of sub-domain boundary nodes, and the number of degreesof-freedom per node. Further, the use of non-overlapping grids implies that nodes in the nite element mesh that lie on sub-domain boundaries are stored in multiple processors as shown in Figure 6b. In contrast, over-lapping sub-domains would require the redundant storage of all the data associated with elements at the sub-domain boundaries. In the explicit algorithm, the use of iterative solution methods for the PPE can constitute 80 ? 90% of the computational work. Further the right-hand-side formation and assembly is performed strictly at the element level. Therefore, with the bulk of the computational work associated with the elements rather than the nodes in the mesh, the choice was made to load balance on the elements rather than nodes. This may be contrasted with fully-implicit algorithms where the bulk of the computational work is associated with the solution of a node-based non-linear system of equations.
3.3 The Parallel Explicit Algorithm
In this section, the issue of solving the PPE in parallel will not be addressed so that attention may be focused upon the solution process for the nodal variables. The DDMP version of the explicit time integration algorithm proceeds as follows. 1. Calculate the partial acceleration, i.e., acceleration neglecting the pressure gradient, at time level n. All processors calculate the processor-local right-hand-side terms neglecting the pressure gradient according to equation (50). ii. Perform the inter-processor nite element assembly according to equation (51). In this step, processors that share a common sub-domain boundary exchange messages containing partially assembled right-hand side contributions, and accumulate the fully-summed terms at the sub-domain boundaries. iii. All processors compute the processor-local partial acceleration using a precalculated lumped mass matrix, i.e., the mass matrix has been fully summed for all nodes in the mesh.
i.
~n = Ml?1 F^
a
(52)
2. Solve the global PPE problem for the current pressure eld. [C T Ml?1 C ]P n = C T a~n
(53)
3. Update the nodal velocities. i.
The computation of the pressure gradient term in equation (54) consists of the parallel on-processor computation of the discrete pressure gradient, followed by an inter-processor assembly.
?1 n un+1 = un + t[~an + Ml?1 fANp p=0 (CP )g]
(54) 4. Repeat steps 1-3 until a maximum simulation time limit or maximum number of time steps is reached.
Remark
For the explicit solution algorithm, the global row-sum lumped mass is accumulated at all nodes in the nite element mesh during the initialization phase. This requires the nodal assembly of the partial nodal mass at nodes on sub-domain boundaries as shown in Figure 6. All other operators are calculated during the initialization process in a perfectly parallel fashion, i.e., without communication.
3.4 Solution Strategies for the PPE
Direct solution methods have been shown to be eective for solving the PPE because a single o-line factorization may be used with one resolve per time step. However, on sequential computers, direct solvers lack scalability with respect to problem size due to memory requirements and the operations count - both a function of the bandwidth of the PPE. In this section, two conjugate gradient based solution methods for the PPE are outlined. The rst is based upon an element-by-element matrix-vector multiply with Jacobi preconditioning. This algorithm lays the foundation for consideration of sub-domain SSOR preconditioning with an element-by-element matrix-vector multiply applied only at the sub-domain boundaries.
Element-by-Element Jacobi Preconditioned Conjugate Gradient (EBE/JPCG)
Before proceeding with the details of the element-by-element algorithm, the basic conjugate gradient algorithm is outlined. The conjugate gradient algorithm shown below consists of a preconditioning step (solving Qz(n?1) = r(n?1) for z(n?1) ), two dot-products, two saxpy operations, two divide operations, and a matrix-vector multiply. Applying a diagonal preconditioner in the algorithm shown below, where Q is the diagonal operator, leaves the matrix vector multiply, q(n) = Ap(n) as the most computationally intensive part of the conjugate gradient algorithm. A parallel matrix-vector multiply that directly uses the assembled form of the PPE on a sub-domain would require messages that dier from those for the nodal variables if the element-centered pressure variables are communicated directly. The overhead of implementing multiple message data structures for passing node centered and element centered data is not appealing. However, by performing the matrix vector multiply element-by-element, the nodal message data structures may be used with vector-valued messages that avoid problems associated with message latency due to short messages.
Preconditioned Conjugate Gradient Algorithm
Compute r0 = b ? Ax0 for some initial guess x0 . for n = 1; 2 . . . ; Iteration limit solve Qz(n?1) = r(n?1) (n?1) = r(n?1)T z(n?1) if n = 1 p(1) = z(0) else (n?1) = (n?1) =(n?2) p(n) = z(n?1) + (n?1) p(n?1) endif q(n) = Ap(n) (n) = (n?1) =p(n)T q(n) r(n) = r(n?1) ? (n) p(n) Check Convergence; repeat iteration if required end The element-by-element (EBE) matrix-vector multiply proceeds in a manner analogous to the right-hand-side formation for the momentum equations, i.e., in a right-to-left order. The rst step of the EBE matrix-vector multiply may be viewed as computing the gradient of the search direction, viz., ?1 Nel (n) g(n) = ANp (55) p=0 fAe=1 (Cp )g: Thus, a vector valued gradient of the search direction, g(n), is assembled at the nodes on
processor, and in a second step, assembled via a gather-send-receive-add-scatter step for the boundary nodes that are assigned to multiple processors. The EBE matrix-vector multiply is completed in a series of completely parallel operations that include dividing g(n) by the lumped mass matrix, and then computing the discrete divergence of the result. Implicit in this computation is the fact that the lumped mass matrix, Ml , has the essential boundary conditions for velocity built-in during a pre-processing step. The Ap(n) product is completed as
q(n) = C T [Ml ?1]g(n) : (56) This approach to the matrix-vector multiply requires communication only for the rst step of the Ap(n) product during the assembly of the nodal gradient values of the search vector. Further, this approach re-uses the nodal communication patterns used for the momentum equations simplifying the implementation of the message passing procedures. The only other communication required in the conjugate gradient algorithm is for the dot-product operations. The dot-product is implemented using a global sum operation and requires a global synchronization. Jacobi preconditioning requires solving Qz(n?1) = r(n?1) , an embarrassingly parallel operation since it requires only the diagonal of the PPE, and this involves strictly
processor-local element data. Therefore, the PPE diagonal and its inverse are precomputed using all on-processor data associated with the unique assignment of elements to processors as
Qii = (Cxi )T X i[Ml ?x 1]Cxi + (Cyi )T X i[Ml ?y 1 ]Cyi + (Czi )T X i[Ml ?z 1]Czi ; (57) where X i is a gather operator that localizes the nodal lumped mass for each element.
Sub-Domain SSOR Preconditioned Conjugate Gradient (SSOR/PCG) The parallel Jacobi preconditioned conjugate gradient algorithm performs adequately in terms of scalability, but it requires a relatively large number of iterations and consequently communication overhead. In order to avoid this problem, a symmetric successive-over relaxation (SSOR) preconditioner [27] has also been implemented in an attempt to reduce the communication overhead required to solve the PPE. The SSOR preconditioner is de ned using an additive decomposition of the PPE, e.g., A = L + D + U , viz. ?1 D + U : Q = (2 ?1 !) D +L D ! ! !
(58)
In a parallel context, this preconditioner relies upon the FEM out ow boundary conditions on a sub-domain and makes use of a modi ed ITPACK[16] vector storage format for the sub-domain PPE operator. This preconditioner is perfectly parallel requiring no communication and also reducing the required number of iterations to solve the PPE by over a factor of two for most problems. The use of an assembled sub-domain PPE operator for the SSOR preconditioner makes it possible to perform the Ap(n) matrix-vector multiply in a two-step process. First, the on-processor matrix-vector multiply is performed using the assembled operator for all elements on the interior of a sub-domain. The second phase of the matrix-vector multiply proceeds in an element-by-element fashion only for the elements that lie on the boundary of a sub-domain. The use of the assembled sub-domain PPE operator for the matrix-vector multiply reduces both the operation count and the data motion for the on-processor portion of the matrix-vector multiply. A numerical study was conducted to determine the optimal value of ! for the SSOR preconditioner. Summary results of this study are shown in Figure 7 and demonstrate that ! = 1:0 is nearly optimal for most PPE operators in both two and three dimensions. The results in Figure 7 also highlight the in uence of the speci c problem on the number of iterations required to achieve convergence. This is demonstrated for the 2-D mesh with 4216 elements, and the 3-D mesh with 13800 elements which both contained pressure singularities in the computational domain.
Remark
The stopping criteria for the preconditioned conjugate gradient algorithm is based upon both the normalized residual in equation (59), and the change in the solution vector in equation (60). A convergence tolerance of 10?5 typically yields PPE solutions that are sucient to maintain a velocity eld with an RMS divergence error of less than 10?8.
kr(n)k kbk kx(n) ? x(n?1) k kx(n) k
(59) (60)
3.5 Communication Costs
This section outlines the communications costs associated with the parallel explicit timeintegration algorithm. The communications costs may be broken down into the cost per time step for the momentum equations, and the cost per time step for solving the PPE. To begin, N? is the number of nodes on the boundary of a sub-domain. In two-dimensions, the average number of nodes communicated to an adjacent processor is N?=4, and in three-dimensions the average is N?=6. Assuming that there are 8 bytes per oating point word, then the total number of bytes per processor to be communicated via a send operation is Nc = 8N?NDOF ; (61) where NDOF is the number of degrees-of-freedom per node. The communication cost for a send operation may be broken into three parts: the time to initiate the message, tstartup, the cost per packet, tpacket, and the transmission time, ttransmit . Thus, the time to send a message is (62) tsend = tstartup + N Nc tpacket + Ncttransmit ; packet where Npacket is the number of bytes per packet. From equations (61) and (62), it is clear that the number of adjacent sub-domains determines the total communication time for a message passing event. The startup time determines the minimum acceptable degree of granularity for a given parallel architecture. For example, message lengths of approximately 5000 bytes are required on the Meiko CS2 before the asymptotic bandwidth of the network is approached. Similarly, this message length is approximately 10000 bytes on the Intel Paragon. Next, the communication cost per time step is estimated for the explicit algorithm. For the momentum equations,there are two primary messaging steps. The rst occurs during the distributed assembly of the right-hand-side in equation (50). The second occurs during the assembly of the pressure-gradient in equation (54). Thus, there are two
vector-valued messaging events required for the momentum equations for each time step. For the solution of the PPE, there is one vector valued message event per iteration, where NIT iterations are required in the conjugate gradient solution. Therefore, the approximate communication time per time step for the explicit algorithm on a per processor basis is tcomm (2 + NIT )tsend: (63) Clearly, for NIT > 2 the communication associated with the solution of the PPE constitutes the dominant part of the communication overhead per time step. Operational experience with the explicit algorithm has demonstrated that for both Jacobi and SSOR preconditioned conjugate gradient the precise number of iterations required to solve the PPE is somewhat p for the EBE/JPCG PPE p problem dependent. However, in general, solver, NIT Nel in two-dimensions, and NIT 3 Nel in three-dimensions. The impact of the communication costs on the code scalability will be discussed in the results presented below.
4 Results In this section, several sample calculations are presented with a summary of performance measurements for the parallel explicit time-integration algorithm. Two of the three calculations presented in this section are motivated in part by the CFD triathalon proposed by Freitas [28], and consist of laminar ow over a backward facing step, a von Karman vortex street, and ow past a circular cylinder attached to a at plate. All of the computations presented in this section were performed on an 1840 processor Intel Paragon (1328 16-MegaByte processors, and 512 32-MegaByte processors).
4.1 The Backward Facing Step
Although motivated in part by Freitas[28], the computations presented here for the backward facing step are based upon the Re = 800 backward facing step benchmark performed by Gartling[29]. This benchmark was chosen, in part, because it presents a demanding test for solution strategies applied to the PPE due to the pressure singularity at the corner of the step. The backward facing step height is taken as H=2, the total channel height is H , and the length of the computational domain is L = 30H . A parabolic inlet velocity pro le is speci ed above the step with Reynolds number de ned as: Re = Uavg H= . The isothermal ow solution is integrated in time from an initial divergence-free condition. Following Gartling [29], four grids with identical boundary conditions were constructed for the backward facing step. The grids are labeled B through E in Table 2 to correspond to those used by Gartling. However, the grid spacing used here has been doubled to account for the fact that the computations use the Q1P0 element rather than the bi-quadratic velocity, linear discontinuous pressure elements employed by Gartling. On mesh E, this yields 387; 362 degrees-of-freedom (DOF) in the problem corresponding to the 355; 362 reported by Gartling.
The separation and re-attachment lengths shown in Table 2 are for a snapshot at 400 time units. l1, and l3 are the re-attachment points on the upper and lower walls respectively, and l2 is the separation point on the upper wall. At 400 time units, the ow eld has established a steady-state condition, e.g., time history plots of the kinetic energy, and velocity components indicate that the solution is no longer changing in time. The computed separation/re-attachment lengths are within about 1:25% of the benchmark results presented by Gartling, while the primary re-attachment point, l1 , is within 14% of the length estimated from Armaly, et al. [30]. Figure 8 shows snapshots of the pressure and vorticity elds for the backward facing step calculation at 400 time units. Figure 8a shows the entire pressure eld, while Figure 8b shows the pressure eld for 0 x 10. Similarly, Figure 8c and d show the vorticity eld for the entire domain and for 0 x 10 respectively. Although not an exhaustive comparison, this calculation compares well with the published benchmark results [29] in terms of the steady-state eld, and the separation/re-attachment lengths.
4.2 Vortex Shedding
The simulation of the von Karman vortex street behind a circular cylinder is a longstanding CFD benchmark for evaluating time-dependent solution algorithms for the incompressible Navier-Stokes equations. The results presented here are for a mesh modeled after that used for the benchmark calculation of Engleman and Jamnia [31]. The computational domain for the vortex shedding calculations consists of an interior square region ?4 x 4, ?4 y 4 surrounding the unit diameter cylinder. This core region is surrounded by a grid that spans ?8 x 24, ?8 y 8. Table 3 shows the radial grid spacing associated with each of the three meshes of coarse, medium and ne resolution. Here, the coarse mesh is considered the base discretization, while the medium mesh corresponds to a 2x2 re nement, and the ne mesh constitutes a 3x3 re nement relative to the coarse mesh. \Tow-tank" boundary conditions consisting of U = 1; V = 0 are applied at the upstream, top and bottom computational boundaries, while out ow conditions i.e., natural boundary conditions, are applied downstream. The Reynolds number for the vortex shedding calculations is Re = 100 based upon the cylinder diameter. The ow computation is started from initial conditions that are divergence-free. After startup, the ow eld evolves through a quasi-steady vortex stretching phase before the buildup of numerical round-o trips the vortex shedding. The number of degrees-of-freedom and the time steps associated with each of the three meshes are shown in Table 3. For the coarse mesh, approximately 142 time steps were taken per vortex shedding cycle, while the time step for ne mesh resulted in about 457 time steps per cycle. The measured Strouhal number for the ne mesh computation is 0:175 based upon the shedding period measured over 25 shedding cycles. This is consistent with the published numerical results of 0:172 ? 0:173 [31]. Figure 9 shows snapshots of the vorticity and pressure elds on the ne mesh at 150 time units. The close-up of the instantaneous vorticity eld shows two distinct secondary
attached vortices in the wake of the cylinder. The color map for the vorticity was chosen speci cally to delineate positive and negative vorticity as illustrated by the region of essentially zero vorticity that separates the upper and lower regions of the vortex street. The pressure eld shows the regions of low pressure that correspond to the regions of high vorticity. The decay in vortex strength due to viscous diusion is clear in the downstream section of the vortex street. The low-pressure regions associated with the vortex centers also exhibit a reduced amplitude in the downstream wake.
4.3 Post and Plate
Juncture ows are of interest in numerous industrial applications, and in particular, exterior ows around vehicles. The presence of geometrical junctures can lead to the generation of longitudinal vortices that pass on each side of the juncture in counterrotating directions. These types of ow con gurations occur in turbomachinery, blade and end-wall ows, aircraft wing - body junctures, and ship/hull appendages. A recent experimental investigation of this type of ow con guration may be found in [32]. The ow past a plate with an attached cylinder is considered here as a simpli ed prototype juncture con guration for a Reynolds number of 100 based upon the cylinder diameter. The computational domain is chosen such that ?4:5 x 6:5, ?2:5 y 2:5, and 0 z 2 with a cylinder of unit diameter at the origin oriented in the zdirection, and a plate in the x ? y plane at z = 0. The boundary conditions are speci ed as U = 1; V = 0; W = 0 at the upstream inlet, with no-slip boundary conditions on the plate and post. A symmetry condition is imposed at the upper boundary (z = 2) so that only the x and y-velocity components vary in the x ? y plane. Out ow boundary conditions are used on the y ? z planes of the domain. Again, three meshes of coarse, medium, and ne resolution are considered where the medium and ne meshes consist of 2x2 and 4x4 re nements of the coarse mesh respectively. This resulted in computations with 35; 200, 265; 952, and 880; 416 degrees-of-freedom for the coarse, medium and ne grids. In Figure 10, snapshots of the pressure, helicity (! u), and enstrophy (! !=2) at 300 time units are shown in the left column for the medium mesh resolution. In the right column of Figure 10, the three vorticity elds are shown. The ow eld in the symmetry plane at z = 2 exhibits behavior strikingly similar to the two-dimensional von Karman vortex street discussed above. The pressure isosurfaces illustrate the three-dimensional nature of the ow eld and highlight the stagnation regions at the leading edge of both the plate and post. In the x ? y symmetry plane the pressure isosurfaces share considerable similarities with the pressure contours shown in Figure 9b. The z-vorticity isosurfaces in the x ? y symmetry plane, shown in Figure 10, also have a strong resemblance to the vorticity contours in Figure 9a. However, the presence of the juncture as well as the plate itself, leads to a strongly three-dimensional vorticity eld particularly in the downstream section of the ow. The isosurfaces of x-vorticity illustrate the orientation of the vorticity with the primary ow direction at the post-plate juncture. Inspection of the helicity eld con rms that the
ow eld contains signi cant three-dimensional structure, and identi es the regions of the juncture ow where the longitudinal vortices are aligned with the velocity eld. The blue and gold helicity isosurfaces on either side of the cylinder-plate juncture clearly identify the presence of counter-rotating longitudinal vortices oriented with the primary
ow direction. The fact that the helicity changes sign on the upstream side of the juncture suggests that the ow eld is subjected to extreme strain-rates in this region. The y-vorticity isosurfaces correlate to the development of the boundary layer from the leading edge of the plate, and clearly identify the wake region near the plate downstream of the cylinder. The y-vorticity values on the upstream surface of the cylinder also indicates the presence of several counter-rotating vortices centered along the stagnation line of the cylinder. In contrast, to the strongly three-dimensional x and y-vorticity elds, the z-vorticity isosurfaces are relatively two-dimensional except near the plate surface where the in uence of the boundary layer is signi cant. Isosurfaces of the enstrophy eld are also shown in Figure 10. The enstrophy isosurfaces show a three-dimensional trough in the downstream region of the post that corresponds to the trough in the y-vorticity isosurfaces. The measured Strouhal number (based upon the y-velocity uctuations in the symmetry plane) is 0:17 which agrees well with the two-dimensional vortex shedding calculations.
4.4 Performance Results
Before proceeding with a discussion of the performance of the parallel explicit timeintegration algorithm, a brief summary of the performance of the vector/cache-blocked implementation is provided as a basis for comparison. The most invariant index for comparison of the explicit algorithm across computational platforms is the element cycletime measured in micro-seconds per time step per element. In other words, the cycle time is the amount of compute time required to advance one element one time-step. On a CRAY Y-MP, the use of a highly vectorized variable-band Cholesky solver [33] for the PPE yields an element cycle-time of 8s essentially independent of the problem size. On a Silicon Graphics Power Indigo-2, the explicit algorithm with the Cholesky PPE solver yields a cycle-time ranging from 8:5 to 94:2s for the backward facing step meshes shown in Table 2. The variability in the workstation cycle-time is a function of the halfbandwidth of the PPE. For large bandwidths, the cache utilization of the variable-band solver can degrade as shown by the Indigo-2 timings. In contrast, the element cycle-time is increased by approximately one order of magnitude when an iterative solver is used to solve the PPE. However, the explicit algorithm exhibits a scalar-vector speedup of approximately 10 on the CRAY Y-MP with either the direct or iterative pressure solver revealing the high degree of vectorization. Further, the remarkable cycle-times on the Power Indigo-2 indicate the eective cache utilization due to the use of the vector/cache blocks. The performance data for the vector/cache based machines discussed above leads to a baseline cycle-time of 10s for the parallel explicit algorithm. Thus, the parallel performance is evaluated in terms of the number of processors, i.e., the parallel granu-
larity, required to achieve a 10s cycle-time for a given problem consistent with what can be delivered on a traditional vector supercomputer, or a high performance workstation. Note that this comparison is biased in favor of the timings for the CRAY Y-MP because a highly vectorized direct solver is being used for the pressure with a single oline factorization and a single resolve per time step. In contrast, the parallel algorithms rely upon an iterative approach to solving the pressure equation. With this measure of performance, the explicit algorithm is assessed for relative parallel eciency and speedup. The methodology used compute the scaled speedup and eciency relies upon an accurate measurement of the time spent performing communication (tcomm ), and the time spent doing computation (tgrind ). The parallel eciency is computed as p = (tgrind ? tcomm )=tgrind [36]. With the scaled eciency, the scaled speedup is computed as S = Npp. Alternatively, the FLOP rate can be measured and used to estimate the scaled speedup. Note that idle time due to load imbalance is included in the measurement of tcomm , but was found to be negligible. Figure 11a shows the scaled speedup and eciency curves for the backward facing step meshes in Table 2. Each mesh was decomposed twice to yield sub-domains with 500 or 1000 elements per processor in order to evaluate the element cycle-time in terms of the degree of parallel granularity. Fewer elements per processor, i.e., more processors for a xed problems size, indicates a reduced degree of granularity or an increased degree of parallelism. The eciency for each computation is computed based upon the percentage of measured time spent doing communication versus computation. This eciency is then used to compute a relative scaled speedup. The scaled eciencies for the backward-facing step computations range between 89% and 98% with the higher eciencies associated with the SSOR sub-domain PPE preconditioner. Note that the SSOR sub-domain PPE preconditioner yields more work per processor relative to the EBE/JPCG solver without any additional communications overhead, thus a higher parallel eciency. The SSOR sub-domain preconditioner reduces the required number of iterates by approximately a factor of 2. However, the total solution time for the PPE remains constant since the SSOR preconditioner eectively requires a second matrix-vector multiply, albeit without communication overhead. A multi-color, vectorized SSOR preconditioner based upon the Eisenstat procedure [34, 35] eliminates this problem, but this preconditioner has not been implemented in parallel. The anomalous eciency results for the calculations with 500 elements per processor is due to a domain decomposition that yielded increased o-processor communication due to an increased number of sub-domain edges. Based upon the performance results for the backward facing step, the degree of granularity for the vortex shedding calculations was set at 350 elements per processor. The speedup and eciency curves for the vortex shedding computations based upon the coarse, medium and ne meshes are shown in Figure 11b. Parallel eciencies and speedups similar to those for the backward facing step are achieved with approximately 350 elements assigned to each processor. The SSOR sub-domain preconditioner yields eciencies that are about 5% higher than the EBE/JPCG solver. However the element cycle-times are 10 ? 20% higher for the SSOR preconditioner.
The speedup and eciency results for the post and plate computations are shown in Figure 11c, again for the coarse, medium and ne mesh resolutions discussed above. Here, the eect of the increased 3-D communication overhead may be seen in the somewhat lower parallel eciencies. It should be noted that higher eciencies may be obtained by assigning more elements per processor, i.e., more computational work per processor, but there is a concomitant increase in the element cycle-time. The parallel eciencies presented here are associated with the time that a user is willing to wait for the computation to complete, and the element cycle-time is the metric used to indicate this time. For example, increasing the number of processors from 64 to 256 processors for the medium resolution mesh reduces the cycle-time by a factor of approximately 3. Thus, the tradeo in the reduced parallel eciency due to a reduced granularity is a gain in the element cycle-time, and ultimately the time to deliver a solution.
5 Conclusions From the results presented above, there are two primary aspects of the parallel explicit algorithm that impact the parallel eciency. The rst is communication, and the second is the number of conjugate gradient iterations required to solve the PPE. It has been observed that the p number of EBE/JPCG iterates required to solve the PPE is roughly proportional to Nel in two dimensions. Thus, increasing Nel by 4x requires 2x more EBE/JPCG steps to achieve a converged PPE solution. As the problem size increases, the number of PPE conjugate gradient iterates increases with an associated increase in the amount of communication overhead. The SSOR sub-domain preconditioner not only provides more parallel work per processor, but the fact that it reduces the number of iterates by a factor of two translates directly into a factor of two reduction in communication overhead. Although the SSOR sub-domain preconditioner oers a reasonable balance between memory and communication requirements, convergence properties, and robustness, the element cycle-times for this preconditioner make it less attractive than the EBE/JPCG solver for some problems. In speci c, the SSOR sub-domain preconditioner performs better for problems that contain a pressure singularity, e.g., the backward facing step. In order to achieve reasonable element cycle-times on the Intel Paragon, the degree of granularity, i.e., the number of elements assigned to each processor, is approximately 250 elements for the explicit time integration algorithm. This translates into a memory requirement per processor of less than 1 MegaByte - less than 6% of the total memory per processor. To a certain degree, this indicates that the Paragon provides a poor balance between good network speed and relatively poor processor speed ultimately forcing a higher degree of granularity in order to achieve the desired element cycle-times. However, the fact that reasonable element cycle-times with high resolution meshes can be achieved with the parallel explicit algorithm demonstrates suggests that this approach is suitable for large eddy simulations (LES) where the time scales associated with an eddy-turnover time must be resolved. This eort has demonstrated the eectiveness of the domain-decomposition message-
passing paradigm for the explicit time-integration algorithm for the incompressible, viscous Navier-Stokes equations. The decomposition based upon elements rather than nodes yields reasonable scalability due to the high proportion of computational work associated with the PPE. Although the parallel explicit algorithm is promising, its utility has been in mapping the parallel path for semi-implicit, projection based solution algorithms for the Navier-Stokes equations. A publication on the application of the domain-based parallelism to a second-order projection algorithm is in preparation. In addition, eorts are underway to develop more ecient parallel preconditioners for the solution of the PPE.
Acknowledgments I would like to acknowledge Shawn Burns, Lee Taylor, and James Peery for their helpful suggestions on the preparation of this manuscript. This research was performed in part using parallel computing resources located at Sandia National Laboratories.
References [1] The Federal High Performance Computing Program, Executive Oce of the President, Oce of Science and Technology, pp. 49-50 (September 8, 1989). [2] A. Kovacs, and M. Kawahara, \A Finite Element Scheme Based on the Velocity Correction Method for the Solution of the Time-Dependent Incompressible NavierStokes Equations," International Journal for Numerical Methods in Fluids, Vol. 13, pp. 403-423 (1991). [3] V. Palanisamy, and M. Kawahara, \A Fractional Step Arbitrary LagrangianEulerian Finite Element Method for Free Surface Density Flows," International Journal of Computational Fluid Dynamics, Vol. 1., pp. 57-77 (1993). [4] P. M. Gresho, and R. Sani, \On Pressure Boundary Conditions for the Incompressible Navier-Stokes Equations," International Journal for Numerical Methods in Fluids, Vol. 7, pg. 1111 (1987). [5] P. M. Gresho, R. L. Lee, and R. Sani, \On the time-dependent solution of the Navier-Stokes equations in two and three dimensions," Recent Advances in Numerical Methods in Fluids, vol. 1, Pineridge Press, Swansea, U.K, pg. 27 (1980). [6] T. J. R. Hughes, The Finite Element Method, Prentice-Hall, Inc., Englewood Clis, New Jersey, pp. 423-429, pp. 490-512 (1987). [7] O. C. Zienkiewicz, and R. L. Taylor, The Finite Element Method, Volume 2, Fourth edition, McGraw-Hill Book Company Limited, Berkshire, England, pp. 404-437 (1991).
[8] P.M. Gresho, S. T. Chan, R. L. Lee, and C. D. Upson, \A Modi ed Finite Element Method for Solving the Time-Dependent, Incompressible Navier-Stokes Equations. Part 1: Theory," International Journal for Numerical Methods in Fluids, Vol. 4, pp. 557-598 (1984). [9] M. A. Christon, T. E. Spelce, \Visualization of High Resolution Three-Dimensional Nonlinear Finite Element Analyses," proceedings Visualization `92, Boston, MA, pp. 324-331 (1992). [10] M. A. Christon, \Visualization Methods for High-Resolution, Transient, 3-D, Finite Element Simulations," proceedings of the International Workshop on Visualization, Paderborn, Germany, January 18-20 (1994). [11] G. L. Goudreau, and J. O. Halquist, \Recent Developments in Large-Scale Finite Element Lagrangian Hydrocode Technology," Computational Methods in Applied Mechanics and Engineering, Vol. 33, pp. 725 (1982). [12] T. Belytschko, J. S. Ong, W. K. Liu and J. M. Kennedy, \Hourglass Control in Linear and Nonlinear Problems," Computational Methods in Applied Mechanics and Engineering, Vol. 43, pp. 251-276 (1984). [13] W. K. Liu and T. Belytschko, \Ecient Linear and Nonlinear Heat Conduction with a Quadrilateral Element," International Journal for Numerical Methods in Engineering, Vol. 20, pp. 931-948 (1984). [14] W. K. Liu, J. S. Ong, and R. A. Uras, \Finite Element Stabilization Matrices," Computer Methods in Applied Mechanics and Engineering, Vol. 53, pp. 13-46 (1985). [15] R. G. Whirley, and B. E. Engelmann, \DYNA3D: A Nonlinear, Explicit, Three Dimensional Finite Element Code for Solid and Structural Mechanics - User Manual," Lawrence Livermore National Laboratory, UCRL-MA-107254, Rev. 1 (1993). [16] D. R. Kincaid, T. C. Oppe, D. M. Young, ITPACKV 2D User's Guide, Center for Numerical Analysis, The University of Texas at Austin, May (1989). [17] C. Farhat, and Michel Lesoinne, \Automatic Partitioning of Unstructured Meshes for the Parallel Solution of Problems in Computational Mechanics," International Journal for Numerical Methods in Engineering, Vol. 36, pp. 745-764 (1993). [18] A. Pothen, H. Simon, and K. Liou, \Partitioning Sparse Matrices with eigenvectors of Graphs," SIAM J. Matrix Anal., Vol. 11, pp. 430-452 (1990) [19] H. D. Simon, \Partitioning of Unstructured Problems for Parallel Processing," Computing Systems in Engineering, Vol. 2., pp. 135 (1991). [20] S. T. Barnard, and H. D. Simon, \A Fast Multilevel Implementation of Recursive Spectral Bisection for Partitioning Unstructured Problems," Proc. 6th SIAM Conf. Parallel Processing for Scienti c Computing, SIAM, pp. 711-718 (1993).
[21] B. Hendrickson, and R. Leland, \An Improved Spectral Graph Partitioning Algorithm for Mapping Parallel Computations," Sandia National Laboratories Report, SAND92-1460 (September, 1992) [22] B. Hendrickson, and R. Leland, \Multidimensional Spectral Load Balancing," Sandia National Laboratories Report, SAND93-0074 (January, 1993). [23] B. Hendrickson, and R. Leland, \A Multilevel Algorithm for Partitioning Graphs," Sandia National Laboratories Report, SAND93-1301 (October, 1993). [24] B. Hendrickson, and R. Leland, \The Chaco User's Guide - Version 1.0," Sandia National Laboratories Report, SAND93-2339 (October, 1993). [25] Maltby, J. D., Lawrence Livermore National Laboratory, personal communication (October, 1993). [26] R. M. Ferencz, \Element-by-element Preconditioning Techniques for Large-Scale, Vectorized Finite Element Analysis in Nonlinear Solid and Structural Mechanics," Ph.D. Thesis, Stanford University (1989). [27] O. Axelsson, Iterative Solution Methods, Cambridge University Press, New York (1994). [28] C. J. Freitas, \Perspective: Selected Benchmarks from Commercial CFD Codes," Journal of Fluids Engineering, Vol. 117, pp. 208-218 (1995). [29] D. K. Gartling, \A Test Problem for Out ow Boundary Conditions - Flow Over A Backward-Facing Step," Int. Journal for Numerical Methods in Fluids, Vol. 11, pp. 953-967 (1990). [30] B. F. Armaly, F. Durst, J. C. F. Pereira, and B. Schonug, \Experimental and Theoretical Investigation of Backward Facing Step Flow," Journal of Fluid Mechanics, Vol. 127, pp. 473-496 (1983). [31] M. S. Engelman and M. A. Jamnia, \Transient Flow Past a Circular Cylinder: A Benchmark Solution," Int. Journal for Numerical Methods in Fluids, Vol. 11, pp. 985-1000 (1990). [32] W. J. Devenport and R. L. Simpson, \Time-dependent and time-averaged turbulence structure near the nose of a wing-body junction," J. Fluid Mech., Vol. 210, pp. 23-55 (1990). [33] O.O. Storaasli, D.T. Nguyen, and T.K. Agarwal, \A Parallel-Vector Algorithm for Rapid Structural Analysis on High-Performance Computers," NASA Technical Memorandum 102614 (April, 1990). [34] S. C. Eisenstat, \Ecient Implementation of a Class of Preconditioned Conjugate Gradient Methods," SIAM J. Sci. Stat. Comput., Vol. 2, No. 1, pp 1-4 (1981).
[35] S. C. Eisenstat, J. M. Ortega, and C. T. Vaughan, \Ecient Implementation of a Class of Preconditioned Conjugate Gradient Methods," SIAM J. Sci. Stat. Comput., Vol. 11, No. 5, pp. 859-872 (1990). [36] J. L. Gustafson, G. R. Montry, and R. E. Benner, \Development of Parallel Methods for a 1024-Processor Hypercube," SIAM Journal on Scienti c and Statistical Computing, Vol. 9, No. 4, pp. 609-638 (1988). Γ1
Γ1
Γ2 Outflow Boundary Γ
1
Figure 1: Flow domain for conservation equations.
Primary Grid
Pressure DOF
Velocity DOF
Dual Grid
Figure 2: Velocity mesh with two degrees-of-freedom (DOF) per node, and the PPE dual grid with one DOF per element.
n8 n3 n4
hη
n7 hζ
n5
n6 hη _ u
_ u hξ
hξ
n4 n3
n1
n2
n
1
a)
n2
b)
Figure 3: Grid Parameters: a) Two-dimensional element with centroid velocity and characteristic element dimensions h and h , b) Three-dimensional element with centroid velocity and characteristic dimensions h , h , and h .
Figure 4: Four processor spatial domain decomposition with element coloring shown for vector/cache blocks.
Global Node 1 2 3 ...
Nnp
P0 1 1 0 ... 1 0
P1 0 1 0 ... 0 1
Processor Number P2 Pn?3 Pn?2 Pn?1 1 0 1 0 0 0 0 0 0 0 1 1 ... ... ... ... 0 1 1 0 1 0 1 0
Figure 5: Bit-array for Maltby-mapping procedure where Nnp is the global number of nodes. 13
14
15
16
9
10
11
12
5
6
7
8
2
3
4
1 a) Sequential Mesh
13[9]
Sub-domain - P1 14[10] 15[11]
16[12]
9[5]
10[6]
11[7]
12[8]
9[7]
10[8]
11[9]
7[3]
8[4]
5[4]
6[5]
7[6]
3[1]
4[2]
Sub-domain - P0 Local - Global 1 - 1 2 - 2 3 - 3 4 - 5 5 - 6 6 - 7 7 - 9 8 - 10 9 - 11
Sub-domain - P1 Global - Local 3 - 1 4 - 2 7 - 3 8 - 4 9 - 5 10 - 6 11 - 7 12 - 8 13 - 9 14 - 10 15 - 11 16 - 12
c) Parallel Assembly Mapping
1[1] 2[2] 3[3] Sub-domain - P0 b) 2-Processor Sub-domain Mapping
Figure 6: Parallel assembly procedure: a) Sequential Mesh, b) 2-Processor sub-domain mapping, c) Parallel assembly mapping.
2500
2-D: Nel = 1,760 2-D: Nel = 4,216 2-D: Nel = 6,400 2-D: Nel = 15,840 3-D: Nel = 1,760 3-D: Nel = 13,800 3-D: Nel = 20,550
No. of Iterations
2000
1500
1000
500
0 0.25
0.50
0.75 1.00 1.25 1.50 SSOR Relaxation Parameter
1.75
2.00
Figure 7: Iteration count as a function of the relaxation parameter, !, for the SSOR preconditioned conjugate gradient algorithm (SSOR/PCG Solver).
Figure 8: Backward facing step at 400 time units: a) and b) Pressure eld, c) and d) Vorticity eld.
Figure 9: Re = 100 ow past a circular cylinder at 150 time units: a) vorticity eld, b) pressure eld.
Figure 10: Re = 100 ow past post and plate at 300 time units. Left column: Pressure , Helicity (isosurface values 0:15), and Enstrophy, Right column: X-Vorticity (isosurface values 0:75), Y-Vorticity, (isosurface values 0:75), Z-Vorticity eld.
1.00
250
0.95 Scaled Efficiency
Scaled Speedup
EBE/JPCG/1000 EBE/JPCG/500 200 SSOR/PCG/1000 SSOR/PCG/500 150 Linear Speedup 100 50
0.90 0.85
EBE/JPCG/1000 EBE/JPCG/500 SSOR/PCG/1000 SSOR/PCG/500
0.80 0.75 0.70
50 100 150 200 Number of Processors
a)
250
50 100 150 200 Number of Processors 1.00
60 EBE/JPCG SSOR/PCG Linear Speedup
0.95 Scaled Efficiency
Scaled Speedup
50
250
40 30 20
0.90 0.85 0.80
EBE/JPCG SSOR/PCG
0.75
10
0.70 b)
10
20 30 40 50 Number of Processors
60
10
20 30 40 50 Number of Processors
60
0.95 Scaled Efficiency
Scaled Speedup
1.00 800 EBE/JPCG/980 700 EBE/JPCG/490 EBE/JPCG/245 600 Linear Speedup 500 400 300 200 100 0 0 100 200 300 400 500 600 700 800 c) Number of Processors
0.90
EBE/JPCG/980 EBE/JPCG/490 EBE/JPCG/245
0.85 0.80 0.75 0.70 0 100 200 300 400 500 600 700 800 Number of Processors
Figure 11: Scaled speedup and eciency curves for the explicit algorithm: a) backward facing step, b) cylinder vortex shedding, c) ow past post and plate.
Table 1: Memory requirements and operations counts for a matrix-vector multiply for various element integration rules, stabilization operators, and storage schemes. (Nel = number of elements, Nnp = number of nodes, EBE = element ? by ? element) Dimension 2-D 2-D 2-D 2-D 2-D 3-D 3-D 3-D 3-D 3-D
Element Formulation 1-pt., EBE 1-pt., h-stabilization, EBE 1-pt., -stabilization, EBE 2x2 Quadrature, EBE 2x2, ITPACKV[16] 1-pt., EBE 1-pt., h-stabilization, EBE 1-pt., -stabilization, EBE 2x2x2 Quadrature, EBE 2x2x2, ITPACKV[16]
Storage 4Nel 4Nel 9Nel 10Nel 9Nnp 12Nel 12Nel 45Nel 36Nel 27Nnp
+ 12Nel 22Nel 19Nel 16Nel 8Nnp 29Nel 69Nel 61Nel 64Nel 26Nnp
16Nel 18Nel 25Nel 16Nel 9Nnp 28Nel 46Nel 100Nel 64Nel 27Nnp
Total Op. 28Nel 40Nel 44Nel 32Nel 17Nnp 57Nel 115Nel 161Nel 128Nel 53Nnp
Table 2: Separation lengths for the backward facing step, ReH = 800. Note that the element side lengths are for the uniform region of the mesh. (l1 { length from the step face to the lower re-attachment point, l2 { length from the step face to upper separation point, l3 { length of the upper separation bubble) No. of No. of Element Size y Case Elements Unknowns (x; y) B 8000 24842 0.100 C 32000 97682 0.050 D 72000 218522 0.033 E 128000 387362 0.025 Gartling[29] 32000 355362 0.025 Armaly[30] n/a n/a n/a
l1 5.41 5.89 6.03 6.03 6.10 7.00
l2 4.71 4.89 4.91 4.90 4.85 5.30
l3 9.87 10.24 10.34 10.37 10.48 9.40
Table 3: Vortex shedding mesh parameters. (ri: radial mesh spacing at the cylinder surface, ro: radial mesh spacing at x = 4.) No. of No. of Time Case Elements Unknowns ri ro Step Coarse 2800 8584 0.0375 0.3375 0.0400 Medium 11200 33968 0.0188 0.1688 0.0200 Fine 25200 76152 0.0100 0.1125 0.0125