The extended unsymmetric frontal solution for multiple

The extended unsymmetric frontal solution for multiple-point constraints P. Areias≀⋆ , T. Rabczuk • and J. Infante Barbosa≀ January 30, 2014 ≀ Physics

Department ´ University of Evora

Col´ egio Lu´ıs Ant´ onio Verney Rua Rom˜ ao Ramalho, 59 ´ 7002-554 Evora, Portugal • Institute

of Structural Mechanics Bauhaus-University Weimar Marienstraße 15 99423 Weimar, Germany ⋆ ICIST

Abstract Interconnected multiple-point constraints can be topologically ordered and enforced by dense matrix multiplication as recently shown by Areias et al. where a Gaussian sparse solver was used. For equality constraints, both the solution and all intermediate reactions are obtained directly from the solution of a reduced linear system. However, standard sparse solvers require the additional step of symbolic assembling and typically do not make use of dense linear algebra kernels: indirect addressing is required during decomposition, which limits the use of dense kernels and effectiveness of shared-memory parallelism. In this work a new approach is presented for the frontal solution method within the multiple-point constraint framework. An a-priori pivot and front sequence is established. Some of the constraint enforcement calculations are performed during the (local) assembling stage. This has been found to speed up forward elimination and back-substitution and directly conforms to the concepts set up by our previous work. Element ordering makes use of a variant of Sloan’s algorithm where the first stage uses exact level sets. When compared with both standard sparse solvers and classical frontal implementations, memory requirements and code size are significantly reduced. A complete software package in Fortran 2003 is described. Examples of clique-based problems are shown with large systems solved in core.

1

Frontal solution and multiple-point constraints: an introduction

The frontal linear solution method was first published in 1970 by Bruce Irons [24] for symmetric systems arising from clique problems (specifically finite elements). In the classical frontal solution, interleaved local “assembling” (clique summation in graph terminology) and forward elimination allow, at the cost of non-optimal operation count, very large problems to be solved with limited memory resources, cf. [17], specially since factors can be stored outof-core. For structurally symmetric but numerically unsymmetric problems, Hood presented a modified frontal code [22, 18, 15]. More recent implementations of the frontal solution have introduced several optimizations (cf. [15, 16]), the most impacting one being the pre-order of elements, cf. [35, 32, 29]. Contrasting with the classical sparse LU factorizations, dense DAXPY and DGER (cf. [27]) in the elimination can be employed and DDOT operations in the back-substitution stage can be adopted. In addition, we can exploit the dense form of the “front” to enforce constraints by pre-and-post multiplication, a fact adopted in this work. For serial execution, performance is seldom competitive with the best sparse solvers based on multifrontal or classical sparse decomposition. However, the frontal method still retains some advantages: • The possibility of fine-tuning the code with techniques from sparse linear algebra. 1

• Competitive performance for elements containing a number of degrees-of-freedom, such as in shell problems and multi-physics problems. If supervariable identification is performed, state-of-the-art frontal methods can out-perform standard sparse Gauss decomposition for these problems. • Lower memory requirements than classical sparse decomposition for many finite element problems (after ordering). • Shared-memory multiple-core parallelism opportunities in dense BLAS [27] operations with OpenMP [11] in the local assembling, forward elimination and back-substitution stages. Known shortcomings of frontal methods with respect to sparse Gaussian decomposition are: • More arithmetic operations than necessary by classical sparse decomposition (which, in turn, are more complex, cf. [20]). • Dense or banded factors when compared with classical sparse decomposition which generate sparse factors. • Fewer opportunities for distributed-memory parallelism than with other methods (although appropriate to be integrated in a multiple-front decomposition [33]). • Motion of data in the front and index manipulation during elimination. With frontal methods, elimination time and storage are strongly affected by clique ordering. We here use a modification of the second stage of Sloan’s algorithm [35] applied to the element-element relation indices. Contrasting with other works, we use the exact level set structure in the two stages of Sloan’s algorithm, as well as the ratio between the node (i.e. graph node) depth and width to select the pseudo-peripheral elements. The exact level set structure incurs in a performance penalty in the symbolic stage, but produces smaller frontwidths and better elimination performance. For nonlinear problems, the symbolic stage is merely a small part of the total analysis time and larger problems (above 105 degrees-of-freedom) greatly benefit from smaller frontwidths. Recently, despite well discussed performance shortcomings in classical sparse problems, there has been an interest in applying frontal algorithms within the so called multiple-front algorithm (not to be confused with the multifrontal algorithm) with the message passing interface (MPI) [33]. Given a partitioned domain, the idea is to apply the classical frontal approach within each sub-domain (in parallel) and then obtain several remaining fronts which are solved in serial. This has motivated the development of the present work, along with the imposition of multiple-point constraints, see also [8]. The attractiveness of frontal algorithms relies in a less intricate source code to implement multiple-point constraints than the one with a classical sparse solver, as can be consulted in [7]. It is possible to partition discrete engineering problems in two classes: additive constituents (finite elements including continuum, loading, contact elements and other smooth and non-smooth force elements) and multiplicative constituents (equality constraints, master-slave relations, rigid parts and arc-length constraints). This classification tolerates some overlapping, and a definite choice is usually made by considerations of efficiency. Specific formulations of many of such constituents are provided in the book by Belytschko et al. [10]. Since Lagrange multipliers are calculated but not explicitly considered in the present approach, the sequence of pivots can be pre-assigned in each clique, allowing a very simple frontal implementation. Concerning multiplicative components, although all equality constraints can be imposed with Lagrange multipliers, often this is not cost-effective if a direct sparse solver is adopted (see, e.g. [23]). Essential boundary conditions are a good example of the effectiveness of multiplicative components used in many industrial codes1 . The same applies to rod and shell parametrization: director inextensibility is achieved with multiplicative constituents (at the continuum level by coordinate transformation, see e.g. S.S. Antman [6, 5]). Belytschko et al. [10] have shown solid-based shell elements obtained by transformation of degrees-of-freedom of a standard continuum element. When a sequence of constraints exists, generality is limited by the presence of cycles in the resulting degree-offreedom (DOF) graph, as we shall see, but also the well-posedness of the resulting discrete system (dependent on the values of the coefficients). To incorporate all constituents prior to the solution we apply transformations to a clique list of additive constituents. Constraints such as rigid motion, kinematic connections, periodic boundary conditions (see, e.g. [9] for such an application) are effectively treated with multiple-point constraints (MPC) or, as a synonym, matrix transformation methods (MTM). In the context of multibody dynamics, these methods are identified as coordinate reduction methods [4]. Relevance of these techniques has increased in recent years for unit cell analysis in multiscale methodologies. Besides these classical problems, now well solved (with strong restrictions in their generality) by commercial 1 usually,

the affected coefficients are implicitly multiplied by zero, which is equivalent to the remotion of the equations.

2

MPCs (essential BC), mirror, rigid link, rigid body, shear band

Beam, tetrahedron and shell elements with cracks and internal nodes

u=u

C i3 = δi3

Nonlocal

uLn = uR n shear band=crack+MPC : quadrature point

Control equations

Contact and interface elements (complementarity) Multiplier

∆a = L

a Pressure and point load elements

Combined meshless arrangements p Local graph ALE mesh replacement constraints

Debonding Load

Geometric elements Collapse

αi − π/3

X − (X + Y )+ = 0 Separation

Figure 1: Classification of common discretization components as either additive (elements) or multiplicative (MPC). software packages, MTM can also be successfully used in solution control. Localized arc-length, COD-control and related techniques, which were, until now, introduced as a “added feature” prone to coding errors and maintenance requirements are reclassified as equality constraints and therefore MPC. As generalizations of boundary conditions for partial differential equations, essential boundary conditions are also classified as multiplicative constituents and natural boundary conditions as additive constituents; this classification is illustrated in detail in Figure 1. Literature concerned with this subject is often restricted to direct sparse multiplication [3] (unrealistic for large scale problems since it creates temporary objects of potentially large size and not easily parallelized) or unnested constraints [2, 34]. The present system of linear solution techniques allows competitive performance and greater generality.

2

Main derivations

2.1

Dense forward Gauss elimination with back substitution and the frontal algorithm

The dense version of Gaussian elimination with back substitution can be written as classically depicted in 1. With respect to the LU decomposition, frontal solution methods are often based on this variant since the right-hand-side f is obtained by assembling and it is not available in other form. Two important characteristics of the dense version are inherited by the frontal method: • Fill-in must be accounted for in the frontal version (as in the sparse LU decomposition), preferably a-priori. • Parallelization of elimination and substitution loops can be implemented exactly as in the dense case. Adapting Algorithm 1 to deal with cliques, it is clear that sums (and subtractions) during elimination can be interleaved with coefficient updating in assembling.

2.2

Fixed entity relations

A one-to-one and onto relation between two finite sets (say, T1 and T2) can be represented by a list LT1T2 such as member I2 ∈ T2 is related to member I1 ∈ T1 according to: 3

Algorithm 1 Classical dense Gaussian elimination with back substitution. Linear system uses the notation Kx = f. A sparse implementation produces elements i, j if there is a degree-of-freedom k related both to i and to j (denoted fill-in, cf. [13]). ! *** F o r w a r d E l i m i n a t i o n for k =1 , n -1 for i = k +1 , n ! P a r a l l e l i z a b l e loop xm = K [i , k ]/ K [k , k ] K [i , k ]= xm for j = k +1 , n K [i , j ]= K [i , j ] - xm * K [k , j ] ! Fill - in d u r i n g e l i m i n a t i o n end for f [ i ]= f [ i ] - xm * f [ k ] end for end for ! *** Back S u b s t i t u t i o n for i =n ,1 , -1 s=f[i] for j = i +1 , n ! P a r a l l e l i z a b l e loop s =s - K [i , j ]* x [ j ] end for x [ i ]= s / K [i , i ] end for

I2 = LT1T2 [I1] Its transpose

LTT1T2

(1)

is denoted LT2T1 and relates T2 and T1: I1 = LT2T1 [I2]

(2)

A permutation can of course be represented by these forms. We use the symbols #(T1) and #(T2) as the cardinality of T1 and T2, respectively. Indices I1 and I2, representing entities, are assumed to lie in intervals T1 ={1, . . . , #(T1)} and T2 ={1, . . . , #(T2)}, respectively. If that is not the case, a additional mapping must be established to obtain these intervals2 . Along the same lines, a many-to-many relation, not necessarily one-to-one or onto, can be represented by two lists: MT1T2 = {IT1T2 , JT1T2 }

(3)

I2 = JT1T2 [IT1T2 [I1] − 1 + K2]

(4)

where

with K2 being the local position of object I2 in the relation between sets T1 and T2. A possible interpretation of (4) is: given the local object K2 in the relation between member I1 and set T2, its global number is given by (4). The degree, or number of destinations of I1 is simply given by IT1T2 [I1 + 1] − IT1T2 [I1]. For example, if (3) represents the relation between the set of elements (T1 = G) and degrees-of-freedom (T2 = E), given element I1 ∈ T1 and the local degree-of-freedom K2, then I2 ∈ T2 is the corresponding global degree-of-freedom. One point worth retaining is that the order of the local objects in (4) is pre-established and not subject to permutation. Using our previous example, the meaning of each local degree-of-freedom in a given element is established by its position K2. The transposition of (3) results in: MT2T1 = {IT2T1 , JT2T1 } = MTT1T2

(5)

which is obtained with pseudo-code shown in Algorithm 5 shown in the Appendix. Of course, in the transposition operation, the local position K2 can be obtained from a third list KT2T1 such as:

2 Additional

K2 = KT2T1 [IT2T1 [I2] − 1 + K1] with

(6)

I2 = JT1T2 [IT1T2 [I1] − 1 + K2] where I1 = JT2T1 [IT2T1 [I2] − 1 + K1]

(7) (8)

indirect addressing or, for dense relations, subsets of T1 and T2

4

The reader can note that, in the transposition, the physical meaning of K1 is not inherited from the physical meaning of K2. For example, the local degree-of-freedom of a given element can be given a precise meaning, whereas the local element of a global degree-of-freedom, obtained by transposition, is not guaranteed to be in a meaningful order. We can also observe that (6-8) unveils a ordering in T1. In addition, the natural number NT2T1 (I1, K2) = IT1T2 [I1] − 1 + K2 can also be set in a relation with members of T1 and T2. We call this number a “counter”. With these insights, we can now extend the notation MT2T1 to include the index KT2T1 : M⋆T2T1 = {IT2T1, JT2T1 , KT2T1 } and now this is an extended many-to-many relation. Recovering our original list (1), we can observe that this can also represent a many-to-one relation, and in that case the transposition results in a many-to-many representation: MT2T1 = LTT1T2

(9)

A simple pre-processing of Algorithm 5 is necessary to perform the operation (9), for simplicity reasons, we use a function with the following signature: transpl(cardT1,LT1T2,cardT2,IT2T1,JT2T1,KT2T1). At this point, an extension is performed. A list of objects is, under the present interpretation, also a many-toone relation. The natural ordering of the list can be seen as an implicit set, which we can identify as T0. In that case, transposition of LT0T1 , MT1T0 = LTT0T1 will provide the indices or positions of each member of T1. With these definitions in hand, we can join M⋆T2T1 and M⋆T1T2 using the symmetry symbol: ⋆ MT1T2 ⋆ M(T1T2) = M⋆T2T1 Having two many-to-many relations, MT1T2 and MT2T3 between T1, T2 and T2 , T3, respectively, we obtain the relation between T1 and T3 by means of T2 as: (T2)

MT1T3 = MT1T2 · MT2T3

(10)

which is implemented with pseudo-code in Algorithm 6, shown in the Appendix. As a particular case, we can relate T1 with itself by means of T2: (T2)

MT1T1 = MT1T2 · MTT1T2

(11)

Multiplication of many-to-many relations is generally non-commutative. We can observe (cf. [8]) that the assembling process relates degrees-of-freedom between themselves by means of cliques. In that case, T1 = G is the set of global degrees-of-freedom and T2 = E is the set of cliques, see also [21]. An important aspect is that, when multiplying two matrices, additional information can be obtained concerning the origin of the result. We omit the procedure to obtain this, as it is a simple extension of the many-to-many multiplication. Sum of compatible many-to-many relations is also easily obtained. Suppose MIT1T2 and MII T1T2 are two compatible many-to-many relations between sets T1 and T2. The sum (S) of these two relations is obtained by merging the corresponding J indices: MST1T2 = MIT1T2 + MII T1T2

(12)

the corresponding algorithm is straightforward and omitted for conciseness. Certain functions over M⋆(T1T2) are of practical importance. For example, it is often required to obtain a set of members of T1, L1 ⊂ T1 satisfying a sequence of dependencies in T2: L2 ⊂ T2. Using the colon (:) as the indicator of array section, we define L1 as follows: L1 = {I1 ∈ T1|JT1T2 [IT1T2 [I1] : IT1T2 [I1 + 1] − 1] = L2}

(13)

where the equal sign indicates an equivalence relation. Either value ordered, naturally ordered lists or unordered comparisons may be considered. Generally, this function is defined as:

5

L1 = gMT1T2 (L2)

(14)

Also, the search for a particular member of T2 can be performed by maintaining two simultaneous lists (naturally ordered and value-ordered) and performing a binary search in the value-ordered one. When two subsets (T1a ⊂ T1 and T1b ⊂ T1) of T1 are related, we write the many-to-many relation as MT1a T1b . Consider now a sequence of relations between subsets of T1: MT1i+1 T1i . Multiplication is only possible, as in the dense case, when compatible dimensions and, in this case, entity types are compatible. Since multiplication is not commutative, the correct form of multiplication can depend on a topological ordering. In the case of a many-to-many relation between two subsets T1a ⊂ T1 and T1b ⊂ T1 of the original set T1, and when each member of T1a is related to all members of T1b , we can write this relation using only two lists: LT1T1a and LT1T1b . Since this can be seen as a dense many-to-many relation between subsets of T1, we can name it D (as in dense) and write: LT1T1a DT1a T1b = (15) LT1T1b This is convenient, since a dense relation can make use of dense kernels, specially linear algebra packages such as BLAS, avoiding indirect addressing, temporary workspaces and motion of data. When T1a = T1b , a further simplification can be performed and we have a clique, i.e. a subset of T1 × T1. In that case, it suffices to write CT1a as: CT1a = {LT1T1a }

(16)

As in the previous case, operations on cliques are simpler than operations on general sparse many-to-many relations and do not require housekeeping. Note that the classical “assembling” operation can be generalized in one of the following (equivalent) versions: Na X

CT1a

a=1

Na X a=1

CT1a

!T

(17)

with Na = #({T1a }) and {T1a } is the set of subsets of T1. A note is required concerning the transposition: for reasons of searching, transposition returns the entities in increasing order. (T2) A path of length k between two entities I11 ∈ T1 and I1k ∈ T1 in a relation MT1T1 is a set of distinct entities {I11 , I12 , . . . , I1k−1 , I1k } such as I1i−1 and I1i are connected. The distance between two entities I1 and J1 in a (T2) relation MT1T1 is the length of the shortest path between them and it is identified as d(I1, J1). The diameter of a (T2) relation MT1T1 is the maximum distance between any two entities in T1: (T2)

D(MT1T1 ) = max(d(I1, J1), I1 ∈ T1 and J1 ∈ T1) (T2)

(18) (T2)

An entity I1 ∈ T1 is called peripheral in a relation MT1T1 if there is another entity J1 ∈ T1 such that d(I1, J1) = D(MT1T1 ). Entities I1 and J1 are both said to be peripheral. The determination of peripheral entities is too costly for use within the frontal method. Therefore, we instead determine pseudo-peripheral entities. A related concept in many(T2) to-many relations with consequences to entity ordering is the level set of a given entity I1 ∈ T1 in a relation MT1T1 . The class of all level sets of I1 is composed by the sets: L1 (I1), L2 (I1), L3 (I1), . . . , Lh (I1)

(19)

where h is the depth of the class of level sets. The width of a given level Li is its cardinality: wi = #(Li ). The width of the class is defined as w = maxi wi . The class of level sets (19) is defined as: L1 (I1) = {I1} and, for i > 1, Li (I1) is the set of all entities related to entities in Li−1 but not belonging to all Lk with k ≤ i. All entities in set Li+1 (I1) are at distance i of entity I1.

2.3

Liveness analysis

Returning to the analysis of a many-to-many relation we can focus on the counter NT2T1 (I1, K2) = IT1T2 [I1] − 1 + K2 which, being a mapping between the natural ordering of the relation between entities in T2 and entities in T1, can be used to establish a “life region” of I1 in T2. If the transpose retains ordering (as is the case of the proposed algorithm 5), while stepping from 1 up to IT2T1 [#(T2) + 1] − 1, we know where a certain entity I1 occurs for the 6

first and second time, as well as the corresponding entities in T2. Specifically, if we consider the set of cliques E (with typical member e) and the set of degrees-of-freedom G (with typical member g), two life regions are defined: • Strict life region of an entity g ∈ G in the relation between G and E with the counter interval: – IEG [JGE [IGE [g]]] − 1 + KGE [IGE [g]] : IEG [JGE [IGE [g + 1] − 1]] − 1 + KGE [IGE [g + 1] − 1] • In terms of cliques: – JGE [IGE [g]] : JGE [IGE [g + 1] − 1] This allows a straightforward implementation of the frontal solution algorithm with minimal housekeeping code.

2.4

Clique permutation for front width reduction

Clique permutation for frontwidth reduction (cf. [35, 26]) is here based on: • Determination of two peripheral or pseudo-peripheral cliques using exact level sets (a simplification was proposed by George and Liu [19]). • Graph trasversal based on neighborhood relations between cliques, which are progressively numbered between the two pseudo-peripheral cliques. We follow the work of [36]3 et al. with the purpose of determining the pseudo-peripheral cliques and the implementation of Kumfert and Pothen [26] for the permutation (but with exact level sets). A variant of the Sloan algorithm is adopted with the following priority function for each element e ∈ E, with E being the set of cliques: (G)

Pe = −W1

D(MEE ) (G)

w(MEE )

Gain(e) + W2 d(e, f)

(20)

where W1 and W2 are the weights of the gain and distance to the finish clique, respectively. The many-to-many (G) relation MEE is defined as: (G)

MEE = MEG · MTEG

(21)

where MEG is the clique/degree-of-freedom connectivity. The function Gain(e) estimates the front growth by selecting clique e as a subsequent clique. Two pseudo-peripheral (pseudo-diametrically opposed) cliques are determined using the level set construction: s ∈ E (start) and f ∈ E (finish). This ordered pair is obtained from two non-ordered pseudo-peripheral elements, from which the start minimizes: w(s) h(s)

(22)

Both the priority function and the selection of the s element are new to this work and resulted from extensive experimentation. For a related algorithm by Kumfert and Pothen [26], it was reported that the optimal weights W1 and W2 depend on the particular problem to be solved. This was subsequently confirmed with more numerical data by Reid and Scott [30]. For the problems solved in this work, we use W1 = 4 and W2 = 1.

2.5

Multiple-point constraints: solving for the Lagrange multipliers

Consider a set of m constraints g(a), each one a function of the global set of original degrees-of-freedom a. If the unconstrained problem is stated in terms of n+ m degrees-of-freedom, a partition of the discrete system of equations can be established: 0n×1 fr (23) = f= 0m×1 fs 3 We

confirm the advantages of their approach.

7

where the subscripts s and r indicate slave and retained degrees-of-freedom, respectively (this notation will be clarified below in the text). A partition of the original n + m degrees-of-freedom is achieved similarly as: ar (24) a= as If ar and as are independent, then a scalar version of (23) can be written as: ∀δas

∈

Rm ,

δas · fs + δar · fr

=

0

∀δar ∈ Rn

(25) (26)

For the simplified situation where m = 14 , a single Lagrange multiplier, λ , is adopted to enforce the constraint. In that case, application of Newton method results in the following iteration:      fss + λo gss fsr + λo gsr gs  ∆as   −fs   frs + λo g rs f rr + λo grr gr  ∆ar −fr =     gs grT 0 λ −g where λo is the preceding iteration value of λ, and

fss

=

fsr

=

frs

=

frr

=

gs

=

gr

=

gss

=

gsr

=

grs

=

grr

=

∂fs ∂as ∂fs ∂ar ∂fr ∂as ∂fr ∂ar ∂g ∂as ∂g ∂ar ∂gs ∂as ∂gs ∂ar ∂gr ∂as ∂gr ∂ar

(27) (28) (29) (30) (31) (32) (33) (34) (35) (36)

From this, and using the notation Kss = fss + λo gss , Ksr = fsr + λo gsr , etc. we can write5 : ∆as = −gs−1 grT ∆ar − gs−1 g

(37)

Introducing b = −gs−1 g and T = −gs−1 grT we can write (37) as ∆as = T ∆ar + b We can now determine λ as part of the solution: 4 This 5 For

in no way inhibits the general case m = 1 the existence of gs−1 is called the immersion property, cf. [8].

8

(38)

−1 λ = −a−1 s (Kss T + Ksr ) ∆ar − as (fs + Kss b)

(39)

It is worth noting that, upon convergence, ∆ar = 0 and b = 0. The Lagrange multiplier at the solution (here denoted as λ⋆ ) is then simply given as: λ⋆ = −a−1 s fs

(40)

A straightforward calculation allows the writing of the following reduced system: Krr + Krs T + T T Kss T + T T Ksr ∆ar = − fr + T T fs + T T Kss + Krs b | | {z } {z }

(41)

T⋆T KT⋆ ∆ar = −T⋆T (f + Kb⋆ )

(42)

⋆ Krr

fr⋆

By adding columns in T corresponding to the degrees-of-freedom in r, besides the ones corresponding to s (this implies an ordering of the original degrees-of-freedom) to form a new transformation matrix T⋆ , we can rewrite (41) as:

where b⋆ is obtained from b by appending the r−rows with zeros. After ∆ar being calculated from the solution of (42), we can calculate ∆as from (38) and λ from (39). This is a generalization of the multiple-point constraint methodology described in the inaugural paper of Abel and Shephard [2] and subsequent works [34, 12, 31, 3]. Note that second derivatives of a single constraint are included in the terms Krr , Krs , Kss and Ksr . It is also a generalization of our recent work on multiple-point constraints (cf. [8]) since the slave degrees-of-freedom second derivatives are not assumed to be zero. In the spirit of the original frontal method (cf. [24]), the solver returns the degrees-of-freedom solution and reactions (more generally Lagrange multipliers). These must be simultaneously updated in the Newton-Raphson iteration. For several constraints, we can use the same approach if they are preordered (cf. [8]). Assuming that an order of constraint application is pre-established by a topological ordering, then each constraint beyond the first one will be applied to an already constrained system. It is obvious that if a certain constraint only affects degrees-of-freedom of the unconstrained system, it should be among the latest to be applied. If we assume all constraints to be interconnected, then an ordered sequence must follow according to the closeness to the original degrees-of-freedom. The generalization of the slave update formula for m interconnected equality constraints is presented, after the preliminary step of topological ordering, as: T⋆ =

1 Y

T⋆l

l+1 Y

T⋆p

(43)

l=m

bm ⋆

=

m X l=1

"

p=m

!

b⋆l

#

(44)

where largest l indices correspond to innermost constraints. Using the previous notation (43-44), if constraint l − 1 is applied following constraint l and this follows l + 1, we can write: T T⋆l Kl+1 T⋆l ∆al {z } | Kl

T T T⋆l−1 T⋆l Kl+1 T⋆l T⋆l−1 ∆al−1

T = − T⋆l (fl+1 + Kl+1 b⋆l ) ⇔ | {z }

(45)

fl

=

T T −T⋆l−1 T⋆l

[(fl+1 + Kl+1 b⋆l ) + Kl+1 T⋆l b⋆l−1 ]

(46)

from which we can define bl⋆ as b⋆l + Kl+1 T⋆l b⋆l−1 . Note that, since not all degrees-of-freedom participate in the constraints, the transformation matrices T⋆l contain the appropriate unit diagonals corresponding to these. Both T⋆l and b⋆l are sparse, but with different properties: in the sparse T⋆ -matrices there are 1’s for degrees-of-freedom that remain active and in the sparse b⋆ -vectors these will 0. This perspective of interconnected6 constraints is motivated by classical static analysis. Fill-in (or profile) concerns during Gauss decomposition are described in earlier works [2, 34, 12] but are less critical for our frontal method, since the reordering of cliques is effected in the 6 This

nomenclature is also adopted in textbook statics, e.g. [28]

9

2 1

3

4

Acyclic test Topological sort Destinations

8 Directed graph 5

7 6 2 Hasse diagram

1

3

6

4

7

5

1: 2:2 3:2 4:2 5:5 6 : 4, 3 7 : 5, 3, 1 8 : 5, 3, 1 MT1T1O T1 = G

1: 2:2 3:2 4:2 5:5 6:2 7 : 2, 5 8 : 2, 5 MT1T1N

8

Collapse (and sum) Figure 2: Specific DOF distribution: directed graph and Hasse diagram. Collapse of DOF destinations. innermost part of the linear solver. The user must specify T⋆ and b⋆ either obtained explicitly from the knowledge of the problem, or pre-process the constraint in the form g(a) = 0. Compared with classical frontal algorithms, the total number of cliques (or generalized elements) is the sum of the number of elements and the number of MPC. This is due to the presence of second derivatives (33-36) of the constraints.

2.6

Processing of constraints with operations on cliques

We introduce the notion of extension number, eni of a degree-of-freedom i. This is the cardinality of the set of masters (i.e. retained DOFs) connected to degree-of-freedom i. Degrees-of-freedom which do not participate as slaves in a MPC have unitary extension numbers (and are considered their own masters). Slave degrees-of-freedom can have any non-negative eni (for example essential boundary conditions result in a null eni ). Consider the DOF arrangement of Figure 2 where the directed graph and the Hasse diagram for this arrangement are shown. Traversing from the top the Hasse diagram we obtain the correct sequence for DOF processing. Note that if the graph contains a cycle, the problem is ill-posed. The one in the picture, if self loops are not accounted for, is acyclic [25]. Self-loops are only allowed here in non-slave DOF (i.e. a slave DOF cannot master itself). In the sequence of operations entailed by the DOF graph in Figure 2, it can also be observed that DOFs are sorted by their inter-dependence. In this case, after traversal, only two DOFs survive: 2 and 5. Surviving DOF are characterized by having no proper outer edges. Two properties from graph theory [25] are relevant for our application (proofs are given in that reference): • A partially order set corresponds to an acyclic directed graph. • Every directed graph admits a topological ordering. • The resulting DOF depth is at most 2, and can be made exactly either 2 or 0. User input must guarantee that the digraph is acyclic (a test is performed at the ordering stage) and a partial ordering must be established from the DOF edges. This ordering sets the degrees-of-freedom in topological order [25], cf. Algorithm 7. As discussed in a previous work [8], a solvable problem results in a Direct Acyclic Graph (DAG). Since the numerical operations follow the Hasse diagram, we convert the relation MT1T1O to the relation MT1T1N performing the operations in Algorithm 2 with T1 = G where G is the set of degrees-of-freedom, MT1T1N being the relation between DOFs after topological ordering. This scheduling of DOF processing is required to avoid repetitions 10

Algorithm 2 Conversion from MT1T1O to MT1T1N . TOP[I1] contains the entity with topological position I1. ... for I1 =1 , cardT1 J1 = TOP [ I1 ] K =0 for K2 = IT1T1O [ J1 ] , IT1T1O [ J1 +1] -1 K = K + IT1T1N [ JT1T1O [ K2 ]] end for IT1T1N [ J1 ]= K end for LOL = IT1T1N [1] IT1T1N [1]=1 for I1 =1 , cardT1 NEWN = IT1T1N [ I1 ]+ LOL I11 = I1 +1 LOL = IT1T1N [ I11 ] IT1T1N [ I11 ]= NEWN end for for I1 =1 , cardT1 if ( IT1T1O [ I1 +1]== IT1T1O [ I1 ]+1) then if ( JT1T1O [ IT1TO [ I1 ]]== I1 ) JT1T1N [ IT1T1N [ I1 ]]= JT1T1O [ IT1T1O [ I1 ]] end if end for for I1 =1 , cardT1 J1 = TOP [ I1 ] K =0 for J = IT1T1O [ J1 ] , IT1T1O [ J1 +1] -1 K1 = JT1T1O [ J ] for L = IT1T1N [ K1 ] , IT1T1N [ K1 +1] -1 K = K +1 J1T1TN [ IT1T1N [ J1 ] -1+ K ]= J1T1TN [ L ] end for end for end for ...

in the calculation of the transformation matrices. As can be observed in Figure 2, there are no repetitions7 in the processing of the sequences of DOFs if the order of the Hasse diagram is followed. Multiplication of transformation matrices benefit from this procedure (further details are given in [8]). Non-slave nodes have unit T⋆ -coefficients whereas slave nodes’ T⋆ −coefficients depend on the specific constraint imposed.

3

Unsymmetric frontal algorithm: assembling and Gauss elimination

The complete frontal algorithm is partitioned in a symbolic part (whose pseudo-code is presented in Algorithm 3) and a numeric part, shown in Algorithm 4. In the symbolic part, the following quantities are determined: • Elements of arrival and exit of each degree of freedom (IDOF) stored in arrays LSTARTELEMENTDOF and LFINISHELEMDOF, respectively. • Maximum frontwidth (MXFRONT) as well as lower and upper indices for the variable frontwidth elimination (MNFROEL and MNFROEL arrays, respectively). • Position in the Frontal array for each eliminated degree-of-freedom (IDOF) in array LFRODOF. • Elimination position of each degree-of-freedom in array DOFPOS and the degree-of-freedom for each elimination position in array POSDOF. This is a one-to-one relation and provides the classical elimination sequence (or tree), cf. [13]. The actual forward elimination and backward substitution is performed in Algorithm 4, where OpenMP [11] directives are used in the local assembling, elimination and substitution. Distinctive features are: • Absence of numerical pivoting. • Use of OpenMP directives. • Use of variable frontwidth, determined in Algorithm 3. 7 To

simplify the routines, we retain the multiplications by 1 for self-masters

11

7 8 7 8 0,0

7,8 1 2

5 1 6 6 0,19

11,12

5,6

1,2

9,10

17,18 4 5

3 4

2 3 3,4

13,14

15,16

NB: Each constraint produces a clique, or pseudo-element, as well as a multiplicative constituent In this case, multiplicative constituents erase 3 original degrees-of-freedom Original clique numbers Reordered clique numbers Global active degrees-of-freedom

Cliques 1 to 5 originate from elements Cliques 6 to 8 originate from multiple point constraints (here essential boundary conditions)

Figure 3: Prototype problem: 5 elements and 3 MPC, corresponding to essential boundary conditions. • Non-symmetric factors stored in buffers ROWDOF and COLDOF, where the latter has reduced dimensions, since it is not used in backward substitution. • Use of extended notion of elements (i.e. cliques) to account for second derivatives in the constraints. It is worth noting that, with exception of local assembling, OpenMP directives are the same as in classical Gaussian elimination. The frontal solution code is very concise and robust for an adequate a-priori pivoting.

The frontal solution method in Algorithms 3 and 4 do not incorporate the boundary conditions, which are introduced as MPCs by the transformation matrices.

4

Step-by-step example

To study the evolution of the many-to-many index arrays from the connectivity data, we will now inspect the problem shown in Figure 3. It consists of five elements and three essential boundary conditions. Frontal solution for this problem is based on the set of index arrays shown in Table 1. With these indices, frontal solution becomes trivial and it is completely general, since no assumption is made concerning the cliques and the degrees-of-freedom.

5

Tests

For comparison with our unifrontal code, we use the MA37 subroutine (described in [18]) from the Harwell software library [1]. This implements a multifrontal algorithm with minimum degree ordering. Three typical linear finite element problems are solved: a square plate modeled with quadrilateral shell elements, a cube modeled with tetrahedral elements and a square-section beam modeled with tetrahedral elements (see Figure 4). Shared memory parallelism is adopted (with OpenMP) in a single processor, 8 core desktop machine (cf. Figure 4). Conclusions can be drawn from the results shown in Table 2: • Frontal solutions are somehow competitive for small problems with significant number of degrees-of-freedom per node (such as shell problems). Larger problems, due to sparsity, usually tend to favor traditional sparse algorithms. • Frontal solutions are relatively competitive for problems with slender geometries. However, symbolic analysis usually is too costly for linear problems. In nonlinear problems, where hundreds of linear solutions can be performed, this shortcoming is diluted.

12

Algorithm 3 Symbolic frontal pseudo-code. NELEM is the total number of cliques, NGDOF is the total number of active degrees-of-freedom. sym_unifrontal transp ( NELEM , PELEMDOF , DELEMDOF , NGDOF , PDOFELEM , DDOFELEM , IJDOF ) mult ( NELEM , PELEMDOF , DELEMDOF , NGDOF , PDOFELEM , DDOFELEM , NELEM , PELEMELEM , D E L E M E L E M) reorder ( NELEM , PELEMELEM , DELEMELEM , NEWE ) for IDOF =1 , NGDOF IELEMMIN = NELEM IELEMMAX =1 for IK = PDOFELEM [ IDOF ] , PDOFELEM [ IDOF +1] -1 IELEMMIN = MIN ( IELEMMIN , NEWE [ DDOFELEM [ IK ]]) IELEMMAX = MAX ( IELEMMAX , NEWE [ DDOFELEM [ IK ]]) end for L S T A R T E L E M D O F[ IDOF ]= IELEMMIN L F I N I S H E L E M D O F[ IDOF ]= IELEMMAX end for transpl ( NGDOF , LFINISHELEMDOF , NELEM , PFINISHELEMDOF , DFINISHELEMDOF , IJFINISH ) transpl ( NGDOF , LSTARTELEMDOF , NELEM , PSTARTELEMDOF , DSTARTELEMDOF , IJSTART ) MXFRONT =0 NINP =0 NOUT =0 for IELEM =1 , NELEM NINP = NINP + P S T A R T E L E M D O F[ IELEM +1] - P S T A R T E L E M D O F[ IELEM ] MXFRONT = MAX ( MXFRONT , NINP - NOUT ) NOUT = NOUT + P F I N I S H E L E M D O F[ IELEM +1] - P F I N I S H E L E M D O F[ IELEM ] end for MXFROEL [1]=0 MNFROEL [1]= HUGE MGDOF =0 for IELEM =1 , NELEM for IKOUNT = P S T A R T E L E M D O F[ IELEM ] , P S T A R T E L E M D O F[ IELEM +1] -1 IDOF = D S T A R T E L E M D O F[ IKOUNT ] IFOUND =0 for IFRO = MNFROEL [ IELEM ] , MXFROEL [ IELEM ] if ( ISBUSY [ IFRO ]==0) then LFRODOF [ IDOF ]= IFRO ISBUSY [ IFRO ]=1 IFOUND =1 EXIT end if end for if ( IFOUND ==0) then for IFRO =1 , MXFRONT if ( ISBUSY [ IFRO ]==0) then LFRODOF [ IDOF ]= IFRO ISBUSY [ IFRO ]=1 EXIT end if end for end if MNFROEL [ IELEM ]= MIN ( MNFROEL [ IELEM ] , LFRODOF [ IDOF ]) MXFROEL [ IELEM ]= MAX ( MXFROEL [ IELEM ] , LFRODOF [ IDOF ]) end for if ( IELEM |= NELEM ) MXFROEL [ IELEM +1]= MXFROEL [ IELEM ] if ( IELEM |= NE . NELEM ) MNFROEL [ IELEM +1]= MNFROEL [ IELEM ] for IKOUNT = P F I N I S H E L E M D O F[ IELEM ] , P F I N I S H E L E M D O F[ IELEM +1] -1 IDOF = D F I N I S H E L E M D O F[ IKOUNT ] IFRO = LFRODOF [ IDOF ] MGDOF = MGDOF +1 DOFPOS [ IDOF ]= MGDOF POSDOF [ MGDOF ]= IDOF if ( IFRO == MNFROEL [ IELEM +1]) MNFROEL [ IELEM +1]= MNFROEL [ IELEM +1]+1 if ( IFRO == MXFROEL [ IELEM +1]) MXFROEL [ IELEM +1]= MXFROEL [ IELEM +1] -1 ISBUSY [ IFRO ]=0 end for end for

13

Algorithm 4 Numeric frontal pseudo-code and OpenMP directives. num_unifrontal RESF =0.0 E00 GLOAD =0.0 E00 GSTIF =0.0 E00 for IELEM =1 , NELEM OLDE [ NEWE [ IELEM ]]= IELEM end for for KELEM =1 , NELEM IELEM = OLDE [ KELEM ] NELV = PELEMDOF [ IELEM +1] - PELEMDOF [ IELEM ] o b t a i n s c l i q u e( IELEM , NELV , FOREL [1: NELV ] , STIFEL [1: NELV ,1: NELV ]) ! £OMP P A R A L L E L DO S H A R E D( GLOAD , GSTIF ) for ICOUNT = PELEMDOF [ IELEM ] , PELEMDOF [ IELEM +1] -1 IDOF = DELEMDOF [ ICOUNT ] IFRO = LFRODOF [ IDOF ] ILOC = ICOUNT +1 - PELEMDOF [ IELEM ] GLOAD [ IFRO ]= GLOAD [ IFRO ]+ FOREL [ ILOC ] for JCOUNT = PELEMDOF [ IELEM ] , PELEMDOF [ IELEM +1] -1 JDOF = DELEMDOF [ JCOUNT ] JFRO = LFRODOF [ JDOF ] JLOC = JCOUNT +1 - PELEMDOF [ IELEM ] GSTIF [ JFRO , IFRO ]= GSTIF [ JFRO , IFRO ]+ STIFEL [ JLOC , ILOC ] end for end for ! £OMP END P A R A L L E L DO IELEM = KELEM MINF = MNFROEL [ IELEM ] MAXF = MXFROEL [ IELEM ] ISTART = P F I N I S H E L E M D O F[ IELEM ] IFINISH = P F I N I S H E L E M D O F[ IELEM +1] -1 for IKOUNT = ISTART , IFINISH IDOF = D F I N I S H E L E M D O F[ IKOUNT ] IPOS = DOFPOS [ IDOF ] IFRO = LFRODOF [ IDOF ] for JFRO = MINF , MAXF COLDOF [ JFRO ]=0.0 E00 if ( ABS ( GSTIF [ JFRO , IFRO ]) > TOLS OR ABS ( GSTIF [ IFRO , JFRO ]) > TOLS ) then ROWDOF [ JFRO , IPOS ]= GSTIF [ JFRO , IFRO ] COLDOF [ JFRO ]= GSTIF [ IFRO , JFRO ] end if end for EQRHS [ IPOS ]= GLOAD [ IFRO ] RESF = RESF + ABS ( EQRHS [ IPOS ]) RCONST = -1.0 E00 / ROWDOF [ IFRO , IPOS ] ! £OMP P A R A L L E L DO S H A R E D( GLOAD , GSTIF ) for JFRO = MINF , MAXF SCONST = RCONST * COLDOF [ JFRO ] if ( ABS ( SCONST ) > TOLS ) then for KFRO = MINF , MAXF GSTIF [ KFRO , JFRO ]= GSTIF [ KFRO , JFRO ]+ SCONST * ROWDOF [ KFRO , IPOS ] end for GLOAD [ JFRO ]= GLOAD [ JFRO ]+ EQRHS [ IPOS ]* SCONST end if end for ! £OMP END P A R A L L E L DO end for end for for IPOS = MGDOF ,1 , -1 IDOF = POSDOF [ IPOS ] IFRO = LFRODOF [ IDOF ] RGASH =0.0 E00 SGASH =0.0 E00 ! £OMP P A R A L L E L DO R E D U C T I O N( -: RGASH ) for KFRO =1 , IFRO -1 RGASH = RGASH - VECRV [ KFRO ]* ROWDOF [ KFRO , IPOS ] end for ! £OMP END P A R A L L E L DO RGASH = RGASH + EQRHS [ IPOS ] ! £OMP P A R A L L E L DO R E D U C T I O N( -: SGASH ) for KFRO = IFRO +1 , MXFRONT SGASH = SGASH - VECRV [ KFRO ]* ROWDOF [ KFRO , IPOS ] end for ! £OMP END P A R A L L E L DO RGASH = RGASH + SGASH EQRHS [ IPOS ]= RGASH VECRV [ IFRO ]= EQRHS [ IPOS ]/ ROWDOF [ IFRO , IPOS ] SOL [ IDOF ]= VECRV [ IFRO ] end for

14

Table 1: Relevant variables and index lists for prototype problem of Figure 3. Variable name Value Interpretation nelem 8 Number of cliques (elements+MPC) ngdof 19 Number of active degrees-of-freedom pelemdof {1,9,17,25,31,36,36,36,36} Pointer to beginning of each clique delemdof {1,2,3,4,5,6,7,8,9,10,11,12, Degree-of-freedom pointed by 5,6,3,4,13,14,11,12,9,10,15, pelemdof 16,15,16,17,18,13,14, 19,1,2,7,8} pdofelem Pointer to beginning of each {1,3,5,7,9,11,13,15,17,19,21, degree-of-freedom 23,25,27,29,31,33,34,35,36} ddofelem {1,5,1,5,1,2,1,2,1,2,1,2,1,5, Clique pointed by pdofelem 1,5,2,3,2,3,2,3,2,3,3,4,3,4, 3,4,3,4,4,4,5} pelemelem {1,4,7,10,12,14,14,14,14} Pointer to beginning of each clique in clique/clique relation delemelem {1,5,2,2,3,1,3,4,2,3,4,5,1} Clique pointed by pelemelem olde {5,1,2,3,4,6,7,8} Original clique number newe {2,3,4,5,1,6,7,8} Clique number after renumbering {n1,n2} {5,4} Pseudo-peripheral cliques lstartelemedof {1,1,2,2,2,2,1,1,3,3,3, Start clique of each 3,4,4,4,4,5,5,1} degree-of-freedom lfinishelemdof {2,2,3,3,3,3,2,2,4,4,4,4, Finish clique of each 5,5,5,5,5,5,1} degree-of-freedom pstartelemdof {1,6,10,14,18,20,20,20,20} Pointer to beginning of each clique in dstartelemdof dstartelemdof {1,2,7,8,19,3,4,5,6,9,10, Start degree-of-freedom for each 11,12,13,14,15,16,17,18} clique index pfinishelemdof {1,2,6,10,14,20,20,20,20} Pointer to beginning of each clique in dfinishelemdof dfinishelemdof {19,1,2,7,8,3,4,5,6,9,10, Finish degree-of-freedom for 11,12,13,14,15,16,17,18} each clique index lfrodof {1,2,5,6,7,8,3,4,4,3, Front position of each 2,1,5,6,7,8,2,3,5} degree-of-freedom mnfroel {1,1,1,1,2,4,4,4} First index in front for each clique mxfroel {5,8,8,8,8,7,7,7} Last index in front for each clique dofpos {2,3,6,7,8,9,4,5,10,11,12, Elimination order for each 13,14,15,16,17,18,19,1} degree-of-freedom posdof {19,1,2,7,8,3,4,5,6,9,10, Global degree-of-freedom 11,12,13,14,15,16,17,18} number for each eliminated degree-of-freedom mxfront 8 Maximum frontwidth rmsfront 5.6236 RMS frontwidth

15

#1

5 dof shell #3 3 dof tet

#2

3 dof tet

Figure 4: Three typical linear problems solved by the present unifrontal method and HSL MA37. Clamped boundary conditions are used (all around in the plate, a face in #2 and a end-face in #3). • Problems with dense connectivities, such as 3D blocks, tend to benefit frontal solutions. Memory access patterns in sparse algorithms are the bottleneck. • Many large problems can be solved in core successfully with frontal methods but not with traditional sparse algorithms. In problems #2 and #3 this occurs. • Overall, the new frontal implementation is competitive, often faster in several tests than the widely established (in Finite Element contexts) MA37 subroutine, and slightly slower in other tests. These conclusions have motivated the combination of unifrontal and multifrontal algorithms (see Davis and Duff [14]) to take advantage of the performance characteristics of these algorithms. We note that the reason for the poor time performance of the symbolic analysis is the use of exact level sets, which are not adopted in classical works ([35] and [26]). Our experience with heuristics in the symbolic part advises against use of cutoff values, as described by Kumfert and Pothen [26]. In example #2, our unifrontal code performs much better than MA37. For massive structures such as the one in example #2, sparse solvers typically perform poorly because of indirect addressing. Concerning the effectiveness of our clique reordering algorithm, we compare it with the original Sloan algorithm, using the clique-clique graph obtained by sparse multiplication as input. Table 3 presents this comparison. It is worth noting that although the frontwidths are not always lower than with Sloan’s algorithm, in narrow problems (such as #3) our approach allows the solution of larger problems. To further test the parallelization with OpenMP, we use problem #2 with 162000 elements and 86490 degreesof-freedom. In addition, a multi-CPU machine is used: DELL PowerEdge R910 server with 256 GB of RAM and 4 Intel XEON [email protected]. Hard drives are all SAS SSD with 256GB each. Figure 5 shows the results for 2,5,10,20,40 and 80 threads. Both total clock time and elimination and substitution results are shown. Up to 40 threads, the elimination+substitution times show almost linear speedup, albeit above that there is a saturation. This saturation is typically fixed with a multifrontal algorithm. For comparison, MA37 does not benefit from multiple threads.

16

Table 2: Comparison between the present Frontal method and HSL library MA37 package (best of 5 runs)

Problem

#1 (Shell quads)

#2 (Tetrahedra)

#3 (Tetrahedra)

Number of degreesoffreedom 12201 24081

Maximum frontwidth

Number of elements

375 525

2500 4900

Symbolic Analysis Frontal [s] MA37⋆ [s]

0.0379 0.1050

49401 750 10000 198801 1500 40000 311001 1875 62500 3630 483 6000 9450 897 16464 26460 1743 48000 86490 3759 162000 147852 5415 279936 3618 30 4800 7218 27 9600 36018 27 48000 90018 30 120000 216018 33 287999 432000 33 384000 Intel Fortran Compiler 13.1.1

0.0270 0.0569

Elimination and substitution Frontal [s] MA37 [s]

0.2573 0.8922

0.3700 1.2330

0.3299 0.1240 3.4792 4.0940 3.5960 0.6160 85.1454 64.3290 8.1910 1.0450 302.1478 133.7950 0.1450 0.0080 0.2891 0.1980 0.7580 0.0234 1.2236 3.1720 5.1530 0.0800 6.1786 47.9000 48.3720 0.3110 42.4108 945.1440 135.4350 × 101.1301 × 0.184 0.0060 0.0603 0.0270 0.397 0.0099 0.1180 0.0530 2.973 0.0570 0.5687 0.2920 44.401 0.1520 1.5754 0.7770 254.1020 0.4370 3.3587 2.709 211.9870 × 7.6941 × 20130313 ifort -O3 -ip -openmp openSUSE 12.3 64 bit, Linux Kernel 3.7.10-1-11 RAM: 7.8 GB, CPU: Intel Core i7 870 2.93 GHz omp_get_wtime

Compiler System Machine Timing function ⋆ Not including explicit assembling, required for MA37 but not for the frontal method. No pivoting is used in MA37 (u = 0) × Not able to obtain a solution due to insufficient memory

17

Table 3: Comparison between the clique reordering and the original Sloan algorithm [35].

Maximum frontwidth Number of Present Sloan degrees-offreedom #1 12201 375 389 (Shell quads) 24081 525 539 49401 750 809 198801 1500 1679 311001 1875 2119 #2 3630 483 345 (Tetrahedra) 9450 897 645 26460 1743 1275 86490 3759 2805 147852 5415 4011 #3 3618 30 78 (Tetrahedra) 7218 27 75 36018 27 45 90018 30 78 216018 33 45 432000 33 × × Not able to obtain a ordering due to insufficient memory Problem

RMS frontwidth Present Sloan

224.83 326.34 481.13 1000.12 1260.08 336.24 624.61 1227.06 2673.22 3843.90 21.38 20.52 20.05 20.38 21.67 30.52

219.74 319.61 492.37 1074.39 1374.71 252.18 480.40 963.27 2141.46 3084.91 18.88 18.38 17.92 17.95 17.91 ×

1000 Elimination and substitution

Time [s]

Total

100

10

0

10

20

30 40 50 Number of threads used in Server

60

70

80

Figure 5: Runs with up to 80 threads on a DELL PowerEdge R910 server (4 Intel XEON [email protected]). Problem #2 with 162000 elements.

18

6

Conclusions and further developments

We proposed a frontal solution method for unsymmetric systems containing interconnected multiple-point constraints. The frontal solution method is parallelized with OpenMP directives [11], following the same techniques for the dense case, and it is written in column major order to retain memory access continuity. An effort to minimize indirect addressing was also performed, since this was found to be of great importance in terms of elimination performance. The source code for this version of the frontal solution is very concise due to absence of pivoting permutations. Since the frontal solution method is purely clique-based, no need for assembling exists (not even symbolic assembling) and therefore a considerable part of our previous derivation [8] is not required. In addition, since the role of degrees-of-freedom in the frontal solution method is limited to the elimination order, degree-offreedom contraction is not required, simplifying the multiple-point constraint pre-processing required in [8] and the amount of indirect addressing. A topological ordering is performed, similarly to our recent developments [8] but with considerable simplifications. These are due to the use of a clique based linear solver. MPC sequential processing and reaction calculations were treated, as in the previous work, as path traversal in a directed acyclic graph. Concerning the mass elimination of degrees-of-freedom, a more sophisticated destination algorithm is being developed so that simultaneously eliminated degrees-of-freedom are adjacent in the frontal matrix.

Acknowledgments The first Author gratefully acknowledges the generous help from Professor Iain Duff in obtaining technical documentation of the frontal solution method. He is also grateful to Professor R. Owen for the three-month stay in UCS, Wales, UK in the year 1999, where interest in the topic emerged. We also gratefully acknowledges the availability of several HSL codes for testing (cf. [1]). The authors gratefully acknowledge financing from the “Funda¸caõ para a Ciência e a Tecnologia” under the Project PTDC/EME-PME/108751 and the Program COMPETE FCOMP-01-0124-FEDER-010267.

Appendix: Support Algorithms Algorithm 5 details the transposition of a many-to-many relation (here MT2T1 is determined from MT1T2 ). This algorithm is used to obtain the relation between degrees-of-freedom and cliques from the relation between cliques and degrees-of-freedom. Algorithm performs a multiplication 6, and is here adopted to obtain the relation between cliques using degrees-of-freedom as intermediate entities (T3 and T1 coincide with the clique set E and T2 is the degree-of-freedom set G). Finally, 7 is classical topological ordering implementation (cf. [25] for the Algorithm), with the test for the presence of cycles.

19

Algorithm 5 Determination of MT2T1 from MT1T2 . transp ( cardT1 , IT1T2 , JT1T2 , cardT2 , IT2T1 , JT2T1 , KT2T1 ) for I1 =1 , cardT1 for K1 = IT1T2 [ I1 ] , IT1T2 [ I1 +1] -1 I2 = JT1T2 [ K2 ] IT2T1 [ I2 ]= IT2T1 [ I2 ]+1 end for end for lol = IT2T1 [1] IT2T1 [1]=1 for I2 =1 , cardT2 new = IT2T1 [ I2 ]+ lol lol = IT2T1 [ I2 +1] IT2T1 [ I2 +1]= new end for for I1 =1 , cardT1 L =0 for K1 = IT1T2 [ I1 ] , IT1T2 [ I1 +1] -1 L = L +1 I2 = JT1T2 [ K1 ] next = IT2T1 [ I2 ] KT2T1 [ next ]= L IT2T1 [ k ]= next +1 JT2T1 [ next ]= I1 end for end for for I2 = cardT2 ,1 , -1 IT2T1 [ I2 +1]= IT2T1 [ I2 ] end for IT2T1 [1]=1

(T2)

Algorithm 6 Determination of MT1T3 from MT1T2 and MT2T3 . mult ( cardT1 , IT1T2 , JT1T2 , cardT2 , IT2T3 , JT2T3 , cardT3 , IT1T3 , JT1T3 ) for I1 =1 , cardT1 N13 =0 for K1 = IT1T2 [ I1 ] , IT1T2 [ I1 +1] -1 I2 = JT1T2 [ K1 ] for K2 = IT2T3 [ I2 ] , IT2T3 [ I2 +1] -1 I3 = JT2T3 [ K2 ] if ( WORK [ I3 ]. EQ .0) THEN N13 = N13 +1 WORK [ I3 ]= LAST LAST = I3 end if end for end for IT1T3 [ I1 ]= N13 for I13 =1 , N13 J = WORK [ LAST ] WORK [ LAST ]=0 LAST = J end for end for LOL = IT1T3 [1] IT1T3 [1]=1 for I1 =1 , cardT1 NEW = IT1T3 [ I1 ]+ LOL LOL = IT1T3 [ I1 +1] IT1T3 [ I1 +1]= NEW end for for I1 =1 , cardT1 for K1 = IT1T2 [ I1 ] , IT1T2 [ I1 +1] -1 I2 = JT1T2 [ K1 ] for K2 = IT2T3 [ I2 ] , IT2T3 [ I2 +1] -1 I3 = JT2T3 [ K2 ] IP = WORK [ I3 ] if ( IP . EQ .0) THEN LEN = LEN +1 JT1T3 [ LEN ]= I3 WORK [ I3 ]= LEN end if end for end for for K = IT1T3 [ I1 ] , LEN WORK [ JT1T3 [ K ]]=0 end for end for

20

Algorithm 7 Verification if a given digraph is acyclic and perform a topological ordering for a many-to-many relation MT1T1 . doftop ( cardT1 , IT1T1 , JT1T1 , ACYCLIC , TOP ) M =1 for I1 =1 , cardT1 if ( JT1T1 [ IT1T1 [ I1 ]]|= I1 ) then for K = IT1T1 [ I1 ] , IT1T1 [ I1 +1] -1 if ( JT1T1 [ K ] >0) IND [ JT1T1 [ K ]]= IND [ JT1T1 [ K ]]+1 end for end if end for IK =0 for I1 =1 , cardT1 if ( IND [ I1 ]==0) then IK = IK +1 L [ cardT1 +1 - IK ]= I1 end if end for MK = cardT1 while ( IK |=0) I1 = L [ MK ] MK = MK -1 IK = IK -1 TOP [ M ]= I1 M = M +1 if ( JT1T1 [ IT1T1 [ I1 ]]|= I1 ) then for J1 = IT1T1 [ I1 ] , IT1T1 [ I1 +1] -1 IG = JT1T1 [ J1 ] if ( IG >0) then IND [ IG ]= IND [ IG ] -1 if ( IND [ IG ]==0) then IK = IK +1 L [ MK +1 - IK ]= IG end if end if end for end if end for if ( M == cardT1 +1) then acyclic = TRUE else acyclic = FALSE end if for J1 =1 , cardT1 /2 I1 = TOP [ cardT1 +1 - J1 ] TOP [ cardT1 +1 - J1 ]= TOP [ J1 ] TOP [ J1 ]= I1 end for

21

References [1] HSL(2011). A collection of Fortran codes for large scientific computation. http://www.hsl.rl.ac.uk, 2011. [2] J.F. Abel and M.S. Shephard. An algorithm for multipoint constraints in finite element analysis. Int J Numer Meth Eng, 14(3):464–467, 1979. [3] M. Ainsworth. Essential boundary conditions and multi-point constraints in finite element analysis. Comp Method Appl M, 190:6323–6339, 2001. [4] F. Amirouche. Fundamentals of Multibody Dynamics Theory and Applications. Birkhäuser, 2006. [5] S.S. Antman. Nonlinear problems of elasticity. Springer, Second edition, 2005. [6] S.S. Antman and R.S. Marlow. Material constraints, lagrange multipliers, and compatibility. Arch Ration Mech An, 116:257–299, 1991. [7] P. Areias. SIMPLASMPC. http://code.google.com/p/simplasmpc/. [8] P. Areias, T. Rabczuk, D. Dias da Costa, and E.B. Pires. Implicit solutions with consistent additive and multiplicative components. Finite Elem Anal Des, 57:15–31, 2012. [9] P. Areias and Matouˇs. Finite element formulation for modeling nonlinear viscoelastic elastomers. Comp Method Appl M, 197:4702–4717, 2008. [10] T. Belytschko, W.K. Liu, and B. Moran. Nonlinear finite elements for continua and structures. John Wiley & Sons, 2000. [11] B. Chapman, G. Jost, and R. van der Pas. Using OpenMP. Portable shared memory parallel programming. MIT Press, Cambridge, Massachussets, 2008. [12] J.I. Curiskis and S. Valliappan. A solution algorithm for linear constraint equations in finite element analysis. Comput Struct, 8:117–124, 1978. [13] T.A. Davis. Direct Methods for Sparse Linear Systems. SIAM, 2006. [14] T.A. Davis and I.S. Duff. A combined unifrontal/multifrontal method for unsymmetric sparse matrices. Trans Math Soft-ACM, 25(1):1–20, 1999. [15] I.S. Duff. Enhancements to the MA32 package for solving package for solving sparse unsymmetric equations. Technical Report AERE-R 11009, United Kingdom Atomic Energy Authority, AERE Harwell Oxfordshire, September 1983. [16] I.S. Duff. Design features of a frontal code for solving sparse unsymmetric linear systems out-of-core. SIAM J Sci Stat Comput, 5(2):270–, 1984. [17] I.S. Duff, A.M. Erisman, and J.K. Reid. Direct methods for sparse matrices. Clarendon Press, Oxford, 1986. [18] I.S. Duff and J.K. Reid. The multifrontal solution of unsymmetric sets of linear equations. SIAM J Sci Stat Comput, 5(3):633–641, 1984. [19] A. George and J.W.H. Liu. An implementation of a pseudoperipheral node finder. Trans Math Soft-ACM, 5(3):284–295, 1979. [20] A. Gupta. Recent advances in direct methods for solving unsymmetric sparse systems of linear equations. Trans Math Soft-ACM, 28(3):301–324, 2002. [21] F.G. Gustavson. Two fast algorithms for sparse matrices: multiplication and permuted transposition. ACM Transactions of Mathematical Software, 4(3):250–269, 1978. [22] P. Hood. Frontal solution program for unsymmetric matrice. Int J Numer Meth Eng, 10:379–399, 1976. [23] N. Ian and M. Gould. On modified factorizations for large-scale linearly constrained optimization. SIAM Journal of Optimization, 9(4):1041–1063, 1999.

22

[24] B.M. Irons. A frontal solution program for finite element analysis. Int J Numer Meth Eng, 2:5–32, 1970. [25] D. Jungnickel. Graphs, Networks and Algorithms, volume 5 of Algorithms and Computation in Mathematics. Springer, second edition, 2005. [26] G. Kumfert and A. Pothen. Two improved algorithms for envelope and wavefront reduction. BIT, 35:1–32, 1997. [27] C.L. Lawson, R.J. Hanson, D. Kincaid, and F.T. Krogh. Basic linear algebra subprograms for fortran usage. Trans Math Soft-ACM, 5:308–323, 1979. [28] J.L. Meriam and L.G. Kraige. Engineering Mechanics: Statics. John Wiley and Sons, Fifth edition, 2002. [29] S. Negre, J.P. Boufflet, J. Carlier, and P. Breitkopf. Improving the finite element ordering for the frontal solver. ´ ements Finis, 9(8):917–940, 2000. Revue Européenne des El´ [30] J.K. Reid and J.A. Scott. Ordering symmetric sparse matrices for small profile and wavefront. Int J Numer Meth Eng, 45:1737–1755, 1999. [31] W.C. Rheinboldt. Geometric notes on optimization with equality constraints. Applied Mathematical letters, 9(3):83–87, 1996. [32] J.A. Scott. On ordering elements for a frontal solver. Commun Numer Meth En, 15:309–323, 1999. [33] J.A. Scott. Parallel frontal solvers for large sparse linear systems. Trans Math Soft-ACM, 29(4):395–417, 2003. [34] M.S. Shephard. Linear multipoint constraints applied via transformation as part of a direct stiffness assembly process. Int J Numer Meth Eng, 20:2107–2112, 1984. [35] S.W. Sloan. An algorithm for profile and wavefront reduction of sparse matrices. Int J Numer Meth Eng, 23:239–251, 1986. [36] Q. Wang, X.W. Shi, C. Guo, and Y.C. Guo. An improved GPS method with a new pseudo-peripherical nodes finder in finite element analysis. Finite Elem Anal Des, 48:1409–1415, 2012.

23

The extended unsymmetric frontal solution for multiple

The extended unsymmetric frontal solution for multiple

Suggest Documents

MA46, a FORTRAN code for direct solution of sparse unsymmetric ...

Users' Guide for the Unsymmetric-pattern MultiFrontal

The Extended Hamiltonian Algorithm for the Solution of the Algebraic ...

Frontal Polymerization in Solution - American Chemical Society

When the FAT goes wide: Right extended Frontal

Delayed Frontal Solution for Finite-Element based ... - CiteSeerX

An extended solution space for Chern-Simons gravity: the slowly ...

Extreme Mechanics Letters Extended Hencky solution for the blister ...

Extended uniform geometrical theory of diffraction solution for the ...

Approximating Networks and Extended Ritz Method for the Solution of ...

Multiple crack weight for solution of multiple interacting cracks by ...

Generalized block Lanczos methods for large unsymmetric ...

iterative dynamic programming solution for multiple query

Semianalytic solution for multiple interacting threedimensional ...

Thruster Modulation for Unsymmetric Flexible Spacecraft with ...

On the exposure limits for extended source multiple pulse laser ...

A Column Pre-ordering Strategy for the Unsymmetric ... - CiteSeerX

Macroscopically the left frontal lobe was swollen, with multiple ... - NCBI

The impact of various potential-vorticity anomalies on multiple frontal ...

Observing Others: Multiple Action Representation in the Frontal ... - Lirias

New Multiple Solution to The Boussinesq Equation

New Solution Construction Heuristics for the Multiple Vehicle Pickup ...

Channel division multiple access: The access solution for UWB ... - Hal

New Solution Construction Heuristics for the Multiple Vehicle Pickup ...