Improving the parallel efficiency of large-scale

Original Article

Improving the parallel efficiency of large-scale structural dynamic analysis using a hierarchical approach

The International Journal of High Performance Computing Applications 1–13 Ó The Author(s) 2015 Reprints and permissions: sagepub.co.uk/journalsPermissions.nav DOI: 10.1177/1094342015581402 hpc.sagepub.com

Xinqiang Miao1, Xianlong Jin1 and Junhong Ding2

Abstract In order to improve the parallel efficiency of large-scale structural dynamic analysis, a hierarchical approach adapted to the hardware topology of multi-core clusters is proposed. The hierarchical approach is constructed based on the strategies of two-level partitioning and two-level condensation. The data for parallel computing is first prepared through twolevel partitioning to guarantee the load balancing within and across nodes. Then during the analysis of each time step, the convergence rate of interface problem is significantly improved by further reducing its size with two-level condensation. Furthermore, the communication overheads are considerably reduced by separating the intra-node and inter-node communications and minimizing the inter-node communication. Numerical experiments conducted on Dawning-5000A supercomputer indicate that the hierarchical approach was superior in performance compared with the conventional Newmark algorithm based on the domain decomposition method. Keywords Numerical simulation, multi-core cluster, high performance computing, dynamic analysis, parallel computing

1. Introduction The finite element method is a powerful numerical tool for studying different phenomena in various areas. Among many areas of numerical simulations, structural dynamic analyses require much computational resources than other applications due to many time steps (Li et al., 2006). With the increase of size and complexity of numerical structural models, the growth of the computational cost has outpaced the computational power of a single processor. As a consequence, parallel computers with many processors and memories are employed to meet the requirement in an effective way (Mininni et al., 2011; Paz et al., 2011; Kozubek et al., 2013). Due to its greater computing power and costto-performance effectiveness, multi-core cluster has been one of the most popular parallel computers for solving large-scale problems nowadays (Kayi et al., 2009). In this paper, our research is conducted on a Dawning5000A supercomputer, which is a typical multi-core cluster located at the Shanghai Supercomputer Center in China. The Dawning-5000A supercomputer is built form multi-core nodes connected by an InfiniBand network. Its architecture is illustrated in Figure 1. As shown in Figure 1, there are two levels of memory model on Dawning-5000A supercomputer: shared

memory within a single node and distributed memory across the nodes. In each node, it is equipped with four quad-core AMD Barcelona 1.9 GHz processors. The design of cache organization and memory access model is based on the hierarchical architecture. The caches are organized in a tree-like configuration and the size of the caches increases from the leaves to the root. Each core has its own 512 KB L2 cache. The four cores on the same chip share a 2 MB L3 cache. All the cores in the same node share a 64GB main memory. In order to reduce memory traffic in the system, it is important to keep the data close to the cores that runs with the data. There are also two layers of communication channel on a Dawning-5000A supercomputer: intra-node communication and inter-node communication. The communication between two cores in the same node is referred to as intra-node communication, and the communication between two cores on different nodes is referred to as inter-node communication. The 1

School of Mechanical Engineering, Shanghai Jiaotong University, China Shanghai Supercomputer Center, China

2

Corresponding author: Xianlong Jin, School of Mechanical Engineering, Shanghai Jiaotong University, Shanghai 200240, China. Email: [email protected]

Downloaded from hpc.sagepub.com at Shanghai Jiaotong University on October 8, 2015

2

The International Journal of High Performance Computing Applications

Figure 1. Architecture of the Dawning-5000A supercomputer.

relationship between these cores in a node is tightly coupled, which means they are interconnected by a shared cache or high speed data channel; however, the relationship between nodes is loosely coupled, which means they are interconnected by an InfiniBand network. Therefore, the inter-node communication is much slower than intra-node communication on a Dawning-5000A supercomputer. The non-uniform latencies between intra-node and inter-node communications on multi-core clusters introduce new challenges in fully exploiting their computing power to obtain optimal performance.

2. Related work There is much related work in the literature that focuses on the parallel algorithms of finite element analysis. Direct methods provide an exact solution within numerical error in a fixed number of operations. Due to their robustness and versatility they are often used for parallel computing in many cases. Farhat and Wilson (1988) presented a parallel active column solver based on the LU decomposition method for sparse and dense symmetric systems of linear equations. They designed versions for both distributed and shared memory systems. George et al. (1989) developed a fan-out algorithm for distributed memory machines. They proposed a subtree-to-subcube mapping to solve large sparse systems with Cholesky decomposition on hypercube machines. Gupta (2006) described a parallel direct solver for general sparse systems of linear equations that had been included in the Watson Sparse Matrix Package (WSMP). They compared the WSMP solver with two similar well-known solvers: MUMPS and Super_LUDist. Gueye et al. (2011) presented a new parallel direct solver which was based on LU factorization of the sparse matrix of the linear system and allowed to detect automatically and handle properly the zeroenergy mode. Yu and Wang (2014) analyzed the computation and communication costs of the multi-frontal method on hybrid CPU–GPU systems to build up timing performance models. They provided theoretical analyses and numerical results to illustrate the characteristics and efficiency of the proposed algorithms.

Although direct methods are robust and versatile, they are prohibitive for large-scale systems because the memory requirements increase rapidly when the size of problems increases. Iterative methods on the other hand require much less memory. In addition, they generally have better scalability for parallel execution. Gullerud and Dodds (2001) proposed a linear preconditioned conjugate gradient (PCG) solver using an element by element framework and described the implementation within a nonlinear implicit finite element code. Rao (2005) presented three formulations combining the domain decomposition-based finite element method with a linear PCG technique for solving large-scale problems in structural mechanics. In the first formulation, the PCG algorithm was applied on the assembled interface stiffness coefficient matrices of all the submeshes. The second formulation operated on the local unassembled submesh matrices and the preconditioner was constructed using the local submesh information. In the third formulation, the sparse PCG algorithm was formulated using the unassembled local Schur complement matrices of submeshes. Bormotin (2013) proposed an iterative method for solving geometrically nonlinear inverse problems of shaping structural elements under creep conditions and implemented it using a software package based on finite element analysis. Wang et al. (2013) presented an iterative method for numerically solving the nonlinear Volterra–Fredholm integral equation. Zhao et al. (2014) proposed a new iterative method to solve the linear equations by changing the problems into smaller scale equations. Although iterative methods require much less memory and are scalable for parallel computing, they do not always converge within a reasonable time. For systems with large condition numbers, they may not converge at all. The hybrid method that combines direct and iterative schemes is derived based on the domain decomposition method (DDM). Unlike other methods, it does not deal with the entire computational domain directly, but divides it into several subdomains. For each subdomain, the internal degrees of freedom are condensed out independently and concurrently. Then, an iterative solver of the parallel PCG algorithm is used to solve


Miao et al.

3

the interface problem. Medek et al. (2007) proposed a quality balancing heuristic that modified classic mesh partitioning so that the partial factorization times were balanced. It saved the overall computation time, especially for time-dependent mechanical and nonstationary transport problems. Houzeaux et al. (2009) presented a parallel implementation of fractional solvers for the incompressible Navier–Stokes equations using an algebraic approach. Its advantage was to set a common basis for a parallelization strategy. Giraud et al. (2010) studied the parallel scalability of variants of an algebraic additive Schwarz preconditioner for the solution of large three-dimensional convection diffusion problems in a non-overlapping domain decomposition framework. Kraus (2012) introduced an algorithm for additive Schur complement approximation. It could be applied in various iterative methods for solving systems of linear equations arising from finite element discretization of partial differential equations. One feature of the hybrid method is most parts of the computation processes are performed in subdomain without inter-processor communication. The communication is only take place when solving the interface problem. Therefore, it is very efficient when the interface problem size is smaller. However, in order to shorten the solution time, large-scale structural dynamic analysis is often performed with a large number of cores. As a result, the finite element mesh of a structure has to be partitioned into a large number of subdomains. With the increase of subdomains, the interface problem size will become very larger. It is difficult to solve because extra iterations are required for convergence (Kocak and Akay, 2001). Moreover, the overheads of communication and synchronization among processes will also increase dramatically due to two reasons. Firstly, the number of processes participating in communication is too large as all the processes are employed for the interface solution. Secondly, it does not take into consideration that inter-node communication is much slower than intranode communication on multi-core clusters. In order to improve the parallel efficiency of interface solution, a natural way is to reduce the interface problem size by utilizing multi-level approaches. Elwi and Murray (1985) proposed a multi-level substructuring scheme based upon partial decomposition of skyline matrices. They discussed the transformation of coordinates and prediction of the skyline for higher level substructures. Leung (2011) presented a multi-level-multi-scale dynamic substructure method. Yang et al. (2012) adopted a multi-level approach to improve the efficiency of parallel substructure finite element analysis. They demonstrated the effectiveness of the multi-level parallel substructural method for nonlinear dynamic structural analysis. The multi-level method reduces the

interface problem size and parallelizes the computations of interface solution among substructures, thus it can speed up the interface solution as well as the overall parallel finite element analysis. However, there are also some disadvantages in the multi-level approach. Firstly, when the cores in one level are busy computing, all the cores in other levels are idle. It leads to a waste of computational resources. Secondly, the communication overheads among different levels will increase when the number of levels increases. Thirdly, the multi-level approach is not aware of the memory and communication characteristics of multi-core clusters, thus it cannot exploit their computing power to get optimal performance. Because of these reasons further significant improvement of parallel computing performance may be difficult to achieve utilizing the multi-level method. In this paper, a hierarchical parallel computing approach for the large-scale structural dynamic analysis is proposed. It is aware of the memory and communication characteristics of multi-core clusters and can fully exploits their computing power to get optimal performance. The proposed approach is constructed based on the strategies of two-level partitioning and two-level condensation. During the structural dynamic analysis most of the cores are busy computing. It not only significantly improves the convergence rate of interface equations by further reducing the problem size, but also considerably reduces the communication overheads by separating the intra-node and inter-node communications and minimizing the inter-node communication. The remainder of this paper is organized as follows. Section 3 briefly describes the conventional Newmark algorithm based on the DDM. The hierarchical parallel computing approach adapted to the hardware topology of multi-core clusters is proposed in Section 4. Numerical experiments are presented in Section 5. Finally, Section 6 concludes the paper.

3. Review of the conventional Newmark algorithm based on the DDM For structural dynamic analysis, the hybrid approach is implemented with the Newmark algorithm based on the DDM. The equations of motion of the structure discretized by finite elements take the general form ½Mf€xg + ½Cf_xg + ½Kfxg = fPg

ð1Þ

where €x, x_ , and x are the acceleration, velocity, and displacement vectors, respectively. M, C, and K are the mass, damping, and stiffness matrices, respectively. P is the external load vector. The Newmark algorithm is one of the most widely used implicit methods for solving the equations of motion. It is based on the following formulas


4

The International Journal of High Performance Computing Applications f€xgt + Dt = c0 (fxgt + Dt fxgt ) c2 f_xgt c3 f€xgt f_xgt + Dt = f_xgt + c6 f€xgt + c7 f€xgt + Dt

ð2Þ ð3Þ

Substituting the relations for €x and x_ from equations (2) and (3) into equation (1) in time t + Dt, and rearranging terms, the form of equilibrium equations is obtained: ^ ^ ½Kfxg t + Dt = fPgt + Dt

ð4Þ

^ is the effective stiffness matrix and P ^ is the where K effective external load vector. They are calculated from the following formulas ^ = ½K + c0 ½M + c1 ½C ½K

ð5Þ

^ xgt + c3 f€xgt ) fPg t + Dt = fPgt + Dt + ½M(c0 fxgt + c2 f_ + ½C(c1 fxgt + c4 f_xgt + c5 f€xgt ) ð6Þ

Once the displacements are obtained according to equation (4), the system’s accelerations and velocities can be obtained from the following formulas f€xgt + Dt = c0 (fxgt + Dt fxgt ) c2 f_xgt c3 f€xgt f_xgt + Dt = f_xgt + c6 f€xgt + c7 f€xgt + Dt

ð7Þ ð8Þ

where c0–c7 are the integration coefficients. They are determined by the time step Dt and the Newmark algorithm’s two basic parameters, namely a and b. Concretely, they are obtained from the following formulas b 1 1 1 c0 = aDt 2 , c1 = aDt , c2 = aDt , c3 = 2a 1, b Dt b c4 = a 1, c5 = 2 ( a 2), c6 = Dt(1 b), c7 = bDt

)

ð9Þ

The Newmark algorithm for structural dynamic analysis is introduced as above. It can be executed in parallel based on the DDM. The details are as follows. Firstly, the structural finite element mesh is partitioned into a special number of non-overlapping subdomains. Then, the equilibrium equations of each subdomain are formed according to equation (4) independently and concurrently in partitioned form as

^ II K ^ BI K

^ IB K ^ BB K

xI xB

=

^I P ^B P

ð10Þ

where the subscripts I and B refer to the internal and boundary domain conditions. Next, the internal degrees of freedom are released from equation (10) to obtain equilibrium equations in terms of boundary degrees of freedom ~ B g = fPg ~ ½Kfx

ð11Þ

in which the condensed stiffness matrix is

~ = ½K ^ BB ½K ^ BI ½K ^ 1 ½K ^ IB ½K II

ð12Þ

and the condensed external load vector is ~ = fP ^ B g ½K ^ BI ½K ^ 1 fP ^I g fPg II

ð13Þ

For large-scale three-dimensional structures, assembling the total interface problem can place extreme demands on both storage and computational resources. Therefore, a direct method for solving the interface problem could be prohibitive expensive. The usual alternative implementation for solving the Schur complement matrix equations is to apply an iterative solver of the parallel preconditioned conjugate gradient algorithm (Carter et al., 1989; Rao, 2005). Once the interface displacements are obtained, the internal unknowns of each subdomain are determined from ^ 1 (fP ^ I g ½K ^ IB fxB g) fxI g = ½K II

ð14Þ

Next, the accelerations and velocities at time t + Dt are calculated according to equations (7) and (8) successively. Following that the stresses and deformation of each subdomain are calculated independently and concurrently. Then, the solution of the next time step is started until all the time steps are completed. The matrix operations in equations (12) to (14) only represent the basic concepts of condensation. They are not the actual procedures for programming. In order to reduce memory requirements and computational operations, the inverse matrices are never formed explicitly. The computation related to matrix inversion is usually performed with matrix factorization methods (Han and Abel, 1984; Elwi and Murray, 1985; Farhat et al., 1987). In this paper, the condensation is implemented with the modified Cholesky decomposition method proposed by Han and Abel (1984). For the structural linear dynamic analysis, the effective stiffness matrix always keeps the same value during solution. Thus, the condensed stiffness matrix which is calculated according to equation (12) also keeps the same value from beginning to end. Therefore, the effective stiffness matrix and condensed stiffness matrix of each subdomain are only required to be calculated once when they are utilized at the first time. After that, it only need to update the effective external load vector and condensed external load vector at each time step. The structural dynamic analysis usually contains many time steps. At each time step, the interface equations had to be solved once. Because the communication is only take place during interface solution, the conventional Newmark algorithm based on the DDM is very efficient when the interface problem size is smaller. However, when the interface problem size gets larger with the increase of subdomains, its efficiency might decrease significantly due to extra iterations and a low communication scheme on multi-core clusters.


Miao et al.

5

4. The hierarchical parallel computing approach This section is devoted to the hierarchical parallel computing approach. To achieve an efficient parallel algorithm on multi-core clusters, two following conditions should be considered: load balancing and inter-node communication. The load in each node as well as in each core should be balanced as possible to alleviate the synchronization overheads among different processes. Moreover, the inter-node communication should be as less as possible because the inter-node communication is much slower than intra-node communication. In order to reach these conditions, the proposed hierarchical approach is constructed based on the strategies of two-level partitioning and two-level condensation. The data for parallel computing is first prepared through two-level partitioning to guarantee the load balancing within and across nodes. Then at each time step of the structural dynamic analysis, the inter-node communication is reduced as much as possible by improving the convergence rate of interface problem with two-level condensation and reducing the number of processes participating in interface solution. In a word, the hierarchical approach is adapted to the hardware topology of multi-core clusters.

4.1 Two-level partitioning METIS (Karypis et al., 2014a) is a set of serial programs for partitioning graphs, partitioning finite element meshes, and producing fill reducing orderings for sparse matrices. ParMETIS (Karypis et al., 2014b) is a parallel library that implements a variety of algorithms for partitioning unstructured graphs, meshes, and for computing fill-reducing orderings of sparse matrices. ParMETIS extends the functionality provided by METIS and includes routines that are especially suited for parallel computations and large scale numerical simulations. In this paper, the strategy of two-level partitioning is implemented by combining METIS and ParMETIS. As shown in Figure 2, The ParMETIS is first utilized to generate M subdomains in level 1 in parallel. Then, each subdomain in level 1 is further partitioned into N subdomains in level 2 by METIS independently and concurrently. In order to adapt to the architecture of multi-core clusters, M should be equal to the number of nodes available for parallel computing and N should be equal to the total number of cores in each node. METIS and ParMETIS libraries convert finite element meshes to graph and partition the graph, finally giving an equal number of meshes and minimizing the interface nodes. They provide the possibility to control unbalance graph partitions. By setting the value of unbalance factor on 1.0, we can guarantee that partitioning is performed on equal parts. In the strategy of

Figure 2. Strategy of two-level partitioning.

two-level partitioning, the load balancing within and across nodes are guaranteed by METIS and ParMETIS, respectively. When the parallel computing is performed, each subdomain in level 1 is assigned to a node and all the subdomains in level 2 derived from the same subdomain in level 1 are assigned to different cores of the same node. As shown in Figure 2, a finite element mesh is first partitioned into two subdomains in level 1 and each subdomain in level 1 is further partitioned into four subdomains in level 2 in the case that M is equal to 2 and N is equal to 4. Through two-level partitioning all the data information about subdomains in level 1 and level 2 can be stored in the local memory of the corresponding nodes. The distributed storage of data for large-scale problems is important for the performance of parallel computing programs. As all the data is stored close to cores where it is first touched, the memory access speed is significantly improved.

4.2 Two-level condensation The strategy of two-level condensation is based on the strategy of two-level partitioning. Its goal is to further reduce the problem size through successively releasing the internal degrees of freedom of subdomains in level 2 and level 1. As illustrated in Figure 3, the equilibrium equations of each subdomain in level 2 are first formed and the internal degrees of freedom are condensed out. Then, the equilibrium equations of each subdomain in level 1 are formed through assembling the interface equations of all subdomains in level 2 within the same node. Next, the internal degrees of freedom of each subdomain in level 1 are released with condensation and the interface equations are finally obtained. For the structural linear dynamic analysis, both the effective stiffness matrix and condensed stiffness matrix of subdomains in level 1 and level 2 always keep the same value during solution. Therefore, the effective stiffness matrix and condensed stiffness matrix of each subdomain are only required to be calculated once


6


Figure 3. Strategy of two-level condensation. Figure 4. Scheme of three-layer parallel computing.

when they are utilized at the first time. After that, it only need to update the effective external load vector and condensed external load vector at each time step. Through two-level condensation the size of interface equations is further reduced. Compared with the conventional Newmark algorithm based on DDM, the hierarchical parallel computing approach has fewer interface degrees of freedom at the final stage. It is expected that less time will be required to solve the interface equations because less iterations are needed for convergence.

4.3 Implementation of three-layer parallel computing As mentioned in Section 1, the inter-node communication is much slower than intra-node communication on multi-core clusters. Therefore, a better choice to reduce the overheads of communication and synchronization among processes is to separate the two kinds of communication and minimize the inter-node communication. This means that large volumes of local communications should be confined within each node and the volumes of global communications among nodes should be reduced as much as possible. The hierarchical parallel computing approach just satisfies these requirements. As illustrated in Figure 4, it is implemented with three layers of parallelism based on the strategies of two-level partitioning and two-level condensation. 4.3.1 The first layer of parallelization. In the first layer of parallelization, each process is independently responsible for the process of a subdomain in level 2 without any communication. The procedure consists of reading subdomain model and partitioning information, forming equation system, condensing, back substituting and calculating stresses. Because all these operations are mostly based on elements, it is not necessary to

interchange information among processors. Each process can run independently according to the model information of the subdomain in level 2 assigned to it. 4.3.2 The second layer of parallelization. In the second layer of parallelization, the operations on each subdomain in level 1, including assembling, parallel condensing and back substituting, are confined within each node with intra-node communication. For the convenience of management, a local master process is employed in each node. It is primarily in charge of the processing of the corresponding subdomain in level 1 such as assembling of subdomain equation system, distributing of data for parallel condensing, collecting of parallel computational results and backward substituting. In order to effectively reduce the condensation time, all the processes within the same node are participated in the parallel condensation of the corresponding subdomain in level 1.The condensation procedure is implemented with the parallel modified Cholesky decomposition method (Farhat and Wilson, 1988; Nikishkov et al., 1996). During condensation all the cores are busy computing. Compared with the multilevel approach, the hierarchical approach is more efficient because none of the cores is idle. The major novelty feature of the parallel modified Cholesky decomposition algorithm used in this paper is that it is only parallel executed within each node. As the number of cores per node is usually very small, it can achieve a higher efficiency. When all the nodes are parallel running in this manner, the overall performance of the system is improved significantly. 4.3.3 The third layer of parallelization. In the third layer of parallelization, the interface equations of subdomains in level 1 are solved utilizing the parallel PCG


Miao et al.

7

algorithm by local master processes with inter-node communication. It is not necessary to assemble the total interface equations of all subdomains. In each node only one process, the local master process, is employed during solution. Both the condensed stiffness matrices and the condensed external load vectors are still distributed in the corresponding local nodes. The diagonal preconditioners are constructed locally by local master processes using the condensed stiffness matrices of subdomains in level 1. The major novelty feature of the parallel PCG algorithm used in this paper is that it only involves the local master processes. Because there is only one local master process per node, the number of processes participating in communication is considerably reduced. Moreover, all the communications among these local processes belong to the category of inter-node communication. Therefore, the overheads of communication and synchronization among processes are reduced as much as possible. In conclusion, through the three layers of parallelism large volumes of local communications are confined within each node, and the volumes of global communications among nodes are reduced as much as possible. As a result, the intra-node and inter-node communications are separated and the inter-node communication is considerably reduced. Because the inter-node communication is much slower than intra-node communication on multi-core clusters, the overheads of communication and synchronization among processes are reduced to a minimum.

4.4 The process of hierarchical parallel computing When the parallel computing is performed, each subdomain in level 1 is assigned to a node and all the subdomains in level 2 derived from the same subdomain in level 1 are assigned to different cores of the same node. For the convenience of management, a local master process is employed in each node to handle all the operations related to the corresponding subdomain in level 1. As shown in Figure 5, the process of hierarchical parallel computing for structural dynamic analysis can be divided into steps as follows: 1.

2.

Prepare data for parallel computing with the strategy of two-level partitioning, including the elements, nodes, loads, boundary conditions and partitioning information of each subdomain in level 1 and level 2. Form the equilibrium equations of each subdomain in level 2 and condense out its internal degrees of freedom. For structural linear dynamic analysis, the effective stiffness matrix and condensed stiffness matrix of each subdomain in level 2 only need to be calculated once when they are utilized the first time.

Figure 5. Flowchart of hierarchical parallel computing.

3.

4.

5.

6.

Form the equilibrium equations of each subdomain in level 1 and perform parallel condensation with intra-node communication. For the structural linear dynamic analysis, the effective stiffness matrix and condensed stiffness matrix of each subdomain in level 1 are only required to be calculated once when they are utilized at the first time. Solve the interface equations of subdomains in level 1 with inter-node communication utilizing the parallel PCG algorithm and then back substitute the internal displacements of each subdomain in level 1. Extract the values of the interface degrees of freedom of each subdomain in level 2 from the computational results of the corresponding subdomain in level 1 and then back substitute its internal displacements. Calculate the accelerations and velocities of each subdomain in level 2 as well as the stresses and deformation. If it needs further iterations go to (2), else end.

5. Numerical experiments In order to investigate the performance of parallel algorithms, two numerical examples (shown in Figures 6 and 7) were conducted on the Dawning-5000A supercomputer. In Figure 6, a cantilever model under the effect of sinusoidal load was used for the dynamic analysis. The sinusoidal load varied according to the formula P(t) = 300sin(20t) N. The length of the cantilever


8


Figure 6. Computational model of a cantilever.

equal to zero, is substantially higher than the overpressure on the rest of the structure. Therefore, a simplified approach of considering only the blast loads on the front face of a building often yields approximate but conservative results. This approach was adopted in this study. The blast load was idealized as a triangular pulse having the peak force and positive phase duration shown in Figure 6. The dynamic analysis duration was set to be 0.4 s at a constant time step of 0.0002 s. Thus, there were 2000 time steps for each analysis.

5.1 Computational environment

Figure 7. Computational model of a building.

was 4 m and its cross section was a rectangle with dimensions of 0.2 m 3 0.2 m. The density of the material was taken as 7800 kg/m3. The Young’s modulus was taken as 2.1e5 MPa and the Poisson’s ratio was taken as 0.3. In order to test and evaluate parallel algorithms with large-scale problems and large processor configuration, the cantilever was meshed with a very small element size. After discretization with hexahedron elements, the model was composed of 8,040,192 elements, 8,909,200 nodes, and 26,727,600 degrees of freedom. The analysis duration was set to be 1 s at a constant time step of 0.001 s. Thus, there were 1000 time steps for each analysis. The Newmark algorithm’s two basic parameters were taken as 0.25 and 0.5, respectively. The damping effect was taken into consideration in the simulation and was calculated according to the Rayleigh damping formula. The damping coefficients about mass matrix multiplier and stiffness matrix multiplier were taken as 5 and 0.0001, respectively. In Figure 7, an actual steel building model was used to simulate the dynamic response of the model subjected to blast load. There were ten stories in the building and the typical story height was 3 m. After discretization with hexahedron elements, the model was composed of 6,252,066 elements, 7,665,800 nodes, and 22,997,400 degrees of freedom. A bomb explosion in or near the building may have catastrophic effects, destroying or severely damaging portions of the exterior and interior structural framework. The blast waves cause loading on the front, side, and rear faces of the building, as well as on the roof. The reflected overpressure on the front face, where the angle of incidence is

The Dawning-5000A supercomputer is located at the Shanghai Supercomputer Center in China. It is a typical multi-core cluster built form multi-core nodes connected by an InfiniBand network. The theoretical bandwidth of the InfiniBand network is 20 Gbps. In each node, it is equipped with four quad-core AMD Barcelona 1.9 GHz processors and 64 GB shared memory. Therefore, the number of cores per node is 16. Each core has its own 512 KB L2 cache. The four cores on the same chip share a 2 MB L3 cache. All the nodes are running on Suse Linux Enterprise Server 10 operating system. The programs of parallel computing for large-scale structural analysis have been developed with C + + and FORTRAN languages. The MVAPICH, which is a high performance message passing interface implementation over the InfiniBand network, is utilized to handle all message-passing tasks among processes.

5.2 Data preparation A two-level mesh partitioning result is required before the hierarchical parallel computing approach can be carried out. As described in Section 4.1, in order to adapt to the architecture of multi-core clusters, the number of subdomains in level 1 should be equal to the number of nodes available for parallel computing and the number of the derived subdomains in level 2 of each subdomain in level 1 should be equal to the total number of cores in each node. In this test, the numerical experiments were performed with 16, 32, 48, and 64 nodes, respectively. Consequently, the corresponding numbers of subdomains in level 1 should also be 16, 32, 48, and 64, respectively. As the total number of cores of each node was 16 on the Dawning-5000A supercomputer, once the first partitioning was completed each subdomain in level 1 would be further partitioned into 16 subdomains in level 2. For example, when utilizing 16 nodes for parallel computing, the result of two-level partitioning for the building model is shown in Figure 8. In this case, the building model was first partitioned into 16 subdomains in level 1. They were numbered from 1 to 16 in the figure. Then, each subdomain in level 1 was


Miao et al.

9

Figure 8. Result of two-level partitioning for the building model.

4

x 10

-4

350 300

Conventional method Hierarchical approach

3

Conventional method Hierarchical approach

250

Displacement/mm

Displacement/m

2 1 0

200 150 100

-1 50

-2 0

-3 -4 0

-50

0.1

0.2

0.3

0.4

0.5 Time/s

0.6

0.7

0.8

0.9

0

50

100

150

200 Time/ms

250

300

350

400

1

Figure 10. Horizontal displacement history at building top. Figure 9. Vertical displacement history at the cantilever’s load point.

be in good agreement with the results obtained with the conventional method. further partitioned into 16 subdomains in level 2 independently and concurrently.

5.3 Validation of parallel computing results The numerical experiments on both the conventional Newmark algorithm based on the DDM and the proposed hierarchical approach were performed using the same analysis settings in the same computing environment, as described above. The hierarchical approach is expected to give the identical results with the conventional method. The results of vertical displacement history at the load point of the cantilever are shown in Figure 9. The results of horizontal displacement history at the top of the building are shown in Figure 10. It can be observed from Figures 9 and 10 that the results obtained using the hierarchical approach were found to

5.4 Performance evaluation of parallel algorithms Two principal indicators of the performance of an algorithm to parallel computing are its speedup, S, and its efficiency, E. In this paper, the speedup is defined as follows. For a special problem, if the available cores for parallel computing range from m to z (1 \ m \ n \.\ z), and the corresponding time costs are tm, tn,.tz, respectively, then the speedup of parallel computing with i cores is Si =

tm (i = m, n, , z) ti

ð15Þ

and the corresponding parallel efficiency is calculated as


10


Table 1. Interface problem sizes and iterations for the cantilever model. Cores

256 512 768 1024

Conventional method

Hierarchical approach

No. of DOFs on interface

Average no. of iterations

Level 1 no. of DOFs on interface


306,000 629,328 920,499 1,274,448

278 313 481 645

37,209 75,618 152,472 212,400

115 139 184 236

Table 2. Interface problem sizes and iterations for the building model. Cores

256 512 768 1024

Conventional method


No. of DOFs on interface


Level 1 no. of DOFs on interface


250,269 362,670 489,624 623,613

448 600 882 1149

73,800 115,083 168,885 229,407

164 172 201 257

Table 3. Statistics of time and performance of parallel computing for the cantilever model. Cores

256 512 768 1024

Conventional method


interface solving time (s)

total time (s)

speed up

parallel efficiency

level 1 solving time (s)

total time (s)

speed up

parallel efficiency

3675 8698 19,634 38,020

59,873 35,066 34,855 47,085

1 1.71 1.72 1.27

100% 85.37% 57.26% 31.79%

3568 4025 6717 9147

59,740 30,571 21,875 18,328

1 1.95 2.73 3.26

100% 97.71% 91.03% 81.49%

Table 4. Statistics of time and performance of parallel computing for the building model. Cores

256 512 768 1024

Conventional method


interface solving time (s)

total time (s)

speed up

parallel efficiency

level 1 solving time (s)

total time (s)

speed up

parallel efficiency

31,985 36,804 54,227 81,436

126,874 69,387 71,840 89,324

1 1.83 1.77 1.42

100% 91.42% 58.87% 35.51%

31,728 32,508 33,457 35,365

126,745 64,973 51,216 43,448

1 1.95 2.47 2.92

100% 97.54% 82.49% 72.93%

Ei =

Si 3 100% i=m

ð16Þ

The detailed results of parallel computing are listed in Tables 1 to 4. For each model, both the conventional and the proposed methods were employed utilizing 16, 32, 48, and 64 nodes for analysis, respectively. As the total number of cores in each node was 16, the corresponding total numbers of cores used in each case were 256, 512, 768, and 1024, respectively. The total time of parallel computing involves the summation of solution times for all the time steps and in each time step the solution time begins form forming

the equilibrium equations of each subdomain to the calculating of the stresses and deformation. The interface solving time involves the summation of solution times of interface equations for all the time steps. The level 1 solving time involves the times of all the operations on subdomains in level 1 in all the time steps. As can be observed from Tables 3 and 4, the hierarchical approach achieved higher speed-up and parallel efficiency compared with the conventional method. For the conventional method, its parallel efficiency dropped quickly due to the dramatic increase of interface solution time. Take Table 3 for example, it cost 3675 seconds to solve the interface equations utilizing


Miao et al.

11

6. Conclusions In order to improve the parallel efficiency of large-scale structural dynamic analysis, a hierarchical approach

4

3.5

Conventional method Hierarchical approach Theoretical speedup

Speedup

3

2.5

2

1.5

1 256

512

768

1024

Number of cores

Figure 11. Performance of parallel algorithms for a building model with 16,451,700 degrees of freedom.

4

3.5

Conventional method Hierarchical approach Theoretical speedup

3

Speedup

256 cores. However, when the number of cores increased up to 1024, the cost of interface solution quickly increased up to 38,020 seconds. This was caused by the increase of the interface problem size and a low communication scheme on multi-core clusters. As listed in Tables 1 and 2, with the increase of subdomains the interface problem size became larger and larger. As a result, extra iterations were required for convergence when solving the interface equations. Operations on subdomains in level 1 in the hierarchical approach substantially solved the same equations as the conventional interface system. However, the level 1 solving time, namely the summation CPU cost of operations on subdomains in level 1 in all the time steps, including forming equation system, parallel condensing, solving interface equations, and back substituting was significantly reduced. Taking Table 4 for example, the interface solving time was 81,436 seconds utilizing the conventional method with 1024 cores. While it only cost 35365 seconds utilizing the hierarchical approach to solve the level 1 system. In this case, as much as 46071 seconds were reduced. This was mostly contributed by the proposed hierarchical parallel computing approach. It not only improved the convergence rate of interface equations by further reducing its size, but also significantly improved the communication efficiency by fully exploiting the memory and communication characteristics of multi-core clusters. As listed in Tables 1 and 2, the hierarchical approach has lower interface problem size and only a few iterations were required for convergence compared with the conventional method. In order to investigate the performance of parallel algorithms with respect to problem sizes, two test meshes of the building model with 16,451,700 and 29,206,800 degrees of freedom were solved employing both the conventional and hierarchical methods. From the results illustrated in Figures 11 and 12, it can be observed that the overall performance of the hierarchical approach improved with the increase of problem sizes whereas the conventional method did not show good scalability with respect to problem sizes. For the conventional method, the proportion of the time cost associated with the interface solution to the total time increased considerably with problem sizes, making it more difficult to improve the speedup. However, when utilizing the hierarchical approach the interface problem size solved at the final stage did not increase considerably with the increase of problem sizes. As a result, it showed good scalability with respect to problem sizes because the ratio of computation to communication per core improved.

2.5

2

1.5

1 256

512

768

1024

Number of cores

Figure 12. Performance of parallel algorithms for a building model with 29,206,800 degrees of freedom.

adapted to the hardware topology of multi-core clusters is proposed. It not only implements load balancing within and across nodes by two-level partitioning, but also considerably reduces the communication overheads by improving the convergence rate of interface problem and minimizing the inter-node communication. Thus, the hierarchical approach can exploit the memory and communication characteristics of multicore clusters to get optimal performance. Numerical experiments conducted on a Dawning5000A supercomputer indicate that the hierarchical approach achieved higher speed-up and parallel efficiency compared to the conventional method. The scalability investigation with respect to problem sizes shows that the overall performance of the hierarchical approach improved with the increase of problem sizes whereas the conventional method did not show good scalability with respect to problem sizes.


12


Funding This work was supported by the National 863 High Technology Research and Development Program of China (grant number 2012AA01A307) and the National Natural Science Foundation of China (grant number 11272214 and 51475287).

References Bormotin KS (2013) Iterative method for solving geometrically nonlinear inverse problems of structural element shaping under creep conditions. Computational Mathematics and Mathematical Physics 53: 1908–1915. Carter WT, Sham T-L and Law KH (1989) A parallel finite element method and its prototype implementation on a hypercube. Computers & Structures 31: 921–934. Elwi A and Murray D (1985) Skyline algorithms for multilevel substructure analysis. International Journal for Numerical Methods in Engineering 21: 465–479. Farhat C and Wilson E (1988) A parallel active column equation solver. Computers & Structures 28: 289–304. Farhat C, Wilson E and Powell G. (1987) Solution of finite element systems on concurrent processing computers. Engineering with Computers 2: 157–165. George A, Heath M, Liu J, et al. (1989) Solution of sparse positive definite systems on a hypercube. Journal of Computational and Applied Mathematics 27: 129–156. Giraud L, Haidar A and Saad Y (2010) Sparse approximations of the Schur complement for parallel algebraic hybrid solvers in 3D. Numerical Mathematics: Theory, Methods and Applications 3: 276–294. Gueye I, El Arem S, Feyel F, et al. (2011) A new parallel sparse direct solver: presentation and numerical experiments in large-scale structural mechanics parallel computing. International Journal for Numerical Methods in Engineering 88: 370–384. Gullerud AS and Dodds RH (2001) MPI-based implementation of a PCG solver using an EBE architecture and preconditioner for implicit, 3-D finite element analysis. Computers & Structures 79: 553–575. Gupta A (2006) A shared- and distributed-memory parallel sparse direct solver. Applied Parallel Computing: State of the Art in Scientific Computing 3732: 778–787. Han TY and Abel JF (1984) Substructure condensation using modified decomposition. International Journal for Numerical Methods in Engineering 20: 1959–1964. Houzeaux G, Vazquez M, Aubry R, et al. (2009) A massively parallel fractional step solver for incompressible flows. Journal of Computational Physics 228: 6316–6332. Karypis G, Schloegel K and Kumar V (2014a) METIS – serial graph partitioning and fill-reducing matrix ordering, version 5. Department of Computer Science, University of Minnesota. Available at: http://glaros.dtc.umn.edu/gkhome/ metis/metis/overview. Karypis G, Schloegel K and Kumar V (2014b) ParMETIS – parallel graph partitioning and fill-reducing matrix ordering, version 4. Department of Computer Science, University of Minnesota. Available at: http://glaros.dtc.umn.edu/ gkhome/metis/parmetis/overview. Kayi A, Kornkven E, El-Ghazawi T, et al. (2009) Performance analysis and tuning for clusters with ccNUMA

nodes for scientific computing – a case study. Computer Systems Science and Engineering 24: 291–302. Kocak S and Akay HU (2001) Parallel Schur complement method for large-scale systems on distributed memory computers. Applied Mathematical Modelling 25: 873–886. Kozubek T, Vondrak V, Mensik M, et al. (2013) Total FETI domain decomposition method and its massively parallel implementation. Advances in Engineering Software 60–61: 14–22. Kraus J (2012) Additive Schur complement approximation and application to multilevel preconditioning. Siam Journal on Scientific Computing 34: A2872–A2895. Leung AYT (2011) Dynamic substructure method for elastic fractal structures. Computers & Structures 89: 302–315. Li Y, Jin X, Li L, et al. (2006) A parallel and integrated system for structural dynamic response analysis. The International Journal of Advanced Manufacturing Technology 30: 40–44. Medek O, Kruis J, Bittnar Z, et al. (2007) Static load balancing applied to Schur complement method. Computers & Structures 85: 489–498. Mininni PD, Rosenberg D, Reddy R, et al. (2011) A hybrid MPI-OpenMP scheme for scalable parallel pseudospectral computations for fluid turbulence. Parallel Computing 37: 316–326. Nikishkov G, Makinouchi A, Yagawa G, et al. (1996) Performance study of the domain decomposition method with direct equation solver for parallel finite element analysis. Computational Mechanics 19: 84–93. Paz RR, Storti MA, Castro HG, et al. (2011) Using hybrid parallel programming techniques for the computation, assembly and solution stages in finite element codes. Latin American Applied Research 41: 365–377. Rao A (2005) MPI-based parallel finite element approaches for implicit nonlinear dynamic analysis employing sparse PCG solvers. Advances in Engineering Software 36: 181–198. Wang KY, Wang QS and Guan KZ (2013) Iterative method and convergence analysis for a kind of mixed nonlinear Volterra-Fredholm integral equation. Applied Mathematics and Computation 225: 631–637. Yang YS, Hsieh SH and Hsieh TJ (2012) Improving parallel substructuring efficiency by using a multilevel approach. Journal of Computing in Civil Engineering 26: 457–464. Yu CHD and Wang WC (2014) Performance models and workload distribution algorithms for optimizing a hybrid CPU–GPU multifrontal solver. Computers & Mathematics with Applications 67: 1421–1437. Zhao JP, Hou YR, Zheng HB, et al. (2014) A new iterative method for linear systems from XFEM. Mathematical Problems in Engineering 2014: 1–8.

Author biographies Xinqiang Miao received the B.S. degree and M.S. degree in college of mechanical and electrical engineering from Shaanxi University of Science and Technology, Xian, China, in 2007 and 2010, respectively. He is currently completing his Ph.D thesis on high performance computing in School of Mechanical Engineering of Shanghai Jiaotong University. His


Miao et al.

13

research interests are related to parallel computing and numerical simulation.

design and analysis of algorithms and numerical simulation.

Xianlong Jin is a full professor in School of Mechanical Engineering of Shanghai Jiaotong University. He is focused on high performance computing and numerical simulation. His technical interests include distributed and parallel computing, performance evaluation,

Junhong Ding is a Software Engineer in Shanghai Supercomputer Center, in china. He received his Ph.D. degree in School of Mechanical Engineering of Shanghai Jiaotong University in 2006.He has over 8 years of experience working in the field of parallel computing.


Improving the parallel efficiency of large-scale

Improving the parallel efficiency of large-scale

Suggest Documents

The Efficiency of MapReduce in Parallel External

The Efficiency of Parallel Metaheuristics for

IMPROVING THE EFFICIENCY OF TARGETED FOOD ... - CiteSeerX

Improving the Efficiency of Dehydrogenation ... - ScienceDirect.com

390 IMPROVING THE EFFICIENCY OF LOGICAL ...

Improving the Overall Efficiency of Radioisotope Thermoelectric ...

Improving the efficiency of patient handover

IMPROVING THE EFFICIENCY OF FERTILISER ... - Massey University

Improving the Efficiency of Background Subtraction ...

Improving the Efficiency of Lifestyle Change

Improving the efficiency of CTDIw annual

Improving the Efficiency of Self-Organizing Emergent

IMPROVING THE EFFICIENCY OF FERTILISER ... - Massey University

Improving the treatment efficiency of constructed ...

Research Article Improving the Efficiency of a

Improving the Efficiency and Effectiveness of ...

improving the efficiency of thermophotovoltaic devices ...

Improving the efficiency of cardiopulmonary resuscitation ... - ResQpod

Improving the conversion efficiency of Cu2ZnSnS4

IMPROVING THE EFFICIENCY OF TARGETED FOOD ... - CiteSeerX

Selfconsistent modeling of the largescale ... - Semantic Scholar

Improving the Pressurized Flushing Efficiency in ...

improving the effectiveness, efficiency and ...

Improving semantic routing efficiency - CiteSeerX