discrete element ... - Springer Link

1 downloads 0 Views 682KB Size Report
parallel implementation of conventional finite element procedures [3~5]. However, the features of a combined finite/discrete element system make such a ...
ACTA MECHANICA SINICA, Vol.20, No.5, October 2004 The Chinese Society of Theoretical and Applied Mechanics Chinese Journal of Mechanics Press, Beijing, China Allerton Press, INC., New York, U.S.A.

ISSN 0567-7718

P A R A L L E L A N A L Y S I S OF C O M B I N E D F I N I T E / D I S C R E T E E L E M E N T S Y S T E M S ON P C CLUSTER* WANG Fujun ( ' ~ )

l't

ZHANG Jing ( ~

Y.T. F E N G 2 ~)1

D.R.J. O W E N 2

LIU Yang ()~ll ~)1

1 (College of Water Conservancy & Civil Eng., China Agricultural University, Beijing 100083, China) 2(Civil gz Computational Engineering Centre, University of Wales Swansea, Swansea, SA2 8PP, U.K.)

A computational strategy is presented for the nonlinear dynamic analysis of largescale combined finite/discrete element systems on a PC cluster. In this strategy, a dual-level domain decomposition scheme is adopted to implement the dynamic domain decomposition. The domain decompositiort approach perfectly matches the requirement of reducing the memory size per processor of the calculation. To treat the contact between boundary elements in neighbouring subdomains, the elements in a subdomain are classified into internal, interfacial and external elements. In this way, all the contact detect algorithms developed for a sequential computation could be adopted directly in the parallel computation. Numerical examples show that this implementation is suitable for simulating large-scale problems. Two typical numerical examples are given to demonstrate the parallel efficiency and scalability on a PC cluster. ABSTRACT:

K E Y W O R D S : parallel computation, finite element, discrete element, PC cluster

1

INTRODUCTION

Combined finite/discrete element methods have recently emerged as a powerful solution strategy for the simulation of a wide variety of practical engineering applications [1]. Examples include process simulation (e.g. granular flow and particle dynamics) and fracture damage modeling (e.g. rock blasting and projectile impact). These situations often include a large number of discrete bodies that interact with each other or undergo a transition from continuous to discontinuous states. Besides their discrete/discontinuous nature, these applications are often characterized by the following features: highly dynamic; rapidly changing domain configurations; and sufficient resolution to be required. In the numerical solution context, the contact detection and interaction computations often take more than half of the entire simulation time and the small time-step imposed in the explicit integration procedure also requires a

very large nmnber (e.g. millions) of time increments to be performed. All these factors make the simulation of a realistic application extremely computationally expensive [2]. Consequently, a parallelisation becomes an obvious option for significantly increasing existing computational capability, along with the recently remarkable advances in hardware performance. In recent years, considerable effort has been devoted to the effective parallel implementation of conventional finite element procedures [3~5]. However, the features of a combined finite/discrete element system make such a parallelisation much more difficult and challenging. Only recently, have some successful attempts emerged to tackle these problems on shared-memory machine [6,7] . It is evident that the shared-memory configuration is limited by the number of processors that can efficiently access the common memory. The current trend in parallel processing is to create the

Received 18 April 2003, revised 3 December 2003 * The project supported by the National Natural Science Foundation of China (10372114) and the Engineering and Physical Sciences Research Council (EPSRC) of UK (GR/R21219) t E-mail: wangfj~cau.edu.cn

Vol.20, No.5

Wang FJ et al.: Parallel Analysis on PC Cluster

distributed-memory parallel environments. As a typical distributed- memory parallel environment, a PC cluster has the potential to provide the increased capacity and computational speed necessary to make the nmnerical simulation of large complex systems practical. The ultimate goal of this work is therefore to discuss the major algorithmic aspects of the large-scale parallel finite/discrete element analysis on a P C cluster. Our implementation is also intended to be general for other distributed-memory parallel platforms. The organization of the p a p e r is as follows. A computational model of a combined finite/discrete element system is presented in Section 2. The general solution procedure and some issues related to the parallel implementation of a combined finite/discrete element approach are discussed in this section. Section 3 provides numerical examples to illustrate the parallel performance achieved with the current implementation of the parallelised finite/discrete solution approach on the PC cluster environment. Finally, the concluding remarks are given in Section 4.

2 PARALLEL COMPUTATIONAL EGY

elements are governed by Newton's second law[II Mii-F

2.1 T h e G o v e r n i n g E q u a t i o n a n d Its T e m p o r a l Integration The discretised governing equation for the finite element sub-system can be written as [3] M i i + C / t + F int - F ext - F ~ = 0

(2)

ext-F c =0

The two sub-systems are coupled by contact forces, F c. Because the discrete element can be viewed as having the same topology as the finite element, i.e. the one with only one node, Eq.(2) can be viewed as a special case of Eq. (1). Therefore, the complexity of the algorithm design for solving Eqs.(1) and (2) is reduced, and only one equation, Eq.(1), needs to be considered. In order to obtain the solution of the governing equation (1), a central difference scheme is often adopted. The accelerations and velocities can be expressed by displacements at different time steps. Thus, the displacements at time t + At are given explicitly in terms of the displacements at time t and t - At, as [3] ut+~t =

(M

+

~

C

)--1[ (At)~( __ 1Tint F~xt) 2

STRAT-

In the combined finite/discrete element system, some continuous regions and some separate bodies are considered. The continuous regions are usually discretised by finite elements. The original separated bodies, such as particulates, or fractured zones generated from continuous regions are represented by discrete elements. For simplifying the simulation, discrete elements are usually treated as rigid bodies and represented by simple geometric entities, such as disks or spheres, while the finite elements are deformable [I] .

535

/

Usually, matrices M and C are diagonal. Therefore, equation set (3) can be solved by calculating independent algebraic equations.

2.2 S u b d o m a i n and Its B u f f e r - Z o n e The parallel computations are always performed in subdomains. Here, each subdomain is confined to be an axis-aligned box in this work. Figure 1 shows a 2D system divided into 7 subdomains ($1, $2, ..., $7). A buffer-zone used to control the frequency of domain decomposition is introduced along the boundary of each subdomain. The buffer-zone of subdomain $6 is shown in Fig.2. One half of the buffer-zone is located inside the current subdomain, and another half is located outside the boundary. Thus, there are three boxes associated with each subdomain, which are termed subdomain box, internal box and external box, respectively.

(1)

where M and C are the mass and damping matrices, respectively,/~int is the global vector of internal resisting nodal forces, F ext is the vector of consistent nodal forces for the applied external loads, F c is the vector of consistent nodal contact forces, i~ is the global vector of nodal accelerations a n d / t is the global vector of nodal velocities. For a discrete element sub-system, the discrete

$1

$5

$2

$3

$6

Fig.1 Subdomains

$4

$7

ACTA MECHANICA SINICA

536

internal box subdomainbox externalbox

S1

$2

S3

~ //

~))>'///// ///

i~i~J

$6

4

~i j $7

buffer zone Fig.2 Buffer-zone and boxes of subdomain $6 Size of the buffer-zone depends on the type of application. It has conflicting effects on the overall efficiency of the domain decomposition and communications at each time step. A larger buffer-zone will result in more elements in each subdomain, which will increase the amount of simulation and communication, but will reduce the frequency with which the domain decomposition has to be performed. With a smaller buffer zone, the number of elements and nodes within the zones will be fewer and the interprocessor communications become less expensive at each step, but the domain decomposition should be conducted more frequently thus increasing the computational cost in the decomposition. A carefully selected buffer-zone can balance the costs in the two phases to achieve a better overall performance. In our experience, one half of the average radius of all the discrete elements can be taken as the size of the buffer-zone. All elements in a subdomain are classified into internal, interracial and external elements. If the bounding box of an element is totally inside the internal box of a subdomain, it is defined as internal. If the bounding box of an element overlaps with the subdomain buffer-zone and its centre is inside the subdomain box, the element is defined as interfacial. Other elements are defined as external. If a node is located inside the subdomain box, this node is defined as an internal node. Other nodes that are located in the external box are defined as external nodes.

2.3 Parallel C o m p u t a t i o n a l P r o c e d u r e

For the combined finite/discrete problems, the explicit integration scheme is often adopted. The computational procedure typically performs the contact computations, internal force computations, external force computations, temporal integration, and configuration update at each time-step. To perform a parallel computation on a PC cluster, the master/slave approaehIS] is adopted in the present work.

2004

The parallel environment consists of one master processor and several slave processors. The parallel computational procedures implemented on a slave processor at one time-step can be expressed as: (i) Compute internal forces on all internal and interfacial finite elements in the subdomain; (2) Compute external forces on all internal and interracial finite elements and discrete elements in the subdomain; (3) Detect contacts among all elements in the subdomain; (4) Compute contact forces on all elements having contacts in the subdomain; (5) Assemble nodal forces for all nodes in the subdomain; (6) Exchange nodal forces of interracial and external elements with other slave processors; (7) Perform the temporal integration (i.e., update the position) for each internal node;

(8) Exchange displacements and velocities of boundary nodes with other slave processors; (9) Send the maximum accumulated displacement to the master processor; (i0) Receive flag from the master processor, which indicates if the domain decomposition is necessary; (ii) If the domain decomposition is necessary, pass necessary data to the master processor, and receive the redistributed data form the master processor. The parallel computational procedures implemented on the master processor at one time-step can be expressed as: (I) Receive the maximum accumulated displacement from all slave processors; (2) Check the maximum accumulated displacement and load imbalance to determine if the domain decomposition is necessary; (3) If the domain decomposition is necessary, collect necessary data from slave processors, perform decomposition and redistribute data among slave processors. Here, the master processor serves as the controller for the computations by conducting the necessary sequential calculations, including partitioning/ re-partitioning the entire domain into subdomains. Each slave processor stores all the data for the problem confined within its subdomain, and performs a simulation for the problem. The slave processor works as a usual processor running a sequential code. The

Vol.20, No.5

Wang FJ et al.: Parallel Analysis on PC Cluster

main difference is that some communication functions are added to exchange information with other slave processors or with the master processor.

2.4 C o n t a c t D e t e c t a n d C o n t a c t Force Evaluation The movement of finite elements and discrete elements will cause unphysical interpenetration. The goal of the contact calculation is to find the interpenetration and eliminate it. The efficient parallelisation of this task is critical, since in a serial simulation it can require over 50~ of the run time. Generally speaking, the contact computations include global search, local search, and contact force evaluation [3]. As a careful design of the parallel procedure mentioned in section 2.3, all three phases can be conducted locally in each subdomain as a sequential process. The global search is not conducted at every timestep. The frequency of the global search is determined by the buffer size of the buffer-zone demonstrated in Fig.2. In the global search phase, each slave processor is responsible for detecting all potential contact pairs within its own subdomain. Any element, whether it is internal, interfacial or external, is treated as a regular element, and there is no need to distinguish contacts between elements with different type codes. Therefore, any algorithm used in the sequential code can be employed to perform the global search in this parallel strategy. In the present work, two searching algorithms[9,1~ proposed by the authors are adopted. These algorithms can accommodate various geometrical entities including facets and particles. This is achieved by representing each entity with an axis-aligned bounding box extended by a buffer zone. However, the contacts between b o u n d a r y elements will be specially handled in the communications for contact forces. The local search for contact has to be performed at each time-step. The local search procedure is a purely geometric operation, details can be found in literature, such as Ref.[11]. Once an actual contact pair is determined, a contact interaction algorithm is employed to evaluate the contact forces. In this work, a penalty function based interaction law [12] is employed. This approach can deal with normal and tangential contact forces efficiently, especially in 3D problems. It should be mentioned that this approach is history dependent. Thus the data redistribution for the contact history must be performed when the domain decomposition is performed.

537

2.5 D y n a m i c D o m a i n D e c o m p o s i t i o n The domain decomposition will be performed every several time-steps. In this work, a dual-level domain decomposition strategy is proposed. Here, the decomposition can be at a low level or at a high level. At a low level, all subdomMn positions are u p d a t e d with the evolution of the sub-problems. At a high level, only the sub-problems are updated, while the subdomain positions remain unchanged. Thus, the partitioning procedure is not required in the high-level decomposition. The decision as to whether the lowlevel or high-level decomposition should be performed depends on two parameters: the accumulative maxi m u m displacement and the level of workload imbalance. The accumulative m a x i m u m displacement since the last decomposition can be evaluated based on the m a x i m u m velocity in each time step and the step size. The level of workload imbalance in the current time step can be evaluated based on the actual runtime of each processor during the time step. If the accumulative m a x i m u m displacement is more t h a n or equal to one half of the size of the subdomain buffer zone, i.e. when an internal element of the current subdomain has crossed the subdomain boundary to enter a neighbouring subdomain, the high-level domain decomposition has to be performed. All the d a t a associated with this element need to be copied/migrated so as to keep the completeness and vMidity of each sub-problem. Clearly, the size of the subdomain buffer-zone will significantly affect the frequency of the such decomposition. If the level of workload imbalance exceeds a prescribed value, i.e. a significant workload unbalancing is detected, the low-level decomposition is required to be performed. At this level, a global domain repartition is involved. To partition/re-partition the entire domain, a Recursive Coordinate Bisection (RCB)[ t3] based algor i t h m is used as the partitioner in the present work. T h e Mgorithm considers the geometric locations of M1 the nodes, determines in which coordinate direction the set of nodes is most elongated, and then divides the nodes by splitting based upon that long direction. The two halves are then further divided by applying the same algorithm recursively. To balance the workload in the partitioning process, the contact list is used to evaluate the computational cost of each element. For the simulation t h a t is dominated by one type of interaction and where the objects have a nearly even distribution across the whole domain, the global contact list can be used.

ACTA MECHANICA SINICA

538

2004

However, it is more reasonable to use the local (actual) contact list generated at last time step for general problems. Once the computational costs of all elements are determined, the workload of each node can be derived from its corresponding element. The details about the dynamic decomposition can be found in Ref.[14].

3 NUMERICAL

EXAMPLES

Two examples, which are discrete element dominated, are provided in this section to illustrate the performance of the suggested parallel computational model. The parallelised finite/discrete element analysis program is tested on a PC cluster environment. Each node in the cluster is a 1GHz processor PC with 512MB memory, running the LINUX operating system. The standard GCC C compiler is used to compile the program. An MPI-based message-passing approach, M P I C H package [15], is used to provide a

Fig.3 Hopper filling: configurations at 1.5 s

Table 1 R u n t i m e and s p e e d u p for example 1 Number of processors 1 2 4 8 runtime/hr 46808.1 23760.3 13528.5 9671.2 speedup 1 1.97 3.46 4.84

portable parallel execution.

3.1 H o p p e r Filling This example performs simulations of a 3D ore hopper filling process. The ore particles are represented by discrete elements as spheres and the hopper and the wall are represented by finite elements. The particles are initially regularly packed at the top of the hopper and then are allowed to fall under the action of gravity. A total of 8 768 discrete elements and 9 quadrilaterM finite elements are used to represent the ore, hopper and hopper bin. The radius of the spheres is 0.1m. The material properties for the hopper and the hopper bin are: Young's Modulus E = 2.1 x 10 it N / m 2, Poisson's ratio ~ = 0.29, mass density p = 7.86 x 105 k g / m 3. The material properties for the ore are: E = 6.25 x 101~ ~, p = 0.2, p = 2.367 • 10a k g / m 3. The total degrees of freedom are 52 656 in the simulation with friction. Figure 3 shows the distribution of the bails at 1.5s. The total runtime required to simulate the failing for 150 000 steps with various number of processors is shown in Table 1. The corresponding speedup is also presented in Table 1, and the para i m efficiency is shown in Fig.4. It appears that the parallel efficiency with 8-processors is only 61%. The reason for this is that the communication overhead is significant in such a distributed memory environment for this small-scale problem.

Fig.4 Parallel efficiency for example 1

3.2 Dragline Bucket Filling This example simulates a dragline bucket filling process as illustrated in Fig.5, where the blasted soil/rock is modeled by discrete elements as a collection of layered particles (spheres). The bucket is modeled as conventional finite elements and the filling is simulated by dragging the bucket with a prescribed motion. A total of 200 000 discrete elements and 108

Fig.5 Example 2: Dragline bucket filling model

Vol.20, No.5

Wang FJ et al.: Parallel Analysis on PC Cluster

finite elements are used to model the system. The total degrees of freedom are 1 200 630 and friction is considered. Table 2 shows the total runtime required to simulate the filling process for 50000 steps with various numbers of processors. Note that we cannot get the sequential runtime on one processor, because this problem is too large to run on a sequential environment. Therefore, the sequential runtime is "predicted" by doubling the runtime for 2 processors. The corresponding speedup is also presented in Table 2. For this example, the total runtime decreases with respect to the number of processors rapidly. This example illustrates the scalability of this program for a larger scale numerical problem. All the computations for 4, 8 and 16 processors exhibit superlinear speedup. This often observed effect for parallel computations arises from the aggregate effect of the increasing amounts of cache memory available to store data for the fixed problem size. On the other hand, m e m o r y requirements increase rapidly for friction calculations when large-scale contacts occur. This growth becomes catastrophic for sequential simulation and parallel computing with a small number of processors. In this case, the frequent dynamic memory reallocations take a considerable time. Another reason for superlinear speedup is t h a t the C P U time used to perform the global search does not decrease proportionally with the number of processors. Table 2 R u n t i m e and s p e e d u p for e x a m p l e 2 Number of processors 1 2 4 8 16 runtime/hr 2168.2 1084.4 520.1 254.9 126.3 speedup 1 1.99 4.17 8.51 17.16

4

CONCLUSIONS

A parallel implementation strategy for a combined finite/discrete element analysis on the P C cluster is presented in this work. Based on the discussions and the numerical comparative studies presented above, the main features of the implementation include: (1) A dual-level domain decomposition scheme is adopted to implement the dynamic domain decomposition. The domain decomposition approach perfectly matches the requirement of reducing the memory size per processor of the calculation. (2) Classifying elements into internal, interfacial and external objects not only improves the efficiency of the dynamic domain decomposition, but also makes the contact treatment very simple. Global search can be conducted locally on each processor, and all the

539

contact detect algorithms developed for the sequential computation can be used directly in the parallel computation. (3) Implementation of the proposed a parallel strategy provides a good parallel performance with good scalability as the calculations show superlinear speedups for the example problem. It is suitable for simulating problems involving large number of separate bodies, especially for calculations including friction. In this way it is possible to get the solution in short time, or run cases which could not fit into a single PC memory.

REFERENCES 1 Munjiza A, Owen DR J, Bicanic N. A combined finite/discrete element method in transient dynamics of fracturing solid. Engineering Computation, 1995, 12: 145~174 2 0 w e n DRJ, Feng YT, Wang FJ. The modelling of multi-fracturing solids and particulate media. In: Fifth World Congress on Computational Mechanics. Vienna, Austria, 2002, July 7~12 3 Wang FJ. Parallel computation of contact-impact problems with FEM and its engineering application. [Ph D Thesis]. Beijing, China: Tsinghua University, 2000 (in Chinese) 4 Krysl P, Belytschko T. Objected-oriented parallelisation of explicit structural dynamics with PVM. Computers and Structures, 1998, 66:259~273 5 Krysl P, Bittnar Z. Parallel explicit finite element solid dynamics with domain decomposition and message passing: dual partitioning scalability. Computers and Structures, 2001, 79:345N360 6 0 w e n DR J, Feng YT, Han K, et al. Dynamic domain decomposition and load balancing in parallel simulation of finite/discrete elements. In: ECCOMAS 2000, Barcelona, Spain, 2000, Sep 11~14 7 0 w e n DR J, Feng YT. Parallelised finite/discrete element simulation of multi-fracturing solids and discrete systems. Engineering Computations, 2001, 18(3-4): 557~576 8 Wang F J, Cheng JG, Yao ZH. Parallel algorithm of explicit integration method in structure nonlinear dynamic analysis. Journal of Tsinghua University (Sci & Tech), 2002, 42(4): 431~434 (in Chinese) 9 W'ang FJ, Cheng JG, Yao ZH. A contact searching algorithm for finite element analysis of contactimpact problems. Acta Mechanica Sinica, 2000, 16(4): 374~382 10 Feng YT, Owen DRJ. An augmented spatial digital tree algorithm for contact detection in computational mechanics. International Journal for Numerical Methods in Engineering, 2002, 55:159~176

540

ACTA MECHANICA SINICA

11 Wang F J, Cheng JG, Yao ZH. FFS contact searching algorithm for dynamic finite element analysis. International Journal for Numerical Methods in Engineering, 2001, 52:655,--,672 12 Han K, Peric D, Owen DR J, et M. A combined finite/discrete element simulation of shot peening process. part II: 3D interaction laws. Engineering Computations, 2000, 17(6/7): 683~702 13 Hendrickson B, Devine K. Dynamic load balancing in computational mechanics. Corap Meth AppI Mech En-

2004

gng, 2000, 184:485~500 14 Feng YT, Wang F J, Owen DRJ. Dynamic domain decomposition based parallel strategies for discrete systems on distributed memory platforms. VII International Conference on Computational Plasticity, Barcelona, Spain, 2003 15 Group W, Lusk E, Skjellum A. Using MPI: Portable Parallel Programming with the Message-Passing Interface. Cambridge, Massachusetts: The MIT Press, 1994

Suggest Documents