The power law approach has been used as the material distribution ... distribution and support conditions by minimizing the compliance of the system. In order to ...
6th World Congresses of Structural and Multidisciplinary Optimization Rio de Janeiro, 30 May - 03 June 2005, Brazil
Parallel Optimality Criteria-based Topology Optimization for Minimum Compliance Design Arash Mahdavi, Raghavan Balaji, Mary Frecker, Eric M. Mockensturm. Department of Mechanical and Nuclear Engineering, Pennsylvania State University, University Park, PA 16802, USA. 1. Abstract Topology optimization is often used in the conceptual design stage as a preprocessing tool to obtain overall material distribution in the solution domain. The resulting topology is then used as an initial guess for shape optimization. It is always desirable to use fine computational grid to obtain high-resolution layouts that minimize the need for shape optimization and post processing [1], but this approach results in high computation cost and is prohibitive for large structures. To reduce the computation time of such problems, parallel computing in combination with domain decomposition is used. The power law approach has been used as the material distribution method and for locating the optimum solution; an optimality criteria-based optimizer is used [2, 3]. The equilibrium equations are solved using a preconditioned conjugate gradient algorithm. These calculations have been done using a master-slave programming paradigm on coarse grain Multiple Instruction Multiple Data (MIMD) shared memory architecture. In this study, by avoiding assembly of the global stiffness matrix, the memory requirement as well as computation time has been reduced. The results of the current study show that the parallel computing technique is a valuable tool for solving computationally intensive topology optimization problems.
2. Keywords Topology optimization, parallel computing, finite element analysis, MPI, SIMP, domain decomposition.
3. Introduction Topology optimization has gained widespread popularity in academia and industry and is being applied to the design of materials, mechanisms, micro electro mechanical systems (MEMS) and many other complex structural design problems. In the literature, one can find a multitude of approaches for solving topology optimization problems. In 1988, Bendsoe and Kikuchi [4] introduced the socalled microstructure/homogenization approach. In 1989, an alternative approach to topology optimization was introduced by Bendsoe [5]. This method is called the “Power-Law” or Solid Isotropic Material with Penalization (SIMP) approach. A similar approach was advocated by Zhou and Rozvany in 1991 [6] and also by Mlejnik in 1992 [7]. In power law approach, material properties are assumed to be constant within each element and the variables are the elements’ relative densities. The material properties are modeled as the relative material density raised to some power times the material properties of the solid material [2]. Despite much theoretical progress in this field, and considering the major benefits that industries can gain by using this method in the conceptual design stage, the application of topology optimization to industrial problems is not yet widespread. The main reason is that, as an iterative process, topology optimization of large real world structures is computationally intensive. Thus, there is a need to find computationally efficient ways to perform the topology optimization of complex structures with large number of degrees of freedom. One way to address this problem is using efficient discretization techniques like Boundary element method (BEM) and meshless techniques. DeRose Jr. and Diaz [8] developed a meshless fictitious domain method based on wavelet basis and Galerkin scheme to solve computationally intensive 3D topology optimization problems. Maar and Schulz [9] used multigrid interior point approach for solving large topology optimization problems. Another approach is to use a faster finite element solver, either by using different element formulations, like p-version finite elements or by using a faster equation solver, like different iterative solvers. Reducing the total number of analyses through heuristic techniques like re-analysis can reduce the solution time as well. All of the above approaches are based on reducing the number of floating point operations needed for solving the topology optimization problem. Another approach is to increase the computational power of the system through parallel computing. Reviewing the literature, it seems that the application of parallel computing in topology optimization is rare. Borrvall and Petersson [10] considered topology optimization of 3D domains using parallel processing. In their work, they used the so-called regularized intermediate density control method to ensure the existence and uniqueness of the solution and to obtain black-and-white final layouts. In their method in order to enforce a black-and-white solution a penalty parameter should be calculated for each problem, which needs some numerical
experiment adding to the complexity of the method. They used the MMA approach for locating optimal solution. Their study showed the effectiveness of the preconditioned conjugate gradient solver for solving parallel topology optimization problems. The main objective of the present work is to implement a simple and efficient parallel computing technique for solving large-scale topology optimization problems. To this end, a much simpler approach based on power law technique and a heuristic filter is adopted here [2]. An optimality criteria method has been employed for updating the design variables (relative densities) [2]. For solution of the equilibrium equations a preconditioned conjugate gradient method is used and in order to speed up the computation, the FEA and sensitivity calculations have been parallelized. The remainder of the paper is organized as follows. Section 3.1 describes the formulation of the topology optimization problem. In section 3.2, different domain decomposition technique used in this paper is discussed. Section 3.3 explains the numerical methods for the iterative solution of equilibrium equations. In section 3.4, the overall structure of the program and parallelization algorithm is outlined. Section 4 includes numerical results of different case studies. In Section 5, some conclusions based on these case studies are presented. 3.1. Problem Formulation The objective of optimization process is to determine the stiffest possible structure for a given domain, amount of material, load distribution and support conditions by minimizing the compliance of the system. In order to solve this problem, a “power-law” or SIMP approach has been adopted [2]. A detailed description of the formulations described in this section can be found in Sigmund [2]. The problem can be expressed as: N N (1) min : c(x) = U T KU = (x ) p uT k u = c (x)
∑
x
e
e
0 e
e=1
Subject to : V (x) V0
€
∑
e
e=1
(2)
= f
: KU = F (3) : 0 < x min ≤ x ≤ 1 (4) € Where U and F are global displacement and force vectors, respectively, K is the global stiffness matrix, ue and k 0 are the element displacement vector and stiffness matrix, respectively; x is the vector of design variables, and xmin is a vector of minimum relative € densities (nonzero to avoid singularity). N is the number of elements, p is the penalization power (typically p=3), while V(x) and V0 are€material volume and design domain volume respectively, and f is the prescribed volume fraction [2]. The above optimization problem can be solved using different techniques such as Optimality Criteria (OC), Sequential Linear Programming (SLP), and the Method of Moving Asymptotes (MMA) [2]. Here a standard optimality criteria method and heuristic design-updating scheme is used
x
new e
max(x min , x e − move) = x e Beη min(1, x + move) e
x e Beη ≤ max(x min , x e − move) max(x min , x e − move) ≤ x e Beη < min(1, x e + move) min(1, x e + move) ≤ x e Beη
(5)
Where move is a positive move limit and η = 0.5 is a damping co-efficient. Be is found from the optimality condition: [2] €
€
∂c ∂x e Be = ∂V λ ∂x e −
(6) €
The Lagrange multiplier λ is obtained using the bisection iterative method. The numerator is the sensitivity of the objective function with respect to the design variable, and is found as [2]: ∂c (7) p −1 T ∂xe
= − p( xe )
ue k0ue
€ Since a power law approach is used, to ensure the existence of solutions to the optimization problem, it must be combined with either a perimeter constraint, a gradient constraint or filtering techniques. In this program a filtering technique is used, which works by modifying the element sensitivities as follows: (8) N ∂c ∂c 1 = (rmin − dist (e, f )) x f ∑ N ∂x f =1 f ∂xe filtered x e ∑ ( rmin − dist (e, f )) f =1
The original sensitivities are thus averaged over a circular area with center located at the center of the corresponding element (element e) and radius rmin . These filtered sensitivities are used in the optimality criteria updating process [2]. In this study, for all examples rmin =1.2, p=3 has been used. 3.2. Domain Decomposition The key element in the application of parallel computing for solving discretized elliptic PDE problems is the domain decomposition method. Domain decomposition is a technique to divide the computational load between different processors to speed up the computation. It can be classified into two categories, explicit domain decomposition and implicit domain decomposition [11, 12]. In explicit domain decomposition, the design domain is partitioned into several sub-domains (Fig. 1), one for each processor. All of the elements within each sub-domain are subject to the same instructions. In implicit domain decomposition, instead of a physical
partitioning of the domain, first the global system of equations is assembled, and then the resulting matrices are partitioned and sent to separate processors.
Fig. 1 Explicit domain decomposition for 2D rectangular domain. Traditionally, there are two major techniques that are employed for domain decomposition, iterative techniques and sub-structuring. In the iterative methods, information concerning nodes along the common boundaries is communicated between processors at each iteration. In the sub-structuring method, each sub-domain is treated as a super-element. Thus using static condensation, internal degrees of freedom of each sub-domain are condensed and the problem is formulated using just those degrees of freedom at the boundaries. After solving for these retained degrees of freedom, internal degrees of freedom are evaluated (recovery) [12]. In the present work a third approach has been adopted for domain decomposition. In this approach there is no need to transfer boundary data between neighboring sub-domains, and the calculations related to each sub-domain are independent of the neighboring subdomains. The interaction of the neighboring sub-domains is taken into account during assembly of the internal force vector. By adopting this strategy, inter-processor communication has been minimized. 3.3. Solvers In general, linear equation solvers may be classified into two types, direct solvers (e.g. Gauss elimination), and iterative solvers (e.g. conjugate gradient). Both direct and iterative solvers can be programmed for parallel machines. In the present work, parallel iterative solvers are preferred over parallel direct solvers, since for large problem sizes, parallel direct solvers are less efficient due to greater inter-processor communication, have much greater memory requirement compared to iterative solvers, and are generally not suited for large sparse systems of equations like in large-scale topology optimization. The main criteria for selecting a parallel solver are the type of analysis and the bandwidth of the stiffness matrix of the structure. According to the literature, conjugate gradient methods become more efficient compared to direct methods when either three dimensional elements are encountered or several thousands plane elements are used [12]. In other words, CG-based methods become more competitive when relatively large bandwidths are encountered [12]. For the current study, which considers meshes with more than 10,000 elements, iterative methods like conjugate gradient are superior to direct solvers. Thus, in this paper the conjugate gradient method has been used for solving the equilibrium equations. 3.4. Program Structure A parallel program has been written in Fortran 90 using Message Passing Interface (MPI) for inter-processor communication. This program consists of the following modules: 1. 2. 3. 4.
Finite Element Analysis (FEA) Sensitivity calculation Mesh independency filter Optimality criteria-based update.
In this work, both the FEA and sensitivity calculation modules are parallelized. It is observed that, parallelizing the FEA has the most prominent effect on performance, since it contains the bulk of the computation [1, 10]. In addition, the optimizer module can proceed only after the equilibrium solution has converged. In the program developed, the FEA block (module 1) is nested within the Optimization block (modules 2, 3 and 4). Mesh independency filtering is not parallelized, since filtering is not local by nature and requires the averaging of the sensitivities over several elements. Also, calculations related to sensitivity filtering are negligible compared to those required for the solution of the equilibrium equation and calculation of the element sensitivities (More than 97% of the operations are related to solution of equilibrium equation and calculating the sensitivities). Master-slave paradigm is used for the parallelization of this program; with communication only between slave processors and the master processor. Finally, to decrease the idleness of the master processor, sensitivity filtering and optimality criteria-based update are assigned to the master processor. 3.4.1 Equilibrium Equation Solver Loop In this module, explicit domain decomposition first partitions the domain into a number of vertical strips equal to the number of slave processors (Fig. 1), then a CG iterative solver with Jacobi preconditioning is used to solve the equilibrium equations. In this approach, the global stiffness matrices are neither assembled nor stored, and all the computations requiring stiffness are performed
by the slave processors at the element level, thus replacing the time-consuming process of global assembly with assembly and elimination of constrained degrees of freedom at sub-domain level. For the load cases considered, a simple geometry and hence a structured mesh has been assumed, where all the elements are unit squares and the aspect ratio of the rectangular domain is maintained through different number of divisions in the x and y directions [2]. 3.4.2 Optimization Loop The main feature of the optimization loop is that the structure is assumed to change only slightly between consecutive optimization iterations, and hence the converged equilibrium solution of the previous iteration is used as the initial guess for the next iteration; this speeds up the overall process considerably. Also, for the first iteration, the initial guess, Uo, has been obtained by solving DUo=F where D is a diagonal matrix consisting of the diagonal elements of the stiffness matrix and F is the load vector. Finally, the contribution of elements with relative densities below a pre-defined threshold is neglected, to further speed up the solution process. A flowchart of the entire program is shown in Figure 2.
Fig. 2 Flowchart 4. Case Studies In order to evaluate the efficiency of the program, four benchmark problems have been studied. In these problems, the effects of the number of processors, mesh size, loading configuration and support conditions have been studied. In all of the problems, three different sizes of mesh have been studied, 40 × 20, 80 × 40 and 160 × 80, in which the first number refers to number of elements in the x-direction and the second refers to that along the y-direction. Depending on the size of the problem, 1, 2, 4, 8, 16, and 32 processors were used for parallel computation. The design problems are shown in Fig 3. For all test cases, E = 1,ν = 0.3 and volume fraction, f = 0.5 has been used. In addition, all the force components are unity.
€
€
€
Fig. 3 Test Cases
The analysis was performed on SGI Origin 2000 workstation in National Center for Supercomputing Applications (NCSA) at University of Illinois at Urbana-Champaign (UIUC). The hardware architecture was a shared memory MIMD with 64 MIPS R10000 processors with clock speed of 195 MHz and 2 operations per clock cycle. Algorithmic speed-up was used as performance measure to compare the various test cases, and parallel efficiency, as defined in (9). Algorithmic speed-up: SPA = T1 / TP (9) where TP is the time it takes to run the parallel code on ‘N’ parallel processors, and T1 is the time it takes to run a parallel code on a single processor [14]. Problem 1
Problem 3
Problem 2
Problem 4
Fig. 4 Final layouts for problems 1 to 4 As expected, by using a fine mesh, fine details of the optimum shape can be captured and the final layouts consist of smooth boundary curves. It is possible to use detailed FE meshes of the domain to obtain final layouts that are near-final designs and do not require extensive post-processing. Reviewing the results, it appears that no direct correlation exists between the number of elements and number of optimization iterations. The number of optimization iterations is found to depend more on load and support conditions and less on the number of elements. As expected the accuracy of the calculated compliance increases with increase in mesh density. This increase in accuracy is the result of the better representation of the continuum and as the element size tends to zero the compliance approaches the exact lower bound solution. 4.1 Performance Charts Figure 5 presents the variation of speed-up as a function of number of processors for the four benchmark problems. As can be seen, for a given mesh size, increasing the number of processors, initially, reduces the computational time significantly, but this increase in processing speed tops out asymptotically which is in agreement with Amdahl’s law [14]. This law says that for a problem of fixed size by increasing the number of processors, the algorithmic speed-up asymptotically approaches a theoretical limit. The reason is that each code inherently has some serial portion that cannot be parallelized and this section of the code determines the maximum speed up that one can achieve through parallel computing. Amdahl’s law reads as: SPA =
€
1 (1− R) + (R /N)
(11)
where, SPA is the Algorithmic speed-up, R =
time, S is the time spent in the serial portion of the code, P is the time spent in the parallel portion of the code, and N is the number of processors. €
Problem 1
Problem 2 Speedup vs No. processors (Problem 2)
Speedup vs No. processors (Problem 1)
4.5
3.5
4 3
3.5 3
Speedup
Speedup
2.5 2 1.5
2.5 2 1.5
40 x 20 mesh 1
40 x 20 mesh
1
80 x 40 mesh 0.5
80 x 40 mesh
0.5
160 x 80 mesh
160 x 80 mesh
0
0 0
5
10
15
20
25
30
0
35
5
10
15
20
25
30
35
No. processors
No. processors
Problem 3
Problem 4
Speedup vs No. processors (Problem 3)
Speedup vs No. processors (Problem 4)
7
4
6
3.5 3
Speedup
5
Speedup
€
P is the ratio of the time spent in the parallel part of the code to the total execution S+P
4 3
2.5 2 1.5
40 x 20 mesh
40 x 20 mesh
2
1
80 x 40 mesh
80 x 40 mesh
1
0.5
160 x 80 mesh
160 x 80 mesh 0
0 0
5
10
15
20
25
30
35
0
5
10
No. processors
15
20
25
30
35
No. processors
Fig. 5 Performance charts for problems 1 to 4 As it is clear from the above formula for the problem of fixed size by increasing the number of processors, there exists a theoretical limit for SPA: ( SPA) max = Lim N →∞ SPA =
1 1− R
(12)
This, partially, explains the saturation observed in the performance plots beyond certain number of processors. By increasing the mesh size, this saturation occurs at the larger number of processors. In the benchmark examples documented here this saturation happens at around 8 processors and the results of 16 and 32 processors are presented here just to show that the performance plateau has reached and the computational gain through increasing the number of processors beyond a certain limit will not worth the cost of adding these new processors. As can be seen Amdahl’s law does not take into account the time needed for inter-processor communications. By increase in the number of processors, for a problem of fixed size, the ratio of communication to computation time increases for each processor. This increase in communication overhead is another reason for degradation of the parallel efficiency of the system. It is worth noting that as the optimization iterations proceeds from theoretical point of view the number of equilibrium iterations should increase because more voids will be introduced in the solution domain and make it more heterogeneous as a result condition number of the stiffness matrix will deteriorate but our numerical experiments shows that the appearance of these holes does not affect the number of equilibrium iterations much. There are two reasons for this behavior; first we used the Jacobi preconditioning that reduces the effect of system heterogeneity. Second, as the optimization proceeds the solution of last optimization iteration is used as the initial guess for the next equilibrium iteration this dramatically reduces the number of total equilibrium iterations. 5. Conclusions In this paper a parallel processing algorithm for the compliance topology optimization problem is proposed, where the FEA and sensitivity calculations have been parallelized. The method is shown to significantly reduce computation time for problems with
relatively fine meshes, and it is expected that problems with very large numbers of elements would also benefit from the proposed approach. The domain decomposition approach and master-slave programming paradigm are shown to be appropriate tools for solving the compliance problem in topology optimization. Solution of the equilibrium equations is accomplished through use of a parallel Jacobi conjugate gradient solver, which is based on a simple diagonal pre-conditioner. The reason for using this is that the pure conjugate gradient solver is relatively slow and is not suitable for an iterative design process like topology optimization. This pre-conditioning speeds up the solver drastically and is shown to provide about six times improvement in performance. It was found that a good initial guess of the displacements could improve the performance of the program. Here, the displacement vector obtained in the previous optimization iteration has been used as the initial guess for displacement vector. In the problems studied, no direct correlation was found between the number of elements and number of optimization iterations. The number of optimization iterations was found to depend more on load and support conditions and less on the number of elements. For a given mesh size, increasing the number of processors reduces the computational time significantly, but this increase in processing speed tops out asymptotically. For a particular problem, increasing the mesh density results in an increase in the maximum achievable speed-up. In this study, a speed-up of up to 6 has been observed. In addition, efficiency drops quite rapidly with increase in the number of processors. Avoiding assembly of the global stiffness matrix decreases the memory requirement as well as processing time. 6. References 1. M.P. Bendsoe, O.Sigmund, “Topology Optimization Theory, Methods and Applications”, 2003 Springer-Verlag. 2. Sigmund, O., “A 99 line topology optimization code written in MATLAB ” Struct. Multidisc. Optim. 21,120-127,2001 3. Rozvany G.I.N., Olhoff N., “Topology Optimization of Structures and Composites Continua”,2000, Kluver Academic Publishers. 4. Bendsoe. M. P., Kikuchi. N., “Generating optimal topologies in optimal design using a homogenization method”, Comput. Methods Appl. Mech. Engrg, 71, pp.197-224, 1988. 5. Bendsoe, M. P., “Optimal shape design as a material distribution problem”, Structural Optimization, Vol. 1, pp.193-202, 1989. 6. Zhou, M.; Rozvany, G.I.N, “The COC algorithm, part II: Topological, geometry and generalized shape optimization”, Comp. Meth. Appl. Mech. Engrng, Vol. 89, pp.197–224, 1991. 7. Mlejnek, H.P., “Some aspects of the genesis of structures”, Struct. Optim. Vol. 5, pp.64–69, 1992. 8. DeRose Jr., G. C. A., Diaz, A. R., “Solving three-dimensional layout optimization problems using fixed-scale wavelets”, Computational Mechanics, Vol. 25, pp.274-285, 2000. 9. Maar B., Schulz V., “Interior point multigrid methods for topology optimization”, Struct Multidiscip Optim, Vol. 19, pp214-224, 2000 10. Borrvall. T., Petersson. J., “Large-scale topology optimization in 3D using parallel computing”, Comput. Methods Appl. Mech. Engrg, 190, pp. 6201-6229, 2001. 11. Papadrakakis. M., “Parallel Solution methods in computational mechanics”, 1997, John-Wiley and sons. 12. B.H.V. Topping and A.I. Khan, “Parallel Finite Element Computations”, 1996, Saxe-Coburg. 13. Dongarra J., Duff I., Sorensen D.,Van Der Vorst H., "Solving Linear Systems on Vector and Shared Memory Computers”, SIAM, 1991 14. Course Website of Prof. Lyle N. Long“http://personal.psu.edu/lnl/424/” 15. Pacheco P.S., “Parallel Programming with MPI”, 1997, Morgan Kaufmann