Applications of Boundary Element Methods on the Intel ... - CiteSeerX

4 downloads 17184 Views 166KB Size Report
For infor- mation on obtaining permission, send a blank email message to [email protected]. ... any bulk property. Second, a simulation with a small.
Applications of Boundary Element Methods on the Intel Paragon David E. Womble, David S. Greenberg, Stephen R. Wheat, Robert E. Benner Sandia National Laboratories, Albuquerque, NM, 87185-1110 Marc S. Ingber University of New Mexico, Albuquerque, NM, 87131 Greg Henry, Satya Gutpa Intel Supercomputer Systems Division, Beaverton, OR, 97006

Abstract. This paper describes three applications of the boundary element method and their implementations on the Intel Paragon supercomputer. Each of these applications sustains over 99 G ops/s based on wall{clock time for the entire application and an actual count of ops executed; one application sustains over 140 G ops/s! Each application accepts the description of an arbitrary geometry and computes the solution to a problem of commercial and research interest. The common kernel for these applications is a dense equation solver based on LU factorization. It is generally accepted that good performance can be achieved by dense matrix algorithms, but achieving the excellent performance demonstrated here required the development of a variety of special techniques to take full advantage of the power of the Intel Paragon.

1 Introduction Boundary element methods (BEMs) are commonly used to solve systems of partial di erential equations which arise in engineering and science. This paper describes three application of the boundary element method; SNAP is used in structural mechanics, VERBITRON is used in acoustics, and CARLOS{3DTM and XpatchTM are used in electromagnetics. Each application program is capable of solving problems with arbitrary geometries, and problems with several types of boundary elements (e.g., triangular and quadrilateral elements). Each also includes high order approximations to the problem geometry and adaptive quadratures for numerical accuracy. BEMs produce a system of equations by placing a grid on the boundary of the domain. For three{ dimensional problems the resulting linear system is usually dense and has ( 2 ) unknowns, where is the number of elements required for a given resolution. This contrasts with typical nite element methods, which result in a sparse linear system with ( 3 ) unknowns. The preferred method for solving these linear systems is LU factorization with triangular solves. Other methods exist, but are limited in the classes of problems to which they can be applied. For example, iterative methods work only when the matrix can be \sparsi ed," and Cholesky decomposition can be used only for Hermitian matrices. Achieving high performance for all of these applications required taking advantage of many novel features of the Intel Paragon. For example, the Paragon at Sandia National Laboratories has two i860 procesO s

s

O s

 This work was supported in part by the United States Department of Energy under Contract DE{AC04{94aL85000. 1 ISSN 1063{9535. Copyright (c) 1994 IEEE. All rights reserved. Personal use of thie material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution must be obtained from the IEEE. For information on obtaining permission, send a blank email message to [email protected]. By choosing to view this document, you agree to all provisions of the copyright laws protecting it.

sors in each compute node. Using SUNMOS, an operating system developed at Sandia and the University of New Mexico, we developed techniques to use both i860s for computation. Future massively parallel computers will probably have multiple{CPU nodes. Thus, our experience with using two compute processors per node should help applications take advantage of architectures to come. The use of SUNMOS also allowed us to experiment with the use of non-blocking collective communications and of hardware based broadcasts. The net result of our research has been to create applications which run at speeds considered impossible by some observers of the machine. In fact, we can exceed the manufacturers original peak performance numbers for the machine. Even more importantly our increased speeds have allowed us to consider applications requiring multiple runs for statistical accuracy on problem sizes heretofore considered unrealistic.

Timings and op rates The run times presented for each application are wall clock times for the entire code, including I/O, pre-processing, matrix assembly, linear system solution, and post-processing. The algorithms used are those that solve the problem in the shortest amount of time, and the oating point operations counted in the performance gures are only those that are necessary and that are actually performed. One op is counted for each double precision addition, subtraction, multiplication or division. (The XpatchTM code uses single precision in some cases when double precision is not necessary for numeric accuracy.) Trigonometric function evaluations and duplicate computations are not counted.

2 Structural mechanics Composite materials are becoming increasingly common, and determining the strength of these materials (and parts manufactured from them) is an important part of material research and design. In the simulation presented here, we determine the composite modulus of a short{ ber reinforced, composite rod. Massively parallel computers are necessary for this application for several reason. First, the placement of the bers in the rod is essentially random, so a statistical analysis based on multiple realizations (simulations) of the physical system is necessary to determine any bulk property. Second, a simulation with a small number of large bers is fundamentally di erent from a simulation of a large number of small bers. Hence,

a simulation must involve hundreds or thousands of bers, each with a sucient number of elements to resolve the aspect ratio and stresses on each ber. Our boundary element code, SNAP, has been developed over the past six years and the Paragon version has recently been developed. SNAP is used to determine stresses and displacements in an elastic medium of arbitrary geometry described by a list of shapes, vertices and element connectivities. (The shapes are provided in a \template" library.) The element library consists of super-parametric triangular and quadrilateral elements. The higher order geometric elements provide greater numerical reliability for domains containing curved surfaces especially for incompressible materials. The pre-processing phase consists of one processor computing a range of element properties, such as area, and broadcasting this information to the remaining processors. The matrix assembly involves a quadrature for each element of the matrix. To obtain the required numerical accuracy, the quadratures use between four and 144 Gauss points, depending on geometric considerations; elements that interact strongly with each other require more points than those with weak interactions. Each quadrature requires between 1,000 and 30,000 oating point operations depending on the type of elements involved and the number of Gauss points. Trigonometric functions and square root evaluations have not been included in the operation counts. The resulting dense matrix contains double precision, real numbers. The post-processing phase consists of the actual calculation of the composite modulus based on the solution of the linear system. This requires relatively few ops, but several global communications. We used SNAP to simulate a random distribution of 944 short- ber rods in a cylinder of length 30 cm and diameter 10 cm with avolume fraction of bers of 10%. There are 56,640 degrees of freedom in the system. The run times and computational rates for this simulation on 1 888 nodes of the Intel Paragon at Sandia National Laboratories are given in Table 1. In all tables, operation counts less than one G op are reported as zero G ops. We note that 3.5% of the total time was spent in the matrix assembly. Although the number of ops performed in the matrix assembly is small (compared to the linear system solve), a fast linear system solve alone would not result in good overall performance. ;

Table 1: 115 G ops/s structure simulation

operation time(s) G ops G ops/s I/O and pre-processing 4.31 0 0 matrix assembly 36.8 747 20.3 linear system solve 1016 121,144 119 post-processing 0.1 0 0 total 1057 121,891 115

3 Acoustics The analysis of the acoustic radiation and scattering of objects is a problem with many commercial and military uses, e.g., the design of quieter automobile passenger compartments, the design of more ecient stereo speakers, and the detection of submerged vehicles. The acoustics modeling code used here is VERBITRON, a general purpose acoustics radiation and scattering code that was developed at Sandia to study acoustic shape optimization problems. VERBITRON is based on an integral representation for the acoustic potential in an exterior domain. A problem associated with the integral operator is that its inverse is unbounded at frequencies corresponding to the eigenvalues of the adjoint interior problem. The associated eigenmodes are ctitious since it can be shown that the exterior problem is well posed at all frequencies. For acoustic shape optimization of elongated bodies, the interior eigenvalues become clustered for larger wave numbers, and many commonly used algorithms for the removal of the ctitious eigenmodes fail. VERBITRON implements a modi ed Burton-Miller algorithm which takes a linear combination of the Helmholtz integral equation and the derivative Helmholtz integral equation to reliably remove the ctitious eigenmodes. There are compelling reasons for a massively{ parallel computer implementation of VERBITRON. First, coarse grids do not provide the spatial resolution for short wavelengths (high frequencies). Second, there is a substantial increase in computational requirements due to the removal of ctitious eigenmodes at high frequencies. Third, for problems of shape optimization, active or passive noise control, and other types of acoustic design, a solution typically depends on several parameters, and greater computational power and eciency is required to perform an adequate parameter study. The element library for VERBITRON contains isoparametric triangular quadratic elements and isoparametric quadrilateral biquadratic elements. Double (multiple) nodes are placed at geometric edges and corners to resolve the ambiguity in the normal di-

rection at those locations. Also, an adaptive quadrature scheme is employed to ensure the accuracy of the integrations, and all calculations are performed in double precision. For our test problem, we modeled the acoustic scattering from a localized collection of objects. The resulting non-Hermitian dense, complex linear system has 38,080 unknowns. The times to solve this problem are shown in Table 2. Table 2: 143G ops/s acoustic potential simulation operation time(s) G ops G ops/s I/O and pre-processing 17.8 0 0 matrix assembly 114 1,589 14 linear system solve 912 147,263 161 post-processing 0.32 0 0 total 1,044 148,852 143

4 Electromagnetics The modeling of electromagnetic radiation, propagation and scattering is an important aspect of the design of many products, including antennas and aircraft. Di erent algorithms are required for high and low frequency calculation. For low frequencies, we use CARLOS{3DTM , a general purpose code based on the method of moments. This code calculates the scattering from arbitrary three-dimensional objects with multiple homogeneous dielectric regions. CARLOS{ 3DTM is a proprietary code developed at McDonnell Douglas Aerospace under the sponsorship of the National Electromagnetics Code Consortium and is in commercial use at over forty{ ve sites. For high frequency calculations, we use XpatchTM , a general purpose electromagnetics code based on zbu er and raytracing methods. XpatchTM is a proprietary code developed at DEMACO, Inc. under the sponsorship of the U.S. Air Force and is in commercial use at over 200 sites.

4.1 CARLOS{3DTM applied to the VFY218 ghter plane The speci c simulation using CARLOS{3DTM run for this paper was the scattering from a VFY218, a conceptual V/STOL ghter aircraft. In this problem there are 38,922 triangular elements and 19,840 nodes. The resulting non-Hermitian, double precision, complex, linear system has 58,383 unknowns. Because

there is one plane of symmetry, the problem can be solved as two systems, one of size 29,381, and the second of size 29,002. The run times for this simulation on 1,904 nodes of the Intel Paragon at Sandia National Laboratories are given in Table 3. The matrix assembly operation is particularly dif cult to code eciently for vector and superscaler architectures because of the potentially large number of special cases that must be handled and the number of options that must be included. For example, quadratures are adjustable (self terms, near terms and far terms must be evaluated di erently); geometric symmetry is used when possible to reduce the computations; and conducting, dielectric and treated surfaces must be handled di erently. The output from the code is a vector of currents on the surface of the described geometry. Post-processing consists of using these currents to determine the scattered eld and a radar cross section. This typically involves a quadrature for each point at which the eld is to be determined; because these points are in the far eld, the radar cross section can be calculated as a function of the angle from the object of interest. Table 3: 99.4 G ops/s CARLOS-3DTM simulation operation time(s) G ops G ops/s I/O and pre-processing 34.9 0 0 matrix assembly 340 2,556 7.52 linear system solve 965 132,685 137.5 post-processing 20.1 0 0 total 1361 135,241 99.4

4.2 XpatchTM applied to the V1old ghter plane

XpatchTM was applied to a well established ghter aircraft model, V1old, consisting of 7,994 triangular elements and 4,297 nodes. Calculations for XpatchTM were done in mixed precision (inner kernels of the ray tracer use double precision while other sections require only single precision.) The run times for this simulation on 1,904 nodes of the Intel Paragon at Sandia National Laboratories are given in Table 4.

5 Optimizing the linear system solver and the SUNMOS OS The computational kernel for boundary element methods is the solution of a dense linear system of

Table 4: 116 G ops/s XpatchTM simulation operation I/O and pre-processing Zbu er calculations ray tracing post-processing total

time(s) 5.4 687 375 91.4 1158

G ops

G ops/s (mixed prec.) 0 0 123,051 179 11,711 31.3 0 0 134,762 116.3

equations. The most commonly used method for solving these equations is LU factorization with column pivoting, and it is generally accepted that good parallel performance can be achieved using this algorithm. In this section we describe the techniques beyond standard LU implementations which we used to increase the performance beyond the expected levels. Some of our improvements were made possible by the use of SUNMOS, an operating system designed by Sandia and the University of New Mexico. SUNMOS replaces OSF/1 on the computational nodes of the Paragon. One advantage of using SUNMOS is that applications have at least 7Mbytes more of usable memory under SUNMOS than under OSF/1, which enables us to factor matrices with more than 57,000 rows and columns entirely in memory. A second advantage of SUNMOS is that message passing is both faster and more exible. This has allowed us to overlap communication with computation more e ectively. For example, we have implemented broadcast routines which are non-blocking, i.e., a process can begin the broadcast and then continue other work while waiting for the broadcast to complete. One major advantage of non-blocking communications is that they can help avoid unnecessary overhead due to system bu ering and copying of messages. A third advantage of SUNMOS is that it allowed us to use hardware which is not available to application programs under OSF/1. Each node of an Intel Paragon contains two i860 processors. Intel refers to one as the main computation processor and the other as the communications co-processor. Under SUNMOS, we were able to use the communications coprocessor to perform computations. The Intel Paragon contains a very high speed communications network which connects the processor nodes into a mesh. At Sandia the mesh has dimensions 16x119. The mesh topology (and in particular the long aspect ratio of Sandia's machine) presents a challenge to communication routine writers. We use both tree based broadcasts and ring based broadcasts to make best use of the network, but broadcasts can

still create substantial overhead. SUNMOS has also allowed us to experiment with hardware broadcast. The additional features provided by SUNMOS led to several new algorithmic issues. The processormemory bus bandwidth on each node is limited to 400 Mbytes/s. The bus thus becomes a bottleneck for level 1 BLAS routines with just one processor doing computation. We have overcome the bus bottleneck by converting to the use of level 3 BLAS routines and modifying the routines to make better use of the i860 caches. In particular, we developed cache based versions of DGEMM and ZGEMM (double precision real and complex matrix-matrix multiplies) that avoid the use of bus{saturating \pipelined loads." The new routines use cache{based loads and arbitrate the bus to avoid thrashing. Level 1 BLAS routines such as DAXPY run at about 15 M ops/s/node, the original level 3 BLAS routines such as DGEMM run at approximately 46 M ops/s/node, while our cache based, dual-processor versions of DGEMM run at approximately 80 M ops/s/node. With the faster performance of computational kernels, the penalty for synchronizations and broadcasts increases. We therefore reordered the computations within the LU factorization so that pivot searches and broadcasts are performed in advance of when the information is needed and thus in the background of other computations. Column pivoting is still performed. Finally, in our complex matrix{matrix multiplications (ZGEMM), a technique due to Winograd is used to change the ratio of additions to multiplications. By doing more additions and fewer multiplications, the separate i860 pipelines for addition and multiplication can be used more eciently. Specifically, the new ZGEMM uses 3 multiplications and 5 additions for each complex multiply{add pair (instead of 4 multiplications and 4 additions). We note that the number of oating point operations does not change; however, because the i860 can perform two additions for each multiplication, the potential computational rate for the ZGEMM increases from 50 to 66 M ops/s/processor). When both processors are used we have achieved up to 115 M ops/s/node. Using the techniques described in this section, our code runs the LINPACK benchmark at 127.7 G ops/s on the 1 904 node Paragon at Sandia. The LU factorization and solution of a complex linear system ran at 173.8 G ops/s, which exceeds the the original performance speci cation of 142.8 G ops/s. ;

6 Conclusions We have presented three applications of boundary element methods for commercially important problems on the 1,904{node Intel Paragon at Sandia National Laboratories. All three codes achieve over 99 G ops/s, and one (VERBITRON) achieves over 140 G ops/s. These computational speeds are based on wall{clock time and actual op counts. We achieved outstanding speed in our LU factorization routine and most of the actual computations are done there; however, because I/O and pre-processing are serial bottlenecks, and because it is dicult to achieve good performance in the matrix assembly, we focused our e orts on all phases of the applications. Throughout this work, our goal has been to solve important problems in the least amount of time using numerically sound algorithms. We expect the applications described here to be used in production runs on the Paragon. Other applications based on boundary element methods, including a uid dynamics code, are under development.

Acknowledgments The authors would like to thank the many people who have helped make this project succeed. Bruce Hendrickson contributed to the early development of the LU kernel. Rolf Riesen and the rest of the SUNMOS team provided the critical operating support necessary to our e ort. Bruce and Rolf, together with Rob Leland provided advice and support throughout the project. Art Hale of Sandia and Ted Barragy, Natalie Bates, Mike Proicou, and Mack Stallcup of Intel helped ensure that the Sandia Paragon was up and running with the latest hardware. Brent Leback of CETech assisted with the new BLAS routines. John Putnam (McDonnell{Douglas) was the original author of CARLOS{3DTM . Joseph Kotulski, John VanDyke, Ray Zazworsky and John Putnam helped port CARLOS{3DTM to the Paragon and incorporate the LU solver. Major Dennis Andersh of Wright Laboratories sponsored the parallelization of Xpatch, and Dr. Shung-Wu Lee of DEMACO, Inc. developed the serial version of XpatchTM .

References [1]

[2]

B. Hendrickson, D. Womble, The Torus-Wrap Mapping for Dense Matrix Calculations on Massively Parallel Computers, SIAM J. Sci. Comput., 15, (September, 1994). M. Ingber, D. Womble, L. Mondy, A Parallel Boundary Element Formulation for Determining E ective Properties

[3]

[4]

[5]

of Heterogeneous Media, Int. J. Num. Meth. in Eng., to appear. S. W. Lee, The XpatchTM Users' Guide, DEMACO, Inc. Technical Report, Champaign, IL, Oct. 1993. J. Volakis, ed., CARLOS{3DTM : A General{ Purpose, Three{Dimensional, Method{of{Moments Scattering Code, IEEE Antennas and Propagation Magazine, 35, (April, 1993), pp. 69{71. D. Womble, D. Greenberg, S. Wheat, R. Riesen, LU Factorization and the LINPACK Benchmark on the Intel Paragon, Sandia National Laboratories Tech Report SAND94{0425, March, 1994.