Proceedings of the 5th International Conference on Computation of Shell and Spatial Structures June 1-4, 2005 Salzburg, Austria E. Ramm, W. A. Wall, K.-U. Bletzinger, M. Bischoff (eds.) www.iassiacm2005.de
Efficiency Aspects for Advanced Fluid Finite Element Formulations Malte Neumann*, Sunil R. Tiyyagura, Wolfgang A. Wall, Ekkehard Ramm *Institute of Structural Mechanics, University of Stuttgart Pfaffenwaldring 7, 70550 Stuttgart, Germany
[email protected]
Abstract For the numerical simulation of large scale CFD and fluid-structure interaction (FSI) problems efficiency and robustness of the algorithms are two key requirements. In this paper we would like to describe a very simple concept to increase significantly the performance of the element calculation of an arbitrary unstructured finite element mesh on vector computers. By grouping computationally similar elements together the length of the innermost loops and the vector length can be controlled. In addition the effect of different programming languages and different array management techniques will be investigated. A numerical CFD simulation will show the improvement in the overall time-to-solution on vector computers as well as on other architectures.
1
Introduction
For the numerical simulation of large scale CFD and FSI problems computing time is still a limiting factor for the size and complexity of the problem. Besides the solution of the set of linear equations, the element evaluation and assembly for stabilized, highly complex elements on unstructured grids is often a main time consuming part of the calculation. Whereas a lot of research is done in the area of solvers and their efficient implementation, there is hardly any literature on efficient implementation of advanced finite element formulations. Still a large amount of computing time can be saved by an expert implementation of the element routines. We would like to propose a straightforward concept to improve significantly the performance of the integration of element matrices of an arbitrary unstructured finite element mesh on vector computers. Very often algorithms in scientific codes only use a small fraction of the available computer power[1]. Therefore it is highly advisable to take a closer look at the efficiency of algorithms and improve them to make the best out of the available computer power. To evaluate the performance of a numerical method several criteria are of course available. For computational scientists who attempt to solve a given problem the most relevant is most probably the time-to-solution. This criteria takes into account a lot of different factors. For example these are the efficiency of the algorithm, the use of a particular hardware platform at a percentage of its peak speed and also the effort to include additional capabilities into the numerical code. However, the multitude of quantities included in this benchmark makes it difficult to use it for comparisons. A more universal performance benchmark is the raw computational speed, typically expressed in FLoating-point OPerations 1
5th International Conference on Computation of Shell and Spatial Structures
IASS/IACM 2005, Salzburg
per Second (FLOPS). Even though the significance of such an isolated performance figure is limited, it still gives an approximate measurement of the capability of a given algorithm-architecture combination[3]. FLOPS is also the basis to evaluate the efficiency an application or algorithm reaches on a given architecture: The efficiency is usually given as the ratio of the achieved sustained FLOPS of the application and the peak FLOPS of the architecture.
2
Computational Efficiency
For the numerical simulation of large scale CFD and fluid-structure interaction (FSI) problems computing time is still a limiting factor for the size and complexity of the problem. Waiting for more powerful computers will not solve this problem, as the demand on larger and more complex simulations usually grows as fast as the available computer power. It is rather highly advisable to use the full power that computers already offer today. Especially on superscalar processors the gap between sustained and peak performance is growing for scientific applications. Very often the sustained performance is below 5 percent of peak. On the other hand the efficiency on vector computers is usually much higher. For vectorizable programs it is possible to achieve a sustained performance of 30 to 60 percent, or above of the peak performance[1, 4]. Starting with a very low level of serial efficiency, e.g. on a superscalar computer, it is a reasonable assumption that the overall level of efficiency of the code will drop even further when run in parallel. Especially if one is to use only moderate numbers of processors, it is essential to use them as efficiently as possible. Therefore in this paper we only look at the serial efficiency as one key ingredient for a highly efficient parallel code[1].
3
Performance Optimization
To achieve a high efficiency on a specific system it is in general advantageous to write hardware specific code, i.e. the code has to make use of the system specific features like vector registers or the cache hierarchy. As our main target architecture is a NEC SX-6 parallel vector computer, we will address some aspects of vector optimization in this paper. But as we will show later this kind of performance optimization has also a positive effect on the performance of the code on other architectures. 3.1 Vector Processors Vector processors like the NEC SX-6 processor use a very different architectural approach than conventional scalar processors. Vectorization exploits regularities in the computational structure to accelerate uniform operations on independent data sets. Vector arithmetic instructions involve identical operations on the elements of vector operands located in the vector registers. A lot of scientific codes like FE programs allow vectorization, since they are characterized by predictable fine-grain data-parallelism[4]. The SX-6 processor contains an 8-way replicated vector pipe capable of issuing a MADD each cycle and 72 vector registers, each holding 256 64-bit words. For non-vectorizable instructions the SX-6 also contains a cache-based superscalar unit. Since the vector unit is significantly more powerful than this scalar processor, it is critical to achieve high vector operations ratios, either via compiler discovery or explicitly through code and data (re-)organization. 3.2 Vector Optimization To achieve high performance on a vector architecture there are three main variants of vectorization tuning: • compiler flags • compiler directives • code modifications In most cases an optimal performance on a vector architecture can only be achieved with code that was especially designed for this kind of processor. Here the data management as well as the structure of the algorithms are important. But often it is also very effective for an existing code to concentrate 2
5th International Conference on Computation of Shell and Spatial Structures
element calculation
IASS/IACM 2005, Salzburg
element calculation group similar elements into sets loop all sets loop gauss points shape functions, derivatives, etc. loop nodes of element loop nodes of element loop elements in set .... calculate stiffness contributions ....
loop all elements loop gauss points shape functions, derivatives, etc. loop nodes of element loop nodes of element .... calculate stiffness contributions .... assemble element matrix
assemble all element matrices
Figure 1: Old and new structure of the algorithm to evaluate element matrices.
the vectorization efforts on performance critical parts and use more or less extensive code modifications to achieve a better performance. The reordering or fusion of loops to increase the vector length or the usage of temporary variables to break data dependencies in loops can be simple measures to improve the vector performance. We would like to put forward a very simple concept, that requires only little changes to an existing FE code, to improve the vector performance of the integration of element matrices of an arbitrary unstructured finite element mesh significantly.
4
Vectorization Concept for FE
The main idea of this concept is to group computationally similar elements into sets and then perform all calculations necessary to build the element matrices simultaneously for all elements in one set. Computationally similar in this context means, that all elements in one set require exactly the same operations to integrate the element matrix, i.e. they have e.g. the same topology and the same number of nodes and integration points. The changes necessary to implement this concept are visualized in the structure charts in figure 1. Instead of looping all elements and calculation the element matrix individually, now all sets of elements are processed. For every set the usual procedure to integrate the matrices is carried out, except on the lowest level, i.e. as the innermost loop, a new loop over all elements in the current set is introduced. As some intermediate results now have to be stored for all elements in one set, the size of these sets is limited. The optimal size also depends strongly on the hardware architecture.
5
Further Influences on the Efficiency
It is well known that the programming language can have a large impact on the performance of a scientific code. Fortran is often considered the best choice for highly efficient code[5] whereas some features of modern programming languages, like pointers in C or objects in C++, make vectorization more complicated or even impossible[4]. Especially the very general pointer concept in C makes it difficult for the compiler to identify data-parallel loops, as different pointers might alias each other. There are a few remedies for this problem like compiler flags or the restrict keyword. The latter is quite new in the C standard and it seems that it is not yet fully implemented in every compiler. We have implemented the proposed concept for the calculation of the element matrices in 5 different variants. The first four of them are implemented in C, the last one in Fortran. Further differences are the array management and the use of the restrict keyword. For a detailed description of the variants see table 1. Multi dimensional arrays denote the use of 3- or 4-dimensional arrays to store intermediate results, whereas one-dimensional arrays imply a manual indexing.
3
5th International Conference on Computation of Shell and Spatial Structures
IASS/IACM 2005, Salzburg
orig C multi
var1 C multi
var2 C multi restrict
var3 C one
var4 C one restrict
var5 Fortran multi
SX-61
1.000
0.024
0.024
0.016
0.013
0.011
Itanium22
1.000
1.495
1.236
0.742
0.207
0.105
Pentium43
1.000
2.289
1.606
1.272
1.563
0.523
language array dimensions restrict keyword
Table 1: Influences on the performance. Properties of the five different variants and their relative time for calculation of stiffness contributions. The results in table 1 give the cpu time spent for the calculation of some representative element matrix contributions standardized by the original code. The positive effect of the grouping of elements can be clearly seen for the vector processor. The calculation time is reduced to less than 3 % for all variants. On the other two processors the grouping of elements does not result in a better performance for all cases. The Itanium architecture shows only a improved performance for one dimensional array management and the variant implemented in Fortran and the Pentium processor performs in general worse for the new structure of the code. Only for the last variant the calculation time is cut in half. It can be clearly seen, that the effect of the restrict keyword varies for the different compilers/processors and also for one-dimensional and multi-dimensional arrays. Using restrict on the SX-6 results only in small improvements for one-dimensional arrays, on the Itanium architecture the speed-up for this array management is even considerable. In contrast to this on the Pentium architecture the restrict keyword has a positive effect on the performance of multi-dimensional arrays and a negative effect for one-dimensional ones. The most important result of this analysis is the superior performance of Fortran. The last variant is the fastest on all platforms. This is the reason we favor Fortran for performance critical scientific code and use the last variant for our further examples.
6
Results
Concluding we would like to demonstrate the positive effect of the proposed concept for the calculation of element matrices on a full CFD simulation. The flow is the Beltrami-Flow (for details see [6]) and the unit-cube was discretized by 32768 stabilized 8-noded hexahedral elements[2].
Calculation time [sec]
20000
other 00 11 11 00 solver 00 ele. calc. 11 00 11 00 10000 11 00 11 00 11 00 5000 11 0 1 00 11 0 1 00 11 0 1 011 00 Original Variant 5
element calc. original var5
15000
Figure 2: Split-up of total calculation time for 32 time steps of the Beltrami Flow on the SX-6. 1 NEC
stiffness contr. original var5
SX-6
0.95
29.55
0.83
71.07
Itanium2
8.68
35.01
6.59
59.71
Pentium4
12.52
20.16
10.31
23.98
Table 2: Efficiency of original and new code in percent of peak performance.
SX-6, 565 MHz; NEC C++/SX Compiler, Version 1.0 Rev. 063; NEC FORTRAN/SX Compiler, Version 2.0 Rev.
305.
2 Hewlett 3 Intel
Packard Itanium2, 1.3 GHz; HP aC++/ANSI C Compiler, Rev. C.05.50; HP F90 Compiler, v2.7. Pentium4, 2.6 GHz; Intel C++ Compiler, Version 8.0; Intel Fortran Compiler, Version 8.0.
4
5th International Conference on Computation of Shell and Spatial Structures
IASS/IACM 2005, Salzburg
In figure 2 the total calculation time for 32 time steps of this example and the fractions for the element calculation and the solver on the SX-6 are given for the original code and the full implementation of variant 5. The time spent for the element calculation, formerly the major part of the total time, could be reduced by a factor of 24. This considerable improvement can also be seen in the sustained performance given in table 2 as percentage of peak performance. The original code not written for any specific architecture has only a poor performance on the SX-6 and a moderate one on the other platforms. The new code, designed for a vector processor, achieves for the complete element calculation an acceptable efficiency of around 30 percent and for several subroutines, like the calculation of some stiffness contributions, even a superior efficiency of above 70 percent. It has to be noted that these high performance values come along with a vector length of almost 256 and a vector operations ratio of above 99.5 percent. But also for the Itanium2 and Pentium4 processors, which were not the main target architectures, the performance was improved significantly and for the Itanium2 the new code reaches around the same efficiency as on the vector architecture.
References [1] Behr, M., Pressel, D.M., Sturek, W.B.: Comments on CFD Code Performance on Scalable Architectures. Computer Methods in Applied Mechanics and Engineering 2000; 190:263–277. [2] Wall, W.A.: Fluid-Struktur-Interaktion mit stabilisierten Finiten Elementen. PhD thesis, Institut f¨ ur Baustatik, Universit¨ at Stuttgart, 1999. [3] Tezduyar, T., Aliabadi, S., Behr, M., Johnson, A., Kalro, V., Litke, M.: Flow Simulation and High Performance Computing. Computational Mechanics 1996; 18:397–412. [4] Oliker, L., Canning, A., Carter, J., Shalf, J., Skinner, D., Ethier, S., Biswas, R., Djomehri, J., van der Wijngaart, R.: Evaluation of Cache-based Superscalar and Cacheless Vector Architectures for Scientific Computations. In: Proceedings of the ACM/IEEE Supercomputing Conference 2003, Phoenix, Arizona, USA. 2003. [5] Pohl, T., Deserno, F., Th¨ urey, N., R¨ ude, U., Lammers, P., Wellein, G., Zeiser, T.: Performance Evaluation of Parallel Large-scale Lattice Boltzmann Applications on Three Supercomputing Architectures. In: Proceedings of the ACM/IEEE Supercomputing Conference 2004, Pittsburgh, USA. 2004. [6] Ethier, C.R., Steinman, D.A.: Exact Fully 3D Navier Stokes Solution for Benchmarking. International Journal for Numerical Methods in Fluids 1994; 19:369–375.
5