Noname manuscript No. (will be inserted by the editor)
Parallelization Strategies for Computational Fluid Dynamics Software: State of the Art Review Asif Afzal · Zahid Ansari · Ahmed Rimaz Faizabadi · Ramis M. K.
Received: date / Accepted: date
Abstract Computational fluid dynamics (CFD) is one of the most emerging fields of fluid mechanics used to analyze fluid flow situation. This analysis is based on simulations carried out on computing machines. For complex configurations, the grid points are so large that the computational time required to obtain the results are very high. Parallel computing is adopted to reduce the computational time of CFD by utilizing the available resource of computing. Parallel computing tools like OpenMP, MPI, CUDA, combination of these and few others are used to achieve parallelization of CFD software. This article provides a comprehensive state of the art review of important CFD areas and parallelization strategies for the related software. Issues related to the computational time complexities and parallelization of CFD software are highlighted. Benefits and issues of using various parallel computing tools for parallelization of CFD software are briefed. Open areas of CFD where parallelization is not much attempted are identified and parallel computing tools which can be useAsif Afzal Department of Mechanical Engineering, P. A. College of Engineering, Mangaluru, India E-mail: asif
[email protected] Zahid Ansari Professor, Department of Computer Science Engineering, P. A. College of Engineering, Mangaluru, India Tel.: +917899267361 E-mail: zahid
[email protected] Ahmed Rimaz Faizabadi Department of Computer Science Engineering, P. A. College of Engineering, Mangaluru, India Ramis M. K. Department of Mechanical Engineering, P. A. College of Engineering, Mangaluru, India
ful for parallelization of CFD software are spotlighted. Few suggestions for future work in parallel computing of CFD software are also provided. Keywords CFD Code Parallelization · Parallel Computing · CFD · OpenMP · MPI · CUDA
1 Introduction Computational Fluid Dynamics (CFD) is used to analyze problems involving fluid flow, employing numerical solution of the governing equation. Presently, CFD is used to simulate fluid flow situations for problems ranging from molecular level to global level [73]. CFD has advanced from the analysis of flow over two dimensional configurations to three dimensional configurations. CFD is extensively used as a design and optimizing tool in industries. With the use of CFD, multiphase flow modeling has become easy, adopting huge fine mesh resolution [101]. Generally, the CFD simulation is performed by running the codes/software on modern day computing machines [87]. In CFD, as the number of grid points increases, the computational time required to obtain the necessary simulation also increases. For few complex industrial configurations, the grid points are so massive that, the computational time required to get the necessary simulation is unbearable [87, 57]. To reduce the computational time and fully utilize the available computing resources, parallel computing has to be adopted. Parallel computing is simultaneous execution of computational tasks on multi-processor architectures [16]. Various parallel computing paradigms have been employed for parallelization of the CFD software. From the literature survey [4, 32, 15, 10], it is found that the most
2
widely used paradigms for parallelization of CFD software are: i) OpenMP, ii) MPI, iii) CUDA and iv) Hybrid OpenMP+MPI In this article we have performed the literature review of parallel computing effort done to parallelize the CFD codes belonging to various CFD areas using different parallel computing tools. In the literature, to the best knowledge of authors, this is the first review article on efforts done for parallelization of CFD codes using different parallel computing tools. As the computational time of CFD codes is very huge because of large computational tasks, the parallelization of these CFD codes has to be done to reduce the time consumption. Various researchers have tried many parallel computing tools to achieve parallelization of CFD codes. But, effort was required to gather the information regarding various parallel computing tools and work done related to it in CFD. We believe that this survey article will help the researchers to identify suitable parallel computing tool and appropriate parallelization strategy corresponding to their specific CFD domain with much reduced effort and search time. This article provides state of the art of CFD code parallelization efforts and what further can be achieved. The issues related to computational time of CFD software, parallel computing in CFD and its related tools are identified. Few areas of CFD where parallelization is not attempted and the parallel computing tools not employed in CFD are also highlighted. The remainder of the article is organised as follows. In section 2, we have presented the literature review of CFD, applications of CFD in various areas/fields and identified the issues related to computational time of CFD codes. Section 3 provides a brief discussion about parallel computing and its related tools usually applied in CFD. In section 4, literature review of methodologies adopted for CFD problems, parallelization attempted for the related CFD codes, benefits and issues of various parallel computing tools are provided. In section 5, the issues related to parallel computing of CFD and suggestions for future work are mentioned. In section 6, we have concluded by presenting the contribution of this article, a brief report on open CFD areas/fields where parallelization is not attempted and highlighted few parallel computing tools, not employed for parallelization of CFD codes. 2 CFD and Issues Related to Computational Time of CFD Software In this section we briefly discuss about CFD and provide intensive review of various major areas where CFD is applied for simulation of different fluid flow situations. Computational time related issues of massive grid
A. Afzal et al.
points of complex configurations in CFD software codes have also been identified.
2.1 Computational fluid dynamics Computational fluid dynamics is the branch of fluid dynamics which provides simulation of fluid flow situations with the help of numerical solution of governing equation [89]. CFD provides simulation, using computational methods of the governing equations which describe and predict the behaviour of fluid flow [79]. CFD provides simulation and numerical modeling in designing, developing and optimizing of existing or new engineering facilities. The CFD method is very attractive, as it produces fluid flow situation more quickly at a lower cost than experimental methods [33]. The simulation provided by CFD has a vital role in various fields ranging from aeronautic to nuclear applications. The methods of CFD are applied in the design, development and optimization of industrial devices to get a more insight of flow behaviour, heat transfer analysis and various other performances [100]. A wide number of applications of CFD can be found in literature in many areas. Applications of CFD in many major areas/fields are mentioned in Table 1. There are various commercial CFD software packages like CFX, ANSYS FLUENT, STARCCM etc., where we can model and simulate the CFD problem. Industrialists and researchers have also developed their own CFD code or solver for numerical investigation of their particular problem. Numerical configurations upto 30 million mesh points in a single simulation is becoming standard in industries that provides more detailed numerical investigations. These numerical computations, software, codes, solvers are run on computers having a single or many central processing units [75, 81, 8].
2.2 Computational time related issues In order to get results with enough accuracy for the numerical simulation and investigation of fluid flow situation, the problem domain has to be sufficiently discretized [47]. For a fluid flow situation to find out whether the grid size is sufficient or not, the numerical calculations has to be repeated for different numbers or range of intervals and inputs. These calculations are repeated until the results are appreciable. This rule is generally accepted for CFD problem but it is rarely practiced because the computational time cost of these calculations is tough to bear [57, 84, 37]. Similar issues related to computational time in numerical methods found in literature are mentioned below.
Parallelization Strategies for Computational Fluid Dynamics Software: State of the Art Review
3
Table 1 Application of CFD in few areas/fields Areas/Fields Automotive industry
Aerodynamic industry
Marine industry
Chemical Processing industries Food industry
Meteorology
Bio-Medical Industry Acoustic industry
Power stations
Turbo machinery Civil Engineering industries Electronics industries HVAC tems
sys-
Fire and smoke modelling Molecular dynamics
CFD application Flow distribution analysis, combustion analysis, heat transfer analysis, pressure variation analysis, thermal performance analysis and various other internal and external flow analysis. In aircrafts- External flow, internal flow simulation, aerodynamics performance, vortex formation prediction, Fluid structure interaction analysis. Aerodynamic design of wings, airfoils, vehicles, nozzle flow analysis. In spacecrafts- Design and analysis of structure, stage separation, rocket engine, re-entry. Prediction of ignition, payload safety, takeoff, vehicle stability, drag, pressure variation. Design and simulation of space vehicle, loads, acoustics, aerodynamics forces, subsonic, supersonic and hypersonic flows. Design of structures, propellers, impellers, plants. Analysis of motion, pitching, rolling, ship drag. Behaviour of green water, simulation of fluid flow around ship, wave impact load analysis. Modelling of ocean waves, wave impact loads, wave heights, water currents. Surface analysis of water for reflection and refraction. Performance analysis of mixing devices like tanks, mixers. Analysis of cyclones, precipitators, boilers, burners, furnaces, fluidized bed. Simulation of processes like multiphase flow, evaporation, gasification, cavitations. Modelling of turbulence, reaction kinetics, energy balance, die filling etc. Prediction of drying, mixing, refrigeration, sterilizing processes for preservation of food. Design of ovens, bakers, hotbox, thermoss. Prediction of weather, climate, air quality, storms, cyclones, floods, clouds. Modeling and analysis of storm impact, cloud evolution, lighting, thunderstorms, fogs, air pollution, heat and cold waves etc. Modeling and analysis of landslide, avalanches, floods, hurricanes, high winds, volcanic eruption, earthquakes, tsunamis. Design of medical instruments like artificial blood vessels, surgical equipments, intravascular devices etc. Optimisation of surgeries, fluid flow behaviour in human body, analysis of biological processes, cardiac flow, investigation of coronary artery diseases. Investigation of acoustic resonance, frequency, sound wave propagation, aero acoustic design, convection, refraction, reflection, distribution of energy in acoustics, simulation of impingement noise. In thermal power stations- Design of burners, furnace, fluidized bed. Analysis of flames, combustion process, super heater, economiser, erosion, ratio of coal and air flow, mass fraction, ducts, heat exchanger. In nuclear power stations- Modeling, simulation and analysis of nuclear reactors, rods, elements, heat generation, fluid, Hydraulic and thermal design, safety analysis, mixing of coolant, thermal shock analysis, performance, safety operation. Heat and fluid flow analysis in diffusers, nozzles, blades, rotor, compressor. Prediction of performance, design investigation, flow visualization in turbines, compressors, fans/ventilators, pumps Design and assessment of Buildings, dams, canals, bridges etc. Analysis of exhausts from structures, natural ventilation, indoor environment, wind loading, sediment scour, spillway flow, plumbing systems, solar heat radiation, wall presuure etc. Analysis of cooling in circuits/ Printed circuit boards , chips, heat pipe, servers. Thermal analysis in various electronic systems, radar systems. Design of electronic cabinets, enclosures. Prediction of thermal comfort, climate control, flow velocity, humidity, density, chemical concentration, contamination transport. Design optimization of HVAC systems, performance analysis of HVAC systems. Physical model of fire and smoke propagation in forests, buses, tunnels, trains, agriculture, wild land. Fire safety analysis in compartments, buses and airplanes etc. Simulation of atoms and molecules of a fluid and their interaction with other fluid particles. Behaviour at interface of different fluids or types of materials.
– The designers of CFD problems need the completed computation of millions of mesh points in the configuration over night so that they can analyze, change and improve the design all over the day [90, 1]. – The numerical results of the complex configurations in CFD require huge computations that are necessary to be calculated for the flow situation and the integration time taken is very long which are important to attain significant time averages [97, 87].
Reference [20, 70, 85]
[74, 82, 120, 91]
[48, 66, 102, 30]
[36, 14]
[112, 18]
[21, 109, 44]
[111, 73]
[19, 110]
[101, 106, 49, 99, 88]
[26, 108]
[35, 55, 34]
[69, 17]
[92, 65]
[6, 71]
[40, 50]
– CFD computation is generally performed within a time step loop which is repeated for many numbers of times in order to obtain significant results, which leads to increased computational time [38, 8]. – In numerical simulation, as the size of the problem increases beyond a certain limit, the required overall computer memory and the computational time rapidly grow to a point where the simulation time
4
is no more feasible with most of the existing computing technology [113, 77]. – The CFD codes for complex industrial configurations commonly need very high end computing platforms to produce flow solutions in a sensible span of time [16, 81, 73]. – With research topics in CFD becoming more and more complicated, the computational requirements even with the present computer systems is not able to produce the results in considerable amount of time [61, 55]. In short to say, one of the most major issue related to CFD code found in literature is the massive time consumption of the CFD code to obtain the numerical simulation, with references numerous to site. Hence to obtain these extensive computations in reasonable amount of time and to better utilize the present computational resources parallel computing has to be adopted [38,8, 104].
3 Parallel Computing and Related Tools This section provides a brief discussion on parallel computing and related major tools/paradigms employed for parallelization of the CFD codes.
A. Afzal et al.
be easily built to enhance the computational power [61, 81]. Parallel computing in the field of CFD has become an important tool for numerical simulations. In CFD, to obtain the computational results in feasible time and for effective utilization of computing resources, parallel computing is adopted [55, 37]. Currently there are many attempts being made to convert the sequential code into parallel computing codes on multi processors. The parallel CFD code developed is generally based on Single Program Multiple Data (SPMD) strategy [13]. In CFD, the parallel computing technique is to basically decompose the large problem domain into smaller subdomains. The generation of subdomains reduces the difficulty in grid generation of complex configurations as the grids are independent within the subdomain. At the domain partitioned or inner boundaries, the subdomains exchange the data which are treated as boundary conditions [105, 81]. For every single subdomain the computation is carried out simultaneously employing the SPMD CFD code. Using certain mapping technique the domain boundary data is exchanged across the boundaries after every iteration. The accuracy and efficiency of parallel computations are determined by the performance of the mapping procedure followed for exchange of boundary conditions [39]. 3.2 Parallel computing tools
3.1 Parallel computing The increase in computational demands has out spaced the development of computational ability of a single processing unit. The stagnation of Central Processing Units (CPUs) clock speed led to considerable attention in parallel architectures which provide massive computational power with the use of separate processing units [7,37, 31]. To take full advantage of multiprocessing architectures, programs need to be written for parallel execution [37, 37]. Parallel computing or parallelization is the simultaneous execution of multiple computational tasks on multiprocessor system by distributing the computational load among the different processors [113, 104]. Figure 1 illustrates parallel execution of different tasks in a program on a multi-core system. For execution of intensive computational applications like CFD, parallel architectures are important tools [16, 8]. Parallel computing with the use of multiprocessing works if the order of execution inside the loop does not depends on any other loop or variable. Existing parallel programming models support shared memory, distributed memory and clusters of shared memory platforms. By interconnecting personal computers or workstations, distributed parallel computing platform can
The major parallel computing tools used in CFD codes parallelization are briefly discussed in this section. 3.2.1 OpenMP Open Multi Processing (OpenMP) is a key standard parallelization tool adopted on the shared memory platform. OpenMP is a set of callable runtime library routines and compiler directives to support multi-platform shared memory parallel programming in C, C++, Fortran on broad range of architectures [72, 8]. OpenMP tool has standardized the parallel computing practice on shared memory multiprocessing. OpenMP Code developed on a platform can be ported to any other shared memory platform [24, 58]. OpenMP model programs are executed on a multi-core or multi-processors that share a little or all of the existing memory. OpenMP offers a very flexible and simpler path in development of code and is supported by nearly all major complier vendors [83]. The developer of a shared memory parallel program often starts with serial code and puts OpenMP directives in simple loop to get better computational performance. With the use of OpenMP in CFD there is no need of parting the grid or transferring the data
Parallelization Strategies for Computational Fluid Dynamics Software: State of the Art Review
Task 1
Task 2
Core 1
Task 3
Core 2
Instruction 12
Instruction 11
Instruction 10
Instruction 9
Instruction 8
Instruction 7
Instruction 6
Instruction 5
Instruction 4
Instruction 3
Instruction 2
Instruction 1
Program
Task 4
Core 3
Core 4
Multi-Core system Simultaneous execution of 4 different tasks of a program on 4 different cores i.e parallel execution.
Fig. 1 Parallel execution of tasks on a multi-core system
Serial Region region
Execution Thread
Master Thread F o r k Slave Thread 1
Slave Thread 3
Slave Thread 2
Slave Slave ThreadThreads 4
Core 1
Core 2
Core 3
Core 4
Slave Thread 1
Slave Thread 3
Slave Thread 2
Slave Thread 4
J o i n Multi-Core system
Fork and Join model of OpenMP paradigm.
Fig. 2 OpenMP paradigm
Task 12
Task 11
Task 10
Task 9
Task 8
Task 7
Task 6
Parallel Region
Serial Region Master Threads
Task 5
Task 4
Task 3
Task 2
Task 1
Program
5
6
A. Afzal et al.
blocking protocols [5]. In blocking protocol the program between different subdomains ensuring load balancing execution is set aside till the message buffer is safe to and saving the communication cost [68]. A fork and join execution model is provided by OpenMP employ and in non blocking protocols the program execution does not wait to be certain that the communiwhere the execution starts with a single thread or procation buffer is safe to employ. Point to point MPI calls cess. In a OpenMP parallel program, the serial tasks is associated with data exchange among two particular are executed on a single core until the parallel region processes. Non blocking communications give the benis reached. This execution of program by the thread efit of letting the computations to move on instantly is sequential until a PARALLEL construct is detected, following the MPI communication call [67, 8, 96]. whereupon the thread becomes master thread and creates a team of threads. The execution threads are now called as master thread and the others as slave threads. These slave threads execute the various tasks of the parallel region on different cores of the system simulta3.2.3 CUDA neously as shown in figure 2. The team of threads join at the end of execution of parallel tasks. This team of threads executes the instructions lexically closed by the Existing graphics hardware architecture provides a great PARALLEL construct. Work sharing constructs like computational power in the form of General Purpose SECTIONS, DO, SINGLE are offered to distribute the Graphics Processing Unit (GP-GPU). To speed up the execution of the enclosed region of code among the general purpose computations of numerical codes graphmembers of the team. These threads of the team do ics hardware can be employed. These GPUs can achieve not depend on any other thread and join at specific speedups comparable to a standard CPU in many appoints or at the end of every work sharing construct plications like CFD [32, 81]. A GPU is a set of Single In[72,83, 51, 104]. struction Multiple Data (SIMD) extremely parallel coprocessor to the CPU. The memory hierarchy of both, conventional processors and GPUs are same. GPUs can 3.2.2 MPI be utilized as a programmable engine supported by programming tools like Compute Unified Device ArchitecMessage passing interface (MPI) is also a key standard ture (CUDA) [41]. In general purpose GPU society the parallelization tool adopted on massage passing platCUDA programming paradigm has established a huge form. MPI is in fact a standard paradigm for parallelizasuccess. The CUDA provides a new approach which tion of code on distributed memory platforms [86, 9]. particularly aims the multicores on a single GPU [94]. MPI provides a collective set of library routines which CUDA is just an extension to the C language that proexchanges messages and manages processes. It is genvides the developers to initiate and run massive parallel erally used in high end computing applications which computations on the GPU. To improve the performance are so large that several PCs are required to do the of CFD code execution significantly the use of GPUs is calculations [83]. MPI is relatively simple to implement one of the best cost effective methods [54]. across variety of platforms and is portable across distributed and shared memory systems. For CFD solvers MPI is presently one of the most popular paradigms. In MPI based calculations of CFD the grid is parted into many subdomains and allotted to various processors. Between the neighbouring subdomains interface of the CFD configuration the data is passed [15, 81]. Figure 3 shows the message passing to various CPUs using MPI for parallel execution of the different tasks in a parallel region of the program. These CPUs are interconnected by a network to communicate data among them. Each CPU communicates with the other, using messages. Many kinds of communication can be considered in MPI Application Program Interface (API) like collective communication and point to point communication. Collective MPI calls lead to communications among all computing cores within a group. These communications can be obtained through blocking or non
The CUDA provides parallel computing architecture that introduces a novel programming model established on high abstraction levels avoiding previous pipeline concepts in graphics and makes porting of CPU application simple [11]. Kernel is the computational core of the programming paradigm in CUDA. The kernel is passed to the GPU as shown in figure 4, and is executed employing different data streams by every processing unit. Form host CPU every kernel is launched and mapped GPUs thread grid. This grid consists of many thread blocks and their related shared memory is accessed by all the threads form a specific block that synchronizes together [64]. The intensive time consuming computations of CFD like applications are sorted into an instruction set and passed to GPU so that every thread core works on several data but simultaneously execute the same instruction set [56].
Parallelization Strategies for Computational Fluid Dynamics Software: State of the Art Review
7
Parallel Region
Serial Region
Task 12
Task 11
Task 10
Task 9
Task 8
Task 7
Task 6
Task 5
Task 4
Task 3
Task 2
Task 1
Program
Serial Region
Master thread
Message
CPU 1
Message
CPU 2
CPU 3
CPU 4
Inter communication network Cluster of CPUs
Message passing of parallel tasks using MPI on cluster of CPUs. Fig. 3 MPI paradigm
4 Literature Review of CFD problems and Related Software Parallelization using Different Tools In this section we provide a detailed literature review of methodologies adopted for CFD problems and parallelization attempted related to software of these CFD problems using different parallel computing paradigms. Benefits and issues of these various parallel computing paradigms in CFD codes parallelization are also provided.
4.1 Parallelization using OpenMP Methodologies of various CFD problems and parallelization strategy of pertaining CFD codes using OpenMP are described below. Amritkar et al., used Discrete Element Method (DEM) coupled with CFD for simulating dense particulate systems. DEM deals with particle-wall and multiple particleparticle interactions of spherical smooth particles. Based on Newtons second law of motion the forces acting on any individual particle were calculated and using
the soft sphere model the collision forces were calculated. They modelled a linear spring-dash pot for contact forces acting on a particle in collision with its neighbour in the tangential and normal directions. The conduction heat transfer related to the particle-particle and particle-wall particulate phase was calculated and similarly the convective heat transfer coefficient was calculated based on Nusselt number correlations between particles and fluid phase [3]. Amritkar et al., achieved parallelization for the CFD-DEM scheme using OpenMP. First touch policy was employed in order to obtain optimal performance by OpenMP. In this policy they kept the data within a node which is local to the core, initializes and allocates memory block first. The data is placed where it is most frequently accessed by performing the initializing of all the arrays in parallel, supplemented with placement tools. These additional tools ensured process affinity to a particular specific processor for the duration of the run during runtime [3]. The investigation of interaction between particle motion and turbulent fluid flow was done numerically by Zhang et al. To investigate the behaviour of fluid phase, inter-particle collision and its effects on particle dispersion the governing equations were solved by carrying
Task 1 Task 2 Task 3 Task 4
CPU main memory
A. Afzal et al.
Serial region
8
CPU
Task 5 Task 6
Task 9 Task 10 Task 11 Task 12
CUDA kernel for parallel execution SIMD in each core
Task 13
Task 16 Task 17 Task 18
Serial region
Task 14 Task 15
Copy data Copy result
GPU1 GPU3 GPU5 GPU7 GPU9 GPU11
GPU2 GPU4 GPU6 GPU8 GPU10 GPU12
GPU main memory
Task 8
Parallel region
Program
Task 7
Execution of parallel tasks in GPUs. Fig. 4 CUDA paradigm
out Direct Numerical Simulation (DNS). Similarly Discrete Element Method (DEM) was employed to calculate particle motion. The discretization of NavierStokes equations on a staggered grid in space was done by use of a fourth-order symmetry preserving scheme, and the temporal discretization was done by employing a fully explicit second-order one-leg scheme. Finally Newtons equation of motion was used to determine the motion of particles when they are not in direct contact among themselves but driven only by the body force and fluid flow [119]. The CFD code used for investigation of interaction between particle motion and turbulent fluid flow was parallelized using OpenMP by Zhang et al. There are two parts in this CFD code, one consists of DNS part and the other DEM. The DNS part of the code is paralleled by domain decomposition method and for the DEM part it is achieved by dividing the total particles among the processor which are to be tracked [119]. Accary et al., proposed a physical phenomena of forest fire by solving the physics conversation equations applied to a medium composed of gas mixture and solid phases. The model coupled the main mechanisms of transfer like radiation, turbulence, convection etc., and decomposition like combustion, drying etc., which
occur during propagation of forest fire. The related non stationery Navier-Stokes equation with Boussinesq approximation was governed for Newtonian fluid flow. Fully implicit segregated method depending on SIMPLER algorithm is used to solve the transport equations. Further the systems obtained from discretized equation are solved by BiCGStab iterative method whereas the pressure equations symmetric linear system is solved by the Conjugate Gradient method [1]. OpenMP parallelization of 3D CFD software for modeling of forest fire behaviour was proposed by Accary et al. The CFD software parallelized, has explicit nature whose most computational time i.e. nearly 90% was consumed by Conjugate Gradient type solver routines. The author ensured proper distribution of data within local memories of corresponding processor nodes. This method allowed to achieve better parallelizing efficiency for moderate count of cores using OpenMP standard [1]. Berger et al., developed a flow solver for inviscid steady state Euler equation for multi level Cartesian grids with embedded boundaries. The developed flow solver employs finite volume discretization to store the flow quantities at the centroid of the cell. The iteration uses multistage RungeKutta scheme [12]. Berger et al., used OpenMP to parallelize the solver developed for in-
Parallelization Strategies for Computational Fluid Dynamics Software: State of the Art Review
viscid steady state flow. The gradient for each cell was computed by each processor, copied the gradient for the overlap cells from the neighbouring processor, computed a residual, updated the cell it owns and copied the values of new overlap cell from the adjacent processor. Huge chunks of the code were executed in parallel with a coarser granularity compared to a typical fine grained loop level parallelization [12]. Using finite volume technique the Reynolds averaged Navier-Stokes (RANS) equation was discretized on meshes of mixed type elements like pyramids, prisms, tetrahedra, hexahedra etc., by Mavriplis. In the flow solver the elements of the grids were handled by a single unifying edge based data structure. Roe Rieman solver based upwind scheme was used to approximate the matrix dissipation and finite difference approximation is employed to discretize the viscous terms. Similarly a first order discretization is employed for multigrid calculation of convective terms on the coarse grid levels [77]. Mavriplis implemented shared memory parallelization, using OpenMP of the CFD code developed for analysis of high lift configurations. Mavriplis made use of first touch rules, where the memory is allocated to the processor which is the first to access it. Memory placement was hence achieved, in which each processor initializes all the arrays by executing a parallel loop , on the sub-domains to which it is been assigned [77]. Cheng et al., developed a flow solver HUNS3D for viscous flows based on hybrid unstructured meshes. The three-dimensional non-dimensionalized unsteady Reynolds averaged Navier-Stokes (RANS) equations were semi discretized using the cell-centered finite volume method and further linearized the equation by means of first order Taylor expansion. Their HUNS3D flow solver uses an improved lowerupper symmetric GaussSeidel timemarching method which is a very efficient method for structured grids and can be used with unstructured grids as well [22]. Cheng et al., implemented parallel computing using OpenMP for their in house built flow solver HUNS3D. They performed the numerical computation on two and three dimensional aerodynamic configurations like RAE2822 airfoil, DLR-F6 WBNP configuration and an aerospace plan. The computational efficiency of the flow field improved on the shared memory parallel environment maintaining the computational accuracy [22]. FPX rotor code developed for analysis of rotary wing aerodynamics like forward flight cases and helicopter hover was studied by Turner and Hu. The code solves unsteady compressible full potential three dimensional equation using an implicit approximate factorization finite difference scheme in a strong conservative form. The code is large and uses many physical en-
9
hancement models such as shock induced entropy correction model, embedded wake vortex models, viscous boundary layer models [104]. Turner and Hu successfully converted FPX code written in Fortran in its reduced two dimensional form into parallel version for parallelization. OpenMP was employed for parallelization of the code developed for analysis of rotor wing aerodynamics. Further they parallelized the loop correctly by making sure of each iteration of the loop to be independent of any other iteration. Upto 32 processors were used to obtain parallelism, and obtained fair speedups [104]. Jin et al., used a moderate size CFD code ARC3D which solves Navier Stokes and Eulers equation in three dimension. The code contains turbulent models, more realistic boundary conditions and uses a single rectilinear grid. The implicit scheme of finite difference equations were approximately factorized using Beam Warning algorithm. The finite difference equation is then solved in three directions alternatively [Jin2000]. OVERFLOW software is widely used in aerospace community for airflow simulation. Jin et al., used the software for analysis of airflow over airfoils. The software solves the related compressible Navier Stokes equation with complicated turbulence model, first order implicit scheme and chimera boundary conditions in multiple zones. The domain is decomposed into many logical Cartesian meshes and the software uses finite differences scheme in space [58]. Jin et al., implemented shared memory parallel programming using OpenMP for CFD code ARC3D and OVERFLOW. Coarser grained parallelism was introduced using OpenMP, which provided runtime library functions and directives in performing point to point synchronization. Furthermore implementation of pipeline parallelism was possible with the directives and library functions of OpenMP which ensured the pipelined code not to execute in a thread before the neighbouring thread has done the work. Execution at such a level required the scheduling scheme of the pipelined code to be ordered and static [58]. The different types of CFD problems belonging to various areas and related CFD codes parallelized using OpenMP are described in the above literature survey. They are mentioned in table 2 along with the number of grid points used and test cases. 4.1.1 Benefits of using OpenMP in CFD Benefits of using OpenMP in parallel computing of CFD codes are provided below. – OpenMP is simple to program as it does not require message passing and provides incremental parallelism approach [42].
10
A. Afzal et al.
Table 2 Parallelization using OpenMP Mesh/Grid Points in millions
CFD Field/Problem
Code
Dense particulate system
GenIDLEST 16
Particle laden turbulent flow
—
8.3
Forest fire behavior
Fire paradox
0.216000
—
9
—
2.4
Viscous flow simulation based on unstructured grids
HUNS3D
1.56
Rotor wing
FPX
3.5
Aero physics
ARC3D
2.1
Air foil
OVERFLOW 2.1
Inviscid steady state flow High lift configurations
– It is portable and scalable. OpenMP based code developed on a platform can be ported to any OpenMP compiler system reducing the time and money spent in porting of code [104]. – OpenMP provides quick development of parallel CFD program because of its global view of application memory address space [59]. – Lighter maintenance of writing sequential codes is obtained by directive based approach of OpenMP [60]. – OpenMP consumes less time in packing and unpacking buffers [12]. – On non parallel compilers OpenMP can be compiled which increases its portability [119]. 4.1.2 Issues of using OpenMP in CFD Issues of using OpenMP in parallel computing of CFD codes are provided below. – OpenMP program is limited to shared memory multiprocessing architectures and the memory architecture restricts its scalability [42]. – More than a single computing node having few Processing Units is needed for majority of the CFD computations. Therefore OpenMP is not yet enough to parallelize resourcefully a CFD code that is targeted for realistic and huge industrial problems [77].
Case Study Fluidized bed and Rotary Kiln Turbulent flow through a square duct European integrated fire management project Space shuttle, ship High lift configurations RAE2822 airfoil, NHLP-2D L1T2 high-lift multielement airfoil, DLR-F6 WBNP configuration and an aerospace plane Helicopter hover, forward flight Aero physics applications Airfoils
Reference [3] [119]
[1]
[12] [77]
[22]
[104] [58] [58]
– At large scale, using OpenMP is difficult to obtain sufficient performance because of the memory model governing fine grained memory access [59]. – Data locality is not ensured by OpenMP [3]. – Multiple levels of parallelization of CFD code employing only OpenMP approach is not much explored in the literature [12]. – The OpenMP overheads increase as the number of threads increase [28].
4.2 Parallelization using MPI Methodologies of various CFD problems and parallelization strategy of pertaining CFD codes using MPI are described below. Wang et al., used their in house built Navier-Stokes code where the governing Reynolds averaged NavierStokes equation is discretized with upwind schemes and finite volume method for steady state solution. To calculate the turbulent eddy viscosity Baldwin-Lomax turbulence model is employed. The implicit time marching scheme is used in conjunction with the unfactored Gauss-Seidel line relaxation to obtain high convergence rate. In each direction the Gauss-Seidel line iteration is applied and is swept backward and forward on each direction [107]. Wang et al., parallelized there in house
Parallelization Strategies for Computational Fluid Dynamics Software: State of the Art Review
built CFD software using MPI for data communication. This MPI based code demonstrated that it has high parallel computation performance. Few two dimensional and three dimensional flows are computed in this parallel environment [107]. The flow solver NUM3SIS code was used by Duvigneau et al., to study robust design of aerodynamics developed at INRIA Sophia-Antipolis. The flow solver has the governing compressible Navier-Stokes / Euler equations on three dimensional tetrahedral grids solved using a mixed finite-element finite-volume method in coordination with cell-vertex approach. Approximate Riemann solvers are used to discretize convective fluxes, and to obtain high-order schemes on the dual mesh a MUSCL interpolation is employed. Using a P1-Galerkin approach the diffusive terms are computed in a classical finite-element framework. Further using a firstorder implicit backward scheme the time integration is achieved, that is based on a matrix-free Implementation [29]. The NUM3SIS CFD code was parallelized by Duvigneau et al., using a domain decomposition approach. The parallelizing tool used was MPI, where domain decomposition is based on a simple overlapping strategy, since adjacent domains share only boundary nodes. Fluxes are computed independently for these nodes in each domain. All contributions were allowed to sum by communications from the different domains. For shared nodes the state update is performed in all domains [29]. The simulation of granular media was studied by Maknickas et al., where the granular media was assumed to be made of discrete particles which can be treated as discrete elements. These discrete particles are regarded as spherical particles and they are considered as in the state of visco-elastic material. Here the particles change their position due to their dynamic nature or free rigid body motion or contact with the walls defining computation domain or contact with the neighbouring particles. The whole granular medias dynamical behaviour is time-dependent, so the Newtons second law of rotational and translational motion is applied for each of the discrete particle. The particles positions, velocity and acceleration is predicted by Gears predictor, and similarly the particles correct positions, velocity and acceleration is calculated according to forces acting on the particles by Gears corrector. Finally using two step Gears predictor-corrector scheme time integration is performed [76]. The parallelization of DEMMAT CFD code was achieved using MPI paradigm by Maknickas et al. The domain is split into equal number of subdomains, where each core only computes the forces and updates the particles position in its subdomain. To perform computations the processors about
11
the particles need to share information which is close to the division boundaries in close by sub-domains. In implemented algorithm, the communication required is therefore local in nature and a major part of the code, which is sequential, can be employed without any or little modification. Partitions having an almost equivalent number of discrete particles make sure that the static load balancing on the homogeneous PC clusters [76]. Jia and Sunden studied multi-block three dimensional CFD code CALC-MP (in-house built) developed for fluid flow and heat transfer analysis. The multiblock 3D code is based on equations for the specific turbulence models in conjunction with energy and NavierStokes equations. Using finite volume technique, the code was developed in a body-fitted coordinate system. The 3D code employs the Rhie and Chow method and collocated mesh arrangement, to establish the velocity at the control volume faces. The pressure and velocity is coupled by the SIMPLEC algorithm. For solving the discretized equations TDMA or SIP based algorithm is used and with hybrid or QUICK scheme convective fluxes are calculated [57]. For parallelization of the CFD code CALC-MP using MPI was done by Jia and Sunden. Message passing strategy is employed for all the data communication on the inner-boundaries. To avoid deadlocking and for efficiency consideration MPI was used. Normally a single block will have more than one neighbouring block hence it is difficult to manage the message receiving and sending sequence properly. One process among all is allotted as the master node, and this master node is responsible for control flow of job and in distributing or collecting the global data like reference pressure, residual data and mass-conservation data [57]. The three-dimensional, unsteady, multiblock, turbomachinery flow solver TFLO was developed and validated by Yao et al. The Reynolds-averaged NavierStokes unsteady equations are employed, which is, using cell-centered discretization is solved on arbitrary multiblock meshes. The solution procedure depends on efficient explicit modified RungeKutta methods with few acceleration techniques like Residual averaging, local time stepping and multigrid. The multigrid based technique, especially, gives fast solution turn around and an excellent numerical convergence. The numerical dissipation schemes used and implemented were: 1) refined convective upwind split pressure dissipation model and 2) JamesonSchmidt Turkel switched scheme. These models and schemes gives contact discontinuities and sharper resolution of shock waves with a little increase in computational cost [116]. Yao et al., parallelized the turbomachinery flow solver TFLO us-
12
A. Afzal et al.
elsA and AVBP employing MPI. Using MPI blocking calls or MPI non-blocking call communications in both the CFD code is implemented. Depending on heuristic algorithm which employs a weighted multi-graph, the scheduling of MPI blocking is based [42]. The Lattice Boltzmann Method, based on kinetic theory, simulates isothermal multi-phase flows in CFD. The Lattice Boltzmann Method for 2 phase flow was analyzed by Shang et al. 2 particle distribution functions are used for the two phase flow. A stable discretization The numerical simulations of three dimensional fluid scheme is employed to deal with stability of pressure structure interaction computational work was carried variation by Boltzmann equation for two phase flow. out by Yuguang et al. The associated continuity equaThe incompressible flow at high viscosity ratio and dention and the Navier-Stokes (NS) equations were solved sity is analyzed using impact of a liquid droplet on a wet with a slight modification to account for a moving mesh, wall having a thin film [96]. The performance of Lattice turbulence and the Smagorinsky eddy viscosity model. Boltzmann Method, in parallel, was done by Shang et Also applying block-iterative method the incremental structural and Navier-Stokes equation can be discretized. al. Using MPI library the parallelization by domain decomposition of the aforesaid method was investigated. These Navier-Stokes equations are solved in conjunction with the block-GaussSeidel iterative algorithm [118]. The data communication is achieved through the halo cells and the computation domain is divided into varMPI was used for parallelization of CFD code developed ious sub domains which is further executed in various for numerical simulation of fluid structure interaction, other processors [96]. by Yuguang et al. MPI was employed to build a suitJ. Hawkes studied and modified the CFD code Reable parallel environment through effective computing FRESCO, which deals with viscous flow and solves inresources. The fluid domain is distributed among 128 compressible Reynolds averaged Navier-Stokes equation blocks during the 3D simulation and the computation of in coordination with volume fraction transport equaeach of the block is given to 128 processors in which one tion and different models like turbulence model, cavprocessor is used as master and the rest as the slaves. itations model for various phases of flow. The equaA lot of saving in budget, hardware and maintenance is tions used here were discretized with cell centered finite observed with good computational performance [118]. volume approach and to ensure conservation of mass Gourdain et al., studied the behaviour of flow in SIMPLE algorithm was employed. With first and seccomplex geometries using CFD code elsA and AVBP. ond order backward schemes the integration of time is The CFD code elsA deals with simulation of multiperformed and segregated strategy was adopted for all ple CFD areas like external and internal aerodynamtransport equations [46]. ReFESCO was parallelized usics from hypersonic flow to low subsonic flow regime. ing MPI by J. Hawkes. In this parallel approach MPI is The governing compressible Reynolds averaged Navierused for domain decomposition and the grids are partiStokes (RANS) equations is solved by a cell centred finite-volume method, centred schemes or upwind schemes tioned into subdomains with each subdomain having a coating of regular cells known as ghost cells and each of for Space discretization normalized by matrix or scalar these subdomains is computed within the MPI process artificial dissipation. With a backward Euler integra[46]. tion or with multistage RungeKutta schemes, the semiBerger et al., parallelized the CFD solver developed discrete equations are integrated, which uses robust lowerfor inviscid steady state Euler equation (as mentioned upper relaxation methods to solve implicit schemes. earlier) using OpenMP and then later converted to MPI The other CFD code AVBP developed is an unstrucversion. The code required a very little modification betured flow solver which deals with hybrid mesh of varcause of the coarse granularity of the parallelized CFD ious kinds of cell types. The code solves turbulent and solver by OpenMP. The results concluded that the MPI laminar compressible NavierStokes equations focusing version of the code performs less than OpenMP version on internal flow geometries with turbulent flows in two which is due to less computational time consumed by or three space dimensions. Cell vertex finite-volume apOpenMP in packing and unpacking buffers and in adproximation is employed for data structure of the code. dition MPI suffers overhead [12]. The discretization method is based on finite-element Selvam and Hoffmann parallelized the CFD code of type low-dissipation TaylorGalerkin or on a LaxWenincompressible flow based on Navier Stokes equation droff method in addition with an artificial viscosity using MPI. The domain is decomposed into many submodel [42]. Gourdain et al., parallelized the flow solver ing MPI. The solver was parallelized using a Single Program Multiple Data domain decomposition strategy. This coarse grained parallelism was achieved using double halo construct and the communication between multi processors inside the flow solver employs MPI asynchronous constructs which avoid contention and deadlocks in the network and take benefit of special rationale communication hardware that this parallel stage can have [116].
Parallelization Strategies for Computational Fluid Dynamics Software: State of the Art Review
13
Table 3 Parallelization using MPI CFD Field/Problem General subdomain boundary in aerodynamics Design optimization in aerodynamics Granular Media Fluid flow and heat transfer analysis Turbomachinery flow Fluid structure interaction Behaviour of flow in complex geometries Internal flow geometries with turbulent flows Two phase flow by Lattice boltzman method Multiphase viscous flow Inviscid steady state flow Incompressible fluid flow High lift configurations
Code
Mesh/Grid Points in millions
Case Study
Reference
—
>16
Transonic Airfoil, CoFlow Airfoil, air wing
[107]
NUM3SIS
>0.03
wing shape of a business aircraft
[29]
DEMMAT
>1
granular material in a cubic box
[76]
CALCMP
>2
3D straight channel, straight square duct
[57]
TFLO
>2
Aachen Turbine
[116]
—
>40
Long span bridges
[118]
elsA
>100
—
[42]
AVBP
>100
—
[42]
—
900
Liquid droplet a thin liquid film
[96]
ReFESCO
2.67
KVLCC2
[46]
—
9
Space shuttle, ship
[12]
—
0.1
Lid driven cavity
[93]
High lift configurations
[77]
OVERFOW 2.4
domains where each subdomain is assigned to an MPI process in such a way that the computational workload is uniformly distributed. Here the decomposition depends upon number of processes and grid size [93]. Mavriplis implemented distributed memory parallelization using MPI as well for an unstructured mesh Reynolds averaged Navier-Stokes (RANS) for analysis of high lift configuration. This implementation has equivalent performance with OpenMP alone parallelization [77]. The different types of CFD problems belonging to various areas and related CFD codes parallelized using MPI are described in the above literature survey. They are mentioned in table 3 along with the number of grid points used and test cases.
1.5-Stage
4.2.1 Benefits of using MPI in CFD
Benefits of using MPI in parallel computing of CFD codes are provided below.
– The major benefit of MPI is its achievable performance [59]. – In development cycle of a program MPI provides full user control [29]. – Data locality is ensured by MPI [3]. – The standard API and MPI libraries give rise to portability on a wide range of platforms [95]. – MPI can be employed for both distributed and shared memory [104].
14
4.2.2 Issues of using MPI in CFD Issues of using MPI in parallel computing of CFD codes are provided below. – Discrete memory view of programming in MPI leads to very long development cycle and even difficult to write [59]. – Increased memory requirement due to duplication of data because of its distributed nature [115]. – To execute MPI tasks on huge parallel platforms is a big challenge due to instabilities of systems [76]. – In complex geometries of CFD the domain decomposition is a challenge [116]. – MPI has scalability issue with non uniform memory access systems because of domain partitioning [42]. – MPI has large overhead [27]. – MPI involves more manpower and needs large resources like network and memory [118].
A. Afzal et al.
derivative term explicit Euler first-order accurate scheme is employed. For incompressible fluid flows, the numerical solution for the Navier-Stokes equation is calculated adopting projection algorithm. Further the pressure Poisson equation is solved using a Jacobi iterative solver to obtain steady state solution [103]. The Navier-Stokes solver on multi GPU desktop platforms for incompressible flow using CUDA was implemented by Thibault and Senocak. The usage of shared memory was done in three steps. In a kernel, the first step is that the block threads copy the subdomain to the shared memory from the global memory which they are responsible for. Then utilizing data from the shared memory the calculation is done by threads. Finally to the global memory the result of calculation is written back, before exiting the kernel. By increasing the size of the subdomain that is represented to a thread block, the overhead created between the data transfer between shared memory and global memory is avoided [103].
Cohen and Molemaker described and performed validation of a second-order double precision finite volume 4.3 Parallelization using CUDA Boussinesq code. The Boussinesq approximation was used to solve the incompressible Navier-Stokes equaMethodologies of various CFD problems and parallelizations on a staggered regular grid and discretized using tion strategy of pertaining CFD codes using CUDA are a second order finite volume method. Using centered described below. differencing of the flux values the advection terms are Griebel and Zaspel studied the simulation of indiscretized resulting in a discretely conservative secondcompressible two-phase flows using the level set funcorder advection scheme and using second-order cention. The CFD code NaSt3DGPF developed here utitered differencing all other spatial terms were discretized. lizes two-phase Navier- Stokes equations with surface With a single forward-Euler step a second-order Adamstension. Here the Navier- Stokes equations discretized Bashford method was primed at the start of the numeron a staggered grid with finite differences and apply ical integration [23]. Cohen and Molemaker used CUDA pressure projection approach of Chorin. For the conto parallelize the CFD code based on second-order douvective terms a fifth order Weighted Essentially Nonble precision finite volume Boussinesq approximation. Oscillatory (WENO) space discretization is employed. For GPU computing on GT200 threads were grouped Using third order Runge-Kutta method or a second orinto 32 batches known as warps which execute in sinder Adams-Bashforth time integration is discretized. gle instruction multiple data strategy. Memory coalescWith an algebraic multigrid method or Jacobi-preconditioned ing process was applied for batches of threads into a conjugate gradient iterative method the Poisson equasingle operation. For thousands of simultaneous active tion is solved [43]. Griebel and Zaspel parallelized level threads, caches of chip are useful only if threads schedset equations using CUDA. Pressure Poisson equations uled to the same processor access the same cache at the solver i.e. Jacobi preconditioned conjugate gradient and same time [23]. the level set functions reinitialization based on CPU code was ported on graphics processing unit (GPU). The authors mentioned that, for single instruction multiple data parallelization on GPUs Jacobi preconditioner is perfectly suited. In CUDA kernel the point wise product can be easily implemented and gives improved performance [43]. The incompressible fluid flow situation was solved using Navier- Stokes equations, and studied by Thibault and Senocak. The diffusion and advection terms of the Navier-Stokes equations are discretized using Secondorder accurate central difference scheme and for time
Simulation of internal fluid flows (turbulent fluid flows) using CFD was done by Emelyanov et al. The CFD code uses Reynolds averaged Navier-Stokes (RANS), which is resolved by local time-stepping employing an upwind finite difference scheme and central difference scheme with Roes approximate Riemann solver or Godunovs exact solver for viscous fluxes. For time integration, Runge-Kutta formulation is used and the simulation was with unstructured type mesh having cells with different shapes. To run on meshes made of different types of cells, the flexibility is provided by edge based
Parallelization Strategies for Computational Fluid Dynamics Software: State of the Art Review
data structures [31]. Parallelization of CFD code for Simulation of internal fluid flows on GPUs was achieved by Emelyanov et al. The parallelizing paradigm is CUDA but the processing (pre and post) had to be done on CPU. The time consuming computation which can be parallelized was performed on GPU using CUDA. In CPU, the face areas, cell volumes, assumptions made in flow field and mesh construction are obtained. Further for each time step of the computation GPUs kernels compute the fluxes of cell, add the cell fluxes then calculate the difference in properties at every node and apply boundary conditions after smoothing the variables [31]. Simulation of particle coagulation dynamics based on accelerating population balance-Monte Carlo was done by Xu et al. The governing equation is solved by quadrature or appropriate discretization scheme in conjunction with deterministic scheme like method of moments or sectional method. To enamour the time development of particle size dispersion and to decrease the time loopings with little statistical noise, the Markov jump model builds coagulation rule matrix belonging to differentially weighted simulation particles [115]. Xu et al., proposed a parallelized CFD code to accelerate particle coagulation dynamics simulation. GPU parallel computing was implemented using CUDA, the main idea being using GPU to provide exceedingly threaded data-parallel processing tasks. Using the Acceptance Rejection scheme and the inverse scheme the collection of coagulated particle pair is applied using CUDA on GPU [115]. Smoothed Particle Hydrodynamics (SPH) method is employed generally to simulate complex flow regimes in CFD, studied by Crespo et al. This SPH does not require mesh and handles wave propagation, moving interfaces, deformable boundaries and specifically free surface flows. The governing equations are based on conservation of mass and momentum and the fluid domain presented by particles are scattered in a rapid manner which is modified by time step with respect to the governing dynamics [25]. SPHysics CFD code developed for SPH method was parallelized on GPUs using CUDA. At first the data is local to CPU which is transferred to GPU and the computation of sequential tasks which involve a loop of all particles are parallelized making use of architecture of cores of GPU. Rearranging the particles of the cells to which they belong is computed by radixsort algorithm offered by CUDA [25]. HOSTA CFD code used for higher order simulations with complex geometries was parallelized by Xu et al., employing CUDA. In this the cell loop and block loop were mapped to CUDA kernel and CUDA stream
15
respectively. CUDA kernel calculates for a grid block and CUDA stream is used to index a particular grid block. Different CUDA configurations and implementations are employed for various CUDA kernels [114]. The different types of CFD problems belonging to various areas and related CFD codes parallelized using CUDA are described in the above literature survey. They are mentioned in table 4 along with the number of grid points used and test cases. 4.3.1 Benefits of using CUDA in CFD Benefits of using CUDA in parallel computing of CFD codes are provided below. – The multi core architecture of GPU can be utilized using CUDA [115]. – CUDA has the similar memory hierarchy compared to other conventional processors [43]. – CUDA can handle computationally huge problems with many GPUs in every node [23]. – The memory hierarchy of CUDA is similar to the memory hierarchy of a conventional multiprocessor [31]. – Significant speedup can be obtained using CUDA on GPUs for many general purpose computations [103]. – CUDA provides cheap computing by consuming less energy and needs less workspace [25]. 4.3.2 Issues of using CUDA in CFD Issues of using CUDA in parallel computing of CFD codes are provided below. – Optimizing the memory access by using the best of shared memory is a challenge for parallelization by employing CUDA [23]. – The heterogeneous environment of the storage system is not encapsulated by CUDA [43]. – The chief performance constriction of GPU computations using CUDA is the transfer of memory between CPU and GPU [31]. – To enhance the computational power of GPU using CUDA for a CFD code needs many changes in the code and optimized compilers [115].
4.4 Parallelization using hybrid OpenMP+MPI Methodologies of various CFD problems and parallelization strategy of pertaining CFD codes using hybrid OpenMP+MPI are described below. Basermann et al., studied CFD code TRACE developed for computation of external flows. The CFD
16
A. Afzal et al.
Table 4 Parallelization using CUDA CFD Field/Problem Two phase incompressible flow Incompressible fluid flow Secondorder double precision finite volume Boussinesq code Simulation of internal fluid flows Simulation of particle coagulation dynamics Smoothed Particle Hydrodynamics Simulation of flow with complex geometries
Code
Mesh/Grid Points in millions
Case Study
Reference
NaSt3DGPF 1
Air bubble rising in water
[43]
—
30
lid-driven cavity
[103].
—
28
Rayleigh-Bnard convection problems
[23]
—
10
Lid driven cavity, shock flow tube, flat plate flow, channel flow
[31]
—
0.01
Free molecular regime
[115]
SPHysics
1
Dam break flow impacting with an obstacle
[25]
HOSTA
800
EET high-lift airfoil configuration, chinas large civil airplane
[114]
code TRACE uses structured grid, and finite volume approach for solving Reynolds-averaged NavierStokes equations to compute the viscous and convective fluxes. The equation is discretized using implicit time which is further solved by Backward-Euler scheme and GaussSeidel relaxation algorithm [10]. Basermann et al., parallelized the CFD code TAU meant for simulation of internal flows. They applied hybrid parallelizing strategy using OpenMP and MPI as parallelizing tools, here Single program multiple data method was employed for over 10 to 50 million grid points [10].
loop level constructs along with multi-threaded Fast Fourier transforms to provide an inner level of parallelization within each MPI task. Outer of the OpenMP parallel regions the communication of the transpose for MPI acquires place, in master only mode of hybrid computing. To present particular inner level parallelization and to show the root of 2 different ways of decomposition, the author gave a code fragment by expressing the transpose within a slab with the use of OpenMP directives which is very important for calculating the FFT [78].
Fluid turbulence is quintessential petascale application that arises from interactions at all temporal and spatial scales. The Pseudospectral computations of fluid turbulence were studied by Mininni et al., in which the simulations are based on incompressible NavierStokes equations. Using pseudo-spectral method these equations are solved where each component is shown as a Galerkin expansion in the form of Fourier basis and the nonlinear term in physical space is calculated and then transformed to spectral space using the fast Fourier transform [78]. Mininni et al., used MPI+OpenMP hybrid method for parallelization of CFD code developed for Pseudospectral computation of fluid turbulence. Here the MPI processes allows a coarse grained parallelization employing the domain decomposition and OpenMP
Deterministic and stochastic CFD problems were analysed by Dong and Karniadakis which, if discretized with a spectral/hp element method, demonstrate inherent hierarchical structures. For deterministic CFD problems NavierStokes equations is applied, and to accommodate the needs of high order as well as the effective managing of multiply associated computational field in the non-homogeneous planes an integrated spectral element-Fourier discretization was applied. Similarly applying stochastic investigation using generalized polynomial chaos hierarchical structures arise [28]. The hybrid parallelization of deterministic and stochastic CFD problems was achieved by Dong and Karniadakis. OpenMP and MPI were employed where the flow domain was decomposed along the homogeneous
Parallelization Strategies for Computational Fluid Dynamics Software: State of the Art Review
17
finite difference Weighted Essentially Non-Oscillatory direction. With every process computing a single subscheme, Chorins exact projection algorithm is applied domain, multiple MPI processes are employed at the as numerical method for the related incompressible flow outer level. Multiple OpenMP threads deal the comequation. Poisson solver and Runge-Kutta integration putations within each MPI process at the inner level, are employed further to obtain pressure and velocity rein the sub-domain in parallel. With MPI, across subdomains data exchange is implemented. Access to shared spectively [93]. Using hybrid MPI+OpenMP and only objects by multiple threads within each process is aligned MPI strategy, Selvam and Hoffmann parallelized the CFD code of incompressible flow. They compared the with OpenMP synchronizations. To enable the coarse results and concluded that hybrid approach gives better grain approach the time-integration loops at the upperperformance for the solver than only MPI. Using MPI, most level confines a single parallel region which avoids 2 dimensional diffusion time reliant equation was paralthe overhead related with thread creations and destruclelized and with many MPI routines higher order finite tions built in fine grain programs. The computations difference scheme is parallelized. Fine grain parallelizawithin a sub-domain are shared by OpenMP threads tion in few part of the code is achieved by incorporating by working on disjoint sections of the vectors or on disOpenMP [93]. joint groups of elements [28]. FUN3D is a CFD code developed for design optiThe simulation of issues on environmental flood was mization of automobiles, airplanes and submarines condone by Shang using the CFD software Telemac. Telemac sisting of several million mesh points with irregular is based on finite element methods, employing Saintmeshes. This FUN3D code was demonstrated by Gropp Venant equations to provide simulation of free surface et al., which has vertex centered, tetrahedral unstrucfluid flow. This said equation is basically derived ustured mesh solving compressible and incompressible Navier-ing NavierStokes equations with few assumptions. To Stokes and Eulers equation. For viscous term the code reorder the elements of mesh to a lower bandwidth uses Galerkins discretization and for approximating the the reverse CuthillMcKee method was applied. Fine convective fluxes variable order Roe scheme alongside mesh was developed to refine the given number of nodes control volume discretization is employed. The Newtons and elements [95]. The CFD software Telemac was parcorrection resulted from Jacobian system is solved with allelized by Shang. Hybrid parallelizing strategy with Krylov method, depending directly on matrix free JaOpenMP and MPI was used for TeleMac. The Telemac cobian vector product operations [45]. The CFD code software was recoded by hybrid parallel programming FUN3D was parallelized by Gropp et al., using mixed with OpenMP and MPI where the OpenMP tool exmode OpenMP and MPI paradigm. The authors menecutes the parallel environment by its fine grain partioned that the hybrid tool appeared to be more effecallelism and MPI executes it by domain partitioning tive to use shared memory compared to heavyweight with coarse grain parallelism. The MPI performs the subdomain based process particularly if the count of partitioning of domain among different nodes and the codes is large. The MPI tool works resulting in slower OpenMP performs within the node [95]. rate of convergence with more number of subdomains Jin et al., used the CFD softwares OVERFLOW which is same as number of MPI processes. The result of and AKIE for simulation of airfoil and turbomachinthe hybrid model is shorter execution time and faster ery respectively. We have briefly discussed about the convergence rate working with fewer chunkier subdosoftware OVERFLOW above, mentioned by the same mains which is same as number of nodes inspite of the author. AKIE code is used in turbomachinery applifact that there is few redundant work when the data cation to analyze the instability of three dimensional from the two thread are joined due to deficiency of vecflows around the rotor blades. AKIE solves the partor reduce operation in the OpenMP standard [45]. tial differential equations to obtain altering solutions Mudigere et al., employed the CFD code FUN3D to among the blades for flow around every blade [59]. Jin optimize the performance of forward solver used in the et al., applied hybrid parallel programming strategy usincompressible system of ideal gas [80]. Mudigere et al., ing OpenMP and MPI for pseudo based CFD appliparallelized the CFD code PETSc-FUN3D using only cation. Airfoil and turbomachinery rotor blades were MPI and hybrid MPI+OpenMP strategy. The authors chosen as the practical application which were studied examined shared-memory optimizations and reported using CFD software OVERFLOW and AKIE. Hybrid that the hybrid version performs better than only MPI parallelization was achieved for both the CFD softwares parallelization approach [80]. and finally Jin et al., reported that this strategy gives many benefits like reduction in overhead, memory footWith the exact projection method, the 2 dimenprint related with MPI, good numerical convergence, sional Navier-stokes equations for incompressible flow stability and improved balancing of load [59]. were discretized by Selvam and Hoffmann. Higher order
18
A. Afzal et al.
Table 5 Parallelization using hybrid OpenMP+MPI CFD Field/Problem Simulation of external flows Pseudospectral computations of fluid turbulence Deterministic and stochastic CFD problems Design optimization of automobiles, airplanes and submarines Investigation of FUN3D Incompressible fluid flow
Code
Mesh/Grid Points in millions
Case Study
Reference
TAU
50
Applied in German Aerospace Center
[10]
—
100
Rectangular slab
[78]
NetKar
0.6
Turbulent flow past a circular cylinder
[28]
FUN3D
0.3
—
[45]
FUN3D
>2
—
[80]
—
0.1
Lid driven cavity
[93]
[95]
Environmental flood flow
Telemac
1.6
Malpasset, is a benchmark problem which is used to simulate the flow of flood after the dam break
Pseudo based application
AKIE and OVERFLOW
36
Turbomachinery rotor blades
[59]
IR3D
30
—
[62]
OVERFLOW30 D
—
[27]
—
3 element airfoil, wing body
[77]
Simulation of effect of environment on vortices Analysis of high fidelity aerodynamic shapes High lift configurations
24.7
The simulation of effects of environment on trailing vortices was investigated by Jost and Robins by using the CFD code IR3D. Under water vehicles control surfaces develop vortices due to environmental effect and depend on ambient temperature, current s of underwater. The IR3D codes governing Boussinesq equations numerical solution is found employing an explicit second-order Adams Bashforth time stepping scheme. Fourier transforms and higher order compact methods are used to compute the derivatives. To solve poisons equation poisons solver is employed and to maintain the characteristic of flow projection algorithm is used [62]. The CFD code IR3D was parallelized using hybrid OpenMP and MPI strategy by Jost and Robins. Coarse and fine grained parallelism is achieved through MPI and OpenMP where the performance has increased compared to only MPI parallelization. Here the commu-
nication overhead is reduced by decreasing the number of MPI process and for the nodes having idle cores shared memory OpenMP parallelization is obtained thus enabling MPI process to execute intense computational phase with the use of multiple streams [62]. The CFD software OVERFLOW-D used for simulation and analysis of flow over complex shapes of aerodynamics was analyzed by Djomehri and Jin. This dynamic code helps in simplifying the body modelling in relative motion. Reynolds averaged Navier-Stokes is discretized using finite differences scheme in space along with implicit time stepping scheme as mentioned about OVERFLOW software earlier. Here the domain is decomposed onto structured grids [27]. Djomehri and Jin implemented parallelization for OVERFLOW-D CFD code and studied the scalability of parallel schemes o various architectures of the Solver. Cluster of shared
Parallelization Strategies for Computational Fluid Dynamics Software: State of the Art Review
memory and shared memory parallelization performs better than just MPI approach as reported in this case. The group loop of the code runs in parallel by MPI and similarly grid loop runs in parallel by OpenMP which is very intensively time consuming. Here the load statically balanced and each MPI thread acts as a MPI processor which decomposes it in shared memory by using OpenMP [27]. Mavriplis extended his approach of parallelization from one level to two level, where MPI and OpenMP are used. To communicate within the groups of subdomains partitioned, MPI is employed and for subdomains communication which is within the MPI process, OpenMP is employed [77]. The different types of CFD problems belonging to various areas and related CFD codes parallelized using OpenMP+MPI are described in the above literature survey. They are mentioned in table 5 along with the number of grid points used and test cases. 4.4.1 Benefits of using hybrid OpenMP+MPI in CFD Benefits of using hybrid OpenMP+MPI in parallel computing of CFD codes are provided below. – The computational performance can be increased by making use of qualities of MPI and OpenMP and overcoming the weaknesses of MPI and OpenMP in hybrid parallelization approach [95]. – OpenMP can be employed within a node for fine grain parallelization and MPI over distributed memory nodes for communication between them. Hence benefits of both the models can be used to keep performance across node by MPI and reducing MPI processes required within a node using OpenMP, which further reduces related overhead [59]. – Shared memory parallelization in a node on clusters of multi-core node can be achieved by using this hybrid MPI and OpenMP model [45]. – The issue of overhead and large memory requirement related to MPI can be reduced employing hybrid OpenMP and MPI model [10]. – Better computational performance of the CFD is observed by implementing OpenMP+MPI parallelization, compared to only OpenMP/MPI parallelization [93]. – Hybrid approaches such as MPI+OpenMP parallelization can be easily employed for applications with multiple levels of parallelism [8]. – Using OpenMP for communication inside nodes and MPI for communication between nodes, improvement in code scalability on multi-core nodes can be obtained [42].
19
4.4.2 Issues of using hybrid OpenMP+MPI in CFD Issues of using hybrid OpenMP+MPI in parallel computing of CFD codes are provided below. – Between the OpenMP threads and MPI processes, in hybrid OpenMP+MPI parallelization strategy, well defined interaction is not observed [59]. – The performance of OpenMP is limited in this hybrid parallelization approach [10]. – OpenMP directives limit the command of the application over the threads in hybrid strategy [28]. – To simplify the exploiting of multiple levels of parallelism, hybrid programming has to be avoided [62].
4.5 Parallelization using hybrid OpenMP+CUDA Methodologies of various CFD problems and parallelization strategy of pertaining CFD codes using hybrid OpenMP+CUDA are described below. The numerical analysis of particulate flows found generally in environmental and engineering applications using CFD software Trubal was done by Yue et al. Trubal is based on Discrete Element Method (DEM) which predicts the motion of the particulate fluid flow by supervising each individual particle. DEM in further depends on Newtons second law of motion to compute the behaviour of each particle in the fluid flow. The interaction or collision between the dynamic particles were calculated based on interaction laws like linear law, non linear law or hard contact law which all inturn are based completely on classical contact mechanics. The relationship of displacement-force was calculated depending on Hertz theory [117]. Yue et al., implemented parallel computing of CFD software Trubal which predicts the numerical simulation of particulate fluid flow. The parallelizing strategy adopted was by using CPU-GPU heterogeneous architecture. The heterogeneous parallelization has two parts mainly CPU, which does the computation (Example- complex logical transactions) of part of the code not suitable for data parallelism and GPU, which is responsible for computations that are of intensively large scale. Optimization of different types of memory was achieved using shared memory to decrease access latency by suspending threads provisionally and texture memory to access randomly targeting to maximize the GPU frequency in order to obtain maximum output thus promoting the overall parallel performance [117]. Xu et al., also parallelized the CFD code HOSTA using OpenMP and CUDA hybrid approach. in this strategy two OpenMP threads were created where one thread of OpenMP launches CUDA kernels which deals
20
with computation of the grid block and the other thread computes the grid blocks dedicated for CPU. Compared to only GPU parallelization here the hybrid scheme showed improved performance [114]. Few CFD problems and related CFD codes parallelized using OpenMP+CUDA are described in the above literature survey. They are mentioned in table 6 along with the number of grid points used and test cases. 4.5.1 Benefits of hybrid OpenMP+CUDA in CFD Benefits of using hybrid OpenMP+CUDA in parallel computing of CFD codes are provided below. – In OpenMP and CUDA heterogeneous parallelization, logical computations can be performed on CPU and intensive computations on GPUs. This method is cost effective and avoids the computational performance issue of CPU [117]. – To optimize the usage of CPU and GPU, computations can be performed on both using OpenMP and CUDA [114]. 4.5.2 Issues of hybrid OpenMP+CUDA in CFD Issues of using hybrid OpenMP+CUDA in parallel computing of CFD codes are provided below. – In this approach balancing of load between GPU and CPU is difficult [52]. – Proficient utilization of collaborated CPU and GPU employing OpenMP and CUDA for complex CFD problems is a big challenge [114].
4.6 Parallelization using hybrid MPI+CUDA Methodologies of various CFD problems and parallelization strategy of pertaining CFD codes using hybrid MPI+CUDA are described below. AlOnazi et al., optimized the CFD software OpenFOAM based applications, which solves the partial differential equations. The first application used is icoFoam, which has the incompressible laminar NavierStokes as governing equations. In the application icoFoam, for time stepping loop, Pressure Implicit with Splitting of Operators algorithm is applied. The algorithm has three important stages which use different equations to get different solutions. The next application was laplacianFoam, which finds Laplacian equations solution. To solve temperature in second application the solver used was Conjugate Gradient [2]. AlOnazi et al., parallelized the CFD software OpenFOAM based applications using hybrid strategy by combining
A. Afzal et al.
MPI and CUDA. The computations run in parallel on GPUs and processors. A process level parallelism is employed using MPI and a GPU which is in connection in turn with MPI. CUDA accelerates the computation in GPU, and the processors run the computations locally which are not connected GPU. To implement CUDA kernels Thrust and CUSP libraries were used. GPU and CPU computational flow started by dividing the computational domain, using OpenFOAM decompose Par function into subdomains. The Conjugate Gradient combines CUDA kernals and MPI kernels [2]. Jacobsen and Thibault investigated incompressible fluid flow computations with massive mesh on heterogeneous computer system. The Navier-stokes equation used for incompressible fluid flows driven by buoyancy modelled by Boussinesq approximation with little temperature changes in the momentum equation. For spatial derivatives, second order central difference scheme and for time derivatives, a second order accurate AdamsBashforth scheme is used to discretize the governing equations on uniform grid. Projection algorithm is employed to in order to obtain solution to the governing Navier-Stokes equation of incompressible fluid flows. Jacobi iterative solver is used to solve Poisson equation [53]. Jacobsen and Thibault achieved parallelism using hybrid MPI+CUDA for CFD software based on three dimensional incompressible Navier Stokes equations. NVIDIAs CUDA paradigm was adopted for fine grain parallelization within each GPU and for coarse grain parallelization MPI library was employed across the HPC cluster. The three dimensional volume is divided into one dimensional layer and they are decomposed between the GPUs on the computer cluster to obtain a one dimensional domain decomposition. The parallel environment is implemented with one processor of MPI to every GPU and it is ensured by initial mapping of hosts to each GPU. Here a master process collects all host names and assigns GPU identifiers to every individual host in such a way that no process of host has the same identifier and the result is scattered back. The GPU executes such that the communication of MPI happens simultaneously [53]. Few CFD problems and related CFD codes parallelized using MPI+CUDA are described in the above literature survey. They are mentioned in table 7 along with the number of grid points used and test cases. 4.6.1 Benefits of hybrid MPI+CUDA in CFD Benefits of using hybrid MPI+CUDA in parallel computing of CFD codes are provided below. – This hybrid model provides advantage over CUDA programming for fine grain parallelization [52].
Parallelization Strategies for Computational Fluid Dynamics Software: State of the Art Review
21
Table 6 Parallelization using hybrid OpenMP+CUDA CFD Field/Problem Particlulate fluid flow Simulation of flow with complex geometries
Code
Mesh/Grid Points in millions
Case Study
Reference
Trubal
>2
Die casting process
[117]
HOSTA
800
EET high-lift airfoil configuration, chinas large civil airplane
[114]
Mesh/Grid Points in millions
Case Study
Reference
OpenFOAM >1
3D lid driven cavity flow
[2]
—
Lid driven cavity
[53]
Table 7 Parallelization using MPI+CUDA CFD Field/Problem OpenFOAM applications Incompressible Fluid Fow
Code
60
– This model provides increased performance for few numbers of nodes [2]. 4.6.2 Issues of hybrid MPI+CUDA in CFD Issues of using hybrid MPI+CUDA in parallel computing of CFD codes are provided below. – In this combination MPI does not support threading [2]. – If this hybrid model is applied for huge number of nodes, its performance reduces [52]. – Scaling is weak in this approach, as the number of node increases [52].
4.7 Parallelization using hybrid MPI+Pthreads Methodologies of various CFD problems and parallelization strategy of pertaining CFD codes using hybrid MPI+Pthreads are described below. Simmendinger and Kugeler examined the CFD code TRACE which is a steady and unsteady three dimensional flow solver for the Reynolds and Favre averaged compressible Navier Stokes equation. Unstructured and structured girds are used in the flow solver and is focused on the analysis of physics of turbomachinery flows. It uses second order accurate Roe upwind scheme for spatial discretization and conservative hybrid-grid interfacing algorithm. The solver can be adapted for nonlinear aero elasticity module, implicit non reflecting boundary conditions, Multimode transition model etc [98]. Simmendinger and Kugeler used this hybrid approach to parallelize the CFD code TRACE. The hybrid MPI
and Pthreads parallelization tools were employed to make better use of multiprocessing architecture. In this hybrid parallelization approach Pthreads helped in processing units of a CPU socket and and bind the enveloping MPI process to the socket. Using MPI parallelization is done by domain decomposition for the distributed computer across nodes and a cluster similarly Pthreads is used for loop parallelization across the cores. A profiling run was done and parallelized subsequently all loops which needed more CPU time. For both the structured and unstructured part, parallelization has been done, of the CFD code TRACE [98]. Basermann et al., applied hybrid parallelization strategy for CFD turbomachinery code TRACE. Pthreads and MPI were the paradigms used to parallelize the code, where enveloping MPI process is bound to the socket and Pthreads are used for all processing units of a CPU socket. Single program multiple data technique was employed for 1911 million grid points required for analysis of turbomachinery flow [10]. For parallelization of three dimensional multi block CFD code CALC-MP using MPI and Pthreads was done by Jia and Sunden. MPI is implemented only message passing between cores on different nodes and by using multi-threading of the shared-memory feature of these nodes Pthreads take the advantage. A simple hybrid MPI+Pthreads programming model was implemented. All the MPI calls i.e. inter-node communication is achieved with the main thread on every node. The communication being performed between inter-node need not be compulsorily known by other threads. Hence the interaction between the hybrid paradigms is avoided [57].
22
A. Afzal et al.
Table 8 Parallelization using MPI+Pthreads CFD Field/Problem
Code
Turbomachinery TRACE Simulation of external flows Fluid flow and heat transfer analysis
Mesh/Grid Points in millions 4
TAU
50
CALCMP
2
Few CFD problems and related CFD codes parallelized using MPI+Pthreads are described in the above literature survey. They are mentioned in table 8 along with the number of grid points used and test cases. 4.7.1 Benefits of hybrid MPI+Pthreads in CFD Benefits of using hybrid MPI+Pthreads in parallel computing of CFD codes are provided below. – The usage of multi-core architectures for both structured and unstructured grid can also be improved by this hybrid model [98]. – This approach avoids communication overhead by splitting the computational zone [10]. – This strategy is feasible as it gives more scale up [10]. 4.7.2 Issues of hybrid MPI+Pthreads in CFD Issues of using hybrid MPI+Pthreads in parallel computing of CFD codes are provided below. – MPI+Pthreads need extra effort to achieve good performance [10].
Case Study Multi stage compressor, single stage turbine Applied in German Aerospace Center 3D straight channel, straight square duct
Reference
[98] [10] [57]
Newtonian fluids, constitutive equation and for computing the drag force and viscous stress tensor Di Felice’s correlation are utilized. Further Volume-averaged NavierStokes equations of momentum and continuity are the base of Eulerian model. In the discretization of both the momentum fluxes and convective mass the second-order Barton scheme is used [63]. For the parallelization of CFD code developed for simulation of fluidized powder bed, Kafui et al., used ParWise parallel environment. A cluster of 1536 cores was made target for parallel running of the code in a distributed memory environment. The strategy used was single program multiple data in which every core executes the same CFD code but runs on different parts of the data. Particle decomposition was implemented using the ParaWise graph partitioner which is based on Jostle graph partitioning tool algorithm. Between graph partitioner internal data structure and the application code data structures the ParaWise utilities interface efficiently and handle identification of owned particles and inter-processor communication [63]. A CFD problem and its application parallelized using ParaWise are described in the above literature survey. They are mentioned in table 9 along with the number of grid points used and test cases. 4.8.1 Benefits of ParaWise in CFD
4.8 Parallelization using ParaWise Methodology of a CFD problem and parallelization strategy of pertaining CFD code using ParaWise is described below. In the simulation of fluidized bed, Lagrangian Eulerian model of the particulate phase was employed and studied by Kafui et al. In Lagrangian model to calculate the angular and linear velocities Newtons second law of motion is applied. To calculate the solid contact force contact mechanics laws were used for nonadhesive/adhesive particleparticle and particlewall interactions, frictional, elastic/elastoplastic. Contact forces were calculated by explicit finite difference scheme. For
Benefit of using ParaWise in parallel computing of CFD codes is also provided below. – To exploit the computational power of distributed high performance computing ParaWise parallel computing environment gives good advantage [63]. 4.8.2 Issues of hybrid ParaWise in CFD Issue of using ParaWise in parallel computing of CFD codes is also provided below. – The CFD software performance in parallel computing reduces as the number of processors increase because of communication overhead [63].
Parallelization Strategies for Computational Fluid Dynamics Software: State of the Art Review
23
Table 9 Parallelization using ParaWise CFD Field/Problem
Code
Mesh/Grid Points in millions
Case Study
Reference
Fluidization of a powder bed
—
>1
Bubbling fluidized powder bed
[63]
4.9 Parallelization using hybrid OpenMP+MPI+CUDA Methodologies of various CFD problems and parallelization strategy of pertaining CFD codes using hybrid OpenMP+MPI+CUDA are described below. To simulate flow with complex geometries CFD code HOSTA was demonstrated by Xu et al. The governing equations used were NavierStokes or Euler equations, and the inviscid flux derivatives discretization was done. The schemes employed in this work were, seventh-order hybrid cell-edge and cell-node Dissipative Compact Scheme and fifth-order Weighted Compact Non-linear Scheme [114]. Xu et al., parallelized the HOSTA CFD code using a tri-level hybrid strategy. OpenMP+MPI+CUDA were used as parallelizing tools for domain decomposition. For the supercomputer TianHe-1As each node, one MPI process was created to manage the GPU and two CPUs. Between the intranode GPU and CPUs the grid blocks on MPI process per node is allocated. Each OpenMP chunk is computed by an OpenMP thread and each CPU calculated grid block is logically divided to many OpenMP chunks i.e. the OpenMP directives were added over the outer most cell loop. By multiple CUDA streams, grid blocks computed with GPU are issued at the same time and each CUDA chunk obtained after partitioning each grid block logically is updated by a CUDA thread block. Finally a CUDA thread calculates each single cell in the chunk [114]. The buoyancy driven incompressible fluid flows was analyzed by Jacobsen and Senocak. The governing NavierStokes equation used is discretized using different schemes like, for time derivatives a second order accurate AdamsBashforth scheme is employed and for spatial derivatives a second order central difference scheme is used on a uniform Cartesian staggered mesh. Chorin s projection algorithm is used in this study and further utilising geometric multigrid solver or fixed iteration Jacobi solver the related pressure Poisson equation is solved [52]. Jacobsen and Senocak used tri-level parallelizing strategy using OpenMP, MPI and CUDA for this CFD code based on buoyancy driven compressible flow on multi-GPU clusters with heterogeneous architectures. MPI is used for parallelization of coarse grain over the
cluster, CUDA parallelizing paradigm for fine grain parallel implementation and OpenMP for individual compute nodes parallel operations. The parallel computations of incompressible flow was investigated on the the NCSA Lincoln Tesla cluster [52]. Few CFD problems and related CFD codes parallelized using OpenMP+MPI+CUDA are described in the above literature survey. They are mentioned in table 10 along with the number of grid points used and test cases. 4.9.1 Benefits of hybrid OpenMP+MPI+CUDA in CFD Benefits of using hybrid OpenMP+MPI+CUDA in parallel computing of CFD codes are provided below. – The computational power of supercomputers can be utilized employing this strategy [114]. – Many block structured grids can be simulated using this tri level hybrid approach [114]. 4.9.2 Issues of hybrid OpenMP+MPI+CUDA in CFD Issues of using hybrid OpenMP+MPI+CUDA in parallel computing of CFD codes are provided below. – Programming complexity in utilizing cluster of CPUs and GPU [114]. – As the number of nodes increases, the scalability of this tri level hybrid model reduces [52].
5 Parallel Computing Issues and Suggestions for Future Work In this section, a literature review on issues related to parallel computing of CFD software are provided. Also few suggestions for future work in parallel computing of CFD software are listed.
5.1 Issues related with parallel computing of CFD software Parallel computing of CFD software related issues are mentioned below.
24
A. Afzal et al.
Table 10 Parallelization using OpenMP+MPI+CUDA CFD Field/Problem Simulation of flow with complex geometries Buoyancydriven incompressible flow
Code
Mesh/Grid Points in millions
Case Study
Reference
HOSTA
800
EET high-lift airfoil configuration, Chinas large civil airplane
[114]
—
>1
lid-driven cavity and natural convection in heated cavity problems
[52]
– Utilizing hardware resources to attain significant high performance computing code is yet a complex job which takes years to master [31]. – For the generic test case the speedup of the theoretical value indicated is approximately 75% but it cuts down to 65% in real application. The cause behind this is still not understood and needs more investigations [98]. – The balancing of load and communications are undoubtedly limiting factors of computational efficiency between the different cores when several hundreds of cores are employed [42]. – Beyond few cores of conventional architecture no improvement in computational performance is noticed [103]. – Computational performance keeps on reducing that integrate nodes with many cores, especially for massive data codes like CFD softwares [78]. – Adopting parallel computing to utilize the capability of available multicore architecture is a challenging task [23]. – Rewriting and optimizing the sequential CFD code may be required to better utilize the computational power of existing multicore GPUs and CPUs [93]. – The memory transfer bottleneck between CPU and GPU needs to be investigated [25]. – Developing the program and optimizing complicated CFD softwares on existing multi-core accelerated high performance computing systems is a big challenge to utilize the capability of heterogeneous systems, particularly when CPUs and GPUs are collaborated [80]. 5.2 Suggestions for future work in parallel computing of CFD Suggestions for future work in parallelization of CFD code are mentioned below. – For applications involving different modes of parallelization like multiphysics problem, OpenMP threads
–
–
–
– –
–
–
gives significant merits compared to MPI processes. Therefore OpenMP suits better for different modes of parallelism. The data locality issue associated with OpenMP tool can be overcome by vigilantly developing OpenMP program for scalable performance. Using nested OpenMP program, is one option which can be used to obtain comparable performance with OpenMP+MPI. Because of limited memory management schemes and memory bandwidth, improvement in computational performance is not noticed beyond few cores which can be most likely solved by increasing integration of memory that leads to improved memory bandwidth. To get better scaling results across multi-GPUs one CPU core needs to be devoted to each active GPU. The parallel performance of MPI alone is better than hybrid OpenMP+MPI parallelization, but within few cores. However hybrid parallelization outperforms only MPI parallelization if the MPI performance does not improve after few numbers of cores and grid zones. OpenMP paradigm is best suitable for shared memory systems compared to MPI. Hence OpenMP should be used on shared memory systems for parallelization. To enhance the compatibility of OpenMP program with time stepping scheme for solving high Reynolds number flow, grid reordering method has to be employed for hybrid unstructured grids.
6 Conclusion In this article we have presented a comprehensive literature review on parallelization of CFD codes belonging to various CFD domains, utilizing various parallel computing tools. The contribution of this article in parallelization of CFD codes is multifold, namely:
Parallelization Strategies for Computational Fluid Dynamics Software: State of the Art Review
– Provided an intensive review of various major areas where CFD is applied for simulation of different fluid flow situations. – Identified the computational time related issues of massive grid points of complex configurations in CFD software codes. – Described parallel computing and related major tools used for parallelization of CFD codes. – Provided a detailed review of different CFD areas, problems and parallelization strategies for related CFD codes using multiple parallel computing tools. – Highlighted various benefits and issues associated with specific parallel computing models employed in parallelization of CFD software. – Mentioned the general parallel computing issues encountered in CFD code parallelization and provided few suggestions for future work in parallelization of CFD codes. The CFD areas where parallelization is largely attempted in general include: aerodynamics, automotives, marine, spacecrafts, turbomachinery, fluidized bed, lid driven cavity and molecular dynamics. There are still many fields of CFD, where parallelization of CFD codes is not much attempted even though they suffer from huge computational time issues are as follows:
25
– POSIX threads or Pthreads- works as a parallel computing tool on shared memory architectures. – Hybrid OpenCL+OpenACC- this can be used in porting of applications on many core GPUs. – PETSc- is library for parallel CFD application developers to implement parallel equation solvers with ease. – ParaWise- it is an automatic parallelization code generator from existing serial FORTRAN programs. Can be used for legacy CFD applications written in FORTRAN. Finally, the world of CFD continues to grow enormously along with the computational power of CPUs and GPUs. As mentioned earlier, CFD software need huge computational time for simulation of complex configurations. If properly planned and programmed (adopting suitable parallel computing strategies), the growing computational power of the CPUs and GPUs can be fully utilized to obtain the CFD results much faster saving the computational time costs.
References
1. Accary, G., Bessonov, O., Foug` ere, D., Meradji, S., Morvan, D.: Optimized parallel approach for 3d modelling of forest fire behaviour. In: Parallel Computing Technologies, pp. 96–102. Springer (2007) – Food industry 2. AlOnazi, A., Keyes, D., Lastovetsky, A., Rychkov, V.: – Bio-medical industry Design and optimization of openfoam-based cfd applica– Chemical processing industry tions for hybrid and heterogeneous hpc platforms. arXiv – Meteorology preprint arXiv:1505.07630 (2015) 3. Amritkar, A., Deb, S., Tafti, D.: Efficient parallel cfd– Civil engineering industries dem simulations using openmp. Journal of Computa– HVAC systems tional Physics 256, 501–519 (2014) – Acoustics industry 4. Amritkar, A., Tafti, D., Liu, R., Kufrin, R., Chapman, – Fire and smoke modeling B.: Openmp parallelism for fluid and fluid-particulate systems. Parallel Computing 38(9), 501–517 (2012) – Thermal power plants 5. Andersson, B., ˚ Alund, A., Mark, A., Edelvik, F.: Mpi– Nuclear power plants parallelization of a structured grid cfd solver including an integrated octree grid generator. Tech. rep., The areas identified above need attention of the reChalmers University of Technology (2013) search community for parallelization of related CFD 6. Andrews, P.L.: Current status and future needs of the sofwtare to reduce the computational time cost. behaveplus fire modeling system. International journal of wildland fire 23(1), 21–33 (2014) Tools popularly employed for parallelization of CFD 7. Asanovic, K., Bodik, R., Catanzaro, B.C., Gebis, J.J., codes include: OpenMP, MPI, CUDA and hybrid OpenMP+MPI. Husbands, P., Keutzer, K., Patterson, D.A., Plishker, There are few more parallel computing tools that are W.L., Shalf, J., Williams, S.W., et al.: The landscape not much utilized for parallelization of CFD codes, but of parallel computing research: A view from berkeley. Tech. rep., Technical Report UCB/EECS-2006-183, have good potential are: EECS Department, University of California, Berkeley (2006) – Open Computing Language (OpenCL) - can be em8. Ayguade, E., Gonzalez Tallada, M., Martorell, X., Jost, ployed for utilizing the computational power of GPUs G.: Employing nested openmp for the parallelization in CFD applications. of multi-zone computational fluid dynamics applica– Open Accelerators (OpenACC) this tool can be tions. In: Parallel and Distributed Processing Symposium, 2004. Proceedings. 18th International, p. 6. IEEE used for CFD applications on heterogeneous sys(2004) tems. 9. Balaji, P., Buntinas, D., Goodell, D., Gropp, W., – MapRduce- can be used to generate and utilize parThakur, R.: Fine-grained multithreading support for hybrid threaded mpi programming. International Jourallel programs on HADOOP clusters.
26
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
21.
22.
23.
24.
A. Afzal et al. nal of High Performance Computing Applications 24(1), 49–57 (2010) Basermann, A., Kersken, H.P., Schreiber, A., Gerhold, T., J¨ agersk¨ upper, J., Kroll, N., Backhaus, J., K¨ ugeler, E., Alrutz, T., Simmendinger, C., et al.: Hicfd: Highly efficient implementation of cfd codes for hpc many-core architectures. In: Competence in High Performance Computing 2010, pp. 1–13. Springer (2012) Baskaran, M.M., Ramanujam, J., Sadayappan, P.: Automatic c-to-cuda code generation for affine programs. In: Compiler Construction, pp. 244–263. Springer (2010) Berger, M.J., Aftosmis, M.J., Marshall, D., Murman, S.M.: Performance of a new cfd flow solver using a hybrid programming paradigm. Journal of Parallel and Distributed Computing 65(4), 414–423 (2005) Blazewicz, M., Brandt, S.R., Diener, P., Koppelman, D.M., Kurowski, K., L¨ offler, F., Schnetter, E., Tao, J.: A massive data parallel computational framework for petascale/exascale hybrid computer systems. arXiv preprint arXiv:1201.2118 (2012) de Boer, A.H., Hagedoorn, P., Woolhouse, R., Wynn, E.: Computational fluid dynamics (cfd) assisted performance evaluation of the twincer disposable high-dose dry powder inhaler. Journal of Pharmacy and Pharmacology 64(9), 1316–1325 (2012) Bohbot, J., Knop, V., Laget, O., Angelberger, C., R´ eveill´ e, B.: High performance 3d cfd codes for complex piston engine applications. In: International Multidimensional Engine Modeling User’s Group Meeting at the SAE Congress (2010) Bosshard, C., Bouffanais, R., Deville, M., Gruber, R., Latt, J.: Computational performance of a parallelized three-dimensional high-order spectral element toolbox. Computers & Fluids 44(1), 1–8 (2011) Boukhanouf, R., Haddad, A.: A cfd analysis of an electronics cooling enclosure for application in telecommunication systems. Applied Thermal Engineering 30(16), 2426–2434 (2010) Boulet, M., Marcos, B., Dostie, M., Moresoli, C.: Cfd modeling of heat transfer and flow field in a bakery pilot oven. Journal of food engineering 97(3), 393–402 (2010) Caraeni, M., Devaki, R., Aroni, M., Oswald, M., Caraeni, D.: Efficient acoustic modal analysis for industrial cfd. In: 47th AIAA Aerospace Sciences Meeting Including the New Horizons Forum and Aerospace Exposition (2009) Chandra, S., Lee, A., Gorrell, S., Jensen, C.G.: Cfd analysis of pace formula-1 car. Brigham Young University (2011) Chen, F., Bornstein, R., Grimmond, S., Li, J., Liang, X., Martilli, A., Miao, S., Voogt, J., Wang, Y.: Research priorities in observing and modeling urban weather and climate. Bulletin of the American Meteorological Society 93(11), 1725–1728 (2012) Cheng, M., Wang, G., Mian, H.H.: Reordering of hybrid unstructured grids for an implicit navier-stokes solver based on openmp parallelization. Computers & Fluids (2014) Cohen, J., Molemaker, M.J.: A fast double precision cfd code using cuda. Parallel Computational Fluid Dynamics: Recent Advances and Future Directions pp. 414–429 (2009) Couder-Casta˜ neda, C., Barrios-Pi˜ na, H., Gitler, I., Arroyo, M.: Performance of a code migration for the simulation of supersonic ejector flow to smp, mic, and gpu using openmp, openmp+ leo, and openacc directives. Scientific Programming 2015, 17 (2015)
25. Crespo, A., Dominguez, J.M., Barreiro, A., G´ omezGesteira, M., Rogers, B.D.: Gpus, a new tool of acceleration in cfd: efficiency and reliability on smoothed particle hydrodynamics methods. PLoS One 6(6), e20,685 (2011) 26. Denton, J., Dawes, W.: Computational fluid dynamics for turbomachinery design. Proceedings of the Institution of Mechanical Engineers, Part C: Journal of Mechanical Engineering Science 213(2), 107–124 (1998) 27. Djomehri, M.J., Jin, H.: Hybrid mpi+ openmp programming of an overset cfd solver and performance investigations. NASA Ames Research Center, NAS Technical Report NAS-02-002 (2002) 28. Dong, S., Karniadakis, G.E.: Dual-level parallelism for high-order cfd methods. Parallel Computing 30(1), 1– 20 (2004) 29. Duvigneau, R., Kloczko, T., Praveen, C.: A three-level parallelization strategy for robust design in aerodynamics. In: Proc. 20th Intl. Conf. on Parallel Computational Fluid Dynamics, pp. 379–384 (2008) 30. Elangovan, M.: Simulation of irregular waves by cfd. World Academy of Science, Engineering and Technology 55 (2011) 31. Emelyanov, V., Karpenko, A., Volkov, K.: Development of advanced computational fluid dynamics tools and their application to simulation of internal turbulent flows. In: Progress in Flight Physics–Volume 7, vol. 7, pp. 247–268. EDP Sciences (2015) 32. Fan, Z., Qiu, F., Kaufman, A., Yoakum-Stover, S.: Gpu cluster for high performance computing. In: Proceedings of the 2004 ACM/IEEE conference on Supercomputing, p. 47. IEEE Computer Society (2004) 33. Ferziger, J.H., Peric, M.: Computational methods for fluid dynamics. Springer Science & Business Media (2012) 34. Flager, F., Welle, B., Bansal, P., Soremekun, G., Haymaker, J.: Multidisciplinary process integration and design optimization of a classroom building. Journal of Information Technology in Construction 14, 595–612 (2009) 35. Fletcher, C., Mayer, I., Eghlimi, A., Wee, K.: Cfd as a building services engineering tool. International Journal on Architectural Science 2(3), 67–82 (2001) 36. Fries, L., Antonyuk, S., Heinrich, S., Dopfer, D., Palzer, S.: Collision dynamics in fluidised bed granulators: A dem-cfd study. Chemical Engineering Science 86, 108– 123 (2013) 37. Frisch, J., Mundani, R.P., Rank, E., van Treeck, C.: Engineering-based thermal cfd simulations on massive parallel systems. Computation 3(2), 235–261 (2015) 38. Gerndt, A., Sarholz, S., Wolter, M., Mey, D.A., Bischof, C., Kuhlen, T.: Nested openmp for efficient computation of 3d critical points in multi-block cfd datasets. In: SC 2006 Conference, Proceedings of the ACM/IEEE, pp. 46–46. IEEE (2006) 39. Geveler, M., Ribbrock, D., Mallach, S., G¨ oddeke, D.: A simulation suite for lattice-boltzmann based real-time cfd applications exploiting multi-level parallelism on modern multi-and many-core architectures. Journal of Computational Science 2(2), 113–123 (2011) 40. Girod, M., Sanader, Z., Vojkovic, M., Antoine, R., MacAleese, L., Lemoine, J., Bonacic-Koutecky, V., Dugourd, P.: Uv photodissociation of proline-containing peptide ions: Insights from molecular dynamics. Journal of The American Society for Mass Spectrometry 26(3), 432–443 (2015)
Parallelization Strategies for Computational Fluid Dynamics Software: State of the Art Review 41. G¨ oddeke, D., Buijssen, S.H., Wobker, H., Turek, S.: Gpu acceleration of an unmodified parallel finite element navier-stokes solver. In: High Performance Computing & Simulation, 2009. HPCS’09. International Conference on, pp. 12–21. IEEE (2009) 42. Gourdain, N., Gicquel, L., Montagnac, M., Vermorel, O., Gazaix, M., Staffelbach, G., Garcia, M., Boussuge, J., Poinsot, T.: High performance parallel computing of flows in complex geometries: I. methods. Computational Science & Discovery 2(1), 015,003 (2009) 43. Griebel, M., Zaspel, P.: A multi-gpu accelerated solver for the three-dimensional two-phase incompressible navier-stokes equations. Computer Science-Research and Development 25(1-2), 65–73 (2010) 44. Grisogono, B.: On nature, theory, and modelling of atmospheric planetary boundary layers. Bulletin of the American Meteorological Society 92(2), 123–128 (2011) 45. Gropp, W.D., Kaushik, D.K., Keyes, D.E., Smith, B.F.: High-performance parallel implicit cfd. Parallel Computing 27(4), 337–362 (2001) 46. Hawkes, J., Turnock, S., Cox, S., Phillips, A., Vaz, G.: Performance analysis of massively-parallel computational fluid dynamics. Proceedings of the 11th International Conference on Hydrodynamics(ICHD 2014), Singapore (October 19-24,2014) 47. Heuveline, V., Krause, M.J., Latt, J.: Towards a hybrid parallelization of lattice boltzmann methods. Computers & Mathematics with Applications 58(5), 1071–1080 (2009) 48. Hochkirch, K., Mallol, B.: On the importance of fullscale cfd simulations for ships. In: 11th International Conference on Computer and IT Applications in the Maritime Industries, COMPIT (2013) 49. H¨ ohne, T., Krepper, E., Rohde, U.: Application of cfd codes in nuclear reactor safety analysis. Science and Technology of Nuclear Installations 2010 (2009) 50. Holland, D.M., Lockerby, D.A., Borg, M.K., Nicholls, W.D., Reese, J.M.: Molecular dynamics pre-simulations for nanoscale computational fluid dynamics. Microfluidics and Nanofluidics 18(3), 461–474 (2014) 51. Hu, Y.C., Lu, H., Cox, A.L., Zwaenepoel, W.: Openmp for networks of smps. In: Parallel Processing, 1999. 13th International and 10th Symposium on Parallel and Distributed Processing, 1999. 1999 IPPS/SPDP. Proceedings, pp. 302–310. IEEE (1999) 52. Jacobsen, D.A., Senocak, I.: Scalability of incompressible flow computations on multi-gpu clusters using duallevel and tri-level parallelism. In: 49th AIAA Aerospace Sciences Meeting including the New Horizons Forum and Aerospace Exposition, vol. 4, pp. 2011–947 (2011) 53. Jacobsen, D.A., Thibault, J.C., Senocak, I.: An mpicuda implementation for massively parallel incompressible flow computations on multi-gpu clusters. In: 48th AIAA aerospace sciences meeting and exhibit, vol. 16 (2010) ¨ 54. Janßen, C.F., Mierke, D., Uberr¨ uck, M., Gralher, S., Rung, T.: Validation of the gpu-accelerated cfd solver elbe for free surface flow problems in civil and environmental engineering. Computation 3(3), 354–385 (2015) 55. Jeff Burnham, P.: Modeling dams with computational fluid dynamics: Past success and new directions 56. Jespersen, D.C.: Acceleration of a cfd code with a gpu. Scientific Programming 18(3-4), 193–201 (2010) 57. Jia, R., Sunden, B.: Parallelization of a multi-blocked cfd code via three strategies for fluid flow and heat transfer analysis. Computers & fluids 33(1), 57–80 (2004)
27
58. Jin, H., Frumkin, M., Yan, J.: Automatic generation of openmp directives and its application to computational fluid dynamics codes. In: High Performance Computing, pp. 440–456. Springer (2000) 59. Jin, H., Jespersen, D., Mehrotra, P., Biswas, R., Huang, L., Chapman, B.: High performance computing using mpi and openmp on multi-core parallel systems. Parallel Computing 37(9), 562–575 (2011) 60. Jin, H., Jost, G., Johnson, D., Tao, W.K.: Experience on the parallelization of a cloud modeling code using computer-aided tools. NASA Technical report, NAS03-006 (2003) 61. Jost, G., Jin, H., an Mey, D., Hatay, F.F.: Comparing the openmp, mpi, and hybrid programming paradigms on an smp cluster. In: Proceedings of EWOMP, vol. 3, p. 2003 (2003) 62. Jost, G., Robins, B.: Experiences using hybrid mpi/openmp in the real world: Parallelization of a 3d cfd solver for multi-core node clusters. Scientific Programming 18(3-4), 127–138 (2010) 63. Kafui, D., Johnson, S., Thornton, C., Seville, J.P.: Parallelization of a lagrangian–eulerian dem/cfd code for application to fluidized beds. Powder Technology 207(1), 270–278 (2011) 64. Karimi, K., Dickson, N.G., Hamze, F.: A performance comparison of cuda and opencl. arXiv preprint arXiv:1005.2581 (2010) 65. Kayne, A.: Computational fluid dynamics (cfd) modeling of mixed convection flows in building enclosures. ASME 2013 7th International Conference on Energy Sustainability (2012) 66. Khor, Y.S., Xiao, Q.: Cfd simulations of the effects of fouling and antifouling. Ocean Engineering 38(10), 1065–1079 (2011) 67. Kiris, C.C., Kwak, D., Chan, W., Housman, J.A.: Highfidelity simulations of unsteady flow through turbopumps and flowliners. Computers & Fluids 37(5), 536– 546 (2008) 68. Kneer, A., Schreck, E., Hebenstreit, M., Goeszler, A.: Industrial mixed openmp/mpi cfd-application for calculations of free-surface flows. WOMPAT 2000 (2000) 69. Kowalski, T., Radmehr, A.: Thermal analysis of an electronics enclosure: Coupling flow network modeling (fnm) and computational fluid dynamics (cfd). In: Semiconductor Thermal Measurement and Management Symposium, 2000. Sixteenth Annual IEEE, pp. 60–67. IEEE (2000) 70. Kumar, M., Kumar, N.S., Raj, R.T.K.: Heat transfer analysis of automotive headlamp using cfd methodology. Heat Transfer 2(7) (2015) 71. Larkin, N.K., O’Neill, S.M., Solomon, R., Raffuse, S., Strand, T., Sullivan, D.C., Krull, C., Rorig, M., Peterson, J., Ferguson, S.A.: The bluesky smoke modeling framework. International Journal of Wildland Fire 18(8), 906–920 (2010) 72. Ledur, C.L., Zeve, C.M., dos Anjos, J.C.: Comparative analysis of openacc, openmp and cuda using sequential and parallel algorithms. 11th Workshop on Parallel and Distributed Processing (WSPPD) (2013) 73. Lee, B.K.: Computational fluid dynamics in cardiovascular disease. Korean circulation journal 41(8), 423–430 (2011) 74. Li, Y., Paik, K.J., Xing, T., Carrica, P.M.: Dynamic overset cfd simulations of wind turbine aerodynamics. Renewable Energy 37(1), 285–298 (2012) 75. Ma, Z., Wang, H., Pu, S.: A parallel meshless dynamic cloud method on graphic processing units for unsteady
28
76.
77.
78.
79.
80.
81.
82.
83.
84.
85.
86.
87.
88.
89. 90.
A. Afzal et al. compressible flows past moving boundaries. Computer Methods in Applied Mechanics and Engineering 285, 146–165 (2015) Maknickas, A., Kaceniauskas, A., Kacianauskas, R., Balevicius, R., Dziugys, A.: Parallel dem software for simulation of granular media. Informatica, Lith. Acad. Sci. 17(2), 207–224 (2006) Mavriplis, D.J.: Parallel performance investigations of an unstructured mesh navier-stokes solver. International Journal of High Performance Computing Applications 16(4), 395–407 (2002) Mininni, P.D., Rosenberg, D., Reddy, R., Pouquet, A.: A hybrid mpi–openmp scheme for scalable parallel pseudospectral computations for fluid turbulence. Parallel Computing 37(6), 316–326 (2011) Morris, P.D., Narracott, A., von Tengg-Kobligk, H., Soto, D.A.S., Hsiao, S., Lungu, A., Evans, P., Bressloff, N.W., Lawford, P.V., Hose, D.R., et al.: Computational fluid dynamics modelling in cardiovascular medicine. Heart pp. heartjnl–2015 (2015) Mudigere, D., Sridharan, S., Deshpande, A., Park, J., Heinecke, A., Smelyanskiy, M., Kaul, B., Dubey, P., Kaushik, D., Keyes, D.: Exploring shared-memory optimizations for an unstructured mesh cfd application on modern parallel systems. In: Parallel and Distributed Processing Symposium (IPDPS), 2015 IEEE International, pp. 723–732. IEEE (2015) M¨ uller, M.S., van Waveren, M., Lieberman, R., Whitney, B., Saito, H., Kumaran, K., Baron, J., Brantley, W.C., Parrott, C., Elken, T., et al.: Spec mpi2007an application benchmark suite for parallel systems using mpi. Concurrency and Computation: Practice and Experience 22(2), 191–205 (2010) Nakata, T., Liu, H., Bomphrey, R.J.: A cfd-informed quasi-steady model of flapping-wing aerodynamics. Journal of Fluid Mechanics 783, 323–343 (2015) Notay, Y., Napov, A.: A massively parallel solver for discrete poisson-like problems. Journal of computational physics 281, 237–250 (2015) Ogasawara, E., de Oliveira, D., Chirigati, F., Barbosa, C.E., Elias, R., Braganholo, V., Coutinho, A., Mattoso, M.: Exploring many task computing in scientific workflows. In: Proceedings of the 2nd Workshop on ManyTask Computing on Grids and Supercomputers, p. 2. ACM (2009) Patel, H.B., Dinesan, M.D.: Optimization and performance analysis of an automobile radiator using cfd-a review. International Journal for Innovative Research in Science and Technology 1(7), 123–126 (2015) Plimpton, S.J., Devine, K.D.: Mapreduce in mpi for large-scale graph algorithms. Parallel Computing 37(9), 610–632 (2011) Rumsey, C.L., Allison, D.O., Biedron, R.T., Buning, P.G., Gainer, T.G., Morrison, J.H., Rivers, S.M., Mysko, S.J., Witkowski, D.P.: Cfd sensitivity analysis of a modern civil transport near buffet-onset conditions. , Dec (2001) Saha, P., Aksan, N., Andersen, J., Yan, J., Simoneau, J., Leung, L., Bertrand, F., Aoto, K., Kamide, H.: Issues and future direction of thermal-hydraulics research and development in nuclear power reactors. Nuclear Engineering and Design 264, 3–23 (2013) Sayma, A.: Computational fluid dynamics. Bookboon (2009) Schornbaum, F., R¨ ude, U.: Massively parallel algorithms for the lattice boltzmann method on non-uniform grids. arXiv preprint arXiv:1508.07982 (2015)
91. Schuster, D.M.: The expanding role of applications in the development and validation of cfd at nasa. In: Computational fluid dynamics 2010, pp. 3–29. Springer (2011) 92. Selma, B., D´ esilets, M., Proulx, P.: Optimization of an industrial heat exchanger using an open-source cfd code. Applied Thermal Engineering 69(1), 241–250 (2014) 93. Selvam, M., Hoffmann, K.A.: Mpi/open-mp hybridization of higher order weno scheme for the incompressible navier-stokes equations. AIAA SciTech (5-9 January 2015, Kissimmee, Florida) 94. Senocak, I., Thibault, J.C., Caylor, M.: Rapid-response urban cfd simulations using a gpu computing paradigm on desktop supercomputers. Eighth Symposium on the Urban Environment, Phoenix Arizona (10-15 January 2009) 95. Shang, Z.: High performance computing for flood simulation using telemac based on hybrid mpi/openmp parallel programming. International Journal of Modeling, Simulation, and Scientific Computing 5(04), 1472,001 (2014) 96. Shang, Z., Cheng, M., Lou, J.: Parallelization of lattice boltzmann method using mpi domain decomposition technology for a drop impact on a wetted solid wall. International Journal of Modeling, Simulation, and Scientific Computing 5(02), 1350,024 (2014) 97. Shimpalee, S., Greenway, S., Spuckler, D., Van Zee, J.: Predicting water and current distributions in a commercial-size pemfc. Journal of Power sources 135(1), 79–87 (2004) 98. Simmendinger, C., K¨ ugeler, E.: Hybrid parallelization of a turbomachinery cfd code: performance enhancements on multicore architectures. In: Proceedings of the V European Conference on Computational Fluid Dynamics ECCOMAS CFD (2010) 99. Smith, B.L.: Assessment of cfd codes used in nuclear reactor safety simulations. Nuclear Engineering and Technology 42(4), 339–364 (2010) 100. Smith, C.W., Matthews, B., Rasquin, M., Jansen, K.E.: Performance and scalability of unstructured mesh cfd workflow on emerging architectures (2015) 101. Stopford, P.J.: Recent applications of cfd modelling in the power generation and combustion industries. Applied Mathematical Modelling 26(2), 351–374 (2002) 102. Tessendorf, J., et al.: Simulating ocean water. Simulating Nature: Realistic and Interactive Techniques. SIGGRAPH 1(2), 5 (2001) 103. Thibault, J.C., Senocak, I.: Cuda implementation of a navier-stokes solver on multi-gpu desktop platforms for incompressible flows. In: Proceedings of the 47th AIAA Aerospace Sciences Meeting, pp. 2009–758 (2009) 104. Turner, E.L., Hu, H.: A parallel cfd rotor code using openmp. Advances in Engineering Software 32(8), 665– 671 (2001) 105. V´ azquez, M., Rubio, F., Houzeaux, G., Gonz´ alez, J., Gim´ enez, J., Beltran, V., de la Cruz, R., Folch, A.: Xeon phi performance for hpc-based computational mechanics codes. Tech. rep., Tech. rep., PRACE-RI (2014) 106. Vijiapurapu, S., Cui, J., Munukutla, S.: Cfd application for coal/air balancing in power plants. Applied mathematical modelling 30(9), 854–866 (2006) 107. Wang, B., Hu, Z., Zha, G.C.: General subdomain boundary mapping procedure for structured grid implicit cfd parallel computation. Journal of Aerospace Computing, Information, and Communication 5(11), 425–447 (2008) 108. Wang, J.f., Piechna, J., Mueller, N.: A novel design of composite water turbine using cfd. Journal of Hydrodynamics, Ser. B 24(1), 11–16 (2012)
Parallelization Strategies for Computational Fluid Dynamics Software: State of the Art Review 109. Warner, T.T.: Numerical weather and climate prediction. Cambridge University Press (2010) 110. Weyna, S.: Acoustic intensity imaging methods for insitu wave propagation. Archives of Acoustics 35(2), 265–273 (2010) 111. Wong, K.K., Inthavong, K., Zhonghua, S., Liow, K., Jiyuan, T.: In-vivo experimental and numerical studies of cardiac flow in right atrium. HKIE Transactions 17(4), 73–78 (2010) 112. Xia, B., Sun, D.W.: Applications of computational fluid dynamics (cfd) in the food industry: a review. Computers and electronics in agriculture 34(1), 5–24 (2002) 113. Xiao, J., Travis, J.R., Royl, P., Svishchev, A., Jordan, T., Breitung, W.: Petsc-based parallel semi-implicit cfd code gasflow-mpi in application of hydrogen safety analysis in containment of nuclear power plant. In: Joint International Conference on Mathematics and Computation (M&C), Supercomputing in Nuclear Applications (SNA) and the Monte Carlo (MC) Method, Nashville, TN (2015) 114. Xu, C., Deng, X., Zhang, L., Fang, J., Wang, G., Jiang, Y., Cao, W., Che, Y., Wang, Y., Wang, Z., et al.: Collaborating cpu and gpu for large-scale high-order cfd simulations with complex grids on the tianhe-1a supercomputer. Journal of Computational Physics 278, 275– 297 (2014) 115. Xu, Z., Zhao, H., Zheng, C.: Accelerating population balance-monte carlo simulation for coagulation dynamics from the markov jump model, stochastic algorithm and gpu parallel computing. Journal of Computational Physics 281, 844–863 (2015) 116. Yao, J., Jameson, A., Alonso, J.J., Liu, F.: Development and validation of a massively parallel flow solver for turbomachinery flows. Journal of Propulsion and Power 17(3), 659–668 (2001) 117. Yue, X., Zhang, H., Luo, C., Shu, S., Feng, C.: Parallelization of a dem code based on cpu-gpu heterogeneous architecture. In: Parallel Computational Fluid Dynamics, pp. 149–159. Springer (2014) 118. Yuguang, B., Guoqiang, W., Yuguang, Z.: A novel parallel computing method for computational fluid dynamics. International Journal of Computer Science Issues (IJCSI) 10(1) (2013) 119. Zhang, H., Trias Miquel, F.X., Tan, Y., Sheng, Y., Oliva Llena, A., et al.: Parallelization of a dem/cfd code for the numerical simulation of particle-laden turbulent flows. 23rd International Conference on Parallel Computational Fluid Dynamics (Barcelona (2011) 15) 120. Zubanov, V., Egorychev, V., Shabliy, L.: Design of rocket engine for spacecraft using cfd-modeling. Procedia Engineering 104, 29–35 (2015)
29