GPU parallelization, optimization and

Multi-CPU/GPU parallelization, optimization and machine learning based autotuning of structured grid CFD codes Weicheng Xue 1, Charles W. Jackson2 and Christopher J. Roy3 Virginia Tech, Blacksburg, VA, 24061

This paper is focused on the multi-CPU/GPU parallelization, optimization and machine learning based autotuning of a 3D buoyancy driven cavity solver and our in-house research CFD code SENSEI. OpenACC directives and MPI are used to scale the performance across multiple CPUs/GPUs without significantly modifying the legacy codes. For the buoyancy driven cavity code, on both GPUs and CPUs, domain decomposition in multiple dimensions performs much better than in a single dimension. Two optimization techniques are investigated for the GPU-accelerated version, and they are shown to successfully improve the memory throughput and reduce the communication overhead between hosts and devices. Using buffers to store noncontiguous boundary data in contiguous locations speed up the code by at least a factor of 2, in terms of the weak scalability. For our GPU-accelerated research CFD code SENSEI, one optimization of removing temporary arrays when calling routines in parallel regions gives a 2.1x~4.1x speedup over the previous best version. A machine learning based autotuning technique is used to autotune six tuning parameters mapping the GPU kernels to the GPU architecture. Only less than 1% of all possible configurations is needed to setup a good machine learning model, which saves a lot of time. The machine learning model is able to validate and predict the performance well in the training and predicting stage, respectively.

Nomenclature p ρ T

= pressure = density = temperature

xj t

= velocity vector = coordinate component = time

 V

 Fv

ν tserial tparallel NP

= body force = = = =

kinematic viscosity serial runtime parallel runtime number of processors

I. Introduction

C

omputational Fluid Dynamics (CFD) uses computational methods to determine flow behaviors. There are numerous studies discussing a variety of CFD solvers applicable for different problems. The current state of CFD is satisfactory for some problems, but in order for it to be used for more applications, the accuracy must be increased and the time it takes to run a simulation must be reduced. Meshes with a lot of nodes are needed to capture 1

Graduate Teaching Assistant, Kevin T. Crofton Department of Aerospace and Ocean Engineering, 215 Randolph Hall, AIAA Student Member. 2 Graduate Research Assistant, Kevin T. Crofton Department of Aerospace and Ocean Engineering, 215 Randolph Hall, AIAA Student Member. 3 Professor, Kevin T. Crofton Department of Aerospace and Ocean Engineering, 215 Randolph Hall, AIAA Associate Fellow 1 American Institute of Aeronautics and Astronautics

flow details for more complicated problems but it may take too long to be practical for many cases. For example, turbulence models such as Large Eddy Simulation requires several orders of magnitude more mesh cells (and more computation) than Reynolds Averaged Navier Stokes equations, making it hard to compute quickly using traditional serial computing.1 With the development of high performance computing, solving complicated CFD problems faster using advanced parallel methods has become possible. On CPUs, there are three common paradigms for parallel programming: shared memory and distributed memory and the hybrid. They are all suitable for simple instruction, multiple data programs (SIMD). A parallel program using a shared memory model like OpenMP2 utilizes the full performance of a socket by launching multiple processes on it. However, this paradigm does not scale well to multiple sockets. The distributed memory model, such as used in MPI3, uses a message passing model which makes it much more suitable for distributed systems with a large number of processors like supercomputers. There is also a hybrid MPI+OpenMP paradigm4 which combines the advantages of MPI and OpenMP. This hybrid approach adapts very well to the modern multi-processor architectures and thus it is easy to realize two different levels of parallelism: MPI for the coarse grain level and OpenMP for the fine grain level. The SIMD property of OpenMP and MPI is well suited for CFD since there is typically a large amount of data that are processed the same way. In CFD, all the three approaches, OpenMP, MPI and the hybrid MPI+OpenMPI have been utilized and evaluated alone or together in many papers.5-10 Berger et al.5 showed that through careful attention to the memory placement and locality, the OpenMP implementation was marginally faster than the MPI implementation when scaling up to 640 CPUs on multi-level Cartesian grids with embedded boundaries, and the reason was attributed to the overhead of communication in MPI implementation,etc. Amritkar et al.6 also highlighted the similar experience of constructing OpenMP code to improve data locality to achieve a good scalability up to 64 CPUs in coupled fluid-material systems. Gourdain et al.7 considered the effect of load balancing, partitioning algorithms, communication overhead, structured and unstructured mesh in the MPI implementation of very complicated CFD solvers. Mininni et al.8 compared the performance of the pure MPI implementation and the hybrid MPI+OpenMP implementation on an incompressible Navier-Stokes solver, and found that the hybrid approach does not outperform the pure MPI implementation when scaling up to about 20000 cores, which may be caused by cache contention and memory bandwidth. Yilmaz et al.10 draw a similar conclusion, although they only scales up to 256 cores. Thus, MPI is more preferable to us. As for a large number of processors residing on distributed sockets, MPI may scale better than OpenMP and the hybrid approach. In addition, MPI enables each processor to have their own memory, which also suits our goal of comparing CPU and GPU performance directly. Apart from high performance CPU computing, general purpose GPUs have started to become more popular in scientific computing. A single GPU may have thousands of processing units, and therefore, an instruction uses multiple threads to operate in a vectorized form on multiple data values, which is known as single instruction multiple threads. In GPU computing, the compute-intensive part of a program is offloaded from the CPU to the GPU, where many threads on the GPU execute the code in parallel. After the GPU finishes its task, the data is moved back to the CPU for further work such as saving or outputting the result. Since the GPU has more compute units than the CPU, the program is usually accelerated. However, there are many factors such as workload and memory bandwidth limiting the amount of acceleration possible. There are three popular options used to accelerate codes on GPUs11: CUDA12-15, OpenCL16 and OpenACC17-20. Using CUDA or OpenCL in an existing CFD code is difficult as users need to add a lot of C/C++ extensions and doing so requires them to be familiar with low level hardware architectures. Accelerating a code with CUDA or OpenCL is fairly time-consuming, although the performance is often good. CUDA is also architecture-dependent so it cannot be easily ported to other GPU architectures. OpenCL has not been commonly used in CFD codes and it is difficult to find comprehensive papers discussing the implementation and the performance of OpenCL in CFD applications. OpenACC is a directive-based model, which enables users to adapt their CFD code more readily without significantly modifying their code. OpenACC requires the programmer to know less about the GPU architecture; however for the best improvements the architecture of the GPU must be taken into account. Also, because of its directive-based implementation, OpenACC can be used across multiple different platforms, which is a big advantage over CUDA. Considering the ease of programming, portability of the code and the parallel performance it provides, OpenACC was used to accelerate the execution of our codes on the GPU. There are many parameters that can be used to tune a GPU-based parallelized CFD code. These parameters include the choice of algorithm, the method of decomposing the domain for different problems and problem sizes, the choice of time integration schemes, the way kernels are mapped to the GPU, etc. Pickering et al.17 investigated the effect of block size in two dimensions for both single precision and double precision on two different GPU architectures. Through exhaustive search, optimal configurations are obtained. However, it would be timeconsuming and impractical if the parameter search space becomes huge. Many auto-tuning methods have been 2 American Institute of Aeronautics and Astronautics

developed to solve such problems. Starchart21 uses a statistic tree-based partitioning approach, clustering parameters according to their importance, but the disadvantage of this method is that the prediction is largely confined by the samples, which may give very bad predictions in complicated situations. MaSiF22 is a machine learning guided autotuning tool, imbedded with Principal Components Analysis (PCA) to truncate the dimensionality of search space. This method is good for problems with large search space. However, accuracy can be lost due to dimension collapsing. Falch et al.23 used a machine learning based approach to tune OpenCL applications. They achieve some good predictions for some benchmark codes, but the code is not that robust and the result is not good for some other benchmarks, especially on GPU.

II. Approach Many factors in a CFD code can greatly affect the performance including numerical algorithm, mesh size, mesh type and the implementation of parallelization, etc. One major consideration is the mesh type (structured or unstructured) used to discretize the domain. In this paper, structured meshes are used for the following reasons. First, the residual on a structured mesh usually takes less time to compute, as there is a benefit of data locality during iteration, allowing the cache to be used more efficiently. Second, unstructured meshes require extra space to store the connectivity information for each cell, which increases the memory load for the problem. The memory overhead can have a significant impact for performance on a GPU which is typically memory bounded. Third, a more accurate estimate of the numerical error can be obtained on a structured mesh rather than an unstructured one. Last, for general problems, the quality of mesh is usually better using a multi-block structured mesh than that provided on an unstructured mesh so more accurate solutions can be obtained. Thus, choosing structured grid mesh is a combined consideration of the performance and the accuracy of the solution. To evaluate the performance of both the CPU and GPU parallel codes, weak scalability and strong scalability (or weak scaling and strong scaling) are used24. Strong scalability measures how the execution time varies when the number of processors changes for a fixed problem size, while weak scalability measures how the execution time varies with the number of processors changes when the problem size on each processor is fixed. Commonly these two scalability are necessary to be investigated together as we are very likely to encounter different situations including having limited compute and memory resources therefore problem size cannot be increased too much, or the execution time of a program is too long so that we want to use as many processors as possible to accelerate our code. The second situation fits more into our needs so we are more interested in the weak scalability. Apart from the two types of scalability, there are two ways of analyzing the strong scalability and weak scalability, depending on whether the scalability analysis is intra-socket or inter-socket25. When operating intra-socket the different processors in that socket have equal access to the same memory system and are able to exchange the data faster reducing the effect of memory latency. However, other resources (such as cache) are limited and have to be shared by a certain number of processors within that socket. When looking at the inter-socket approach, every processor has its own resources. Using the inter-socket approach has certain benefits, such as a larger cache to ensure better data locality which is especially important for memory bounded problems like CFD. However, having processors on different nodes requires these processes to communicate over the network, which may bring serious latency issues when scaling to a large number of processors. Since we want to directly compare the speedup of the code on GPUs and CPUs (each node has two sockets and each socket has one GPU and 6 CPUs), the performance of strong scalability and weak scalability using the approach of inter-socket (the process is bound to socket) is investigated and presented. To improve the performance of GPU parallelization26, we need to reduce both the computation cost and communication overhead. Rinard et al.27 discusses some aspects of communication optimization in great detail. The computation cost is greatly determined by the numerical algorithms used and the computation time is decided by the compute capability of the GPU, which is not our focus in terms of investigating the parallel optimization methods. Data locality, load balancing, synchronization, the ratio of communication over computation, latency reducing and hiding, concurrent fetches and communication efficiency improvements are very important factors to be considered to reduce the communication overhead. Regarding load balancing, a good domain decomposition28 is very important. For a multi-block structured grid, the grid size for different blocks may be quite different. A processor may have a larger number of smaller blocks, in contrast to that another processor may have less blocks with greater size. Even knowing this, sometimes the load balancing cannot be completely guaranteed therefore how to reduce the adverse effect of load imbalancing across different GPUs is a big issue. Also, the number of data transfers, the amount of data to be transferred, and the synchronization overhead due to load imbalance are mainly determined by decomposition. To reduce the communication overhead of data transfer between CPU and GPU and MPI calls, the majority of the data should be kept on the GPU, and the amount of data that need to be transferred should be 3 American Institute of Aeronautics and Astronautics

minimized, between the CPU and its GPU as well as the CPU to other CPUs solving the problem. To reduce the synchronization overhead, the amount of the task that can be done asynchronously should be maximized. To improve data locality and coalesced fetch, data should be loaded into cache as chunks then it becomes more efficient to read and write data from/to global memory. To reduce latency, the memory throughput should be maximized by only transferring contiguous data, if possible. Non-contiguous data transfer between GPU and GPU is a big bottleneck, which deteriorates the performance greatly. To hide latency, kernel execution and data transfer should be overlapped as much as possible, which should be helpful when the communication overhead becomes larger if the number of processors continues to increase. To improve communication efficiency, private temporary arrays for threads and new imported dependencies in nested loops should be avoid as much as possible. A large amount of temporary arrays is a waste of memory, and it makes the processors use cache much more inefficiently. There may be several bottlenecks in a code. A very good way to find the hotspot is through profiling29. Nvidia provides a GPU profiling tool nvprof that helps to identify the hotspot. Once the hotspots are identified, new approaches or algorithms can be used to improve these areas. The bottlenecks may include memory throughput, GPU occupancy, data transfers, etc. Some of these issues may be easily solved by using GPUs with higher compute capability, so we should first test on different GPU platforms and decide whether they are really bottlenecks. The compute capability is often confined by the physical limit of the GPU itself, i.e. how many processors it has, what the architecture is, etc. Also, we should test our optimization on different platforms as one of our goals is to enable the code to have a good portability across different architectures. One good strategy of parallelism for a specific CFD code on a specific platform may not work well for another as the optimization may be architecture specific.

III. Description of the CFD Codes The finite difference CFD code in this paper works on solving the classic 3D Buoyancy-Driven-Cavity (BDC) problem. It uses the artificial compressibility method developed by Chorin30. The artificial compressibility term added to the continuity equation turns the elliptic system of equations into a hyperbolic system that can be marched in pseudo-time to obtain the steady state solution. The system of the governing equations is given as

 ∂4 p 1 ∂p + ∇ ⋅ = ρ V ε , j ∂x 4j β 2 ∂t

  1  1 ∂V   + V ⋅∇V = − ∇p + υ∇ 2 V + Fv , ∂t ρ ρ   ∂T + V ⋅∇T = α∇ 2T , ∂t

(1)

where εj denotes components of the numerical dissipation coefficient and β denotes the artificial compressibility parameter. SENSEI (Structured, Euler/Navier-Stokes Explicit-Implicit Solver) is our in-house flow solver developed by Derlaga et al.31 It uses modern Fortran techniques in order to solve 2D/3D CFD problems. SENSEI is a multi-block finite volume code and is embedded with several flux options such as Roe and vanLeer. The system of equations are given as

   ∂  (2) Ω + − = Q d F F d s S ∫ ∂Ω i v ∫Ω dΩ , ∂t ∫Ω     where Q is the vector of conserved variables, Fi and Fv are the inviscid and viscid vectors respectively. S is the

source term. For more details on the implementation see Derlag et at.31

IV. Implementation A. Performance Metrics There are many ways to measure the performance of a parallel code. The most common metric is the number of floating point operations performed per second (GFLOPS). This is a way of showing the performance of the code and is calculated by timing the program and calculating how many floating point operations were performed (typically by hand or using a code analysis tool). An example of this was performed by Pickering et al.17 For parallel computing it is also common to measure the speedup compared to the serial implementation: 4 American Institute of Aeronautics and Astronautics

speedup =

tserial t parallel

.

(3)

The speedup shows how much faster the parallel version of the code is compared to the serial version. It can also be useful to measure how efficiently all of the parallel resources are being used. Ideally a program running on two nodes would be twice as fast as the same code run on one node. This would be linear scalability and an efficiency of 100%. The efficiency can be calculated as:

efficiency =

speedup . NP

(4)

Typically a program will have sub-linear scalability, with an efficiency less than 100%, due to the overhead associated with parallel computing such as the communication costs. Sometimes it is possible to achieve superlinear scalability when the efficiency is higher than 100%. This is commonly due to good improved use of the cache on a smaller portion of the domain. B. Decomposition Strategy There are many ways to decompose a general computational domain such as graph partitioning, Hilbert space curves, etc28. For a single-block structured mesh, decomposition becomes more straightforward with three main methods: 1D decomposition, 2D decomposition and 3D decomposition. 3D decomposition means that the domain is decomposed in three dimensions, and similarly, 2D decomposition and 1D decomposition means that the domain is decomposed in two dimensions or one dimension, respectively. In the BDC code, each node has two neighboring nodes in its pressure stencil and only one layer of neighboring nodes for the other variables. Figure 1 shows the stencil used at node i, j, k in the BDC code. For SENSEI, it uses two layers of ghost cells at the edges of the domain to use the interior stencil everywhere.

Figure 1. Stencil (black+red: stencil for temperature and velocity, blue+black+red: stencil for pressure) However, 3D decomposition may be a more promising way for more processors for two reasons. First, the surface area-volume ratio is larger in 1D decomposition, which means more data need to be transferred between processors. Second, 1D decomposition may generate too thin slices that impedes a good scaling because the surface area-volume ratio is very large. The 3D decomposition adopted is shown in Figure 2. Every processor is assigned a small block in the whole domain. Then ghost nodes/cells are appended with each block. Then, only the boundaries of each small block need to communicate with each other to exchange data, in order to fill the stencil values of the nodes/cells on the boundaries of each block.

5 American Institute of Aeronautics and Astronautics

Figure 2. 3D domain decomposition Data layout of multi-dimensional arrays in Fortran is column-majored. For example, since the unit stride direction of a three-dimensional array A(i, j, k) is the first index (i), i is the iterated fastest and k is the slowest, it should be considered to improve the data locality for nested loops in both serial and parallel programming. Knowing this, the order of loop index (from the innermost to the outermost) should be i, j and k. Thus, the domain should be accordingly decomposed in the order of k, j and i directions so that locality can be made full use of. For a given number of processors, firstly we can find the cube root and round it to the lower bound integer, and use it as the number of partitions in i direction, then use the remainder number to decompose as much as possible in j and k directions. This strategy tries to utilize as many processors as possible and divides the domain in three directions as evenly as possible. C. Hardware Configuration There are three platforms with different configurations to be used, to assess the performance of different GPU parallelization and optimization of both the 3D BDC code and SENSEI. Hokiespeed The Hokiespeed cluster at Virginia Tech was used to perform the testing for this paper. Hokiespeed has 204 nodes using a quad data rate InfiniBand interconnect. Each node is outfitted with 24GB memory, two six-core Xeon E5645 CPUs and two Nvidia M2015/C2050 GPUs. Every GPU has 14 multiprocessors (MP) and 3GB memory in total. The peak bandwidth to the 3GB shared memory is 148.4GB/s. Every MP has 32 Cuda cores, 48KB shared memory and 16KB L1 cache. All the access to the global memory go through the L2 cache of size 512KB. The peak double precision performance is 513 GFLOPS. There are two reasons why Hokiespeed is chosen. First, some of our research group’s previous results for this BDC code are based on Hokiespeed. Using Hokiespeed can enable comparison with previous results. Second, Hokiespeed has many sockets each with one GPU allowing us to realize massive CPU/GPU processing. The compiler used is PGI 15.7 and –O3 optimization option is used. Thermisto Thermisto is a workstation in our research lab. There are two Nvidia Tesla C2075 GPUs and 32 CPUs. Every GPU has 14 multiprocessors (MP), each with 32 cuda cores. The peak bandwidth is 144GB/s. The peak double precision performance is 515 GFLOPS. The configuration and peak double precision performance of this GPU is very similar to the Hokiespeed C2050 GPU. The compiler used is PGI 16.5 and –O3 optimization option is used. Newriver Newriver has 39 GPU nodes. Each of these nodes is equipped with two Intel Xeon E5-2680v4 (Broadwell) 2.4GHz CPU (28 cores/node in all), 512 GB memory, and two Nvidia P100 GPUs. Each GPU is capable of up to 4.7 TeraFLOPS of double-precision performance. The Nvidia P100 GPU offers much higher GFLOPS and memory bandwidth compared with the Nvidia M2015/C2050 GPU on Hokiespeed or the Nvidia C2075 GPU. The compiler used is PGI 17.5 and –O3 optimization option is used. 6 American Institute of Aeronautics and Astronautics

V. Results A. Parallelization and optimization of the 3D BDC code The solution of pressure using 64 GPUs (3D decomposition) with overlap of communication and computation (C-C) at 50000 steps for a grid size of 643 which is coarse is shown in Figure 3(a). The solution is smooth and there is no discontinuity between sub-blocks for this very coarse grid. Figure 3(b) shows the largest difference (among the five variables, pressure is the most different) between the serial version and the 64 GPUs (3D decomposition) with overlap of communication and computation (C-C) for a grid size of 643 at 50000 steps (C-C is the most complicated implementation for the BDC case). The relative difference is only O(10-8) which is within the round-off error. For all of the implementations presented here, they were checked to ensure that their implementations were correct.

(a) Pressure contour (b) Maximum solution difference contour Figure 3. Validation of solution at 50000 steps (64GPUs C-C) CPU performance An optimized single CPU performance is used as the base metric. In this paper, serial mode CPU double precision performance is found to be very close to MPI mode CPU double precision performance. Therefore, the efficiency using one MPI process can be regarded as 100%, which can be seen in Table 1. The grid size adopted in Table 1 is 2563. Table 1. A single CPU double precision (DP) performance on different modes Serial mode MPI mode GFLOPS 1.897 1.849 Figure 4 shows the CPU strong scalability and weak scalability efficiency comparison of different decompositions in multiple dimensions. In Figure 4(a), there are several phenomena which can be easily seen. First, in terms of efficiency, 2D decomposition is the most efficient followed by 3D decomposition and then 1D decomposition. The reason why 1D decomposition scales the worst is that there is a large amount of data (surface area of sub-blocks) that needs to be transferred. However, there is a limit to the surface area analogy which is why 2D decomposition scales better than 3D decomposition. This better performance is due to the fact that a single block is only communicating with 2 other blocks instead of six, so there are fewer communications occurring (each with their own latency). Also, in 2D, the data transferred are in more contiguous memory locations helping the efficiency of this approach. Figure 4 also shows that super-linear scaling was observed using 2D and 3D decomposition. This is because only one process is set up in each stencil. We want to distribute the processes in as many different sockets as possible because it is convenient to compare CPU scaling and GPU scaling (every socket has just one GPU). As a result, each process has more resources (cache, memory bandwidth), compared with the case in which multiple processes 7 American Institute of Aeronautics and Astronautics

residing within a socket need to share the resources (the efficiency is lower). For using 2D decomposition, the highest efficiency occurs on eight processors, and efficiency for all the number of CPUs tested exceeds 120%. The efficiency would drop below 100% if many more CPUs are used as the communication overhead would increase greatly. Third, for all decompositions, there is a trend of efficiency drop if the number of processors reaches a certain value, because of the constantly increasing communication overhead.

(a) Strong scalability (b) Weak scalability Figure 4. Parallel efficiency using different decompositions (for CPUs) There are many ways to decompose a domain in a given number of blocks (one for each processor) and these ways were analyzed to determine the best way to perform the decomposition. For example, 12 processors can be decomposed as (1,1,12) using 1D decomposition, (1,3,4) or (1,2,6) using 2D decomposition, and (2,2,3) or (2,3,2) using 3D decomposition (there are many more possibilities). A good decomposition should maximize contiguous memory access (data locality) and minimize communication overhead. Considering that arrays in Fortran are column-major, the maximum partition should be assigned to k index, and then to j and i index. If NP=12, 3 different combinations are given in Table 2 for comparison. It is clearly seen that (2,2,3) gives the best performance in this example so it is used for NP=12 in 3D decomposition. Decomposition for different number of processors uses the same idea. Table 2. Performance of different combinations (NP=12) (2,2,3) (2,3,2) (3,2,2) GFLOPS 26.463 26.404 24.864 GPU Performance First, definitions and brief introduction of different optimization versions are given. 1D, 2D and 3D denotes different decompositions. Single GPU performance is 35.925 GFLOPS, which is 18.94 times of a single CPU performance. All the speedup for GPU are based on single GPU performance (35.925 GFLOPS). There are three versions of GPU-accelerated code: the first version is a baseline code, which is ported from the MPI CPU version directly, and the other two are optimizations step by step. They are aimed at improving the memory throughput and reducing synchronization overhead. Baseline We port OpenACC directives to the MPI CPU code, and directly apply MPI Isend/Irecv between contiguous hosts to communicate, and then update data on their respective devices (GPUs). In this version, !$acc async clauses and non-blocking communication are used to reduce some synchronization overhead. It is a very straightforward adaption from the MPI CPU version to the GPU version, through adding OpenACC directives in the code. Since decomposition can be done in three dimensions, it has three versions, using different decomposition. It is mentioned that since GPU Direct is not available, and we are not using CUDA-aware MPI, GPUs cannot exchange data directly.


3D V1 The goal of this optimization is to improve the memory throughput since we found that the data transfer between CPU and GPU is very slow for 3D decomposion, in which non-contiguous data transfer is the bottleneck. We arrange buffers on GPUs to send and receive data, then assign noncontiguous data to the send buffer (this can be explicitly parallelized using !$acc parallel loop collapse), then update data on CPUs, transfer data between CPUs, and update data on CPUs, and finally assign contiguous data back to noncontiguous memory on GPUs (also uses !$acc parallel loop collapse). The pseudo code is given in Figure 5. In the code, soln_send and soln_recv are buffers used to store boundary data for each block. McCall32 implemented this optimization method for SENSEI and got 3.6 times faster in terms of memory throughput.

(a) Send_buffer

(b) Receive_buffer


(c) LDC solver Figure 5. Pseudo code of using buffers for GPU parallelism 3D V2 This optimization is to reduce the amount of data transferred and some synchronization overhead. Since the stencil for different variables are different, we can only transfer a necessary amount of data based on the stencil size for each component (density, three velocity components and pressure), rather than the largest stencil. Also, we reorder transferring data clauses and then add as many !$acc async clauses during transferring data as possible (ensures this would not affect the solution). This optimization can only be done after V1 since we used buffers. Figure 6 show the weak scalability speedup and efficiency of different decompositions with different optimization strategies listed above. In this part, 3D grid growth is used, shown in Table 3. Regarding the weak scalability, the 3D Baseline is the worst, as it requires big chunks of noncontiguous data to be transferred between hosts and devices, which is a most important bottleneck using 3D decomposition for Multiple GPU implementation. The 1D decomposition is the second worst, proving that its scalability property is not very good. In terms of the two optimizations, the 3D V1 overcomes the bottleneck just mentioned by creating buffers, so that the data transferred between hosts and devices are in contiguous locations. For 3D V2, it reduces the communication overhead by transferring less data and also increases the memory throughput by executing more independent kernels at a time during data transfer.

(a) Speedup (b) Efficiency Figure 6. Weak scalability using different decompositions and optimizations (for GPUs)


Number of GPUs 1 2 4 8 16 32 64

Table 3. Grid growth for weak scalability 3D decomposition dims (1,1,1) (1,1,2) (1,2,2) (2,2,2) (2,2,4) (2,4,4) (4,4,4)

Grid size dims (256,256,256) (256,256,512) (256,512,512) (512,512,512) (512,512,1024) (512,1024,1024) (1024,1024,1024)

Figure 7 shows the strong scalability speedup for two levels of grids (2563 and 5123). From the strong scaling result, if we compare among the three different versions of GPU code decomposed in three dimensions, 3D V1 and 3D V2 performs much better than the 3D baseline version. Also, compared with 3D V1, 3D V2 brings a performance gain of about 11.1% and 8.3% for the grid size 2563 and 5123, respectively. However, the performance of these two different optimizations are very close when the number of GPUs is 64 as there is less work for each GPU to do, which is called low occupancy issue.

Figure 7. Strong scalability speedup using different decompositions and optimizations for two grids (for GPUs) Overlap of Communication and Computation (for both CPU and GPU) Before presenting the performance of overlap of communication and computation (C-C), we first investigate how much performance gain can be obtained through using non-blocking MPI Isend/Irecv calls. Table 4 shows the CPU performance comparison between these two different data transfer functions when the number of GPUs is 4 and 8, respectively. If non-blocking communication is used, there should be an explicit wait somewhere to enable that the data transfer is finished before moving to next iteration. It can be clearly seen that there are some gains using non-blocking data transfer, and it becomes more obvious when NP increases. The percentage of gain is around 10.6% when NP=8. Other numbers of processors has also been used and non-blocking communication is always faster than blocking communication. Table 4. Non-blocking and blocking communication (for CPU) Number of CPUs Blocking Non-blocking 4 9.652 GFLOPS 9.676 GFLOPS 8 16.082 GFLOPS 17.984 GFLOPS In the C-C version, every sub-block is divided into two parts: the boundary and interior part. For large grids, the interior domain has many more nodes. The interior part does not need data from other sub-blocks so it can be 11 American Institute of Aeronautics and Astronautics

computed at the same time that the data transfer is happening, in order to hide the memory latency. After the data is transferred, the computation is continued to update the boundaries of each sub-block. Figure 8(a) shows the strong scaling performance on two different grid levels, comparing the overlapped (C-C) version and the non-overlapped (3D V2) on the CPU. For the larger grid size (2563), overlapping the communication and computation does not provide much performance gain, which is surprising. For the smaller grid size (643), C-C performs much better than non-overlapped, especially when running on more GPUs (around 31% improvement on 32 GPUs). Figure 8(b) gives the comparison of the C-C and non-overlapped strong scaling performance for two levels of grids on GPUs. Since the compute capability of GPU is much higher than CPU, a larger grid size is assigned. For GPU, C-C reduces the performance all cases tested.

(a) CPU (b) GPU Figure 8. Comparison of C-C and non-overlapped strong scaling for two different grids Different grid sizes are tested on CPUs to evaluate the performance of the C-C. The maximum gain, minimum gain (negative means loss) are shown in Table 5. It can be clearly seen that the maximum gain always occurs at the number of processors NP=32 and the minimum gain always happens at NP=8. For a large grid size (5123 or 2563), C-C always performs worse than the non-overlapped, but the performance loss is small (within 5%). However, for small grid size (643 and 323), the performance gain is very obvious. For the grid size 1283, whether there is performance gain or loss depends on how many CPUs are used.

Grid size 323 643 1283 2563 5123

Table 5. Performance gain using C-C (for CPU) Maximum gain 63.9%@NP=32 31.3%@NP=32 10.6%@NP=32 -2.49%@NP=32 -1.37%@NP=32

Minimum gain 23.3%@NP=8 9.13%@NP=8 -4.32%@NP=8 -4.37%@NP=8 -2.30%@NP=8

Figure 9 gives the weak scaling performance gain (negative means performance loss) for a grid size of 2563 on both CPU and GPU. For weak scaling, C-C hardly brings performance gain for both CPU and GPU. In terms of weak scaling, C-C is worse than the non-overlapped code.


Figure 9. Weak scaling performance gain for a grid size of 2563 To figure out the reason why C-C does not scale very well, profiling is used. A schematic diagram of profiling is given in Figure 10. In this Figure, only the time region from starting sending buffers to finishing residual calculation (residual calculation and dissipation calculation are merged for simplicity) for non-overlapped, and from starting sending buffers to finishing receiving buffers for C-C, is considered. This simplicity does not affect the later analysis.

(a) Non-overlapped execution

(b) C-C excution Figure 10. Schematic diagram of profiling The reason why the GPU C-C does not outperform the non-overlapped version seems that the part of the kernel execution and MPI data transfer are not overlapped very well, or the overlapping makes the communication 5x longer. It is unclear and seems very strange (for some grid sizes, CPU version works but not for GPU). In this example, C-C needs 141.5 s to finish the process of sending/receiving and residual calculation while the nonoverlapped needs only 138.5 s to finish it, which takes less time. The reason behind this is unclear and requires more investigation.


B. Parallel optimization of SENSEI Some considerations of the optimizations Parallelization and some optimization of SENSEI have been done by McCall in his thesis32. Detailed performance results and discussion can also be found there. There are several difficulties when porting GPU to the complicated code SENSEI. The first difficulty is solution debugging. Usually it is very difficult to check the intermediate results as we need to update the relevant variables on the host firstly and then print the values out. It becomes impossible when we want to check values in the !$acc loop region as OpenACC forbids us to do so. The second difficulty to parallelize SENSEI is that it is difficult to parallelize a complicated code written in modern Fortran. OpenACC requires some additional modification of the original code written in modern Fortran but some of these restrictions are not clearly stated in OpenACC standard or OpenACC tutorials. McCall listed some ways to overcome the two difficulties in his thesis: generation of temporary arrays as parameters to a procedure call, temporary arrays, multi-dimensional array assignments, etc. However, after some test, we found the performance of the GPU-accelerated code done by McCall32 can only achieve 1.3x~3.4x speedup compared with single CPU performance, depending on cases. This is not as much speedup as was observed for BDC but there are still more improvements to further improve SENSEI on GPUs. After profiling the SENSEI code, we found many bottlenecks step by step. Some loops with very small size were mistakenly parallelized in the former code. These issues were solved by removing parallel directives enclosing these small loops and we get 1.14x~1.89x speedup compared with former single GPU performance, depending on cases. The baseline performance defined later is the performance of SENSEI after this correction. After we get such an improvement, we continue to use Nvidia visual profiler to find more bottlenecks. In McCall’s thesis32, he mentioned that the developer must manually create and pass temporary arrays as parameters to the procedure calls if we want the kernel to be executed on GPU. Also, he found that computing boundary enforcement on the CPU instead of the GPU produced a 32% increase in performance. For a 3D case, the performance deterioration of enforcing boundary condition on GPU is much worse, especially when the size of one dimension is very large. Moving BC enforcement to CPU may be a viable option in terms of improving the performance, however the gain is still unsatisfactory, and it leaves us no opportunity to tune the part of BC enforcement. The manually created or automatically generated temporary arrays are big problems for the performance. Instead of passing array slices to a function the entire array was passed with the indicies of the desired slice as shown in Figure 11. This avoids having to create the temporary arrays, whose allocation and deallocation and data movement can have a significant cost. !$acc parallel loop do … call func(a(imin : imax, jmin : jmax)) enddo …

!$acc parallel loop do … call func(a, imin, imax, jmin, jmax) enddo …

Figure 11. Pseudo code of avoiding automatically generated temporary arrays In Figure 11, the original way of calling a routine is passing the reshaped array. It has two issues. The first is that it is not well supported by the PGI compiler. It would potentially return “nvlink error” and whether to support passing slices of array to a procedure call is a discussion for the PGI compiler group internally. The second issue is that if we check the compiling information, we would find that temporary arrays are generated implicitly, which may greatly deteriorate the performance and it is the reason why PGI does not support it. Parallel solution In order to analyze the performance of the SENSEI in parallel, a supersonic inlet case was performed. The inlet conditions for this problem are given in Table 6 and the outflow was a supersonic outflow. The pressure solution for this problem is shown in Figure 12. The difference of the steady state parallel solution and the serial solution is on the order of 10-8.


Figure 12. Pressure distribution in the 2D supersonic inlet case Table 6. Inflow condition for the inlet case Mach 4.0 pressure 12270Pa temperature 217K Performance There are different versions of optimizations. The baseline (BCs on CPU) is the version in which BCs are enforced on the CPU, the baseline GPU (BCs on GPU) is the version in which BCs are enforced on the GPU, GPU V1 is the version in which the private temporary arrays are removed in acceleration region for all the boundaries. Based on GPU V1, finally we have another optimization GPU V2 which is described below. GPU V2 The goal of this modification is to optimize the kernel of extrapolating to the cells at corners and edges. In the former version of the GPU-accelerated code, this kernel is executed on GPU using “!$acc routine seq”, and using some temporary arrays. Then the “!$acc routine seq” is replaced with “!$acc parallel loop” structure, together with removal of the temporary arrays. The new version of the code has two advantages over the old one. First, it is determined by the compiler how the routine is executed on GPU if “!$acc routine seq” is used, even if “seq” is removed, while “!$acc parallel loop” explicitly parallelize the loops and returns higher performance. Also, we can tune the loops, which is convenient for our future autotuning work. Second, if “!$acc routine seq” is used, then we cannot get the relevant information of how the kernel is executed using Nvidia profiler, which is like a black box to us. While using “!$acc parallel loop” structure, we could easily determine how the kernel is mapped, how much time it costs, and whether there are other bottlenecks. It offers us a chance to further optimize the code. Figure 13 and Figure 14 show the performance of different versions of BC enforcement on two platforms, Newriver and Thermisto. For both the 2D and 3D inlet cases, the performance of the baseline (BC on GPU) is much worse than that of the baseline (BCs on CPU), which shows that a naïve implementation of GPU parallelism can reduce the performance of the code compared to the CPU version. Both Figures 13 and 14 show that the improvements made for enforcing the BCs on the GPU greatly improved the performance of the code as a whole. There was a larger improvement for the 3D cases because the boundary conditions in 3D have more to do improving the occupancy of the GPU. 15 American Institute of Aeronautics and Astronautics

Figure 13. Performance of different versions of BC enforcement on Newriver

Figure 14. Performance of different versions of BC enforcement (Thermisto) Analysis To investigate why the performance improves so much, profiling information of the kernels optimized should be obtained. The part of code running on CPU cannot be profiled by Nvidia, but it can be approximated by manually measuring the length of time in the results generated by the nvvp tool. Table 7 shows the length of time for different kernels using GPU or the relevant routines of the code for CPU, for one substep.


Table 7. The length of time for different kernels or routines (ms) on Thermisto Different kernels or Baseline Baseline GPU V1 routines of the code (BCs on CPU) (BCs on GPU) 30.5 0.0978 kernel computing boundary imin 29.0 0.1401 kernel computing boundary imax 96.5 0.1771 kernel computing boundary jmin ~229.9 77.3 0.1060 kernel computing boundary jmax 123.5 0.2413 kernel computing boundary kmin 117.4 0.2418 kernel computing boundary kmax ~94.7 94.7 kernel updating edges & corners 229.9 568.9 95.7 total

GPU V2 0.0947 0.1389 0.1755 0.1034 0.2438 0.2458 0.8114 1.8135

From Table 7, Regarding the total time of the BCs, the baseline (BCs on GPU) is 2.48 times of the baseline (BCs on CPU), which is the reason why McCall enforced BCs on CPU prior to this work. After optimization, GPU V1 reduces the time spent on the BCs on 6 sides by more than a factor of 200. After more improvements, GPU V2 accelerates the kernel updating edges & corners by 116.7 times. Although the kernels have been accelerated greatly, the speedup of the GPU V1 and GPU V2 over the baseline (BCs on CPU) is just 2.08x for a 3D case due to Amdahl’s law33. For more analysis as to why this occurs see Appendix A. C. Machine learning based autotuning of SENSEI Preliminary introduction of Artificial Neural Network Machine learning is a field which enables computers the ability to learn without being explicitly programmed. There are many models, such as decision tree, artificial neural network, genetic algorithms, etc. Artificial neural network is composed of many layers of neurons. A neuron is a set of arithmetic operations, which is shown in Figure 15. After a neuron, there is always an activation function, such as step, linear, sigmoid function, etc. The effect of the activation function is to introduce non-linearity to the network and then the output is sent to the next layer. Artificial neural network is a loss driven based method, in which a loss function needs to be defined beforehand. The loss function can be defined as an error norm of the predicted value and the real value or other meaningful function quantifying the error. Usually there should be a penalty function added to the loss function to make the solution smooth.

Figure 15. A neuron Many neurons comprise an artificial neural network, which is shown in Figure 16. The first layer is the input layer. Tuning parameters such as gang size, vector length for different kernels, what kinds of loop structures, etc can be regarded as inputs. After the input layer, there are several hidden layers, ranging from 1 layer to 30~50 layers or more. More hidden layers make the network more complicated and being able to get more accurate predictions. 17 American Institute of Aeronautics and Astronautics

However, it may cause severe issues, such as slow convergence rate, vanishing gradient problem and exploding gradient problem33.

Figure 16. Artificial neural network A follow-up to AUMA This paper used the model of artificial neural network. We followed up a framework called AUMA introduced in Ref. [23], and then adapted the code to be consistent with our in-house CFD code SENSEI. Some improvements including using different training algorithms, scaling the input features, using a better way of error estimate, optimizing the training expense are done when following up to their work. Figure 17 shows the process of training the model using the machine learning based approach. The machine learning code first generates random samples, which are the inputs to the neural network (the different tuning parameters). These inputs are then passed to SENSE which runs these sample cases and measures the execution time. This execution time is then passed back to the machine learning code. The machine learning code then creates and trains the artificial neural network, which is then used to predict the parameters which would produce the fastest running code. SENSEI then runs all the predicted cases and returns the actual execution time to the machine learning code. The machine learning code then returns the best performing parameters of that round which should be the best parameters to use in SENSEI.

Figure 17. A follow-up to AUMA Some settings The algorithm used is the Adaptive Moment Estimation (Adam) method34. It has many parameters which may be needed to tuned to get good results. The setting of some of these parameters are given in Table 7. When training and predicting the data, we need a metric to assess how good the model is and how good the prediction is. Scikit-learn35, which is a very famous machine learning framework, provides a metric score R2 in its toolkit, of which the definition is given in Eq (5),


∑( y N

R 2 = 1-

i =1

actual ,i

− y pred ,i )

N

2

∑ ( yactual ,i − yactual ,mean )

,

(5)

2

i =1

where yactual,i, and ypred,i are the actual value and the predicted value of sample i, respectively, and yactual,mean is the mean actual value of all the samples. N is the number of samples. If R2 is very close to 1, then the fitting is very good and if R2 is a very large negative value, then the fitting is very poor. Table 7. Parameter settings in the ML code Stepsize α 0.001 The 1st moment exponential decay factor β1 0.95 The 2nd moment exponential decay factor β2 0.9 Activation function solver Relu Initial learning rate learning_rate_init 0.0001 Maximum epoch number max_iter 400 Tolerance tol 10-6 Numerical stability factor ϵ 1e-9 Feature scaling is an improvement method which is often used to improve the accuracy in machine learning based application. Scikit-learn provides different kinds of scaling including standardscaler, minmaxscaler, maxabsscaler, etc. These scalers are affine and enables all features to vary on the comparable levels. The standardscalar is used in this paper as it is one of the most straightforward scalers. The scaler is used to remove the mean value of each features, and then scaled by dividing their standard deviation, which is called “mean removal and variance scaling”. The standard sclaler is appropriate for data which has the property of Gaussian distribution. However, in some cases even if the data does not have such a distribution or if we do not know too much about the distribution, it is often used because of its simplicity. The definition of the scaler is defined as

xs , j ,i =

x j ,i − x j

σj

,

(6)

where j means the jth feature, i means the ith sample, and s means “scaled”. Data collection There are many parameters to be tuned in SENSEI. Only the three most important kernels which take the most of the execution time (60%) are considered in this paper. Each of the three kernels are nested loops with three layers. These three kernels are used to compute the fluxes in three dimensions: xi_flux, eta_flux, and zeta_flux. The ranges for these 6 parameters are given in Table 8. Every kernel has two tuning parameters called inviscid_rhs_eta_flux_g and inviscid_rhs_eta_flux_v, which are runtime variables predefined in a namelist file. The whole search space is 512000. In our work, 4000 samples (0.78%) are generated, with 3000 samples (0.56%) being used for the training stage (setting up a ML model) and the remaining 1000 samples (0.195%) being used for the predicting stage (evaluate how good the model is for unknown samples). Table 8. Parameter ranges for six tuning parameters Parameters Parameter range Gang size for xi_flux kernel 100, 200, …, 1000 Vector length for xi_flux kernel 32, 64, …, 256 Gang size for eta_flux kernel 100, 200, …, 1000 Vector length for eta_flux kernel 32, 64, …, 256 Gang size for zeta_flux kernel 100, 200, …, 1000 Vector length for zeta_flux kernel 32, 64, …, 256 Since we want to train as efficiently as possible, we need to have an iteration independence analysis, which is similar to the grid independence study in CFD simulation. The iteration independence analysis requires that the 19 American Institute of Aeronautics and Astronautics

performance time does not change too much with the iteration steps increased. We define that the average execution time per step at 100 steps is a “true” value, and it is acceptable if the average execution time per step is within the range of ±1% around the “true” value. The average execution time per step and the difference of it at different steps are given in Figure 18 and Table 9, respectively. Running for 5 steps was chosen because it is within 1% of the “correct” time per step that we measure with 100 iteration steps.

Figure 18. Average execution time per step vs iteration steps Table 9. The difference of average execution time per step at different steps Iteration steps Execution time/step Difference 100 0.4638 s 0 5 0.4672 s 0.73% 3 0.475 s 2.41%

Results Figure 20 shows the result in the training stage and the predicting stage, respectively. The vertical axis and the horizontal axis represent the prediction time and the actual time of samples, respectively. If the prediction time and the actual time are exactly the same for all samples, then all scattered points should lie on the red line with a slope of 1. However it is not what we really want to achieve as in that way it overfits the data, which may implys bad prediction for unknown data. We do want to have some errors that make R2 to be not exactly 1, but close to 1. From Figure 19, in both the training and predicting stage, the R2 scores are both very close to 1, and most points lie closely near the line, meaning the prediction time and the actual time for all samples are close. The R2 score is a little better for the training data than for the predicting data, which is normal as the predicting data are unknown for the machine learning model. The data used in the predicting stage may have some new features which do not appear in the training data.


(a) Training samples (b) Test samples Figure 19. Prediction runtime vs Actual runtime (N=80, Lh=3) There are many hyper parameters which determine the accuracy the machine learning model. We are interested in how sensitive the accuracy of the model is to these hyper parameters and want to make the model more robust. The hyper parameters include (but are not restricted to) the number of neurons for each layer, how many hidden layers, which activation function to be used, how to define the loss function, whether to use sparsely connected network, whether to scale the data etc. Three hyper parameters, the number of neurons for each layer, how many hidden layers to be used, and whether to scale the data, are considered in this paper. Figure 20 and Figure 21 show the result after changing the number of neurons for each layer and the number of hidden layers. The total number of neurons is fixed as 240. Together with Figure 20, all the three figures show that the machine learning model accuracy is not sensitive to the number of neurons for each layer and the number of hidden layers. Different other values are also tried. It is found that only when the number of hidden layers is increased to 8 or more, the accuracy is much smaller and the iteration becomes unstable. This indicates that the machine learning model is robust.

(a) Training samples (b) Test samples Figure 20. Prediction runtime vs Actual runtime (N=240, Lh=1) 21 American Institute of Aeronautics and Astronautics

(a) Training samples (b) Test samples Figure 21. Prediction runtime vs Actual runtime (N=120, Lh=2) Initially we defined a rule of scaling, which is provided by the scikit-learn toolkit, and the result is good. However, we are also interested in how good the model is when unscaled data is used. Figure 22 shows the result using 5 hidden layers, each with 80 neurons. If only 2 or three layers are used, then the prediction for all inputs are just constant. Therefore the number of hidden layers and the number of neurons are changed in this case, in order to get better results. Although doing this, we could see that both the training and predicting are very bad. The data lie far away from the linear line, which means the prediction runtime and the actual runtime are quite different. Also, the R2 score is a negative value.

(a) Training samples (b) Test samples Figure 22. Prediction runtime vs Actual runtime (N=80, Lh=5)


VI. Conclusions & Future Work OpenACC directives are used to accelerate our in-house CFD code because of the ease of programming, the good portability and the fair performance it provides. For the multi-GPU accelerated 3D BDC code, decomposing domain in multiple dimensions improves the efficiency. Also, a number of optimization were examined, such as creating shared buffers to store noncontiguous buffer, reducing the overhead of communication and synchronization, and overlap of communication and computation. The performance increases greatly except for using overlap of communication and computation, which brings little benefits in the BDC code. For the GPU-accelerated SENSEI, several optimizations have been made. Removing private temporary arrays in OpenACC kernels gives a huge performance improvement for both the 2D and 3D cases. Removing private temporary arrays saves a lot of memory and also improves the efficiency of using cache. For the machine learning based autotuning SENSEI, 6 tuning parameters for the most 3 occupying kernels are used to tune the performance. We followed up to a prior work developed by other researchers and then adapt it to our own needs. The machine learning model developed gives very good (about 0.9 in terms of R2 score) predictions of the execution time for unknown data. Also, the machine learning model is not sensitive to the number of neurons for each layer and the number of hidden layers. Using unscaled data leads the machine-learning model to perform badly. In terms of future work, there are mainly two aspects. First, one of our goals is to further optimize the performance of SENSEI through profiling. Our next task is to optimize the interblock boundary as we want to improve the multi-GPU performance, although only single GPU performance of SENSEI is given in this paper. Also, investigation of making better use of cache such as tiling should be tried36~38. Finally, we want to optimize the implicit solver and use a hybrid GPU+CPU to accelerate SENSEI. Second, we want to use the machine learning or even the deep learning technique to autotune SENSEI by including more parameters such as the iterative solver, the time advancement scheme, hardware architecture, etc. Locating the most important parameters that affect the performance most is another important task.

References 1Zhiyin,

Y., “Large-eddy simulation: Past, present and the future,” Chinese Journal of Aeronautics, Vol. 28, No. 1, 2015, pp.

11-24. 2”OpenMP Application Programming Interface, Version 4.5,” 2015. 3Barney, B., “Message Passing Interface,” https://computing.llnl.gov/tutorials/mpi/. 4Zollweg, J., “Hybrid Programming with OpenMP and MPI,” https://www.cac.cornell.edu/education/Training/Intro/Hybrid090529.pdf. 5Berger, M. J., Aftosmis, M. J., Marshall, D. D. and Murman, S. M., “Performance of a new CFD flow solver using a hybrid programming paradigm,” Journal of Parallel and Distributed Computing, Vol. 65, No. 4, 2005, pp. 414-423. 6Amritkar, A., Deb, S., and Tafti, D., “Efficient parallel CFD-DEM simulations using OpenMP,” Journal of Computational Physics, Vol. 256, 2014, pp. 501-519. 7Gourdain, N., Gicquel, L., Montagnac, M., Vermorel, O., Gazaix, M., Staffelbach, G. et al. “High performance parallel computing of flows in complex geometries: I. methods,” Computational Science & Discovery, Vol. 2, No. 1, 2009, pp. 015003. 8Mininni, P. D., Rosenberg, D., Reddy, R., and Pouquet, A., “A hybrid MPI–OpenMP scheme for scalable parallel pseudospectral computations for fluid turbulence,” Parallel Computing, Vol. 37, No. 6, 2011, pp. 316-326. 9Hoeflinger, J., Alavilli, P., Jackson, T. and Kuhn, B., “Producing scalable performance with OpenMP: Experiments with two CFD applications,” Parallel Computing, Vol. 27, No. 4, 2001, pp. 391-413. 10Yilmaz, E., Payli, R., Akay, H., and Ecer, A, “Hybrid Parallelism for CFD Simulations: Combining MPI with OpenMP,” Parallel Computational Fluid Dynamics 2007. Lecture Notes in Computational Science and Engineering, Vol 67, Springer, Berlin, Heidelberg. 11Herdman, J. A., Gaudin, W. P., McIntosh-Smith, S., Boulton, M., Beckingsale, D. A., Mallinson, A. C. et al. “Accelerating hydrocodes with OpenACC, OpenCL and CUDA,” High Performance Computing, Networking, Storage and Analysis (SCC), 2012 SC Companion:, IEEE. Nov. 2012, pp. 465-471. 12Chandar, D. D., Sitaraman, J. and Mavriplis, D. J., “A Hybrid Multi-GPU/CPU Computational Framework for Rotorcraft Flows on Unstructured Overset Grids,” 21st AIAA Computational Fluid Dynamics Conference, 2013, pp. 2855. 13Jacobsen, D. A. and Senocak, I., “Multi-level parallelism for incompressible flow computations on GPU clusters,” Parallel Computing, Vol. 39, No. 1, 2013, pp. 1-20. 14Elsen, E., LeGresley, P. and Darve, E., “Large calculation of the flow over a hypersonic vehicle using a GPU,” Journal of Computational Physics, Vol. 227, No. 24, 2008, pp. 10148-10161. 15Brandvik, T. and Pullan, G., “Acceleration of a 3D Euler solver using commodity graphics hardware,” 46th AIAA aerospace sciences meeting and exhibit, Jan. 2008, pp. 607. 16Gorobets, A., Trias, F. X. and Oliva, A., “An OpenCL-based Parallel CFD Code for Simulations on Hybrid Systems with Massively-parallel Accelerators,” Procedia Engineering, Vol. 61, 2013, pp. 81-86.


17Pickering, B. P., Jackson, C. W., Scogland, T. R., Feng, W. C. and Roy, C. J., “Directive-based GPU programming for computational fluid dynamics,” Computers & Fluids, Vol. 114, 2015, pp. 242-253. 18Xu, R., Tian, X., Chandrasekaran, S. and Chapman, B., “Multi-GPU support on single node using directive-based programming model,” Scientific Programming, 2015, pp. 3. 19Xia, Y., Lou, J., Luo, H., Edwards, J., and Mueller, F., “OpenACC acceleration of an unstructured CFD solver based on a reconstructed discontinuous Galerkin method for compressible flows,” International Journal for Numerical Methods in Fluids, Vol. 78, No. 3, 2015, pp. 123-139. 20Luo, L., Edwards, J. R., Luo, H. and Mueller, F., “Performance assessment of a multiblock incompressible navier-stokes solver using directive-based gpu programming in a cluster environment,” 52nd Aerospace Sciences Meeting, 2013. 21Jia, W., Shaw, K. A. and Martonosi, M., “Starchart: Hardware and software optimization using recursive partitioning regression trees,” Proceedings of the 22nd international conference on Parallel architectures and compilation techniques, IEEE Press, Oct. 2013, pp. 257-268. 22Collins, A., Fensch, C., Leather, H. and Cole, M., “MaSiF: machine learning guided auto-tuning of parallel skeletons,” High Performance Computing (HiPC), 2013 20th International Conference, IEEE, Nov. 2013, pp. 186-195. 23Falch, T. L. and Elster, A. C., “Machine learning-based auto-tuning for enhanced performance portability of OpenCL applications,” Concurrency and Computation: Practice and Experience, 2016. 24Lantz, S., “Scalability,” Workshop: High Performance Computing on Stampede, 2015. 25Abdi, D. S. and Bitsuamlak, G. T., “Asynchronous parallelization of a CFD solver,” Journal of Computational Engineering, 2015. 26Wolley, C., “GPU Optimization Fundamentals,” https://www.olcf.ornl.gov/wp-content/uploads/2013/02/GPU_Opt_FundCW1.pdf, 2013. 27Rinard, M., “Communication Optimizations for Parallel Computing Using Data Access Information,” Proceedings of the IEEE/ACM SC95 Conference, 1995. 28“Domain decomposition,” http://mosaic.mpi-cbg.de/docs/DCAMM/04_Thursday_1.pdf. 29Wolley, C., “Profiling and Tuning OpenACC code,” http://developer.download.nvidia.com/GTC/PDF/GTC2012/ PresentationPDF/S0517B-Monday-Programming-GPUs-OpenACC.pdf. 30Chorin, A. J., “A numerical method for solving incompressible viscous flow problems,” Journal of computational physics, Vol. 135, No. 2, 1997, pp. 118-125. 31Derlaga, J. M., Phillips, T. and Roy, C. J. “SENSEI computational fluid dynamics code: a case study in modern Fortran software development,” 21st AIAA Computational Fluid Dynamics Conference, 2013, p. 2450. 32McCall, A., “Multi-level Parallelism with MPI and OpenACC for CFD Applications,” Master Thesis, Kevin T. Crofton Department of Aerospace and Ocean Engineering, Virginia Tech, Blacksburg, VA, 2017. 33”Why are deep neural networks hard to train?,” http://neuralnetworksanddeeplearning.com/chap5.html. 34 Kingma, D., Ba, J. L., “Adam: A method for stochastic optimization,” Proceedings of the 3rd International Conference on Learning Representations, 2014. 35 http://scikit-learn.org/stable/. 36Lam, M. D., Rothberg, E. E. and Wolf, M. E., “The cache performance and optimizations of blocked algorithms,” ACM SIGARCH Computer Architecture News, Vol. 19, No. 2, ACM, Apr. 1991, pp. 63-74. 37Rivera, G. and Tseng, C. W., “Tiling optimizations for 3D scientific computations,” Supercomputing, ACM/IEEE 2000 Conference, IEEE, Nov. 2000, pp. 32-32. 38Datta, K., Murphy, M., Volkov, V., Williams, S., Carter, J., Oliker, L. and et al. “Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures,” Proceedings of the 2008 ACM/IEEE conference on Supercomputing, IEEE Press, Nov, 2008, pp. 4.

Appendix A. According to Amdahl’s law33, the speedup for a fixed size problem is given as,

S ( p, N ) =

1 p 1− p + N

,

(7)

where S is the theoretical speedup of the execution of the whole task, N is the speedup of the part of the task which benefits from improvement, p is the parallel proportion. Suppose we have a code, and assuming all other parts are serial except for the boundaries on 6 sides and the edges & corners. This is this the baseline code or a “serial” code we have, and we want to investigate how much speedup we can get based on improving the 7 kernels related to the BCs and the edges & corners. We can do this step by step. First, if we compare the baseline (BCs on CPU) and the baseline (BCs on GPU), using nvvp we found 24 American Institute of Aeronautics and Astronautics

that the optimized proportion p1 =0.53 (this is obtained by measuring the length of time on nvvp plots), then we get the speedup N1 on the 7 kernels, then get the global speedup,

229.9 = 0.404 568.9 1 1 S（p1 ,N1）= = =0.56, p1 0.53 1-0.53+ 1 − p1 + 0.404 N1 = N1

while from Figure 14, the speedup is 0.41x, which is very close to 0.56x. The error is caused by not considering other overhead and the inaccuracy of the approximation of the routine time as it mix with data exchange between CPU and GPU in the plot given by nvvp. Similarly, comparing the baseline (BCs on GPU) and the GPU V1 version, the optimized proportion p2=0.56, the speedup N2 over the 6 kernels computing BCs and the global speedup are,

474.2 = 472 1.004 1 1 = =2.27, S（p2 ,N 2）= 0.56 p2 1-0.56+ 1 − p2 + 472 N2 = N2

while from Figure 14, the speedup of GPU V1 over the baseline (BCs on GPU) is 3.46x. This error may due to better cache use as we removed all private temporary arrays, which also saves a lot of memory. Finally, we compare the GPU V1 and GPU V2 version. The optimized proportion p3=0.21, and the speedup N3 over the kernel computing the value at corners and edges are,

97.3 = 116 0.8114 1 1 S（p3 ,N 3）= = =1.26, p3 0.21 1-0.21+ 1 − p3 + 116 N3 N3 =

And from Figure 14, the speedup of GPU V2 over GPU V1 is 1.46, which is very close. The actual speedup is also larger than the theoretical, of which the reason may be the more efficient use of the cache, as we removed private temporary arrays further and saves resources. The theoretical speedup is just a very coarse way of approximating the speedup, which helps to locate other bottlenecks. Also, it provides an estimate of how much speedup can be obtained. When considering optimizing the performance of a code, the proportion of the code which is sequentially executed may be a big bottleneck, as is shown in this case. Although the speedup over one kernel or several kernels can be more than 500, the final speedup is just 2x. On Newriver, a similar analysis can be made.


GPU parallelization, optimization and

GPU parallelization, optimization and

Suggest Documents

Optimization and Parallelization of Monaural

Hardware for Speculative Reduction Parallelization and Optimization

Performance Optimization and Parallelization of ... - Semantic Scholar

Optimization and Parallelization of Monaural ... - Semantic Scholar

Adaptive Parallelization and Optimization for The

A GPU parallelization of Branch-and-Bound for

Multicore and GPU Parallelization of Neural Networks ... - CyberLeninka

Automatic CPU-GPU Communication Management and Optimization

Automatic CPU-GPU Communication Management and Optimization

Parallelization of a DEM Code Based on CPU-GPU Heterogeneous ...

Parallelization and Performance Optimization on Face ... - IEEE Xplore

Optimization and parallelization of B-spline based orbital ... - arXiv

Parallelization and optimization of the neuromorphic simulation code

Parallelization and optimization of spatial analysis for large scale ...

GPU-based Parallel Particle Swarm Optimization - CiteSeerX

Mali GPU Application Optimization Guide - ARM Infocenter

Cache optimization for CPU-GPU heterogeneous ...

Communication Optimization for Multi GPU ... - Semantic Scholar

Research Article CUDAICA: GPU Optimization of ... - BioMedSearch

GPU Optimization of Infomax-ICA EEG Analysis

Architecture-Aware Mapping and Optimization on a 1600-Core GPU ...

Optimization and Architecture Effects on GPU Computing ... - IMPACT

Optimization strategies for parallel CPU and GPU ... - arXiv

GPU implementation and optimization of Video Super-Resolution