Pattern-Driven Hybrid Multi- and Many-Core

6 downloads 0 Views 5MB Size Report
parallelism, especially on many-core accelerated systems. In this work, we extend the shallow-water model in MPAS to demonstrate a pattern-driven approach ...
Pattern-Driven Hybrid Multi- and Many-Core Acceleration in the MPAS Shallow-Water Model Peng Zhang∗† , Yulong Ao∗† , Chao Yang∗‡ , Yiqun Liu∗† , Fangfang Liu∗ , Changmao Wu∗ , Haitao Zhao∗ ∗ Institute of Software, Chinese Academy of Sciences, Beijing 100190, China † University of Chinese Academy of Sciences, Beijing 100049, China ‡ State Key Laboratory of Computer Science, Chinese Academy of Sciences, Beijing 100190, China Corresponding author: Chao Yang ([email protected]).

Abstract—There is an urgent demand in studying efficient methodologies to enable hybrid multi- and many-core accelerations in global climate simulations. The Model for Prediction Across Scales (MPAS) is a family of earth-system component models that receives increasingly more attention. Like many other models, MPAS, though features some emerging numerical algorithms, employs a pure MPI approach for parallel computing, which, to date, is in lack of support for multi-threaded parallelism, especially on many-core accelerated systems. In this work, we extend the shallow-water model in MPAS to demonstrate a pattern-driven approach for hybrid multi- and many-core accelerations of climate models. We first identify all basic computation patterns through a rigorous analysis of the MPAS code. Then for the whole model, we use the identified patterns as building blocks to draw a data-flow diagram, which serves as a perfect indicator to recognize data dependencies and exploit inherent parallelism. And finally, based on the data-flow diagram, a hybrid algorithm is designed to support concurrent computations done on both multi-core CPUs and many-core accelerators. We implement the algorithm and optimize it on an x86-based heterogeneous supercomputer equipped with both Intel Xeon CPUs and Intel Xeon Phi devices. Experiments show that our hybrid design is able to deliver an 8.35x speedup as compared to the original code and scales up to 64 processes with a nearly ideal parallel efficiency. Keywords-global atmospheric modeling; many-core acceleration; hybrid algorithm; Intel Xeon Phi; MPAS

I. I NTRODUCTION There are grand challenges lying ahead in modeling and predicting the changes of the global climate [1–3]. Whilst many Earth system models are developed for building different component models in a consistent way, frameworks or methodologies that help promote extreme-scale climate simulations on today’s state-of-the-art supercomputers are urgently required. Due to the limit of power consumption, heterogeneous architectures based on both multi-core processors and many-core accelerators have become a competitive choice in designing supercomputing systems. Examples include Tianhe-2, which employs both Intel Xeon CPUs and Intel Xeon Phi coprocessors, and Titan, which uses both AMD Opteron CPUs and NVIDIA Tesla K20x GPUs. Although the performance increase of heterogeneous architectures can be tremendous, people have to face additional

difficulties in exploiting hybrid accelerations from different types of processing units. Among many Earth system models, the Model for Prediction Across Scales (MPAS, [4]) is a collaborative project for developing future-generation atmosphere, ocean and other component models that are built with advanced mathematical concepts such as the spherical centroidal Voronoi tessellations (SCVTs, [5]). Like many other models, MPAS employs a pure MPI approach for parallel computing, which, to date, is in lack of support for multi-threaded parallelism, especially on many-core accelerated heterogeneous systems. The unstructured SCVT meshes employed in MPAS, as compared to regular meshes frequently used in many other models, brings extra difficulties in designing algorithms that fit with modern computing systems. To demonstrate a new approach for designing hybrid multi- and many-core accelerations of climate models, we select the shallow-water model in MPAS as a proxy and extend it to support multi- and many-core heterogeneous architectures. Our proposed approach consists of three major steps. First, all basic computation patterns on the SCVT mesh are identified through a rigorous analysis of the MPAS code. Second, by using the identified patterns as fundamental building blocks, a data-flow diagram of the whole model is drawn, which further helps recognize data dependencies and inherent parallelism throughout the code. And third, a hybrid algorithm for concurrent heterogeneous computing and overlapped data moving is designed by following the data dependencies and exploiting the inherent parallelism. We implement the algorithm and optimize it on an x86-based heterogeneous supercomputer equipped with both Intel Xeon CPUs and MIC (Many Integrated Core) based Intel Xeon Phi accelerators. Experiments show that our hybrid design is able to deliver an 8.35x speedup as compared to the original code and scales up to 64 processes with a nearly ideal parallel efficiency. The rest of this paper is organized as follows. In Section 2, we introduce the background of our research, including related works on accelerating geoscientific models, a brief introduction of the MPAS model, and a discussion on different choices of designing hybrid algorithms. We then

present our pattern-driven hybrid method in Section 3, where we show a way to identify basic computation patterns, to compose a data-flow diagram and to design a hybrid algorithm based on it. Also shown in Section 3 is how we deal with irregular reductions commonly occurred in many computation patterns. Following Section 3, details on implementing and optimizing the hybrid algorithm are provided in Section 4. We conduct experiments on a hybrid CPUMIC system and show the results in Section 5 to validate the correctness of the hybrid implementation and access the performance of different optimizing techniques. Scalability results are also presented in Section 5 to demonstrate that the proposed hybrid design achieves a high speedup as compared to the original code and maintains a nearly ideal scalability. The paper is concluded in Section 6. II. BACKGROUND

ratios. And fourth, the data-flow diagram is easy to revise to incorporate with future model development. B. Problem Description The spherical shallow-water equations used in MPAS are expressed as  ∂h  ∂t + ∇ · (F) = 0, (1)  ∂u ⊥ + qF = −g∇(h + b) − ∇K, ∂t where the prognostic variables are the fluid thickness h and the velocity u. In addition, the flux related terms are F = hu and F⊥ = k × hu, the kinetic energy is K = |u|2 /2 and the total potential vorticity is q = [k · (∇ × u) + f ]/h. Except for the unit normal vector k, all other symbols including the gravitational coefficient g, the Coriolis parameter f and the bottom topography b are given constants.

A. Related Work A large number of work has been done on manycore acceleration of certain modules in famous climate or weather-forecasting models, such as the cloud microphysics processes [6–8], the chemical kinetics kernels [9], and the long wave radiation modules [10], among others. In many of these works, significant performance increase can be achieved in the specific many-core accelerated modules, but the overall speedups of entire models are still limited. To enable many-core acceleration on whole geoscientific models, continuous effects have been made. For example, Fuhrer et al. [11] accelerated the dynamic core of the COSMO weather forecasting model on GPU with a speedup of 2.8x. Xu et al. [12] ported the whole parallel Princeton Ocean Model to a 4-GPU cluster and delivered an equivalent performance to a pure CPU system with over 400 CPU cores. Shimokawabe et al. proposed a multi-GPU algorithm for the ASUCA nonhydrostatic model and scaled to nearly 4,000 GPUs on the TSUBAME 2.0 supercomputer [13, 14]. Xue et al. [15] designed a hybrid CPU-MIC algorithms for global shallow-water simulations on Tianhe2, scaling to thousands of heterogeneous computing nodes. These works, though demonstrate promising speedup with convincing results, often require to rewrite the whole model codes completely, introducing extra difficulties for long-term maintenance and future revisions. For the MPAS model, we present a new approach for the design and implementation of hybrid algorithms on modern heterogeneous supercomputers. We take the shallow water model of MPAS as a starting point and demonstrate the approach on a hybrid CPU-MIC platform. Our paper differs from the above works in several aspects. First, we focus on accelerating the whole model instead of a certain portion of it. Second, a new approach based on data-flow diagrams is proposed to guide designing and implementing hybrid algorithms. Third, the hybrid algorithm is flexible for any heterogeneous architecture with arbitrary host-to-device

mass points velocity points vorticity points Delaunay triangle

Vornoi cell

Figure 1. Schematic diagram of all three types of mesh points on the C-staggered horizontal Voronoi mesh.

One of the most important features of MPAS is the the spherical centroidal Voronoi tessellation (SCVT, [16]) based horizontal mesh. Given a domain that spans the surface of the sphere and a set of distinct points, one can generate an SCVT mesh accordingly. The SCVT mesh is comprised of Voronoi mesh cells that are mostly hexagons, with a dual mesh consisting of Delaunay triangles. A standard finitevolume scheme [17] [18] with C-grid staggering on the SCVT mesh is employed in MPAS for spatial discretization. As shown in Figure 1, there are three types of mesh points representing discretized physical variables that are related to mass, velocity and vorticity, respectively. For the shallow water model, the prognostic variables h and u are discretized separately at the mass points and the velocity points. A number of other intermediate variables are defined at the three types of mesh points for use during the computation. All component models in MPAS follow a same threephase running procedure, namely: initialization, timeintegration and finalization. The initialization phase is responsible for reading meshes, allocating memory and initialing data structures for later computation. The finalization

Algorithm 1 RK-4 time-stepping algorithm.

CPU

MIC

Input Data and Initialization Exchange g halo

CPU to MIC

Exchange halo

MIC to CPU CPU to MIC

accumulative_update Runge Kutta Loop

compute_next_ substep_state compute_solve_ diagnostics

Runge Kutta Loop

compute_tend enforce_boundary_edge

Time Step Loop

1: for time step ← 1 to n do 2: for RK step ← 1 to 4 do 3: call compute tend 4: call enforce boundary edge 5: if RK step < 4 then 6: call compute next substep state 7: call compute solve diagnostics 8: call accumulative update 9: else 10: call accumulative update 11: call compute solve diagnostics 12: call mpas reconstruct 13: end if 14: end for 15: end for

accumulative_update

mpas_reconstruct

compute_solve_ diagnostics

MIC to CPU

phase writes back the computation results and does some cleaning such as deallocating the memory that has been allocated in initialization. The most time consuming one, which is in between of the above two phases, is timeintegration. In this phase, the main computation occurs as per each time step following a fourth-order Runge-Kutta (RK-4) procedure [19]. For the shallow water model, the RK-4 time stepping algorithm is shown in Algorithm 1. All models in MPAS are based on a same framework that only supports MPI for parallel computing. A hybrid algorithm is required to exploit intra-node parallelism on multi- and many-core accelerated heterogeneous platforms. C. Discussion on Hybrid Algorithm Design On heterogeneous systems equipped with both multi-core CPUs and many-core accelerators, we may either port all kernels to the accelerators, leaving CPU aside to handle MPI communication, or design a hybrid algorithm to utilize both the heterogeneous computing resources. The former method could be suitable for accelerator-rich platforms on which the computing capacity of CPUs is negligible as compared to the accelerators, e.g., the TSUBAME 2.0 supercomputer [20]. The latter one, on the other hand, aims to make a balanced utilization of both CPUs and accelerators, and has a potential to achieve higher performance on a wide range of heterogeneous architectures. In this paper, we only consider the hybrid method to fully exploit intra-node computing resources. And to that end, two very different choices are available. • •

Code-level algorithm design. Kernel-level algorithm design.

In the code-level design, the data structure is usually redesigned so as to adapt with the heterogeneous architecture. As a result, the original code is abruptly revised or even entirely rewritten, which requires a lot of work. Although the performance gain might be substantial, the optimized code is hard to maintain for long-term development. This

Output Data and Finalization

Figure 2. Flowchart of a hybrid algorithm based on the kernel-level design.

makes it impractical to apply in many climate simulation projects. In the kernel-level design, one usually profiles the code to identify the most time-consuming kernels. These kernels are then ported to the accelerators for further optimization. Kernels that are independent with each other can be ran simultaneously on host and device. This method helps keep a minimal invasion to the original code, therefore is friendly for future code maintenance and development. However, the performance gain of this method is limited due to partial optimization of the code, and repeated data transfer between host and device. Besides, the data dependency among kernels are unclear and hard to track, thus the load balance between heterogeneous processors is usually far below expected. Here for comparison purpose, we briefly demonstrate how a kernel-level hybrid algorithm can be designed for the shallow water model in MPAS. In the kernel-level design, computation kernels are treated as basic elements of the application. A profiling of the code is done to examine the cost of each kernel. Then the data dependency among all kernel is identified by checking the input and output of each kernel. We say that there exists a data dependence between two kernels only if when the output of one kernel relates to the input of another one. If there is no dependence between two kernels, they can then be launched concurrently. Since the many-core device has higher computing power, the more time-consuming kernels will reside on it. The kernel-level design of the hybrid algorithm is shown in Figure 2. In this figure, all the computation kernels are separated into two groups. Kernels in the left group are running on the

A

B

C

D

E

F

G

H

Figure 3.

All eight stencil patterns identified in the shallow water model of MPAS.

CPUs and those on right are on the accelerators. “Exchange halo”s with red-arrows represent MPI communication for the multi-process case. It is expected to assign performancecritical kernel to the accelerators so that they can be accelerated as much as we can. And at the same time, we expect to put some other kernels to be ran concurrently on the CPU side in order to exploit the computing capacity of CPU. But it is not very clear how the kernels are arranged on the CPU side to maximize the performance, and the predictable load imbalance between the CPU and MIC sides will also drop the performance on the whole. III. PATTERN -D RIVEN D ESIGN It is easy to see that the granularity of the kernel-level design is usually coarse, and there is little mechanism for fine-grained adjustment of load balance. On the other hand, if we go down to the code-level, the overall maintenance of the application project will be severely affected. We need to seek for an intermediate way with granularity between the kernel-level and the code-level. A. Basic Stencil Patterns Like many other geoscientific models, the shallow water model in MPAS uses an explicit RK-4 method for time stepping. In each time step, computations are done on variables defined on the mesh. By examining the whole code, we find there are two types of computations done in each RK loop. One is the local computation of a physical variable defined on a certain type of mesh points. And the other is stencil computations, with the input data and output data belonging to different physical variables that are defined possibly on different types of mesh points. The

local computation, such as the native accumulation, can be embarrassingly parallelized without any data dependency, therefore is easy to achieve high parallel performance. The stencil computations, however, depend on different input and output variables, and require further analysis before optimizing. As mentioned earlier, there are three types of mesh points in the CVT-based C-grid. We find that there are totally eight stencil patterns repeatedly used throughout the shallow water model. All the eight stencil patterns are shown in Figure 3. In the figure, colored points with squares, triangles and circles represent the three types of mesh points as defined in Figure 1. the centered point in each stencil pattern represents the type of the output variable, and all other points represent A in the type of the input variable. For instance, pattern Figure 3 describe such a computation pattern that in order to calculate a variable on a mass point, we need variables on the neighboring velocity points. Analogous interpretations apply to the remainder seven patterns in the figure. B. Data-Flow Chart After identifying all computation patterns, we need to use them as building blocks to compose a data-flow diagram. The connections between these building blocks are determined by analyzing the input and output variables of neighboring computation patterns. For the shallow water model described as Algorithm 1, the data-flow diagram is shown in Figure 4 (a), where rectangular blocks marked with X1 to X6 stand for six local computations, circled blocks marked with A to H stand for the eight stencil computations (with possibly different numbers representing the utilizations in different scenarios). The whole diagram

if RK_step == 4

compute_tend

A1

B1

3

1

C1 1 X1

1

enforce_ boundary _edge

1

1

X2 1

X3

3

A4

X6

X5 1

1

mpas_rec mpas_reconstruct

X4

1

Exchange halo

Exchange halo

2

5

accumulative_update

2

F

compute_next _substep_state

G

1

D1 D2 1 1

1

B2

3 H1

E

4

1

C2

1 X4 X5

accumulative _update

1 1

2

F

3

D2 1

H1

1

H2

1

B2

1 A2

1

G

D1 4

E C2

H2

1 A2

1

A3

A3

compute_solve_diagnostics

compute_solve_diagnostics

(a) The original data-flow diagram with patterns grouped by kernels.

accumulative_update

if RK_step == 4

compute_tend: CPU part

A1

5

B1

3

C1

compute_tend: MIC part

1

X1

1 1

enforce_ boundary _edge

1

X3

1

1

1 1

X2

1

mpas_rec mpas_reconstruct

compute_solve_diagnostics: adjustable part

compute_next _substep_state

Exchange halo

Exchange halo

2

1 X4 1 X5

1

D1

3

D2

X3

1 A2

X3

1 A3

X2

1

F

2

1 1 compute_next _substep_state

accumulative _update: CPU part

1

G

H1

E

1

4

1 1

X5

C2

H2

B2

3

X6

compute_solve_diagnostics: mpute solv adjustable part

X4

accumulativee _update: MIC part

A4

D1

3

D2

E

1 A2 1 A3 1

2

F

1 1

compute_solve_diagnostics: MIC part

G

1

H1

4

C2

H2

B2

compute_solve_diagnostics: MIC part

(b) The reorganized data-flow diagram for pattern-level hybrid algorithm design. Figure 4.

The data-flow diagram for the shallow water model in MPAS.

is organized as like a “circuit diagram”, with the data flow being the “electric current” and the computation patterns being the “circuit components”. Red colored numbers are also shown in the diagram to help understand the numbers of independent sets of input variables. All the input/output variables grouped by patterns are listed in Table I. C. Adjustable Hybrid Algorithm As is done in Figure 2 for the kernel-level hybrid design, we use gray boxes in Figure 4 (a) to show the kernels that processed by CPUs and dark yellow boxes to represent kernels on the accelerator side. Also marked in the diagram

are the synchronizations required for the MPI halo exchange and the corresponding data transfer between host and device. It is clearly seen from Figure 4 (a) that the load-balance between host and device is not well maintained. In order to improve it, we may use the data-flow diagram based on the computation patterns to reveal further potential parallelism. As seen Figure 4 (b), this can be done by reorganizing and separating the computation patterns from a same kernel function and distributing them between host and device. Redundant computations might be introduced to increase the concurrency without destroying the completeness of the

Table I L IST OF ALL PATTERNS AND THEIR INPUT / OUTPUT VARIABLES . Kernel

Pattern A1

H2 B2 A2 A3

Input provis u(u),h edge pv edge,provis u(u), h edge,ke,provis h(h) divergence,vorticity,tend u provis h(h) provis h(h) provis h(h),d2fdx2 cell1, d2fdx2 cell2 provis u(u) provis h(h),vorticity pv vertex pv vertex,pv cell, provis u(u),v vorticity provis u(u) u u

compute tend

B1

vorticity cell v divergence ke

X1

tend u

tend u

X2 X3 X4 X5

tend tend tend tend

A4

u

X6

uReconstructX uReconstructY uReconstructZ

C1 D1 D2 E

compute solve diagnostics

F G H1 C2

enforce boundary edge compute next substep state accumulative update

mpas reconstruct

h u h u

Output tend h tend u tend u d2fdx2 cell1 d2fdx2 cell2 h edge vorticity pv vertex pv cell pv edge

provis h provis u h u uReconstructX uReconstructY uReconstructZ uReconstructZonal uReconstructMeridional

pattern structure. The newly designed hybrid algorithm is based on patterns instead of kernels, with superior granularity and more efficient utilization of heterogeneous processing units. In the pattern-driven design, there are another type of boxes with light-yellow color, in which the operations can be adaptively controlled according to the configuration of the heterogeneous system, so that the load balance is improved.

Algorithm 2 An example of irregular reduction. Real Y(nCells), X(nEdges); /* data arrays */ Integer CellsOnEdge(nEdges,2); /* index array */ Integer cell1, cell2 ; /* intermediate variable*/ for iedge = 1 to nEdges do cell1 = CellsOnEdge(i,1); cell2 = CellsOnEdge(i,2); Y(cell1) = Y(cell1) + X(iedge); Y(cell2) = Y(cell2) − X(iedge); end for

Algorithm 3 Regularity-aware loop refactoring. Real Y(nCells), X(nEdges); /* data arrays */ Integer CellsOnEdge(nEdges,2); /* index array */ Integer EdgesOnCell(nCells,maxEdges); /* index array */ Integer iedge ; /* intermediate variable*/ for icell = 1 to nCells do for i = 1 to nEdgesOnCell(icell) do iedge = EdgesOnCell(icell,i); if (icell == CellsOnEdge(iedge,1)) then Y(icell) = Y(icell) + X(iedge); else Y(icell) = Y(icell) − X(iedge); end if end for end for

IV. I MPLEMENTATION & O PTIMIZATION We implement the hybrid algorithm on a supercomputer equipped with both multi-core CPUs and many-core MIC devices. In addition to the MPI programming model for inter-node parallelism, we exploit thread-level parallelism by using OpenMP, which supports both the multi-core CPUs and the many-core MIC devices. Based on OpenMP, the hybrid implementation exhibits minimal revisions to the original MPI code. Here we list some major optimizations that may be critical for improving the overall performance.

D. Regularity-Aware Loop Refactoring

A. Reducing data transfer overhead

The stencil computations in MPAS often have irregular reduction modes [21] that are not suitable to be parallelized with OpenMP. This is because the input and output vectors may belong to different types of mesh points and the code may have been designed to mix arbitrary two types of unknowns. An example of such kind of irregularity is shown in Algorithm 2. The loop traverses the mesh data in edge order but write back variables in cell order. This pattern is not suitable for OpenMP since it will cause data race if multiple threads defined on edges access a same cell when these edges belongs to the given cell. The race condition can be totally resolved by refactoring the loop in cell order as shown in Algorithm 3. In order to achieve high performance on both multi-core CPUs and many-core accelerators, we perform the loop refactoring in the code to make all computation patterns regularity-aware.

The hybrid algorithm requires data transfer between host and device. If the data transfer is done only on demand, the large amount of transferred data may severely degrade the performance. However, in our largest test case (15km), all the data needed to be offloaded to MIC is about 5.3GB, which is not beyond the local memory of MIC device. That is to say, we can transfer all the necessary data, including both mesh and computing data, at the very beginning of the code. During the whole computing, the mesh data is unchanged thus kept on device, and only a little amount of computing data are transferred between host and device. In practice, we use the offload programming method to implement the algorithm on the CPU-MIC system. For MPAS, taking the 30-km resolution mesh (with 655,362 mesh cells) as an example, the averaged amount of data transfer are reduced by at least a factor of 4x.

B. OpenMP multi-threading

E. Streaming store

As mentioned earlier, OpenMP is used to exploit threadlevel parallelism. On the CPU side, the number of threads is set to be equal to the number of available cores. On the MIC side, we leave one hardware core for the offloading engine and set four threads on each of the rest hardware cores. In the shallow water model of MPAS, there are a number of basic computation patterns including local computations and stencil computations. Each computation pattern can be parallelized with OpenMP as a single parallel region. However, setting up an individual parallel region for each computation pattern may introduce a large overhead due to the implicit synchronizations of OpenMP. Therefore we only set up one parallel region for each kernel, and remove all unnecessary implicit synchronizations. The compact thread affinity is used for better data reuse in the L2 cache.

The Intel Xeon Phi coprocessor provides streaming store instructions. These instructions are intended to speed up performance in the case of vector-aligned unmasked stores in streaming kernels to reduce memory bandwidth pressure. To achieve this purpose, we align the arrays we use in MIC by 64bytes first, then insert vector nontemporal directives at proper places. F. Other methods Some other standard optimizations, including prefetching, 2MB paging and loop fusing, are also carried out. Except for the last one, these methods are implemented by simply adding compiling options or using environment variables. And the last one can be done by properly fusing adjacent computation patterns without affecting the data dependency in the data-flow diagram.

C. Loop refactoring The regularity-aware loop refactoring is done in all stencil computation loops to avoid the possible competition when running in parallel. This greatly helps increase the concurrency of all kernels and improve the overall performance. Some intermediate variable might be introduced during the loop refactoring, but the overhead is negligible as compared to the performance gain. D. Vectorization Utilizing 512-bit SIMD instruction-level parallelism on MIC usually brings big performance gain. Although auto vectorization is supported by the default Intel compiler option, we choose not to use it because of the highly irregular memory accesses in most kernels. Instead, we insert SIMD directives at proper positions to execute manual vectorization. For the refactored loops as shown in Algorithm 3, a label matrix L defined as  1, if i = CellsOnEdge(EdgesOnCell(i, j), 1) L(i, j) = −1, otherwise

is employed to record the operator “+/-” in the code and remove the conditional branches. The branch-free loop refactoring is shown in Algorithm 4.

V. T EST AND A NALYSIS Our experiment platform is a cluster consisting of 32 computing nodes connected by a 56Gb FDR InfiniBand network. Each node is equipped with two Intel Xeon E52680 V2 CPUs and two Intel Xeon Phi 5110p coprocessors. The configuration detail of the platform is shown in Table II. In the test, we group each 10-core CPU with a Xeon Phi accelerator and assign one MPI process to them. All experiments are conducted in the standard double-precision accuracy. The meshes we use in our experiments are all quasi-uniform SCVT-based. Information on the horizontal resolution as well as the total number of mesh cell of the tested meshes are provided in Table III. Table II C ONFIGURATIONS OF THE TEST PLATFORM . Intel Xeon E5-2680 V2 Intel Xeon Phi 5110P Frequency 2.8GHz 1.1GHz Cores/Threads 10 / 10 60 / 240 SIMD width 4 double / 8 integer 8 double / 16 integer Instruction Set AVX IMCI Gflops in D.P. 224 1010.8 L1/L2/L3 cache 32KB / 256KB / 25MB 32KB / 512KB / Memory capacity 32GB 7.8GB

Algorithm 4 Branch-free loop refactoring Real Y(nCells), X(nEdges); /* data arrays */ Integer CellsOnEdge(nEdges,2); /* index array */ Integer EdgesOnCell(nCells,maxEdges); /* index array */ Integer L(nCells,maxEdges); /* label matrix */ Integer iedge ; /* intermediate variable*/ for icell = 1 to nCells do for i = 1 to nEdgesOnCell(icell) do iedge = EdgesOnCell(icell,i); Y(icell) = Y(icell) + L(icell,i)*X(iedge); end for end for

Table III M ESH INFORMATION LIST. Resolution # of Mesh Cells

120-km 40, 962

60-km 163, 842

30-km 655, 362

15-km 2, 621, 442

A. Correctness Validation We first validate the correctness of the hybrid implementation. For doing so, we compare the simulation results of the original single-core CPU code and our pattern-driven hybrid CPU-MIC implementation. There are a number of

100

Speedup

80

60

40

20

0 Baseline

(a) Result of the original CPU implementation.

OpenMP Refactoring

SIMD

Streaming

Others

Tuning methods

Figure 6. Performance improvement on the Intel Xeon Phi with various optimization techniques.

(b) Result of the hybrid CPU-MIC implementation.

(c) Difference between the two results. Figure 5. Results on the total height distribution (i.e. h + b) at day 15 for the zonal flow over an isolated mountain problem.

the many-core implementation on a single device of the Intel Xeon Phi accelerator. The SCVT-based mesh with 30-km horizontal resolution and 65, 536 mesh cells are used in the test. In Figure 6, the baseline result is obtained from the original single-core implementation without any optimizations done. From the figure, we observe that without the regularity-aware loop refactoring, the directly application of the OpenMP multi-threading leads to poor performance; only a speedup of less than 20x is obtained on the 60-core device. When the loop refactoring is done, the speedup quickly increases to over 60x, which clearly shows the effectiveness of this optimization methods. The SIMD vectorization, on the other hand, only improves the performance by about another 20% percent. This is expected because of the irregular memory patterns in the computing kernels. Other optimizations techniques such as streaming, prefetching, 2MB paging and loop fusing further gradually increase the speedup to nearly 100x. C. Performance of Hybrid Implementations

B. Single Device Performance Shown in Figure 6 are results on the performance improvement by applying various optimization techniques to

Kernel-level design vs. CPU version Pattern-driven design vs. CPU version

Speedup

8

8.34

8.35

7.80

7 6 5

Execution time per step (s)

test cases [22] available for the global shallow water model. Among them we choose the fifth test case, which describes the evolution of a zonal flow over an isolated mountain. We remark that we have tested all other test cases for the shallow water model, all with similar results. Only the result of test case five is shown here for the sake of brevity. Figure 5 show the comparison results of the total height distribution at day 15 and the difference between the two codes. The tests are done on the quasi-uniform SCVT mesh with 120-km horizontal resolution and 40, 962 mesh cells. Since all computation kernels are parallelized in the hybrid implementation, and some loops are even refactored, the two results are not bit-wise identical. Nevertheless, the difference between the two, as compared to the absolute values of the total height, are consistent with each other within the machine precision.

16

6.02

6.05

5.98

5.63 4.59

CPU version Kernel-level design Pattern-driven design

17.528

12 8 4.434

4 0

0.2710.059 0.045

40962 cells

1.115

0.198 0.143

163842 cells

0.741 0.532

655362 cells

2.896 2.102

2621442 cells

Number of mesh cells

Figure 7. Performance comparison of different implementations on different scales of SCVT-meshes.

We then examine the performance of the hybrid implementations by comparing with the original CPU implementation. Four typical meshes are used in the test, with

D. Scalability Tests Now we investigate the parallel performance of the pattern-driven hybrid implementation when running with multiple MPI processes. Both the strong and weak scalabilities are of interest to us. And for the sake of comparison, we also test the parallel performance of the original CPU code. 8 CPU version Pattern-driven design

Execution time per step (s)

4

CPU version Pattern-driven design

0.30 0.271

0.25

0.272

0.274

0.273

0.046

0.046

0.047

0.20 0.15 0.10 0.05

0.045

2

0.00

1

1

2

4

8

16

32

64

# of MPI processes

0.5 0.25

Figure 9.

Weak scaling results.

0.125 0.0625

0.03125 1

2

4

8

16

32

64

# of MPI processes

(a) 30-km mesh (655, 362 cells). 32 CPU version Pattern-driven design

16

Execution time per step (s)

mesh, we gradually increase the number of MPI processes from 1 to 64 by a factor of 2. From Figure 8 (a) we observe that in the case of the small mesh, the hybrid implementation has a good parallel efficiency when the number of MPI processes is less than 16, but the parallel efficiency degrades severely when scaling to larger numbers of MPI processes. This is expected because the average execution time of the hybrid implementation is much smaller than that of the original CPU code, and there is left little space to speedup when continuously increasing the number of MPI processes. When using the large mesh in the strong scaling test, as seen in Figure 8 (b), the parallel scalability of the hybrid design becomes even better, it not only outperforms the original CPU code by nearly one magnitude but also maintains comparable parallel efficiency.

Execution time per step (s)

the horizontal resolution increasing from 120-km to 15km. In the test, both the kernel-level and the pattern-driven hybrid implementations are used for comparison purpose. Shown in Figure 7 are the testing results, including the averaged execution time (in seconds) per time step, and the relative speedup as compared to the single-core CPU version. It is seen from the figure that as the mesh becomes finer, higher speedups are obtained. The kernel-level hybrid implementation sustains a speedup of 6.05x, and the patterndriven version further improves the speedup to 8.35x, which is a 38% increase. The performance improvement is due to the fine-grained granularity in the pattern-driven design which is able to lead to superior load balance as compared to the kernel-level algorithm.

8 4 2 1 0.5 0.25 0.125

In the weak scaling tests, we fix the number of mesh cells per MPI process to be about 40, 962. Due to the limited availability of the mesh data, we can only perform a weakscaling test from 1 MPI process to 64 processes by a factor of 4. The testing results are shown in Figure 9. Also provided in the figure is the result of the same weak-scaling test on the original CPU code. We observe from the figure that both the original version and the hybrid implementation is able to maintain a nearly perfect weak scalability as the number of MPI processes increases. It indicates that the pattern-driven hybrid design is able to inherit the good parallel performance of the original code, with additional support to hybrid manycore platforms.

0.0625

VI. C ONCLUSION

0.03125 1

2

4

8

16

32

64

# of MPI processes

(b) 15-km mesh (2, 621, 442 cells). Figure 8.

Strong scaling results.

In the strong scaling test, we employ two meshes, including a relatively small mesh of 30-km horizontal resolution and 655, 362 mesh cells, and a relatively large one of 15-km horizontal resolution and 2, 621, 442 mesh cells. For each

In this paper, we have presented a new pattern-driven approach for the design, implementation, and optimization of hybrid algorithms on multi- and many-core accelerated systems in geoscientific model development. The shallow water model in the MPAS project has been selected as a proxy to demonstrate our method. The approach consists of three major steps. First, all basic computation patterns are identified through a rigorous analysis of the MPAS code. Second, these identified patterns are used as building blocks

to draw a data-flow diagram, which serves as a perfect indicator to recognize data dependencies and exploit inherent parallelism. Third, a hybrid algorithm is designed based on the data-flow diagram to support concurrent computations done on both multi-core CPUs and many-core accelerators. We have practiced the pattern-driven approach on an x86based heterogeneous supercomputer equipped with both Intel Xeon CPUs and Intel Xeon Phi devices. Experiments show that our hybrid design is able to deliver an 8.35X speedup as compared to the original code and scales up to 64 processes with a nearly ideal parallel efficiency. It is worth noting that the pattern-driven design has a great potential to be applied in many stencil-based application scenarios. Future works on this subject may include building performance models for the pattern-driven design and leveraging automatic code generation techniques for the easy of implementation and optimization. ACKNOWLEDGMENT The work was supported in part by NSF China (grant# 61170075) and 863 Program of China (grant# 2015AA01A302). R EFERENCES [1] D. R. Easterling, G. A. Meehl, C. Parmesan, S. A. Changnon, T. R. Karl, and L. O. Mearns, “Climate extremes: observations, modeling, and impacts,” vol. 289, no. 5487, pp. 2068– 2074, 2000. [2] K. Hamilton and W. Ohfuchi, Eds., High Resolution Numerical Modelling of the Atmosphere and Ocean. Springer, 2008. [3] S. Shingu, H. Takahara, H. Fuchigami, M. Yamada et al., “A 26.58 Tflops global atmospheric simulation with the spectral transform method on the Earth Simulator,” in Proceedings of the ACM/IEEE conference on Supercomputing (SC ’02), 2002, pp. 1–19. [4] “MPAS: Model for Prediction Across Scales,” http:// mpas-dev.github.io. [5] Q. Du, V. Faber, and M. Gunzburger, “Centroidal Voronoi Tessellations: Applications and Algorithms,” SIAM Review, vol. 41, no. 4, pp. 637–676, 1999. [6] J. Michalakes and M. Vachharajani, “GPU acceleration of numerical weather prediction,” Parallel Processing Letters, vol. 18, no. 04, pp. 531–548, 2008. [7] J. Mielikainen, B. Huang, H. Huang, and M. D. Goldberg, “Improved GPU/CUDA based parallel weather and research forecast (WRF) single moment 5-class (WSM5) cloud microphysics,” IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol. 5, no. 4, pp. 1256– 1265, 2012. [8] J. Mielikainen, B. Huang, H.-L. Huang, M. Goldberg, and A. Mehta, “Speeding Up the Computation of WRF DoubleMoment 6-Class Microphysics Scheme with GPU,” Journal of Atmospheric and Oceanic Technology, vol. 30, no. 12, pp. 2896–2906, 2013. [9] J. C. Linford, J. Michalakes, M. Vachharajani, and A. Sandu, “Multi-core acceleration of Chemical Kinetics for simulation and prediction,” in Proceedings of the International Conference on High Performance Computing Networking, Storage and Analysis (SC ’09), 2009, pp. 7:1–7:11.

[10] E. Price, J. Mielikainen, B. Huang, H. A. Huang, and T. Lee, “GPU acceleration experience with RRTMG long wave radiation model,” in SPIE Remote Sensing. International Society for Optics and Photonics, 2013. [11] O. Furher, C. Osuna, X. Lapillonne, T. Gysi, M. Bianco, and T. Schulthess, “Towards GPU-accelerated Operational Weather Forecasting,” in The GPU Technology Conference, 2013. [12] S. Xu, X. Huang, Y. Zhang, Y. Hu, H. Fu, and G. Yang, “Porting the Princeton Ocean Model to GPUs,” in Algorithms and Architectures for Parallel Processing. Springer, 2014, pp. 1–14. [13] T. Shimokawabe, T. Aoki, C. Muroi, J. Ishida, K. Kawano, T. Endo, A. Nukada, N. Maruyama, and S. Matsuoka, “An 80-fold speedup, 15.0 TFlops full GPU acceleration of nonhydrostatic weather model ASUCA production code,” in Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC ’10), 2010, pp. 1–11. [14] T. Shimokawabe, T. Aoki, J. Ishida, K. Kawano, and C. Muroi, “145 TFlops performance on 3990 GPUs of TSUBAME 2.0 supercomputer for an operational weather prediction,” Procedia Computer Science, vol. 4, pp. 1535 – 1544, 2011. [15] W. Xue, C. Yang, H. Fu, X. Wang, Y. Xu, L. Gan, Y. Lu, and X. Zhu, “Enabling and scaling a global shallow-water atmospheric model on Tianhe-2,” in Proceedings of the 2014 IEEE 28th International Parallel and Distributed Proceeding Symposium (IPDPS’14), 2014, pp. 745–754. [16] L. Ju, T. Ringler, and M. Gunzburger, “Voronoi tessellations and their application to climate and global modeling,” in Numerical techniques for global atmospheric models. Springer, 2011, pp. 313–342. [17] T. D. Ringler, D. Jacobsen, M. Gunzburger, L. Ju, M. Duda, and W. Skamarock, “Exploring a multiresolution modeling approach within the shallow-water equations,” Monthly Weather Review, vol. 139, no. 11, pp. 3348–3368, 2011. [18] S. Ii and F. Xiao, “A global shallow water model using high order multi-moment constrained finite volume method and icosahedral grid,” Journal of Computational physics, vol. 229, no. 5, pp. 1774–1796, 2010. [19] A. Jameson, W. Schmidt, E. Turkel et al., “Numerical solutions of the Euler equations by finite volume methods using Runge-Kutta time-stepping schemes,” AIAA paper, vol. 1259, p. 1981, 1981. [20] T. Shimokawabe, T. Aoki, T. Takaki, T. Endo, A. Yamanaka, N. Maruyama, A. Nukada, and S. Matsuoka, “Petascale phase-field simulation for dendritic solidification on the TSUBAME 2.0 supercomputer,” in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC ’11), 2011, pp. 3:1– 3:11. [21] X. Huo, V. T. Ravi, and G. Agrawal, “Porting irregular reductions on heterogeneous CPU-GPU configurations,” in Proceedings of 18th International Conference on High Performance Computing (HiPC). IEEE, 2011, pp. 1–10. [22] D. L. Williamson, J. B. Drake, J. J. Hack, R. Jakob, and P. N. Swarztrauber, “A standard test set for numerical approximations to the shallow water equations in spherical geometry,” Journal of Computational Physics, vol. 102, no. 1, pp. 211– 224, 1992.

Suggest Documents