Parallel sparse nonlinear solvers of obstacle problems on GPU clusters Lilia Ziane Khodja1 , Ming Chau2 , Rapha¨el Couturier1 , Pierre Spit´eri3 , Jacques Bahi1 1
LIFC, EA 4269, BP 527, 90016 Belfort CEDEX, France
[email protected],
[email protected],
[email protected] 2
3
Advanced Solutions Accelerator, 199 rue de l’Oppidum, 34170 Castelnau Le Lez, France
[email protected]
ENSEEIHT-IRIT, 2 rue Charles Camichel, 31071 Toulouse CEDEX, France
[email protected] Abstract The present study deals with the solution of a non linear boundary value problem : the obstacle problem defined in a three-dimensional domain. This mathematical problem arises for instance in financial mathematics (American options) or mechanics. This paper focuses on iterative parallel algorithms for the solution of the obstacle problem with a cluster of GPU. Two solvers have been implemented, both mixing MPI and CUDA : the projected Richardson algorithm and the projected block Gauss-Seidel algorithm. In each iterative algorithm, the communication between GPU nodes can be either synchronous or asynchronous. Note that the convergence of asynchronous algorithms are ensured with a mathematical theory, which is breifly presented. Moreover, the performance of the considered GPU solvers are compared with equivalent CPU solvers. The experimental results clearly show that asynchronous iterations on GPU cluster is the fastest method.
Keywords : nonlinear systems, obstacle problem, GPU cluster, CUDA, asynchronous iterations
1
Introduction
The simulation of many physical phenomena lead to the solution of very largescale algebraic systems with iterative methods. Thus parallel computers are widely used in order to solve these mathematical problems. Nowadays, one of the most attractive parallel platforms are those using the computing power of the GPU cards (Graphics Processing Units). Their hardware and software architectures have rapidly evolved allowing them to become high performance 1
accelerators for the data-parallel tasks and intensive arithmetic computations of many applications. Several works have proved the ability of the GPUs to provide better performances than the CPUs for many applications using iterative solvers [1, 2, 3]. In the present study, we concentrate on the comparison between CPU and GPU implementations of iterative parallel synchronous and asynchronous algorithms. Asynchronous iterative algorithms are attractive ways for reducing the idle time due to blocking message passing and synchronisation barriers, which arise in such data parallel algorithms. Particularly, asynchronous algorithms enable one to consider parallel computations whereby processors go at their own pace according to their intrinsic characteristics and computational load. This method avoids the use of load balancing techniques, which cannot be easily implemented in all situations, in order to reduce the parallelisation overhead. Asynchronous parallel iterative methods have been studied by many authors [20, 21, 23, 24], and have been implemented so far on clusters of CPU. In the present study, an implementation on a cluster of 12 GPU has been performed. More precisely, each algebraic system is parallelized and solved iteratively using the whole cluster. The data communication at each iteration between the GPU nodes can be either synchronous or asynchronous, whereas inside each GPU node, a CUDA parallelisation is performed. Thus, there are two levels of parallelism : MPI between GPU nodes and CUDA in each node. In the sequel, we will compare the performances of the synchronous and asynchronous versions of two parallel iterative algorithms, both implemented, CPU and GPU architectures. The model problem is a non linear PDE occuring in financial mathematics (option pricing) and constrained structure mechanics : the obstacle problem (see [4]). Our experimental results show a significant acceleration when GPU clusters are used, compared to CPU clusters. Furthermore, asynchronous algorithms are faster than synchronous ones on both kind of cluster. The paper is organized as follows ; in section 2, the model problem, its time and space discretization, and the mathematical formulation of the parallel algorithms being used are presented. Section 3 is devoted to a brief description of the GPU architecture. The CUDA implementation of parallel synchronous and asynchronous iterative algorithms are explained in section 4. Some experimental results, which show comparisons between CPU and GPU implementations, projected Richardson and projected block Gauss-Seidel algorithms are presented in section 5. Finally, an improvement of the projected Richardson algorithm with red-black ordering is described in section 6 ; experimental results are also shown.
2
Solving large scale obstacle problems
The obstacle problem occurs in many applications such as mechanics, free boundary problems and financial derivatives. It consists in solving a time dependent nonlinear equation ∂u + bt .∇u − ν.∆u + c.u − f ≥ 0, u ≥ φ, e.w. in [0, T ] × Ω, ν > 0, ∂t∂u ( ∂t + bt .∇u − ν.∆u + c.u − f )(u − φ) = 0, e.w. in [0, T ] × Ω (1) u(0, x, y, z) = u0 (x, y, z) B.C. on u(t, x, y, z) def ined on ∂Ω 2
where u0 is the initial condition, c ≥ 0, b and ν are physical parameters, T is the final time, u = u(t, x, y, z) is the unknown to be computed, f is a right hand side that could represent for instance the external forces, B.C. describes the boundary conditions on the boundary ∂Ω of the domain Ω and φ models a constraint imposed to u. Practically, the Dirichlet condition (where u is fixed on ∂Ω) or the Neumann condition (where the normal derivative of u is fixed on ∂Ω) are classically considered. The previous time dependent equation is solved numerically by considering an implicit or semi-implicit time marching scheme, where at each time step a stationary nonlinear problem is solved. t b .∇u − ν.∆u + (c + δ).u − g ≥ 0, u ≥ φ, e.w. in [0, T ] × Ω, ν > 0, (bt .∇u − ν.∆u + (c + δ).u − g)(u − φ) = 0, e.w. in [0, T ] × Ω (2) B.C. on u(t, x, y, z) def ined on ∂Ω where δ is the inverse of the time step and g = f + δuprec , and uprec is the solution obtained at the previous time step. The numerical analysis and the solution of such stationary equation has been studied by many authors [4, 5, 6, 7, 8, 9, 10, 11, 12], where both sequential, parallel synchronous and parallel asynchronous algorithms on classical CPU architectures have been considered. In the scope of our current study, we will consider the implementation on the GPU architecture of two algorithms : projected block relaxation Gauss-Seidel on one hand and projected Richardson (a point relaxation algorithm) on the other hand.
2.1
Discretization
First of all, in the previous stationary problem (2), note that the convectiondiffusion operator is not self adjoint. Consequently, its spacial discretization does not lead to a symmetric matrix. Nevertheless, since the convection coefficients arising in the operator are constant, then by a classical change of variables, we can formulate the same problem by the way of a self adjoint operator. Indeed, consider the following stationary convection-diffusion operator : bt ∇v − ν∆v + (c + δ)v = g, e.w. in [0, T ] × Ω, c ≥ 0, δ > 0; where b = {b1 , b2 , b3 } and consider also the following general change of variables t ; then, the previous stationary v = ea .u, where a is defined by a = b (x,y,z) 2ν convection-diffusion operator is changed as follows −ν∆u + (
kbk22 + c + δ)u = e−a .g = f, 4ν
(3)
where kbk2 denotes the euclidean norm; then, by using this change of variables, the stationary convection-diffusion operator is changed in a stationary diffusion operator, which has the major property of being a self adjoint operator. Such operator (3) on one hand, leads to considering optimisation algorithms for the numerical solution. On the other hand, the formulation (2) leads to considering relaxation algorithms, since the operator is not self adjoint due to the convection terms. In the sequel, the domain Ω ⊂ R3 is set to Ω = [0, 1]3 , and is discretized with an uniform Cartesian mesh constituted by M = m3 discretization points, 3
1 . A classical where m is related to the spatial discretization step by h = m+1 order 2 finite difference approximation of the Laplacian is used. So the complete discretization of both stationary boundary value problems (2) and (3) leads to the solution of a large discrete complementary problem of the following form, when both Dirichlet or Neumann boundary conditions are used : Find U ∗ ∈ RM such that ¯ (A + δI)U ∗ − G ≥ 0, U ∗ ≥ Φ, (4) ∗ T ∗ ¯ ((A + δI)U − G) (U − Φ) = 0,
where A is a matrix obtained after spatial discretization by finite difference method, G is derived from the Euler first order implicit time marching scheme and from the discretized right-hand side of the obstacle problem, δ is the inverse of the time step, and I is the identity matrix. The matrix A is symmetric when the self adjoint operator is considered, and non symmetric otherwise. According to the chosen discretization scheme of the Laplacian, A is an Mmatrix (irreducibly diagonal dominant ; see [13]) and consequently the matrix (A + δI) is also an M-matrix. This property will be important in the sequel.
2.2
Parallel iterative solvers
In the sequel, owing to the large size of the previous discrete complementary problem, we will solve problem (4) by parallel synchronous or asynchronous iterative algorithms (see [20, 21, 22, 23, 24]). 2.2.1
Mathematical framework
Let α be a positive integer. Assume that E = RM ; note that E is an Hilbert α Y Ei is a product of α subspaces space. Consider also that the space E = denoted Ei = Rmi , where
α X
i=1
mi = M ; note that Ei is also an Hilbert space in
i=1
which h . , . ii denotes the scalar product and | . |i the associated norm, for all i ∈ {1, . . . , α}. α X hui , vi ii the scalar product on Then for all u, v ∈ E denote by hu , vi = i=1
E and k . k its associated norm. In the sequel, we consider the general following fixed point problem Find U ∗ ∈ V such that U ∗ = F (U ∗ )
(5)
where V 7→ F (V ) applies from E to E. Let V ∈ E and consider the following block decomposition of V and the corresponding decomposition of F V F (V )
= =
(V1 , . . . , Vα ) (F1 (V ), . . . , Fα (V ))
In order to solve problem (5), let us consider now the parallel asynchronous iterations defined as follows : let U 0 ∈ E be given, then for all p ∈ N, U p+1 is
4
recursively defined by ( Uip+1 =
where
ρ (p)
ρ (p)
Fi (U1 1 , . . . , Uj j Uip if i 6∈ s(p)
ρ (p)
, . . . , Uα α
) if i ∈ s(p)
∀p ∈ N, s(p) ⊂ {1, . . . , α} and s(p) 6= ∅ ∀i ∈ {1, . . . , α}, {p | i ∈ s(p)} is denombrable
and ∀j ∈ {1, . . . , α}, ( ∀p ∈ N, ρj (p) ∈ N, 0 ≤ ρj (p) ≤ p and ρj (p) = p if j ∈ s(p) lim ρj (p) = +∞.
(6)
(7)
(8)
p→∞
The previous asynchronous iterative scheme models computations that are carried out in parallel without order nor synchronization and describes a subdomain method without overlapping. Particularly, it enables one to consider distributed computations whereby processors go at their own pace according to their intrinsic characteristics and computational load. The parallelism between the processors is well described by the set s(p) which contains at each step p the index of the components relaxed by each processor on a parallel way while the use of delayed components in (6) permits one to model nondeterministic behavior and does not imply inefficiency of the considered distributed scheme of computation. Note that, according to [21], theoretically, each component of the vector must be relaxed an infinity of time. The choice of the relaxed components may be guided by any criterion, and, in particular, a natural criterion is to pick-up the most recently available values of the components computed by the processors. Remark 1 Such asynchronous iterations describe various classes of parallel algorithms, such as parallel synchronous iterations if ∀j ∈ {1, . . . , α}, ∀p ∈ N, ρj (p) = p. 2.2.2
Projected parallel Richardson methods
Many equivalent formulations of the obstacle problem exist and the reader is referred to [4] for complements. According to the boundary value problem formulation with a self adjoint operator (3), we can consider here the equivalent optimisation problem and the fixed point mapping associated to its solution. The goal is to define a parallel iterative algorithm within the framework defined in section 2.2.1. Let K be a closed convex set defined by ¯ everywhere in E} K = {V | V ≥ Φ ¯ is the discrete obstacle function. where Φ In fact the obstacle problem (4) is formulated as the following constrained optimisation problem Find U ∗ ∈ K such that ∀V ∈ K, J(U ∗ ) ≤ J(V ) 5
where the cost function is given by J(v) =
1 hA.V , V i − hG , V i 2
in which h. , .i denotes the scalar product in E and A = A + δI is symmetric positive definite, according to the chosen discretization method. Owing to the great size of such a system, in order to reduce computation time, the former optimization problem can be solved by a numerical way by using a projected parallel asynchronous method on the convex set. More particularly we will consider here an asynchronous parallel adaptation of the Richardson projected method. In order to define the parallel asynchronous Richardson projected method, we extend the formalism introduced in subsection 2.2.1 as follows. Assume that α Y Ki . Let ∀i ∈ {1, . . . , α}, Ki ⊂ Ei and Ki is a closed convex set, and let K = i=1
also G = (G1 , . . . , Gα ) ∈ E. For any V ∈ E, let PK (V ) be the projection of V on K such that PK (V ) = (PK1 (V1 ), . . . , PKα (Vα )) where ∀i ∈ {1, . . . , α}, PKi is the projector from Ei onto Ki . For any γ ∈ R, γ > 0, let a fixed point mapping Fγ be defined by U ⋆ = PK (U ⋆ − γ(A.U ⋆ − G)) = Fγ (U ⋆ )
(9)
which can also be written Fγ (V ) = (F1,γ (V ), . . . , Fα,γ (V )) in the following way ∀V ∈ E, Fi,γ (V ) = PKi (Vi − γ(Ai .V − Gi )). 2.2.3
Projected parallel block relaxation methods
We can also consider a projected parallel asynchronous block relaxation algorithm, related to the natural block decomposition of the discretized operator, with the same notation used in section 2.2.2. Then the projected block relaxation algorithm is associated to the following fixed point mapping X Ui⋆ = PKi (A−1 (10) Ai,j Uj⋆ )) = FBi (U ⋆ ), ∀i ∈ {1, . . . , α}. i,i (Gi − j6=i
Then we can associate to this fixed point mapping FB a parallel asynchronous block method defined by (6), (7) and (8). Remark 2 In the implementation, the number of processors is obviously less important than the number of blocks in the previous model of parallel iterations. Then several adjacent block components of the discretization matrix, and iterate vector are processed accordingly by each processor. Such an implementation leads to a more multiplicative behavior of the considered subdomain methods without overlapping. 2.2.4
Convergence of the methods
In both cases, the important property which ensures the convergence of the parallel synchronous and asynchronous algorithms defined above, is the fact that A is an M-matrix. 6
In the case of projected Richardson’s method, the convergence proceeds from a result of [16]. Indeed, there exists a value γ0 > 0, such that ∀γ ∈]0, γ0 [, the synchronous and asynchronous iterations (6), (7) and (8), associated to the fixed point mapping Fγ (9), converge to the unique solution U ⋆ of the discretized problem. For more details, the reader is referred to [12]. In the case of projected block relaxation methods, the convergence has been established by various ways, using contraction techniques (see [18, 15]) or partial ordering techniques (see [19, 14, 17]). To sum up, the synchronous and asynchronous iterations (6), (7) and (8), associated to the fixed point mapping FB (10), converge to U ⋆ . Moreover assuming that the algebraic system is splitted into q blocks, q ≤ α, corresponding to a coarser subdomain decomposition without overlapping; then using a result of [15], it can be shown that the convergence of this method still holds. Furthermore, if the subdomain decomposition associated with α block is a point decomposition (i.e. α = M ), then parallel asynchronous block relaxation methods converge for every subdomain decomposition. Remark 3 Note that convergence is ensured in both cases for every initial guess U 0.
3
GPU hardware and software architectures
A GPU (Graphics Processing Unit) is a massively multithreaded manycore designed to assist the CPUs in the intensive parallel computations. It is built around processor and memory hierarchies. Indeed, a GPU is composed of hundreds of cores (up-to 512 cores in the recent GPUs) organized in several arrays of processors called streaming multiprocessors. It is also equipped with different high-bandwidth and low-latency memories. It has thousands of 32-bit registers and 16KB (or 48KB for the recent GPUs) of fast shared memory per multiprocessor, and a read/write global or device memory shared between all its cores. Besides these commonly used memories, each multiprocessor is equipped with two additional cached memory spaces, texture and constant memories, that reside in the device memory and accessible in read-only by all cores of a multiprocessor. These last memories are used to improve the access times to the non-cached device memory, and thus, to get maximum memory throughput. The only GPU memory accessible by the CPU is the device memory through a PCI-Express interface, such that all data transfers between the GPU and its CPU are performed from and/or to the device memory. So, the CPU can read/write the device, texture and constant memory spaces. In order to exploit the computing power of the GPU architectures, Nividia has released CUDA (Compute Unified Device Architecture) [25] for the generalpurpose computing on GPUs. CUDA architecture provides a C/C++ parallel programming environment with a set of minimal extensions to program the GPUs for the general computations (graphic and/or non-graphic application purposes), that are usually performed by the CPUs. A CUDA program consists of sequential C codes to be executed by the CPU and several data-parallel functions to be performed by the GPU called kernels. At the GPU level, the same kernel is executed in parallel by thousands or even millions of GPU threads. The GPU organizes this large amount of threads into a grid of several 1D, 2D or
7
3D thread blocks that are distributed among its multiprocessors. Indeed, each multiprocessor executes one or more thread blocks in SIMD fashion (Single Instruction, Multiple Data) and in turn each core of a multiprocessor runs one or more threads within a block using the SIMT architecture (Single Instruction, Multiple threads). A GPU multiprocessor manages and executes the threads of its thread blocks in several small groups of threads called warps, such that each warp contains 32 threads of 32 consecutive and increasing IDs and the first warp is the owner of thread 0. At any given clock cycle, the threads of a warp execute concurrently the same instruction of a kernel but operate on different data and are free to follow the same or different execution paths without any synchronization point. And all threads within the same thread block execute concurrently the same instruction on different data and can cooperate among themselves through barrier synchronization (__syncthreads() in CUDA) and the shared memory. In contrast, within the grid of thread blocks of the same kernel, there is no synchronization at all between the thread blocks, except that they read/write the input and output data from/to the global memory. The number of threads per block and the number of thread blocks within the thread-grid of a kernel are restricted by the limited resources of the multiprocessors, and then, they affect the performance and execution time of the kernel computation on the GPU. A kernel will fail to launch if the number of threads per block, specified in its execution configuration, is either above the maximum number of threads per block (up-to 1024 threads for the new generation of GPUs), or it requires too many registers and/or too much shared memory space than available per multiprocessor for a given GPU. Moreover, the full efficiency of a kernel is acheived, first when all 32 threads of a warp agree on their execution path, because otherwise the different branch execution paths are serially executed, and second when their read/write memory accesses are as optimally as possible to maximize the memory throughput. For optimal memory usage, a kernel should use as much as possible the low-latency shared memory and/or the coalesced read/write memory accesses to the high-latency device memory. In fact the memory coalescing can efficiently improve the access times by reducing the number of the memory transactions for a warp. The full coalescing is acheived when a half-warp (the first or the last sequential 16 threads of a warp) accesses sixteen 1-, 2-, 4-, 8- or 16-byte words of the same data type, addressed into the same 32-, 64- or 128-byte segment of the device memory. In this case, the read/write by all threads of a half-warp are performed into one memory transaction at a time. So if the 16 threads of a half-warp address 16 words into n different memory segments, then n different memory transactions will be performed for this half-warp. When the coalecsence is not ensured, the texture memory is recommanded to be used in order to improve the data reading from the device memory. GPUs only work on data filled in their device memories and the final results of their kernel executions must be communicated to their CPUs. Hence, the data must be transferred in and out of the GPU. However, the speed of memory copy between the GPU and the CPU is slower than the bandwidths of the GPU memories, and thus, it dramatically affects the performances of GPU computations. Accordingly, it is necessary to limit data transfers between the GPU and its CPU during the computations. For more details about the GPU programming and the CUDA architecture, please refer to [25]. 8
4
GPU implementation
The parallel implementation of both solvers, the projected Richardson and the projected block relaxation methods, requires the data partitioning among the computing nodes of the GPU cluster. Let N denotes the number of computing nodes on the cluster, where a computing node represents a pair of a CPU core, holding one MPI process, and a GPU. So before starting computations, the three-dimentional problem to be solved is split into N parallelepipedic subproblems, each for a pair of (MPI process, GPU), as is shown in Figure 1. Indeed, the y and z axis of the three-dimentional domain of the problem are respectively split into Ny and Nz parts, such that N = Ny × Nz . This block-based data partitioning of the problem reduces the data exchanges at subdomain boundaries compared to the native row-wise partitioning. After the decomposition of the problem, all the data generated from the partitioning are copied from the CPU memories to the GPU global memories, to be processed on the GPUs.
GPU11
size_y
GPU2
GPU5
GPU8
GPU1
GPU4
GPU7
GPU10
GPU0
GPU3
GPU6
GPU9
size_x
y
x z
size_z
Figure 1: Data partitioning of a problem to be solved among N = 3 × 4 GPUs As many other iterative methods, the algorithms of the projected Richardson method and the projected block relaxation method use a set of data-parallel linear algebra operations that are amenable to be implemented on parallel CPU architectures, and thus, on GPUs. However, their implementations on a GPU cluster differ on two main points: - The projected Richardson solver is implemented as a fixed point-based iteration and uses the vector component update of the Jacobi method, - The projected block relaxation solver is implemented as a block-based iteration and uses the vector block component update of the Gauss-Seidel method. In the parallel computations, the same algorithm of a method is executed in parallel by each computing node of the GPU cluster but on different threedimentional sub-problems. In each computing node, the MPI process acts as a 9
controller of the main loop of the method and the GPU executes all the dataparallel linear algebra operations inside this main loop as kernels. The implementation of all kernels of the projected Richardson method on GPUs uses a perfect fine-grain multithreading parallelism. Since this method is implemented as a fixed point method, each kernel is executed by a large number of GPU threads such that each thread is in charge of the computation of one component of the local iterate vector U in the sub-problem. Moreover, this method uses the vector updates of the Jacobi method, which means that each thread i computes the new value of its vector component, Uip+1 , independently of the new values, Ujp+1 for j 6= i, of those computed in parallel by other threads at the same relaxation p + 1. Unfortunately this is not the case of the projected block relaxation method. First, its block-based nature requires triangular solves of the matrix blocks. At the initialization step, the projected block relaxation method performs a triangulation of the largest block of the tridiagonal matrix (in principle they are all the same size) along the x axis, according with the numbering of the grid points. So at each relaxation, the method performs triangular solves by the back substitution method for each matrix block. In this case each GPU thread is in charge of a block of vector components instead of only one vector component as the projected Richardson method, because of the recurrence relations in the computations of the vector components in each block. Second, this method uses vector updates of the Gauss-Seidel method, such that each computation of the new value of a vector component, Uip+1 , involves the new values, Ujp+1 for j < i, and the old values, Ujp for j > i, of other vector components. Consequently, each GPU thread must wait for the computing of the new values of other components by other threads before computing the new values of its components in its own block. However, the thread waits for other threads to complete their computations dramatically affect the computation performances of the GPUs. Therefore the implementation of the projected block relaxation method performs mixed CPU/GPU computations, such that the triangular solves and the vector updates are performed by the MPI process, involving at each relaxation data transfers between the CPU core and the GPU. The dimensions of the grid and blocks of threads that execute a given kernel on the GPU are provided by the MPI process in the execution configuration of the kernel, defined by the syntax ≪ · · · ≫. They depend on the resources of the GPU multiporcessor and the resource requirements of the kernel, as is mentioned in Section 3, §4. So if block defines the size of a thread block, which must not exceed the maximum size of a thread block (in our GPUs is 512 threads), then the number of thread blocks in the grid, denoted grid, can be computed according to the size of the problem n as follows: grid = (n + block − 1)/block
(11)
However when solving very large problems, the size of the grid can exceed the maximum number of thread blocks that can be executed on the GPUs (up-to 65535 thread blocks), and thus, the kernel will fail to launch. Therefore for each kernel, we decompose the three-dimentional sub-problem (nx × ny × nz) into nz two-dimentional slices of size (nx × ny), as we can see from Figure 2. All slices of the same kernel are executed using for loop by nx × ny parallel threads organized in a two-dimentional grid of two-dimentional thread blocks, as is shown in Figure 3. Each thread is in charge of nz discretization points (one 10
of each slice), accessed in the GPU memory with a constant stride (nx × ny).
Grid of thread blocks
Thread block
Sl ic Sl e (3 ic ) e Sl (2) ic Sl e (1 ic ) e( 0)
Sl
ic
e(
nz
−1
)
GPU11
Figure 2: Decomposition of a sub-problem in a GPU into nz slices
//GPU kernel __global__ void kernel(..., int n, int nx, int ny, int slices, int stride, ...) { int tx = blockIdx.x * blockDim.x + threadIdx.x; //x-coordinate of a thread int ty = blockIdx.y * blockDim.y + threadIdx.y; //y-coordinate of a thread int tid = tx + ty * nx; //thread ID in the grid for(int i=0; i