J. Cent. South Univ. (2013) 20: 1189−1203 DOI: 10.1007/s11771-013-1602-z
Programming for scientific computing on peta-scale heterogeneous parallel systems YANG Can-qun(杨灿群)1, WU Qiang(吴强)1, TANG Tao(唐滔)1, WANG Feng(王锋)1, XUE Jing-ling(薛京灵)2 1. State Key Laboratory of High Performance Computing (National University of Defense Technology), Changsha 410073, China; 2. School of Computer Science and Engineering, University of New South Wales, Sydney NSW 2052, Australia © Central South University Press and Springer-Verlag Berlin Heidelberg 2013 Abstract: Peta-scale high-performance computing systems are increasingly built with heterogeneous CPU and GPU nodes to achieve higher power efficiency and computation throughput. While providing unprecedented capabilities to conduct computational experiments of historic significance, these systems are presently difficult to program. The users, who are domain experts rather than computer experts, prefer to use programming models closer to their domains (e.g., physics and biology) rather than MPI and OpenMP. This has led the development of domain-specific programming that provides domain-specific programming interfaces but abstracts away some performance-critical architecture details. Based on experience in designing large-scale computing systems, a hybrid programming framework for scientific computing on heterogeneous architectures is proposed in this work. Its design philosophy is to provide a collaborative mechanism for domain experts and computer experts so that both domain-specific knowledge and performance-critical architecture details can be adequately exploited. Two real-world scientific applications have been evaluated on TH-1A, a peta-scale CPU-GPU heterogeneous system that is currently the 5th fastest supercomputer in the world. The experimental results show that the proposed framework is well suited for developing large-scale scientific computing applications on peta-scale heterogeneous CPU/GPU systems. Key words: heterogeneous parallel system; programming framework; scientific computing; GPU computing; molecular dynamic
1 Introduction With the fast development of GPU’s (graphics processing unit) computing performance and programming environment, more and more interests are attracted to accelerate non-graphics computing, especially high performance computing using GPUs [1−2]. Constructing supercomputers with CPUs and GPUs has become a new trend in the high performance computing (HPC) area [3]. The rough architecture of these supercomputers is shown in Fig. 1, in which GPUs are generally used to accelerate specific algorithms. This architecture exhibits advantages in many aspects, such as obtaining very high performance with relatively small system scale and exhibiting much higher energy efficiency than its homogeneous counterpart. Still, CPU-GPU heterogeneous systems face several problems such as programming and reliability [4−5]. Programming model is the interface between the system and programmers, and thus is an important metric of the system’s usability.
Fig. 1 Rough architecture of CPU-GPU heterogeneous parallel systems
As to the users of these systems, architecture is not their most interested part when programming. They may not be experts of computer science, but experts of mathematics, physics or biology for instance. They prefer a programming framework that can depict the application features more succinctly and efficiently, without involving too much architecture details. This so-called domain-specific programming framework has
Foundation item: Project(61170049) supported by the National Natural Science Foundation of China; Project(2012AA010903) supported by the National High Technology Research and Development Program of China Received date: 2012−07−06; Accepted date: 2013−01−11 Corresponding author: TANG Tao, PhD; Tel: +86−13517403741; E-mail:
[email protected]
J. Cent. South Univ. (2013) 20: 1189−1203
1190
become a popular research direction in high performance computing in recent years [6]. The JASMIN [7] we mentioned in this work is such a framework designed for scientific computing, especially for adaptive blockstructured meshes computing. However, domain-specific programming frameworks also have some disadvantages. For example, they put main emphasis on application features and ignore the architecture details. Consequently, they cannot exploit the computing potential of specific architectures very efficiently, such as the CPU-GPU heterogeneous architecture we mentioned before. Present CPU-GPU heterogeneous systems typically adopt CUDA [8] or OpenCL [9] as their programming interface, which demands a deep understanding of the architecture to develop better programs. We cannot assume that the domain experts can understand the architecture as deep as computer experts. Similarly, computer experts usually cannot completely understand the algorithms or models of the applications, especially for those large-scale real-world applications. Therefore, we argue that large-scale applications should be cooperatively developed by domain experts and computer experts. In this situation, the programming framework plays a critical role between domain and computer experts. Their relationship can be illustrated in Fig. 2. The key issue of this hybrid programming framework is the task distribution between domain and computer experts. Domain experts take charge of the description of the application-specific information such as models, parameters, algorithms, etc., while computer experts concern full exploitation of the computing power according to the architecture details. Their programs should be connected in a natural way and can exploit the computing power of the parallel system efficiently. To our best knowledge, this is the first study of programming framework that covers both aspects. Based on the development of high performance computing systems and applications, we propose a hybrid programming framework integrating JASMIN,
OpenMP and CUDA/OpenCL, aiming to improve the programming interface for scientific computing on CPU-GPU heterogeneous parallel systems. In our framework, domain experts depict application features with JASMIN, which provides a concise application level task distribution mechanism to exploit coarse-grained parallelism between computing nodes. Computer experts adopt OpenMP and CUDA/OpenCL to exploit fine-grained intra-node parallelism for specific numerical procedures. With this framework, domain and computer experts can efficiently develop scientific applications for large-scale CPU-GPU heterogeneous systems collaboratively. We extend a dynamic performance monitoring interface based on JASMIN and improve JASMIN’s application-level checkpointing implementation with a distributed in-memory checkpointing scheme, aiming to improve the reliability of the systems while keeping the programming interface concise. With these approaches, the developed applications can tolerant all single-node faults with high performance and are sensitive to nodes’ performance abnormality at runtime. We develop two real-world scientific applications on a peta-scale CPU-GPU heterogeneous parallel system TH-1A [4]. The results show that applications developed with this framework can efficiently exploit the computing power of large-scale CPU-GPU heterogeneous systems. We are firmly convinced that this work gives a promising candidate for programming on future exa-scale systems.
2 Collaborative programming framework 2.1 JASMIN overview JASMIN is a parallel software infrastructure for scientific computing, especially for parallel adaptive mesh applications. Through encapsulating data structures, integrating numerical algorithms, shielding large-scale parallel computing details, JASMIN supports a fast programming of adaptive mesh applications on large-
Fig. 2 Domain-computer experts collaborative programming framework
J. Cent. South Univ. (2013) 20: 1189−1203
1191
scale parallel computing systems. With the help of JASMIN, users can quickly develop adaptive mesh applications based on partial differential equation by providing physical models, discrete stencils and problem-specific algorithms. They are not necessarily familiar with parallel computing, adaptive computing and high performance computing techniques. Most underlying implementation details, such as MPI parallelization, general numerical algorithm library, load balancing, mesh management, input/output and result visualization are handled by the infrastructure automatically. The software architecture of JASMIN is illustrated in Fig. 3. The bottom layer supports operations for structured adaptive mesh refinement (SAMR), including parameter input, memory management, communication and load balancing management, mesh refinement/ coarsening modules. The middle layer is a batch of general-purpose numerical algorithms extracted from real-world applications. The top layer provides a C++ programming interface, based on which, users can design serial and numerical subroutines for physical models, parameters, discrete stencils, special algorithms. JASMIN will integrate all these modules as a complete parallel program. Application code including numerical subroutines for physics models, parameters, discrete stencils, special algorithms, and so on Top layer for interfaces Middle layer for numerical algorithms
Supporting layer for SAMR meshes
Interfaces for user applications Timer integrator Grid geometry
Application utilities Math. Ops.
Solvers
Mesh adaptivity Patch hierarchy
Patch data
Communications
Tool box
Fig. 3 Software architecture of JASMIN
In JASMIN, all mesh-based simulations adopt a particle-cell-patch organization to exploit parallelism. For example, in a molecular dynamics simulation, the 3D simulation domain is first organized as a 3D grid of cubes, each of which is called a cell. Particles are distributed in these cells. The size of cell is usually chosen according to the cut-off distance in force calculation. We know that interaction between particles is in inverse proportion to their distance and can be ignored if the distance is long enough. Therefore, to reduce the computation task in simulation, a cut-off distance is usually assigned. Particles with distance longer than this threshold will be considered having no interaction with each other. The size of cell is generally
equal to or slightly larger than the cut-off distance so that interaction between two particles that are not in neighboring cells needs not to be considered. Cell can be used to exploit fine-grained parallelism while patch is used to exploit coarse-grained parallelism. JASMIN packs several contiguous cells as a cell set. A cell set along with its all boundary cells (ghost layer) is called a patch, as shown in Fig. 4. In JASMIN, patch is an encapsulated data structure as well as a basic task scheduling unit. Task distribution of JASMIN applications is generally implemented on the basis of patch. The management of patch data and inter-patch communication due to particle migration are handled by JASMIN automatically.
Fig. 4 A cutaway view of particle-cell-patch organization
Presently, JASMIN only supports programming on homogeneous architectures. The infrastructure adopts MPI to exploit inter-node parallelism. While inside a node, OpenMP is typically used to exploit thread-level parallelism on multi-core CPUs. In this work, we study how to integrate GPU programming into this framework efficiently, so as to support programming on CPU-GPU heterogeneous parallel systems. 2.2 Collaborative programming framework Despite JASMIN provides a high level programming interface for domain experts, the programs, when mapped to run on traditional parallel computers, are still implemented as OpenMP threads nested in MPI processes, as shown in Fig. 5. Without loss of generality, we assumed that each node is equipped with two multi-core CPUs in this work. In this scheme, domain experts provide domainspecific information such as models, parameters and algorithms to JASMIN, which will partition the problem space and handle the MPI parallelization accordingly. Typically, one process runs on one node. Meanwhile, thread-level parallelization in the numerical algorithms is accomplished by computer experts to fully utilize all CPU cores in each process.
1192
With a GPU introduced into the computing node, we have to employ another level of parallelism to depict the relationship of the CPU and the GPU. In our design, each MPI process spawns two control threads, each of which runs on one CPU core, as shown in Fig. 6. One thread manages the GPU through CUDA/OpenCL interface and the other one spawns a new team of OpenMP threads to run on all other CPU cores in the node. This two control threads dynamically apply for computation tasks from JASMIN and forward them to the GPU program or the nested OpenMP threads. The GPU program and nested OpenMP program are provided by computer experts, aiming at taking full advantage of the computing power. It should be noticed that the above scheme of execution is not the only choice in our framework. For instance, we can also generate two MPI processes in
Fig. 5 Execution scheme on homogeneous parallel systems
Fig. 6 Execution scheme on GPU-accelerated parallel systems
Fig. 7 Execution scheme with two processes for one node
J. Cent. South Univ. (2013) 20: 1189−1203
each node, as shown in Fig. 7. Each process runs on one core of one CPU. Same as previous, each process spawns two control threads which manage the GPU and other cores in local CPU, respectively. The two GPU-control threads will compete for the GPU and the competition is handled by GPU’s runtime environment. More processes may result in better performance, since some fragments in the process that are not suitable to be parallelized by OpenMP can be parallelized by multiple processes. However, higher parallelism may sometimes incur more redundant computations. Hence, the choice depends on concrete applications and is beyond the scope of this work. In the following, we consider the previous one, i.e., one process for each node. Considering JASMIN’s patch-based task distribution scheme, in our framework, we first distributed all
J. Cent. South Univ. (2013) 20: 1189−1203
patches in the simulation domain averagely among the computing nodes. Then, in each process, the two control threads will be dynamically applied for patches until there is no unprocessed patch left in the simulation domain. The applied patch was then passed to the GPU or the nested thread team.
3 Software infrastructures To better exploit the computing power of CPU-GPU heterogeneous parallel systems, we enhanced JASMIN’s software infrastructure with several optimizations and fault-tolerant schemes according to the architecture features. 3.1 Optimization strategies 3.1.1 Intra-node load balancing In the particle-patch-cell parallelization scheme, particles may migrate from one cell (patch) to another at runtime. Therefore, even initialized completely averagely, after a long-time running, some cells (patches) may have much more particles than others, which means that they will have much more computation tasks in the next timestep. Although we distributed the patches averagely among all computing nodes at first, the workload between nodes may become unbalanced with the particle migration. This imbalance will be handled by JASMIN’s infrastructure automatically through migrating patches from higher-load nodes to lower-load nodes at regular intervals. As mentioned before, we adopted a dynamic task scheduling between the CPU and the GPU inside the node. This is because the performance model of GPU is complex and its execution time is hard to be predicted, so static task partition is not feasible. The workload imbalance may also arise inside the node due to different particle numbers of cells. The kernel function running on the GPU will generate enough threads to calculate all particles of each cell in parallel. Therefore, for those cells with fewer particles, some threads will be idle and the processing unit they are assigned to will exhibit lower occupation. To handle this problem, in our infrastructure, we partitioned a kernel into multiple subkernels according to the number of particles in each cell. For instance, we categorized the cells into 4 intervals: [0, 63], [64, 127], [128, 191], [192, 255], and for each group a particular subkernel with proper thread number was issued. As a result, the number of idle threads was reduced. 3.1.2 Communication overlapping The low-bandwidth communication between the CPU and the GPU would affect the performance of GPU program evidently. Although the bandwidth of PCIe bus
1193
between the CPU and the GPU can reach up to 8 Gbps, data transfers across the bus are snooped, which will greatly reduce the communication performance [8]. The write-combining page-locked memory can be used to improve the performance. It is not snooped during transfers across the PCIe bus, which can improve transfer performance by up to 40% [8]. Moreover, copies between write-combining page-locked host memory and device memory can be performed concurrently with kernel execution for CUDA 2.0 devices. Since the write-combining page-locked host memory is a scarce resource in the system, we cannot allocate all patches in it at the same time. Instead, we allocated two blocks of such memories at one time. While one patch is calculated, the other one’s data can be transferred. By employing this double-buffered method, we can overlap the kernel execution and the data transferring. 3.2 Towards higher reliability The introduction of GPU increases the complexity of the computing node and thus decreases its reliability. GPU itself is a complex system, which consists of several independent components such as processing unit, memory and bus interconnection. Faults may arise on each of them at runtime. From practical observations, we categorized the faults of node into two types: functional fault and performance fault. Functional fault means the node crashes and has no response anymore, while performance fault means the program on the node can finish eventually, but with much poorer performance than on others. Aiming at these two types of faults, we extended or enhanced the programming framework with fault-tolerant schemes. 3.2.1 Double in-memory checkpointing The most commonly used technique to deal with functional faults is checkpoint-restart (CR) [10]. In a CR-based method, the state of the program, known as a checkpoint, is periodically saved to stable storages (typically the disks). When a failure happens, the computation can be restarted from the most recent checkpoint. CR-based methods can be categorized into system-level and application-level ones. The former save checkpoints automatically without any indication from the program, while the latter let users insert checkpointing statements explicitly in the program, to indicate which data should be saved. Presently, JASMIN provides a restartable execution mechanism. Parts of classes and objects can be denoted as restartable ones. Programmer can save the state of these objects into a restarting database and restore them when the program restarts from some crash. Therefore, this mechanism can be considered as an application-level CR method. JASMIN adopts a global file system to store the restarting database, which may throttle the
J. Cent. South Univ. (2013) 20: 1189−1203
1194
performance evidently due to bandwidth competition when the system scale becomes large. To improve the scalability, we integrated an in-memory double checkpointing scheme into JASMIN’s infrastructure. In-memory means that checkpoints are saved in memory (if possible) instead of disk, which will dramatically improve the performance of checkpointing (typically two orders of magnitude faster or more). Double checkpointing [11] is a distributed fault-tolerant method, which avoids the usage of global file system and thus can improve the scalability. As shown in Fig. 8, for each data object, two copies of the checkpoint are saved, one in the local node and the other in a remote node. The remote node is called buddy [12] or partner [13] of the local node. When a node crashes at runtime, the computation task will be migrated to its partner node or a spare node. With a small quantity of communication, the computation can be restored. Double checkpointing method is proved to be very efficient to tolerant single-node faults, which are the most common faults in HPC systems. It should be noticed here that, we provided an underlying implementation without changing the programming interface in JASMIN, which resulted in almost no modification to the application codes.
performance of all nodes at runtime and report a warning if some node fails the performance test. So that the user or the task manager system can fix or replace the deviant nodes in time. Customized performance test cases and warning threshold can be specified by passing parameters to the monitor function. This method is proved to be very convenient to locate performance faults in the systems we designed before, especially in those systems with very large scale.
4 Application case studies In this section, we will propose the developments of two real-world scientific applications using our framework on TH-1A [4], a peta-scale CPU-GPU heterogeneous system. 4.1 Morse potential molecular dynamics simulation 4.1.1 Computation task analysis Molecular dynamics (MD) is frequently used in the study of nanoscale physical phenomena. By modeling the motions of atoms or particles within a molecular system, in-depth understanding of complex physical mechanisms can be gained. The computation task in an MD simulation is to integrate the set of coupled differential Newton’s equations given by mi
dv i F2 (ri , r j ) dt j i
(1)
dri vi dt
Fig. 8 In-memory double checkpoint
3.2.2 Dynamic performance monitoring In a large-scale parallel computing system, sometimes a node may exhibit much poorer performance than others at runtime, which we call a performance fault. The introduction of GPU evidently increases the possibility of this fault. As mentioned before, GPU itself is a complex system. A performance fault may happen in GPU for several reasons. For example, a fault in the device memory may incur much ECC protection overheads, and a fault in PCIe interconnection may reduce the available bandwidth greatly. If these deviant nodes cannot be detected and fixed in time, the performance of the whole program will be harmed. Based on this observation, we integrated a dynamic performance monitoring interface into JASMIN. Programmers only need to invoke the performance monitor function periodically (typically at the end of a coarse-grain loop), the application will monitor the
(2)
where mi, ri and vi are the mass, position and velocity vectors of particle i, respectively, and F2 is a force function describing pairwise interaction between particles (three-body interactions and many-body interactions can be added). The force terms in Eq. (1) may be either long-range or short-range in nature. For long-range forces such as coulombic interactions, each particle interacts with all others. Long-range force models are not commonly used in classical MD simulations since directly computing is too costly [14]. Therefore, we adopted the widely used short-range force model, in which case the summations in Eq. (1) were restricted to particles within a cutoff distance rc. The force terms in Eq. (1) are the derivatives of potential energy expressions. We adopted the Morse potential function to describe the potential energy between particles. However, this work can be generalized easily with other cutoff potentials such as Lennard-Jones potential [15]. The potential energy between two particles i and j, denoted as φ(i, j), is given by
(i, j ) (e
2 ( rij rc )
2e
( rij rc )
)
(3)
where ε and β are constants with dissociation energy and
J. Cent. South Univ. (2013) 20: 1189−1203
1195
the potential energy curve gradient coefficient, and rij is the distance between particles i and j. Given the potential function, the force between particles i and j can be described as F2 (ri , r j )
(i, j ) 2 ( rij rc ) ( r r ) 2 (e e ij c ) rij
(4)
Although the computational complexity has been reduced by using the Morse potential, determining the total forces is still the most computationally-intensive part of MD simulations. Same as in many other GPU-accelerated MD codes [16−17], we used GPU to accelerate the force computation task. The updating task and other bookkeeping tasks (e.g., load balance and communications) are all performed on the CPU to reduce GPU’s scheduling overhead. 4.1.2 CPU implementation When the CPU-controlling thread is assigned with one patch, it will spawn 10 slave OpenMP threads to run on the remaining 10 CPU cores (two Intel Xeon X5670 CPUs have 12 cores). These 11 threads run concurrently to process the cells of the patch. They dynamically apply for cells. Each thread is allotted one cell at a time until there are no unprocessed cells in the patch. Then, the 11 threads will be synchronized. Algorithm 1: Calculate forces on CPU 1. tempf→0 2. for all cell i in patch p do 3. for z=0 to 13 do 4. j←HNBiz 5. for k=0 to NAi−1 do 6. m←NNik 7. if i=j then 8. o←k+1 9. else 10. o←0 11. endif 12. for l=0 to NAj−1 do 13. n←NNjl 14. dr←|POSm−POSn| 15. if dr≤rc then 16. f←force(dr) 17. tempf←tempf+f 18. if j is not a ghost cell then 19. Fn←Fn−f 20. endif 21. endif 22. endfor 23. Fm←Fm+tempf 24. endfor 25. endfor 26. endfor The CPU calculation of the forces on one patch is shown in Algorithm 1. For each particle, the
neighboring particles are stored in 27 cells, including the cell it is in and the surrounding 26 cells. By employing Newton’s Third Law (the mutual forces of action and reaction between two bodies are equal and opposite), only half of these cells need to be considered. In our implementation, the indices of these cells are stored in a 2-D array HNB. The number of particles varies from cell to cell, so an array NA is used to store the number of particles of each cell. The 2-D array NN is used to map local index of one particle within one cell to global index of the particle. The coordinates of all the particles in patch are stored in the array POS. The array F is used to store the force results of each particle. The kernel is essentially a four-level loop nest. The first level loop at line 2 iterates over all the cells of patch p. The second level loop at line 3 iterates over cell i and its all neighboring cells. The inner two loops, from line 5 to line 22, accumulate incrementally the total forces acting on the particles. The third level loop iterates over all particles of cell i. The fourth level loop iterates over cell i and half of its neighbors. While the loop deals with cell i, lines 7 to 11 ensure that each particle only interacts with those particles with larger local indices than it. The function force implements Eq. (4). To guarantee the correctness of the results, all of the scatter accesses of F adopt atomic read-modify-write operations while accumulating forces at lines 19 and 23. 4.1.3 GPU implementation As mentioned in Section 2.2, computation on GPU was developed using GPU programming interfaces such as CUDA [8] and OpenCL [9] (CUDA in this work). In a CUDA program, a kernel running on the GPU launches many lightweight threads to exploit data parallelism at runtime. These threads are organized into thread blocks, which are co-scheduled groups of threads that can share data through a fast, writable shared memory and synchronize with each other using barrier instructions. Given a patch, each thread block is assigned with one cell and each thread is assigned with one particle. Many works [18−21] have developed GPU algorithms for short-range forces calculation. However, all of them do not employ Newton’s Third Law in their algorithms to halve the computation amount, since the overhead of scatter access with atomic operations is very expensive on GPU [22]. First, it requires atomic read-modify-write operations for double floating-point data, which are not presently supported in GPU’s hardware. Moreover, the extra write operations introduced will bring scattered memory accesses, which will decrease the performance evidently. In our implementation, we solved this problem by taking advantage of the shared memory on the GPU which can greatly reduce the overhead of the atomic operations introduced by applying Newton’s Third Law.
1196
Our solution, which we called the NTkernel, is shown in Algorithm 2 and Algorithm 3. Algorithm 2 calculates the forces between the particles from the same cell, while Algorithm 3 takes care of the forces from neighboring cells. Algorithm 2: NTkernel: Part 1 1. tempf←0; SP←0 2. i←NNbt 3. if t