FULL PAPER
WWW.C-CHEM.ORG
Accelerating VASP Electronic Structure Calculations Using Graphic Processing Units Mohamed Hacene,[a] Ani Anciaux-Sedrakian,[a] Xavier Rozanska,[b]y Diego Klahr,[a]z Thomas Guignon,*[a] and Paul Fleurat-Lessard*[b] We present a way to improve the performance of the electronic structure Vienna Ab initio Simulation Package (VASP) program. We show that high-performance computers equipped with graphics processing units (GPUs) as accelerators may reduce drastically the computation time when offloading these sections to the graphic chips. The procedure consists of (i) profiling the performance of the code to isolate the time-consuming parts, (ii) rewriting these so that the algorithms become better-suited for the chosen graphic accelerator, and (iii) optimizing memory traffic between the
host computer and the GPU accelerator. We chose to accelerate VASP with NVIDIA GPU using CUDA. We compare the GPU and original versions of VASP by evaluating the Davidson and RMM-DIIS algorithms on chemical systems of up to 1100 atoms. In these tests, the total time is reduced by a factor between 3 and 8 when running on n (CPU core þ GPU) compared to n CPU cores only, without any accuracy loss. C 2012 Wiley Periodicals, Inc. V
Introduction
massively parallel GPU technology.[8,12–24] The Vienna Ab initio Simulation Package (VASP) is an efficient plane-wave code based on periodic DFT.[25–28] It allows theoretical study of chemical systems via energy and forces calculations, which permit geometry optimizations, molecular dynamics simulations, and determination of a wide range of physicochemical properties for solids or surfaces.[1,2,25–29] It shows good performance on CPU hardware and could gain attractiveness after being ported to GPU hardware. Maintz et al.[24] recently ported the VASP Blocked-Davidson wave function optimization to GPU. Their modification resulted in a computation time reduction by a factor of 7 on a C2050 graphic card (Fermi architecture) in comparison to an Intel Xeon X5560 2.8 GHz processor. However, the RMM-DIIS algorithm is more efficient for large chemical systems and molecular dynamics simulations.[28] It is thus desirable to port this algorithm to GPU. Therefore, in this work, we focus on the routines involved in the electronic minimization and their behavior on GPU-based clusters. In particular, we study the GPU version of the blocked Davidson (ALGO ¼ Normal keyword), RMM-DIIS (ALGO ¼
Computational chemistry evolved to become a highly valuable tool for the characterization and analysis of materials and chemical phenomena. Quantum chemical calculations are now routinely used to complement experiments.[1,2] Because the studied systems are increasingly complex, reducing the computation time is an important issue. Known bottlenecks in quantum chemical-physics simulations at the atomistic level are matrix products, vector operations, and Fast Fourier Transforms (FFTs). Two approaches may be followed to reduce the overall computation time and to allow simulating larger chemical systems. In the first one, we could optimize (i) the mathematical approximations for the theoretical chemical equations (e.g., resolution of identity density functional theory (DFT),[3–6] wavelet DFT,[7,8] linear scaling approaches[9]), and (ii) the physicochemical approximations for the chemical system and/or its environment (e.g., cluster, periodic or hybrid approaches, implicit electrostatic environments). In the second one, we could (re)design completely or partially the software to take advantage of the newest hardware technologies. We go for the latter. Current hardware designs include multicore CPU (that is Central Processing Unit containing more than two cores), many core CPU (more than 32 cores per Central Processing Unit), and GPU (Graphics Processing Units) platforms. Heterogeneous GPU-based multicore platforms are composed of GPUs and multicores CPUs. The efficiency of these architectures is demonstrated by different benchmarks like FFT or Sparse Matrix-Vector multiplication tests for instance.[10,11] In this article, we study the heterogeneous GPU-based multicore architecture performance in numerical simulations of chemical systems. During the past years, theoretical chemistry softwares have been modified or developed from scratch to benefit from the
DOI: 10.1002/jcc.23096
[a] M. Hacene, A. Anciaux-Sedrakian, D. Klahr, T. Guignon IFP Energies Nouvelles, 1 et 4 avenue de Bois-Pr eau, F-92852 RueilMalmaison Cedex, France E-mail:
[email protected] [b] X. Rozanska, P. Fleurat-Lessard Laboratoire de Chimie de l’ENS de Lyon, Universit e de Lyon, UMR CNRS 5182, 46 All ee d’Italie, F-69364 Lyon Cedex 07, France E-mail:
[email protected] † Present address: Materials Design, 18 rue de Saisset, F-92120 Montrouge, France. ‡ Present address: Total E&P, Centre Scientifique et Technique J. Feger, Avenue Larribau, F-64000 Pau, France.
Contract/grant sponsor: King Abdullah University of Science and Technology (KAUST, Award No. UK-C0017). C 2012 Wiley Periodicals, Inc. V
Journal of Computational Chemistry 2012, 33, 2581–2589
2581
FULL PAPER
WWW.C-CHEM.ORG
VERYFAST) and mixed blocked Davidson and RMM-DIIS (ALGO ¼ FAST) algorithms. Before describing the applied approach and implementation details, we summarize briefly necessary background to understand the hardware and software constraints imposed by a GPU architecture. This article is organized as follows: In the first part, we present the GPU evolution, the main features of current GPUs and their specificities in term of programming models. The second part describes the porting of some VASP routines to GPU with special care taken for multicore CPU, and multi-GPU architectures. Results are gathered and analyzed in the third section, while the section four concludes this work.
GPU Background; Hardware and Software In the past 20 years, GPUs have evolved from fixed-function processors to massively parallel-floating point engines. Hence, the idea of general purpose computation on GPUs emerged to take advantage of the processing power of GPUs for nongraphical tasks. Such general purpose computations were first done with OpenGL graphical application programming where the computation concepts (i.e., data structures, algorithms...) had to be mapped to that of graphical concepts (3D objects, textures, pixel shader...). This programming model is difficult to handle for a non-OpenGL expert and is inefficient for largescale porting of codes to GPUs (see for example[30]). General purpose programming tools like BrookGPU,[31] CUDA,[32] or OpenCL,[33] ease GPU use for all programmers. However, fully benefitting from the GPU computational power still requires that algorithms exhibit a high degree of parallelism and regular data structures. This stems from the specific GPU hardware architecture: 100s of basic computation cores (also called Stream Processor—SP) handle arithmetic and load/store operations without any flow control. The latter is handled by an additional unit called a sequencer. This leads to a programming model called Data Parallel.[34,35] Such programming and execution model is similar to those used in the 80s and 90s massively parallel supercomputers.[34,35] Nvidia GPUs provide a massively multithreaded execution model where threads are identical instruction flows working with different data. This corresponds to the Single Instruction, Multiple Data programming model. Threads are scheduled by the sequencer in groups of 32, called warps. Usually there are more threads than SP in a typical computation sequence, called a kernel. Such execution model permits to hide memory access latency.[36] Here, we focus on the two most widely available and used Nvidia GPU architectures at the time of writing this manuscript, namely, GT200 and FERMI. The GT200 architecture provides up to 240 SP, for C/M1060§ cards models. These cores are grouped into 30 groups of 8 SP, and each group is called a stream multiprocessor (SM).
§The C1060 and the M1060 have the same technical specificities. The main difference is that that the C1060 has its own fan to actively cool it down, while the M1060 has only a passive cooling system. As a consequence, the M1060 is thinner and can be integrated closer to the CPU in 1U servers. 2582
Journal of Computational Chemistry 2012, 33, 2581–2589
The FERMI architecture provides 448 SP for C/M2070 and up to 512 SP for M2090 models, grouped into 14 SM (16 for M2090), each containing 32 SP. The main difference between FERMI and GT200 is the FERMI capacity to execute different kernels in parallel on different SMs (which permits task parallelism), while the GT200 can only execute one kernel at a time. As a consequence, on the GT200, the threads not used by this kernel are inactive during its execution leading to a waste of the computational resources. FERMI has also better double precision computing performance: it contains 16 double precision units per 32 SP while GT200 offers only one double precision floating point unit per 8 SP On the software side, CUDA and OpenCL have emerged as the leaders of general purposes software tools. Although CUDA only provides support for Nvidia GPUs, OpenCL is more generic and supports Nvidia, AMD GPUs, and also multicore CPUs. These software tools provide extensions to programming languages (mainly C), for handling GPU hardware concepts such as: • thread parallelism, which has been formalized as ‘ stream computing’’ • memory hierarchy: local SM memory, different types of cache, GPU main memory, and sometimes host memory. • asynchronous computation between host CPU and GPU.
• task parallelism between SMs with CUDA stream thanks to the FERMI architecture Aside from the language extension and the associated compiler, others tools like debuggers, profilers, and also libraries are available. In this work, we use the CUDA programming environment. Moreover, we consider that the GPUs and the CPUs process disjoint parts of the code. In this context, the GPU role is to decrease the computation time by assisting the CPU for the dedicated parts, even when taking into account the data transfer penalty. GPU Acceleration of VASP The main goal of this work is to accelerate the most timeconsuming functions in VASP using GPUs and to analyze the software behavior in this context. To reach this goal, we first profiled VASP to identify the most time-demanding functions. The most expensive parts of these functions were modified to take advantage of GPU architecture. We then monitored the data transfers between the CPU and the GPU. As these transfers proceed via the slow PCIe bus, some ‘‘inexpensive’’ functions executed between two consecutive calls to the GPUs were also ported to GPU to decrease the number of transfers. In its original implementation, VASP only uses double precision to compute the wave function, the energy, and its derivatives. Therefore, in this work, the port of VASP to GPU is only done with double precision computation. Profiling To get an overview of VASP 5.2.2 performance, we used the Valgrind’s tool suite[37] together with KCachegrind.[38] The WWW.CHEMISTRYVIEWS.COM
FULL PAPER
WWW.C-CHEM.ORG
libraries and between 2.5 and 5 when the CUFFT is used instead of the FFTW library.[11]** The CULA library, which provides some LAPACK routines on GPU,[44] is not used in this work due to the lack of accuracy and unavailability of some routines for the double precision complex entities (for example to store the wave function). transfer. In a second stage, we study how the data transfers from the CPU to the Figure 1. Most time-consuming routines (expressed in percentage of the total time) called during a geometry GPU may be reduced. Indeed, optimization done with VASP for the BULK test system. even though the latest graphic chips have fast local memory, the data transfers are done through the PCIe bus that is much slower than memory performance results for a solid MgH2 system containing six access. Therefore, optimizing data transfers is compulsory to atoms (BULK) are shown in Figure 1 for the serial version of ensure that they are not the bottleneck and to fully benefit VASP. The energy for this system is obtained using the mixed from the computation power of the GPUs. We use two stratblocked Davidson and RMM-DIIS algorithms (ALGO ¼ FAST). egies to reduce the memory traffic. BULK is described in the Results and Discussion section. First, we detect the portion of software where the simulaTo know if the code can be ported on GPU, we need to tion alternates between CPU and GPU. To reduce the timekeep abreast of the libraries that the software is using (in consuming data transfer between the CPU and the GPU, funcour case FFTW,[39] BLAS,[40] LAPACK,[41] and SCALAPACK[42]) tions executed between two successive calls to the GPUs were and identify the functions that are often used. As seen in also ported to GPU, thus avoiding two data transfers. Figure 1, many functions use the FFTW library. For the Then, we embed the data transfers into the computation BULK test case, ca. 60% of the computation time is spent in part. This is done using an asynchronous memory copy from functions of this library. The importance of the FFTW library CPU to GPU: while the data is transferred, a GPU kernel and/or was then confirmed with a larger system of 240 atoms, a CPU function can be launched at the same time, as long as denoted SILICA, and described in the Results and Discussion the data transfers and the calculation are independent. This is section. Further analyses indicated that other functions are illustrated in Figure 2. time consuming: EDDAV in the case of the Blocked Davidson algorithm, EDDIAG and RMMDIIS for the RMM-DIIS algorithm, and POTLOK, ORTHCH, and CHARGE functions for both algorithms.¶ All these routines were thus considered for optimization. Data
Porting VASP to GPU We now analyze the time-consuming routines to clarify how to proceed with the porting to GPUs. CUDA version of libraries. First, we replace standard computational libraries by their GPU counterparts. The CUBLAS (BLAS library on GPU)[43] and the CUFFT (FFT library on GPU)[10] libraries are properly optimized for the target GPU architecture. Previous work has shown that on Tesla M1060, the speedup factor in double precision is 3 between BLAS and CUBLAS
¶Main purpose of the cited VASP functions: the Blocked Davidson algorithm (Algo ¼ Normal) is done in EDDAV; the RMM-DIIS algorithm (Algo ¼ VeryFast) is done in three steps: EDDIAG: sub-space rotation, RMM-DIIS: residual minimization, ORTHCH: orthogonalization. After each iteration, in both algorithms, one can compute the new charges in CHARGE.
Figure 2. Hiding the data transfers by the computation parts. The data transfer and the GPU kernel are done with asynchronous calls: the CPU is thus free to run other computations. [Color figure can be viewed in the online issue, which is available at wileyonlinelibrary.com.]
Porting other functions to GPU The algorithm design may strongly impact the performance of the application on the different platforms. In this section, we review how to adapt the algorithms to take advantage of the GPU accelerator.[13,15]
**Tesla C2050 Performance Benchmarks, Tesla GPU Computing, NVIDIA. The configuration used for this benchmark is a GPU M1060 (and C2050) versus CPU Quad-Core Intel core i7 3.2 GHz. The configuration used in Ref. 11 is a GPU M1060 (and C2050) vs. CPU Xeon Quad-Core E5520 2.26 GHz.
Journal of Computational Chemistry 2012, 33, 2581–2589
2583
FULL PAPER
WWW.C-CHEM.ORG
Many subroutines in VASP use nested loops on the spin, on the electrons, on the ions, and on the bands. For example, in VASP 4.4 and later, one can use a blocked version of the RMM-DIIS algorithm in which bands are treated in parallel by blocks. One can also treat ions in blocks. Such loops can be schematized as: do NP¼1,NPLim Action(NP) end do where there is no dependency between Action(NP). Porting such code to GPU can be done in three ways (see also Fig. 3):
a) Synchronous calls to GPU for Action(NP): for each value of NP, the GPU is called with one kernel and the CPU waits for the result. b) Asynchronous calls to GPU: GPU is called for Action(1), Action(2), …, Action(NPLim) in a serialized way. While the GPU is executing these actions, the CPU can execute other parts of the code. c) Asynchronous calls to GPU using task parallelism: GPU is called for Action(1), Action(2), …, Action(NPLim) with different CUDA streams running concurrently. While the GPU is executing these actions, the CPU can execute other parts of the code. The third approach could be the fastest one, if Action(NP) does not use all GPU computing resources. In this case, CUDA streams give the opportunity to use the remaining GPU resources. The last point to obtain optimal performance is to detect the GPU architecture being used and choose dynamically the most efficient algorithm for this architecture among (i) not using the GPU (for small systems for example), (ii) using the GPU with asynchronous calls (Fig. 3b), or (iii) using the GPU with task parallelism (Fig. 3c). For example, the RMM-DIIS algorithm used in VASP contains a small set of loops. The GPU version of this algorithm is well suited for the FERMI architecture because of its ability to launch concurrent kernels on different SP, which is not the case for G80 and GT200 ones.
Extension to Multiple GPUs In this section, we focus on heterogeneous GPU-based multicore platform (composed of GPUs and multicores CPUs) problematic. We consider situations with equal numbers of CPU cores and GPUs. Further work might consider sharing all the CPU cores and GPUs. When a parallel code is developed, it is crucial to study the actual speedup gained by running on multiple cores with respect to a single core run. Indeed, lack of specific optimization might lead to strong nonideal scaling that will hamper running on massively parallel architectures. Such a concern also applies to the use of multiple graphic cards associated for high-performance to improve the speedup of the code. However, the heterogeneous GPU-based multicore configuration using the Message Passing Interface (MPI),[45] allows using all available GPUs on the machine. Thus, each MPI process could be accelerated by a GPU to improve the performance. A simplified version of the mechanism is shown on Figure 4. However, using simultaneously many GPUs and CPUs might result in many data transfers between CPUs and GPUs, slowing down the whole process. To reduce the time taken by data transfer, we can benefit from the fact that only a small amount of the data (taken from a vector or a matrix) is modified after a MPI communication. Therefore, only the modified sections, identified easily, are transferred to the GPU. Figure 3. Schematic representation of three approaches to port loops on GPU. a) Synchronous calls, b) Asynchronous calls; c) Asynchronous calls with concurrent kernels using SPMD-like parallelism on Fermi GPU. [Color figure can be viewed in the online issue, which is available at wileyonlinelibrary.com.]
2584
Journal of Computational Chemistry 2012, 33, 2581–2589
Results and Discussion We present now the effectiveness of our contribution. All results are obtained using VASP 5.2.2 and our developed GPU WWW.CHEMISTRYVIEWS.COM
WWW.C-CHEM.ORG
FULL PAPER
are displayed in Figure 5. The generalized gradient approximation was used in the formulation of Perdew and Wang PW91.[46] Atomic cores were described with the projected augmented wave method [47,48] which is equivalent to an all electrons frozen core approach for BULK, SILICA, LIQUID test cases, and by ultrasoft pseudopotentials for the SLAB system.[49] The wave functions are developed on a basis set of plane waves. With the selected pseudopotentials, converged electronic enerFigure 4. Multi-GPUs communications: A single core computation (top) can be accelerated in different ways: i) (on the left) by using GPU where computation is offloaded to the GPU with data transfer before and after computation. ii) (on gies were obtained by the right) by using multiple processes communicating with MPI, in this case algorithm requires ‘data synchronization’’ using cutoff energies of between processes, that is, MPI communication (red section) during considered computation. If we want to use GPU 250, 400, 250, and 200 and multiple process acceleration (bottom) we have to consider that each MPI communication requires getting data eV for BULK, SILICA, from/to the GPU to go to/from the ‘communication device.’’ These additional data transfers are an overhead compared to the single core GPU acceleration. [Color figure can be viewed in the online issue, which is available at SLAB, and LIQUID, wileyonlinelibrary.com.] respectively. The Brillouin zone integration was version of VASP 5.2.2. The original version of VASP is compiled converged with a 3 3 5 (BULK), 1 1 1 (SILICA), 1 1 with Intel compilers, ifort, and IntelMpi, and the GPU version is 1 (LIQUID), and 1 1 1 (SLAB) k-point mesh generated compiled with the same Intel compilers and the CUDA 4.0. by the Monkhorst-Pack algorithm.[50] On the chemical side, Total computation times were obtained by performing 20 these four systems represent typical systems used in our geometry optimization steps. For some routines, we also groups. On the technical side, they span a large range of posreport the execution time for one iteration: this was arbitrarily sibilities used to model extended systems in terms of the taken as the average for the 11th, 12th, and 13th iterations. It choice of the pseudopotentials, cutoff energies, or k-points is worth noticing that the wave function optimization is the sampling. most computationally demanding part. In all cases, the tests BULK test case. The BULK system corresponds to the a-phase are made in double precision and we checked that the enerof the MgH2 solid (P42/mnm space group).[51] The unit cell gies and geometries obtained with our GPU version of VASP comprises six atoms and is depicted in Figure 5a. This system were the same as those obtained with the original version up is a benchmark material for experimentalists in the field of to 106 eV for energies, and 0.01 Angstroms for geometries. hydrogen storage applications.[52,53] ˚ thick-slab of {111} bSILICA test case. SILICA system is a 7 A Platform cristobalite terminated by hydroxyl groups and used to model the surface of amorphous silica.[54] This system is an insulator. The two platforms used to conduct our tests are described The unit cell contains 240 atoms (Si68O148H24) and is depicted below. in Figure 5b and in Figure S1 in Supporting information. B505 configuration. This platform is composed of nine nodes. SLAB test case. SLAB system, presented in Figure 5c and Figure Each node contains 2 Xeon quadruple-cores E5540 CPUs with S2, is an oxide slab of 328 atoms. The unit cell geometry is a frequency of 2.53 GHz and two Tesla M1060 GPU cards defined by Al112Si16O200. This aluminosilicate was obtained based on the GT200 architecture. from an alumina model[55,56] by substituting 16 aluminum Workstation configuration. This platform comprises one node atoms by silicon atoms. It was used as a precursor in the modwith one quadruple-core Q9450 CPU with a frequency of 2.66 GHz eling of amorphous silica alumina systems.[57] and one Tesla C2070 GPU cards based on the Fermi architecture. LIQUID test case. Small silicate oligomers are proposed as nucleation precursors and growth units for zeolite crystal.[58] Test systems The condensation of oligomers is usually achieved in the presence of an organic template cation that orients the zeolite To validate our approach, we use four systems, presented growth toward a given structure. The fourth system, LIQUID, below. They are labeled BULK, SILICA, LIQUID, and SLAB and Journal of Computational Chemistry 2012, 33, 2581–2589
2585
FULL PAPER
WWW.C-CHEM.ORG
considering or from different optimizations of the GPU port. However, it is worth noting that even with a speedup of 7, the RMM-DIIS algorithm is competitive with the Blocked Davidson one. Similar conclusions are obtained for the SLAB case when comparing one CPU core with one CPU core þ one GPU, and for the LIQUID system when comparing two cores of two CPUs with two cores of two CPUsþfour GPUs. These results are reported in Figures S3 and S4 in Supporting information. For the next tests, we report only the results for the recommended mixed Blocked Davidson/RMM-DIIS algorithm (ALGO¼FAST). Detailed results for the RMMDIIS algorithm. Performances
for the four most time-consuming functions are show in Figure 7 and Figure S5 for the SILICA and SLAB test cases. They highlight the obtained benefit by Figure 5. The four test cases. a) BULK corresponds to the a-phase of the MgH2 solid.[35] The unit cell is shown in using the heterogeneous blue. Mg atoms are in light blue, H atoms in white. b) SILICA: Si yellow, O red, H white. Unit cell in blue. c) SLAB: Si in CPU þ GPU version of yellow, Al in light blue, O in red. Unit cell is indicated in blue. d) LIQUID: Si in yellow, O in red, C in light blue, N in VASP comparing to CPU on blue, H in white. the B505 configuration. The LIQUID system is only considered with many CPU cores and GPUs because it is too illustrated in Figure 5d, corresponds to a silicate [Si8O8(OH)7O]large for other architectures. For the first three routines, the and a tetramethyl-ammonium [N(CH3)4]þ ion pair solvated by gain is around a factor of 10, whereas it is only between 230 water molecules equilibrated under NVT conditions.[59] 2 and 4 for RMM-DIIS function. As a consequence, we fall to a factor of less than 4 for the total computational time Acceleration on many CPUs and M1060 GPUs on an entire sequence because of the larger weight of the In this first part, we consider the acceleration given by the RMM-DIIS routine than the other routines in an entire cheapest GPU available to us: the Tesla M1060 GPU. As this sequence. GPU generation cannot launch concurrent kernels, we ported The SILICA and SLAB test cases put emphasis on the RMMthe routines using Asynchronous calls (see Fig. 3b). DIIS function speedup on GPU: its speedup is not as good as the other ported functions. However, it depends greatly on Choice of the minimization algorithm. We first used the SILICA the size of the system. The SLAB test case (Figure S5) gives test case to check the performances of the three standard better results for the RMM-DIIS function than the SILICA one options for the ALGO keyword: Normal (Blocked Davidson), (Fig. 7): the GPU acceleration is 3.6 for SLAB, whereas it is only VeryFast (RMM-DIIS), and Fast (mixing Blocked Davidson and 2.3 for SILICA. Indeed, depending of the data size, using CPU RMM-DIIS). The total timing for the three algorithms is shown cores instead of GPU might be preferable. In VASP, as in many in Figure 6. other quantum chemical programs, the electronic wave funcAs expected, the RMM-DIIS algorithm is the fastest one.[28] tions are obtained through iterative processes that end based Surprisingly, our speedup for the Blocked Davidson function is on predefined convergence criteria.[60] As the computation only slightly larger than 3, in disagreement with the one published by Maintz et al. who obtained an acceleration between 5 time of the iterations is quite the same from an iteration to and 7. This might come from the different systems that we are another one, we compare the two first iterations (the first one 2586
Journal of Computational Chemistry 2012, 33, 2581–2589
WWW.CHEMISTRYVIEWS.COM
FULL PAPER
WWW.C-CHEM.ORG
that the scalability of VASP is not altered after the porting on GPU. As a rule, the times for one core of 16 CPUs þ 16GPUs and four cores of 16 CPUs are essentially the same.
Figure 6. Total time (in seconds) for the SILICA test case for the three standard wave function minimization algorithms case using one Xeon E5540 core and using one Xeon E5540 core þ M1060 GPU. Acceleration factors of the second configuration over the first one are given in brackets. [Color figure can be viewed in the online issue, which is available at wileyonlinelibrary.com.]
with GPU, the second one with CPU core) to select the proper architecture to use. The experimentation results on the B505 configuration are shown in Figure 8 and Figure S6 and S7. It can be seen
Figure 8. Total time (in seconds) for the SILICA test on VASP multi-GPU. Acceleration factors for n cores of CPU compared to one CPU core are indicated in square brackets. Acceleration factors for n(CPU coreþGPU) relative to nCPU core are indicated in brackets. [Color figure can be viewed in the online issue, which is available at wileyonlinelibrary.com.]
To check the effect of network on the results for many CPUs and GPUs, we used another configuration comprising two cores of 2 Xeon E5520 and 4 Tesla C1060 on the same card. Similar results were obtained (see Table S1 in Supporting materials): two cores of two CPUs þfour GPUS are faster than four cores of two CPUs by a factor around 1.6. It is worth noting that the remaining cores of a configuration can be used to run other calculations in parallel of our GPU version of VASP. Acceleration on one Fermi C2070 GPU
Figure 7. Time (in seconds) for the SILICA test case for one iteration of the main routines of the RMM-DIIS algorithm using one Xeon E5540 core and using one Xeon E5540 core þ M1060 GPU. Acceleration factors of the second configuration over the first one are given in brackets. [Color figure can be viewed in the online issue, which is available at wileyonlinelibrary.com.]
Previous results on the Tesla M1060 GPUs have shown that the RMMDIIS routine is more difficult to port to GPU than others, directly impacting the total computational time. In this part, we take advantage of the Streaming Multiprocessor of the NVIDIA Fermi architecture to improve the results from RMM-DIIS. In practice, this means that we will use the approach described on Figure 3c. As we focused on the RMMDIIS routine, we will show results for this algorithm only. The total computational time and the timing of the main RMM-DIIS routines for one geometry optimization iteration are shown in Figure 9 for the SLAB and Figure S8 for the SILICA systems with the SM optimization activated or not. Using the SM allows a drastic increase of the RMMDIIS acceleration that goes up from 3.8 to 11.1. This also impacts the total time Journal of Computational Chemistry 2012, 33, 2581–2589
2587
FULL PAPER
WWW.C-CHEM.ORG
Figure 9. Time (in seconds) and total computation time (divided by 100, in seconds) for the main routines of the RMM-DIIS algorithm for the SLAB test case using Xeon E5540 (label E5540), Xeon E5540þM1060 (label M1060), Xeon E5540þC2070 without using the Streaming Multiprocessor (label C2070-no SM), and Xeon E5540þC2070 with the SM manager (label C2070-SM). Acceleration factors are given in brackets.
acceleration that is now accelerated by a factor of 8.1 instead of 3.1. Similar results are found for the SILICA system where the total time is reduced by a factor of 5.7 with the SM optimization (Figure S8).
Conclusions We have provided a hybrid massively parallelized molecular dynamic ab initio software for GPUs clusters. To avoid continuously transferring data from CPUs (resp. GPUs) to GPUs (resp. CPUs), we have ported some functions in CUDA and achieved a balanced combination between CUFFT, CUBLAS, and CUDA. We have established a multi-GPU platform to improve the overall performance of the software. Indeed, on the B505 configuration, adding 16 NVIDIA GT200 GPUs to only 16 cores (out of 64) offers the computational power offered by the full 64 cores architecture, while leaving 48 cores available for other calculations. Moreover, putting a Tesla Fermi in a traditional machine improves the speedup of VASP by a factor between 3 and 8 (using a Xeon Q9450).
Acknowledgments This publication is based on work supported by Award No. UKC0017, made by King Abdullah University of Science and Technology (KAUST). The B505 blade center has been bought within the CADENCED project with KAUST. The authors also thank their KAUST colleagues for insightful discussions. They thank P. Raybaud and C. Chizallet for providing tests cases BULK and SLAB. Keywords: graphics processing units plane-waves Fortran scientific computing accelerated computing hybrid computing VASP 2588
Journal of Computational Chemistry 2012, 33, 2581–2589
How to cite this article: M. Hacene, A. Anciaux-Sedrakian, X. Rozanska, D. Klahr, T. Guignon, P. Fleurat-Lessard, J. Comput. Chem. 2012, 33, 2581–2589. DOI: 10.1002/jcc.23096 Additional Supporting Information may be found in the online version of this article. [1] P. Sautet, F. Delbecq, Chem. Rev. 2010, 110, 1788. [2] J. J. Hafner, Comput. Chem. 2008, 29, 2044. [3] M. Feyereisen, G. Fitzgerald, A. Komornicki, Chem. Phys. Lett. 1993, 208, 359. [4] F. Weigend, M. H€aser, H. Patzelt, R. Ahlrichs, Chem. Phys. Lett. 1998, 294, 143. [5] H.-J. Werner, F. R. Manby, J. Chem. Phys. 2006, 124, 054114. [6] L. Maschio, D. Usvyat, F. R. Manby, S. Casassa, C. Pisani, M. Schu¨ltz, Phys. Rev. B 2007, 76, 075101. [7] L. Genovese, A. Neelov, S. Goedecker, T.Deutsch, S. Alireza Ghasemi, A. Willand, D. Caliste, O. Zilberberg, M. Rayson, A. Bergman, R. Schneider, J. Chem. Phys. 2008, 129, 014109; Available at: http://inac.cea.fr/sp2m/ L_Sim/BigDFT. [8] L. Genovese, B. Videau, M. Ospici, T. Deutsch, S. Goedecker, J.-F. M ehaut, CR M ecanique 2011, 339, 149. [9] S. Goedecker, Rev. Mod. Phys. 2009, 71, 1085. [10] NVIDIA Corp. Available at: http://developer.nvidia.com/cuda/cufft. Accessed on August 8, 2012. [11] Chen, X. Cui, H. Mei, Improving performance of matrix multiplication and FFT on GPU, In Proceedings of the 24th ACM International Conference on Supercomputing, 2–4 June 2010, Tsukuba, Ibaraki, Japan, 2010, pp. 315–324. [12] A. W. G€ otz, T. W€ olfle, R. C. Walker, Annu. Rep. Comput. Chem. 2010, 6, 21. [13] M. S. Friedrichs, P. Eastman, V. Vaidyanathan, M. Houston, S. Legrand, A. L. Beberg, D. L. Ensign, C. M. Bruns, V. S. Pande, J. Comput. Chem. 2009, 30, 864. [14] N. Luehr, I. S. Ufimtsev, T. J. Martı´nez, J. Chem. Theory Comput. 2011, 7, 949. [15] I. S. Ufimtsev, T. J. Martı´nez, J. Chem. Theory Comput. 2008, 4, 222. [16] I. S. Ufimtsev, T. J. Martı´nez, J. Chem. Theory Comput. 2009, 5, 2619.
WWW.CHEMISTRYVIEWS.COM
FULL PAPER
WWW.C-CHEM.ORG
[17] M. Hutchinson, M. Widom, Comp. Phys. Comm. 2012, 183, 1422. [18] Y. Uejima, T. Terashima, R. Maezono, J. Comput. Chem. 2011, 32, 2264. [19] K. A. Wilkinson, P. Sherwood, M. F. Guest, K. J. Naidoo, J. Comput. Chem. 2011, 32, 2313. [20] A. E.III DePrince, J. R. Hammond, J. Chem. Theory Comput. 2011, 7, 1287. [21] R. Olivares-Amaya, M. A. Watson, R. G. Edgar, L. Vogt, Y. Shao, A. Aspuru-Guzik, J. Chem. Theory Comput. 2010, 6, 135. [22] L. Genovese, M. Ospici, T. Deutsch, J.-F. M ehaut, A. Neelov, S. Goedecker, J. Chem. Phys. 2009, 131, 034103. [23] K. Yasuda, J. Chem. Theory Comput. 2008, 4, 1230. [24] S. Maintz, B. Eck, R. Dronskowski, Comput. Phys. Commun. 2011, 182, 1421. [25] G. Kresse, J. Hafner, Phys. Rev. B: Condens. Matter 1993, 48, 13115. [26] G. Kresse, J. Hafner, Phys. Rev. B: Condens. Matter 1994, 49, 14251. [27] G. Kresse, J. Furthmuller, Comput. Mater. Sci. 1996, 6, 15. [28] G. Kresse, J. Furthmu¨ller, Phys. Rev. B 1996, 54, 11169. [29] G. Suna, J. Ku¨rtia, P. Rajczy, M. Kertesz, J. Hafner, G. Kresse, J. Mol. Struct.: THEOCHEM 2003, 624, 37. [30] J. Bolz, I. Farmer, E. Grinspun, P. Schr€ oder, ACM Trans. Graph. 2003, 22, 917. [31] I. Buck, T. Foley, D. Horn, J. Sugerman, K. Fatahalian, M. Houston, P. Hanrahan, ACM Trans. Graph. 2004, 23, 777. [32] NVIDIA Corp. Available at: http://www.nvidia.com/object/cuda_home_ new.html. Accessed on August, 8, 2012. [33] Available at: http://www.khronos.org/opencl/. Accessed on August, 8, 2012. [34] D. E. Culler, J. Pal Singh, A. Gupta, In Parallel Computer Architecture, A Hardware/Software Approach; Eds. Morgan Kaufmann Publishers Inc.: San Francisco, CA, 1997, pp. 44–47. [35] David A. Patterson, John L. Hennessy, In Computer Organization and Design, The Hardware/Software Interface; Morgan Kaufmann Publishers Inc; Eds. Elsevier, Waltham, MA, USA. fourth edition, 2011, Chapter 7 and Appendix A. [36] Robert Alverson, David Callahan, Daniel Cummings, Allan Porterfield, Burton Smith, In The Tera computer system, In Proceedings of the ICS 0 90 ACM SIGARCH International Conference on Supercomputing, Amsterdam, Netherlands, June 11–15, 1990, 1–6; ACM New York, 1990. [37] Valgrind Documentation, Available at: http://valgrind.org/. Accessed on August, 8, 2012. [38] Kcachegrind, J. Weidendorfer, Performance optimization: simulation and real measurement, Available at: http://kcachegrind.sourceforge. net/html/Documentation.html. Accessed on August, 8, 2012. [39] Fastest Fourier Transform in the West, Available at: http:// www.fftw.org/. Accessed on August, 8, 2012.
[40] Basic Linear Algebra Subprograms, Available at: http://netlib.org/blas/. Accessed on August, 8, 2012. [41] Linear Algebra PACKage, Available at: http://www.netlib.org/lapack/. Accessed on August, 8, 2012. [42] Scalable LAPACK, Available at: http://www.netlib.org/scalapack/. Accessed on August, 8, 2012. [43] NVIDIA Corp., Available at: http://developer.nvidia.com/cuda/cublas. Accessed on August, 8, 2012. [44] CULA tools performance, Available at: http://www.culatools.com/ dense/performance. Accessed on August, 8, 2012. [45] Message Passing Interface Forum, MPI: A Message-passing Interface Standard, Version 2.2, High-Performance Computing Center Stuttgart, Germany (2009). [46] J. P. Perdew, J. A. Chevary, S. H. Voslo, K. A. Jackson, M. R. Pederson, D. J. Singh, C. Fiolhais, Phys. Rev. B 1992, 46, 6671. [47] P. E. Blochl, C. J. Forst, J. Schimpl, Bull. Mater. Sci. 2003, 26, 33. [48] P. E. Blochl, Phys. ReV. B 1994, 50, 17953. [49] G. Kresse, J. Hafner, Phys.: Condens. Matter 1994, 6, 8245. [50] H. J. Monkhorst, J. D. Pack, Phys. Rev. B 1976, 13, 5188. [51] W. H. Zachariasen, C. E. Holley, Jr., J. F. Stamer, Jr., Acta Crystallogr. 1963, 16, 352. [52] J. Yang, A. Sudik, C. Wolverton, D. J. Siegel, Chem. Soc. Rev. 2010, 39, 656. [53] J.-N. Chotard, W. S. Tang, P. Raybaud, R. Janot, Chem. Eur. J. 2011, 17, 12302. [54] X. Rozanska, F. Delbecq, P. Sautet, Phys. Chem. Chem. Phys. 2010, 12, 14930. [55] M. Digne, P. Sautet, P. Raybaud, P. Euzen, H. Toulhoat, J. Catal. 2002, 211, 1. [56] M. Digne, P. Sautet, P. Raybaud, P. Euzen, H. Toulhoat, J. Catal. 2004, 226, 54. [57] C. Chizallet, P. Raybaud, Angew. Chem. Int. Ed. 2009, 48, 2891. [58] C. S. Cundy, P. A. Cox, Chem. Rev. 2003, 103, 663. [59] T. T. Trinh, X. Rozanska, F. Delbecq, P. Sautet, Phys. Chem. Chem. Phys. 2012, 14, 3369. [60] G. Kresse and J. Furthmu¨ller, VASP The Guide, 2012 Institut fu¨r Materialphysik, Universit€at Wien, Sensengasse 8, A-1130 Wien: Austria. http:// cms.mpi.univie.ac.at/vasp/vasp/vasp.html. Accessed on August, 8, 2012.
Received: 20 December 2011 Revised: 20 July 2012 Accepted: 24 July 2012, and final revision July 24, 2012 Published online on 20 August 2012
Journal of Computational Chemistry 2012, 33, 2581–2589
2589