GPU Accelerated Dissipative Particle Dynamics with Parallel Cell-list ...

80 downloads 20105 Views 94KB Size Report
Oct 20, 2011 - optimization needed to obtain best performance of GPU is ... molecular dynamics (MD), the demand of computing power is still beyond a desktop computer ... new parallel programming model and instruction set architecture ...
IEIT Journal of Adaptive & Dynamic Computing Vol. 1, No. 2, 33-42 OPEN ACCESS

ISSN 2219-4088 www.ieit-web.org/ijadc

GPU Accelerated Dissipative Particle Dynamics with Parallel Cell-list Updating Hao Wu1,*, Junbo Xu, Shengfei Zhang, Hao Wen 1 Institute of Process Engineering, Chinese Academy of Sciences, Beijing 100190, China E-Mails: [email protected] * Institute of Process Engineering, Chinese Academy of Sciences, Beijing 100190 Received: Nov. 2010 / Accepted: Jan. 2011 / In Press: Apr. 2011 / Published: Apr. 2011

Abstract: A general purpose DPD simulation entirely implemented on GPU is presented in this paper, including cell-list updating, force calculation and integrating forward. The algorithm and optimization needed to obtain best performance of GPU is discussed. The performance benchmarks show that our implementation running on single GPU can be more than 20x faster than conventional implementation running on single CPU core and 10x faster than that running on 5 CPU cores. Keywords: dissipative particle dynamics; cell-list updating

1. Introduction Dissipative particle dynamics (DPD) is a coarse-grained method first introduced by Hoogerbrugge1 and Groot2-3 for simulating dynamical and rheological properties of both simple and complex fluids. By introducing bead-and-spring type particles, polymers can be simulated as well4. It has been used successfully for calculations of block-copolymer micro-phase separation5, surfactant behavior6-7 and phase separation in binary immiscible fluids8, and so forth. Though DPD can tackle hydrodynamic time and space scales beyond those available with molecular dynamics (MD), the demand of computing power is still beyond a desktop computer when simulating complex system. It often requires a computer cluster to run these simulations. Until recent few years, a fast growing hardware GPU offers an alternative.

IEIT Journal of Adaptive & Dynamic Computing Vol. 1, No. 2

2

A graphics processing unit or GPU is a specialized processor that offloads 3D graphics rendering from the microprocessor. GPUs have grown extremely powerful and far exceed CPUs in terms of raw computing power. As a result, the use of GPUs for general purpose computing has become an important and rapidly growing area of research. In November 2006, NVIDIA® introduced CUDA™ (Compute Unified Device Architecture)9, a general purpose parallel computing architecture – with a new parallel programming model and instruction set architecture – that leverages the parallel compute engine in NVIDIA® GPUs to solve many complex computational problems in a more efficient way than on a CPU. A few previous works have investigated GPU implementations of specific algorithms used for MD and DPD. Elsen et al. implemented a simple implicit solvent model (distance dependent dielectric)10. Stone et al. have examined a GPU implementation for electrostatics11. Anderson et al. develop a general purpose molecular dynamics code that runs entirely on a single GPU12. van Meel presents a double buffered partial updates of the cell-list13. Friedrichs et al. implemented all-atom protein molecular dynamics running entirely on GPU14. Rozen et al. present an adoption of the bucket sort algorithm capable of running entirely on GPU architecture15. This paper presents an implementation of DPD running on GPU. To make use of the GPU as optimal as possible, the kernel code is built from very beginning instead of migrating from any CPU version, which would have too many restrictions. It is developed under NVIDIA® CUDA environment by C programming language. Three major computations are needed in every time step of DPD simulation, similar as MD: updating the cell-list, calculating forces, and integrating forward to the next time step. With the new ability of the latest GPU, all 3 computations can be programmed in highlyparallel way. The algorithms presented in this paper can also be used for MD. 2. Implementation details

2.1 GT200 architecture GPU offers tremendous computing power and it is still developing at a speed faster than Moore’s law. The latest GPU architecture of NVIDIA® available on consumer market is GT200. The next generation is GT300, code name Fermi, which has just been unveiled. The features of GPU are improved from generation to generation, so it is important to understand how they can be programmed and obtain the best performance. A single GT200 is composed of 10 Thread Processing Clusters (TPC). Each TPC is further made up of 3 Streaming Multiprocessors (SM), plus a texture pipeline, which serves as the memory pipeline for each group of 3 SMs. Each SM has 8 stream processors, also known as CUDA cores. Each streaming multiprocessor has its own 16KB of shared local memory, which runs at the same clock speed as the thread processors.

2.2 CUDA overview C for CUDA9 extends C by allowing the programmer to define C functions, called kernels, that, when called, are executed N times in parallel by N different CUDA threads, as opposed to only once

IEIT Journal of Adaptive & Dynamic Computing Vol. 1, No. 2

3

like regular C functions. These independent threads are organized into blocks which can contain anywhere from 32 to 512 threads each, but all blocks must be the same size. Threads within a block can cooperate among themselves by sharing data through some shared memory and synchronizing their execution to coordinate memory accesses. One important consequence of this execution is the smallest unit of execution on the device, called “warp”. All 32 threads in the warp must execute the same instruction in a data-parallel fashion. CUDA threads may access data from multiple memory spaces during their execution. Each thread has a private local memory. Each thread block has a shared memory visible to all threads of the block and with the same lifetime as the block. Finally, all threads have access to the same global memory. The global memory space is not cached, so the global memory access is very expensive. The simultaneous memory accesses by threads in a half-warp can be coalesced into a single memory transaction of 32, 64, or 128 bytes, which is critical to obtain optimal performance. Data access of global memory is slow, but transferring data between the GPU and CPU is much slower still. For this reason, communication between the GPU and CPU should be reduced to minimum.

2.3 Cell-list updating Finding particles in range of cutoff for force calculating can be done by an all versus all comparison which has a time complexity of O(N2). Algorithmic improvements on the CPU have reduced the time to scale as O(N) in average case16-17. The simulation box is typically decomposed into smaller domains, called cells. For a given particle, all interaction partners are then located in the neighboring cells. By choosing rcell as rmax, which is cut-off radius of forces, 27 neighboring cells need to be considered when calculating forces. There are plenty of data structures to implement the cell-lists18. One commonly used approach is linked cell-list. A critical memory area mechanism19 is needed when applying the linked cell-list to a parallel algorithm. Although CUDA provides atomic functions19, there is no efficient way to invoke a critical section inter-warps. Therefore, the linked cell-list is not well suited for GPU implementation. We choose the data structure that assigns a fixed sized array of placeholders to every cell and physically copy particles’ IDs into this array. The disadvantage of this data structure is the maximum number of particles could appear in a single cell must be defined and a portion of storage is wasted since not all cells have the maximum number of particles. Fortunately, the density fluctuation is weak during DPD simulation, so that the maximum number of particles in a single cell would not be too large to waste lots of storage. Updating the cell-list involves looping through the N particles and placing them into Ncell cells of width rcell, called binning the particles. A simple implementation is invoking the serial algorithm running only 1 thread on GPU. However, any flow control instruction (if, switch, do, for, while) can significantly affect the instruction throughput by causing threads of the same warp to diverge, which causes running a serial loop on GPU to be very slow20. An alternative way is copying particles’ position from device (GPU) memory to host (CPU) memory, updating the cell-list by CPU, then copying the cell-list back to device memory again. The bottleneck here is the low throughput between

IEIT Journal of Adaptive & Dynamic Computing Vol. 1, No. 2

4

device and the host as mentioned before, so the overall performance may not be guaranteed compare to running the whole process on GPU. With the new ability of the latest GPU to perform read-modify-write operations atomically, we can implement a fully parallel algorithm of cell-list updating on GPU. Each thread bins 1 particle to the cell-list. If 2 simultaneously executed threads try to bin the particle to the same cell, a chaos state will happen because of the thread competition. So the atomic function19 is applied here to solve the problem.

2.4 Force calculation The dissipative particle dynamics (DPD) describes a fluid system by dividing it up in small interacting fluid packages. Each DPD bead represents a fluid package. The evolution of the positions rij and impulses v ij of all interacting beads over time is governed by Newtonian second law of motion ∂ri = vi , ∂t

mi

∂v i = fi ∂t

(1)

The equations of motion are solved using the modified velocity-Verlet algorithm presented by Groot and Warren2. The total force acting on a bead is composed of three pair wise additive forces, conservative FijC , dissipative FijD and random FijR

(

fi = ∑ FijC + FijD + FijR

)

(2)

j ≠i

Beads interact only with those that are within a certain cut-off radius rc . When rc = 1, then

⎧aij (1 − rij )rij FijC = ⎨ ⎩0 ⎧⎪− γω D (rij )(rij ⋅ v ij )rij F =⎨ ⎪⎩0 D ij

⎧⎪σω R (rij )ξ ij rij F =⎨ ⎪⎩0 R ij

rij < 1 rij > 1

rij < 1 rij > 1

rij < 1 rij > 1

(3)

(4)

(5)

Where, ξ ij is a random number drawn from a uniform distribution with zero mean, ω (rij ) is the weight function, γ is a friction factor and σ defines the fluctuation amplitude. To make a polymer, monomers are threaded together in linear chains, using the interaction force FiSpring = ∑ Crij j

IEIT Journal of Adaptive & Dynamic Computing Vol. 1, No. 2

5

Where the sum runs over all particles to which particle i is connected. When calculating forces, each particle is assigned to one independent thread. The position and velocity of the particle are copied from global memory to local variables of the thread the particle is assigned to. 27 neighboring cells are searched for particles within range of interaction, then DPD forces are calculated. Note that the distance of 2 particles and each force pair are calculated twice by 2 different threads. Though it increases the floating point computation, the number of memory accesses is reduced. This is suggested for GPU computing by NVIDIA20 for obtaining better performance due to the non-cached global memory architecture, which is also proved by practical works12. A disadvantage of this algorithm is every thread loads particle positions from global memory, which is not cached. 3 mechanisms could be used to reduce the global memory access latency, i.e. register, shared memory and memory coalescing. Generally, accessing a register is zero extra clock cycles per instruction. The local variables can be stored in register if there are enough ones available. And the device is capable of reading 4-byte, 8-byte, or 16-byte words from global memory into registers in a single instruction. After finishing the force calculation, the summation of forces of the particle is copied back to global memory. Each thread requires only one global memory read and one write operation for the particle assigned to it. Since the threads could be assigned in the same order in which particles are stored in global memory, the memory accesses are coalesced. However, the position and velocity of the particle within the range of cutoff for force calculating need to be read from global memory every time. These particles are not in any expected order, so that the memory accesses are not coalesced, which is the critical bottleneck in our implementation. We do not take advantage of shared memory, because only 16KB of shared memory is available for each block and there is no efficient way of inter-block synchronizing. When shared memory is not enough for storing all particles, great caution should be taken for designing the distributed algorithm. Another reason is that in the next generation of GPU architecture, code name Fermi21, cache mechanism is deployed. The shared memory may not have much advantage anymore. We intend to implement an algorithm which can be migrated to new architecture smoothly and efficiently. Though most of DPD simulations choose the cell size as 1 which is the cutoff of conversation force, other cell sizes are also valid. For example, by choosing rcell =0.5 rc, 125 neighboring cells need to be accessed instead of 27. Memory accesses cost more than the extra floating point operations on GPU, so rcell =0.5 rc is chosen for this work. DPD implementation requires a pseudo random number generator. Groot2 indicates that DPD simulation is not sensitive to the distribution of the random number. So we use MT19937 (Mersenne Twister) library from CUDA SDK as random number generator, instead of developing one by ourselves.

2.5 Position and velocity updating After finishing the force calculation, the third portion of computation is integrating forward to the next time step, involving updating the position and velocity of all particles. The algorithm is similar as the force calculation, which has an independent thread per particle. Verlet-velocity method is used to update the position and velocity to the next time step. Taking advantage of register and memory coalescing, this part of computation is expected to be less time consuming.

IEIT Journal of Adaptive & Dynamic Computing Vol. 1, No. 2

6

3. Brief comparison to other recent works

Rozen et al.15 also implement a neighbor search method running on GPU for DPD. Though our implementation shows better performance, the comparison is not on equal footing. The hardware they use is Geforce 6800, which is less advanced than the Geforce GTX285 used for this paper. There is no other relevant works of DPD. The neighbor search method of DPD is similar as that of MD, so we compare with GPU implementations of MD algorithmically. Anderson et al.’s12 implementation copies the particle positions from the device, bins them on the CPU and then copies the resulting cell lists back to the device. Van Meel et al.13 invoke a double buffered partial updates of the cell-list to safely update cell in parallel environment. In short, they design the algorithm to conquer the limitation of old GPU architecture. With new ability of the latest GPU, the algorithm presents in this paper shows some advantage on performance. 4. Performance measurement

4.1 Cell-list updating 3 different cell-list updating methods are compared: 1) Migrating the original algorithm from CPU to GPU. 2) Copying data back to CPU memory every other time step, and updating the cell-list by CPU. 3) Parallel cell-list updating presented in this paper. Table 1 gives the average updating time of single time step. The classic cell-list updating method used on CPU involves looping through the N particles and placing them into Ncell. It is time consuming to run these serial loops on GPU when N becomes large. So cell-list updating is extremely slow on GPU using this method. Large loop is more efficient on CPU due to the increase in computational power of modern CPU and the capability of compiler level optimization. It should be expected that the cell-list updating is more efficient on CPU. The results show that serial cell-list updating on CPU is an order of magnitude faster than that on GPU even counting on the inefficient memory copy between GPU and CPU. With the atomic function support of the latest GPU architecture, the cell-list updating could be accomplished in a highly-parallel way. Each thread bins one particle into the cell-list. Though the memory writing must be serial for each cell, there are thousands of cells, so that the number of serialized memory writing is relatively small. The results of 20*20*20 are not included because it is too small for precisely comparison. The time of force calculation is also shown in the chart for comparison. Apparently, the running time of parallel cell-list updating is negligible. By contrast, the running time of serial cell-list updating on GPU even exceeds the time of force calculation.

4.2 Force calculation The critical bottleneck here is the memory access. There is a deep memory hierarchy in GPU. The fast memory should be used as much as possible. The global memory access is very expensive. Since all particles’ positions are updated every other time step in DPD simulation, and same in other particle

IEIT Journal of Adaptive & Dynamic Computing Vol. 1, No. 2

7

based methods, the global memory access is inevitable for each time step. So the memory access has to be coalesced as many as possible for better performance. The 3 dimension vectors (position, velocity, etc.) in our implementation are defined as float4 instead of float3. The float3 data type requires 2 operations to access. By contrast, float4 data type only needs 1. Moreover, float4 is 16bits width which satisfies the coalesced condition. Table 2 gives the statistics of memory access of different vector type definitions. The results are reported by CUDA Visual Profiler. When vectors are defined as float4, less small pieces (32byte and 64byte) memory accesses are needed and more memory accesses are coalesced. Therefore, the total numbers of both global memory load and store requests are reduced. It leads to an approximate 30% advantage compare with float3 type of vector definition although a quarter of storage is wasted. Register is the fastest storage. Therefore, using more register to store local variables and reducing the global memory dependency will benefit the performance. However, if a thread uses too much registers, it would lead to a diminishing in occupancy. Occupancy is the ratio of the number of active warps per multiprocessor to the maximum number of possible active warps. One of the keys to good performance is to keep the multiprocessors on the GPU as busy as possible, high occupancy. Table 3 gives a statistic of force calculation reported by CUDA Visual Profiler. It uses 35 register each thread and has the highest occupancy of 0.375 at block size 192. With the new occupancy calculator provided by CUDA SDK, the optimal block size can be predicted instead of running the program at different block size and searching for the best one. So only the results of the optimal block size are listed in the table. The maximum number of register usage of a single thread can be limited by adding the parameter”-maxrregcount=xx” during compiling. Without enough register, local variables will be stored at local memory, which is as expensive as global memory. Therefore, there is a tradeoff between occupancy and register usage. By limiting the register number of a single thread to 25, the occupancy is increased to 0.625 at Block size 320. But it doesn’t transform to any performance improvement. On the contrary, it loses approximate 10% efficiency. It also indicates the memory access is the critical bottleneck and diminishing the latency is more important than keeping high occupancy.

4.3 Integrating forward The third portion of calculation is the least time consuming. It only takes 2ms on average for 80*80*80 system and less than 1ms for smaller system.

4.4 Comparison with commercial software The optimum performance of our implementation is compared with 2 well known commercial software, Material Studio® 3.1 developed by Accelrys®, which is a software environment that brings the advanced and validated materials simulation technology to desktop computing, and Culgi® Library 4.0 from Culgi B.V., which is a multi-functional chemistry simulation platform for soft matter research, providing modeling and simulation algorithms covering a range of length and time scales. The DPD module of Material Studio is only available in serial version, it is chosen as the baseline. The single

IEIT Journal of Adaptive & Dynamic Computing Vol. 1, No. 2

8

CPU/GPU benchmarks of our implementation and Material Studio are run on a desktop computer with an Intel® Core™ 2 Quad Q9400 CPU and a Nvidia® GTX285 graphics card manufactured by GigaByte®. The multi CPUs benchmarks of Culgi are run on a workstation with 2 Intel Xeon 5320 CPUs and 4GB RAM. The DPD module of Culgi restricts the number of MPI process must be an aliquot part of size of the simulation box. So the number of process is set to 5. The density of simulation system is 3. Each benchmark runs for 10,000 time steps. Table 4 gives the benchmark results under different scales of simulation system of 20*20*20, 40*40*40 and 80*80*80. CULGI encounters a memory limitation on 80*80*80 for we only have 4GB RAM, so it is not listed in the table. Culgi running on 5 cores shows about 2x speedup compare with Material Studio. It should be noted that the parallel version of Culgi and serial version of Material Studio are running on different CPUs. Limited by hardware and licenses, we cannot perform all the benchmarks under equal environment. Appendix gives the benchmarks of serial version of CULGI, which can be used for comparing the relative performance. Our GPU implementation shows more than 20x speedup compare with the serial version of DPD module of Material Studio. Back in time, it may take days even weeks to run those simulation of complex systems, like surfactant oligomer. With the power of GPU, now it can be done within hours on a desktop computer. Moreover, a widely used scale in DPD simulation is 20*20*20. It can be done almost instantly, which gives us much more flexibility when applying simulations.

4.5 Numerical precision Current GPUs only offer fully support for single precision floating point operation, and some of them are not IEEE compliant. Though double precision is supported by GT200 architecture, it still carries a large performance penalty. To test the numerical precision of our implementation, the mean square distance (MSD) and self-diffusion coefficient is measured. Groot2 proved that self-diffusion coefficient of a DPD particle is approximately D ≈ 45k B T / 2πγρrc 3 . With λ = 6.75 and ρ = 3 , the

theoretical expressions give D ≈ 0.354 , the simulation gives D ≈ 0.306 .Materials Studio doesn’t report MSD but only the diffusion coefficient. Our simulation gives D ≈ 0.302 , Materials Studio and CULGI give 0.310 and 0.296 respectively. The results give us confidence that single precision is not harming the quality of the dynamics. 5. Conclusion

A general purpose DPD simulation fully implemented on GPU with highly-parallel cell-list updating is presented in this paper. The GPU implementation is compared with commercial solutions, Material Studio and Culgi. With proper designed algorithm and memory access strategy, it shows over 20x speedup against the serial version DPD provided by Material Studio and over 10x speedup against the parallel version of DPD provided by Culgi running on 5 CPU cores. Current technology allows 4 GPUs hosted in a single workstation, which means 4 times performance with proper usage theoretically. Moreover, the next generation of GPU architecture GT300 will offer wonderful new abilities (parallel kernels execution, improved atomic functions, cached memory, etc.) which is expected provide over 20x speedup comparing to current GT200

IEIT Journal of Adaptive & Dynamic Computing Vol. 1, No. 2

9

architecture. With those new GPUs, larger scale DPD simulation could be applied on a desktop supercomputer instead of power-consuming computer clusters. Acknowledgement

The authors are grateful to the financial support from the National Natural Science Foundation of China (project 20776141 and 20821092) and State Key Laboratory of Multi-phase Complex Systems, and appreciate Yang yang from Culgi B.V.’s instruction. Appendix

Limited by hardware and licenses, we cannot perform all the benchmarks under equal environment. Table 5 gives the benchmarks of serial version of Culgi running on different machines, which can be used for comparing the relative performance. References and Notes

1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14.

15. 16.

Hoogerbrugge, P. J.; Koelman, J. Europhysics Letters 1992, 19(3), 155-160. Groot, R. D.; Warren, P. B. Journal of Chemical Physics 1997, 107(11), 4423-4435. Groot, R. D. Journal of Chemical Physics 2003, 118(24), 11265-11277. Schlijper, A. G.; Hoogerbrugge, P. J.; Manke, C. W. Journal of Rheology 1995, 39(3), 567-579. Xu, J. B.; Wu, H.; Lu, D. Y.; He, X. F.; Zhao, Y. H.; Wen, H. Molecular Simulation 2006, 32(5), 357-362. Wu, H.; Xu, J. B.; He, X. F.; Zhao, Y. H.; Wen, H. Colloids and Surfaces a-Physicochemical and Engineering Aspects 2006, 290(1-3), 239-246. Groot, R. D.; Rabone, K. L. Biophysical Journal 2001, 81(2), 725-736. Novik, K. E.; Coveney, P. V. International Journal of Modern Physics C 1997, 8(4), 909-918. NVIDIA. CUDA Programming Guide, 2009. Elsen, E.; Houston, M.; Vishal, V.; Darve, E.; Hanrahan, P.; Pande, V. S. In SC06 Proceedings, 2006. Stone, J. E.; Phillips, J. C.; Freddolino, P. L.; Hardy, D. J.; Trabuco, L. G.; Schulten, K. Journal of Computational Chemistry 2007, 28(16), 2618-2640. Anderson, J. A.; Lorenz, C. D.; Travesset, A. Journal of Computational Physics 2008, 227(10), 5342-5359. van Meel, J. A.; Arnold, A.; Frenkel, D.; Zwart, S. F. P.; Belleman, R. G. Molecular Simulation 2008, 34(3), 259-266. Friedrichs, M. S.; Eastman, P.; Vaidyanathan, V.; Houston, M.; Legrand, S.; Beberg, A. L.; Ensign, D. L.; Bruns, C. M.; Pande, V. S. Journal of Computational Chemistry 2009, 30(6), 864872. Rozen, T.; Boryczko, K.; Alda, W. Journal of Wscg, 2008 2008, 16(1-3), 161-167. Plimpton, S. Journal of Computational Physics 1995, 117(1), 1-19.

IEIT Journal of Adaptive & Dynamic Computing Vol. 1, No. 2

10

17. Yao, Z. H.; Wang, H. S.; Liu, G. R.; Cheng, M. Computer Physics Communications 2004, 161(12), 27-35. 18. Frenkel, D.; Smit, B. Understanding Molecular Simulation; Academic Press, 2002. 19. Siberschatz, A.; Galvin, P. B.; Gagne, G. Operating System Concepts. 7th Edition; Wiley, 2004. 20. NVIDIA. CUDA C Programming Best Practices Guide, 2009. 21. NVIDIA. http://www.nvidia.com/object/fermi_architecture.html, 2009.

© 2010 by the authors; licensee IEIT, Italy. This article is an open-access article distributed under the terms and conditions of the Creative Commons Attribution license (http://creativecommons.org/licenses/by/3.0/).

Suggest Documents