Document not found! Please try again

Fast heterogeneous computing with CUDA ... - Semantic Scholar

4 downloads 54253 Views 522KB Size Report
atom potentials per second, performing 290 GFLOPS of floating point arithmetic. ... contribution of each atom float dx = x ... desktop in a personal supercomputer.
International Conference and Workshop on Emerging Trends in Technology (ICWET 2010) – TCET, Mumbai, India

Fast Heterogeneous Computing with CUDA Compatible Tesla GPU Computing Processor (Personal Supercomputing)

D Kumar

M A Qadeer

Department of Computer Engineering Zakir Hussain College of Engg. & Tech. Aligarh Muslim University Aligarh 202002, India +91-9457393654

Department of Computer Engineering Zakir Hussian College of Engg. & Tech. Aligarh Muslim University Aligarh 202002, India +91-9897705269

[email protected]

[email protected]

GPU computing products [10],[12], that place the power previously available only from supercomputers in the hands of every scientist and engineer. Today’s workstations have been transformed into personal supercomputers. Many of the molecular structures we analyze are so large that they can take weeks of processing time to run the calculations required for their physical simulation. “NVIDIA’s Tesla GPU computing technology has given a 100-fold increase in speed of programs. NVIDIA Tesla promises to take this forward with more flexible computing solutions. Computing on NVIDIA Tesla is now available to any software developer through the world’s only Clanguage development environment for the GPU. NVIDIA® CUDA™ is a complete software development solution [11] that includes a C-compiler for the GPU, debugger/profiler, dedicated driver and standard libraries. CUDA simplifies parallel computing on the GPU by using the standard C language to create programs to process large quantities of data in parallel. Programs written with CUDA and run on Tesla are able to process thousands of threads simultaneously, providing high computational throughput to enable the GPU to quickly solve complex, computational problems. Most applications that require massive compute power can leverage the power of NVIDIA Tesla to make parallel computing power more pervasive and affordable. CUDA’s recent success can be seen in both the academic and the application development communities. CUDA is being actively used by thousands of developers and scientists in applications from molecular simulation to seismic analysis to medical device design.

ABSTRACT This paper presents how fast heterogeneous computing can be achieved with Tesla GPU computing processor. Tesla GPU super computer brings the performance of a cluster to a workstation and turning it into a supercomputer. We have chosen molecular dynamics field to show fast and high performance computing with Tesla GPU. We have given a DCS (direct coulomb summation) algorithm for computing electrostatic fields around molecules with CUDA. Tesla GPU speeds up the molecular dynamics application up to 240X. These Tesla GPUs can be programmed with a NVIDIA’s multi core programming architecture CUDA that provides easy development environment and accelerate scientific and engineering applications up to great extent.

Categories and Subject Descriptors C.1.3 [Processor Architectures]: Other Architecture Styles Heterogeneous (hybrid) systems, Pipeline processors

General Terms Algorithms, Performance, Design, Reliability, Standardization.

Keywords CUDA, Tesla GPU, DCS (direct Coulomb summation), Molecular Dynamics, Personal Supercomputing

1. INTRODUCTION The simulation model of fields like geosciences, molecular dynamics and medical diagnostics becomes exponentially complex so they require vast computing resources. NVIDIA took a giant step in meeting this challenge with its announcement of a new class of processors based on a revolutionary new GPU. Under the NVIDIA® Tesla™ brand, NVIDIA offers a family of Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee.

Figure 1. Heterogeneous Computing [13]

ICWET’10, February 26–27, 2010, Mumbai, Maharashtra, India. Copyright 2010 ACM 978-1-60558-812-4…$10.00.

925

International Conference and Workshop on Emerging Trends in Technology (ICWET 2010) – TCET, Mumbai, India hence 4 Tesla GPUs supercomputer have 3.7 teraflop single precision and 312 gigaflop double precision.

2. BACK GROUND 2.1 Advance Features of Tesla GPU computing processor

2.3 Power Performance Ratio The performance power ratio of Tesla S1060 supercomputing is very high as compared to X86 server. Performance /watt of Tesla S1060 is 20X better than X 86 server [13].

Tesla GPUs computing processor have many advance features than GPUs have before [12]. It has massively – parallel many core architecture with 240 cores that is used to solve compute problems on workstations that previously required a cluster installation. It uses CUDA C programming environment that easily expresses the application parallelism. It strongly supports IEEE 745 single and double precision unit that achieve the high precision performance from a single chip while meeting the precision requirement of high computing application.

Figure 4. Power Performance Ratio

3. CUDA’S SCALABLE PROGRAMMING MODEL The advent of multicore CPUs and manycore GPUs means that mainstream processor chips are now parallel systems. Furthermore, their parallelism continues to scale with Moore’s law [12]. The challenge is to develop application software that transparently scales its parallelism to leverage the increasing number of processor cores, much as 3D graphics applications transparently scale their parallelism to many core GPUs with widely varying numbers of cores. CUDA’s parallel programming model is designed to overcome this challenge while maintaining a low learning curve for programmers familiar with standard programming languages such as C. At its core are three key abstractions – a hierarchy of thread groups [12], shared memories, and barrier synchronization – that are simply exposed to the programmer as a minimal set of language extensions. These abstractions provide fine-grained data parallelism and thread parallelism, nested within coarse-grained data parallelism and task parallelism [16]. They guide the programmer to partition the problem into coarse sub-problems that can be solved independently in parallel, and then into finer pieces that can be solved cooperatively in parallel. Such a decomposition preserves language expressivity by allowing threads to cooperate when solving each sub-problem, and at the same time enables transparent scalability since each sub-problem can be scheduled to be solved on any of the available processor cores: A compiled CUDA program can therefore execute on any number of processor cores [11], and only the runtime system needs to know the physical processor count. This scalable programming model allows the CUDA architecture to span a wide market range by simply scaling the number of processors and memory partitions: from the high-performance enthusiast GeForce GTX 280 GPU [11] [12] and professional Quadro and Tesla computing products to a variety of inexpensive, mainstream GeForce GPUs.

Figure 2. Floating point performance [11] Its asynchronous transfer capability charges system performance by overlapping data transfers with computation. One Tesla GPU contains 4 GB global memory that computes on large data sets than before. Groups of processor cores can collaborate using low latency memory. It has very high speed data transfer that enables fast and high-bandwidth communication between CPU and GPU.

Figure 3. Bandwidth

2.2 Tesla Green Supercomputing Energy efficient supercomputing is known as green supercomputing. Energy-efficient (green) supercomputing has traditionally been viewed as passé, even to the point of public ridicule. But today, it’s finally coming into vogue [16]. Four Tesla C1060 GPUs in a workstation turn it into a personal supercomputer. These Tesla Super computers have the performance and power like a cluster. One Tesla GPU has 933 gigaflop single precision and 78 gigaflop double precision and

926

International Conference and Workshop on Emerging Trends in Technology (ICWET 2010) – TCET, Mumbai, India atom potentials per second, performing 290 GFLOPS of floating point arithmetic. With the use of four GPUs, total performance increases to 157 billion atom potentials per second and 1.156 TFLOPS of floating point arithmetic, for a multi-GPU speedup of 3.99[2] and a scaling efficiency of 99.7%, To match this level of performance using CPUs, hundreds of state-of-the-art CPU cores would be required, along with their attendant cabling, power, and cooling requirements. While only one of the first steps in our exploration of the use of multiple GPUs, this result clearly demonstrates that it is possible to harness multiple GPUs in a single system with high efficiency.

4. TESLA GPU ACCELERATION OF MOLECULAR MODEL Modern graphics processing units (GPUs) contain hundreds of arithmetic units and can be harnessed to provide tremendous acceleration for many numerically intensive scientific applications. The increased flexibility of the most recent generation of GPU hardware( Tesla ) combined with high level GPU programming languages such as CUDA have unlocked this computational power and made it much more accessible to computational scientists. The key to effective utilization of GPUs for scientific computing is the design and implementation of efficient data-parallel algorithms that can scale to hundreds of tightly coupled processing units. Many molecular modeling applications are well suited to GPUs [10], due to their extensive computational requirements, and because they lend themselves to parallel processing implementations. The use of multiple GPUs [2] can bring even more computational power to bear on highly parallelizable computational problems [10].

5.2 Direct Coulomb Summation GPU outruns a CPU core by 44x [15]. In DCS Algorithm work is decomposed into tens of thousands of independent threads, multiplexed onto hundreds of GPU processor cores. Singleprecision FP arithmetic is adequate for intended Application Numerical accuracy can be further improved by compensated summation, spatially ordered summation groupings, or accumulation of potential in double-precision. At each lattice point, sum potential contributions for all atoms in the simulated structure is given by

5. MOLECULAR DYNAMICS Continuing increases in high performance computing technology have rapidly expanded the domain of biomolecular simulation from isolated proteins in solvent to complex aggregates, often in a lipid environment. Such systems routinely comprise 100,000 atoms, and several published NAMD [2] simulations have exceeded 1,000,000 atoms [10]. Studying the function of even the simplest biomolecular machines requires simulations of 100 ns or longer, even when employing simulation techniques for accelerating processes of interest [15]. One of the most time consuming calculations in a typical molecular dynamics simulation is the evaluation of forces between atoms that do not share bonds. The high degree of parallelism and floating point arithmetic capability of GPUs can attain performance levels twenty times that of a single CPU core. The twenty-fold acceleration provided by the GPU decreases the runtime for the non-bonded force evaluations such that it can be overlapped with bonded forces and PME long-range force calculations on the CPU.

Potential[j]= charge[i] / Rij+ potential[j]

Figure 5. Evaluation of Lattice Point

5.1 Multi-GPU coulomb summation Just as scientific computing can be done on clusters composed of a large number of CPU cores, in some cases problems can be decomposed and run in parallel on multiple GPUs within a single host machine, achieving correspondingly higher levels of performance. One of the drawbacks to the use of multi-core CPUs for scientific computing has been the limited amount of memory bandwidth available to each CPU socket[9], often severely limiting the performance of bandwidth-intensive scientific codes. Recently this problem has been further exacerbated since the memory bandwidth available to each CPU socket hasn't kept pace with the increasing number of cores in current CPUs. Since Tesla GPUs contain their own on-board high performance memory, the available memory bandwidth available for computational kernels scales as the number of GPUs is increased. This property can allow single-system multi-GPU codes to scale much better than their multi-core CPU based counterparts. Highly data-parallel and memory bandwidth intensive problems are often excellent candidates for such multi-GPU performance scaling [6]. The direct Coulomb summation algorithm implemented is an exemplary case for multi-GPU acceleration [3] [4]. The scaling efficiency for direct summation across multiple GPUs is nearly perfect -- the use of 4 GPUs delivers almost exactly 4X performance increase. A single GPU evaluates up to 39 billion

Atom list has the smallest memory footprint, which is best choice for the inner loop (both CPU and GPU).Lattice point coordinates are computed on-the-fly Atom coordinates made relative to the origin of the potential map, eliminating redundant arithmetic. Arithmetic can be significantly reduced by pre calculating and reusing distance components.

5.3 Single slice DCS with CUDA void cenergy(float *energygrid, dim3 grid, float gridspacing, float z, const float *atoms, int numatoms) { int i,j,n; int atomarrdim = numatoms * 4; for (j=0; j